A Bag-of-Words Based Feature Extraction Scheme for American Sign Language Number Recognition from Hand Gesture Images

Human Computer Interaction (HCI) focuses on the interaction between humans and machines. An extensive list of applications exists for hand gesture recognition techniques, major candidates for HCI. The list covers various fields, one of which is sign language recognition. In this field, however, high accuracy and robustness are both needed; both present a major challenge. In addition, feature extraction from hand gesture images is a tough task because of the many parameters associated with them. This paper proposes an approach based on a bag-of-words (BoW) model for automatic recognition of American Sign Language (ASL) numbers. In this method, the first step is to obtain the set of representative vocabularies by applying a K-means clustering algorithm to a few randomly chosen images. Next, the vocabularies are used as bin centers for BoW histogram construction. The proposed histograms are shown to provide distinguishable features for classification of ASL numbers. For the purpose of classification, the K-nearest neighbors (kNN) classifier is employed utilizing the BoW histogram bin frequencies as features. For validation, very large experiments are done on two large ASL number-recognition datasets; the proposed method shows superior performance in classifying the numbers, achieving an F1 score of 99.92% in the Kaggle ASL numbers dataset.


I. INTRODUCTION
The human computer interface (HCI) refers to the user inter-aces in a production or process-control system; it deals with the design, implementation, and assessment of new interfaces to improve the interaction between humans and machines [1], [2]. An efficient, robust, and customized interface can greatly reduce the gap between a human's mental model and the way a computer or robot accomplishes a given task. In recent industrial scenarios, gestures, hand and body poses, speech, and gaze are among the many natural interaction modes that can be used to design affordable user interfaces [3]. Among all the modes of natural interaction, gesturing serves as one of the most comfortable and expressive ways to conduct effective and meaningful communication between two individuals from different cultures; this is true even for physically challenged people, including those who are hearing-impaired or dumb [1]- [6]. However, very few people can understand hand gestures properly [1], so a communication gap exists that isolates those in the impaired community from the mass of people. Because of this, automatic hand gesture recognition (HGR) has become a major concern for researchers [7], [8]. It has been applied in some interesting fields, such as gaming, sign language recognition (SLR), and virtual reality [9]. For applications in the real world, automatic HGR faces significant challenges caused by its needs for accuracy and robustness [1], [2], [8]. There is a comprehensive and detailed analysis of existing research techniques for the recognition of sign language in [1]. It is accompanied by a discussion of the usual challenges for gesture recognition systems; the work aims to guide entry into (and to facilitate increasing efforts in) the SLR research field. Most studies of SLR are based on American Sign Language (ASL), Indian sign language, and Arabic sign language [1]. Various state-of-the-art HGR methods have been employed. Among them are methods based on hidden Markov models (HMMs), neural networks (NNs), and fuzzy logic; as they incorporate some complex processes [10] and varying parameters [11], their computational costs seem to be high [12]- [14]. A user valuation study, [7], conducted on 25 visually challenged people with the aim of enabling the impaired community to use hand gestures to interact with machines, has led the way to the proposal of an innovative dactylology. A quantitative rating analysis of the subjects' performances has led to the creation of an ideal collection of gestures. The same literature presents a module, accompanied by the proposed dactylology, aimed at recognizing dactylological symbols and enabling a writing support system. Kirsti and Thad [11], [15] explores the achievement of SLR through the use of HMMs, while Keshav [16] uses HMM and hand trajectory tracking techniques in an SLR system created to identify Roman numbers and Arabic alphabets. However, significant performance is not achieved when an SLR system uses a context-dependent HMM model only [17]. Sharmila [2] implements an approach that involves using different image processing and machine learning algorithms to recognize the ASL by means of hand gestures; this suggests the possibility of operating a system with no direct human touch. Gongfa [18] develops a method with moderate accuracy; based on a skeletonization algorithm and Convolutional Neural Network (CNN), it can reduce the effect of shooting angle and surroundings, both of which have a massive impact on recognition. Jayashree [4] proposes an approach that uses the minimum number of possible constraints and achieves a satisfactory detection rate; it identifies 26 different ASL alphabets in the presence of complex backgrounds that include varying lighting conditions, hand shapes and hand sizes. Antonakos [19] proposes a semi-supervised approach for classifying the extreme states from facial cues in sign language videos. Jaya [20] provides a feature extraction approach to identify ASL alphabets; using an SVM classifier, it is based on the DWT and F-ratio. Nasser [21] presents another gesture recognizer that uses bag-of-features and multiclass SVM techniques along with SIFT and K-means clustering. Ching [22] uses LMC for the recognition of ASL while Teak and Wenjinu [3], [5] proposes methods based on deep learning. Wenjinu [5] describes a unique approach to recognizing the ASL alphabet from depth images; it uses CNN along with multi-view augmentation and an inference fusion technique. Though this method outperforms the state-of-the-art methods with respect to some specific symbols, some of its technicalities lead to the misclassification of other signs and thence to the failure of recognition in some cases.
The objective of this paper is to develop an automatic scheme, based on bag-of-words (BoW) histogram features, for ASL number recognition. The first step is to extract a set of representative vocabularies by applying a K-means clustering algorithm to the pixel intensities of each of a few training images, randomly selected from each class of ASL numbers. At the feature extraction stage, these vocabularies are used as bin centers of the BoW-based histogram. For feature extraction, the pixel intensities of the image are mapped to the nearest vocabulary to construct the BoW-based histogram. The histogram bin frequencies are used as the feature vector. Finally, a supervised K-nearest neighbors (kNN) classifier is employed for classification. Experiments are done using a large number of captured images to evaluate the performance of the proposed method using a ten-fold cross-validation scheme.
The rest of the paper is organized as follows: Section II presents the proposed method of ASL number recognition in detail; in Section III, different experiments are performed to evaluate the performance of the proposed method; Section IV presents the concluding remarks.

II. PROPOSED BOW-BASED RECOGNITION SCHEME FOR ASL NUMBERS
This section presents the proposed method in detail. It consists of five steps: preprocessing; BoW vocabulary extraction; feature extraction based on the BoW histogram; and classification with a kNN classifier.

A. Preprocessing
The many noisy background pixels in any image intended for ASL recognition may degrade the feature quality if they are included. To reduce noise and subtract the background, therefore, a preprocessing step is necessary. A captured back-ground image is subtracted from the given image and the resulting image filtered with a median filter to reduce noise. For simplicity, the filtered image is converted into a grayscale image, which is then submitted for feature extraction. Fig. 1 shows examples of the preprocessed grayscale images. The images in the first row represent the captured RGB images, those in the second row the corresponding grayscale images after background subtraction and noise reduction. Preprocessing enhances the hand gesture images and reduces background noise, so the step is expected to improve the feature quality for ASL number recognition.

B. BoW Vocabulary Extraction
The efficient solution of a classification problem depends heavily on the quality of the extracted feature; extraction is therefore a challenging task for ASL number recognition, especially when a large dataset is involved. BoW-based feature extraction schemes are, however, widely popular for many pat-tern recognition problems [23], so this paper proposes such a scheme, using the pixel intensities for histogram construction. To begin, 10% images are randomly chosen for each class from the entire training set for BoW vocabulary formulation. Fig. 2 shows the procedure for BoW vocabulary construction.
The pixel intensities of the images are fed a K-means clustering algorithm [24] to obtain k cluster centers for each class. The k cluster centers are treated as the BoW vocabularies, for the i-th considered class. The BoW vocabularies of each class are obtained using this method and then grouped together to form a BoW dictionary, where N is the total number of considered classes. Later, in both training and test phases, the vocabularies in D are considered as the histogram-bin centers for constructing a BoW-based histogram from an image.

C. Proposed BoW-Based Histogram Feature Extraction
After vocabulary or bin-center extraction, the next step is to obtain a BoW-based histogram from an image I for ASL number recognition. To do this, the pixel intensities of the image are mapped to the nearest vocabulary for histogram construction. First, the bin frequencies hm of all the bin centers ∈ D are initialized t zero. hm = 0; ∀m ∈ D. Then, the distances dm = {dist(pij, cm)}; ∀m ∈ D are calculated from the pixel intensity pij of image I for all the BoW histogram-bin centers. The dist function represents the Euclidean distance. The bin frequency of a bin center is increased by one if the bin center has a minimum distance from the pixel intensity of all bin centers. In this way, a histogram is created considering all the pixels in each sign image. Fig. 3 shows an example of BoW histogram construction for the ten classes considered. It is clear from the figure that the proposed BoW histograms are distinct for each of the ten classes under examination, a characteristic that can be considered as a strong feature for ASL number recognition. BoW histogram-bin frequencies are therefore used as main features for ASL number recognition in this paper. In this proposed BoW-based feature extraction scheme, pixel intensities of each grayscale image are used to extract the bin centers or for histogram construction. If the available sign images are in color, pixel intensity triplets could be used as local features for final histogram-based feature extraction. Fig. 3. BoW histograms for ASL number signs. The first and third columns present the images of ten ASL number signs; the second and fourth present the BoW histograms corresponding to each image.

D. Classification with a kNN Classifier
The kNN is a simple, non-parametric supervised classifier in common use [25], [26]. It predicts the class of a test feature, basing the prediction on a distance search to find the K number of the training neighbors nearest to the test feature. In general, the 'Cityblock', 'Cosine', 'Correlation', and 'Euclidean' distance functions are used to measure the distances from the new test feature to all the training samples in the feature space. The label that comprises the majority of the K nearest training samples is assigned to the test sample.
To obtain a suitable value for K, various values of K are tried.

III. RESULTS AND DISCUSSION
This section documents the performance of the proposed method and describes the dataset and performance measurement criteria.

A. Dataset
Two datasets, one publicly available and the other acquired, are used to validate the performance of the proposed method.
International Journal of Machine Learning and Computing, Vol. 11, No. 1, January 2021 For the acquired dataset, 500 images are captured, using a webcam interfaced with MATLAB, for each ASL number class, thus providing 5,000 images for performance evaluation. To capture the images, an environment with a clear background is set up and suitably lit for recording videos with a webcam. The hand gestures are captured in a region of interest (ROI) that is set up in the video. Snapshots are taken from the video at three second intervals. Once the background frame has been captured by taking a snapshot with no gesture present at the ROI position, snapshots are taken with defined hand gestures (for example, 'sign zero') in the ROI position. In this way, 500 hand gesture images are taken for each of the ten ASL numbers, the image size being 250 × 274 pixels. After capturing all the images, the preprocessing step mentioned in Section II-A is employed to reduce the background noise and to enhance the hand gestures before feature extraction. For fair comparison, another dataset, the Kaggle ASL numbers dataset publicly available in [27], is also used for performance evaluation. In this dataset, 1000 images, size 30 × 30, are available for each class.

B. Performance Measurement Criteria
Classification produces four recognition types for the signed images. An image belonging to one class may be misclassified as belonging to another, creating a false positive recognition (Fp) of that class, while an image belonging to another class may be misclassified as belonging to that class, creating a false negative (Fn) recognition of that class. When the class of a considered image is accurately predicted, the recognition is defined as a true positive (Tp) for the considered class and as a true negative (Tn) for all other classes. The standard performance measures accuracy, precision, recall, and F1 score are used to evaluate the performance of the proposed algorithm, which can be easily classified from the confusion matrix using the equations provided below:

C. Performance of the Proposed ASL Number Recognition Scheme
To evaluate the performance of the proposed ASL number recognition method, the first step is to vary the cluster size in the K-means clustering algorithm in the range k ∈ {4, 8, 16, ...., 128}. Next, using the cluster centroids as bin centers, BoW histograms are constructed. The number of bin centers is therefore ten times the value of the chosen k. The supervised kNN classifier now performs the classification with K = 1. All the results are evaluated using a tenfold cross-validation scheme and appear in Table I, which shows that the best performance is achieved using the constructed dataset and a clustering size of 16. The results are as follows: accuracy is 92.22%; precision is 92.19%; recall is 92.22%; and the F1-score is 92.21%. Table II shows the confusion matrix, where 'A' represents the actual class and 'P' the predicted class. The computational complexity of the proposed algorithm is O(N(p × q)m). The proposed algorithm takes 0.0123 seconds for feature extraction per sample and 0.02666 seconds for classification (System configuration is: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz 2.40Ghz, 8GB RAM, 64-bit OS). In the rest of the paper, therefore, results from the constructed dataset are reported using a clustering size of 16 unless otherwise specified. The best results for the Kaggle ASL dataset are 100% for all performance indices with k = 128. Next, the performance of the proposed BoW histogram is compared with that of the conventional histogram based approach. This involves using the gray pixel intensities of the sign images to construct the histogram for a bin size of 30, then choosing the bin centers by dividing the overall range of gray-level intensity into equal portions. The histogram bin frequencies are chosen as features in the kNN classifier with 'cityblock' distance. The results are shown in Fig. 4, which makes it clear that the proposed BoW histogram approach outperforms that of the conventional histogram. This paper therefore proposes the use of BoW histogram-based features for the recognition of signed numbers. Table III compares the performance of the proposed method for three supervised classifiers: artificial neural network (ANN); support vector machine (SVM), and K-nearest neighbors (kNN). The ANN classifier is implemented for the 'trainscg', 'trainrp', 'trainbfg', 'trainlm', and 'traingd' training functions and for a hidden node size of N ∈ {10, 20, 30, ..., 100}. The SVM classifier is implemented for multi-class classification using the 'one versus all' coding scheme with regularization parameter C ∈ {1, 2, 4, ..., 128} and Gaussian Radial Basis Function (RBF) kernel parameter σ ∈ {1, 2, 4, ..., 128}. The kNN classifier is implemented for 'Cityblock', 'Cosine', 'Cor-relation', and 'Euclidean' distances with K ∈ {1, 2, 3, ..., 10}. Table III shows the best performance of each classifier, each with its best possible settings. The kNN classifier, with K = 1 and 'cityblock' distance, produces the best result for both datasets. Finally, Table IV compares the BoW method with two others, one proposed in [26] and the other by SK Dixit [26], who documented the use of features obtained using the Combined Orientation Histogram and Statistical (COHST) and Discrete Wavelet Transform (DWT) approaches for ASL number recognition. The statistical features in [26] are extracted from the entire image; this may degrade the feature quality when the image pixels of different classes are at the same intensity level. Note also that the DWT method extracts features only from low-frequency components. Table IV shows that the proposed BoW-based feature extraction scheme outperforms the methods proposed in [26], in terms of all performance.  1  0  3  0  0  1  485 3  1  2  1  0  7  4  2  0  1  1  452 7  17  11  6  3  5  3  0  0  2  5  487 0  0  0  3  6  0  0  7  2  31  0  422 35  2  1  7  3  2  28  3  13  0  32  409 4  6  8  1  19  2  7  9  2  3  9  448 0  9  6  0  0  8  2  6  1 1 0 476

IV. CONCLUSION
An efficient ASL number recognition scheme is developed in this paper, using bag-of-words (BoW) histograms to ex-tract features. It is experimentally shown that the BoW-based features are very suitable for classifying the ASL numbers. For classification, kNN, the simplest and most widely used classifier, is employed. The performance of the proposed scheme is evaluated in terms of accuracy, precision, recall, and F1 score, using two large datasets. The results show that the proposed BoW-based histogram method outperforms methods using conventional histograms; when compared with two other methods, it is shown to outperform both in terms of all performance indices. The best performance is achieved using our constructed dataset with a clustering size of 16. The results are as follows: accuracy is 92.22%; precision is 92.19%; recall is 92.22%; and the F1-score is 92.21%, whereas for the Kaggle ASL dataset all performance indices are 100% with a k of 128. The proposed method is therefore expected to help deaf and dumb people by allowing them to communicate with others via intelligent devices; it is also expected to help develop the games industry and robotics industries use hand gestures. In future, the authors wish to extend the proposed BoW-based model for ASL number recognition using semi-supervised learning techniques with larger dataset where labeling of all images are quite challenging.

CONFLICT OF INTEREST
The authors declare no conflict of interest.

AUTHOR CONTRIBUTIONS
Rasel Ahmed Bhuiyan and Abdul Matin actively participated in the implementation of the proposed model and prepared the draft paper. Md. Shafiur Raihan Shafi and Amit Kumar Kundu prepared the manuscript and actively worked as a proofreader. During the review process, all of us work together to solve the reviewer's comments.

ACKNOWLEDGMENT
We would like to thanks Amit Kumar Kundu to guide us. He has supervised us in the right direction. His supervision encourages us to take the implementation further.