Implementation of HOG Feature Extraction with Tuned Parameters for Human Face Detection

Extracting and tracking face in image sequences is a required first step in many applications such as face recognition facial expression classification and face tracking, it is a challenging problem in computer vision field because of many factors that effects on the image, some of these factors are luminosity, different face colors, background patterns, face orientation and variability in size, shape, and expression. The objective of this paper is to Experiment wide range of parameters for HOG face detector and setting up the most suitable kernel for Support Vector Machine (SVM) and then, comparing this method with some well-known methods for face detection and identifying the most reliable one. The aim of this study is not providing the best face detector method rather than a try to find out the performance of HOG feature for detecting a face, experimenting different kernels and eventually finding the tuned parameters for HOG descriptors for detecting a face, in this study based on experimental results as shown in Table IV. The HOG + SVM scores the highest value of precision, accuracy, and sensitivity. As 0.8824, 0.9986 and 0.75 respectively compared to Viola-Jones method which scores 0.6512, 0.9973 and 0.7 finally skin color method which scores 0.3968, 0.9947 and 0.625.


I. INTRODUCTION
Human face detection is one of the most important first step action in many computer vision applications such as face tracking, emotion recognition, facial feature extraction, document control, access control, clustering, gender recognition and classification. Human computer interaction (HCI) and lately threat detection from facial expressions. Early face detection methods used anthropometric and heuristic techniques as in the 1970s [1]. These methods tend to be failed because they were very sensitive to changes in terms of lighting, scaling and orientation conditions. Furthermore, they were only able to detect faces in simple environments like pictures with only white background. While in the meantime, there are many approaches and well-developed algorithms are available for face detection such as knowledge-based methods, feature based and template matching techniques [2]- [4]. In the last 20 years, various algorithms and methods have been introduced for face detection such as Viola-Jones 2001 [5], which is based on haar like features, skin based methods and machine learning methods such as Convolutional Neural Network Manuscript received August 2, 2019; revised June 24, 2020. The authors are with the Dept. of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region, Iraq (e-mail: mohammed.guhdar@uod.ac, amera_melhum@uod.ac).
(CNN) or ConvNet, a support vector machine (SVM) combined with a feature extraction technique such as Histogram of oriented gradients (HOG), speeded-up robust features (SURF) and Local binary patterns (LBP). In this paper our main attention would be on geometrical shape-based method (HOG) with tuned parameters and experimental results of using different kernels such as Gaussian and Laplacian. Then, comparing our results with a Haar-Like feature based method (Viola-Jones), in order to make a fair comparison the same database which is MIT CBCL face database has been used for training both methods.

A. Viola-Jones Method
One of the most successful attempts was by Paul Viola and Michael Jones [5]. Their approach was based on a cascade of simple features available on a human face. Resulting in one of the most successful achievement at that time in terms of accuracy and speed. In this method Viola and Jones used Haar-Like features which is a scalar product between some Haar-Like feature templates as shown in Fig. 1 and an image. Mathematical representation is shown in equation 1 and resulted image of the applied equation shown in Fig. 2. In the following equation I and P indicate a picture and a pattern respectively, both of the similar size N × N the feature associated with pattern P of image I is defined by: (1)  Only the pixels in black and white are used for feature calculation. Then, adaboost learning algorithm used to give a meaning to the calculated features after a scalar product between the image and Haar-Like component after that, it decides whether the certain area (window) is a face or not a face, one of the most interesting point on the work that is presented by Viola and Jones. In contrast, to most of the face detection systems which focusing on pixel intensities, Viola-Jones used a new representation of image called the "integral image" or summed area table to increase the speed of features calculation as shown in Fig. 3. A linear combination of weak classifiers constructed a strong classifier in order to increase accuracy and reducing computation time. Fig. 4 shows the structure of cascade classifier. Adaboost requires positive and negative data for training in order to be able to correctly classify face candidates from non-face candidates. One serious problem with Viola-Jones method is that having different orientation of face, However, it can be solved by adding more cascades in different orientation.

B. Skin Color Detection
One of the fastest methods for face detection would be human skin color detection [6]- [10]. Skin color is utilized to decide if the specific pixel is a skin color pixel or non-skin shading pixel. For this method in order to get accurate results for detecting different face colors, most of the researchers attend to use YCbCr color model space, because unlike RGB color space the YCbCr is independent in terms of luminance. Therefore, it achieves better accuracy. This algorithm is fast because it eliminates many unwanted pixels mostly about 60% to 70% of image area that requires to be processed by eliminating all non-skin color pixels.
For the components of YCbCr Y stands for luma component which represent the luminance and computed from nonlinear RGB [11]. It is acquired as weighted sum of RGB values. Cb is the dimension of contrast among blue and luma segment and Cr is the dimension of distinction among red and luma part. The Y stands for the luminance, while, Cb and Cr are the chrominance segment.
The threshold used in this specific algorithm for detecting skin colour is given to be 76< Cb <127 and 132 < Cr < 173 [12], Unlike RGB, it has separate luminance and chrominance parts which make this shading space alluring for skin color division [13].

III. SUPPORT VECTOR MACHINES (SVMS)
Support vector machines (SVMs) are a set of related supervised learning methods in machine learning used for regression and classification [14]. The simple term for a given set of training data each of which pointed to one of two classes, SVM algorithm builds a model for predicting feature data into which category it belongs. Intuitively, an SVM model is a representation of the data as mapped points in space, so the data points in space becomes distinguishable. And then new entered data are mapped into that same space and predicted in to which classes it belongs based on the side of the gap it falls on A linear support vector machine is composed of a set of given support vectors z and a set of weights w. The calculation for the output of a given SVM with N support vectors z1, z2, … , zN and weights w1, w2, … , wN is then given by equation 2 [15].

A. Kernel Trick
One of the simplest definitions for kernel trick would be transforming linearly inseparable data like shown in Fig. 5 to linearly separable ones like shown in Fig. 6 by using a plane. The kernel function is what is applied on each data instance to map the original non-linear observations into a higher-dimensional space in which they become separable.
In 1992, Bernhard Boser, Isabelle Guyon and Vapnik suggested a tricky way to create non-linear classifiers from a linear classifier by applying the kernel trick (originally proposed by Aizerman et al.) to maximum-margin hyperplanes. The resulting algorithm is formally similar, except that every dot product is replaced by a non-linear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. The transformation may be non-linear and the transformed space high dimensional, thus the classifier is a hyperplane in the high-dimensional feature space. It might be non-linear in the original input space.
Using kernels, the original formula for the SVM given SVM with support vectors z1, z2, … , zN and weights w1, International Journal of Machine Learning and Computing, Vol. 10, No. 5, September 2020 w2, … , wN is now given by:

1) Gaussian kernel
Alternatively, it could also be implemented using The sigma (σ) is an adjustable parameter in which it plays an important role in the accuracy of the kernel. It should be tuned carefully depending on the problem. If it is overestimated, the exponential will behave linearly causing the higher-dimensional projection to lose its non-linear power. In the other side, by underestimating it, the function will lack regularization and the decision boundary will become sensitive to noise. x and y are N×N size convolution matrices. 2

) Laplacian kernel
The Laplace kernel is closely related to the Gaussian kernel, with only the square of the norm left out. It is also a radial basis function kernel. But it is less sensitive to changes in the sigma parameter.
By using Kernel function (Laplacian) the data can be transformed into higher dimensions in which it has more freedom to separate them, x and y are N×N size convolution matrices.

IV. HISTOGRAM OF ORIENTED GRADIENT AND SUPPORT VECTOR MACHINE
Dalal and Triggs at the CVPR conference in 2005 [17], by using HOG descriptor + linear SVM and a detection window that is 64 pixels wide by 128 pixels tall successfully detected a person in a given image with a great accuracy. In 2018 Chee, Kok Wei, and Soo Siang, implemented a combination of HOG with HOM (histogram of Magnitude) for the same purpose and achieved even greater accuracy which is 99.0% accuracy compared to HOM (95.5%) or HOG (98.6%) features when they are used independently [18] and same concept has been applied for human tracking [19] with great accuracy.
HOG features are coded version of the image created by using cells and block structures. They provide information about shape orientations based on equation 7, and luminosity densities based on equation 8 for each pixel. Most importantly we can observe changes in shape with each direction. 1) A HOG features vector are generated by combining the gradient calculations of each pixel as shown in Fig. 7. 2) Generating a histogram for each block by using gradients value. 3) Calculating the normalization of the histograms as shown in Fig. 8. Collecting the normalization vectors together for each block. Calculating an 8×8 block histogram of gradients based on the above data a histogram of 9 bins is calculated as shown in Fig. 8. The, magnitudes of less than 10 are added to degree 0 values and magnitudes between 10 and less than 30 are added to degree 20 and so on. Based on the above steps and after pre-processing steps, all the images in MIT CBCL face databases are converted in to their HOG representation. Which is a feature vector that represents face image as visualized in right side of Fig. 9.

V. THE PROPOSED METHOD
The proposed method uses support vector machine with

… (3)
International Journal of Machine Learning and Computing, Vol. 10, No. 5, September 2020 HOG features in either Laplacian or Gaussian kernel method for transforming data. Furthermore, tuned parameters for HOG features which are a number of bins, Cell size and Block size has been identified for face detection based on experimental results. In which for Bin size=7, Cell size 3 and Block size 6 achieved the most successful results in detecting faces in a crowded and complex environment in an image.
Viola-Jones suffers from detecting faces in different orientations, to solve this issue Viola-Jones method adds more Haar-like features in rotated order. Many available methods tend to use rotating sliding windows to 35°, 45° and 60° degree in both directions in order to detect faces in different orientation. All of these additional steps increase the complexity and decreases the performance. Another problem with these methods is using different size of sliding window for detecting an object in multi scales and for the Viola-Jones just adds more Haar-Like features but this time larger and smaller window sizes used to detect a face in multi scales In this proposed method the pyramid images have been used to detect objects in multi scale and instead of rotating sliding window a rotated image has been used. This increased the speed for almost 90%. For example, if we have a 500px by 350px image and if we use a sliding window 1px slide in X and one px in Y direction each time. Eventually we will end up of having 175,000 windows if we rotate each window 3 times for each degree it will become 525,000 operations and let's say a 4 degrees of scaling we will end up of having 2,100,000 operations which is a huge number of operations, especially when we want to implement a method for real time.
While in our case we will still have 175,000 windows in the first step but with only 3 operations because we will rotate the whole image 3 times instead of rotating window by window. For scaling level by using pyramid image by dividing image dimensions by two each time for 3 times, we will have 175,000, 43,750 and 10,875 = 229,625 windows, multiplying this number by 3 for each rotation degree process, the total number of operations will be 688,875 operations.
The rotation operation on an image performs a geometric transform which maps the position (x1, y1) of a picture element in an input image onto a position (x2, y2) in an output image by rotating it through a user-specified angle θ about an origin O through an iterative process, which means each pixel [x1(i), y1(i)] is rotated one by one by using the equation 9 and 10, where i represents a pixel location. In most implementations, output locations (x2, y2) which are outside the boundary of the image are ignored. Rotation is mostly used to improve the visual appearance of an image, although it can be useful as a pre-processor in applications where the directional operators are involved.
The rotation operator performs a transformation of the form: where (x0, y0) are the coordinates at the center of rotation (in the input image) and θ is the angle of rotation with clockwise rotations having positive angels, Fig. 12 shows an image rotation in 90° degrees.

VI. PRE-PROCESSING
In the pre-processing steps, first grey scale filter is applied on the image by using accord.net open source library which is a predefined function to achieve this task, then histogram equalization applied as shown in Fig. 13. While Fig. 14 shows that how low performance is achieved before applying histogram equalization on the image, and in Fig. 15 appears how the performance is improved after applying the histogram equalization.

A. Test Results for HOG + SVM
In this research we attempted to use HOG descriptors + SVM with Gaussian and Laplacian kernels for face detection, For training MIT CBCL face database has been used in which contains about 4500 negative samples and 2,500 positive samples for training all in gray PGM formats. The HOG transformation is applied before training. Same databases have been used in Viola-Jones for training adaboost, after tuning the HOG hyper parameters, the highest precision value achieved as shown in Table I, that happened to be between 86%-97% detection rate and 1%-4.5% false positive rate. That is, after applying a lot of experiments and adjustments on HOG cell size, block size and number of bins. The experiment shows that the best result is achieved by setting cell size to 6, block size to 3 and bin number to 7.

1) Experimenting different kernel effect on HOG + SVM performance
Based on the experiments the Gaussian kernel achieved the best performance among all the other experimented kernels and Laplacian kernel scored the second best performance.

B. Test Results for Viola-Jones
After training the Viola-Jones on MIT CBCL database for face detection, it achieves good results with a frontal face of MIT database, however when testing it by CMU test set the accuracy falls down significantly, a possible reason for that could be that the training set in MIT which mainly contains frontal faces as shown in the left side of Fig. 17, while in other hand the CMU training set mainly contains different illumination and multi-view variation conditions as shown in the right side of Fig. 17. Furthermore, according to [5] and [20] the first Haar-like features has limitations for multi-view face detection and lack of robustness in detecting faces under various lighting conditions, as shown in Table II. The experiment shows that the Viola-Jones algorithm scores low precision value and high false positive rate.

C. Test Results for Skin-Based Face Detection
The test results for detecting faces based on skin color are shown in Table III proving that by depending only on skin color we get very low precision value and high rate of false positives.
Based on images color tone we can either get a better or a worse results, because of diversity in skin color. That would be one reason for skin-based face detection to fail at some point because there is no limit for color tones. In Fig. 18 and Fig. 19 same method with the same configurations has been applied but totally different results have been achieved.    The above images are the test results of skin pixel detection. Test shows that there are many false positives since at this stage we only depend on skin color. Most of the false positives can be eliminated by using and additional step.
One example for that purposes would be template matching or any machine learning techniques such as support vector machines or neural networks. But the accuracy of face locations is the real issue here and the solution for that problem is a hot topic of researching area.
The experiments show that one cannot depend only on skin detection for detecting a face accurately. There should always be an additional method alongside the skin detection in order to get rid of all these false positives. For example, if we use a template matching method or any classification methods as an additional method with skin detection the false detection rate could be reduced between 30% to 60% percent based on training data that will be provided to this additional methods for training purpose.

D. Running Speed
The performance of three methods is tested on 64-bit 1.2GHz quad-core ARM Cortex-A53 CPU. And, there is a huge difference in the speed of the three methods as shown in Table V. Because HOG method is more precise and covers most of the cases as rotation, scale and even when people wearing glasses or showing face to either side. In most of the frames it can detect faces, which results in slowing down the whole process.

VIII. CONCLUSION
It's concluded that, for face detection based on skin color we struggled to filter out many false positives even though after using template matching methods. Variant illuminations and different skin tone were the main problems for face detection based on skin color. In other hand; Viola-Jones algorithm takes a lot more time for training than SVM based algorithm but in detection for real time it achieves promising results in terms of speed. It is good for detecting frontal faces and can overcome on different illumination problems by using histogram equalization. But yet it struggles for detecting faces in different orientations, while Haar-like features are more robust to illumination changes than color. It becomes inefficient in terms of memory and processing cost while trying to solve different orientation problem by adding more haar like features. For Histogram of Orientated Gradient (HOG) turns into an incredible elective element choice as it is generally invariant to various enlightenment changes and is fit for catching geometric properties of countenances that are hard to catch with direct edge channels, for example, Haar-like highlights, it is fast, accurate and achieves promising results. Finally based on experimental results of this research recommends HOG + SVM over Viola-Jones and skin color for face detection.

CONFLICT OF INTEREST
The authors declare no conflict of interest.