Gender Classification of Thai Facebook Usernames

 Abstract —This paper presents an application of machine learning to classify Facebook users’ gender based on their username alone. User profile information on social networks is important in many studies, but occasionally no information is publicly available online, such as age or gender. Most studies only use textual information from the web page. Instead, we opted to study gender classification by username, in which the gender is inferred from the users first name and alias name. We focused only on Thai names which may have certain patterns that reveal the owner’s gender. A combination of different models is proposed to classify gender based on Thai Facebook usernames. Each model was trained using a supervised learning approach. Furthermore, all the classification results were combined into a final model. Using this method, the model achieved 91.75% level of accuracy.


I. INTRODUCTION
Social media is used extensively and has become a part of people's daily lives. It influences the lifestyles of people of all ages in almost every aspect of a person's life [1]. Facebook is the largest and most used social media platform globally and in Thailand [2]. Facebook provides a simple way for users to express their feelings, ideas, and opinions by enabling them to generate their own content. Information published online can be studied to analyze social processes from economics to public health. For example, online data can be used to forecast target groups by marketing teams, opinions about political events can be analyzed, and perspectives on social issues or mental health can be assessed [3]. Yet most social media sites do not require user demographic details. Moreover, it is not necessary for users to reveal their identities, and some users even disguise or conceal their private information to hide their true identities. Consequently, using user information derived from social media for analysis or communication can be inaccurate and can fail to match the target group. The correct gender classification of social media users is therefore important, since it can help solve issues derived from limited or inaccurate social user information and maximize benefits. According previous studies, the existing methods for classifying gender are often analyzed by text, but it is not always possible to find messages to analyze. The name of users is another way of classifying gender, especially in the Thai language. Manuscript  In Thai, female and male names have different characteristics. For example, "พั ชราภา" (Patcharapa), "วิ ชุ ลดา" (Wichulada), and "พรพรรณ" (Pornpan) are female names, while "สุ ชาติ " (Suchart), "สมเกี ยรติ " (Somkiat), and "ประวุ ฒิ " (Prawut) are male names. There are also names which can be used by either gender, such as "วิ รั ตน์ " (Wirut), "สุ วรรณ" (Suwan), and "สมพร" (Somporn). Further challenges were caused by many social media users utilizing an alias name rather than their actual name.
For the aforementioned reasons, we decided to study and solve such problems using machine learning techniques to classify gender based on Thai Facebook usernames. This was done using both first names and alias names and analyzing the features extracted from username-derived textual information.

II. RELATED WORK
In recent years, several studies have proposed machine learning approaches to extract the demographic of social media users using text, names, images, location, or profile colors to classify gender, age, personality, education, marital status, ethnicity, geographic location, language, and race [4]- [11].
These authors explored a variety of methods for gender classification from username. For example, Alowibdi, J.S., U.A. Buy, and P. Yu [5] Learning and Computing, Vol. 10, No. 5, September 2020 phoneme-based features set, with word-frequency based features, and with 1-gram through 5-gram features. The best result achieved was 82.5% accuracy in the case of the 3-gram phoneme-based features with Decision Tree classifier for first name, and 75.2% accuracy in the case of the 3-gram phoneme-based features using Decision Tree for usernames. Bergsma, S., Dredze, M., Van D.B., Wilson, T., and Yarowsky, D. [6] ran experiments using Support Vector Machine to classify gender by first name and surname, by making comparisons between five features: Token; character N-gram; cluster; token with N-gram; and all together features. The best result achieved was 90.2% accuracy with all features. Akbar, R. [7] compared Multinomial Naive Bayes with Random Forest to classify gender from Indonesian names using the frequency of characters, last character, and last two characters features. The classifiers yielded an accuracy of around 70% (Multinomial Naive Bayes) and 83% (Random Forest). Further, Septiandri, A.A. [8] classified Indonesian name genders using Character-Level Long-Short Term Memory compared with Naive Bayes, Logistic Regression, and XGBoost using n-grams as the features. The results showed that the best performance of Naive Bayes and Logistic Regression was obtained from 3-gram, and the best performance of XGBoost was from 2-gram. When using Character-Level Long-Short Term Memory techniques, they were able to classify gender more accurately than Logistic Regression was able to. The accuracy percentage rose from 85.28% to 92.25% in the full name case, while using first names only yielded a 90.65% level of accuracy. Vicente, M., F. Batista, and J.P. Carvalho [11] proposed methods based on the combination of different classifiers. They created four distinct classifiers, each of which considered a group of features extracted from four different sources by conducting performance comparisons among Logistic Regression, Multinomial Naï ve Bayes, Support Vector Machines, and Decision Tree classifier. The final classifier-combining the four previous individual classifiers-achieved the best performance, corresponding to 93.2% accuracy for English and 96.9% accuracy for Portuguese data.

A. Dataset
We focused on Thai Facebook usernames collected between January to March 2019 using the Selenium library. Users were selected who had usernames only in Thai characters and had an open gender profile. After all the data was collected, the usernames were manually labelled into two types, namely first name and alias name. Of the 4317 usernames collected, 2047 were classified as female (47.42%) and 2270 were male (52.58%). Moreover, 1961 of them were first names while 2356 were alias names (see Table I). An example of the dataset is given in Fig. 1.

B. Thai Word Tokenization Features
In the Thai language, words are not separated by a space, and spaces are instead used to denote a new sentence. Furthermore, most alias names can be separated into words, so we selected the PyThaiNLP library [12] using the Maximal Matching method to segment those Thai names. An example of Thai word tokenization is given in Fig. 2.  The word tokenization analysis results show that each gender uses self-describing words. For example, females used "น้ อง" (sister), "แม่ " (mother), and "หญิ ง" (girl), while males used "พี ่ " (brother), "หนุ ่ ม" (boy), and "เสื อ" (tiger).
Nonetheless, these may be found in the alias name. Virach S. et al. [13] classified Thai speech parts into 14 categories (noun, pronoun, verb, auxiliary, determiner, adverb, classifier, conjunction, preposition, injection, prefix, ending, negator, and punctuation) and divided them into 47 subcategories. In the present study, the PyThaiNLP library was used to extract speech parts with word tokenization. An example of Thai speech part classification is given in Fig. 4.

D. Thai Character Frequency Features
In the Thai language there are three main character types: Consonants; vowels; and tones. These are located in the upper, middle and lower levels [14]. The frequency of each character was counted only from the first name. An example of Thai character frequency is given in Fig. 5. The character occurrence was counted from female and male names, as shown in Fig. 6. The bar chart shows that the character 'ณ' occurs more often in female's first names than for males. The same result can be found in 'ญ', 'ร', 'ภ', and 'า'. Meanwhile, the characters 'ษ', 'ง', 'ช', 'ต' and ' ิ ' appear more frequently in male first names. To measure how disorganized character in the first name, we calculated the entropy of their gender for each character, with a high result denoting that character has a highly varied gender, and a low result pointing to the character being used in the names of the just one gender. In particular, entropy is defined as the sum of the probability of each gender, times by the log probability of that same gender.

E. Thai Substring Character Features
Thai male and female first names have different characteristics. For example, names starting with "พร-" are most likely to be female, while names starting with "พล-" are supposed to be male. Meanwhile, names ending with "-วรรณ" are most likely to be female, and names ending with "-วั ฒน์ " are always for males. Features were created from substrings in the first name. Six feature types were used in this experiment: The first two characters; the first three characters; the first four characters; the last two characters; the last three characters; and the last four characters. An example of Thai substring characters is given in Fig. 7. Fig. 8 presents the top 10 most substring characters for first names. The female substring characters are shown on the left and the males are shown on the right.
Substring characters were found in the female first names, but not in the male first names, for instance, the last two characters; 'ภา', last two characters; 'ณี ', last four characters; ' ิ พย์ ', and last three characters; ' ิ ดา'. Meanwhile, substring characters were found in the male first names but not in the female first names, for example, the last three characters; 'ชั ย', last two characters; 'พล', last four characters; 'พงษ์ ', and first two characters; 'ธี '.
The substring character analysis results show that female and male first names have different characteristics. These distinct properties could therefore be used as features for the model. (female on the left and male on the right).

F. Process Design
In this study, four different feature groups were used, as follows: 1) Word tokenization features; 2) Speech part classification features; 3) Character frequency features; 4) Substring characters features; All of these were used as the same input for four separate models which had different purposes. For example: A) classifying gender from only the first name; B) classifying gender from only the alias name; C) classifying username into first name or alias name; and D) classifying gender from username. The final model in Fig. 9 combines the output of all four models to produce a final model.
We used K-Nearest Neighbor, Support Vector Machine, Random Forest, Multinomial Naï ve Bayes, and Neural Network in Models A, B, C, and D, while a Neural Network was used to combine the results from all the classifiers.
At the end of this process, the experiments were divided into three parts in the combined models, with the first part of the combined models to classify gender for first name (Model A) and gender for the alias name (Model B). The second part combined models to classify gender for the first name (Model A), gender for the alias name (Model B), and classify the first name and alias name (Model C). The third part combined all the models.

G. Evaluation Approach
A program was developed in the Python 3 environment using an open source scikit-learn 0.21.3 [15].
All the experiments were performed using stratified 10-fold cross validation and evaluated performance based on accuracy. In k-fold cross-validation, the original samples were randomly partitioned into k equal sized subsamples. Each k subsamples was used as the test data, while the remaining k-1 subsamples were used as training data. The cross-validation process was then repeated k times, with each of the k subsamples used exactly once as the test data. The k results from the folds could then be averaged to produce a single estimation.
Accuracy was defined as the proportion of true results, either true positive or true negative (1).

TP TN TP FN TN FN
where TP is true positives, TN is true negatives, FP is false positives, and FN is false negatives. Accuracy was determined to be the proportion of true results, either true positive or true negative.

A. Gender Classification from First Name
Each classifier was trained with character frequency and International Journal of Machine Learning and Computing, Vol. 10, No. 5, September 2020 substring characters of the first name only, not including surname. The model outputs were female and male scores. The best performance consistently achieved using Support Vector Machine reached an accuracy of 78.17%, followed by Multinomial Naï ve Bayes (76.90%), Neural Network (76.09%), K-Nearest Neighbor (73.13%), and Random Forest (72.72%). The accuracy of the gender classification for first name is summarized in Table II.

B. Gender Classification from Alias Name
Each classifier was trained with word tokenization features from the alias name only. In this research, three gender groups were identified: Female; male; and unclassified. Female and male name had a probability of over 60%, while unclassified were less than 60%. The model outputs were female and male scores. The best performance was from Support Vector Machine at 69.45% of accuracy, followed by Multinomial Naï ve Bayes (66.04%), Neural Network (65.09%), Random Forest (61.30%), and K-Nearest Neighbor (56.41%). The gender classification accuracy for alias name is summarized in Table III.

C. First Name and Alias Name Classification
Each classifier was trained with word tokenization and speech part features. The purpose was to classify the first name from the alias name. The model outputs were first name and alias scores. Multinomial Naï ve Bayes was the best classifier which achieved an accuracy of 83.18%, followed by Support Vector Machine (82.44%), Neural Network (80.12%), Random Forest (78.11%), and K-Nearest Neighbor (74.10%). The accuracy of the first and alias name classifications are shown in Table IV.

E. Combining Classification Models
To determine the best final classifier, the performance of the five classifiers were compared, including K-Nearest Neighbor, Supper Vector Machine, Random Forest, Multinomial Naï ve Bayes, and Neural Network. We also compared the performance obtained from the three different input sets, that is all the models, the combination of three models, and the combination of two models. The performance from getting inputs from all the models had the best accuracy at 91.75% using the Neural Network, this was followed by the combination of three models and the combination of two models, respectively. The accuracy of the combined model is summarized in Table VI.  Fig. 10 summarizes the achieved accuracy per classifier for the three combination types. The combination of two models obtained an 82.07% level of accuracy with Neural network. For the combination of three models, an 83.46% level of accuracy was achieved with Neural network. Finally, the combination of four models achieved an accuracy of 91.75% using Neural network.

V. CONCLUSION
Facebook data can be analyzed and studied for many benefits, but the data often lacks user demographic details which can result in the user information used in analysis or communication being incorrect and not matching the target group. This study therefore established the objective to classify gender from username only in order to solve the problem of correctly using user information and maximizing benefits.
The present study demonstrated that Thai names have certain patterns that can reveal the owner's gender. The authors also presented a gender classification method for Thai Facebook usernames using a combined model. Instead of applying the same model for all features, related features were grouped and used in separate models since first names and alias names have different characteristics. The output of each model was then used as inputs for the final model. The features proposed, including the word tokenization, speech part classification, character frequency, and substring characters can achieve a good result. The experimental results demonstrate that using word tokenization for all usernames achieved a baseline 65.81% level of accuracy, but the combined model achieved an improved performance with a 91.75% level accuracy.