Abstract—In the era of data and information, insight of user’s behavior such as trend is normally used in real-time marketing for improvement of gross profit, therefore, it is beneficial to know the trend in social media. Word tokenization and stop words list are the conventional method for keyword extraction task, however for Thai language in social media platform, there are still no efficient word tokenization tools and stop words list to extract trend from platform such as Facebook. Therefore, in this research, we propose an algorithm that require no word tokenization tools and external stop words list for the purpose of Trend Keywords extraction. The core idea is using Character n-Grams, instead of Word n-Grams, to tokenize, process, and combine n-Grams into keyword. After that we identified Trend Keywords from other keywords by using our algorithm to generate stop words list for filtering out stop words. For the evaluation of result, we use human to classify the retrieved Trend Keywords and compare them with Trend Keywords from baseline method. As a result, our algorithm can identify more keyword than baseline method. Finally, the precision of generated stop words list is 97.6%, and the precision of Trend Keywords is 40% with the used of 1-month generated stop words list. Furthermore, by using 2-months generated stop words list, the precision can be increased to 44% by consuming more processing time for list of stop words.
Index Terms—Information retrieval, keyword extraction, social media mining, stop words.
The authors are with Chulalongkorn University, Thailand (e-mail: firstname.lastname@example.org, email@example.com).
Cite: Nattapong Ousirimaneechai and Sukree Sinthupinyo, "Extraction of Trend Keywords and Stop Words from Thai Facebook Pages Using Character n-Grams," International Journal of Machine Learning and Computing vol. 8, no. 6, pp. 589-594, 2018.