Abstract—This paper defines a Standard Arabic Profiling (SAP) toolset that helps researchers for textual analysis and comparing between different Arabic corpora. Since tools for Arabic language are needed, we present the SAP toolset to simplify the textual analysis process. The approach consists of three profilers: The Part of Speech (POS) profiler that gives statistical analysis for a given document, vocabulary profiler which provides user with an indication out the vocabulary used in a document with reference to Open Source Arabic Corpus (OSAC) of two news agencies (CNN and BBC). The process is accomplished by computing similarity between documents and corpus using Log likelihood measure. Lastly the newly added profiler is the Readability profiler which is used to 1) assess the readability level for a document according to Flesch Reading Ease Readability Formula, and 2) measure the simplicity and ambiguity levels of the document. We described the current part-of-speech for this toolset and how we can extend its functionality to embrace vocabulary and readability profiling.
Index Terms—Arabic natural language processing, part-of-speech tagging (POST), text analysis, software.
Khalid M. O. Nahar and Malek Barahoush are with the Department of Computer Sciences, Faculty of IT and Computer Sciences, Yarmouk University, Irbid, 21163, Jordan (Corresponding author: Khalid M.O. Nahar; e-mail: firstname.lastname@example.org).
F. Al Eroud and Abdallah M Al-Akhras are with the Department of Computer Information System, Faculty of IT and Computer Sciences, Yarmouk University, Irbid, 21163, Jordan.
Cite: Khalid M. O. Nahar, Ahmed F. Al Eroud, Malek Barahoush, and Abdallah M Al-Akhras, "SAP: Standard Arabic Profiling Toolset for Textual Analysis," International Journal of Machine Learning and Computing vol. 9, no. 2, pp. 222-229, 2019.