How Different Are Full and Short Papers in Word-Usage ?

Academic papers submitted to a conference are assessed by reviewers and judged if they deserve to be presented at the conference. The accepted papers are often classified into full papers, short papers, and other types, according mainly to the reviewers’ assessment. The major aim of the study presented in this paper is to find tips which are effective for a paper to be improved so that a paper supposed to be classified as a short paper becomes a full paper. In this study, we investigate a scenario for finding the differences between full and short papers on the usage of words/terms. Then, we extract words which are characteristic for either full or short papers through an experimental study. In order to find these words, we introduce a couple of indexes of a word. The results inspire that we can obtain practical tips in this approach by refining this method.


I. INTRODUCTION
It is a big issue for researchers and graduate students in academic organizations how to write highly assessed papers so that they can obtain higher academic degrees and/or higher reputations. Academic papers submitted to a conference are reviewed and judged according to their assessment if they deserve to be presented at the conference. The accepted papers are often classified to full papers, short papers, and other types, according mainly to the reviewer's assessment result.
Our major aim of the study presented in this paper is to find any kind of effective tips for improving a paper which should be evaluated as a short paper to improve so that it becomes evaluated as a full paper. Among various considerable candidates, we take an approach of finding tips by analyzing sample papers in this study as the very first step toward our goal. More specifically, we investigate a scenario for finding the differences between full and short papers in their usage of words, or terms, and show some of the most discriminating words in the process of analysis.
One of the contributions of this paper is to propose a couple of indexes of words which show how much they are used in full and short papers. Precisely, we define a function called FS-index (full-short-index) for a word. FS-index shows a ratio which represents how much amount is the word used in full papers in comparison with the amount for short papers. Furthermore, we define a function called μ-index Manuscript received December 9, 2019; revised April 18, 2020. T. Minami was with Kyushu Institute of Information Sciences (KIIS), 6-3-1 Saifu, Dazaifu, Fukuoka 818-0117 Japan (e-mail: minamitoshiro@gmail.com, ohura@kiis.ac.jp).
(mixed index), which considers both FS-index value and the popularity of usage of the word.
Even with an analysis on word/term-usage in this paper, the results inspire us that we should obtain some kind of practical tips if we investigate not only the differences in word-usage, but also the differences of the organization, and other features of the papers by refining the analysis method used in this paper.
In our different study, we proposed a method for discriminating full and short papers for the same data used in this paper. The proposed method gives better performance than discriminating by the number of pages [1].
We have been experiencing studies of analyzing other types of data in a similar approach [2], [3]. From the free texts for retrospective evaluations of students, we have found that the students with wide perspective for learning have better outcome, or examination score, than those who have narrow perspective. We take the approach of introducing appropriate measuring index(es) considering the specific needs of the problem, and finding some properties which are interesting and informative.
Use of citation data is a different approach from that in this study for assessing importance of papers. Nakatoh et al. proposed the concept of focused citation index in this approach [4].
The rest of this paper is organized as follows: In Section II, we describe about the target data for analysis, including the tools we use.
In Section III, we start with analyzing the total frequencies of words, so that we recognize how the words are used in the target set of papers. Then, we go forward to analyze the usage of words how the words appear in the full and short papers.
Finally, in Section IV, we summarize our discussions and results in this paper and show our future possible directions.

II. THE TARGET DATA
The target data used in this paper are the academic papers presented in the session of "Area 2 -Information Technologies Supporting Learning" of the 9th International Conference on Computer Supported Education (CSEDU 2017) [5]. Among 68 papers, 19 of them (28%) are full papers and the remaining 49 (72%) are short papers.
Full papers are assigned a 12-page limit, whereas 8-page limit for short papers. Extra 4 pages are allowed if necessary with additional fee.
The numbers of pages of the actual papers vary from 5 to 13. Table I shows the numbers of full and short papers for the given number of pages. We can see that 12 out of 19 (63%) full papers have smaller papers than the limit of 12 and 1 paper has more than the limit number of pages. For the short papers, 24 papers out of 49 (49%) short papers have smaller number of pages than the limit number of 8, and 2 papers How Different Are Full and Short Papers in Word-Usage?  We can see also that 6 full papers have the number of pages that are in the range for short papers. Therefore, we cannot determine if a given paper is full or short only from the number of pages.
The pre-processing of data consists of three steps: 1) To convert the papers in pdf format into text data by applying a pdf to text converter software. 2) To formulate the text file for analysis suitable to the analysis using KH Coder [6], i.e. by adding HTML tag like information. 3) To apply KH Coder and extract frequency data of words, or terms.
In the first step (1) for pre-processing the data, the information about the arrangement, non-text part such as figures and tables, and some other organizational information is eliminated. The remaining data contain only the text parts of the original papers. We use the obtained text data without editing.
KH Coder is a free software which is convenient to analyze text data. It can deal with hierarchical data by using HTML-like tag <h1> to <h5>. The <h1> tags show the topmost group of text data, and <h2> tags show the lower level data, and so on. We can change the target range by specifying the part for analysis to KH Coder.
The file obtained in step (2) in the study of this paper has the following hierarchical organization: The topmost <h1> tag is used for grouping full and short papers. For example, "<h1>FullPapers</h1>" specifies that the full papers will follow to this tag.
The second <h2> tag is used for specifying the header of each paper of full and short papers. For example, "<h2>file:FCSEDU_2017_101_CR.pdf.txt </h2>" specifies the paper with the number 101. The "F" character after the string "file:" indicates that this paper is classified as a full paper. The content of the paper follows the "<h2>" tag. For example, the content of the paper number 101 starts with the string as follows: "Personalized, Affect and Performance-driven Computer-based Learning Christos Athanasiadis, Enrique Hortal, Dimitrios Koutsoukos, ".
After the step (iii), i.e., by applying KH Coder, we obtain the statistical data of all the papers: The file contains 58,008 sentences, 47,168 paragraphs, 68 <h2> tags (the number of papers), and 2 <h1> tags (full and short).
By choosing the Tools>Words>Frequency List menu and choosing the options "By POS tags" and "Term frequency", we obtain the word frequency data by POS tags, which are used in the following analysis steps.

III. WORD USAGE ANALYSIS
In this section, we analyze the word-usage data and investigate the characteristic features for discriminating the full and short papers.
This section is organized as follows: We start the analysis with investigating how the words used in the papers are distributed in terms of the frequencies in the full and short papers in Section A.
Then in Section B, we define an index for characterizing whether and how much a word is used in full and short papers. We also investigate how they are distributed by their histogram.
In Section C, we introduce another index called Pop(w) for measuring popularity of words and see how the words are used from these two indexes.
Finally, in Section D, we define an index for measuring importance of words and we show some of the most characteristic words for full and short papers.

A. Preliminary Analysis of Word-Usage as a Whole
We start with capturing what words/terms are used in the papers; especially, what are the differences in the full and short papers. Fig. 1 and Table II show how words are used in full papers and short papers in terms of frequency (the number of occurrences). A word is represented in the form "word:POS" so that it shows in what part-of-speech the word is used in the paper. For example, "be:v" represents the verb "be" and "student:n" represents the noun "student".
As we can see easily, the verb "be" is used much more than other words, and it is followed by the popular nouns such as "student", "they" in both full and short papers.
The underline in the list for full papers in Table II shows that the rank of the word is higher (smaller ranking number) than that in the list of short papers; and vice versa for those International Journal of Machine Learning and Computing, Vol. 10, No. 4, July 2020 for short papers. These words could be considered to somehow represent the differences of the full and short papers. Especially if a word has big difference in their ranking numbers in full and short papers, it may represent a kind of characteristic difference between full and short papers. By comparing these words, we are inclined to think that the words in short papers such as "system" and "datum" inspire that the short papers deal with more about specific topics such as educational systems than full papers.
On the other hand, the words in full papers such as "course" and "user" inspire that full papers intend to deal with more about the topics from wider points of view such as the curriculums, educational framework, and students who are the main participants in education.

B. Analysis of Word-Usage of Nouns in Full and Short Papers
In this section as well as the rest of this paper, we mainly deal with nouns as the target words for analysis. The nouns are the most important part of speech (POS) because they carry the main idea, or the subject, of the sentences. Now we would like to adjust the weight of occurrence of words. As the numbers of word occurrences in the full papers and in the short papers are different, one occurrence of a word in a full paper and in a short paper has different weight among all the occurrences of words. Thus, we would use the ratios of occurrences in full papers and in short papers.
For a word w, we define the ratio of w for full papers by 蟐 = #occurrences of in full papers #all occurrences of words in full papers , where # symbols stand for the number of the following set.
From this definition, Fr(w) is the ratio of the word w among all words in their occurrences in full papers.
We define Sr(w) in the similar way for short papers.  FS(w) shows the weight of usage of the word w how much it is used in full papers and how much in short papers. FS(w)>0 shows that w is used more in full papers and FS(w)=1 means w is used only in full papers. Similarly, FS(w)<0 means that w is more used in short papers than in full papers.
Furthermore, 0 iff 蟐 蟐 ㌳, 0 iff 蟐 蟐 ㌳, = ϡ iff Fs(w)≠0 and Sr w = 0 (the word w appears only in one or more full papers), and = ϡ iff Sr(w) ≠ 0 and 蟐 = 0 (the word w appears only in one or more short papers). From these properties, we say a word w by F-word if 0 and S-word if 0.  We can see that quite a few words have FS indexes close to either ϡ or ϡ. Actually, 2,766 words have value 1, i.e., they appear only in full papers. Among them, 1,804 (65%) words appear only once. Thus, they appear only in a paper. For short papers, 6,028 words only appear in short papers, and 3,605 (60%) of them appear only once.
Among these words, the number of S-words is greater than that of F-words. However, by dividing the number of papers, the number of S-words per one paper is 124, whereas that of F-words is 146. Thus, we could say that these values are almost the same with full and short papers.

C. Word-Usage Analysis with Two Indexes
As we have seen in the previous subsection, the word having FS index of 1 do not mean that they characterize full papers. For example, a word with occurrence 1 has FS index value of 1 if it happens to be used in a full paper. Also, if it happens to be used in a short paper, FS index value becomes 1. Therefore, we cannot say that the word characterize either full papers or short papers from the fact that FS value is 1 or 1.
Thus, the characteristic words we are looking for should be chosen by considering the popularity, or the number of occurrences, so that total frequency ratio should not be very low. Let us define the popularity of word w by the mean value of Fr(w) and Sr(w); i.e., Pop w㌳ = 蟐 蟐 ㌳ Fig. 3 shows a scatter diagram between FS index and popularity of words. We can see that the words "student", "they", "we", "it", "course", etc. are located close to the line for FS index=0, i.e., they are used both in full and short papers in a similar ratio even if their popularity is higher than other words. In Fig. 3, we see many noun words that appear in Table II locate in the area in the middle popularity area and in the range from 0.ϡ to 0.4 in their absolute values FS index. For example, for example, the words "course", "user", and "study" appear in the right area, i.e., where FS(w)>0, in Fig. 3. Their ranking orders in full papers are 9, 11, 14, respectively in Table II. Similarly, the words "learning", "system", "education", and "datum" appear in the left area. Their ranking orders in Table II are 4, 10, 11, and 12, respectively.
A characteristic word needs to satisfy the following two conditions: 1) It has high absolute values in FS index because it shows how much the word is used specifically in full or short papers.
2) It has a high popularity value as well, because if the popularity is very small, it means that the word appears only in a small number of papers, and thus, it is highly possible that the word happens to appear in one or some small number of papers.

D. Use of Mixed Index for Measuring the Amount of Characteristic Feature
In order to find the characteristic words that discriminate the full and short papers more appropriately, we propose an index by mixing up the two indexes of FS index and popularity in this section. Firstly, we consider the necessary properties of the mixing up function, and then we take an example function and investigate what words we can obtain from the target data.

1) Defining an index of word for measuring the amount of discrimination between full and short papers
Before defining the specific mixing-up function, we would like to investigate the conditions required to a mixing-up function. Such an indexing function μ(i, p) for mixing-up a FS index value i and a popularity p should satisfy the following properties. (iii) Absolute value is strictly increasing regarding p, i.e., | L R ϡ ㌳| | L R ㌳| if R ϡ R . In this paper, we define the mixing-up function by using multiplication as follows, which we call m-index: For FS index i and popularity p, L R㌳ = LR Note that some functions in the form L R㌳ = L 蟐 R also satisfy the three conditions that are shown above; e.g., when 蟐 = 3 and = .  Most words are located at the central part where the absolute values are small. At the same time, some 500 words are located near 1 as well as near 1. These words are considered to be the characteristic words for full and short papers, respectively.

3) Highly characteristic words
The words having big absolute values in their m-index discriminate the full papers and short papers more than other words. Among them, the words having positive m-index values are those that characteristically appear in full papers, which we call F-words, and we call S-words for those having negative values. Table III shows the list of top 30 F-words and S-words. According to the words in the list, here again, we find that F-words contain generic words concerning study and lectures of university students, such as, together with ranking, "student/2", "user/3", "study/4", "knowledge/7", "course/11", "classroom/14", "teacher/16", "participant/17", "question/20", "response/22", etc.
On the other hand, S-words contain the words referring educational systems and experimental results, such as "datum/2", "system/3", "education/4", "result/6", "game/7", "video/8", "environment/9", "technology/12", "experiment/18", "evaluation/22", etc. Roughly speaking, most of these words relate to educational systems and experiments in educational situations. These words might reflect that short papers pay more attentions to specific systems and experiments than full papers as we have recognized in the previous sections.
According to these findings, we may summarize that short papers deal with more about specific systems and experiments and their results than full papers, and full papers discuss more about philosophical aspects, models, and other topics in a more theoretical basis. We have to investigate further by deeper analysis for more precise findings and conclusions.

IV. CONCLUDING REMARKS
In this paper, we investigated the different features of full papers and short papers based on an analysis of usage of words, or terms. Our tentative conclusion is that short papers rather pay more attention to specific experiments and their results than full papers, whereas full papers pay more concerns educational discussions from general points of view than short papers.
Our objectives include not only finding differences and extracting valuable tips for improving the quality of papers, but also finding effective analysis methods. In this paper, we proposed indexes for measuring the differences of the features between full and short papers in terms of word-usage of papers.
Our target data for analysis is rather small data [7] than big data because we aim to find domain-specific tips rather than to find generic knowledge that are applicable to a wide variety of domains. Thus, we intend to pursue investigation more deeply on tools for analysis and useful findings in education according to our study approach.
The analysis methods we carried out in this paper is just the very first stage toward our goal in order to find more effective and practical tips for writing more sophisticated papers. We understand that it is quite a difficult problem to find tips for writing better papers only from the word usage.
In order to obtain satisfactory results, we need to investigate the papers more deeply including the following topics: 1) In this paper, we used Fr and Sr for weighting importance of words. We may use other measures such as TF-IDF [8] instead of Fr/Sr, which is a candidate for our future studies. We used FS(w) for classifying words w in this paper. For this purpose we may use SVM [9], [10], which is also our candidate for future studies. 2) Analysis not only with usage of specific words but also with usage of types of words; what features of word usage are more full-paper oriented and what are more short-paper oriented. 3) In this paper, we took all the words without thinking about parts of speech (POS) firstly, and then we took nouns only for specific analysis. Usage of words with other parts of speech should be different in full and short papers. We need to analyze further by considering other types of POS. 4) Organizational analysis is another important topic for investigation. Organizations should be different between full and short papers. Analysis on organizational differences such as structure of contents, layout of figures and tables, use of mathematical formulas, are other important topics toward our eventual goal.