How to Fight against SMS-Spam : Structural Approach and Results

Although the number of different uses for mobile-data networks has grown rapidly, short message service (SMS) remains the primary message-exchange method; in addition, SMS is still necessary since it provides several advantages including the small monetary cost that is incurred per transmission, greater security compared to online social networks (OSNs). Due to the popular status of SMS, SMS spam is a form of communication that can be used to pursue malicious economic intents such as phishing and illegal advertising, or to widely distribute unwanted messages to numerous phone users. In this paper, we explore the effectiveness of using social structural approach. To this end, we introduce a methodology that shows how to expand SMS networks from small SMS datasets to social networks based on real-world datasets and possible SMS-spam attack. Also, we verify the detection effectiveness of our approach by conducting experiments.


I. INTRODUCTION
The short-message service (SMS) is a communication method that sends messages of a limited size and is mostly transmitted over mobile networks (e.g., mobile phones). As the widespread use of mobile phones has expanded globally, the cost of SMS continues to decrease. As a result of the ease and low cost of using SMS, it has become the most widely-used communication service, followed by voice communications.
According to the International Telecommunication Union (ITU), 5.3 billion active mobile cellular users around the world sent 1.8 trillion SMS messages (approximately 200,000 SMS messages per second) in 2010 [1]; the worldwide penetration of mobile broadband services [2] is further stimulating the use of SMS; also, 18.7 billion texts are sent worldwide every day (not including app-to-app messaging) in 2017 [3]. Even though email and mobile messengers such as Twitter are commonly used in everyday life due to the explosive growth of smart-device use, SMS is still a major method for mobile communication. From a research perspective, SMS has not been sufficiently investigated relative to similar Internet messaging services (e.g., Twitter and email); nevertheless, as is discussed in [4], Twitter comprises less than 1 % of the world's Giseop Noh is with the Cheongju University, Cheongju-si, Chungcheongbuk-do, South Korea, 28503 (e-mail: kafa46@gmail.com).
communications, but accounts for more than 75 % of the research regarding short-message communications.
Spam refers to unsolicited and/or unwanted messages that are transmitted to a large number of recipients with a malicious intent [4], [5] including economic motives, phishing, and scamming. Spam frequently appears in various online communications including email, online social networks, SMS, blogs, and online guest rooms. SMS spam is effective from the attacker's perspective for the following reasons: (i) SMS spam is not cost-prohibitive because the increase in the use of SMS leads to a reduction in message costs (below USD $0.001 in China and a cost-free service in other countries) [6]. (ii) SMS messages are often personal, privileged, and private when compared to other online short-message services [7], whereby users tend to trust incoming SMS messages and are therefore more likely to open messages compared with email or other online communications. (iii) SMS spammers can more easily acquire target accounts-the targeted telephone numbers-as they simply need to enumerate all of the numbers from a finite phone-number space, or search for phone information on the Internet; specifically, 20 % to 30 % of SMS traffic in China and India is SMS spam [8], while 30 % of SMS messages in the Asia region have been classified as spam [6]. (iv) People tend to use acronyms when writing SMS messages, so the abbreviations used by SMS users are not standard for their language; this could represent a problem because the use of fewer words indicates that there is less information to work with, and such linguistic variability provides a greater amount of terms or features with a more sparse representation [9].
So far, methods to detect email spam have been the center of attention for the research community, whereas SMS spam has rarely been considered. The available defense approaches against email spam include blacklisting, address management, and content-based filtering. A content-based approach has been proven to be an effective solution [10], and several corresponding research papers have been published [9]- [13]. The challenge of SMS-spam filtering in comparison with email-spam filtering, however, is the nature of SMS text, whereby the length is brief, informal characters are used, and less header information is available.
Recent studies have explored more advanced spam filtering methods on online social networks (OSNs). OSNs reflect real social interactions between users and have a unique structure that includes small-world behavior, clustering, and sparse cuts between the clusters.
In addition to content-based methods, spam filtering through the use of network structures has also received much attention; however, spam filtering based on the OSN structure also faces difficulties as a result of the limited network information for SMS due to the highly private nature of SMS activity. S. J. Delany et al. summarized possible spam-filtering approaches in [8], whereby a great amount of attention is paid to content-based SMS-spam filtering.
SMS-spam filtering has not been fully explored, and it is challenging to design effective spam filters since it is difficult to collect SMS datasets to analyze their graphical structures. We therefore believe that the use of social networks provides a meaningful solution. In this paper, we explore reasonable methodologies to address the question, "Can we fully exploit the advantages of the nature of social networks to detect SMS spam?" To answer this question, we propose an approach to configure and expand a social-network-based SMS dataset by structural approach. The main contributions of this paper are summarized as follows:  First, we created a social network for SMS messages and SMS spammers to analyze the characteristics of the network using a real-world dataset and sociology theory. Through our SMS-network-building approach, we provide a framework to build and configure SMS networks as one of the guidelines.  Secondly, we propose a social network based spam detection approach. We believe that our approach can be the first trial and can serve as a baseline concept to improve social-network-based SMS detection. The remainder of this paper is organized as follows: We review previous works related to SMS spam-detection methodologies in Section II including content-based and social-network-based approaches; in Section III, we present the manner in which we built the SMS network and tackled the privacy obstacle regarding SMS datasets; in Section IV, we propose our spam detection approach that includes a social-network-based detector and provide possible SMS spam attack (called One2N); in Section V, we explain how the proposed approach works by providing the results of experiments; and Section VI concludes our paper.

A. Spam Filtering
Much of the existing research on spam filtering has focused on the protection techniques regarding email [14], Twitter [4], [15]- [20], Facebook [21], and the Internet [22] including white and black listings; the digital signature, postage control; address management; collaborative, content-based filtering [23]; and social-network-based filtering. Specifically, most of the spam-or spammer-detection approaches for social networks involve the content-based approach [21].
From email spam to SNS spam, a user's information is treated as the most valuable feature for spam filtering. The Naïve Bayesian filter and SVM (Support Vector Machine) are popular approaches in spam-filtering research, as they are common, well-known machine-learning algorithms and have shown a superior performance compared to other approaches.
The authors of [15] suggested graph-based features such as the in-degree, out-degree, and user-reputation level for a micro-blogging service like Twitter that has content-based features such as duplicate tweets, HTTP links, replies and mentions, and topics. For the online voting systems, Benevenuto et al. explored YouTube.com to detect spammers who try to increase the reputations of malicious movies by posting a series of responses, and they exploited video attributes (ratings), user attributes (activities), and social-network metrics (clustering coefficient and betweenness) [24]. To enhance the performance of spam detection, several approaches build social-network-based approaches on top of content-based schemes [14,25,26]. Using the network spam-filter features (in-and out-link, cross-link, etc.) from 12 million Web pages, the authors of [22] tested various network features for an improved classification performance.
However, the previously mentioned social-network-based approaches, unlike our trial, do not address SMS-spam filtering at all.

B. SMS-Spam Filtering
SMS-spam filtering is a relatively new task that inherits many issues and solutions from email-spam filtering; however, it poses its own specific challenges due to the brief length of the messages. The Naive Bayesian's algorithm, pattern-matching algorithms, evolutionary algorithms, Logistic Regression (LR), Dynamic Markov Compression (DMC), and SVM, among others, can be used in the SMS-spam-detection field, but traditional content-based filters may have their performance seriously degraded while the level of ambiguity is increased. Since SMS messages are fairly short with only 160 characters and their text is generally rife with idioms and abbreviations [6], it is difficult to adopt traditional email-spam filters without any kind of modification.
The authors of [9] first studied the possibility of applying Bayesian filtering techniques to the problem of SMS-spam filtering, whereby the Bayesian filtering techniques that are used to block email spam were extended. Qian Xu et al. also utilized SVM-classifier and k-nearest-neighbor (k-NN) algorithms with content-less features for SMS-spam detection. They show that temporal features and network features including the number of recipients and the CC can be effective compared to conventional static features [10].
Ref. [27] considered the problem of content-based spam filtering, whereby the technique checks enough features in short spam messages (i.e., mobile (SMS) communication, blog comments, and email-summary information) to distinguish them from non-spam messages in a low bandwidth client. Their purpose is to examine the transferability of successful email filtering techniques to very short messages.
Liu and Wang first considered an effective online SMS-spam filtering application based on each individual classifier with the same weight; however, the authors partially used Chinese SMS volunteers [28], and they extracted email-body text to split it into sentences for pseudo SMS (PSMS) collection. Ningning Wu et al., implemented mobile, parallel real-time monitoring and filtering with a multi-core software platform for SMS. They combined Pinyin Fuzzed Keyword Matching Technology with a dynamic adjustment of the user's credit grade based on the keyword dictionary of Bayesian Learning [29].
International Journal of Machine Learning and Computing, Vol. 10, No. 1, January 2020

A. How to Overcome the Difficulty of SMS Networks
Unlike OSNs like Twitter and Facebook, SMS messages have the following two key differences [7]: (i) SMS is typically a private communication between two (or more than two) persons who trust each other. (ii) Although SMS and Twitter messages ("tweets") are similar in terms of their message length, SMS messages are more likely to be brief.
Unlike Twitter's open API that provides access to the platform's public messages, SMS communication is highly private. Due to SMS privacy issues, it is difficult to collect datasets from users. One possible approach is to collect SMS messages from volunteers; however, the senders of the volunteers' received messages have not given their consent in this case. A publicly released dataset can therefore only contain those SMS messages sent by the volunteers, which leads only restricted message contents and sender-receiver networks.
There are several research studies for which SMS datasets were collected. In our paper, we exploit the SMS network that was gathered and publicly released by National University of Singapore (NUS). The NUS dataset contains 42,140 English and 31,205 Chinese SMS messages from a collection period between 2004 and April 2014; this dataset appeared in [7], [30], [31], but unfortunately, it does not distinguish the spam messages. To explore the possibility of spam detection using social networks, we needed to adopt another dataset for which the spam SMS messages have been clearly classified. We subsequently cover the details of overcoming this obstacle in a step-by-step manner.
Due to the difficulty of obtaining SMS datasets, we first used the real-world (NUS) dataset as a seed structure for constructing an SMS-message network using social-network theories that are as realistic as possible. We believe that the proposed artificial construction of an SMS network is one of the best solutions to explore the unseen structure of the unknown SMS world. From the seed structure, we needed to expand the SMS network; however, we did not have any clues as to what the exact nature of the SMS network was. To tackle this problem, we selected and analyzed the most similar structure among the different social networks. We selected the Twitter dataset since Twitter resembles SMS technology in terms of message length. In summary, we generated SMS networks according to the following steps: Step 1. Select a baseline network: Select one real-world SMS dataset (NUS dataset in this paper) as a seed network (structure). The seed network is required to expand its network size (the number of nodes and edges) according to network-expansion rules.
Step 2. Analyze a reference network: Analyze a known message-exchange network as a reference network to expand the seed network. We selected a Twitter message network as the reference network due to the commonality regarding the message-length limit. We derived the expansion rules from analyzing the reference network. In this paper, we use the power law exponent as an expansion rule.
Step 3. Expand the baseline network: Expand the SMS network by exploiting the characteristics of the reference network.
We elaborated on the baseline SMS networks by using the details in the following three sub-sections.

B. Selecting a Baseline Network (Step 1)
As we discussed in the previous sub-section, we selected the NUS dataset as our seed network. In terms of basic characteristics, the NUS dataset contains 51,654 English SMS messages with the corresponding time stamp, country, phone model, source ID, destination ID, message body (text), and message profile. We selected and gathered the information regarding the source ID, destination ID, and message body, as these data are essential for the construction of our baseline SMS network; we filtered out the messages without a source ID. For brevity, let the source ID, destination ID, and message body be sender, receiver, and text, respectively, in the remainder of this paper. Lastly, we extracted the sender, receiver, and text from the original NUS dataset, resulting in 40,077 messages comprised of 60 senders and 2,409 receivers.

C. Analyzing a Reference Network (Step 2)
Since we do not know the complete structure of the SMS social network, and given that a sound term representation is one of the most important parts, we need to accept that SMS messages do not have the same structure and characteristics as those of email or other previous short-message formats; therefore, we first analyzed the Twitter network including spam messages.
In terms of the shared message-length brevity between Twitter and SMS, the character limitations of both are 140 and 160, respectively [7]; we selected the Twitter dataset because of this similarity. In our paper, we focus on the characteristics of an expanded SMS network that met the following two conditions: (i) A large and popular social network that embeds a function of the message exchange and contain a portion of spam messages. (ii) Similarity with the large and popular social networks to reflect the real world as much as possible. Lastly, we believe that the Twitter social network is one of the clues for inferring an SMS network through an analysis of the network characteristics.
We choose a Twitter dataset (http://twitter.mpisws.org/links-anon.txt.gz) for use in this paper that was generated by M. Cha et al. and contains 1,963,263,821 social links [17]. With reference to SMS spam, the list of spammers in the Twitter dataset was provided by S. Ghosh et al. in [18] and contains 41,352 spammer accounts; also, Kwak et al. conducted a quantitative study with a very large Twitter network (41.7 million users, 1.47 billion social links, and 106 million tweets) [19]. Java et al. showed that the out-degree exponent of their Twitter dataset is 2.4, and that OSNs and human-contact networks are "scale-free networks" that show a "small-world phenomenon" [20]. It has been proven that social links (degree distribution) follow a power-law distribution [32]- [34]. Additionally, Meng Jiang et al. focused on Twitter attacks in [35], whereby they searched for groups of accounts (spammers) that were used to unfairly bolster the popularity of their customers.

D. Expanding the Baseline Network (Step 3)
Using the baseline network (NUS dataset), we generated a unidirectional social link if an SMS record existed between a sender and receiver (refer to the upper-left graph in Fig. 1) Since the baseline network is not enough to reflect the complete nature of the entire SMS network, we expanded the baseline network. From Step 2 ("Analyzing a reference network"), we choose the value of the power law exponent as our expansion rule.
Since we only have sender messages from the research volunteers, the generation of the SMS-message network was not relevant. To overcome this shortcoming, however, we exploited the observations from the spammed Twitter network. We expanded the original NUS SMS network by increasing the edges via two methods (random and preferential attachments [34]). In the random attachment, we added edges in the following manner: (i) Randomly choose a number between 0.0 and 1.0. (ii) Select the number of nodes (by looking up Table I) to be connected for each node. (iii) Connect all of the nodes to their selected nodes. From our experiment on the random attachment, 2,586 edges were added. Additionally, we generated edges based on the preferential attachment (PA) model [36]. We set the power law distribution with an exponent of 2.4 since we realized that the out-degree exponent of the Twitter dataset is 2.4 from Step 2, and generated the number of edges for each randomly chosen node, whereby we borrowed the idea of edge generation from [37]. The SMS network generated via the PA model is reported in the upper-right figure of Fig. 1.  *PA: Preferential Attachment is the way of generating networks. In this paper, we followed the network generation using [34] IV. SPAM DETECTION

A. Detection Metric
To exploit the social-network characteristics, we explored the network metrics. The possible metrics that exist include betweenness centrality (BC), in-degree, and out-degree; BC is given by [38]. Another metric is to represent a graph by using the clustering coefficient (CC). CC is the fraction of the pairs of node n's friends that are connected to each other by edges. In this paper, we use CC as our basic social-network metric since CC is reported as a possible metric in spam-filtering techniques [16], [39]. We leave the exploration of other metrics for our future works.
The CC of the specific node n is given by [40] in (1).
where n e and   n deg are the number of real edges between the neighbors of node n and the number of neighbor edges, respectively.
Therefore, for the global structure analysis, the average CC can be computed as (2).
In this paper, we use CC as our basic social-network metric since CC is reported as a possible metric in spam-filtering techniques [15], [23]. We leave the exploration of other metrics for our future works.

B. Spammer Characteristics
Reference [16] founds that spammers have a low CC in an email network because spammers send emails to randomly selected recipients. We believe the finding of [16] that posits that spammers have a lower CC than normal SMS users from adopting social theories such as those of [41]. Since a spammer has less social relations than normal users, the CC of the spammer is also lower than those of normal users.

C. Spam Detection
From using CC values for detecting spam, we devised the detection approach based on the mean and the upper and lower thresholds. After our detection decides that a message is spam and that the corresponding sender is a spammer, it provides an "attack" notice to users. The spam decision rules are as follows: Input: μ: the mean CC of receivers from a sender γ: the mean CC of the rest of the receivers α: the threshold value Decision: If γ is in the range of [μ (1-α), μ (1+α)], the messages are normal (the sender is a normal message sender). Otherwise, the messages are spam (the sender is a spammer).

A. Attack Scenarios
We hold two assumptions regarding the construction of the attack type: (i) If a spammer sends a single message to multiple receivers, the possibility of detection by defenders is greater. (ii) A spammer knows the various spam-detection algorithms (even though nearly all of the detection approaches are only valid for email networks) such as the varying of either the number of SMS messages or the number of spammers. We define a spammer's spam strategy as follows: 1 to N (One2N) spam: A spammer sends SMS spam to N receivers via a sender. The spammer selects N receivers at random and only one type of spam message is sent.
One of the main features of spammers is that they have a large number of message receivers; however, it is also possible for normal users to send a large, or "heavy," SMS load. For instance, the organizer of a business conference will send a large number of SMS messages to the attendee list. In this case, the organizer's behavior is similar to that of a spammer in terms of SMS-message quantity. We applied the term "normal heavy sender" (NHS) for this type of non-spam heavy message sender who has many receivers.
Let the NHS set be h n , so that n is the number of the edges of node n . We assign the level of randomness using R and it is defined as follows: R = a ratio allocating the fraction of the receiver's sent-message number from h n .
Let the set of randomly selected receivers be r n . Let the total number of nodes in the SMS be N, and let the number of receiver nodes from h n be h n ; for example, if R = 0.

B. Experiments on One2N attack
An example of a One2N attack is portrayed in Fig. 3. The color coding follows Table II, and these color rules are applied to all graphical representations of the attack scenario (One2N). To compare a situation wherein normal messages are sent from NHSs, we generated a graph where the number of NHS = 1 and R = 1.0. As we can see from Fig. 2 and Fig. 3, the forming patterns of the edges are different from each other, whereby One2N attacks form edges into different local communities; however, the NHS edges belong to one community that an NHS belongs to as well. In other words, NHSs send SMSs to known people; spammer send messages to unknown members from various communities.   We conducted experiments with 8 HNSs whose neighbor (receiver) numbers are greater than or equal to 50. Our spam International Journal of Machine Learning and Computing, Vol. 10, No. 1, January 2020 detection rule (SDR) classified all of the 8 NHSs as normal nodes with α = 0.35. Note that α is the controlling parameter that indicates how much variation can be allowed around the mean value of the non-spamming receivers (refer to Section IV-C).
To monitor the robustness of SDR on spam attacks, we simulated the One2N attack using different N values, where 10 500 N   and the interval resolution is 10; note that N represents the number of spammed receivers. We calculated the average CCs of the spammed nodes and normal nodes, and the average results are 0.323 for the former and 0.612 for the latter. The averaged CC distribution of the One2N attack is portrayed in Fig. 4. As we can see in Fig. 4, the CC values that sent spam to receivers from one spammer are increased as N grows. However, the CC values of the normal receivers are stable. The SDR identified all of the One2N cases as attacks regardless of the N value.

VI. CONCLUSIONS
From the outset of this study, we continuously tried to answer the question, "Can we fully exploit the advantages of the nature of social networks to detect SMS spam?" To answer our question, we proposed an approach for generating a reasonable social network to test SMS-spam detectors. We also proposed the social-network-based SMS-spam-detection approach; additionally, we introduced possible type of SMS-spam attack (called One2N) that include the random nature of spammers. After simulation using social network-based spam detection against spam attacks with various number of receivers, our approach can detect all spam messages. Therefore, we argue that the answer to our previously mentioned question is "Yes," since spam detection is notable and meaningful enough to be compared under normal conditions.