An efficient Intelligent Cache Replacement Policy Suitable for PACS

An efficient intelligent cache replacement policy suitable for picture archiving and communication systems (PACS) was proposed in this work. By combining the Support vector machine (SVM) with the classic least recently used (LRU) cache replacement policy, we have created a new intelligent cache replacement policy called SVM-LRU. The SVM-LRU policy is unlike conventional cache replacement policies, which are solely dependent on the intrinsic properties of the cached items. Our PACS-oriented SVM-LRU algorithm identifies the variables that affect file access probabilities by mining medical data. The SVM algorithm is then used to model the future access probabilities of the cached items, thus improving cache performance. Finally, a simulation experiment was performed using the trace-driven simulation method. It was shown that the SVM-LRU cache algorithm significantly improves PACS cache performance when compared to conventional cache replacement policies like LRU, LFU, SIZE and GDS.


I. INTRODUCTION
The PACS is a computer application system dedicated to the process, storage and the transmission of medical images. Through the network, the images information collected by CT, CR, DSA, MRI, gastrointestinal, ultrasound, endoscope and other image equipment are digitized and transferred to the server for classification and storage, so that the relevant image terminals are retrieved as needed, quickly. For instance, the PACS in Yancheng 1st People's Hospital has a comparative 20% annual data update rate each year. The growth of massive data poses numerous challenges to storage systems. Recently, the evolution of storage systems is moving toward high-performance, high-capacity, and low-cost. However, any single storage medium likewise Hard Disk Drive (HDD) and Solid State Drives (SSD) cannot meet the above-mention requirements due to restrictions of their inherent characteristics. Therefore, the hybrid storage media is an efficient solution.
Hybrid storage system composed of storage media with different characteristics. According to the characteristics of data access and the system load conditions, the data request is handed over to the medium best suited to handle, thereby improving the performance of the entire system. The SSD is widely used as a memory in hybrid storage system, with high Manuscript received September 24, 2019; revised October 5, 2020. Yinyin Wang and Yuwang Yang are with the School of Computer Science and Engineering, Nanjing University of Science and Technology, China (e-mail: wyywx699@163.com, yuwangyang@njust.edu.cn).
Qingguang Wang is with the Yancheng 1st People's Hospital and Medical School of Nantong University, China (e-mail: wqg699@163.com). reliability, low energy consumption and high performance. Considering the high performance of SSD and the low cost of HDD with large capacity, SSD is usually used as the cache of HDD to improve the user experience by providing users with high-speed and large-capacity storage. However, as a cache medium, the SDD is of relatively high cost and limited storage capacity. Therefore, another cache replacement strategy is needed, good enough to manage the hybrid storage.
In PACS, the classic cache replacement strategy has a relatively poor performance, whereas these strategies only consider a certain influence of the cache object (such as size, final access time and frequency). The literature offered machines learning-based approaches to improve the traditional cache replacement strategy likewise Artificial Neural Network (ANN), Support Vector Machine (SVM), and Decision Tree C4.5 (C4.5). However, mention approaches are time-consuming consequently leads to computation overhead besides the prediction accuracy is nasty. More importantly, these policies only focus on the influential factors of cache object itself, ignoring the characteristics of the cache object counterpart entity. Hence, the existing intelligent cache replacement strategy is inefficient in PACS.
A large amount of medical data set produced during the medical treatment. The data set is stored in databases likewise HIS, RIS, and EMR. Many interesting phenomena can be observed via the analysis of these eigenvalues through the mining of data: 1. The complete image flow includes medical examinations, film printing, preliminary reports, final reports, and so on. The image data will be accessed multiple times before the process ends. 2. Image data of patients in the hospital are much more accessed than patients who have been discharged. 3. Image data with positive results are more accessed than negative. 4. Different doctors have quite different ways of accessing image data. 5. Imaging data of the patients who had undergone a surgical operation and who are seriously ill are more likely to be accessed than normal patients. Therefore, this paper hopes to use these eigenvalues to predict the probability that the cache object will be accessed in the future through machine learning. Further, to improve the cache hit ratio through the combination with the classic cache replacement strategy.
Based on the analysis of the cache architecture of PACS and the features of both cache object features and users' habits, this paper combines the SVM algorithm with the classic LRU cache replacement strategy to form the SVM-LRU strategy. This strategy uses a relatively simple SVM algorithm to establish a model that includes influencing factors, features, and users' habits of the cache object. The SVM-LRU strategy has less training time, small computation, and high prediction accuracy with improved hit ratio and byte hit ratio performance.

A. Hybrid Storage System Architecture
In PACS, SSD and HDD are commonly used to develop a hybrid storage system [1]- [3]. SSD is used as a cache for HDD. The logical address of the hybrid storage corresponds to the physical address of HDD. In SSD, only a copy of the hot data in HDD is cached. The total capacity of HDD is the entire capacity of the hybrid storage. When there is an access request, the data is first retrieved on the SSD and, if hit it will be read directly from the SSD. If not, the data is transferred from HDD to SSD and read. This architecture is mainly suitable for file-type storage systems. The PACS mainly stores large numbers of image files. Hence, it is very suitable for this architecture.

B. Cache Replacement Strategy
This section reviews cache replacement strategy, which is divided into six types: 1) The strategy based on the Random algorithm (RAND).
This type of strategy uses a random number generator.
The number belongs to a software or a hardware to determine the replaced object in the main memory. Such, strategy is the simplest and straightforward to implement. However, this kind of strategy does not consider the various factors that affect the object at all, so the hit ratio is relatively low, hence is less practical. 2) Strategy Based on the LRU [4]. This strategy takes objects that have not been accessed for the longest time in the recent past as replaced objects. The advantage is the relatively easy implementation, low time complexity, and has a good performance in applications with unified cache sizes. However, it ignores factors such as the frequency and size of the object, and there is a problem of cache pollution.
3) The strategy based on the Least Frequently Used algorithm (LFU) [5]. This type of strategy selects the object that has been least accessed recently as the object to be replaced, which can help avoid caching pollution problems. However, this policy is very difficult to implement. It sets a counter for each object and selects a fixed clock to count for each counter. Similarly, it only considers the frequency factor of the cache object and ignores other factors. 4) The strategy based on the SIZE [6]. This strategy replaces the object of maximum size from the cache when a new object requests for space. The strategy is simple and easy. However, the cache can be polluted by small objects that are not accessed again but are difficult to replace. This policy is suitable for WEB cache replacement applications, which have a high cache hit ratio but a low byte-hit ratio.

5) The strategy based on function. Likewise, Greedy
Dual-Size (GDS) and Greedy Dual-Size Free (GDSF) [7] can optimize cache performance by selecting the appropriate weighted parameters. Multiple influential factors can be considered to handle different application scenarios. However, it is a very difficult task to choose the appropriate weighted parameters for this strategy. Furthermore, new problems may be produced in the calculation of function values. 6) Intelligent cache replacement strategy. Numerous intelligent cache replacement policies have been proposed in recent years. The strategies proposed in References [8]- [12] can generally be divided into two categories. In the first category, intelligent algorithms are used independently as cache replacement policies; in the second category, intelligent algorithms are combined with conventional cache replacement policies. Both of these approaches rely on the prediction of future access probabilities to enhance cache performance. However, these intelligent cache replacement policies depend only on the intrinsic properties of the cached items alone (e.g., filesize, last access time, and access frequency). In our method, medical data is mined to identify the patients who correspond to the PACS's cached items. A machine learning algorithm is then used to construct a predictive model that is based on the variability of the treatment process. This model is then used to predict the future access probability of each cached item, thus improving cache performance.

C. Support Vector Machine
The support vector machine (SVM) is one of the most robust and accurate methods in all well-known machine learning algorithms. SVM has been used successfully in a wide range of applications such as text classification, Web page classification and bioinformatics applications [13], [14] Consider the problem of separating the set of training data       in some space H , and that we have no prior knowledge about the data distribution, then the optimal hyperplane is the one which aximizes themargin [15]. The optimal values for  and b can be found by solving a constrained minimization problem, using Lagrange where i and b are found by using an SVC learning algorithm [15]. Those i with nonzero i  are the "support vectors". For where x  is the mean of x .

III. SVM-LRU INTELLIGENT CACHE REPLACEMENT STRATEGY FOR PACS
The SVM-LRU intelligent cache replacement strategy for PACS uses SVM algorithm to predict the probability that the cache object will be accessed in the future by considering the eigenvalues of the patient corresponding to the cache object in the medical process to, through the combination with LRU, improve PACS cache replacement performance.

A. SVM-LRU Intelligent Cache Replacement Strategy Framework
In his section, the SVM-LRU intelligent cache replacement policy can be presented and framework is shown in Fig. 1 The ETL component is responsible for extracting data in heterogeneous data sources (such as HIS, RIS, EMR, etc.) distributed in hospitals into the temporary intermediate layer for cleaning, conversion, integration, and finally loading into the target database, which lays the basis for later data analysis and data mining [16], [17].
The offline component does not handle users' access requests directly. It trains SVM-LRU policies when the server is idle. The updated SVM-LRU policy is then applied to the cache manager for online components.
The online component manages the cache manager. In case of an access request from users, first, the eigenvalue of the cache object will be gained from the target database, which would then be put into cache manager use, followed by the fulfilment of cache management by using SVM-LRU policy.

B. SVM-LRU Intelligent Cache Replacement Strategy
LRU policy is a classical cache replacement policy. However, the LRU policy is subject to cache pollution, which means that unwanted objects reside on the cache for a long time. In other words, a new object will be inserted in the LRU, at the top of the cache stack. If the object is no longer accessed in the future, it takes a long time to move down to the bottom of the stack before removing it from the cache.
SVM is combined with LRU to form a new SVM-LRU to reduce cache pollution in LRU. The SVM-LRU workflow is as follows. When the users' requests object X, SVM predicts P the probability that the object will be accessed in the future. If P is greater than or equal to the threshold , object X is determined as hot data, which will be placed at the top of the cache stack. Otherwise, object X is cold, which will be placed in the middle of the cache stack for a quick elimination: The SVM-LRU strategy pseudo code is as follows.

A. Cache Replacement Strategy Evaluation Index
There are two main indicators used to evaluate cache storage performance, Hit Ratio (HR) and Byte Hit Ratio (BHR). The hit ratio represents the percentage of access requests obtained from the cache among total access requests. And the byte hit ratio represents the percentage of bytes obtained from the cache among the total bytes of access requests.
HR and BHR have different focuses. HR focuses on reducing users' response time and achieving a better user experience. While BHR focuses on reducing the erase frequency of SSD and extending its service life. It is very difficult for a cache replacement strategy to make HR and BHR both get optimal performance [18]. This is because the strategy of improving HR usually favour small-sized objects instead of large-sized objects, thus reducing BHR. In contrast, strategies that do not favour small-sized objects tend to increase BHR at the expense of HR

B. Comparison between SVM-LRU Strategy and Classic Strategy
The Track Driven Simulation method is used to conduct experiments [19] he experimental data was derived from the PACS of Yancheng No. 1 People's Hospital, and the data was collected between 7th May 2018 and 27th May 2018 (a total of 21 days). Before the experiment, the stop point of the experiment, namely the infinite cache size, should be determined. Infinite cache means that there is enough space to store all cached objects without replacing any cached objects. In addition, the capacity of infinite cache is the total sum of all cache object sizes, among which HR and BHR can reach maximum values. Considering the cost element, however, infinite cache is impossible to realize. The infinite cache obtained through the experiment is 16134GB. max 42.36%  HR max 45.18%  BHR . So 9 levels are selected, and the cache capacity is from 5 2 GB to 13 2 GB. The SVM-LRU strategy and the classic GDS, LRU, LFU, and SIZE were used for experiments. Fig. 3 and Fig. 4 show HR and BHR in different strategies with different cache capacities. As the cache capacity increases, HR and BHR in all strategies increase. When the cache capacity is close to the capacity of the infinite cache, HR and BHR become stable and close to its maximum value.
As shown in Fig. 3, it is obvious that SVM-LRU has improved the hit ratio of the classic LRU strategy as regards to HR, indicating II at the introduction of the SVM in the LRU strategy is effective. Since the frequency at which the image cache stored in PACS is accessed is not much different under the same circumstances, that is, there are fewer over-heated or over-cooled cache files, so the LFU strategy that introduces frequency factors has the worst performance. Similarly, the large difference in the cache size of the image stored by PACS, for example, the cache files such as ultrasonography, DR, etc. are only1~10MB, while the cache file size of CT, MIR and others have hundreds of MB. So when the introduced dimension factors like SIZE, GDS are smaller in cache capacity, it has more advantages over other strategies. But with the increase in cache capacity and the over-abandonment of large files, HR's growth rate is significantly slower than other strategies. In GDS, there is not much difference between and SVM-LRU and HR. GDSF has more advantages with small cache capacity, otherwise, SVM-LRU has more advantages.
From the point of BHR, as shown in Fig. 4, size factors are not considered in SVM-LRU, LRU, and LFU, so the trend of BHR is basically the same as that of HR. Small-size objects are preferred by SIZE and GDS, so a high HR is obtained at the expense of low BHR. When the cache capacity is small, HR is superior to SVM-LRU, but BHR is significantly inferior to it.  To better illustrate the advantages of SVM-LRU strategy, the Improvement Ratios (IR) of HR and BHR performance are introduced to evaluate the advantages and disadvantages of different strategies. IR calculation formula is as formula (16), where PM represents the suggested Proposed Method (PM) and CM represents a comparative model (CM). Table I summarizes IR statistics of SVM-LRU and other strategies.
Form the overall, the results in Table I  253 SVM-LRU has got a bigger BHR advantage at a relatively small loss of HR. . SVM-LRU can predict and replace cold data in the cache faster, and it can significantly improve cache performance especially when the cache capacity is relatively small, which indicates that the introduction of the SVM into LRU is correct and effective. SVM-LRU has a good performance on HR and BHR compared to LFU, indicating that access frequency is not a key factor of the hybrid storage cache replacement strategy like PACS. Comparing SVM-LRU with GDS and SVM-LRU with SIZE, HR is slightly worse only when the cache capacity is 32GB. HR and BHR in other cases have been significantly improved, which indicates that the strategy of focusing on dimensions is not suitable for PACS hybrid storage cache.
In summary, SVM-LRU is better than other classical strategies, and its BHR is good enough. HR has a good performance when the cache capacity is large enough, and it is generally superior to other classical strategies.

V. CONCLUSION AND FUTURE WORKS
This paper proposes an intelligent cache substitution strategy for PACS, which combines the SVM with the LRU to form the SVM-LRU intelligent cache replacement strategy. Different from the existing cache replacement policy, only the nature of the cache object itself is used to formulate the strategy (such as size, final access time and frequency). The SVM-LRU strategy oriented towards PACS obtains the eigenvalues in the medical process that belongs to the patients corresponding to the cached object through mining the medical data. Experimental results show that SVM-LRU significantly improves cache performance compared to LRU, LFU, SIZE and GDS.
The SVM-LRU strategy proposed in this article is based on the example of PACS in hospitals. Its core ideas can be extended to industries such as electricity, banking, aerospace, and telecommunications. Which are similar to hospital systems with a large amount of data, a large number of data dimensions, and strong data relevance. The eigenvalues are obtained through data mining. The machine learning algorithm is used to establish a model to predict the probability that the cache object will be accessed in the future, thus improving the cache performance.

CONFLICT OF INTEREST
The authors declare no conflict of interest.

AUTHOR CONTRIBUTIONS
Yinyin Wang performed the data analyses and wrote the manuscript. Yuwang Yang provided the conception of the study. Qingguang Wang contributed significantly to analysis and manuscript preparation.