Fast CU Spliting Algorithms for Virtual Reality Video Based on KNN

Abstract—The coding framework for virtual reality video at present is first projecting 3D data to 2D format, then encoding it by traditional coding tools, which has much high computational complexity. In order to reduce the coding complexity based on the quality evaluation standard of virtual reality video, this paper presents a fast algorithm to speed up the Coding Unit (CU) partitioning by predicting the maximum depth of LCU with KNN classifier. Experimental results show that the proposed fast algorithm provides an average time reduction rate of 37.9% compared to the reference HM-16.16+360lib4.0, with only 1.31% BD-rate increase.


I. INTRODUCTION
Virtual reality video is a special kind of video representing the whole scene of the environment in 360 degree. It is captured by multiple professional cameras, and spliced using software and can be played by special device. It also provides the viewer with various functions to manipulate the video, such as zoom in and out and moving in all directions, so as to simulate and reproduce the real environment [1].
At present, the coding and transmission of virtual reality video mainly relies on projection every frame of the 3D style data into 2D one, and then using traditional coding framework such as HEVC, H.264 to fulfill encoding. The commonly used projection formats are ERP, EAP, and CMP and so on. In addition, different from traditional video, virtual reality video has its own quality evaluation metric. In this paper, we will study the virtual reality video in ERP projection format.
HEVC is one of the coding framework used in virtual reality video coding. HEVC adopts the coding structure of coding tree unit (CTU), which is the basic processing unit of HEVC. A CTU consists of 1 brightness CTB, 2 chromaticity CTB and corresponding syntax elements. Figure 1 shows a frame divided into CUs in CTU. A CTU may contain only one encoding unit (CU), and HEVC can also use quadtree structure to recursively divide CU into many different sizes of CU [2].  There are four kinds of CU in HEVC: 64×64, 32×32, 16×16, and 8×8. For a LCU with size 64×64, encoder first treat it as a CU, calculate the best prediction mode of it, and record the best prediction data in the current partitioning mode. Then encoder divide the current LCU into four 32×32 CUs. Encoder calculate the best prediction mode of the 32x32 CUs, and record the best prediction data. Similarly, the encoder divides each 32×32 CU into four 16×16 CUs, and calculate the best prediction model of each 16×16 CU and record it, and divides each 16×16 CU into four 8×8 CUs. Then Encoder calculate the best prediction model corresponding to 8×8 and record the prediction data. Since 8x8 is the smallest CU, here encoder just loop through the best prediction model corresponding to each 8×8 and record its data. When 8x8 CUs' calculating is complete, encoder compare the sum of four 8×8 CUs' RD-Costs to the RD-Cost corresponding to the 16×16 CU's to decide whether to choose the four 8×8 CUs or the a 16×16 CU. After the first 16×16 CU is completed, encoder repeat the previous steps to determine whether the second 16×16 CU is divided into four 8×8 CUs or 16×16 CU, and then the third 16×16 CU and the fourth 16×16 CU. After the 16×16 CU is completed, encoder compare the sum RD-Cost of the four 16×16 CUs with the RD-Cost of a 32×32 CU to determine whether to select 32×32 CU or 16×16CUs. When the first 32×32 CU is completed, encoder repeat the previous steps to determine the partitioning pattern for the second, third, and fourth 32×32 CU. When all four 32×32 computations are complete, we compare the sum RD-Costs of the 32×32CUs with the RD-Cost of a 64×64CU, and then decide whether to choose 64×64 CU or four 32×32 CUs and its descending partition. In HEVC, to determine whether a block in quadtree coding structure needs to be further partitioned, it is necessary to compare the coding cost of all blocks with that of the block after traversal of all the blocks mentioned above. If the RD-cost of CU is larger than the sum of RD-cost of its sub-CU, the smaller CUs are needed. If the RD-cost of the current CU is larger than the RD-cost of the parent CU, there is no need to divide it, and the current CU as a whole. Obviously, these comparisons occur after the end of all CU traversals of different sizes, which means much high computation burden. In fact, in many cases, the size of CU varies in the optimal partition, if we can predict the maximum depth in a LCU, we can terminate the partitioning process in advance, and there is no need to traverse all the possibilities.
In this paper, we use KNN to predict the maximum depth of LCU, and to reduce the redundancy of LCU partitioning operation.

II. RELATED WORKS
In [3], K. Choi proposed a coding tree termination method for the CU SKIP mode. Deyuan Liu [4] proposed a fast CU size decision algorithm based on Support Vector Machines (SVM). The [5] uses weighted SVM to predict CU premature termination to optimize computation complexity. In [6], a Bayesian decision rule based early termination method was reported, in which on-line learning and off-line learning were jointly applied to generate model parameters of classifiers.
In [7], a fast CU partitioning algorithm is proposed for HEVC encoder, which early on terminates the CU partitioning process based on the Bayesian decision rule using joint online and offline learning. In [8], author proposed a fast and efficient mode decision algorithm based on the Newman-Pearson rule, which consists of early SKIP mode decision and fast CU size decision. In [9], author proposed a machine learning-based fast coding unit (CU) depth decision method for High Efficiency Video Coding (HEVC), which optimizes the complexity allocation at CU level with given rate-distortion (RD) cost constraints. [10] proposes a fast CU splitting algorithm which can narrow CU depth range and early terminate the CU splitting based on the Sobel edge detection operator. [11] proposed a method to reduce the high encoding time by pruning the coding quad-trees using prediction residuals statistics.
Until now, a number of methods have been proposed to reduce the encoding complexity of HEVC on early CU decision. However, the experiment of this part of algorithm is only used for HEVC. Although the virtual reality video is encoded by HEVC after projection, its quality evaluation standard is different from HEVC. Moreover, in the above-mentioned articles, some of the algorithms are based on the threshold of video statistics for fast partitioning. But the statistical threshold does not always apply to all videos.

III. LCU DEPTH PREDICTION BASED ON KNN
LCU may contain four sizes of CU. If we predict the size of the smallest CU in LCU before encoding LCU, and skip calculating the smaller CU after encoding this CU, we can improve the encoding efficiency of the encoder when dividing CU.
In this paper, the minimum CU size in LCU is predicted to skip the calculation of CU partitioning method in LCU in advance. This problem can also be regarded as a classification problem. Because of the computational efficiency and the complexity of model training, KNN classifier is adopted in this paper. This part mainly includes LCU feature selection, classification method, classification accuracy analysis, KNN parameter K and prediction set proportion judgment.

A. LCU Complexity Feature Analysis Based on Sobel Filtering
Generally, the simpler area can get better coding effect under the larger CU, while in the more complex area, it needs to be divided into smaller CU for prediction. According to this idea, many people have proposed some related algorithms, which can achieve the goal of fast partitioning CU to a certain extent. In this paper, the Sobel operator is used to filter the content of the encoding LCU to calculate the complexity of the encoding LCU.
In this paper, we use horizontal and vertical Sobel filtering for CU content to be coded. The gradients Gx , Gy in horizontal and vertical directions are obtained by calculation. A in the equation (1) represents the content of LCU.
After obtaining Gx and Gy , we calculate edge points and average gray values by equations (3) and (4). i and j represent the coordinates of each pixel, and n represent the width of the LCU .The more the number of edge points in an If the size of the smallest CU block is predicted before LCU partitioning, further partitioning can be terminated after encoding the corresponding size CU, thus improving the encoding efficiency.
Through experiments, we find that the horizontal and vertical edge mean values of LCU are closely related to its depth. The different LCU ratios in the thresholds are shown in the following Table I:  At present, the commonly used fast CU algorithm often extracts the features of CU, quantifies them and then extracts the threshold value, which is used to process the CU larger than or less than the threshold value. However, the threshold of this algorithm is usually derived from the International Journal of Machine Learning and Computing, Vol. 10, No. 6, November 2020 statistics of the test video, and it may not be able to represent all the features of the video very well.
For different videos, the optimal threshold will not be the same because the image characteristics and complexity of each video are different. The statistical thresholds are often neutralized by the characteristics of the statistical video set. The statistical thresholds are not necessarily optimal for the video to be coded.

C. Adjustment of Classification Method
The experimental results show that the edge characteristics of 0 LCU and 1 LCU frames are similar, and it is difficult to distinguish them by using edge features.
We counted the proportion of each CU in the video, as shown in the Table II.

D. LCU Depth Prediction Based on KNN Classifier
Because LCU depth prediction itself can also be regarded as a classification problem, we can use classifier to encode a part of the video frame normally, record the depth and edge features of LCU, and then use these data to predict the depth of the LCU of another part of the frame.
Since the training of classifier in this algorithm is carried out in coding, it is necessary to select the algorithm with lower computational complexity for training and prediction in order to achieve the goal of improving coding efficiency.
KNN algorithm is low in training complexity and relatively simple in structure. Although the time complexity of prediction is relatively high, its computational complexity is much lower than that of traversing all depth LCUs. Therefore, this paper intends to use KNN to predict the depth of LCU.
KNN is a basic classification and regression method. Its input is the feature vector of an instance. By calculating the distance between the new data and the trained data, K (K >= 1) neighbors are selected for classification and judgment (voting) or regression. If K = 1, the new data is simply assigned to the class of its nearest neighbors.
As shown in Fig. 4, which class is the blue circle determined to be, is it a red square or a green triangle? If K = 3, the green circle will be assigned to the green triangle class because the proportion of the red square is 2/3. If K = 5, the green circle will be assigned to the blue quadrangle class because the proportion of the green triangle is 4/5. The KNN method is more suitable than other methods for the intersection or overlap of class domains. The KNN algorithm itself is simple and effective. It is a lazy-learning algorithm. The classifier does not need training set, and the training time complexity is 0. The computational complexity of KNN classification is directly proportional to the number of documents in the training set, that is, if the total number of data in the training set is n, the classification time complexity of KNN is O (n). Although the KNN method also depends on the limit theorem in principle, it is only related to a small number of adjacent samples in class decision making. KNN method is more suitable than other methods for the intersected or overlapped sample sets because it mainly depends on the neighboring samples, rather than on the method of discriminating class domains. Because of the distribution characteristics of In this algorithm, as shown as Fig. 5 we divide the encoded sequence into frames set, and each frames set has some frames. Among them, some frame is the training set, and the rest is the prediction set. When encoding the training set video, the original algorithm is used to partition the LCU, and the edge density attributes  If the predicted result is 2 LCU , then the 64×64 CU and 8 ×8 CUs are skipped, and only 32×32 CUs and 16×16 CUs are coded, from which the optimal partition results are selected.
If the predicted result is 3 LCU , then the 64×64 CU is skipped, only 32×32 CUs, 16×16 CUs and 8×8 CUs are coded, from which the optimal partition results are selected.

E. KNN Parameter Selection
For a given input sample x , if its true value is y , the output value ŷ predicted by the classifier ˆ() y f x  may be inconsistent with the true value y . The result of correct classification is measured by the correct rate function and recorded as ( , ( )) R y f x .
If the total number of frames is N , then the correct rate P of the classifier is: ( , ( ))   R y f x P N (6) In this algorithm, since the prediction result is the depth of the LCU, if 2 LCU is predicted to be 3 LCU , the LCU with depth of 2 will be calculated at 8×8 CUs, but in fact it has no effect on the coding result. If the number of 2 LCU divided into 3 LCU is 23 L , the overall accuracy of the classifier is achieved: If the total number of frames of the coded video is N , and the number of frames using KNN to predict LCU depth is PF N , the proportion of training frames PF R is: In this paper, there are two parameters to be determined for KNN classification. One is the value of k and the other is PF R . The P Total of different PF R and k is as Table III. Because the increase of k and the decrease of the proportion of predicted frames will affect the efficiency of the fast algorithm, we choose the KNN classifier with k to be 2 and PF R to be 60%. WS_PSNR is an objective quality assessment standard of virtual reality video adopted by 360Lib. According to the evaluation criterion, the pixels of different latitudes have different weights when projecting a 2D image onto a spherical field of view, and virtual reality videos are evaluated by adding weights to different latitudes of the 2D images projected from virtual reality videos. Assuming that the size of the 2D image after projection is MxN, the weighted mean square error is as follows: WS_PSNR is directly calculated from the projected 2D image, its weight ( , ) w i j is related to the projection format. The weight of the ERP format video is as follows: International Journal of Machine Learning and Computing, Vol. 10, No. 6, November 2020 Due to the characteristics of the virtual reality video, WS_PSNR is used as the objective quality assessment standard. We use WS_PSNR instead of the original PSNR to calculate the BD-rate in video coding.
The actual encoding time is measured on a workstation with a 3.60-GHz processor and 8GB of RAM. The anchor is under "encoder intra main" with "encoder 360 ERP" configuration. As shown in Table IV, the proposed algorithm achieves 37.9% time reduction, 1.31% BD-rate increase.

V. CONCLUSION
In order to reduce the computational complexity of virtual reality video coding, this work proposes a fast algorithm to speed up the CU partition process based on KNN classifier. The classifier use edge information to predict the depth of LCU and terminate the CU partition process based on this depth in advance. Experimental results show that the proposed fast algorithm provides an average time reduction rate of 37.9% compared to the reference HM-16.16+360lib4.0, with only 1.31% BD-rate increase.