Implicit Adaptation to Low Rank Structure in Online Learning

This paper is about the relationship between regret (in online learning) and the rank of an ensemble’s loss matrix Y. Recently, several new algorithms have been developed to exploit low rank structure in Y. Unfortunately, each of these is not known to be order minimax optimal outside of specialized settings. This paper explores through simulation whether this apparent difficulty in achieving minimax optimality is because highly specialized algorithms are required. We observe that a horizon-adaptive hedge algorithm appears to exploit low rank structure effectively, suggesting that algorithms do not have to explicitly work to exploit low rank structure.


I. INTRODUCTION
Prediction with expert advice is a fundamental problem in online learning. There are rounds with experts. In each round = 1 ⋅⋅⋅ , the learner chooses a probability vector ∈ Δ , where Δ denotes the -simplex, namely the set of all distributions over experts Intuitively if an expert has the greatest overall performance for previous rounds, the player can trust this expert more and follow its advice. This strategy is known as Follow the Leader and can be interpreted as a form empirical risk minimization.
As described in [1], in online learning, there are two types of settings: stochastic and adversarial. In the former, is generated according to a fixed IID distribution and the learner seeks to control the expected regret. In the latter, can be fully dependent on the history of play. It is the latter so-called adversarial setting that is less well understood and the focus of this paper.

A. Algorithms
Many algorithms handle adversarial data, including Follow the Regularized Leader (FTRL), Exponentially Weighted Averaging (EWA, also known as Hedge), and Online Mirror Manuscript received January 5, 2020; revised December 7, 2020. Weiqi Yang is with University of Science and Technology of China, China (e-mail: 976237481@qq.com). Descent (OMD) (though the latter two can be considered as special cases of the former).
EWA is named and defined in [2]. Its fundamental idea is to weight experts based on their performance. More precisely, if an expert has relatively high loss, it will receive an exponentially decreasing proportion of weight. The update rule is as follow: This algorithm can be modified in many ways. One of these modifications is called the doubling trick, in which the learning rate becomes adaptive over exponentially increasing time epochs [2]. Consequently, need not be known in advance. We consider another horizon-adaptive algorithm in Section B.
FTRL modifies Follow the Leader by adding a regularization term ( ) ( [3]), which can provide stability to the algorithm. In this case, it can help ensure the algorithm does not overfit. The standard form for FTRL applied to prediction with expert's advice problem is as follow: OMD is a transformation of FTRL. And its update rule is naturally interpreted in terms of a gradient operation.

B. Online Learning with Low Rank Experts
In more recent studies, researchers have started to focus on some parameters with smaller values in order to make the algorithm more efficient, instead of depending on the number of experts. Particularly, [3] proposes the Low-Rank Experts setting, in which they assume that there is a low dimensional ( << ) embedding of the loss matrix. This idea springs from the popular technique which is called matrix factorization proposed by [4]. In [4], the researchers assume that a large rating matrix can be projected into a lower dimensional embedding.
Low rank assumptions are ubiquitous in data science; one of the most famous techniques is matrix completion proposed by [5] and extended by [6]. [3] proposes a type of OMD: if the embedding is known in advance, then this type can achieve a regret upper bound of 8√ . They also show that under the stochastic case, Follow the Leader achieves a bound in terms of the -rank of the loss Michael Spece is with Carnegie Mellon University, USA.

Implicit Adaptation to Low Rank Structure in Online Learning
Weiqi Yang and Michael Spece matrix. The -rank is defined as They proved a regret bound related to the -rank under the stochastic case, but not adversarial case. Then [3] proposes the open problem of whether there is an upper bound of (√ ) under the adversarial setting.
Ref. [7] proposes Online Lazy Newton. This algorithm achieves a regret bound of √ log( ); the authors show that their method is invariant to affine transformations.
Ref. [8] treats the problem from a different perspective. Under their setup, they consider an infinite number of experts, and they change EWA. Nevertheless, when they apply their algorithm to the low rank experts problem, they do not get a bound in term of the rank; instead, they address another open problem under the stochastic setting, under a -rank for the embedding.
Ref. [9] treats the problem with a wholly new prospective. They assume that the loss matrix structure in hindsight is an additive space composed of low rank spaces and other spaces. Under their setup and noisy low rank experts, they achieve a regret bound of √2(16 + ) . But, even under their setup, this bound is suboptimal.
Ref. [10] studies low rank online learning in the supervised setting for real-valued predictions, and achieves a regret of (√ ) for a fixed ambient dimension. However, that paper fails to prove that this bound can apply to the setting in [3].

C. Restart Approaches
These have been studied in various fields in online learning for years. One of the most famous applications is the adaptive regret for fixed shares given by [11]. It uses the restart approach for prediction under mix-loss by implementing the Follow the Leading History presented by [12]. Ref.
[13] applies the restart approach to EWA under branching experts.
Ref. [8] also applies the restart approach to EWA so that both number of effective experts and the step parameter can be set dynamically.
Ref. [1] provides the foregoing review of low rank learning and finds various settings, including sub-settings of [3], where (√ ) is achievable.

II. LEARNING ALGORITHMS UNDER STUDY
The first algorithm (Algorithm 1) we consider is a horizonadaptive version of EWA, which also appears in [4]:

III. SIMULATION
In the simulations of the loss, each vector component follows a distribution .
In the following experiments, we fix = 20 = 20 if we do not especially mention.

A. How to Generate the Loss Matrix
Algorithm 3: How to generate the loss matrix, which has N rows and T columns of a not high rank r (compared with the min(N, T)) Data: Input a distribution D and a certain rank r Result: Generate the loss matrix of a not high rank r (compared with the min(N, T)) 1. Generate three matrices: , , . is a matrix of rows and columns, and its each vector component obeys a distribution.
is a matrix of rows and columns, and its each vector component obeys a distribution.
is a matrix of rows and columns, and its first diagonal numbers are , other numbers are . 2. = × × . And if 's element is greater than 1, make it equal to 1; if 's element is smaller than 0, make it equal to 0. 3. Make a judgment: if the ′ s rank equals to , then use it as the loss matrix; if not, abandon it. But using Algorithm 3 to generate the matrices of higher rank is not efficient. Because for higher rank (compared with the min ( , ) ): is close to the identity matrix of min( , ) dimensions. So is close to the × , and in this condition, it is very easy for its elements to be bigger than 1. Then according to Algorithm 3, the will be close to a matrix, whose most elements are 1 and rank is lower than expected. Therefore, with a certain number of experiments, only a very small percentage of them meet the requirement that their ranks equel to r. As a result, it is very hard to get close to the real worst regrets of high rank in Algorithm 3.
So we consider another way to generate the loss matrices (Algorithm 4): Algorithm 4: How to generate the loss matrix, which has N rows and T columns of a certain rank r (high) Data: Input a distribution D and a certain rank r 1. Generate a matrix: . And is a matrix of rows and columns,and its each element obeys a distribution. 2. Make a judgment: if the ′ s rank equals to , then use it as the loss matrix; if not, abandon it.
Algorithm 4 also has its disadvantage: when using this way to generate the matrices of low rank (compared with the min( , )), it is very hard to get the matrices that we want. Because, the matrix of rank must have ( − ) linear dependent rows and ( − ) linear dependent columns. When , are much bigger than , this can be very hard in The Fig. 1 is the picture describing the relationship between the worst regret and the loss matrices' rank . And in Experiment B, we will use traditional learning algorithm (Algorithm 1) to compute the regrets.
In Experiment B, we fix the = = 20 , and we use Algorithm 3 with the distribution = binomial distribution ( = 0.5) to generate the loss matrices of certain rank , when ≤ 17 ; and use Algorithm 4 with the distribution = binomial distribution ( = 0.5) to generate the loss matrices of rank , when ≥ 18 . For each certain rank ( ∈ {1. . . min( , )}), we do 100000 independent trials.
From the Fig. 1, we can find three results: 1. The worst regret can decrease when the rank increases. 2. Although the curve fluctuates at some points, yet in general, the curve is of the same shape as the curve = 2√ .
3. The curve has a rapid decline when the rank begins to be relatively high.
About the Result 1, it was expected that the worst regret will keep increasing when the rank increases. But our finding is inconsistent with that. To find if our finding is just because the number of trial is not big enough to get the accurate worst regrets, we do the Experiment C.
About the Result 2, it was expected that the relationship between the worst regret and the loss matrix's rank should be as the function = × √ ( is a constant), and our finding is consistent with this previous assumption.
About the Result 3, we think this abnormal decline is because of our way to generate the loss matrices. Since we use Algorithm 3 to generate the loss matrices of rank , when ≤ 17 and use Algorithm 4 to generate the loss matrices of rank , when ≥ 18, it is reasonable that the point, where = 17, is a minimum point. Because when the ( ≤ 17) is getting close to 17, it is harder and harder to generate the loss matrices of rank using the Algorithm 3. So the regret will decrease (getting away from the accurate regret). And when r ( ≥ 17) is getting close to 20, it is easier and easier to generate the loss matrices of rank using the Algorithm 4. So the regret will increase (getting close to the accurate regret). As a result, the point, where = 17, is a minimum point.

C. The Accurate Worst Regret of Small ,
In Experiment C, we also use traditional learning algorithm (Algorithm 1) to compute the regrets. And we fix the = = 5. But this time, we generate all the possible matrices with 5 rows and 5 columns and these matrices' elements are chosen from set {0, 1}. In this case, we can get the accurate worst regret. The following table is the result:  From the picture and data, we can see that the worst regret of the matrices with the rank r ( = 4) is bigger than that of the the matrices with the rank r ( = 5) . Although this difference is not significant, yet it shows that the worst regret can decrease when the rank of loss matrices increases.
So the Result 1 in Experiment B (the worst regret can decrease when the rank increases) is reasonable.

D.
= = 20 and Distribution = Uniform Distribution ( = 1, = 0) The Fig. 3 is the picture describing the relationship between the worst regret and the loss matrices' rank . And in Experiment D, we will use traditional learning algorithm (Algorithm 1) to compute the regrets. To compare the Fig. 3 with the Fig. 1, we can find that the worst regrets of Fig. 3 are much lower than corresponding worst regrets of Fig. 1. We think this is because of their different distributions.
When the elements of loss matrix follow the binominal distribution, they only have two values to choose from, so it is easier to get the linear dependent rows or columns. Then it is easier to generate the loss matices of certain rank r, which means the results we simulated are close to the accurate worst regret.
But when the elements of loss matrix follow the uniform distribution, these elements have infinite values to choose from. As a result, it is very hard to gender the loss matrices of certain rank , therefore, it is harder to get close to the accurate worst regret of rank .
According to this fact, our following research focus on the binomial distribution.

E. For Larger and (Also Binomial Distribution)
The Fig. 4 is the picture describing the relationship between the worst regret and the loss matrices' rank . And in Experiment E, we will use traditional learning algorithm (Algorithm 1) to compute the regrets.
In Experiment E, we fix the = = 50 , and we use Algorithm 3 with the distribution = binomial distribution ( = 0.5) to generate the loss matrices of rank , when ≤ 42; and use Algorithm 4 with the distribution = binomial distribution ( = 0.5) to generate the loss matrices of rank , when ≥ 43. For each certain rank ( ∈ {1. . . min( , )}), we do 100000 independent trials.
Although, the regrets may be not close to the accurate worst regrets at some points (40 ≥ ≥ 45), yet as we can see the general shape of the curve.
From the Fig. 4, we find the three results in Experiment B are also true, which moreover proves that the findings in Experiment B can not only be applied to certain and but also bigger and .
About those abnormal points ( 40 ≥ ≥ 45 ), they are because of our way to generate the loss matrices too. Since we use Algorithm 3 to generate the loss matrices of rank , when ≤ 42 ; and use Algorithm 4 to generate the loss matrices of rank , when ≥ 43, it is reasonable that the points, where is near to 42, are minimum points. Because when the ( ≤ 42) is getting close to 42, it is harder and harder to generate the loss matrices of rank using the Algorithm 3. So the regret will be away from the accurate worst regret. But when r ( ≥ 42) is getting close to 50, it is easier and easier to generate the loss matrices of rank using the Algorithm 4. So the regret will getting close to the accurate regrets.
As a result, these points (40 ≥ ≥ 45) are far below the accurate worst regrets.

F. Use Another Algorithm
In the following experiments, we will change the traditional learning algorithm (Algorithm 1) to another learning algorithm (Algorithm 2), which is designed to exploit low rank structure effectively. In Algorithm 2 we want to only use those experts that have different cumulative loss to compute the regrets. In this way, we expect that this another learning algorithm can compute the regrets more efficiently and we want to see if these two learning algorithms will generate different results.

1) Use another algorithm to repeat the Experiment B
In Experiment a, we will use another learning algorithm (Algorithm 2) to repeat the Experiment B. So we also fix the = = 20, and we use Algorithm 3 with the distribution = binomial distribution ( = 0.5) to generate the loss matrices of rank , when ≤ 17; and use Algorithm 4 with the distribution = binomial distribution ( = 0.5) to generate the loss matrices of rank , when ≥ 18. For each certain rank ( ∈ {1. . . min( , )}), we do 100000 independent trials.
And we show the curve of the traditional algorithm (Algorithm 1) and the curve of another algorithm (Algorithm 2) in the same picture.
From the Fig. 5, we can find that: in Experiment B, these two algorithms have almost the same results.
2) Use another algorithm to repeat the Experiment D In Experiment b, we will use another learning algorithm (Algorithm 2) to repeat the Experiment D.
And we show the curves of the traditional algorithm (Algorithm 1) and the curve of another algorithm (Algorithm 2) in the same picture. From the Fig 6, we can find that: in Experiment D, these two algorithms have almost the same results.
3) Use another algorithm to repeat the Experiment E In Experiment c, we will use another learning algorithm (Algorithm 2) to repeat the Experiment E.
So we also fix the = = 50, and we use Algorithm 3 with the distribution = binomial distribution ( = 0.5) to generate the loss matrices of rank , when ≤ 42; and use Algorithm 4 with the distribution = binomial distribution ( = 0.5) to generate the loss matrices of rank , when ≥ 43. For each certain rank ( ∈ {1. . . min( , )}), we do 100000 independent trials. And we show the curves of the traditional algorithm (Algorithm 1) and the curve of another algorithm (Algorithm 2) in the same picture.
From the Fig. 7, we can find that: in Experiment E, these two algorithms have almost the same results.
In conclusion, this new algorithm (Algorithm 2) has almost the same results as the traditional algorithm (Algorithm 1), when it is used to compute the regrets of low rank.

IV. CONCLUSION
We have shown a traditional algorithm exploits low rank structure effectively and comparably to an algorithm explicitly designed to exploit low rank structure. This suggests the difficulty of proving accurate upper bounds is not as much in algorithm design as it is in uncovering how algorithms take advantage of coincidental structure. Future research might consider in more depth existing algorithms. In particular, given our results are experimental, theoretical justification or other independent verification of them is warranted.
[12] E. Hazan  He is the chief AI / data scientist at AllocateRite, New York City and researches online learning and foundations of data science. Dr. Spece was offered the NSF fellowship in artificial intelligence, the NDSEG fellowship in mathematics, and was a NSF honorable mention in economics.