Abstract—Learning-to-rank (LtR) is a framework that spans from training set generation from unlabelled data to learning a model to runtime efficiency. While a good number of learning-to-rank algorithms have been developed over the past decade, “out-of-the-box” topics such as investigation into the properties of training sets have not been given much attention to date. In this article we investigate the imbalanced nature of LtR training sets, which generally contain very few relevant documents as compared to the non-relevant ones. Since the relevant documents are rare in a document collection, the need to include in the training set as many relevant documents as possible is well-understood. However, the lower bound of the number of non-relevant documents needed to learn a good ranking function is still largely uninvestigated. This article aims at addressing this question with a special focus on random forest based LtR algorithms. We employ both random and deterministic undersampling techniques to reduce the number of non-relevant documents. Minimization of training set size reduces the computational complexity (i.e., learning time) which is an important factor, especially for large scale LtR. Extensive experiments on Letor benchmark datasets reveal that in many cases the performance of a LtR algorithm trained on a much smaller training set (in terms of presence of non-relevant documents) remains largely similar to that of a model learnt from the original training set. This investigation thus suggests that for large scale LtR tasks, we can leverage undersampling techniques to reduce training time with oftentimes negligible effect on predictive accuracy. We further examine the potential benefit of random forest based LtR algorithms in the context of undersampling which demonstrates that the inherent structure of a random forest is conducive to using undersampling, and thereby allows for even more scalability.
Index Terms—Learning-to-rank, undersampling, random forests, scalability, imbalanced data.
Muhammad Ibrahim is with the Department of Computer Science and Engineering, University of Dhaka, Dhaka-1000, Bangladesh (e-mail: firstname.lastname@example.org).
Cite: Muhammad Ibrahim, "Sampling Non-relevant Documents of Training Sets for Learning-to-Rank Algorithms," International Journal of Machine Learning and Computing vol. 10, no. 3, pp. 406-415, 2020.Copyright © 2020 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).