In recent years, the information retrieval (IR) field has experienced a paradigm shift in the application of machine learning techniques to achieve effective ranking models. A few years ago,we were using hill-climbing optimisation techniques such as simulated annealing to optimise the parameters in weighting models, such as BM25 or PL2, or latterly BM25F or PL2F. Instead, driven first by commercial search engines, IR is increasingly adopting a feature-based approach, where various mini-hypothesis are represented as numerical features, and learning to rank techniques are deployed to decide their importance in the final ranking formulae.
The typical approach for ranking is described in the following figure from our recently presented WSDM 2013 paper:
Phases of a retrieval system deploying learning to rank, taken from Tonellotto et al, WSDM 2013.
In particular, there are typically three phases:
- Top K Retrieval, where a number of top-ranked documents are identified, which is known as the sample.
- Feature Extraction - various features are calculated for each of the sample documents.
- Learned Model Application - the learned model obtained from a learning to rank technique re-ranks the sample documents to better satisfy the user.
The set of top K documents selected within the first retrieval phase is called the sample by Liu, even though the selected documents are not iid. Indeed, in selecting the sample, Liu suggested that the top K documents ranked by a simple weighting model such as BM25 is not the best, but is sufficient for effective learning to rank. However, the size of the sample - i.e. the number of documents to be re-ranked by the learned model - is an important parameter: with less documents, the first pass retrieval can be made more efficient by the use of dynamic pruning strategies (e.g. WAND); on the other hand, too few documents may result in insufficient relevant documents being retrieved, and hence effectiveness being degraded.
Our article The Whens and Hows of Learning to Rank in the Information Retrieval Journal studied the sample size parameter for many topic sets and learning to rank techniques - for the mixed information needs on the TREC ClueWeb09 collection, we found that while a sample size of 20 documents was sufficient for effective performance according to ERR@20, larger sample sizes of thousands of documents were needed for effective NDCG@20; for navigational information needs, predominantly larger samples sizes (upto 5000 documents) were needed; Moreover, the particular document representations that used to identify the sample was shown to have an impact on effectiveness - indeed, navigational queries were found to be considerably easier (requiring smaller samples) when anchor text was used, but for informational queries, the opposite was observed. In the article, we examined these issues in detail, across a number of test collections and learning to rank techniques, as well as investigating the role of the evaluation measure and its rank cutoff for listwise techniques - for in depth details and conclusions, see the IR Journal article.