CRI: CI-SUSTAIN: Collaborative Research: Sustaining Lemur Project Resources for the Long-Term
By Jamie Callan
The project is organized around three types of activities: Sustaining software, sustaining datasets, and operation. The project achieves long-term software sustainability by adding support for Indri and Galago functionality and creating integration and migration paths with the open-source Lucene search engine, which has large user and volunteer-developer communities. Research done with Galago or Indri will thus be reproducible in Lucene and more accessible to Lucene's industry users. The project also extends the Galago Application Programming Interface to support the newest developments in neural network (deep learning) document ranking technologies, which now are being studied widely and expected in a state-of-the-art research system. It broadens the utility of Ranklib by supporting neural algorithms for better comparison with high quality learning to rank approaches, and broadens the utility of the Sifaka text mining application with support for additional document and machine learning formats. The older ClueWeb09 and ClueWeb12 datsets are superseded by a new ClueWeb2020 dataset that is designed to last a decade and support research on newer learning-to-rank and neural network (deep learning) ranking algorithms. The project maintains and operates the existing infrastructure, in the form of software maintenance and support; dataset licensing and distribution; and operation of online search services. The new Lemur Project infrastructure supports a broad range of Information Retrieval research, for example, research on retrieval models; how to train learned rankers; use of semi-structured knowledge bases; result diversification; query optimization; and distributed search. In particular, it greatly improves support for research on learned and neural (deep learning) ranking algorithms, which have become important research topics in recent years. The ClueWeb datasets are used by a broad human language technologies research community. This project makes enhancements that sustain this infrastructure for the research community for at least the next decade.