Skip to content

Roadmap

This page is incorrectly named: RAGatouille doesn't have a set-in-stone roadmap, but rather, a set of objectives.

Below, you'll find things that we're hoping to integrate and/or support in upcoming versions (⛰️ denotes a major milestone):

New Features

Synthetic Data Generation
  • Build upon our tutorial 3 and integrate OpenAI query generation into a built-in DataProcessor.
  • Leverage DSPy to perform data augmentation via LLM compiling, reducing the reliance on API providers by enabling locally-ran models to generate data.
  • ⛰️ Integrate UDAPDR - UDAPDR is an extremely impressive method to adapt retrievers to a target domain via entirely synthetic query: all you need to provide is your document collection. We're hoping to integrate this in an upcoming version of RAGatouille.
  • Provide a toolkit to generate synthetic passages for provided queries.

Improvements

  • ⛰️ Full ColBERTv2 style training: transparently use an existing cross-encoder teacher model to generate distillation scores and improve model training.
  • Evaluation support: at the moment, RAGatouille doesn't roll out any evaluation metrics, as these are more commonly available already. Future versions of RAGatouille will include some form of evaluation for convenience!
  • Support for more "late-interaction" models, such as Google's SparseEmbed.
  • New negative miners, such as ColBERTMiner (not a huge priority as dense hard negative work well enough, but would be a nice feature for thoroughness)
  • Full LlamaIndex integration

Library Upkeep

  • ⛰️ Improve the documentation to cover every component and concept of the library in-depth.
  • Comprehensive test coverage