Hello 👋
I’m a French person working in Machine Learning, and more specifically, NLP and Information Retrieval. I’m currently based in Tokyo, Japan, and intend to be for the rest of my time on this planet should I get the chance to.
I’m currently doing R&D at a really cool place named Answer.AI.
Things I’ve Done
In no particular order and without full exhaustivity:
Research/Models
Retrieval Models
- JaColBERT (then JaColBERTv2), released in late ~2023, which improved the state-of-the-art for japanese retrieval by over 20 pct point at the time, and helped kickstart a wave of greatly improved JA embedding models.
- JaColBERTv2.5, released after JaColBERT finally lost its SOTA crown. It’s a thorough study in considerably improving the training of retrievers/late interaction models, which has resulted in both the JaColBERTv2.5 model itself (SOTA on JA retrieval with ~130M params) and answerai-small-colbert-v2, a proof-of-~hubris~concept 33m param model competitive with 500m+ models on English retrieval, with over 2.5M monthly downloads on HF.
- Token Pooling, a shockingly simple clustering-based method which we introduced to improve a major issue with multi-vector models: the sheer number of vectors that need stored. Without any retraining, we can achieve a 50-to-66% reduction in the number of vectors, and thus the storage footprint, needed, without incurring any sort of retrieval quality hit! It’s even been adopted by other teams who showed it works on multimodal retrieval too ☺️
- I have been very interested in the importance of synthetic data for retrieval models, as well as how it can be combined with base models using model merging to avoid any sort of catastrophic forgetting. I have some pretty cool proofs of concepts that I need to explore a bit more…
Encoders
- ModernBERT, a project which I started and then co-led (with Benjamin Warner and Antoine Chaffin) where we trained a BERT-like model with a much more modern architecture, longer context, and using modern data approaches. A proof that there’s still some juice left in encoder models, now go and further explore them!
- It’s all in the [MASK], a quick exploratory study showing that encoders trained on modern data mixes exhibit a certain degree of instruction following w/o being autoregressive and with pretty tiny parameter counts.
Multimodal
- byaldi A one-shot mini-toolkit to let people try out ColPali, which at the time was the best approach for multimodal retrieval. You might’ve seen it in HuggingFace’s Merve Noyan’s quick intro to multimodal retrieval notebook, showing how easy it is to just retrieve PDFs from textual queries and pass them to your LLM so you can, finally, after all these years, “chat with your PDFs”.
- (Currently no research, but it’s of very high interest to me right now, so including it here :).)
Tools/Software
- RAGatouille, my claim to extremely niche nerdy fame, a toolkit which is in dire need of updates (soon!) but which made ColBERT/late interaction usable with an extremely simple, 3-lines API. It greatly helped bridge the gap between research and actual applications, and increased ColBERTV2 downloads from ~50k to ~3M a month 🤯 (but the real genius to credit for this is Omar Khattab, the creator of, among other things, ColBERT!)
- rerankers, a library which makes it extremely easy to use almost all commonly used approaches to reranking. Re-ranking is an important step in information retrieval and recommender systems, where we use a strong model to rank a list of candidates generated by a weaker model. The problem is that there are a lot of approaches, and they all come with their own implementations, libraries, dependencies and API, making experimentation ridiculously time-consuming. rerankers unifies a lot of these methods into a joint API, getting the same results as research implementations, but in a very easy to use (and swap in/out) way!
- fastkmeans, in collaboration with Benjamin Warner, a mini-library (it does 1 (one) thing) which implements the K-Means algorithm with PyTorch & Triton GPU acceleration while maintaining cpu compatibility. It fixes pretty much all the pain points people have with the dreaded faiss: it has no dependencies except torch and numpy (and optionally triton), it runs ~5x faster than faiss when using the triton kernel, and it can be (uv) pip installed into just about environment, rather than requiring compilation from scratch like faiss does.
Talks
- To the dismay of my great shyness, a lot of people might know me from my talk on the basic components of retrieval (beyond just vector search) that I gave for Hamel Husain’s mastering LLM course.
- I’ll also soon be given a talk at Riken AIP about encoder models, specifically about ModernBERT and about the avenues I see as future developments for this kind of model and small models in general!
In the past
Prior to 2023, I got an MScR in AI at The Univrsity of Edinburgh worked on applied data science at a bunch of companies, over ~5 years. Quick highlights include:
- I worked at on Legaltech problems at Jus Mundi and LexisNexis, where I contributed to some of the first legal transformers-based language modelling work in French (I know, that’s a lot of qualifiers, but that was really cool back in 2018!), and then putting it in production (again, in 2018, this was extremely exciting.). I also did some work there that I find pretty interesting, showing that SVMs are remarkably strong baselines for simple legal tasks and ULMFiT-like models are pretty good at it, too.
- I explored skill-matching problems at UK startup Bright Network. For a bit of background, the EU maintains the ESCO skills database, which contains roughly ~13k items listing all the skills, both soft and hard, that are considered as useful for jobs, and a lot of governmental policies and HR decisions use this database. Automated detection of skills within job postings is an important problem in that space. My team there showed that the use of LLM-generated synthetic data to train first-stage retrievers combined with LLM reranking considerably outperformed all previous approaches to automated skills detection. Our approach was then refined and made (much) better by Stanford’s Karel d’Oosterlinck, using the DSPy framework, once again highlighting that synthetic data can solve all your problems.
- The problem with working with startups is also that I now know more than I thought I ever would about the pricing and naming of GPU instances on various cloud providers…