Hello 👋

I’m a French person working in Machine Learning, and more specifically, NLP and Information Retrieval. I’m based in Tokyo, Japan, and intend to be for the rest of my time on this planet should I get the chance to.

I work at Mixedbread, where we’re trying to solve information retrieval.

Open Things I’ve Done

Nowadays, my work is split between work on Mixedbread’s internal models, which are not publicly released, and open science, as we’re keen to contribute to the IR ecosystem. Below, you may find open projects/publications, presented in no particular order and not particularly exhaustive.

Research/Models
Retrieval
  • JaColBERT (then JaColBERTv2), released in late ~2023, which improved the state-of-the-art for japanese retrieval by over 20 pct point at the time, and helped kickstart a wave of greatly improved JA embedding models.
  • JaColBERTv2.5, released after JaColBERT finally lost its SOTA crown. It’s a thorough study in considerably improving the training of retrievers/late interaction models, which has resulted in both the JaColBERTv2.5 model itself (SOTA on JA retrieval with ~130M params) and answerai-small-colbert-v2, a proof-of-~hubris~concept 33m param model competitive with 500m+ models on English retrieval, with over 2.5M monthly downloads on HF.
  • Token Pooling, a shockingly simple clustering-based method which we introduced to improve a major issue with multi-vector models: the sheer number of vectors that need stored. Without any retraining, we can achieve a 50-to-66% reduction in the number of vectors, and thus the storage footprint, needed, without incurring any sort of retrieval quality hit! It’s even been adopted by other teams who showed it works on multimodal retrieval too ☺️
  • I have been very interested in the importance of synthetic data for retrieval models, as well as how it can be combined with base models using model merging to avoid any sort of catastrophic forgetting. I have some pretty cool proofs of concepts that I need to explore a bit more…
  • I love figuring out ways that make the models work, so we can improve them. One such example is our recent work on ColBERT’s projection layers, which highlighted that MaxSim has pretty cool learning properties and that ColBERT models still have a lot of free lunches on the table.
Encoders
  • ModernBERT, a project which I started and then co-led (with Benjamin Warner and Antoine Chaffin) where we trained a BERT-like model with a much more modern architecture, longer context, and using modern data approaches. A proof that there’s still some juice left in encoder models, now go and further explore them!
  • It’s all in the [MASK], a quick exploratory study showing that encoders trained on modern data mixes exhibit a certain degree of instruction following w/o being autoregressive and with pretty tiny parameter counts.
Multimodal
  • Largely working on Multimodal as part of Mixedbread’s research effort. It has some pretty cool demos, such as our semantic art search.
Tools/Software
  • RAGatouille, my claim to extremely niche nerdy fame, a toolkit which is in dire need of updates (soon!) but which made ColBERT/late interaction usable with an extremely simple, 3-lines API. It greatly helped bridge the gap between research and actual applications, and increased ColBERTV2 downloads from ~50k to ~3M a month 🤯 (but the real genius to credit for this is Omar Khattab, the creator of, among other things, ColBERT!)
  • byaldi A one-shot mini-toolkit to let people try out ColPali, which at the time was the best approach for multimodal retrieval. You might’ve seen it in HuggingFace’s Merve Noyan’s quick intro to multimodal retrieval notebook, showing how easy it is to just retrieve PDFs from textual queries and pass them to your LLM so you can, finally, after all these years, “chat with your PDFs”.
  • rerankers, a library which makes it extremely easy to use almost all commonly used approaches to reranking. Re-ranking is an important step in information retrieval and recommender systems, where we use a strong model to rank a list of candidates generated by a weaker model. The problem is that there are a lot of approaches, and they all come with their own implementations, libraries, dependencies and API, making experimentation ridiculously time-consuming. rerankers unifies a lot of these methods into a joint API, getting the same results as research implementations, but in a very easy to use (and swap in/out) way!
  • fastkmeans, in collaboration with Benjamin Warner, a mini-library (it does 1 (one) thing) which implements the K-Means algorithm with PyTorch & Triton GPU acceleration while maintaining cpu compatibility. It fixes pretty much all the pain points people have with the dreaded faiss: it has no dependencies except torch and numpy (and optionally triton), it runs ~5x faster than faiss when using the triton kernel, and it can be (uv) pip installed into just about environment, rather than requiring compilation from scratch like faiss does.
Talks

Academic Volunteering

I am not, currently, an academic, although I am affiliated with the National Institute of Informatics (NII) in Tokyo, under Professor Makoto P. Kato and the Kasys lab.

I think that it’s important that we nurture an open IR community to see the same kind of growth experienced by NLP/Language Modeling. I’m the main organizer of the Late Interaction Workshop series at IR conferences, and currently serve on the program committees of CIKM and ECIR.

In the distant past

Prior to 2023, I got an MScR in AI at The Univrsity of Edinburgh worked on applied data science at a bunch of companies, oftentimes in a consulting fashion over ~5 years. There’s not a lot of public-facing things there, but quick highlights include:

  • I worked at on Legaltech problems at Jus Mundi and LexisNexis, where I contributed to some of the first legal transformers-based language modelling work in French (I know, that’s a lot of qualifiers, but that was really cool back in 2018!), and then putting it in production (again, in 2018, this was extremely exciting.). I also did some work there that I find pretty interesting, showing that SVMs are remarkably strong baselines for simple legal tasks and ULMFiT-like models are pretty good at it, too.
  • I explored skill-matching problems at UK startup Bright Network. For a bit of background, the EU maintains the ESCO skills database, which contains roughly ~13k items listing all the skills, both soft and hard, that are considered as useful for jobs, and a lot of governmental policies and HR decisions use this database. Automated detection of skills within job postings is an important problem in that space. My team there showed that the use of LLM-generated synthetic data to train first-stage retrievers combined with LLM reranking considerably outperformed all previous approaches to automated skills detection. Our approach was then refined and made (much) better by Stanford’s Karel d’Oosterlinck, using the DSPy framework, once again highlighting that synthetic data can solve all your problems.
  • The problem with working with startups is also that I now know more than I thought I ever would about the pricing and naming of GPU instances on various cloud providers…