Fio | Benjamin Clavié

nice howl

日本語版は近日公開予定です（日本語を勉強中なので、間違いはご容赦ください！）

Welcome to this stream-of-thoughts report on fio-base-japanese-v0.1, the first version of Fio, a family of Japanese sentence embeddings. These are all notes I took while training the models, vaguely ordered in relevant categories! I hope that they can be useful to anyone interested in Japanese embeddings.

In short:

Fio-v0.1 is currently (18/12/2023) the best similarity sentence embeddings model for Japanese, as well as the best overall monolingual model (similarity + retrieval). For retrieval tasks, it is still beaten by the strongest multilingual models.
Fio-v0.1 is trained on relatively little data.
Fio is based on cl-tohoku/bert-base-japanese-v3
Fio-v0.1 is trained on a single GPU.
A lot of the performance comes from relatively simple, but very important small steps.
Fio-v0.1 could likely perform much better with more compute, as scaling aspects within the data and the litterature show a lot of margin for improvement.
Data augmentations using LLM are very promising, and very, very far from being fully explored in Fio-v0.1.

tl;dr: Fio v0.1 is a proof of concept. It uses little data/compute and leverages very few of the data augmentation techniques that could be used. Scaling observed in training indicate good chances that with more compute, Fio could match its similarity performance on retrieval tasks. Fio’s retrieval does considerably increase at higher recalls, and its lightweightness could make it a great model to use before a reranker.

Results

Main Results

⚠️ WARNING: fio-base-japanese-v0.1 has seen textual entailment tasks during its training, which is not the case of the other other japanese-only models in this table. This gives Fio an unfair advantage over the previous best results, cl-nagoya/sup-simcse-ja-[base|large]. During mid-training evaluations, this didn’t seem to greatly affect performance, however, JSICK (NLI set) was included in the training data, and therefore it’s impossible to fully remove this contamination at the moment. I intend to fix this in future release, but please keep this in mind as you view the results (see JSQuAD results below for a fully unseen comparison, although focused on retrieval).

This is adapted and truncated (to keep only the most popular models) from oshizo’s benchmarking github repo, please check it out for more information and give it a star as it was very useful!

Italic denotes best model for its size (base/large | 768/1024), bold denotes best overall.

Model	JSTS valid-v1.1	JSICK test	MIRACL dev	Average
bclavie/fio-base-japanese-v0.1	*0.863*	*0.894*	0.718	0.825
cl-nagoya/sup-simcse-ja-base	0.809	0.827	0.527	0.721
cl-nagoya/sup-simcse-ja-large	0.831	0.831	0.507	0.723
colorfulscoop/sbert-base-ja	0.742	0.657	0.254	0.551
intfloat/multilingual-e5-base	0.796	0.806	0.845	0.816
intfloat/multilingual-e5-large	0.819	0.794	0.883	*0.832*
pkshatech/GLuCoSE-base-ja	0.818	0.757	0.692	0.755
text-embedding-ada-002	0.790	0.789	0.7232	0.768

JSQuAD Retrieval

Fio is not trained on JSQuAD at all. This serves as an idea of how generalisable the current embeddings are, to indicate their suitability for a two-step approach (retrieve + rerank).

This ranking is adapted from nouu-me/document_vector_search_benchmark’s github repo. Please do check it out if you’re interested.

Multiling	embedding	Recall@1	Recall@3	Recall@5	Recall@10
Yes	intfloat/multilingual-e5-large	0.864926	0.952949	0.965781	0.977488
Yes	intfloat/multilingual-e5-small	0.840387	0.933814	0.953850	0.972985
Yes	intfloat/multilingual-e5-base	0.838361	0.934039	0.954975	0.972535
Yes	textembedding-gecko-multilingual@001	0.780729	0.904322	0.932463	0.961054
Yes	textembedding-gecko-multilingual@latest	0.780729	0.904548	0.932238	0.960603
Yes	text-embedding-ada-002	0.753940	0.874606	0.906799	0.937866
No	bclavie/fio-base-japanese-v0.1 (This Model)	0.69968	0.84151	0.87933	0.92436
No	sonoisa/sentence-bert-base-ja-mean-tokens-v2	0.654210	0.810671	0.862900	0.914228
No	sonoisa/sentence-luke-japanese-base-lite	0.652634	0.813147	0.861324	0.908825
No	pkshatech/GLuCoSE-base-ja	0.644755	0.798064	0.846466	0.896668
No	sonoisa/sentence-bert-base-ja-mean-tokens-v2	0.639802	0.782981	0.841288	0.894867
No	cl-nagoya/sup-simcse-ja-base	0.631923	0.792661	0.848942	0.897118
No	cl-nagoya/sup-simcse-ja-large	0.603107	0.776452	0.833408	0.889239
No	cl-nagoya/unsup-simcse-ja-large	0.594777	0.755966	0.818100	0.879559
No	cl-nagoya/unsup-simcse-ja-base	0.577217	0.746961	0.804142	0.870779

Background

I’ve been very interested in embedding for years, and been working on improving them professionally for a long time. While all the new tech has been really cool and interesting, I haven’t really had an opportunity to make any public contributions before. I’ve been learning Japanese in prevision of a (mid-term) move to Japan, and got really interested in the current state of Japanese NLP, as it comes with many different constrains. Fio-v0.1 is the result of this combination, and I hope it can be useful to others!

How to use Fio?

Fio is a pretty standard sentence-transformers model, though it does require unidic-lite and fugashi. Please check the huggingface model page for the boilerplate instructions.

Data

Datasets

Compute ressources are limited for v0.1, so we using a restricted amount of dataset.

Similarity/Entailment:

JSTS (train)
JSNLI (train)
JNLI (train)
JSICK (train)

Retrieval:

MMARCO (Multilingual Marco) (train, 124k sentence pairs, <1% of the full data)
Mr.TyDI (train)
MIRACL (train, 50% sample)
~~JSQuAD (train, 50% sample, no LLM enhancement)~~ JSQuAD is not used in the released version, to serve as an unseen test set.

Data Augmentation

Data augmentation experiments are ongoing and seem to yield very good results, but aren’t widely used in the released version.

Training Procedure

We train using sentence pairs only. We trained the model in three phases, all supervised:

Binary (label: 0/1) training on JSICK, JSNLI and JNLI. Entailment is 1, contradiction is 0, and neutrals are discarded. We use a batch size of 64 and sequence length of either 64 or 128 (very little impact) for 3 epochs.
Binary (label: 0/1) retrieval training on MMARCO, Mr.TyDI and MIRACL (if using). Training data is sorted so negative examples are in the same batch as positive example (very important). Sequence length of 512 is required for best performance, 256 is serviceable and anything below is not recommended as performance degrades rapidly. See below for comments on batch size. This step is only performance for one epoch in the released version. this is likely undertrained.
Fine-tuning for tasks can be done at this stage, as a third step. For example, similarity training on JSTS and JSICK. This is line with the common idea that training for NLI then STS is a good substitute for multitask learning. Label is the float representing the similarity (0 to 5). We use a batch size of 64 and sequence length of either 64 or 128 (as above, very little impact) for 3 epochs.

Stage 1 and 3 were performed on an RTX 4090 and stage 2 on an A100.

Stage 3 appears to not be fully required for most tasks – it improves performance marginally. Fine-tuning per task and using LM-Cocktail to merge the weights is likely a better option.

Base Model

The model is based on the excellent cl-tohoku/bert-base-japanese-v3. Whether it is the optimal option remains unclear, it provided the most balanced performance during subsampled runs, so was picked. Other considered models were:

studio-ousia/luke-japanese-base-lite: A strong Japanese LUKE model. Performance appeared slightly better than bert-base-japanese-v3 on some tasks, and worse on others.
nlp-waseda/roberta-base-japanese: Worse on all metrics than bert-base-japanese-v3 (never by a lot!)
intfloat/multilingual-e5-base: Training directly from e5 multilingual was considered, but dismissed for multiple reasons. Among them, the non-Japanese optimised tokenisers, and the fact that having a language-specific model is generally preferred.

Similarity

The sentence similarity tasks were learned very easily by the model, and performance shot up quickly. After the first training stage, JSTS performance was already >0.805. In training runs without any attempt to improve retrieval and with no data augmentation, JSTS performance went up to ~0.87 and JSICK ~0.90.

It seems like reaching above 0.90 on both datasets could be fairly doable with a model specially optimised for sentence similarity/entailment, especially with LLM enhancements. In early tests, there is a ceiling around the scores above (0.875 and 0.901) that seem to require more optimisations to break.

Retrieval

Interesting aspect: including MIRACL train split in the training set (or not) had an extremely minor impact on downstream performance on MIRACL. The NLI-only model from step 1 of training (as explained above) had relatively poor performance, but once retrieval training had been performed:

Performance on all NLI tasks went up
Performance on MIRACL went up as long as MMarco (downsampled)/MrTyDi was present in the data.

Prompt

Retrieval performance was hugely affected by using a prompt in the query. We tried multiple approaches:

Not using any sort of pre-processing, the performance was mediocre.
Prepending all queries with "質問: " and all passages with "段落: ". This improved performance, but the model was still not able to retrieve the correct passage for many queries. (MIRACL dev: ~0.66)
Prepending all queries with "関連記事を取得するために使用できるこの文の表現を生成します: ", to give this querying context more weight, especially in the case of very short queries. In this setting, we do not prepend the passages with anything. This yielded the best performance.

We have more prompt combinations to experiment with, but the last approach yielded by far the best results (+5 performance points on MIRACL dev compared to the one above).

Results#

Main Results#

JSQuAD Retrieval#

Background#

How to use Fio?#

Data#

Datasets#

Data Augmentation#

Training Procedure#

Base Model#

Similarity#

Retrieval#

Prompt#