API Reference
RAGTrainer
Main trainer to fine-tune/train ColBERT models with a few lines.
Source code in ragatouille/RAGTrainer.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 |
|
__init__(model_name, pretrained_model_name, language_code='en', n_usable_gpus=-1)
Initialise a RAGTrainer instance. This will load a base model: either an existing ColBERT model to fine-tune or a BERT/RoBERTa-like model to build a new ColBERT model from.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
str - Name of the model to train. This will be used to name the checkpoints and the index. |
required |
pretrained_model_name |
str
|
str - Name of the pretrained model to use as a base. Can be a local path to a checkpoint or a huggingface model name. |
required |
language_code |
str
|
str - Language code of the model to train. This will be used to name the checkpoints and the index. |
'en'
|
n_usable_gpus |
int
|
int - Number of GPUs to use. By default, value is -1, which means use all available GPUs or none if no GPU is available. |
-1
|
Returns:
Name | Type | Description |
---|---|---|
self |
RAGTrainer
|
The current instance of RAGTrainer, with the base model initialised. |
Source code in ragatouille/RAGTrainer.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
|
export_training_data(path)
Manually export the training data processed by prepare_training_data to a given path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
Union[str, Path]
|
Union[str, Path] - Path to the directory where the data will be exported. |
required |
Source code in ragatouille/RAGTrainer.py
55 56 57 58 59 60 61 |
|
prepare_training_data(raw_data, all_documents=None, data_out_path='./data/', num_new_negatives=10, hard_negative_minimum_rank=10, mine_hard_negatives=True, hard_negative_model_size='small', pairs_with_labels=False, positive_label=1, negative_label=0)
Fully pre-process input-data in various raw formats into ColBERT-ready files and triplets. Will accept a variety of formats, such as unannotated pairs, annotated pairs, triplets of strings and triplets of list of strings. Will process into a ColBERT-ready format and export to data_out_path. Will generate hard negatives if mine_hard_negatives is True. num_new_negatives decides how many negatives will be generated. if mine_hard_negatives is False and num_new_negatives is > 0, these negatives will be randomly sampled.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
raw_data |
Union[list[tuple], list[list]]
|
Union[list[tuple], list[list]] - List of pairs, annotated pairs, or triplets of strings. |
required |
all_documents |
Optional[list[str]]
|
Optional[list[str]] - A corpus of documents to be used for sampling negatives. |
None
|
data_out_path |
Union[str, Path]
|
Union[str, Path] - Path to the directory where the data will be exported (can be a tmp directory). |
'./data/'
|
num_new_negatives |
int
|
int - Number of new negatives to generate for each query. |
10
|
mine_hard_negatives |
bool
|
bool - Whether to use hard negatives mining or not. |
True
|
hard_negative_model_size |
str
|
str - Size of the model to use for hard negatives mining. |
'small'
|
pairs_with_labels |
bool
|
bool - Whether the raw_data is a list of pairs with labels or not. |
False
|
positive_label |
Union[int, str]
|
Union[int, str] - Label to use for positive pairs. |
1
|
negative_label |
Union[int, str]
|
Union[int, str] - Label to use for negative pairs. |
0
|
Returns:
Name | Type | Description |
---|---|---|
data_out_path |
str
|
Union[str, Path] - Path to the directory where the data has been exported. |
Source code in ragatouille/RAGTrainer.py
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
train(batch_size=32, nbits=2, maxsteps=500000, use_ib_negatives=True, learning_rate=5e-06, dim=128, doc_maxlen=256, use_relu=False, warmup_steps='auto', accumsteps=1)
Launch training or fine-tuning of a ColBERT model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch_size |
int
|
int - Total batch size -- divice by n_usable_gpus for per-GPU batch size. |
32
|
nbits |
int
|
int - number of bits used for vector compression by the traiened model. 2 is usually ideal. |
2
|
maxsteps |
int
|
int - End training early after maxsteps steps. |
500000
|
use_ib_negatives |
bool
|
bool - Whether to use in-batch negatives to calculate loss or not. |
True
|
learning_rate |
float
|
float - ColBERT litterature usually has this performing best between 3e-6 - 2e-5 depending on data size |
5e-06
|
dim |
int
|
int - Size of individual vector representations. |
128
|
doc_maxlen |
int
|
int - The maximum length after which passages will be truncated |
256
|
warmup_steps |
Union[int, Literal['auto']]
|
Union[int, Literal["auto"]] - How many warmup steps to use for the learning rate. Auto will default to 10% of total steps |
'auto'
|
accumsteps |
int
|
How many gradient accummulation steps to use to simulate higher batch sizes. |
1
|
Returns:
Name | Type | Description |
---|---|---|
model_path |
str
|
str - Path to the trained model. |
Source code in ragatouille/RAGTrainer.py
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 |
|
RAGPretrainedModel
Wrapper class for a pretrained RAG late-interaction model, and all the associated utilities. Allows you to load a pretrained model from disk or from the hub, build or query an index.
Usage
Load a pre-trained checkpoint:
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
Load checkpoint from an existing index:
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_index("path/to/my/index")
Both methods will load a fully initialised instance of ColBERT, which you can use to build and query indexes.
RAG.search("How many people live in France?")
Source code in ragatouille/RAGPretrainedModel.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 |
|
add_to_index(new_collection, new_document_ids=None, new_document_metadatas=None, index_name=None, split_documents=True, document_splitter_fn=llama_index_sentence_splitter, preprocessing_fn=None)
Add documents to an existing index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
new_collection |
list[str]
|
The documents to add to the index. |
required |
new_document_metadatas |
Optional[list[dict]]
|
An optional list of metadata dicts |
None
|
index_name |
Optional[str]
|
The name of the index to add documents to. If None and by default, will add documents to the already initialised one. |
None
|
Source code in ragatouille/RAGPretrainedModel.py
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 |
|
clear_encoded_docs(force=False)
Clear documents encoded in-memory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
force |
bool
|
Whether to force the clearing of encoded documents without enforcing a 10s wait time. |
False
|
Source code in ragatouille/RAGPretrainedModel.py
384 385 386 387 388 389 390 |
|
delete_from_index(document_ids, index_name=None)
Delete documents from an index by their IDs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
document_ids |
Union[TypeVar(T), List[TypeVar(T)]]
|
The IDs of the documents to delete. |
required |
index_name |
Optional[str]
|
The name of the index to delete documents from. If None and by default, will delete documents from the already initialised one. |
None
|
Source code in ragatouille/RAGPretrainedModel.py
245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 |
|
encode(documents, bsize=32, document_metadatas=None, verbose=True, max_document_length='auto')
Encode documents in memory to be searched through with no Index. Performance degrades rapidly with more documents.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
documents |
list[str]
|
The documents to encode. |
required |
bsize |
int
|
The batch size to use for encoding. |
32
|
document_metadatas |
Optional[list[dict]]
|
An optional list of metadata dicts. Each entry must correspond to a document. |
None
|
Source code in ragatouille/RAGPretrainedModel.py
335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 |
|
from_index(index_path, n_gpu=-1, verbose=1)
classmethod
Load an Index and the associated ColBERT encoder from an existing document index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_path |
Union[str, path]
|
Path to the index. |
required |
n_gpu |
int
|
Number of GPUs to use. By default, value is -1, which means use all available GPUs or none if no GPU is available. |
-1
|
verbose |
int
|
The level of ColBERT verbosity requested. By default, 1, which will filter out most internal logs. |
1
|
Returns:
Name | Type | Description |
---|---|---|
cls |
RAGPretrainedModel
|
The current instance of RAGPretrainedModel, with the model and index initialised. |
Source code in ragatouille/RAGPretrainedModel.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
|
from_pretrained(pretrained_model_name_or_path, n_gpu=-1, verbose=1, index_root=None)
classmethod
Load a ColBERT model from a pre-trained checkpoint.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pretrained_model_name_or_path |
str
|
Local path or huggingface model name. |
required |
n_gpu |
int
|
Number of GPUs to use. By default, value is -1, which means use all available GPUs or none if no GPU is available. |
-1
|
verbose |
int
|
The level of ColBERT verbosity requested. By default, 1, which will filter out most internal logs. |
1
|
index_root |
Optional[str]
|
The root directory where indexes will be stored. If None, will use the default directory, '.ragatouille/'. |
None
|
Returns:
Name | Type | Description |
---|---|---|
cls |
RAGPretrainedModel
|
The current instance of RAGPretrainedModel, with the model initialised. |
Source code in ragatouille/RAGPretrainedModel.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
|
index(collection, document_ids=None, document_metadatas=None, index_name=None, overwrite_index=True, max_document_length=256, split_documents=True, document_splitter_fn=llama_index_sentence_splitter, preprocessing_fn=None)
Build an index from a list of documents.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
collection |
list[str]
|
The collection of documents to index. |
required |
document_ids |
Optional[list[str]]
|
An optional list of document ids. Ids will be generated at index time if not supplied. |
None
|
index_name |
str
|
The name of the index that will be built. |
None
|
overwrite_index |
Union[bool, str]
|
Whether to overwrite an existing index with the same name. |
True
|
max_document_length |
int
|
The maximum length of a document. Documents longer than this will be split into chunks. |
256
|
split_documents |
bool
|
Whether to split documents into chunks. |
True
|
document_splitter_fn |
Optional[Callable]
|
A function to split documents into chunks. If None and by default, will use the llama_index_sentence_splitter. |
llama_index_sentence_splitter
|
preprocessing_fn |
Optional[Union[Callable, list[Callable]]]
|
A function or list of functions to preprocess documents. If None and by default, will not preprocess documents. |
None
|
Returns:
Name | Type | Description |
---|---|---|
index |
str
|
The path to the index that was built. |
Source code in ragatouille/RAGPretrainedModel.py
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
|
rerank(query, documents, k=10, zero_index_ranks=False, bsize=64)
Encode documents and rerank them in-memory. Performance degrades rapidly with more documents.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
Union[str, list[str]]
|
The query or list of queries to search for. |
required |
documents |
list[str]
|
The documents to rerank. |
required |
k |
int
|
The number of results to return for each query. |
10
|
zero_index_ranks |
bool
|
Whether to zero the index ranks of the results. By default, result rank 1 is the highest ranked result |
False
|
bsize |
int
|
The batch size to use for re-ranking. |
64
|
Returns:
Name | Type | Description |
---|---|---|
results |
Union[list[dict], list[list[dict]]]
|
A list of dict containing individual results for each query. If a list of queries is provided, returns a list of lists of dicts. Each result is a dict with keys |
Individual results are always in the format:
{"content": "text of the relevant passage", "score": 0.123456, "rank": 1}
Source code in ragatouille/RAGPretrainedModel.py
301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 |
|
search(query, index_name=None, k=10, force_fast=False, zero_index_ranks=False, **kwargs)
Query an index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
Union[str, list[str]]
|
The query or list of queries to search for. |
required |
index_name |
Optional[str]
|
Provide the name of an index to query. If None and by default, will query an already initialised one. |
None
|
k |
int
|
The number of results to return for each query. |
10
|
force_fast |
bool
|
Whether to force the use of a faster but less accurate search method. |
False
|
zero_index_ranks |
bool
|
Whether to zero the index ranks of the results. By default, result rank 1 is the highest ranked result |
False
|
Returns:
Name | Type | Description |
---|---|---|
results |
Union[list[dict], list[list[dict]]]
|
A list of dict containing individual results for each query. If a list of queries is provided, returns a list of lists of dicts. Each result is a dict with keys |
Individual results are always in the format:
{"content": "text of the relevant passage", "score": 0.123456, "rank": 1, "document_id": "x"}
or
{"content": "text of the relevant passage", "score": 0.123456, "rank": 1, "document_id": "x", "document_metadata": {"metadata_key": "metadata_value", ...}}
Source code in ragatouille/RAGPretrainedModel.py
261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 |
|
search_encoded_docs(query, k=10, bsize=32)
Search through documents encoded in-memory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
Union[str, list[str]]
|
The query or list of queries to search for. |
required |
k |
int
|
The number of results to return for each query. |
10
|
batch_size |
int
|
The batch size to use for searching. |
required |
Returns:
Name | Type | Description |
---|---|---|
results |
list[dict[str, Any]]
|
A list of dict containing individual results for each query. If a list of queries is provided, returns a list of lists of dicts. |
Source code in ragatouille/RAGPretrainedModel.py
362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 |
|