We take the Passage dataset as the example.
Download data and convert to our format:
bash ./prepare_data/download_data.sh
The data will be saved into ./data/passage/dataset
.
We use the co-codenser as the text encoder:
python ./prepare_data/get_embeddings.py \
--data_dir ./data/passage/dataset \
--preprocess_dir ./data/passage/preprocess \
--tokenizer_name Luyu/co-condenser-marco-retriever \
--max_doc_length 256 \
--max_query_length 32 \
--output_dir ./data/passage/evaluate/co-condenser
The code will preprocess the data into preprocess_dir
(for training encoder)
and generate embeddings into output_dir
(for training index). More information about data format
pleaser refer to dataset.README.md
python ./basic_index/faiss_index.py \
--preprocess_dir ./data/passage/preprocess \
--embeddings_dir ./data/passage/evaluate/co-condenser \
--index_method ivf_opq \
--ivf_centers_num 10000 \
--subvector_num 64 \
--subvector_bits 8 \
--nprobe 100
python ./basic_index/scann_index.py \
--preprocess_dir ./data/passage/preprocess \
--embeddings_dir ./data/passage/evaluate/co-condenser \
--ivf_centers_num 10000 \
--subvector_num 32 \
--nprobe 100
Finetune the index with fixed embeddings:
(need the embeddings of queries and docs)
python ./learnable_index/train_index.py \
--preprocess_dir ./data/passage/preprocess \
--embeddings_dir ./data/passage/evaluate/co-condenser \
--index_method ivf_opq \
--ivf_centers_num 10000 \
--subvector_num 32 \
--subvector_bits 8 \
--nprobe 100 \
--training_mode {distill_index, distill_index_nolabel, contrastive_index} \
--per_device_train_batch_size 512
Jointly train index and query encoder (always has a better performance):
(need embeddings and a query encoder)
python ./learnable_index/train_index_and_encoder.py \
--data_dir ./data/passage/dataset \
--preprocess_dir ./data/passage/preprocess \
--max_doc_length 256 \
--max_query_length 32 \
--embeddings_dir ./data/passage/evaluate/co-condenser \
--index_method ivf_opq \
--ivf_centers_num 10000 \
--subvector_num 32 \
--subvector_bits 8 \
--nprobe 100 \
--training_mode {distill_index-and-query-encoder, distill_index-and-query-encoder_nolabel, contrastive_index-and-query-encoder} \
--per_device_train_batch_size 512
We provide several different training modes:
- contrastive_(): contrastive learning;
- distill_(): knowledge distillation; transfer knowledge (i.e., the order of docs) from the dense vector to the IVF and PQ
- distill_()_nolabel: knowledge distillation for non-label data; in this way, first to find the top-k docs for each train queries by brute-force search (or a index with high performance), then use these results to form a new train data.
More details of implementation please refer to train_index.py and train_index_and_encoder.
Methods | MRR@10 | Recall@10 | Recall@100 |
---|---|---|---|
Faiss-IVFPQ | 0.1380 | 0.2820 | 0.5617 |
Faiss-IVFOPQ | 0.3102 | 0.5593 | 0.8148 |
Scann | 0.1791 | 0.3499 | 0.6345 |
LibVQ(contrastive_index) | 0.3179 | 0.5724 | 0.8214 |
LibVQ(distill_index) | 0.3253 | 0.5765 | 0.8256 |
LibVQ(distill_index_nolabel) | 0.3234 | 0.5813 | 0.8269 |
LibVQ(contrastive_index-and-query-encoder) | 0.3192 | 0.5799 | 0.8427 |
LibVQ(distill_index-and-query-encoder) | 0.3311 | 0.5907 | 0.8429 |
LibVQ(distill_index-and-query-encoder_nolabel) | 0.3285 | 0.5875 | 0.8401 |
For PQ, you can reuse above commands and only change the --index_method
to pq
or opq
.
For example:
python ./learnable_index/train_index.py \
--data_dir ./data/passage/dataset \
--preprocess_dir ./data/passage/preprocess \
--max_doc_length 256 \
--max_query_length 32 \
--embeddings_dir ./data/passage/evaluate/co-condenser \
--index_method opq \
--subvector_num 32 \
--subvector_bits 8 \
--training_mode distill_index \
--per_device_train_batch_size 128
Besides, you can train both doc encoder and query encoder when only train PQ (training_mode = distill_jointly_v2
).
python ./learnable_index/train_index_and_encoder.py \
--data_dir ./data/passage/dataset \
--preprocess_dir ./data/passage/preprocess \
--max_doc_length 256 \
--max_query_length 32 \
--embeddings_dir ./data/passage/evaluate/co-condenser \
--index_method opq \
--subvector_num 32 \
--subvector_bits 8 \
--training_mode distill_index-and-two-encoders \
--per_device_train_batch_size 128
Methods | MRR@10 | Recall@10 | Recall@100 |
---|---|---|---|
Faiss-PQ | 0.1145 | 0.2369 | 0.5046 |
Faiss-OPQ | 0.3268 | 0.5939 | 0.8651 |
Scann | 0.1795 | 0.3516 | 0.6409 |
LibVQ(distill_index) | 0.3435 | 0.6203 | 0.8825 |
LibVQ(distill_index_nolabel) | 0.3467 | 0.6180 | 0.8849 |
LibVQ(distill_index-and-query-encoder) | 0.3446 | 0.6201 | 0.8837 |
LibVQ(distill_index-and-two-encoders) | 0.3475 | 0.6223 | 0.8901 |