Skip to content

Latest commit

 

History

History
158 lines (135 loc) · 6.01 KB

README.md

File metadata and controls

158 lines (135 loc) · 6.01 KB

MSMARCO Passage Ranking

We take the Passage dataset as the example.

Preparing Data

Download data and convert to our format:

bash ./prepare_data/download_data.sh

The data will be saved into ./data/passage/dataset.

Preprocess and Generate Embeddings

We use the co-codenser as the text encoder:

python ./prepare_data/get_embeddings.py  \
--data_dir ./data/passage/dataset \
--preprocess_dir ./data/passage/preprocess \
--tokenizer_name Luyu/co-condenser-marco-retriever \
--max_doc_length 256 \
--max_query_length 32 \
--output_dir ./data/passage/evaluate/co-condenser 

The code will preprocess the data into preprocess_dir (for training encoder) and generate embeddings into output_dir (for training index). More information about data format pleaser refer to dataset.README.md

IVFPQ

  • Faiss Index

python ./basic_index/faiss_index.py  \
--preprocess_dir ./data/passage/preprocess \
--embeddings_dir ./data/passage/evaluate/co-condenser \
--index_method ivf_opq \
--ivf_centers_num 10000 \
--subvector_num 64 \
--subvector_bits 8 \
--nprobe 100
  • ScaNN Index

python ./basic_index/scann_index.py  \
--preprocess_dir ./data/passage/preprocess \
--embeddings_dir ./data/passage/evaluate/co-condenser \
--ivf_centers_num 10000 \
--subvector_num 32 \
--nprobe 100
  • Learnable Index

Finetune the index with fixed embeddings:
(need the embeddings of queries and docs)

python ./learnable_index/train_index.py  \
--preprocess_dir ./data/passage/preprocess \
--embeddings_dir ./data/passage/evaluate/co-condenser \
--index_method ivf_opq \
--ivf_centers_num 10000 \
--subvector_num 32 \
--subvector_bits 8 \
--nprobe 100 \
--training_mode {distill_index, distill_index_nolabel, contrastive_index} \
--per_device_train_batch_size 512

Jointly train index and query encoder (always has a better performance):
(need embeddings and a query encoder)

python ./learnable_index/train_index_and_encoder.py  \
--data_dir ./data/passage/dataset \
--preprocess_dir ./data/passage/preprocess \
--max_doc_length 256 \
--max_query_length 32 \
--embeddings_dir ./data/passage/evaluate/co-condenser \
--index_method ivf_opq \
--ivf_centers_num 10000 \
--subvector_num 32 \
--subvector_bits 8 \
--nprobe 100 \
--training_mode {distill_index-and-query-encoder, distill_index-and-query-encoder_nolabel, contrastive_index-and-query-encoder} \
--per_device_train_batch_size 512

We provide several different training modes:

  1. contrastive_(): contrastive learning;
  2. distill_(): knowledge distillation; transfer knowledge (i.e., the order of docs) from the dense vector to the IVF and PQ
  3. distill_()_nolabel: knowledge distillation for non-label data; in this way, first to find the top-k docs for each train queries by brute-force search (or a index with high performance), then use these results to form a new train data.

More details of implementation please refer to train_index.py and train_index_and_encoder.

  • Results

Methods MRR@10 Recall@10 Recall@100
Faiss-IVFPQ 0.1380 0.2820 0.5617
Faiss-IVFOPQ 0.3102 0.5593 0.8148
Scann 0.1791 0.3499 0.6345
LibVQ(contrastive_index) 0.3179 0.5724 0.8214
LibVQ(distill_index) 0.3253 0.5765 0.8256
LibVQ(distill_index_nolabel) 0.3234 0.5813 0.8269
LibVQ(contrastive_index-and-query-encoder) 0.3192 0.5799 0.8427
LibVQ(distill_index-and-query-encoder) 0.3311 0.5907 0.8429
LibVQ(distill_index-and-query-encoder_nolabel) 0.3285 0.5875 0.8401

PQ

  • Index

For PQ, you can reuse above commands and only change the --index_method to pq or opq. For example:

python ./learnable_index/train_index.py  \
--data_dir ./data/passage/dataset \
--preprocess_dir ./data/passage/preprocess \
--max_doc_length 256 \
--max_query_length 32 \
--embeddings_dir ./data/passage/evaluate/co-condenser \
--index_method opq \
--subvector_num 32 \
--subvector_bits 8 \
--training_mode distill_index \
--per_device_train_batch_size 128

Besides, you can train both doc encoder and query encoder when only train PQ (training_mode = distill_jointly_v2).

python ./learnable_index/train_index_and_encoder.py  \
--data_dir ./data/passage/dataset \
--preprocess_dir ./data/passage/preprocess \
--max_doc_length 256 \
--max_query_length 32 \
--embeddings_dir ./data/passage/evaluate/co-condenser \
--index_method opq \
--subvector_num 32 \
--subvector_bits 8 \
--training_mode distill_index-and-two-encoders \
--per_device_train_batch_size 128
  • Results

Methods MRR@10 Recall@10 Recall@100
Faiss-PQ 0.1145 0.2369 0.5046
Faiss-OPQ 0.3268 0.5939 0.8651
Scann 0.1795 0.3516 0.6409
LibVQ(distill_index) 0.3435 0.6203 0.8825
LibVQ(distill_index_nolabel) 0.3467 0.6180 0.8849
LibVQ(distill_index-and-query-encoder) 0.3446 0.6201 0.8837
LibVQ(distill_index-and-two-encoders) 0.3475 0.6223 0.8901