Official implementation of our QAEncoder method for more advanced QA systems.
Affiliation: Peking University, Institute for Advanced Algorithms Research, Shanghai
Modern QA systems entail retrieval-augmented generation (RAG) for accurate and trustworthy responses. However, the inherent gap between user queries and relevant documents hinders precise matching. We introduce QAEncoder, a training-free approach to bridge this gap. Specifically, QAEncoder estimates the expectation of potential queries in the embedding space as a robust surrogate for the document embedding, and attaches document fingerprints to effectively distinguish these embeddings. Extensive experiments across diverse datasets, languages and embedding models confirmed QAEncoder's alignment capability, which offers a simple yet effective solution with zero additional index storage, retrieval latency, training costs, or risk of hallucination.
Set up the environment and run the demo script:
git clone https://github.com/IAAR-Shanghai/QAEncoder.git
cd QAEncoder
conda create -n QAE python=3.10
conda activate QAE
pip install -r requirements-demo.txt
python demo.py # Network is also required
Results should be like:
Change the embedding models, languages, documents and potential queries for verification of our hypothesis.
We currently provide the core datasets and codes to reproduce results on FIGNEWS. The instruction is as follows:
cd FIGNEWS
pip install -r requirements-fignews.txt
pip uninstall llama-index-core
pip install llama-index-core==0.11.1 # reinstall to avoid subtle bugs
mkdir model output; unzip data.zip # setup datasets
python download_model.py # Download bge-large-en-v1.5 model for alignment
python QAE.py --method QAE_emb --alpha_value 0.0 --dataset_name figEnglish
python QAE.py --method QAE_emb --alpha_value 0.5 --dataset_name figEnglish
python QAE.py --method QAE_hyb --alpha_value 0.15 --beta_value 1.5 --dataset_name figEnglish
The results should be like:
python QAE.py --method QAE_emb --alpha_value 0.0 --dataset_name figEnglish
python QAE.py --method QAE_emb --alpha_value 0.5 --dataset_name figEnglish
python QAE.py --method QAE_hyb --alpha_value 0.15 --beta_value 1.5 --dataset_name figEnglish
For fast query generation, these online interfaces are recommend.
The following standard prompt is extracted from Llamaindex workflow, see the blog or docs.
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge.
generate only questions based on the below query.
You are a Professor. Your task is to setup {num_questions_per_chunk} questions for an upcoming quiz/examination. The questions should be diverse in nature across the document. The questions should not contain options, not start with Q1/ Q2. Restrict the questions to the context information provided.
A simpler prompt can be:
Context information is below.
-----------------
{context_str}
-----------------
You are an expert in text comprehension and question formulation, tasked with generating {num_questions_per_chunk} high-quality questions, based solely on the context.
This work is currently under review and code refactoring. We plan to fully open-source our project in order.
- Release Demo
- Release QAEncoder core codes and datasets
- Release QAEncoder codes compatible with Llamaindex and Langchain
- Release QAEncoder++, our future works
@article{wang2024qaencoder,
title={QAEncoder: Towards Aligned Representation Learning in Question Answering Systems},
author={Wang, Zhengren and Yu, Qinhan and Wei, Shida and Li, Zhiyu and Xiong, Feiyu and Wang, Xiaoxing and Niu, Simin and Liang, Hao and Zhang, Wentao}
journal={arXiv preprint arXiv:2409.20434},
year={2024}
}