A large-scale complex question answering evaluation of ChatGPT and similar large-language models
A framework for detailed evaluation of the ability of ChatGPT and similar large-scale language models to answer complex questions.
This repository is a subproject of KSESEU.
If you use the code, please cite the following paper:
This package is mainly contributed by Yiming Tan, Dehai Min, Yu Li, Wenbo Li, Guilin Qi.
To evaluate ChatGPT's ability to answer complex knowledge, we propose an evaluation framework: First, we classify the latent features that constitute complex questions, and describe each question under test with multi-labels for identifying combinatorial reasoning. Secondly, following the black-box test specification of CheckList proposed by Microsoft, we design an evaluation method that introduces CoT hints to measure the reasoning function and reliability of large language models in answering complex questions. Our evaluation uses 8 real complex question answering datasets, including six English datasets and two multilingual datasets, to further analyze the potential impact of language bias. We compared the evaluation results of ChatGPT, GPT3.5, GPT3, and T5 to identify persistent historical issues in LLMs. All data and results are available for further analysis.
Given that the training data for language models (LLMs) extensively covers Wikipedia, we choose to evaluate our model using an open-domain complex question answering dataset related to Wikipedia. Specifically, we curated a set of 7 different datasets for this purpose: WebQuestionSP, ComplexWebQuestion, GraphQA, GrailQA, KQApro, QALD-9, MKQA, and the comparison models used include: GPT3 davinci-1, GPT3.5 davinci-2/davinci-3, T5.
Monolingual datasets | Source | Paper |
---|---|---|
WebQuestionSP(WQSP) | download_url | paper_url |
ComplexWebQuestion(CWQ) | download_url | paper_url |
GraphQA | [download_url] | paper_url |
GrailQA | download_url | paper_url |
KQApro | download_url | paper_url |
Multilingual dataset
Multilingual datasets | Source | Paper |
---|---|---|
QALD-9 | download_url | paper_url |
MKQA | download_url | paper_url |
We make the resources of datasets public and classify them according to dataset type and model type.
Please visit this folder for specific information classified by dataset and this folder for specific information classified by model.The folder contains the detailed structure and organization of our dataset.
We assess the LLM's ability to handle each feature in the CQA scenario through the Minimal Functional Test (MFT); we classify the answer types into 9 categories, respectively Mixed fact (MISC);Reason (WHY);Location (LOC);Time (DATE/TIME);Character (PER);Yes or no (Boolean);Number (NUM);Organization (ORG);Unable to answer (UNA)
At the same time, we divide the labels of "reasoning type" into eight categories, which are: SetOperation;Filtering;Counting;The most valuable;Sort ;Single-hop;Multi-hop;Star-shape
We also take into account the "language type" label that may have an impact on model performance: de;ru;pt;hi_IN;en;Fa;it;fr;ro;es;nl;pt_BR
We adopted a simple idea of expanding the matching range to strengthen the generalization of answer matching, including the following two operations:
-
Subtree marking method provided by constituent tree.
-
A strategy of exact matching between the noun phrase list and the answer list is employed.
For the samples that did not complete the matching, we set a threshold based on the cosine similarity between phrase vectors to obtain potential correct matches. The parts above the threshold are manually judged whether the answer is right or wrong.
Invariance test means adding perturbations to the original sentence that should not change the output of the model. The main purpose of this test is to verify that ChatGPT maintains the invariance of the answer in the case of increasing disturbance. We mainly use two methods to perform the invariance test:
- To change the spelling of words in a sentence, we imitate the habit of humans when typing sentences, and perform random letter repetition and random letter omission and stemming methods on words.
- Rewrite the sentence, paraphrasing the sentence without changing the original meaning of the sentence, and evaluate whether the result has changed.
Directional Expectation test refers to perturbing the input with known expected results to evaluate whether the final result is developing in the direction we expect. We mainly conduct directional expectation tests from three aspects:
- Conduct experiments on "reasoning types", mainly on SetOperation types, Filtering types, counting types, and comparison (most value and sorting) types.
- Use the type of answer to guide, what type of answer we prompt to the question, and then evaluate whether the type of answer matches the type we prompt.
- Using a step-by-step guidance method, ask each noun or noun phrase in the sentence again, and finally ask the question again to evaluate whether the accuracy of the answer has improved.