Skip to content

Commit

Permalink
feat: Add llm consistency eval tool v1 (#70)
Browse files Browse the repository at this point in the history
* feat: Add llm consistency eval tool v1

* Revise the calculation of consistency

* Update the checklist evaluation tool for demo

* fix: fix the user prompt, store the report when running function '

* Update consistency evaluation tool to take TestEvaluator as input

* Create class ConsistencyEvaluator for consistency evaluation

* Refactor the consistency tool code into Python script files

* Update docstring and variable name for ConsistencyEvaluator

* move `llm_eval/` into `modules/`

---------

Co-authored-by: SoloSynth1 <solosynth1@gmail.com>
tonyshumlh and SoloSynth1 authored May 21, 2024
1 parent 37e8253 commit a06411d
Showing 3 changed files with 406 additions and 0 deletions.
351 changes: 351 additions & 0 deletions src/test_creation/modules/llm_eval/01_checklist_eval_consistency.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,351 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d4b9be5c-c6c2-4fce-9bca-815f8772443a",
"metadata": {},
"source": [
"## Introduction"
]
},
{
"cell_type": "markdown",
"id": "78741377-167e-41d6-9542-c3593c0079ff",
"metadata": {},
"source": [
"This Jupyter notebook is a tool to evaluate the consistency of ML Test Evaluation performed by ChatGPT based on the research paper (Alexander, R., Katz, L., Moore, C., & Schwartz, Z. (2023)). \\\n",
"It serves the purpose of evaluating the application performance before and after changes (e.g. checklist modification, model setting changes)."
]
},
{
"cell_type": "markdown",
"id": "f427397a-321d-4ba8-ba63-512e18eea528",
"metadata": {},
"source": [
"### Libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "bcb67c7b-0b04-4f5f-a42b-7b31dcd963bc",
"metadata": {},
"source": [
"import sys\n",
"sys.path.append(\"../test_creation/\")"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 42,
"id": "12b6521a-4c59-4c34-ae5f-720706d2f1e8",
"metadata": {},
"source": [
"from analyze import TestEvaluator\n",
"\n",
"import pandas as pd"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "409a4642-ffbb-49a8-ab2c-8503d6bc58aa",
"metadata": {},
"source": [],
"outputs": []
},
{
"cell_type": "markdown",
"id": "9d2139ca-98d3-4253-a2e1-3ecdce1a1018",
"metadata": {},
"source": [
"## Inputs"
]
},
{
"cell_type": "markdown",
"id": "4a03e63c-fb18-4361-b117-aa2355b7f5bb",
"metadata": {},
"source": [
"Please specify the `test_functions_directory` below to load the ML test code base, the parameters, e.g. checklist, and the corresponding models to for evaluation"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "d4408d48-9590-444c-8725-52b06363fdda",
"metadata": {},
"source": [
"models = []"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 3,
"id": "74036e8d-1dee-459b-bdf7-fa797b262e2f",
"metadata": {},
"source": [
"test_functions_directory = '../../../lightfm/tests'"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 5,
"id": "210373f1-9354-434c-9103-5c4b767b14c4",
"metadata": {},
"source": [
"# temperatures = [0.1]\n",
"# models = ['gpt-3.5-turbo']"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 26,
"id": "457946ba-0723-45a1-9491-c40ede92992b",
"metadata": {},
"source": [
"checklist_directory = '../../checklist/checklist_demo.yaml'"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 27,
"id": "cc01e5b0-f3d7-4230-b413-e94372d88634",
"metadata": {},
"source": [
"name = 'checklist_demo_1'\n",
"evaluator = TestEvaluator(test_functions_directory)\n",
"evaluator.load_checklist(checklist_directory)\n",
"models.append({'name': name, 'model': evaluator})"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 28,
"id": "46898c0d-dd18-4812-a6ba-fced01aabcc9",
"metadata": {},
"source": [
"name = 'checklist_demo_2'\n",
"evaluator = TestEvaluator(test_functions_directory)\n",
"evaluator.load_checklist(checklist_directory)\n",
"models.append({'name': name, 'model': evaluator})"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 29,
"id": "7f1a285c-f50e-423d-9d54-be0e15190244",
"metadata": {},
"source": [
"models"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 30,
"id": "fccbbab1-65de-495e-8071-38021e67cb4c",
"metadata": {},
"source": [
"pd.DataFrame(models)"
],
"outputs": []
},
{
"cell_type": "markdown",
"id": "bc9dee5f-fe3a-42ec-9daf-28ca7830d388",
"metadata": {},
"source": [
"## API Running"
]
},
{
"cell_type": "markdown",
"id": "202ba74b-6ac9-4d6b-a11a-71f17a21614f",
"metadata": {},
"source": [
"Incorporate the data, prompts and parameters, feed into OpenAI API for test runs and fetch responses."
]
},
{
"cell_type": "code",
"execution_count": 79,
"id": "f36cdd4a-6afe-4b39-8a65-46261ebaab16",
"metadata": {},
"source": [
"# # Clone the model to make sure that all the test runs are independent.\n",
"# import copy\n",
"# model_temp = copy.copy(models[0]['model'])"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 69,
"id": "82f59d57-1081-41c0-884a-fb8fc8ba11f0",
"metadata": {},
"source": [
"class ConsistencyEvaluator:\n",
" def __init__(self):\n",
" self.evaluation_reports = None\n",
"\n",
" def evaluate(self, models, num_test_runs=2, verbose=False):\n",
" \"\"\"\n",
" Input the initialized TestEvaluator models, test run `num_test_runs` times to obtain the result\n",
" models = [{'name': 'model_no1', 'model': {{model object}}}, ...]\n",
" \"\"\"\n",
" results = []\n",
" for item in models:\n",
" if verbose:\n",
" print(f'Model: {item['name']}')\n",
" \n",
" for test_no in range(num_test_runs):\n",
" if verbose:\n",
" print(f'Test Run No.: {test_no+1}')\n",
" \n",
" result = dict()\n",
" model = item['model']\n",
" model.evaluate()\n",
" \n",
" result['score'] = model.get_completeness_score(score_format='number')\n",
" result['report'] = model.evaluation_report\n",
" result['model_name'] = item['name']\n",
" result['test_no'] = test_no+1\n",
" results.append(result)\n",
" self.evaluation_reports = pd.DataFrame(results)\n",
" return\n",
"\n",
" def get_completeness_score_dist(self):\n",
" \"\"\"\n",
" Obtain the distribution of the Test Completeness scores\n",
" \"\"\"\n",
" completeness_score_df = self.evaluation_reports.drop(columns='report')\n",
" completeness_score_df = completeness_score_df.pivot(index='model_name', columns='test_no', values='score')\n",
" return completeness_score_df\n",
"\n",
" def get_consistency_dist(self):\n",
" \"\"\"\n",
" Obtain the distribution of the consistency per checklist item\n",
" \"\"\"\n",
" consistency_df = pd.DataFrame()\n",
" for idx in self.evaluation_reports.index:\n",
" result = self.evaluation_reports.iloc[idx]['report'].reset_index()\n",
" result['test_no'] = self.evaluation_reports.iloc[idx]['test_no']\n",
" result['model_name'] = self.evaluation_reports.iloc[idx]['model_name']\n",
" consistency_df = pd.concat([consistency_df, result], axis = 0, ignore_index=True)\n",
" consistency_df = consistency_df.pivot(index=['model_name', 'ID'], columns=['test_no'], values=['is_Satisfied'])\n",
" consistency_df.columns = consistency_df.columns.droplevel(level=0)\n",
" consistency_df['consistency'] = consistency_df.eq(consistency_df.iloc[:, 0], axis=0).all(1)\n",
" return consistency_df"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 66,
"id": "fa7e37b6-c946-4fbb-a166-346d4101b528",
"metadata": {},
"source": [
"consistency_evaluator = ConsistencyEvaluator()\n",
"consistency_evaluator.evaluate(models, num_test_runs=5, verbose=True)"
],
"outputs": []
},
{
"cell_type": "markdown",
"id": "24ce51c6-0dbf-4fa7-8722-5bef38e0224a",
"metadata": {},
"source": [
"## Result & Evaluation"
]
},
{
"cell_type": "markdown",
"id": "fc750f68-4fa7-4264-a270-2f3cc0ea667c",
"metadata": {},
"source": [
"The evaluation will be based on 2 metrics calculated from the response:\n",
"- Completeness Score distribution: The distribution of the `num_test_runs` completeness scores per each set of parameters\n",
"- Consistency Score: Out of all `checklist` items, the proportion of results remain consistent among `num_test_runs` runs per each set of parameters"
]
},
{
"cell_type": "code",
"execution_count": 67,
"id": "22ea8a4d-1f08-4f4d-9c80-4dc56ac16b14",
"metadata": {},
"source": [
"consistency_evaluator.get_completeness_score_dist()"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 51,
"id": "dd191f9e-6cb6-4964-b060-7976f1529edd",
"metadata": {},
"source": [
"# import matplotlib\n",
"# completeness_score_df.plot(kind='box')"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 68,
"id": "391b097d-df72-4490-a91d-7d0858852ad5",
"metadata": {},
"source": [
"consistency_evaluator.get_consistency_dist()"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 20,
"id": "ca45a592-7490-4f86-8357-0707ef81e0e9",
"metadata": {},
"source": [
"# consistency_df.groupby(['model_name']).agg({'consistency': 'mean'})"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "2b5c0b4b-ca68-4d44-9cb7-977e08abadb5",
"metadata": {},
"source": [],
"outputs": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:test-creation]",
"language": "python",
"name": "conda-env-test-creation-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Empty file.
Loading

0 comments on commit a06411d

Please sign in to comment.