-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Add llm consistency eval tool v1 (#70)
* feat: Add llm consistency eval tool v1 * Revise the calculation of consistency * Update the checklist evaluation tool for demo * fix: fix the user prompt, store the report when running function ' * Update consistency evaluation tool to take TestEvaluator as input * Create class ConsistencyEvaluator for consistency evaluation * Refactor the consistency tool code into Python script files * Update docstring and variable name for ConsistencyEvaluator * move `llm_eval/` into `modules/` --------- Co-authored-by: SoloSynth1 <solosynth1@gmail.com>
1 parent
37e8253
commit a06411d
Showing
3 changed files
with
406 additions
and
0 deletions.
There are no files selected for viewing
351 changes: 351 additions & 0 deletions
351
src/test_creation/modules/llm_eval/01_checklist_eval_consistency.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,351 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "d4b9be5c-c6c2-4fce-9bca-815f8772443a", | ||
"metadata": {}, | ||
"source": [ | ||
"## Introduction" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "78741377-167e-41d6-9542-c3593c0079ff", | ||
"metadata": {}, | ||
"source": [ | ||
"This Jupyter notebook is a tool to evaluate the consistency of ML Test Evaluation performed by ChatGPT based on the research paper (Alexander, R., Katz, L., Moore, C., & Schwartz, Z. (2023)). \\\n", | ||
"It serves the purpose of evaluating the application performance before and after changes (e.g. checklist modification, model setting changes)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "f427397a-321d-4ba8-ba63-512e18eea528", | ||
"metadata": {}, | ||
"source": [ | ||
"### Libraries" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "bcb67c7b-0b04-4f5f-a42b-7b31dcd963bc", | ||
"metadata": {}, | ||
"source": [ | ||
"import sys\n", | ||
"sys.path.append(\"../test_creation/\")" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 42, | ||
"id": "12b6521a-4c59-4c34-ae5f-720706d2f1e8", | ||
"metadata": {}, | ||
"source": [ | ||
"from analyze import TestEvaluator\n", | ||
"\n", | ||
"import pandas as pd" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "409a4642-ffbb-49a8-ab2c-8503d6bc58aa", | ||
"metadata": {}, | ||
"source": [], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "9d2139ca-98d3-4253-a2e1-3ecdce1a1018", | ||
"metadata": {}, | ||
"source": [ | ||
"## Inputs" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "4a03e63c-fb18-4361-b117-aa2355b7f5bb", | ||
"metadata": {}, | ||
"source": [ | ||
"Please specify the `test_functions_directory` below to load the ML test code base, the parameters, e.g. checklist, and the corresponding models to for evaluation" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 25, | ||
"id": "d4408d48-9590-444c-8725-52b06363fdda", | ||
"metadata": {}, | ||
"source": [ | ||
"models = []" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "74036e8d-1dee-459b-bdf7-fa797b262e2f", | ||
"metadata": {}, | ||
"source": [ | ||
"test_functions_directory = '../../../lightfm/tests'" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"id": "210373f1-9354-434c-9103-5c4b767b14c4", | ||
"metadata": {}, | ||
"source": [ | ||
"# temperatures = [0.1]\n", | ||
"# models = ['gpt-3.5-turbo']" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 26, | ||
"id": "457946ba-0723-45a1-9491-c40ede92992b", | ||
"metadata": {}, | ||
"source": [ | ||
"checklist_directory = '../../checklist/checklist_demo.yaml'" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 27, | ||
"id": "cc01e5b0-f3d7-4230-b413-e94372d88634", | ||
"metadata": {}, | ||
"source": [ | ||
"name = 'checklist_demo_1'\n", | ||
"evaluator = TestEvaluator(test_functions_directory)\n", | ||
"evaluator.load_checklist(checklist_directory)\n", | ||
"models.append({'name': name, 'model': evaluator})" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 28, | ||
"id": "46898c0d-dd18-4812-a6ba-fced01aabcc9", | ||
"metadata": {}, | ||
"source": [ | ||
"name = 'checklist_demo_2'\n", | ||
"evaluator = TestEvaluator(test_functions_directory)\n", | ||
"evaluator.load_checklist(checklist_directory)\n", | ||
"models.append({'name': name, 'model': evaluator})" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 29, | ||
"id": "7f1a285c-f50e-423d-9d54-be0e15190244", | ||
"metadata": {}, | ||
"source": [ | ||
"models" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 30, | ||
"id": "fccbbab1-65de-495e-8071-38021e67cb4c", | ||
"metadata": {}, | ||
"source": [ | ||
"pd.DataFrame(models)" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "bc9dee5f-fe3a-42ec-9daf-28ca7830d388", | ||
"metadata": {}, | ||
"source": [ | ||
"## API Running" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "202ba74b-6ac9-4d6b-a11a-71f17a21614f", | ||
"metadata": {}, | ||
"source": [ | ||
"Incorporate the data, prompts and parameters, feed into OpenAI API for test runs and fetch responses." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 79, | ||
"id": "f36cdd4a-6afe-4b39-8a65-46261ebaab16", | ||
"metadata": {}, | ||
"source": [ | ||
"# # Clone the model to make sure that all the test runs are independent.\n", | ||
"# import copy\n", | ||
"# model_temp = copy.copy(models[0]['model'])" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 69, | ||
"id": "82f59d57-1081-41c0-884a-fb8fc8ba11f0", | ||
"metadata": {}, | ||
"source": [ | ||
"class ConsistencyEvaluator:\n", | ||
" def __init__(self):\n", | ||
" self.evaluation_reports = None\n", | ||
"\n", | ||
" def evaluate(self, models, num_test_runs=2, verbose=False):\n", | ||
" \"\"\"\n", | ||
" Input the initialized TestEvaluator models, test run `num_test_runs` times to obtain the result\n", | ||
" models = [{'name': 'model_no1', 'model': {{model object}}}, ...]\n", | ||
" \"\"\"\n", | ||
" results = []\n", | ||
" for item in models:\n", | ||
" if verbose:\n", | ||
" print(f'Model: {item['name']}')\n", | ||
" \n", | ||
" for test_no in range(num_test_runs):\n", | ||
" if verbose:\n", | ||
" print(f'Test Run No.: {test_no+1}')\n", | ||
" \n", | ||
" result = dict()\n", | ||
" model = item['model']\n", | ||
" model.evaluate()\n", | ||
" \n", | ||
" result['score'] = model.get_completeness_score(score_format='number')\n", | ||
" result['report'] = model.evaluation_report\n", | ||
" result['model_name'] = item['name']\n", | ||
" result['test_no'] = test_no+1\n", | ||
" results.append(result)\n", | ||
" self.evaluation_reports = pd.DataFrame(results)\n", | ||
" return\n", | ||
"\n", | ||
" def get_completeness_score_dist(self):\n", | ||
" \"\"\"\n", | ||
" Obtain the distribution of the Test Completeness scores\n", | ||
" \"\"\"\n", | ||
" completeness_score_df = self.evaluation_reports.drop(columns='report')\n", | ||
" completeness_score_df = completeness_score_df.pivot(index='model_name', columns='test_no', values='score')\n", | ||
" return completeness_score_df\n", | ||
"\n", | ||
" def get_consistency_dist(self):\n", | ||
" \"\"\"\n", | ||
" Obtain the distribution of the consistency per checklist item\n", | ||
" \"\"\"\n", | ||
" consistency_df = pd.DataFrame()\n", | ||
" for idx in self.evaluation_reports.index:\n", | ||
" result = self.evaluation_reports.iloc[idx]['report'].reset_index()\n", | ||
" result['test_no'] = self.evaluation_reports.iloc[idx]['test_no']\n", | ||
" result['model_name'] = self.evaluation_reports.iloc[idx]['model_name']\n", | ||
" consistency_df = pd.concat([consistency_df, result], axis = 0, ignore_index=True)\n", | ||
" consistency_df = consistency_df.pivot(index=['model_name', 'ID'], columns=['test_no'], values=['is_Satisfied'])\n", | ||
" consistency_df.columns = consistency_df.columns.droplevel(level=0)\n", | ||
" consistency_df['consistency'] = consistency_df.eq(consistency_df.iloc[:, 0], axis=0).all(1)\n", | ||
" return consistency_df" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 66, | ||
"id": "fa7e37b6-c946-4fbb-a166-346d4101b528", | ||
"metadata": {}, | ||
"source": [ | ||
"consistency_evaluator = ConsistencyEvaluator()\n", | ||
"consistency_evaluator.evaluate(models, num_test_runs=5, verbose=True)" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "24ce51c6-0dbf-4fa7-8722-5bef38e0224a", | ||
"metadata": {}, | ||
"source": [ | ||
"## Result & Evaluation" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "fc750f68-4fa7-4264-a270-2f3cc0ea667c", | ||
"metadata": {}, | ||
"source": [ | ||
"The evaluation will be based on 2 metrics calculated from the response:\n", | ||
"- Completeness Score distribution: The distribution of the `num_test_runs` completeness scores per each set of parameters\n", | ||
"- Consistency Score: Out of all `checklist` items, the proportion of results remain consistent among `num_test_runs` runs per each set of parameters" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 67, | ||
"id": "22ea8a4d-1f08-4f4d-9c80-4dc56ac16b14", | ||
"metadata": {}, | ||
"source": [ | ||
"consistency_evaluator.get_completeness_score_dist()" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 51, | ||
"id": "dd191f9e-6cb6-4964-b060-7976f1529edd", | ||
"metadata": {}, | ||
"source": [ | ||
"# import matplotlib\n", | ||
"# completeness_score_df.plot(kind='box')" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 68, | ||
"id": "391b097d-df72-4490-a91d-7d0858852ad5", | ||
"metadata": {}, | ||
"source": [ | ||
"consistency_evaluator.get_consistency_dist()" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 20, | ||
"id": "ca45a592-7490-4f86-8357-0707ef81e0e9", | ||
"metadata": {}, | ||
"source": [ | ||
"# consistency_df.groupby(['model_name']).agg({'consistency': 'mean'})" | ||
], | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "2b5c0b4b-ca68-4d44-9cb7-977e08abadb5", | ||
"metadata": {}, | ||
"source": [], | ||
"outputs": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python [conda env:test-creation]", | ||
"language": "python", | ||
"name": "conda-env-test-creation-py" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.12.3" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Empty file.
Oops, something went wrong.