feat: Add llm consistency eval tool v1 (#70)

* feat: Add llm consistency eval tool v1 * Revise the calculation of consistency * Update the checklist evaluation tool for demo * fix: fix the user prompt, store the report when running function ' * Update consistency evaluation tool to take TestEvaluator as input * Create class ConsistencyEvaluator for consistency evaluation * Refactor the consistency tool code into Python script files * Update docstring and variable name for ConsistencyEvaluator * move `llm_eval/` into `modules/` --------- Co-authored-by: SoloSynth1 <solosynth1@gmail.com>
UBC-MDS · May 21, 2024 · a06411d · a06411d
1 parent 37e8253
commit a06411d
Showing 3 changed files with 406 additions and 0 deletions.
diff --git a/src/test_creation/modules/llm_eval/01_checklist_eval_consistency.ipynb b/src/test_creation/modules/llm_eval/01_checklist_eval_consistency.ipynb
@@ -0,0 +1,351 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d4b9be5c-c6c2-4fce-9bca-815f8772443a",
+   "metadata": {},
+   "source": [
+    "## Introduction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "78741377-167e-41d6-9542-c3593c0079ff",
+   "metadata": {},
+   "source": [
+    "This Jupyter notebook is a tool to evaluate the consistency of ML Test Evaluation performed by ChatGPT based on the research paper (Alexander, R., Katz, L., Moore, C., & Schwartz, Z. (2023)). \\\n",
+    "It serves the purpose of evaluating the application performance before and after changes (e.g. checklist modification, model setting changes)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f427397a-321d-4ba8-ba63-512e18eea528",
+   "metadata": {},
+   "source": [
+    "### Libraries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "bcb67c7b-0b04-4f5f-a42b-7b31dcd963bc",
+   "metadata": {},
+   "source": [
+    "import sys\n",
+    "sys.path.append(\"../test_creation/\")"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "id": "12b6521a-4c59-4c34-ae5f-720706d2f1e8",
+   "metadata": {},
+   "source": [
+    "from analyze import TestEvaluator\n",
+    "\n",
+    "import pandas as pd"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "409a4642-ffbb-49a8-ab2c-8503d6bc58aa",
+   "metadata": {},
+   "source": [],
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9d2139ca-98d3-4253-a2e1-3ecdce1a1018",
+   "metadata": {},
+   "source": [
+    "## Inputs"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4a03e63c-fb18-4361-b117-aa2355b7f5bb",
+   "metadata": {},
+   "source": [
+    "Please specify the `test_functions_directory` below to load the ML test code base, the parameters, e.g. checklist, and the corresponding models to for evaluation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "id": "d4408d48-9590-444c-8725-52b06363fdda",
+   "metadata": {},
+   "source": [
+    "models = []"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "74036e8d-1dee-459b-bdf7-fa797b262e2f",
+   "metadata": {},
+   "source": [
+    "test_functions_directory = '../../../lightfm/tests'"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "210373f1-9354-434c-9103-5c4b767b14c4",
+   "metadata": {},
+   "source": [
+    "# temperatures = [0.1]\n",
+    "# models = ['gpt-3.5-turbo']"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "id": "457946ba-0723-45a1-9491-c40ede92992b",
+   "metadata": {},
+   "source": [
+    "checklist_directory = '../../checklist/checklist_demo.yaml'"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "id": "cc01e5b0-f3d7-4230-b413-e94372d88634",
+   "metadata": {},
+   "source": [
+    "name = 'checklist_demo_1'\n",
+    "evaluator = TestEvaluator(test_functions_directory)\n",
+    "evaluator.load_checklist(checklist_directory)\n",
+    "models.append({'name': name, 'model': evaluator})"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "46898c0d-dd18-4812-a6ba-fced01aabcc9",
+   "metadata": {},
+   "source": [
+    "name = 'checklist_demo_2'\n",
+    "evaluator = TestEvaluator(test_functions_directory)\n",
+    "evaluator.load_checklist(checklist_directory)\n",
+    "models.append({'name': name, 'model': evaluator})"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "id": "7f1a285c-f50e-423d-9d54-be0e15190244",
+   "metadata": {},
+   "source": [
+    "models"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "id": "fccbbab1-65de-495e-8071-38021e67cb4c",
+   "metadata": {},
+   "source": [
+    "pd.DataFrame(models)"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bc9dee5f-fe3a-42ec-9daf-28ca7830d388",
+   "metadata": {},
+   "source": [
+    "## API Running"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "202ba74b-6ac9-4d6b-a11a-71f17a21614f",
+   "metadata": {},
+   "source": [
+    "Incorporate the data, prompts and parameters, feed into OpenAI API for test runs and fetch responses."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 79,
+   "id": "f36cdd4a-6afe-4b39-8a65-46261ebaab16",
+   "metadata": {},
+   "source": [
+    "# # Clone the model to make sure that all the test runs are independent.\n",
+    "# import copy\n",
+    "# model_temp = copy.copy(models[0]['model'])"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 69,
+   "id": "82f59d57-1081-41c0-884a-fb8fc8ba11f0",
+   "metadata": {},
+   "source": [
+    "class ConsistencyEvaluator:\n",
+    "    def __init__(self):\n",
+    "        self.evaluation_reports = None\n",
+    "\n",
+    "    def evaluate(self, models, num_test_runs=2, verbose=False):\n",
+    "        \"\"\"\n",
+    "        Input the initialized TestEvaluator models, test run `num_test_runs` times to obtain the result\n",
+    "        models = [{'name': 'model_no1', 'model': {{model object}}}, ...]\n",
+    "        \"\"\"\n",
+    "        results = []\n",
+    "        for item in models:\n",
+    "            if verbose:\n",
+    "                print(f'Model: {item['name']}')\n",
+    "                \n",
+    "            for test_no in range(num_test_runs):\n",
+    "                if verbose:\n",
+    "                    print(f'Test Run No.: {test_no+1}')\n",
+    "                \n",
+    "                result = dict()\n",
+    "                model = item['model']\n",
+    "                model.evaluate()\n",
+    "        \n",
+    "                result['score'] = model.get_completeness_score(score_format='number')\n",
+    "                result['report'] = model.evaluation_report\n",
+    "                result['model_name'] = item['name']\n",
+    "                result['test_no'] = test_no+1\n",
+    "                results.append(result)\n",
+    "        self.evaluation_reports = pd.DataFrame(results)\n",
+    "        return\n",
+    "\n",
+    "    def get_completeness_score_dist(self):\n",
+    "        \"\"\"\n",
+    "        Obtain the distribution of the Test Completeness scores\n",
+    "        \"\"\"\n",
+    "        completeness_score_df = self.evaluation_reports.drop(columns='report')\n",
+    "        completeness_score_df = completeness_score_df.pivot(index='model_name', columns='test_no', values='score')\n",
+    "        return completeness_score_df\n",
+    "\n",
+    "    def get_consistency_dist(self):\n",
+    "        \"\"\"\n",
+    "        Obtain the distribution of the consistency per checklist item\n",
+    "        \"\"\"\n",
+    "        consistency_df = pd.DataFrame()\n",
+    "        for idx in self.evaluation_reports.index:\n",
+    "            result = self.evaluation_reports.iloc[idx]['report'].reset_index()\n",
+    "            result['test_no'] = self.evaluation_reports.iloc[idx]['test_no']\n",
+    "            result['model_name'] = self.evaluation_reports.iloc[idx]['model_name']\n",
+    "            consistency_df = pd.concat([consistency_df, result], axis = 0, ignore_index=True)\n",
+    "        consistency_df = consistency_df.pivot(index=['model_name', 'ID'], columns=['test_no'], values=['is_Satisfied'])\n",
+    "        consistency_df.columns = consistency_df.columns.droplevel(level=0)\n",
+    "        consistency_df['consistency'] = consistency_df.eq(consistency_df.iloc[:, 0], axis=0).all(1)\n",
+    "        return consistency_df"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 66,
+   "id": "fa7e37b6-c946-4fbb-a166-346d4101b528",
+   "metadata": {},
+   "source": [
+    "consistency_evaluator = ConsistencyEvaluator()\n",
+    "consistency_evaluator.evaluate(models, num_test_runs=5, verbose=True)"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "24ce51c6-0dbf-4fa7-8722-5bef38e0224a",
+   "metadata": {},
+   "source": [
+    "## Result & Evaluation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fc750f68-4fa7-4264-a270-2f3cc0ea667c",
+   "metadata": {},
+   "source": [
+    "The evaluation will be based on 2 metrics calculated from the response:\n",
+    "- Completeness Score distribution: The distribution of the `num_test_runs` completeness scores per each set of parameters\n",
+    "- Consistency Score: Out of all `checklist` items, the proportion of results remain consistent among `num_test_runs` runs per each set of parameters"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 67,
+   "id": "22ea8a4d-1f08-4f4d-9c80-4dc56ac16b14",
+   "metadata": {},
+   "source": [
+    "consistency_evaluator.get_completeness_score_dist()"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "id": "dd191f9e-6cb6-4964-b060-7976f1529edd",
+   "metadata": {},
+   "source": [
+    "# import matplotlib\n",
+    "# completeness_score_df.plot(kind='box')"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 68,
+   "id": "391b097d-df72-4490-a91d-7d0858852ad5",
+   "metadata": {},
+   "source": [
+    "consistency_evaluator.get_consistency_dist()"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "ca45a592-7490-4f86-8357-0707ef81e0e9",
+   "metadata": {},
+   "source": [
+    "# consistency_df.groupby(['model_name']).agg({'consistency': 'mean'})"
+   ],
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2b5c0b4b-ca68-4d44-9cb7-977e08abadb5",
+   "metadata": {},
+   "source": [],
+   "outputs": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python [conda env:test-creation]",
+   "language": "python",
+   "name": "conda-env-test-creation-py"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/src/test_creation/modules/llm_eval/__init__.py b/src/test_creation/modules/llm_eval/__init__.py