From 346bdb0165dd356d78bcf3d1fa37bc412ae37c26 Mon Sep 17 00:00:00 2001 From: shumlh Date: Wed, 19 Jun 2024 17:58:11 -0700 Subject: [PATCH 01/12] Holistic review on final_report --- report/final_report/final_report.qmd | 96 ++++++++++++++-------------- report/final_report/references.bib | 34 ++++++++++ 2 files changed, 81 insertions(+), 49 deletions(-) diff --git a/report/final_report/final_report.qmd b/report/final_report/final_report.qmd index 1d7d9b3..4626a9a 100644 --- a/report/final_report/final_report.qmd +++ b/report/final_report/final_report.qmd @@ -17,51 +17,51 @@ by John Shiu, Orix Au Yeung, Tony Shum, Yingzi Jin ### Problem Statement -The global artificial intelligence (AI) market is growing exponentially {cite}`grand2021artificial`, driven by its ability to autonomously make complex decisions impacting various aspects of human life, including financial transactions, autonomous transportation, and medical diagnosis. +The global artificial intelligence (AI) market is growing exponentially ({cite}`grand2021artificial`), driven by its ability to autonomously make complex decisions impacting various aspects of human life, including financial transactions, autonomous transportation, and medical diagnosis. -However, ensuring the software quality of these systems remains a significant challenge {cite}`openja2023studying`. Specifically, the lack of a standardized and comprehensive approach to testing machine learning (ML) systems introduces potential risks to stakeholders. For example, inadequate quality assurance in ML systems can lead to severe consequences, such as substantial financial losses ({cite}`Asheeta2019`, {cite}`Asheeta2019`, {cite}`Asheeta2019`) and safety hazards. +However, ensuring the software quality of these systems remains a significant challenge ({cite}`openja2023studying`). Specifically, the lack of a standardized and comprehensive approach to testing machine learning (ML) systems introduces potential risks to stakeholders. For example, inadequate quality assurance in ML systems can lead to severe consequences, such as misinformation ({cite}`Ashley2024`), social bias ({cite}`Alice2023`), substantial financial losses ({cite}`Asheeta2019`) and safety hazards. Therefore, defining and promoting an industry standard and establishing robust testing methodologies for these systems is crucial. But how? ### Our Objectives -We propose to develop testing suites diagnostic tools based on Large Language Models (LLMs) and curate checklists based on ML research papers and best practices to facilitate comprehensive testing of ML systems with flexibility. Our goal is to enhance applied ML software's trustworthiness, quality, and reproducibility across both the industry and academia {cite}`kapoor2022leakage`. +We propose to develop testing suites diagnostic tools based on Large Language Models (LLMs) and to curate checklists based on ML research papers and best practices to facilitate comprehensive testing of ML systems with flexibility. Our goal is to enhance applied ML software's trustworthiness, quality, and reproducibility across both the industry and academia ({cite}`kapoor2022leakage`). ## Data Science Methods ### Current Approaches -To ensure the reproducibility, trustworthiness and free-of-bias ML system, comprehensive assessment is essential. We have observed some traditional approaches in assessing the quality of ML systems, which contain different advantages and drawbacks as follows. +To ensure the reproducibility, trustworthiness and free-of-bias ML system, comprehensive testing is essential. We have observed some traditional approaches in assessing the completeness of ML system tests, which contain different advantages and drawbacks as follows. -#### 1. Code Coverage +1. **Code Coverage** -Code coverage is a measure of the proportion of source code of a program executed when a particular test suite is run. It is widely used in software development domain as one of the measurements. It quantifies the test quality and is scalable given the short process time. However, it cannot provide the reasons and in which ML areas that the test suites fall short under the context of ML system development. +Code coverage is a measure of the proportion of source code of a program executed when a particular test suite is run. It is widely used in software development domain as one of the measurements. It quantifies the test quality and is scalable given the short processing time. However, it cannot provide the reasons and in which ML areas that the test suites fall short under the context of ML system development. -#### 2. Manual Evaluation +2. **Manual Evaluation** -Manual evaluation involves human expert review at the source code, whom can take the business logic into considerations and find vulnerabilites. Manual evaluation usually delivers comments for improvement under specific development context, and it is still one of the most reliable methods in practice. However, the time cost is large and it is not scalable due to the scarcity of time and human expert. Different human expert might put emphasis on different ML test areas instead of a comprehensive and holistic review on the ML system test suites. +Manual evaluation involves human expert review at the source code, whom can take the business logic into considerations and find vulnerabilites. Manual evaluation usually delivers comments for improvement under specific development context, and it is still one of the most reliable methods in practice ({cite}`openja2023studying`, {cite}`alexander2023evaluating`). However, the time cost is large and it is not scalable due to the scarcity of time and human expert. Different human expert might put emphasis on different ML test areas instead of a comprehensive and holistic review on the ML system test suites. ### Our Approach Our approach is to deliver an automated code review tool with the best practices of ML test suites embedded, which can be used by ML users to learn the best practices as well as to obtain a comprehensive evaluation on their ML system codes. -To come up with the best practices of ML test suites, ML research paper and recognized online resources are our data. Under the collaboration with our partner, we have researched industrial best practices (cite: Microsoft, Jordan) and published academic literature (cite: OpenJa) and consolidated the testing strategies of ML projects into a format which is easily legible and editable by human (researchers, ML engineers, etc.). The format is also machine-friendly that can be easily incorporated into the automated tool. +To come up with the best practices of ML test suites, ML research paper and recognized online resources are our data. Under the collaboration with our partner, we have researched industrial best practices ({cite}`msise2023`, {cite}`jordan2020`) and published academic literature ({cite}`openja2023studying`) and consolidated the testing strategies of ML projects into a checklist which is easily legible and editable by human (researchers, ML engineers, etc.). The checklist is also machine-friendly that can be embedded into the automated tool. -To develop our automated code review tool, GitHub repositories of ML projects are our data. We have collected 11 repositories studied in {cite}`openja2023studying`, where these projects include comprehensive test suites and are written in Python programming language, for our product development. Our tool is capable of understanding the test suites in these projects, comparing and contrasting the test suites with the embedded best practices, and delivering evaluations and suggestions to the current test suties. +To develop our automated code review tool, GitHub repositories of ML projects are our data. We have collected 11 repositories studied in {cite}`openja2023studying`, where these projects include comprehensive test suites and are written in Python programming language, for our product development. Our tool is capable of understanding the test suites in these projects, comparing and contrasting the test suites with the embedded best practices, and delivering evaluations to the current test suites. -By developing our approach, we expect that it can provide reliable test suites evaluation to multiple ML projects in a scalable manner. However, we acknowledged that the consolidation of best practices currently focused on a few high priority test areas due to time constraint, where we expect to expand in the future. The test evaluation results provided by our tool are yet as reliable as human evaluation, where we will quantify its performance using the success metrics below. +By developing our approach, we expect that it can provide reliable test suites evaluation to multiple ML projects in a scalable manner. However, we acknowledged that the current consolidation of best practices only focused on a few high priority test areas due to time constraint. The test evaluation results provided by our tool are yet as reliable as human evaluation, where we will quantify its performance. ### Success Metrics -To properly assess the performance of our tool which leverages the capability of LLMs, we have researched and taken reference of the methods in {cite}`alexander2023evaluating` and defined the 2 success metrics: accuracy and consistency. With these metrics, our users (researchers, ML engineers, etc.) can assess the trustworthiness while obtaining the evaluation results from our tool. +To properly assess the performance of our tool which leverages the capability of LLMs, we have researched and taken reference of the methods in {cite}`alexander2023evaluating` and defined the 2 success metrics: accuracy and consistency. With these metrics, our users (researchers, ML engineers, etc.) can assess the degree of trustworthiness of the evaluation results from our tool. 1. **Accuracy of the Application vs Human Expert Judgement** -We run our tool on the ML projects in {cite}`openja2023studying` to obtain the evaluation results (i.e. completeness score) per each ML test best practice item. We then manually assess the test suites of these ML projects using the same criteria as the ground truth data. Machine evaluation results are compared and contrasted with the ground truth data. Accuracy is defined as the number of matching results over total number of results. +We run our tool on the ML projects in {cite}`openja2023studying` to obtain the evaluation results (i.e. completeness score) per each checklist item on ML test. We then manually assess the test suites of these ML projects using the same criteria as the ground truth data. Machine evaluation results are compared and contrasted with the ground truth data. Accuracy is defined as the number of matching results over total number of results. 2. **Consistency of the Application** -Multiple runs on each ML project are performed and the evaluation results per each ML test best practice item are obtained. Standard deviation of these results per ML projects are calculated as a measure of consistency. +Multiple runs on each ML project are performed and the evaluation results per each checklist item are obtained. Standard deviation of these results per ML projects are calculated as a measure of consistency. ## Data Product & Results @@ -69,7 +69,7 @@ Multiple runs on each ML project are performed and the evaluation results per ea Our solution offers both a curated checklist on robust ML testing, and a Python package that facilitates the use of LLMs in checklist-based evaluation on the robustness of users' ML projects. The Python package is made publicly available for distribution on the Python Packaging Index (PyPI). -The justifications for creating these products are, on one hand, checklists have been shown to decrease errors in software systems and promote code submissions (cite: Gawande 2010, Pineau et al. (2021) from Tiffany PDF). Moreover, Python is chosen to be the programing language of our package given its prevalence in the ML landscape, its ubiquitous presence across different OSes and the existence of Python libraries for the integration with LLMs. This lowers the barrier to use and develop our package and provides better user experience. +The justifications for creating these products are, on one hand, checklists have been shown to decrease errors in software systems and promote code submissions ({cite}`Atul2010`, {cite}`pineau2021improving`). Moreover, Python is chosen to be the programming language of our package given its prevalence in the ML landscape, its ubiquitous presence across different OSes and the existence of Python libraries for the integration with LLMs. This lowers the barrier to use and develop our package and provides better user experience. #### How to use the product @@ -79,13 +79,13 @@ There are two ways to make use of this package: 2. **As a high-level API.** Alternatively, one can use the package to import all components necessary for performing the tasks as part of their own system. Documentations are provided in terms of docstrings. -By formating our product as a CLI tool and API, one (researchers, ML engineers, etc.) will find it user-friendly to interact with. Moreover, it is versatile to support various use cases, such as web application development, data science research, etc. +By formatting our product as a CLI tool and API, one (researchers, ML engineers, etc.) will find it user-friendly to interact with, and versatile to support various use cases, e.g. web application development, scientific research, etc. #### System Design -(To be revised) ![image](../../img/proposed_system_overview.png) +(FIXME To be revised) ![image](../../img/proposed_system_overview.png) -The design principle of our package adheres to object-oriented design and SOLID principles, which is fully modular. One can easily switch between different prompts, models and checklists to use. This enables code reuse and promote users' collaboration to extend its functionality. +The design principle of our package adheres to object-oriented design and SOLID principles, which is fully modular. One can easily switch between different prompts, models and checklists to use. This facilitates code reusability and users' collaboration to extend its functionality. There are five components in the system of our package: @@ -93,10 +93,10 @@ There are five components in the system of our package: This component extracts the information relevant to test suites from the input codebase, which is essential for injecting only the most relevant information to LLMs given its token limits. 2. **Prompt Templates** -This component stores the prompt template necessary for instructing LLM to behave and return responses in consistent and expected format. Few-shot learning is applied for the instruction. +This component stores the prompt template necessary for instructing LLMs to behave and return responses in the expected format. 3. **Checklist** -This component reads the curated checklist, which is stored in CSV format, as a dict with fixed schema for injection into prompt. Default checklist is also included inside the package for distribution. +This component reads the curated checklist, which is stored in CSV format, as a dictionary with fixed schema for injection to LLMs. Default checklist is also included inside the package for distribution. 4. **Runners** This component involves the Evaluator module, which evaluates each file from the test suites using LLMs and outputs evaluation results, and Generator module, which generates test specifications. Both modules include validation and retry logics and record all relevant information in the responses. @@ -106,7 +106,7 @@ This components parses the responses from Evaluator into evaluation reports in v #### Checklist Design -The package will incorporate a checklist ([Fig. 1](overview-diagram)) which contains the best practices in testing ML pipeline and is curated manually based on ML researches and recognized online resources. Prompt engineering is applied to the checklist for better performance. This also helps combating the hallucination of LLMs ({cite}`zhang2023sirens`) during the evaluation of ML projects by prompting it to follow **exactly** the checklist. +The checklist ([Fig. 1](overview-diagram)) embedded in the package contains the best practices in testing ML pipeline and is curated manually based on ML researches and recognized online resources. Prompt engineering is applied for better performance. This also helps combating the hallucination of LLMs ({cite}`zhang2023sirens`) during the evaluation of ML projects by prompting it to follow **exactly** the checklist. Here is an example of how the checklist would be structured: @@ -120,46 +120,48 @@ Here is an example of how the checklist would be structured: | Reference | References of the checklist item, e.g. academic paper | | Is Evaluator Applicable | Whether the checklist item is selected to be used during evaluation. 0 indicates No, 1 indicates Yes | -(To be revised) +(FIXME To be revised) #### Artifacts There are three artifacts after using our package: 1. **Evaluation Responses** -The artifact stores both the evaluation responses from LLMs and meta-data of the process in JSON format. This supports downstream tasks, such as report render, scientific research, etc. +The artifact stores both the evaluation responses from LLMs and meta-data of the process in JSON format. This supports downstream tasks, e.g. report render, scientific research, etc. -(To be revised) schema of the JSON saved & what kind of information is stored +(FIXME To be revised) schema of the JSON saved & what kind of information is stored 2. **Evaluation Report** The artifact stores the evaluation results of the ML projects in a structured format, which includes completeness score breakdown and corresponding detailed reasons. -(To be revised) +(FIXME To be revised) 3. **Test Specification Script** -The artifacts stores the test specification responses from LLMs in Python script format. +The artifacts stores the test specifications generated by LLMs in Python script format. -(To be revised) +(FIXME To be revised) ### Evaluation Results -As illustrated in `Success Metrics`, we ran 30 iterations on each of the repositories in {cite}`openja2023studying` and examined the breakdown of the ML Completeness Score to assessed the quality of evaluation determined by our tool. (FIXME: would it be better to show a table of the repos? like how the Openja does?) +As illustrated in `Success Metrics`, we ran 30 iterations on each of the repositories in {cite}`openja2023studying` and examined the breakdown of the completeness score to assess the evaluation quality by our tool. -#### Accuracy +(FIXME: would it be better to show a table of the repos? like how the Openja does?) -For accuracy, we targeted 3 of the repositories (`lightfm` (FIXME: link), `qlib` (FIXME: link), `DeepSpeech` (FIXME: link)) for human evaluation and compared the ground truth with the outputs from our tool. +1. **Accuracy** + +For accuracy, we targeted 3 of the repositories ([`lightfm`](https://github.com/lyst/lightfm), [`qlib`](https://github.com/microsoft/qlib), [`DeepSpeech`](https://github.com/mozilla/DeepSpeech)) for human evaluation and compared the ground truth with the outputs from our tool. ```{python} # FIXME: table: checklist id, title, (ground truth, (lightfm, qlib, DeepSpeech)) ``` -> Caption: Ground truth data on the 3 repositories +> Caption: Ground truth data on the 3 repositories. 1 = fully satisfied, 0.5 = partially satisfied, 0 = not satisfied ```{python} # FIXME: jitter-mean-sd plot (checklist item vs. score) for each repo ``` > Caption: Comparison of the satisfaction determined by our system versus the ground truth for each checklist item and repository -We found that our tool tends to undermine the actual satisfying cases. For the items that are actually satisfied (score = 1), our tool tends to classify as partially satisfied (score = 0.5), while for those that are partially satisfied (score = 0.5), our tool often classfies as not satisfied (score = 0). +We found that our tool tends to undermine the actual satisfying cases. For the items that are actually satisfied, our tool tends to classify as partially satisfied, while for those that are partially satisfied, our tool often classifies as not satisfied. ```{python} # FIXME: contingency table @@ -168,55 +170,51 @@ We found that our tool tends to undermine the actual satisfying cases. For the i The accuracy issue may be attributed to the need for improvement of prompts in our checklist. -#### Consistency +2. **Consistency** -Since the completeness score from LLMs contain randomness, we further studied the consistency of scores across checklist items and reposities. +Since the completeness score from LLMs contain randomness, we further studied the consistency of scores across checklist items and repositories. ```{python} # FIXME: jitter-boxplot, checklist item vs. SD ``` -> Caption: Standard deviations of the score for each checklist item. Each dot represents the standard deviation of scores of 30 runs of a sigle repository +> Caption: Standard deviations of the score for each checklist item. Each dot represents the standard deviation of scores of 30 runs of a single repository -We found 2 diverging cases. For example, it shows high standard deviations across repositories for item `3.2 Data in the Expected Format`. This might be a proof of poor prompt quality, making it ambiguous for the LLM and hence hard to produce consistent results. Prompt engineering might solve this problem. +We found 2 diverging cases. For example, it shows high standard deviations across repositories for item `3.2 Data in the Expected Format`. This might be a proof of poor prompt quality, making it ambiguous for the LLM to produce consistent results. Prompt engineering might solve this problem. -On the other hand, there are outliers yielding exceptionally high standard deviations for item `5.3 Ensure Model Output Shape Aligns with Expectation`. This may be because those repositories are unorthodox, and careful manual examination is required to achieve a more robust conclusion. +On the other hand, there are outliers yielding exceptionally high standard deviations for item `5.3 Ensure Model Output Shape Aligns with Expectation`. This might be because those repositories are unorthodox, but careful manual examination is required for a more definite conclusion. #### Comparison of `gpt-3.5-turbo` and `gpt-4o` -To examine if newer LLMs help in both metrics, we preliminarily compared system outputs from `gpt-4o` and `gpt-3.5-turbo` on the `lightfm` repository, we observed that the `gpt-4o` system consistently returned "Satisfied", which deviates from the ground truth. +To examine if newer LLMs help in both metrics, we preliminarily compared system outputs from `gpt-4o` versus `gpt-3.5-turbo` on the `lightfm` repository. We observed that the one from `gpt-4o` consistently returned "Satisfied", which deviates from the ground truth. ```{python} # FIXME: jitter-mean-sd plot (checklist item vs. score) for each repo ``` > Caption: Comparison of the satisfaction using `gpt-4o` versus using `gpt-3.5-turbo` for each checklist item on `lightfm` -Further investigation into `gpt-4o` is required to address this issue and enhance the system performance. +Further investigation into `gpt-4o` is required to address if it can enhance the system performance. ## Conclusion ### Wrap Up -Our project, FixML, represents a significant step forward in the field of machine learning (ML) testing by providing curated checklists and automated tools that enhance the evaluation and creation of test suites for ML models. The development and implementation of FixML have been driven by both the need of better quality assurance in ML systems, and the current limitations of traditional testing methods on ML projects which are either too general without comprehensive clarification, or are too human-reliant. - -FixML seamlessly takes in the user’s ML codebase, identifies and extracted its existing test suites. Together with the curated checklist on ML testing, FixML leverages Large Language Models (LLMs) to assess the completeness of the test suites and output detailed evaluation reports with completeness scores and specific reasons. This assists users in understanding the performance of their current test suites with insights. Additionally, FixML can generate test function specifications corresponding to the curated checklist, helping users utilizing their test suites. - -In return, FixML solution combines the scalability of automated testing with the reliability of expert evaluation. By automating the evaluation process, FixML significantly reduces the time and human effort required to assess the quality of ML test suites. This popularizes thorough and efficient quality assessment on ML projects. +The development of FixML have been driven by both the need of better quality assurance in ML systems, and the current limitations of traditional testing methods on ML projects. FixML provides curated checklists and automated tools that enhance the evaluation and creation of test suites for ML projects, which in return, significantly reduces the time and human effort required to assess the completeness of ML test suites. This popularizes thorough and efficient assessment on ML projects. ### Limitation & Future Improvement -While FixML provides substantial benefits, there are limitations and areas that aim to be addressed in future development: +While FixML provides substantial benefits, there are limitations and areas to be addressed in future development: 1. **Specialized Checklist** -The current checklist is designed to be general and may not cover all specific requirements for different ML projects. Future development will focus on creating more specialized checklists for different domains and project types, allowing for more tailored evaluations. Since the format of the checklist is designed to allow users to easily expand, edit and select checklist items based on their specific use case, we welcome any collaboration with ML researchers on the creation of specalized checklists. +The default checklist is general and might not cover all requirements for different ML projects. Future development will focus on creating more specialized checklists for more tailored evaluations across domains and project types. We welcome any collaboration with ML researchers on the creation of specalized checklists based on their use cases. 2. **Enhanced Test Evaluator** -Our current study unveils the varying accuracy and consistency issues on the evaluation results using OpenAI GPT models. Future improvements involves prompt enhancement with prompt engineering techniques and support for multiple LLMs for higher performance and flexibility of FixML test evaluator functionality. We also expect to deliver user guidelines in editing the prompts in our system, where ML developers can customize prompts for better performance and collaborate with us to embed them into the system. +Our study reveals the accuracy and consistency issues on the evaluation results using OpenAI GPT-3.5-turbo model. Future improvements involves better prompt engineering techniques and support for multiple LLMs for enhanced performance and flexibility. We expect to include user guidelines in prompt creation to faciliate collaboration with ML developers. 3. **Customized Test Specification** -FixML test specification generator currently produces general test function skeletons solely based on the curated checklist without the context of the specific ML projects. Future developments will involve the integration of the ML project codebase in the generation process to output customized test functions skeletons. This further lower the barrier of ML users in creating comprehensive test suites relevant to the projects. +FixML currently produces general test function skeletons solely based on the curated checklist. Future developments involve the integration of the ML project infromation in the generation process to produce customized test functions skeletons. This further incentivizes users to create comprehensive tests. 4. Workflow Optimization #FIXME: have to review whether to include as it seems lower priority. @@ -226,4 +224,4 @@ The current test evaluator and test specification generator are separate entitie Performance optimization is another critical area for future development. As FixML handles large codebases and complex evaluations, optimizing the system to handle these tasks more efficiently is essential. This includes improving the speed and accuracy of the LLM responses, reducing the time taken to analyze and generate reports, and ensuring the system can scale effectively to handle more extensive and more complex projects. -By addressing these limitations and focusing on these future improvements, FixML will become an even more powerful tool for ensuring the quality and robustness of machine learning and data science projects. \ No newline at end of file +By addressing these limitations with future improvements, we hope FixML promote better ML systems and thus better human life. \ No newline at end of file diff --git a/report/final_report/references.bib b/report/final_report/references.bib index 1302306..8c6febe 100644 --- a/report/final_report/references.bib +++ b/report/final_report/references.bib @@ -75,4 +75,38 @@ @article{alexander2023evaluating author={Alexander, Rohan and Katz, Lindsay and Moore, Callandra and Schwartz, Zane}, journal={arXiv preprint arXiv:2310.01402}, year={2023} +} + +@misc{msise2023, + title = {Testing data science and MLOps Code}, + author = {Microsoft Industry Solutions Engineering Team}, + year = 2023, + month = {May}, + journal = {Testing Data Science and MLOps Code - Engineering Fundamentals Playbook}, + url = {https://microsoft.github.io/code-with-engineering-playbook/machine-learning/ml-testing/} +} + +@misc{jordan2020, + title = {Effective Testing for Machine Learning Systems}, + author = {Jordan, Jeremy}, + year = 2020, + month = {August}, + url = {https://www.jeremyjordan.me/testing-ml/} +} + +@article{pineau2021improving, + title={Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program)}, + author={Pineau, Joelle and Vincent-Lamarre, Philippe and Sinha, Koustuv and Larivi{\`e}re, Vincent and Beygelzimer, Alina and d'Alch{\'e}-Buc, Florence and Fox, Emily and Larochelle, Hugo}, + journal={Journal of Machine Learning Research}, + volume={22}, + number={164}, + pages={1--20}, + year={2021} +} + +@book{Atul2010, + title = "Checklist Manifesto, the (HB)", + author = "Gawande, Atul.", + year = 2010, + publisher = "Penguin Books India" } \ No newline at end of file From a205f3b5af3bf51b6c498c5618bebe3234d1e10c Mon Sep 17 00:00:00 2001 From: shumlh Date: Wed, 19 Jun 2024 23:30:15 -0700 Subject: [PATCH 02/12] Fix references in the final report --- report/final_report/final_report.qmd | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/report/final_report/final_report.qmd b/report/final_report/final_report.qmd index 6b82247..56c5fd0 100644 --- a/report/final_report/final_report.qmd +++ b/report/final_report/final_report.qmd @@ -40,13 +40,13 @@ Code coverage is a measure of the proportion of source code of a program execute 2. **Manual Evaluation** -Manual evaluation involves human expert review at the source code, whom can take the business logic into considerations and find vulnerabilites. Manual evaluation usually delivers comments for improvement under specific development context, and it is still one of the most reliable methods in practice ({cite}`openja2023studying`, {cite}`alexander2023evaluating`). However, the time cost is large and it is not scalable due to the scarcity of time and human expert. Different human expert might put emphasis on different ML test areas instead of a comprehensive and holistic review on the ML system test suites. +Manual evaluation involves human expert review at the source code, whom can take the business logic into considerations and find vulnerabilites. Manual evaluation usually delivers comments for improvement under specific development context, and it is still one of the most reliable methods in practice ([@openja2023studying], [@alexander2023evaluating]). However, the time cost is large and it is not scalable due to the scarcity of time and human expert. Different human expert might put emphasis on different ML test areas instead of a comprehensive and holistic review on the ML system test suites. ### Our Approach Our approach is to deliver an automated code review tool with the best practices of ML test suites embedded, which can be used by ML users to learn the best practices as well as to obtain a comprehensive evaluation on their ML system codes. -To come up with the best practices of ML test suites, ML research paper and recognized online resources are our data. Under the collaboration with our partner, we have researched industrial best practices ({cite}`msise2023`, {cite}`jordan2020`) and published academic literature ({cite}`openja2023studying`) and consolidated the testing strategies of ML projects into a checklist which is easily legible and editable by human (researchers, ML engineers, etc.). The checklist is also machine-friendly that can be embedded into the automated tool. +To come up with the best practices of ML test suites, ML research paper and recognized online resources are our data. Under the collaboration with our partner, we have researched industrial best practices ([@msise2023], [@jordan2020]) and published academic literature ([@openja2023studying]) and consolidated the testing strategies of ML projects into a checklist which is easily legible and editable by human (researchers, ML engineers, etc.). The checklist is also machine-friendly that can be embedded into the automated tool. To develop our automated code review tool, GitHub repositories of ML projects are our data. We have collected 11 repositories studied in [@openja2023studying], where these projects include comprehensive test suites and are written in Python programming language, for our product development. Our tool is capable of understanding the test suites in these projects, comparing and contrasting the test suites with the embedded best practices, and delivering evaluations to the current test suites. @@ -70,7 +70,7 @@ Multiple runs on each ML project are performed and the evaluation results per ea Our solution offers both a curated checklist on robust ML testing, and a Python package that facilitates the use of LLMs in checklist-based evaluation on the robustness of users' ML projects. The Python package is made publicly available for distribution on the Python Packaging Index (PyPI). -The justifications for creating these products are, on one hand, checklists have been shown to decrease errors in software systems and promote code submissions ({cite}`Atul2010`, {cite}`pineau2021improving`). Moreover, Python is chosen to be the programming language of our package given its prevalence in the ML landscape, its ubiquitous presence across different OSes and the existence of Python libraries for the integration with LLMs. This lowers the barrier to use and develop our package and provides better user experience. +The justifications for creating these products are, on one hand, checklists have been shown to decrease errors in software systems and promote code submissions ([@Atul2010], [@pineau2021improving]). Moreover, Python is chosen to be the programming language of our package given its prevalence in the ML landscape, its ubiquitous presence across different OSes and the existence of Python libraries for the integration with LLMs. This lowers the barrier to use and develop our package and provides better user experience. #### How to use the product @@ -393,3 +393,5 @@ The current test evaluator and test specification generator are separate entitie Performance optimization is another critical area for future development. As FixML handles large codebases and complex evaluations, optimizing the system to handle these tasks more efficiently is essential. This includes improving the speed and accuracy of the LLM responses, reducing the time taken to analyze and generate reports, and ensuring the system can scale effectively to handle more extensive and more complex projects. By addressing these limitations with future improvements, we hope FixML achieves better performance and promotes better ML systems and thus better human life. + +## References \ No newline at end of file From f480f9c63842f1898b642738a6ae55e41d83c8a8 Mon Sep 17 00:00:00 2001 From: shumlh Date: Thu, 20 Jun 2024 00:34:12 -0700 Subject: [PATCH 03/12] Polish the final report --- report/final_report/final_report.qmd | 125 ++++++++++++++------------- 1 file changed, 64 insertions(+), 61 deletions(-) diff --git a/report/final_report/final_report.qmd b/report/final_report/final_report.qmd index 56c5fd0..b9e2725 100644 --- a/report/final_report/final_report.qmd +++ b/report/final_report/final_report.qmd @@ -32,133 +32,135 @@ We propose to develop testing suites diagnostic tools based on Large Language Mo ### Current Approaches -To ensure the reproducibility, trustworthiness and free-of-bias ML system, comprehensive testing is essential. We have observed some traditional approaches in assessing the completeness of ML system tests, which contain different advantages and drawbacks as follows. +To ensure the reproducibility, trustworthiness, and lack of bias in ML systems, comprehensive testing is essential. We outlined some traditional approaches for assessing the completeness of ML system tests with their advantages and drawbacks as follows. 1. **Code Coverage** -Code coverage is a measure of the proportion of source code of a program executed when a particular test suite is run. It is widely used in software development domain as one of the measurements. It quantifies the test quality and is scalable given the short processing time. However, it cannot provide the reasons and in which ML areas that the test suites fall short under the context of ML system development. +Code coverage measures the proportion of source code of a program executed when a particular test suite is run. Widely used in software development, it quantifies test quality and is scalable due to its short processing time. However, it cannot indicate the reasons or specific ML areas where the test suites fall short under the context of ML system development. 2. **Manual Evaluation** -Manual evaluation involves human expert review at the source code, whom can take the business logic into considerations and find vulnerabilites. Manual evaluation usually delivers comments for improvement under specific development context, and it is still one of the most reliable methods in practice ([@openja2023studying], [@alexander2023evaluating]). However, the time cost is large and it is not scalable due to the scarcity of time and human expert. Different human expert might put emphasis on different ML test areas instead of a comprehensive and holistic review on the ML system test suites. +Manual evaluation involves human experts reviewing the source code, whom can take the business logic into considerations and identify vulnerabilites. It often provides context-specific improvement suggestions and remains one of the most reliable practices ([@openja2023studying], [@alexander2023evaluating]). However, it is time-consuming and not scalable due to the scarcity of human experts. Moreover, different experts might put emphasis on different ML test areas and lack a comprehensive and holistic review of the ML system test suites. ### Our Approach -Our approach is to deliver an automated code review tool with the best practices of ML test suites embedded, which can be used by ML users to learn the best practices as well as to obtain a comprehensive evaluation on their ML system codes. +Our approach is to deliver an automated code review tool with the best practices of ML test suites embedded. This tool aims to educate ML users on best practices while providing comprehensive evaluations of their ML system codes. -To come up with the best practices of ML test suites, ML research paper and recognized online resources are our data. Under the collaboration with our partner, we have researched industrial best practices ([@msise2023], [@jordan2020]) and published academic literature ([@openja2023studying]) and consolidated the testing strategies of ML projects into a checklist which is easily legible and editable by human (researchers, ML engineers, etc.). The checklist is also machine-friendly that can be embedded into the automated tool. +To establish these best practices, we utilized data from ML research papers and recognized online resources. In collaboration with our partner, we researched industrial best practices ([@msise2023], [@jordan2020]) and academic literature ([@openja2023studying]), and consolidated testing strategies into a human-readable and machine-friendly checklist that can be embedded into the automated tool. -To develop our automated code review tool, GitHub repositories of ML projects are our data. We have collected 11 repositories studied in [@openja2023studying], where these projects include comprehensive test suites and are written in Python programming language, for our product development. Our tool is capable of understanding the test suites in these projects, comparing and contrasting the test suites with the embedded best practices, and delivering evaluations to the current test suites. +For development, we collected 11 GitHub repositories of ML projects as studied in [@openja2023studying]. These Python-based projects include comprehensive test suites. Our tool should be able to analyze these test suites, compare them with embedded best practices, and deliver evaluations. -By developing our approach, we expect that it can provide reliable test suites evaluation to multiple ML projects in a scalable manner. However, we acknowledged that the consolidation of best practices currently focused on a few high priority test areas due to time constraint, where we expect to expand in the future. The test evaluation results provided by our tool are yet as reliable as human evaluation, where we will quantify its performance using the success metrics below. +We expect that our approach will provide scalable and reliable test suite evaluations for multiple ML projects. However, we recognize that our current best practices only focus on a few high-priority test areas due to time constraints. We plan to expand this scope in the future. While our tool's evaluations are not yet as reliable as human evaluations, we will quantify its performance. ### Success Metrics -To properly assess the performance of our tool which leverages the capability of LLMs, we have researched and taken reference of the methods in [@alexander2023evaluating] and defined the 2 success metrics: accuracy and consistency. With these metrics, our users (researchers, ML engineers, etc.) can assess the degree of trustworthiness of the evaluation results from our tool. +To properly assess the performance of our tool which leverages LLMs capability, we have taken reference of the methods in [@alexander2023evaluating] and defined two success metrics: accuracy and consistency. These metrics will help users (researchers, ML engineers, etc.) gauge the trustworthiness of our tool's evaluation results. -1. **Accuracy of the Application vs Human Expert Judgement** +1. **Accuracy vs Human Expert Judgement** -We run our tool on the ML projects in [@openja2023studying] to obtain the evaluation results (i.e. completeness score) per each ML test best practice item. We then manually assess the test suites of these ML projects using the same criteria as the ground truth data. Machine evaluation results are compared and contrasted with the ground truth data. Accuracy is defined as the number of matching results over total number of results. +We run our tool on ML projects from [@openja2023studying] to obtain evaluation results for each ML checklist item. These results are then compared with our manually assessed ground truth data based on the same criteria. Accuracy is calculated as the proportion of matching results to the total number of results. -2. **Consistency of the Application** +2. **Consistency** -Multiple runs on each ML project are performed and the evaluation results per each checklist item are obtained. Standard deviation of these results per ML projects are calculated as a measure of consistency. +We perform multiple runs on each ML project to obtain evaluation results for each checklist item. Consistency is measured by calculating the standard deviation of these results across multiple runs for each project. ## Data Product & Results ### Data Products -Our solution offers both a curated checklist on robust ML testing, and a Python package that facilitates the use of LLMs in checklist-based evaluation on the robustness of users' ML projects. The Python package is made publicly available for distribution on the Python Packaging Index (PyPI). +Our solution includes a curated checklist for robust ML testing and a Python package for checklist-based evaluation of ML project testing robustness using LLMs. The package is publicly available on the Python Packaging Index (PyPI). -The justifications for creating these products are, on one hand, checklists have been shown to decrease errors in software systems and promote code submissions ([@Atul2010], [@pineau2021improving]). Moreover, Python is chosen to be the programming language of our package given its prevalence in the ML landscape, its ubiquitous presence across different OSes and the existence of Python libraries for the integration with LLMs. This lowers the barrier to use and develop our package and provides better user experience. +Justifications for these products are: + +- Checklists have been shown to reduce errors in software systems and promote code submissions ([@Atul2010], [@pineau2021improving]). +- Python is widely used in ML, compatible with various OSes, and integrates well with LLMs. These ensure the ease of use and development. #### How to use the product There are two ways to make use of this package: -1. **As a CLI tool.** A runnable command `fixml` is provided by the package. Once installed, users can perform the codebase evaluation, test function specification generation and other relevant tasks by running subcommands under `fixml` in terminal environment. +1. **As a CLI tool.** A runnable command `fixml` is provided by the package. Once installed, users can perform codebase evaluations, generate test function specifications, and more by running subcommands under `fixml` in the terminal. -2. **As a high-level API.** Alternatively, one can use the package to import all components necessary for performing the tasks as part of their own system. Documentations are provided in terms of docstrings. +2. **As a high-level API.** Users can import necessary components from the package into their own systems. Documentation is available through docstrings. -By formatting our product as a CLI tool and API, one (researchers, ML engineers, etc.) will find it user-friendly to interact with, and versatile to support various use cases, e.g. web application development, scientific research, etc. +By offering it as both CLI tool and API, our product is user-friendly to interact with, and versatile to support various use cases such as web application development and scientific research. #### System Design (FIXME To be revised) ![image](../../img/proposed_system_overview.png) -The design principle of our package adheres to object-oriented design and SOLID principles, which is fully modular. One can easily switch between different prompts, models and checklists to use. This facilitates code reusability and users' collaboration to extend its functionality. -The design principle of our package adheres to object-oriented design and SOLID principles, which is fully modular. One can easily switch between different prompts, models and checklists to use. This facilitates code reusability and users' collaboration to extend its functionality. +The design of our package follows object-oriented and SOLID principles, which is fully modularity. Users can easily switch between different prompts, models, and checklists, which facilitates code reusability and collaboration to extend its functionality. There are five components in the system of our package: 1. **Code Analyzer** -This component extracts the information relevant to test suites from the input codebase, which is essential for injecting only the most relevant information to LLMs given its token limits. +It extracts test suites from the input codebase, to ensure only the most relevants details are provided to LLMs given token limits. 2. **Prompt Templates** -This component stores the prompt template necessary for instructing LLMs to behave and return responses in the expected format. +It stores prompt templates for instructing LLMs to generate responses in the expected format. 3. **Checklist** -This component reads the curated checklist, which is stored in CSV format, as a dictionary with fixed schema for injection to LLMs. Default checklist is also included inside the package for distribution. +It reads the curated checklist from a CSV file into a dictionary with a fixed schema for LLM injection. The package includes a default checklist for distribution. 4. **Runners** -This component involves the Evaluator module, which evaluates each file from the test suites using LLMs and outputs evaluation results, and Generator module, which generates test specifications. Both modules include validation and retry logics and record all relevant information in the responses. +It includes the Evaluator module, which assesses each test suite file using LLMs and outputs evaluation results, and the Generator module, which creates test specifications. Both modules feature validation, retry logic, and record response and relevant information. 5. **Parsers** -This components parses the responses from Evaluator into evaluation reports in various formats (HTML, PDF) using Jinja template engine. Adhering to our design principle, this enables flexibility in creating customized report structure. +It converts Evaluator responses into evaluation reports in various formats (HTML, PDF) using the Jinja template engine, which enables customizable report structures. #### Checklist Design -The checklist ([Fig. 1](overview-diagram)) embedded in the package contains the best practices in testing ML pipeline and is curated manually based on ML researches and recognized online resources. Prompt engineering is applied for better performance. This also helps combating the hallucination of LLMs ([@zhang2023sirens]) during the evaluation of ML projects by prompting it to follow **exactly** the checklist. +The embedded checklist ([Fig. 1](overview-diagram)) contains best practices for testing ML pipelines, and is curated from ML research and recognized online resources. Prompt engineering further improves performance. THis helps mitigate LLM hallucinations ([@zhang2023sirens]) by ensuring strict adherence to the checklist. -Here is an example of how the checklist would be structured: +Example checklist structure: | Column | Description | |------------------:|:----------------------------------------------------| -| ID | The Unique Identifier of the checklist item | -| Topic | The Test Area of the checklist item | -| Title | The Title of the checklist item | -| Requirement | The Prompt of the checklist item to be injected into LLMs for evaluation | -| Explanations | Detailed explanations of the checklist item for human understanding | -| Reference | References of the checklist item, e.g. academic paper | -| Is Evaluator Applicable | Whether the checklist item is selected to be used during evaluation. 0 indicates No, 1 indicates Yes | +| ID | Unique Identifier of the checklist item | +| Topic | Test Area of the checklist item | +| Title | Title of the checklist item | +| Requirement | Prompt for the checklist item to be injected into LLMs for evaluation | +| Explanations | Detailed explanations for human understanding | +| Reference | References for the checklist item, e.g., academic papers | +| Is Evaluator Applicable | Indicates if the checklist item is used during evaluation (0 = No, 1 = Yes) | (FIXME To be revised) #### Artifacts -There are three artifacts after using our package: +Using our package results in three artifacts: 1. **Evaluation Responses** -The artifact stores both the evaluation responses from LLMs and meta-data of the process in JSON format. This supports downstream tasks, e.g. report render, scientific research, etc. +These responses include both LLM evaluation results and process metadata stored in JSON format.This supports downsteam tasks like report rendering and scientific research, etc. (FIXME To be revised) schema of the JSON saved & what kind of information is stored 2. **Evaluation Report** -The artifact stores the evaluation results of the ML projects in a structured format, which includes completeness score breakdown and corresponding detailed reasons. +This report presents structured evaluation results of ML projects, which includes a detailed breakdown of completeness scores and reasons for each score. (FIXME To be revised) 3. **Test Specification Script** -The artifacts stores the test specifications generated by LLMs in Python script format. +Generated test specifications are stored as Python scripts. (FIXME To be revised) ### Evaluation Results -As illustrated in `Success Metrics`, we ran 30 iterations on each of the repositories in [@openja2023studying] and examined the breakdown of the completeness score to assessed the quality of evaluation determined by our tool. +As described in `Success Metrics`, we conducted 30 iterations on each repository from [@openja2023studying] and examined the breakdown of the completeness score to assess our tool's evaluation quality. (FIXME: would it be better to show a table of the repos? like how the Openja does?) 1. **Accuracy** -For accuracy, we targeted 3 of the repositories ([`lightfm`](https://github.com/lyst/lightfm), [`qlib`](https://github.com/microsoft/qlib), [`DeepSpeech`](https://github.com/mozilla/DeepSpeech)) for human evaluation and compared the ground truth with the outputs from our tool. +We targeted 3 of the repositories ([`lightfm`](https://github.com/lyst/lightfm), [`qlib`](https://github.com/microsoft/qlib), [`DeepSpeech`](https://github.com/mozilla/DeepSpeech)) for human evaluation compared our tool's outputs with the ground truth. ```{python} import pandas as pd gt = pd.read_csv('ground_truth.csv') gt ``` -> Caption: Ground truth data on the 3 repositories. 1 = fully satisfied, 0.5 = partially satisfied, 0 = not satisfied +> Caption: Ground truth data for the 3 repositories. (1 = fully satisfied, 0.5 = partially satisfied, 0 = not satisfied) ```{python} # FIXME: jitter-mean-sd plot (checklist item vs. score) for each repo @@ -215,9 +217,9 @@ errorbars = base.mark_errorbar().encode( titleFontSize=12 ) ``` -> Caption: Comparison of the satisfaction determined by our system versus the ground truth for each checklist item and repository +> Caption: Comparison of our system's satisfaction determination versus the ground truth for each checklist item and repository -We found that our tool tends to undermine the actual satisfying cases. For the items that are actually satisfied, our tool tends to classify as partially satisfied, while for those that are partially satisfied, our tool often classifies as not satisfied. +Our tool tends to underrate satisfying cases, which often classifies fully satisfied items as partially satisfied and partially satisfied items as not satisfied. ```{python} df_repo_run = pd.read_csv('score_by_repo_run_3.5-turbo.csv') @@ -235,14 +237,13 @@ contingency_table = pd.pivot_table( contingency_table.index.names = ['Repository', 'Checklist Item', 'Ground Truth'] contingency_table.sort_index(level=[0, 2]) ``` -> Contingency table of the satisfaction determined by our system versus the ground truth +> Caption: Contingency table of our system's satisfaction determination versus the ground truth -The accuracy issue may be attributed to the need for improvement of prompts in our checklist. +The accuracy issue may be attributed to a need to improve our checklist prompts. 2. **Consistency** -Since the completeness score from LLMs contain randomness, we further studied the consistency of scores across checklist items and repositories. -Since the completeness score from LLMs contain randomness, we further studied the consistency of scores across checklist items and repositories. +As the completeness scores from LLMs contain randomness, we examined the consistency of completeness scores across checklist items and repositories. ```{python} stds = df_repo__stat[['repo', 'std', 'id_title']].pivot(index='repo', columns='id_title').copy() @@ -289,16 +290,19 @@ stripplot = base.mark_circle(size=100).encode( title="30 Runs on Openja's Repositories for each Checklist Item" ) ``` -> Caption: Standard deviations of the score for each checklist item. Each dot represents the standard deviation of scores of 30 runs of a single repository -> Caption: Standard deviations of the score for each checklist item. Each dot represents the standard deviation of scores of 30 runs of a single repository +> Caption: Standard deviations of the score for each checklist item. Each dot represents the standard deviation of scores from 30 runs of a single repository. + +We identified two diverging cases: -We found 2 diverging cases. For example, it shows high standard deviations across repositories for item `3.2 Data in the Expected Format`. This might be a proof of poor prompt quality, making it ambiguous for the LLM to produce consistent results. Prompt engineering might solve this problem. +1. **High Standard Deviations** +Items like `3.2 Data in the Expected Format` showed high standard deviations across repositories. This might indicate potential poor prompt quality for the LLM to produce consistent results. Improved prompt engineering could address this issue. -On the other hand, there are outliers yielding exceptionally high standard deviations for item `5.3 Ensure Model Output Shape Aligns with Expectation`. This might be because those repositories are unorthodox, but careful manual examination is required for a more definite conclusion. +2. **Outliers with High Standard Deviations** +Items like `5.3 Ensure Model Output Shape Aligns with Expectation` had outliers with exceptionally high standard deviations, which is possibly due to unorthodox repositories. A careful manual examination is required for a more definitive conclusion. #### Comparison of `gpt-3.5-turbo` and `gpt-4o` -To examine if newer LLMs help in both metrics, we preliminarily compared system outputs from `gpt-4o` versus `gpt-3.5-turbo` on the `lightfm` repository. We observed that the one from `gpt-4o` consistently returned "Satisfied", which deviates from the ground truth. +To evaluate if newer LLMs improve performance, we preliminarily compared outputs from `gpt-4o` and `gpt-3.5-turbo` on the `lightfm` repository. We observed that `gpt-4o` consistently returned "Satisfied," which deviated from the ground truth. ```{python} # FIXME: jitter-mean-sd plot (checklist item vs. score) for each repo @@ -358,15 +362,15 @@ errorbars = base.mark_errorbar().encode( titleFontSize=12 ) ``` -> Caption: Comparison of the satisfaction using `gpt-4o` versus using `gpt-3.5-turbo` for each checklist item on `lightfm` +> Caption: Comparison of satisfaction using `gpt-4o` versus `gpt-3.5-turbo` for each checklist item on lightfm -Further investigation into `gpt-4o` is required to address if it can enhance the system performance. +Further investigation into `gpt-4o` is required to determine its effectiveness in system performance. ## Conclusion ### Wrap Up -The development of FixML have been driven by both the need of better quality assurance in ML systems, and the current limitations of traditional testing methods on ML projects. FixML provides curated checklists and automated tools that enhance the evaluation and creation of test suites for ML projects, which in return, significantly reduces the time and human effort required to assess the completeness of ML test suites. This popularizes thorough and efficient assessment on ML projects. +The development of FixML has been driven by the need of better quality assurance in ML systems and the current limitations of traditional testing methods on ML projects. FixML provides curated checklists and automated tools that enhance the evaluation and creation of test suites for ML projects. This in return, significantly reduces the time and effort required to assess the completeness of ML test suites, and thus promotes thorough and efficient assessment on ML projects. ### Limitation & Future Improvement @@ -374,24 +378,23 @@ While FixML provides substantial benefits, there are limitations and areas to be 1. **Specialized Checklist** -The default checklist is general and might not cover all requirements for different ML projects. Future development will focus on creating more specialized checklists for more tailored evaluations across domains and project types. We welcome any collaboration with ML researchers on the creation of specalized checklists based on their use cases. +The default checklist is general and may not cover all requirements for different ML projects. Future development will focus on creating specialized checklists for tailored evaluations across various domains and project types. Collaboration with ML researchers is welcomed for creating specialized checklists based on specific use cases. 2. **Enhanced Test Evaluator** -Our study reveals the accuracy and consistency issues on the evaluation results using OpenAI GPT-3.5-turbo model. Future improvements involves better prompt engineering techniques and support for multiple LLMs for enhanced performance and flexibility. We expect to include user guidelines in prompt creation to faciliate collaboration with ML developers. +Our study reveals the accuracy and consistency issues on the evaluation results using OpenAI GPT-3.5-turbo model. Future improvements involves better prompt engineering techniques and support for multiple LLMs for enhanced performance and flexibility. User guidelines in prompt creation will be provided to facilitate collaboration with ML developers. 3. **Customized Test Specification** - -FixML currently produces general test function skeletons solely based on the curated checklist. Future developments involve the integration of the ML project infromation in the generation process to produce customized test functions skeletons. This further incentivizes users to create comprehensive tests. +Future developments will integrate project-specific information to produce customized test function skeletons. This may further encourage users to create comprehensive tests. 4. Workflow Optimization #FIXME: have to review whether to include as it seems lower priority. -The current test evaluator and test specification generator are separate entities. This could be improved by embedding a workflow engine that allows the system to automatically take actions based on the LLM response. For instance, if the LLM response suggests that test suites are partially satisfied or non-satisfied, the system could automatically run the test generator to produce test function skeletons and then reevaluate them until they are satisfied or some threshold is met. This would create a more cohesive and efficient workflow, reducing manual intervention and improving overall system performance. +The test evaluator and test specification generator are currently separate. Future improvements could embed a workflow engine that automatically takes actions based on LLM responses. This creates a more cohesive and efficient workflow, recues manual intervention, and improves overall system performance. 5. Performance Optimization #FIXME: have to review whether to include as it seems lower priority. -Performance optimization is another critical area for future development. As FixML handles large codebases and complex evaluations, optimizing the system to handle these tasks more efficiently is essential. This includes improving the speed and accuracy of the LLM responses, reducing the time taken to analyze and generate reports, and ensuring the system can scale effectively to handle more extensive and more complex projects. +As FixML handles large codebases and complex evaluations, performance optimization is essential. Future developments will focus on improving the speed and accuracy of LLM responses, reducing analysis and report generation times, and ensuring scalability for handling larger and more complex projects. -By addressing these limitations with future improvements, we hope FixML achieves better performance and promotes better ML systems and thus better human life. +By addressing these limitations and implementing future improvements, we aim for FixML to achieve better performance and contribute to the development of better ML systems, and ultimately enhance human life. ## References \ No newline at end of file From 79a5d9d0e7313ca7ba7fd811bddc890e1b0779ff Mon Sep 17 00:00:00 2001 From: shumlh Date: Thu, 20 Jun 2024 10:00:39 -0700 Subject: [PATCH 04/12] Add reference to final report --- report/final_report/final_report.qmd | 4 ++-- report/final_report/references.bib | 8 ++++++++ 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/report/final_report/final_report.qmd b/report/final_report/final_report.qmd index b9e2725..405d548 100644 --- a/report/final_report/final_report.qmd +++ b/report/final_report/final_report.qmd @@ -20,7 +20,7 @@ by John Shiu, Orix Au Yeung, Tony Shum, Yingzi Jin The global artificial intelligence (AI) market is growing exponentially ([@grand2021artificial]), driven by its ability to autonomously make complex decisions impacting various aspects of human life, including financial transactions, autonomous transportation, and medical diagnosis. -However, ensuring the software quality of these systems remains a significant challenge ([@openja2023studying]). Specifically, the lack of a standardized and comprehensive approach to testing machine learning (ML) systems introduces potential risks to stakeholders. For example, inadequate quality assurance in ML systems can lead to severe consequences, such as misinformation ([@Ashley2024]), social bias ([@Alice2023]), substantial financial losses ([@Asheeta2019]) and safety hazards. +However, ensuring the software quality of these systems remains a significant challenge ([@openja2023studying]). Specifically, the lack of a standardized and comprehensive approach to testing machine learning (ML) systems introduces potential risks to stakeholders. For example, inadequate quality assurance in ML systems can lead to severe consequences, such as misinformation ([@Ashley2024]), social bias ([@Alice2023]), substantial financial losses ([@Asheeta2019]) and safety hazards ([@David2023]) Therefore, defining and promoting an industry standard and establishing robust testing methodologies for these systems is crucial. But how? @@ -110,7 +110,7 @@ It converts Evaluator responses into evaluation reports in various formats (HTML #### Checklist Design -The embedded checklist ([Fig. 1](overview-diagram)) contains best practices for testing ML pipelines, and is curated from ML research and recognized online resources. Prompt engineering further improves performance. THis helps mitigate LLM hallucinations ([@zhang2023sirens]) by ensuring strict adherence to the checklist. +The embedded checklist contains best practices for testing ML pipelines, and is curated from ML research and recognized online resources. Prompt engineering further improves performance. THis helps mitigate LLM hallucinations ([@zhang2023sirens]) by ensuring strict adherence to the checklist. Example checklist structure: diff --git a/report/final_report/references.bib b/report/final_report/references.bib index 8c6febe..8823afa 100644 --- a/report/final_report/references.bib +++ b/report/final_report/references.bib @@ -36,6 +36,14 @@ @misc{Asheeta2019 institution = {Firstpost} } +@misc{David2023, + author = {David Shepardson}, + year = {2023}, + title = {GM's Cruise recalling 950 driverless cars after pedestrian dragged in crash}, + url = {https://www.reuters.com/business/autos-transportation/gms-cruise-recall-950-driverless-cars-after-accident-involving-pedestrian-2023-11-08/}, + institution = {Reuters} +} + @article{kapoor2022leakage, title={Leakage and the reproducibility crisis in ML-based science}, author={Kapoor, Sayash and Narayanan, Arvind}, From 0c940e35b940228dcb6742a9729b293cc285432b Mon Sep 17 00:00:00 2001 From: shumlh Date: Thu, 20 Jun 2024 11:58:05 -0700 Subject: [PATCH 05/12] Fix final report image issue --- report/final_report/_quarto.yml | 2 +- report/final_report/docs/01_preprocess.html | 2 +- .../final_report/docs/02_finding-report.html | 2 +- .../docs/02_plots-for-final-report.html | 2 +- .../docs/04_plots-for-presentations.html | 2 +- report/final_report/docs/final_report.html | 246 ++++++++++-------- .../docs/img}/checklist_sample.png | Bin report/final_report/docs/{ => img}/logo.png | Bin .../docs/img/proposed_system_overview.png | Bin 0 -> 74338 bytes .../img}/test_evaluation_report_sample.png | Bin .../docs/img}/test_spec_sample.png | Bin report/final_report/docs/proposal.html | 4 +- report/final_report/docs/search.json | 12 +- report/final_report/final_report.qmd | 8 +- report/final_report/img/checklist_sample.png | Bin 0 -> 210625 bytes report/final_report/{ => img}/logo.png | Bin .../img/proposed_system_overview.png | Bin 0 -> 74338 bytes .../img/test_evaluation_report_sample.png | Bin 0 -> 287875 bytes report/final_report/img/test_spec_sample.png | Bin 0 -> 134698 bytes 19 files changed, 152 insertions(+), 128 deletions(-) rename {img => report/final_report/docs/img}/checklist_sample.png (100%) rename report/final_report/docs/{ => img}/logo.png (100%) create mode 100644 report/final_report/docs/img/proposed_system_overview.png rename {img => report/final_report/docs/img}/test_evaluation_report_sample.png (100%) rename {img => report/final_report/docs/img}/test_spec_sample.png (100%) create mode 100644 report/final_report/img/checklist_sample.png rename report/final_report/{ => img}/logo.png (100%) create mode 100644 report/final_report/img/proposed_system_overview.png create mode 100644 report/final_report/img/test_evaluation_report_sample.png create mode 100644 report/final_report/img/test_spec_sample.png diff --git a/report/final_report/_quarto.yml b/report/final_report/_quarto.yml index a18ba50..05d3bd5 100644 --- a/report/final_report/_quarto.yml +++ b/report/final_report/_quarto.yml @@ -5,7 +5,7 @@ project: website: sidebar: style: "docked" - logo: "logo.png" + logo: "img/logo.png" search: true contents: - section: "Final Report" diff --git a/report/final_report/docs/01_preprocess.html b/report/final_report/docs/01_preprocess.html index b724d86..1bd826e 100644 --- a/report/final_report/docs/01_preprocess.html +++ b/report/final_report/docs/01_preprocess.html @@ -122,7 +122,7 @@ @@ -230,85 +231,102 @@

Executive Summary

Introduction

Problem Statement

-

The global artificial intelligence (AI) market is growing exponentially {cite}grand2021artificial, driven by its ability to autonomously make complex decisions impacting various aspects of human life, including financial transactions, autonomous transportation, and medical diagnosis.

-

However, ensuring the software quality of these systems remains a significant challenge {cite}openja2023studying. Specifically, the lack of a standardized and comprehensive approach to testing machine learning (ML) systems introduces potential risks to stakeholders. For example, inadequate quality assurance in ML systems can lead to severe consequences, such as substantial financial losses ({cite}Asheeta2019, {cite}Asheeta2019, {cite}Asheeta2019) and safety hazards.

+

The global artificial intelligence (AI) market is growing exponentially ((Grand-View-Research 2021)), driven by its ability to autonomously make complex decisions impacting various aspects of human life, including financial transactions, autonomous transportation, and medical diagnosis.

+

However, ensuring the software quality of these systems remains a significant challenge ((Openja et al. 2023)). Specifically, the lack of a standardized and comprehensive approach to testing machine learning (ML) systems introduces potential risks to stakeholders. For example, inadequate quality assurance in ML systems can lead to severe consequences, such as misinformation ((Belanger 2024)), social bias ((Nunwick 2023)), substantial financial losses ((Regidi 2019)) and safety hazards ((Shepardson 2023))

Therefore, defining and promoting an industry standard and establishing robust testing methodologies for these systems is crucial. But how?

Our Objectives

-

We propose to develop testing suites diagnostic tools based on Large Language Models (LLMs) and curate checklists based on ML research papers and best practices to facilitate comprehensive testing of ML systems with flexibility. Our goal is to enhance applied ML software’s trustworthiness, quality, and reproducibility across both the industry and academia {cite}kapoor2022leakage.

+

We propose to develop testing suites diagnostic tools based on Large Language Models (LLMs) and curate checklists based on ML research papers and best practices to facilitate comprehensive testing of ML systems with flexibility. Our goal is to enhance applied ML software’s trustworthiness, quality, and reproducibility across both the industry and academia (Kapoor and Narayanan 2022).

Data Science Methods

Current Approaches

-

To ensure the reproducibility, trustworthiness and free-of-bias ML system, comprehensive assessment is essential. We have observed some traditional approaches in assessing the quality of ML systems, which contain different advantages and drawbacks as follows.

-
-

1. Code Coverage

-

Code coverage is a measure of the proportion of source code of a program executed when a particular test suite is run. It is widely used in software development domain as one of the measurements. It quantifies the test quality and is scalable given the short process time. However, it cannot provide the reasons and in which ML areas that the test suites fall short under the context of ML system development.

-
-
-

2. Manual Evaluation

-

Manual evaluation involves human expert review at the source code, whom can take the business logic into considerations and find vulnerabilites. Manual evaluation usually delivers comments for improvement under specific development context, and it is still one of the most reliable methods in practice. However, the time cost is large and it is not scalable due to the scarcity of time and human expert. Different human expert might put emphasis on different ML test areas instead of a comprehensive and holistic review on the ML system test suites.

-
+

To ensure the reproducibility, trustworthiness, and lack of bias in ML systems, comprehensive testing is essential. We outlined some traditional approaches for assessing the completeness of ML system tests with their advantages and drawbacks as follows.

+
    +
  1. Code Coverage
  2. +
+

Code coverage measures the proportion of source code of a program executed when a particular test suite is run. Widely used in software development, it quantifies test quality and is scalable due to its short processing time. However, it cannot indicate the reasons or specific ML areas where the test suites fall short under the context of ML system development.

+
    +
  1. Manual Evaluation
  2. +
+

Manual evaluation involves human experts reviewing the source code, whom can take the business logic into considerations and identify vulnerabilites. It often provides context-specific improvement suggestions and remains one of the most reliable practices ((Openja et al. 2023), (Alexander et al. 2023)). However, it is time-consuming and not scalable due to the scarcity of human experts. Moreover, different experts might put emphasis on different ML test areas and lack a comprehensive and holistic review of the ML system test suites.

Our Approach

-

Our approach is to deliver an automated code review tool with the best practices of ML test suites embedded, which can be used by ML users to learn the best practices as well as to obtain a comprehensive evaluation on their ML system codes.

-

To come up with the best practices of ML test suites, ML research paper and recognized online resources are our data. Under the collaboration with our partner, we have researched industrial best practices (cite: Microsoft, Jordan) and published academic literature (cite: OpenJa) and consolidated the testing strategies of ML projects into a format which is easily legible and editable by human (researchers, ML engineers, etc.). The format is also machine-friendly that can be easily incorporated into the automated tool.

-

To develop our automated code review tool, GitHub repositories of ML projects are our data. We have collected 11 repositories studied in {cite}openja2023studying, where these projects include comprehensive test suites and are written in Python programming language, for our product development. Our tool is capable of understanding the test suites in these projects, comparing and contrasting the test suites with the embedded best practices, and delivering evaluations and suggestions to the current test suties.

-

By developing our approach, we expect that it can provide reliable test suites evaluation to multiple ML projects in a scalable manner. However, we acknowledged that the consolidation of best practices currently focused on a few high priority test areas due to time constraint, where we expect to expand in the future. The test evaluation results provided by our tool are yet as reliable as human evaluation, where we will quantify its performance using the success metrics below.

+

Our approach is to deliver an automated code review tool with the best practices of ML test suites embedded. This tool aims to educate ML users on best practices while providing comprehensive evaluations of their ML system codes.

+

To establish these best practices, we utilized data from ML research papers and recognized online resources. In collaboration with our partner, we researched industrial best practices ((Team 2023), (Jordan 2020)) and academic literature ((Openja et al. 2023)), and consolidated testing strategies into a human-readable and machine-friendly checklist that can be embedded into the automated tool.

+

For development, we collected 11 GitHub repositories of ML projects as studied in (Openja et al. 2023). These Python-based projects include comprehensive test suites. Our tool should be able to analyze these test suites, compare them with embedded best practices, and deliver evaluations.

+

We expect that our approach will provide scalable and reliable test suite evaluations for multiple ML projects. However, we recognize that our current best practices only focus on a few high-priority test areas due to time constraints. We plan to expand this scope in the future. While our tool’s evaluations are not yet as reliable as human evaluations, we will quantify its performance.

Success Metrics

-

To properly assess the performance of our tool which leverages the capability of LLMs, we have researched and taken reference of the methods in {cite}alexander2023evaluating and defined the 2 success metrics: accuracy and consistency. With these metrics, our users (researchers, ML engineers, etc.) can assess the trustworthiness while obtaining the evaluation results from our tool.

+

To properly assess the performance of our tool which leverages LLMs capability, we have taken reference of the methods in (Alexander et al. 2023) and defined two success metrics: accuracy and consistency. These metrics will help users (researchers, ML engineers, etc.) gauge the trustworthiness of our tool’s evaluation results.

    -
  1. Accuracy of the Application vs Human Expert Judgement
  2. +
  3. Accuracy vs Human Expert Judgement
-

We run our tool on the ML projects in {cite}openja2023studying to obtain the evaluation results (i.e. completeness score) per each ML test best practice item. We then manually assess the test suites of these ML projects using the same criteria as the ground truth data. Machine evaluation results are compared and contrasted with the ground truth data. Accuracy is defined as the number of matching results over total number of results.

+

We run our tool on ML projects from (Openja et al. 2023) to obtain evaluation results for each ML checklist item. These results are then compared with our manually assessed ground truth data based on the same criteria. Accuracy is calculated as the proportion of matching results to the total number of results.

    -
  1. Consistency of the Application
  2. +
  3. Consistency
-

Multiple runs on each ML project are performed and the evaluation results per each ML test best practice item are obtained. Standard deviation of these results per ML projects are calculated as a measure of consistency.

+

We perform multiple runs on each ML project to obtain evaluation results for each checklist item. Consistency is measured by calculating the standard deviation of these results across multiple runs for each project.

Data Product & Results

Data Products

-

Our solution offers both a curated checklist on robust ML testing, and a Python package that facilitates the use of LLMs in checklist-based evaluation on the robustness of users’ ML projects. The Python package is made publicly available for distribution on the Python Packaging Index (PyPI).

-

The justifications for creating these products are, on one hand, checklists have been shown to decrease errors in software systems and promote code submissions (cite: Gawande 2010, Pineau et al. (2021) from Tiffany PDF). Moreover, Python is chosen to be the programing language of our package given its prevalence in the ML landscape, its ubiquitous presence across different OSes and the existence of Python libraries for the integration with LLMs. This lowers the barrier to use and develop our package and provides better user experience.

+

Our solution includes a curated checklist for robust ML testing and a Python package for checklist-based evaluation of ML project testing robustness using LLMs. The package is publicly available on the Python Packaging Index (PyPI).

+

Justifications for these products are:

+
    +
  • Checklists have been shown to reduce errors in software systems and promote code submissions ((Gawande 2010), (Pineau et al. 2021)).
  • +
  • Python is widely used in ML, compatible with various OSes, and integrates well with LLMs. These ensure the ease of use and development.
  • +

How to use the product

There are two ways to make use of this package:

    -
  1. As a CLI tool. A runnable command fixml is provided by the package. Once installed, users can perform the codebase evaluation, test function specification generation and other relevant tasks by running subcommands under fixml in terminal environment.

  2. -
  3. As a high-level API. Alternatively, one can use the package to import all components necessary for performing the tasks as part of their own system. Documentations are provided in terms of docstrings.

  4. +
  5. As a CLI tool. A runnable command fixml is provided by the package. Once installed, users can perform codebase evaluations, generate test function specifications, and more by running subcommands under fixml in the terminal.

  6. +
  7. As a high-level API. Users can import necessary components from the package into their own systems. Documentation is available through docstrings.

-

By formating our product as a CLI tool and API, one (researchers, ML engineers, etc.) will find it user-friendly to interact with. Moreover, it is versatile to support various use cases, such as web application development, data science research, etc.

+

By offering it as both CLI tool and API, our product is user-friendly to interact with, and versatile to support various use cases such as web application development and scientific research.

System Design

-

(To be revised) image

-

The design principle of our package adheres to object-oriented design and SOLID principles, which is fully modular. One can easily switch between different prompts, models and checklists to use. This enables code reuse and promote users’ collaboration to extend its functionality.

+

(FIXME To be revised) image

+

The design of our package follows object-oriented and SOLID principles, which is fully modularity. Users can easily switch between different prompts, models, and checklists, which facilitates code reusability and collaboration to extend its functionality.

There are five components in the system of our package:

    -
  1. Code Analyzer This component extracts the information relevant to test suites from the input codebase, which is essential for injecting only the most relevant information to LLMs given its token limits.

  2. -
  3. Prompt Templates This component stores the prompt template necessary for instructing LLM to behave and return responses in consistent and expected format. Few-shot learning is applied for the instruction.

  4. -
  5. Checklist This component reads the curated checklist, which is stored in CSV format, as a dict with fixed schema for injection into prompt. Default checklist is also included inside the package for distribution.

  6. -
  7. Runners This component involves the Evaluator module, which evaluates each file from the test suites using LLMs and outputs evaluation results, and Generator module, which generates test specifications. Both modules include validation and retry logics and record all relevant information in the responses.

  8. -
  9. Parsers This components parses the responses from Evaluator into evaluation reports in various formats (HTML, PDF) using Jinja template engine. Adhering to our design principle, this enables flexibility in creating customized report structure.

  10. +
  11. Code Analyzer
  12. +
+

It extracts test suites from the input codebase, to ensure only the most relevants details are provided to LLMs given token limits.

+
    +
  1. Prompt Templates
  2. +
+

It stores prompt templates for instructing LLMs to generate responses in the expected format.

+
    +
  1. Checklist
+

It reads the curated checklist from a CSV file into a dictionary with a fixed schema for LLM injection. The package includes a default checklist for distribution.

+
    +
  1. Runners
  2. +
+

It includes the Evaluator module, which assesses each test suite file using LLMs and outputs evaluation results, and the Generator module, which creates test specifications. Both modules feature validation, retry logic, and record response and relevant information.

+
    +
  1. Parsers
  2. +
+

It converts Evaluator responses into evaluation reports in various formats (HTML, PDF) using the Jinja template engine, which enables customizable report structures.

Checklist Design

-

The package will incorporate a checklist (Fig. 1) which contains the best practices in testing ML pipeline and is curated manually based on ML researches and recognized online resources. Prompt engineering is applied to the checklist for better performance. This also helps combating the hallucination of LLMs ({cite}zhang2023sirens) during the evaluation of ML projects by prompting it to follow exactly the checklist.

-

Here is an example of how the checklist would be structured:

+

The embedded checklist contains best practices for testing ML pipelines, and is curated from ML research and recognized online resources. Prompt engineering further improves performance. THis helps mitigate LLM hallucinations ((Zhang et al. 2023)) by ensuring strict adherence to the checklist.

+

Example checklist structure:

--++ @@ -319,117 +337,702 @@

Checklist Design

- + - + - + - + - + - + - +
IDThe Unique Identifier of the checklist itemUnique Identifier of the checklist item
TopicThe Test Area of the checklist itemTest Area of the checklist item
TitleThe Title of the checklist itemTitle of the checklist item
RequirementThe Prompt of the checklist item to be injected into LLMs for evaluationPrompt for the checklist item to be injected into LLMs for evaluation
ExplanationsDetailed explanations of the checklist item for human understandingDetailed explanations for human understanding
ReferenceReferences of the checklist item, e.g. academic paperReferences for the checklist item, e.g., academic papers
Is Evaluator ApplicableWhether the checklist item is selected to be used during evaluation. 0 indicates No, 1 indicates YesIndicates if the checklist item is used during evaluation (0 = No, 1 = Yes)
-

(To be revised)

+

(FIXME To be revised)

Artifacts

-

There are three artifacts after using our package:

+

Using our package results in three artifacts:

    -
  1. Evaluation Responses The artifact stores both the evaluation responses from LLMs and meta-data of the process in JSON format. This supports downstream tasks, such as report render, scientific research, etc.
  2. +
  3. Evaluation Responses
-

(To be revised) schema of the JSON saved & what kind of information is stored

+

These responses include both LLM evaluation results and process metadata stored in JSON format.This supports downsteam tasks like report rendering and scientific research, etc.

+

(FIXME To be revised) schema of the JSON saved & what kind of information is stored

    -
  1. Evaluation Report The artifact stores the evaluation results of the ML projects in a structured format, which includes completeness score breakdown and corresponding detailed reasons.
  2. +
  3. Evaluation Report
-

(To be revised)

+

This report presents structured evaluation results of ML projects, which includes a detailed breakdown of completeness scores and reasons for each score.

+

(FIXME To be revised)

    -
  1. Test Specification Script The artifacts stores the test specification responses from LLMs in Python script format.
  2. +
  3. Test Specification Script
-

(To be revised)

+

Generated test specifications are stored as Python scripts.

+

(FIXME To be revised)

Evaluation Results

-

As illustrated in Success Metrics, we ran 30 iterations on each of the repositories in {cite}openja2023studying and examined the breakdown of the ML Completeness Score to assessed the quality of evaluation determined by our tool. (FIXME: would it be better to show a table of the repos? like how the Openja does?)

-
-

Accuracy

-

For accuracy, we targeted 3 of the repositories (lightfm (FIXME: link), qlib (FIXME: link), DeepSpeech (FIXME: link)) for human evaluation and compared the ground truth with the outputs from our tool.

+

As described in Success Metrics, we conducted 30 iterations on each repository from (Openja et al. 2023) and examined the breakdown of the completeness score to assess our tool’s evaluation quality.

+

(FIXME: would it be better to show a table of the repos? like how the Openja does?)

+
    +
  1. Accuracy
  2. +
+

We targeted 3 of the repositories (lightfm, qlib, DeepSpeech) for human evaluation compared our tool’s outputs with the ground truth.

Code -
# FIXME: table: checklist id, title, (ground truth, (lightfm, qlib, DeepSpeech))
+
import pandas as pd
+gt = pd.read_csv('ground_truth.csv')
+gt
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
idtitleDeepSpeechlightfmqlib
02.1Ensure Data File Loads as Expected0.01.00.5
13.2Data in the Expected Format0.01.01.0
23.5Check for Duplicate Records in Data0.00.00.0
34.2Verify Data Split Proportion0.01.00.5
45.3Ensure Model Output Shape Aligns with Expectation0.00.51.0
56.1Verify Evaluation Metrics Implementation0.01.01.0
66.2Evaluate Model's Performance Against Thresholds0.01.01.0
+ +
+
-

Caption: Ground truth data on the 3 repositories

+

Ground truth data for the 3 repositories. (1 = fully satisfied, 0.5 = partially satisfied, 0 = not satisfied)

Code -
# FIXME: jitter-mean-sd plot (checklist item vs. score) for each repo
+
# FIXME: jitter-mean-sd plot (checklist item vs. score) for each repo
+import altair as alt
+import pandas as pd
+
+df_repo__stat = pd.read_csv('score_stat_by_repo_3.5-turbo.csv')
+gt = pd.read_csv('ground_truth.csv')
+gt = gt.melt(id_vars=['id', 'title'], var_name='repo', value_name='ground_truth')
+
+df_repo__stat_with_gt = df_repo__stat.merge(gt, on=['id', 'title', 'repo'])
+
+base = alt.Chart(
+    df_repo__stat_with_gt.query('repo in ["lightfm", "qlib", "DeepSpeech"]')
+).transform_calculate(
+    min="max(0, datum.mean-datum.std)",
+    max="min(1, datum.mean+datum.std)"
+)
+    
+# generate the points
+points = base.mark_point(
+    filled=True,
+    size=50,
+    color='black'
+).encode(
+    x=alt.X('mean:Q').scale(domainMin=0, domainMax=1).title("Score").axis(
+        labelExpr="datum.value % 0.5 ? null : datum.label"
+    ),
+    y=alt.Y('id_title:N', title=None, axis=alt.Axis(labelPadding=10, labelLimit=1000, grid=False))#.scale(domainMin=0, domainMax=1).title('Score'),
+)
+
+# generate the points for ground truth
+gt_points = base.mark_point(
+    filled=True,
+    size=200,
+    color='green',
+    shape="diamond"
+).encode(
+    x=alt.X('ground_truth:Q'),
+    y=alt.Y('id_title:N')
+)
+
+# generate the error bars
+errorbars = base.mark_errorbar().encode(
+    x=alt.X("min:Q").title('1 SD'), #"id:N",
+    x2="max:Q",
+    y="id_title:N"
+)
+
+(gt_points + points + errorbars).facet(
+    column=alt.Column('repo:N').title(None)
+).configure_axis( 
+    labelFontSize=12, 
+    titleFontSize=12
+)
+
+ + +
+ +
-

Caption: Comparison of the satisfaction determined by our system versus the ground truth for each checklist item and repository

+

Comparison of our system’s satisfaction determination versus the ground truth for each checklist item and repository

-

We found that our tool tends to undermine the actual satisfying cases. For the items that are actually satisfied (score = 1), our tool tends to classify as partially satisfied (score = 0.5), while for those that are partially satisfied (score = 0.5), our tool often classfies as not satisfied (score = 0).

+

Our tool tends to underrate satisfying cases, which often classifies fully satisfied items as partially satisfied and partially satisfied items as not satisfied.

Code -
# FIXME: contingency table
+
df_repo_run = pd.read_csv('score_by_repo_run_3.5-turbo.csv')
+
+df_repo_run = df_repo_run.merge(gt, on=['id', 'title', 'repo'])
+
+contingency_table = pd.pivot_table(
+    df_repo_run,
+    values='run', 
+    index=['repo', 'id_title', 'ground_truth'], 
+    columns=['score'],
+    aggfunc='count', 
+    fill_value=0
+)
+contingency_table.index.names = ['Repository', 'Checklist Item', 'Ground Truth']
+contingency_table.sort_index(level=[0, 2])
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
score0.00.51.0
RepositoryChecklist ItemGround Truth
lightfm3.5. Check for Duplicate Records in Data0.03000
5.3. Ensure Model Output Shape Aligns with Expectation0.51290
2.1. Ensure Data File Loads as Expected1.00030
3.2. Data in the Expected Format1.00300
4.2. Verify Data Split Proportion1.001119
6.1. Verify Evaluation Metrics Implementation1.00525
6.2. Evaluate Model's Performance Against Thresholds1.00129
qlib3.5. Check for Duplicate Records in Data0.02370
2.1. Ensure Data File Loads as Expected0.50030
4.2. Verify Data Split Proportion0.53252
3.2. Data in the Expected Format1.001416
5.3. Ensure Model Output Shape Aligns with Expectation1.01254
6.1. Verify Evaluation Metrics Implementation1.021810
6.2. Evaluate Model's Performance Against Thresholds1.00246
+ +
+
-

Contingency table of the satisfaction determined by our system versus the ground truth

+

Contingency table of our system’s satisfaction determination versus the ground truth

-

The accuracy issue may be attributed to the need for improvement of prompts in our checklist.

-
-
-

Consistency

-

Since the completeness score from LLMs contain randomness, we further studied the consistency of scores across checklist items and reposities.

+

The accuracy issue may be attributed to a need to improve our checklist prompts.

+
    +
  1. Consistency
  2. +
+

As the completeness scores from LLMs contain randomness, we examined the consistency of completeness scores across checklist items and repositories.

Code -
# FIXME: jitter-boxplot, checklist item vs. SD
+
stds = df_repo__stat[['repo', 'std', 'id_title']].pivot(index='repo', columns='id_title').copy()
+stds.columns = [col[1] for col in stds.columns]
+stds = stds.reset_index()
+stds = stds.melt(id_vars='repo', var_name='id_title')
+
+base = alt.Chart(stds)
+
+box = base.mark_boxplot(
+    color='grey',
+    opacity=0.5,
+    size=20,
+).encode(
+    x=alt.X('value:Q').title('Standard Deviation of Scores'),
+    y=alt.Y('id_title:N', title=None, axis=alt.Axis(labelPadding=10, labelLimit=1000, grid=False))
+)
+
+stripplot = base.mark_circle(size=100).encode(
+    y=alt.Y( 
+        'id_title:N',
+        axis=alt.Axis(ticks=False, grid=True, labels=True), 
+        scale=alt.Scale(), 
+    ), 
+    x='value:Q',
+    yOffset="jitter:Q",
+    color=alt.Color('id_title:N', legend=None),
+    tooltip='repo'
+).transform_calculate(
+    # Generate Gaussian jitter with a Box-Muller transform
+    jitter="sqrt(-2*log(random()))*cos(2*PI*random())"
+)
+
+(
+    box + stripplot
+).configure_view( 
+    stroke=None
+).configure_axis( 
+    labelFontSize=12, 
+    titleFontSize=12
+).properties(
+    height=300, 
+    width=600,
+    title="30 Runs on Openja's Repositories for each Checklist Item"
+) 
+
+ + +
+ +
-

Caption: Standard deviations of the score for each checklist item. Each dot represents the standard deviation of scores of 30 runs of a sigle repository

+

Standard deviations of the score for each checklist item. Each dot represents the standard deviation of scores from 30 runs of a single repository.

-

We found 2 diverging cases. For example, it shows high standard deviations across repositories for item 3.2 Data in the Expected Format. This might be a proof of poor prompt quality, making it ambiguous for the LLM and hence hard to produce consistent results. Prompt engineering might solve this problem.

-

On the other hand, there are outliers yielding exceptionally high standard deviations for item 5.3 Ensure Model Output Shape Aligns with Expectation. This may be because those repositories are unorthodox, and careful manual examination is required to achieve a more robust conclusion.

-
+

We identified two diverging cases:

+
    +
  1. High Standard Deviations
  2. +
+

Items like 3.2 Data in the Expected Format showed high standard deviations across repositories. This might indicate potential poor prompt quality for the LLM to produce consistent results. Improved prompt engineering could address this issue.

+
    +
  1. Outliers with High Standard Deviations
  2. +
+

Items like 5.3 Ensure Model Output Shape Aligns with Expectation had outliers with exceptionally high standard deviations, which is possibly due to unorthodox repositories. A careful manual examination is required for a more definitive conclusion.

Comparison of gpt-3.5-turbo and gpt-4o

-

To examine if newer LLMs help in both metrics, we preliminarily compared system outputs from gpt-4o and gpt-3.5-turbo on the lightfm repository, we observed that the gpt-4o system consistently returned “Satisfied”, which deviates from the ground truth.

+

To evaluate if newer LLMs improve performance, we preliminarily compared outputs from gpt-4o and gpt-3.5-turbo on the lightfm repository. We observed that gpt-4o consistently returned “Satisfied,” which deviated from the ground truth.

Code -
# FIXME: jitter-mean-sd plot (checklist item vs. score) for each repo
+
# FIXME: jitter-mean-sd plot (checklist item vs. score) for each repo
+df_repo_4o__stat = pd.read_csv('score_stat_by_repo_4o.csv')
+df_repo_4o__stat_with_gt = df_repo_4o__stat.merge(gt, on=['id', 'title', 'repo'])
+df_repo_4o__stat_with_gt['model'] = 'gpt-4o'
+
+df_repo_35turbo__stat_with_gt = df_repo__stat_with_gt.query("repo == 'lightfm'").copy()
+df_repo_35turbo__stat_with_gt['model'] = 'gpt-3.5-turbo'
+
+df_model_comp = pd.concat(
+    (df_repo_35turbo__stat_with_gt, df_repo_4o__stat_with_gt), 
+    axis=0
+)
+
+base = alt.Chart(
+    df_model_comp
+).transform_calculate(
+    min="max(0, datum.mean-datum.std)",
+    max="min(1, datum.mean+datum.std)"
+)
+    
+# generate the points
+points = base.mark_point(
+    filled=True,
+    size=50,
+    color='black'
+).encode(
+    x=alt.X('mean:Q').scale(domainMin=0, domainMax=1).title("Score").axis(
+        labelExpr="datum.value % 0.5 ? null : datum.label"
+    ),
+    y=alt.Y('id_title:N', title=None, axis=alt.Axis(labelPadding=10, labelLimit=1000, grid=False))#.scale(domainMin=0, domainMax=1).title('Score'),
+)
+
+# generate the points for ground truth
+gt_points = base.mark_point(
+    filled=True,
+    size=200,
+    color='green',
+    shape="diamond"
+).encode(
+    x=alt.X('ground_truth:Q'),
+    y=alt.Y('id_title:N')
+)
+
+# generate the error bars
+errorbars = base.mark_errorbar().encode(
+    x=alt.X("min:Q").title('1 SD'), #"id:N",
+    x2="max:Q",
+    y="id_title:N"
+)
+
+(gt_points + points + errorbars).facet(
+    column=alt.Column('model:N').title(None)
+).configure_axis( 
+    labelFontSize=12, 
+    titleFontSize=12
+)
+
+ + +
+ +
-

Caption: Comparison of the satisfaction using gpt-4o versus using gpt-3.5-turbo for each checklist item on lightfm

+

Comparison of satisfaction using gpt-4o versus gpt-3.5-turbo for each checklist item on lightfm

-

Further investigation into gpt-4o is required to address this issue and enhance the system performance.

+

Further investigation into gpt-4o is required to determine its effectiveness in system performance.

@@ -437,40 +1040,82 @@

Com

Conclusion

Wrap Up

-

Our project, FixML, represents a significant step forward in the field of machine learning (ML) testing by providing curated checklists and automated tools that enhance the evaluation and creation of test suites for ML models. The development and implementation of FixML have been driven by both the need of better quality assurance in ML systems, and the current limitations of traditional testing methods on ML projects which are either too general without comprehensive clarification, or are too human-reliant.

-

FixML seamlessly takes in the user’s ML codebase, identifies and extracted its existing test suites. Together with the curated checklist on ML testing, FixML leverages Large Language Models (LLMs) to assess the completeness of the test suites and output detailed evaluation reports with completeness scores and specific reasons. This assists users in understanding the performance of their current test suites with insights. Additionally, FixML can generate test function specifications corresponding to the curated checklist, helping users utilizing their test suites.

-

In return, FixML solution combines the scalability of automated testing with the reliability of expert evaluation. By automating the evaluation process, FixML significantly reduces the time and human effort required to assess the quality of ML test suites. This popularizes thorough and efficient quality assessment on ML projects.

+

The development of FixML has been driven by the need of better quality assurance in ML systems and the current limitations of traditional testing methods on ML projects. FixML provides curated checklists and automated tools that enhance the evaluation and creation of test suites for ML projects. This in return, significantly reduces the time and effort required to assess the completeness of ML test suites, and thus promotes thorough and efficient assessment on ML projects.

Limitation & Future Improvement

-

While FixML provides substantial benefits, there are limitations and areas that aim to be addressed in future development:

+

While FixML provides substantial benefits, there are limitations and areas to be addressed in future development:

  1. Specialized Checklist
-

The current checklist is designed to be general and may not cover all specific requirements for different ML projects. Future development will focus on creating more specialized checklists for different domains and project types, allowing for more tailored evaluations. Since the format of the checklist is designed to allow users to easily expand, edit and select checklist items based on their specific use case, we welcome any collaboration with ML researchers on the creation of specalized checklists.

+

The default checklist is general and may not cover all requirements for different ML projects. Future development will focus on creating specialized checklists for tailored evaluations across various domains and project types. Collaboration with ML researchers is welcomed for creating specialized checklists based on specific use cases.

  1. Enhanced Test Evaluator
-

Our current study unveils the varying accuracy and consistency issues on the evaluation results using OpenAI GPT models. Future improvements involves prompt enhancement with prompt engineering techniques and support for multiple LLMs for higher performance and flexibility of FixML test evaluator functionality. We also expect to deliver user guidelines in editing the prompts in our system, where ML developers can customize prompts for better performance and collaborate with us to embed them into the system.

+

Our study reveals the accuracy and consistency issues on the evaluation results using OpenAI GPT-3.5-turbo model. Future improvements involves better prompt engineering techniques and support for multiple LLMs for enhanced performance and flexibility. User guidelines in prompt creation will be provided to facilitate collaboration with ML developers.

  1. Customized Test Specification
-

FixML test specification generator currently produces general test function skeletons solely based on the curated checklist without the context of the specific ML projects. Future developments will involve the integration of the ML project codebase in the generation process to output customized test functions skeletons. This further lower the barrier of ML users in creating comprehensive test suites relevant to the projects.

+

Future developments will integrate project-specific information to produce customized test function skeletons. This may further encourage users to create comprehensive tests.

  1. Workflow Optimization #FIXME: have to review whether to include as it seems lower priority.
-

The current test evaluator and test specification generator are separate entities. This could be improved by embedding a workflow engine that allows the system to automatically take actions based on the LLM response. For instance, if the LLM response suggests that test suites are partially satisfied or non-satisfied, the system could automatically run the test generator to produce test function skeletons and then reevaluate them until they are satisfied or some threshold is met. This would create a more cohesive and efficient workflow, reducing manual intervention and improving overall system performance.

+

The test evaluator and test specification generator are currently separate. Future improvements could embed a workflow engine that automatically takes actions based on LLM responses. This creates a more cohesive and efficient workflow, recues manual intervention, and improves overall system performance.

  1. Performance Optimization #FIXME: have to review whether to include as it seems lower priority.
-

Performance optimization is another critical area for future development. As FixML handles large codebases and complex evaluations, optimizing the system to handle these tasks more efficiently is essential. This includes improving the speed and accuracy of the LLM responses, reducing the time taken to analyze and generate reports, and ensuring the system can scale effectively to handle more extensive and more complex projects.

-

By addressing these limitations and focusing on these future improvements, FixML will become an even more powerful tool for ensuring the quality and robustness of machine learning and data science projects.

+

As FixML handles large codebases and complex evaluations, performance optimization is essential. Future developments will focus on improving the speed and accuracy of LLM responses, reducing analysis and report generation times, and ensuring scalability for handling larger and more complex projects.

+

By addressing these limitations and implementing future improvements, we aim for FixML to achieve better performance and contribute to the development of better ML systems, and ultimately enhance human life.

+
+ +
+ + -
- +

References

+
+Alexander, Rohan, Lindsay Katz, Callandra Moore, and Zane Schwartz. 2023. “Evaluating the Decency and Consistency of Data Validation Tests Generated by LLMs.” arXiv Preprint arXiv:2310.01402. +
+
+Belanger, Ashley. 2024. “Air Canada Must Honor Refund Policy Invented by Airline’s Chatbot.” Ars Technica. https://arstechnica.com/tech-policy/2024/02/air-canada-must-honor-refund-policy-invented-by-airlines-chatbot/. +
+
+Gawande, Atul. 2010. Checklist Manifesto, the (HB). Penguin Books India. +
+
+Grand-View-Research. 2021. “Artificial Intelligence Market Size, Share & Trends Analysis Report by Solution, by Technology (Deep Learning, Machine Learning), by End-Use, by Region, and Segment Forecasts, 2023 2030.” Grand View Research San Francisco. +
+
+Jordan, Jeremy. 2020. “Effective Testing for Machine Learning Systems.” https://www.jeremyjordan.me/testing-ml/. +
+
+Kapoor, Sayash, and Arvind Narayanan. 2022. “Leakage and the Reproducibility Crisis in ML-Based Science.” arXiv Preprint arXiv:2207.07048. +
+
+Nunwick, Alice. 2023. “ITutorGroup Settles AI Hiring Lawsuit Alleging Age Discrimination.” Verdict. https://www.verdict.co.uk/itutorgroup-settles-ai-hiring-lawsuit-alleging-age-discrimination/. +
+
+Openja, Moses, Foutse Khomh, Armstrong Foundjem, Zhen Ming, Mouna Abidi, Ahmed E Hassan, et al. 2023. “Studying the Practices of Testing Machine Learning Software in the Wild.” arXiv Preprint arXiv:2312.12604. +
+
+Pineau, Joelle, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché-Buc, Emily Fox, and Hugo Larochelle. 2021. “Improving Reproducibility in Machine Learning Research (a Report from the Neurips 2019 Reproducibility Program).” Journal of Machine Learning Research 22 (164): 1–20. +
+
+Regidi, Asheeta. 2019. “SEBI’s Circular: The Black Box Conundrum and Misrepresentation in AI-Based Mutual Funds.” Firstpost. https://www.firstpost.com/business/sebis-circular-the-black-box-conundrum-and-misrepresentation-in-ai-based-mutual-funds-6625161.html. +
+
+Shepardson, David. 2023. “GM’s Cruise Recalling 950 Driverless Cars After Pedestrian Dragged in Crash.” Reuters. https://www.reuters.com/business/autos-transportation/gms-cruise-recall-950-driverless-cars-after-accident-involving-pedestrian-2023-11-08/. +
+
+Team, Microsoft Industry Solutions Engineering. 2023. “Testing Data Science and MLOps Code.” Testing Data Science and MLOps Code - Engineering Fundamentals Playbook. https://microsoft.github.io/code-with-engineering-playbook/machine-learning/ml-testing/. +
+
+Zhang, Yue, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, et al. 2023. “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models.” https://arxiv.org/abs/2309.01219. +
+
@@ -73,7 +92,7 @@ - +