diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 16c51c8..ecfc0af 100755 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -40,30 +40,18 @@ If you are proposing a feature: ## Get Started! -Ready to contribute? Here's how to set up `fixml` for local development. +1. Follow [our guide](https://fixml.readthedocs.io/en/latest/install_devel_build.html) +on installing the development build of FixML on your system. -1. Download a copy of `fixml` locally. -2. Create a new conda environment and Install all essential libraries: +2. Use `git` (or similar) to create a branch for local development and make your changes: ```console - $ conda env create -f environment.yaml + git checkout -b name-of-your-bugfix-or-feature ``` -3. Activate the newly created environment: +3. When you're done making changes, check that your changes conform to any code formatting requirements and pass any tests. - ```console - $ conda activate fixml - ``` - -4. Use `git` (or similar) to create a branch for local development and make your changes: - - ```console - $ git checkout -b name-of-your-bugfix-or-feature - ``` - -5. When you're done making changes, check that your changes conform to any code formatting requirements and pass any tests. - -6. Commit your changes and open a pull request. +4. Commit your changes and open a pull request. ## Pull Request Guidelines diff --git a/README.md b/README.md index 0920fd3..926a335 100644 --- a/README.md +++ b/README.md @@ -16,183 +16,79 @@ A tool for providing context-aware evaluations using a checklist-based approach on the Machine Learning project code bases. -## Motivation - -Testing codes in Machine Learning project mostly revolves around ensuring the -findings are reproducible. To achieve this, currently it requires a lot of -manual efforts. It is because such projects usually have assumptions that are -hard to quantify in traditional software engineering approach i.e. code -coverage. One such example would be testing the model's performance, which will -not result in any errors, but we do expect this result to be reproducible by -others. Testing such codes, therefore, require us to not only quantitatively, -but also to qualitatively gauge how effective the tests are. - -A common way to handle this currently is to utilize expertise from domain -experts in this area. Researches and guidelines have been done on how to -incorporate such knowledge through the use of checklists. However, this requires -manually validating the checklist items which usually results in poor -scalability and slow feedback loop for developers, which are incompatible with -today's fast-paced, competitive landscape in ML developments. - -This tool aims to bridge the gap between these two different approaches, by -adding Large Language Models (LLMs) into the loop, given LLMs' recent -advancement in multiple areas including NLU tasks and code-related tasks. They -have been shown to some degrees the ability to analyze codes and to produce -context-aware suggestions. This tool simplifies such workflow by providing a -command line tool as well as a high-level API for developers and researchers -alike to quickly validate if their tests satisfy common areas that are required -for reproducibility purposes. - -Given LLMs' tendency to provide plausible but factually incorrect information, -extensive analyses have been done on ensuring the responses are aligned with -ground truths and human expectations both accurately and consistently. Based on -these analyses, we are also able to continuously refine our prompts and -workflows. +## Documentations -## Installation +- Guides and API documentations: [https://fixml.readthedocs.org](https://fixml.readthedocs.org) +- Reports and proposals: [https://ubc-mds.github.io/fixml](https://ubc-mds.github.io/fixml) -This tool is on PyPI. To install, please run: +## Installation ```bash pip install fixml -``` -## Usage +# For unix-like systems e.g. Linux, macOS +export OPENAI_API_KEY={your-openai-api-key} -### CLI tool +# For windows systems +set OPENAI_API_KEY={your-openai-api-key} +``` -Once installed, the tool offers a Command Line Interface (CLI) command `fixml`. -By using this command you will be able to evaluate your project code bases, -generate test function specifications, and perform various relevant tasks. +For more detailed installation guide, +visit [the related page on ReadtheDocs](https://fixml.readthedocs.io/en/latest/installation.html). -Run `fixml --help` for more details. +## Usage -> [!IMPORTANT] -> By default, this tool uses OpenAI's `gpt3.5-turbo` for evaluation. To run any -> command that requires calls to LLM (i.e. `fixml evaluate`, `fixml generate`), -> an environment variable `OPENAI_API_KEY` needs to be set. To do so, either use -`export` to set the variable in your current session, or create a `.env` file -> with a line `OPENAI_API_KEY={your-api-key}` saved in your working directory. +### CLI tool -> [!TIP] -> Currently, only calls to OpenAI endpoints are supported. This tool is still in -> ongoing development and integrations with other service providers and locally -> hosted LLMs are planned. +FixML offers a CLI command to quick and easy way to evaluate existing tests and +generate new ones. #### Test Evaluator -The test evaluator command is used to evaluate the tests of your repository. It -generates an evaluation report and provides various options for customization, -such as specifying a checklist file, output format, and verbosity. +Here is an example command to evaluate a local repo: -Example calls: ```bash -# Evaluate repo, and output the evalutions as a JSON file in working directory -fixml evaluate /path/to/your/repo - -# Perform the above verbosely, and use the JSON file to export a HTML report -fixml evaluate /path/to/your/repo -e ./eval_report.html -v - -# Perform the above, but use a custom checklist, and to overwrite existing report -fixml evaluate /path/to/your/repo -e ./eval_report.html -v -o -c checklist/checklist.csv - -# Perform the above, and to use gpt-4o as the evaluation model -fixml evaluate /path/to/your/repo -e ./eval_report.html -v -o -c checklist/checklist.csv -m gpt-4o +fixml evaluate /path/to/your/repo \ + --export_report_to=./eval_report.html --verbose ``` #### Test Spec Generator -The test spec generator command is used to generate a test specification from a -checklist. It allows for the inclusion of an optional checklist file to guide -the test specification generation process. - -Example calls: +Here is an example command to evaluate a local repo ```bash -# Generate test function specifications and to write them into a .py file fixml generate test.py - -# Perform the above, but to use a custom checklist -fixml generate test.py -c checklist/checklist.csv ``` -### Package +> [!TIP] +> Run command `fixml {evaluate|generate} --help` for more information and all +> available options. +> +> You can also refer +> to [our Quickstart guide](https://fixml.readthedocs.io/en/latest/quickstart.html) +> on more detailed walkthrough on how to use the CLI tool. -Alternatively, you can use the package to import all components necessary for running the evaluation/generation workflows listed above. +### Package -The workflows used in the package have been designed to be fully modular. You -can easily switch between different prompts, models and checklists to use. You -can also write your own custom classes to extend the capability of this library. +Alternatively, you can use the package to import all components necessary for +running the evaluation/generation workflows listed above. -Consult the [API documentation on Readthedocs](https://fixml.readthedocs.io/en/latest/) +Consult [our documentation on using the API](https://fixml.readthedocs.io/en/latest/using-the-api.html) for more information and example calls. ## Development Build -If you are interested in helping the development of this tool, or you would like -to get the cutting-edge version of this tool, you can install this tool via -conda. - -To do this, ensure you have Miniconda/Anaconda installed on your system. You can -download miniconda -on [their official website](https://docs.anaconda.com/miniconda/). +Please refer to [the related page in our documentation](https://fixml.readthedocs.io/en/latest/install_devel_build.html). +## Rendering Documentations -1. Clone this repository from GitHub: -```bash -git clone git@github.com:UBC-MDS/fixml.git -``` - -2. Create a conda environment: - -```bash -conda env create -f environment.yaml -``` - -3. Activate the newly created conda environment (default name `fixml`): - -```bash -conda activate fixml -``` - -4. Use `poetry` which is preinstalled in the conda environment to create a local package install: - -```bash -poetry install -``` - -5. You now should be able to run `fixml`, try: -```bash -fixml --help -``` - -## Rendering API Documentation - -Make sure you have installed dev dependencies listed in `pyproject.toml`. - -```bash -cd docs/ - -python -m sphinx -T -b html -D language=en . _build -``` - -## Running the Tests - -Navigate to the project root directory and use the following command in terminal -to run the test suite: - -```bash -# skip integration tests -pytest -m "not integeration" - -# run ALL tests, which requires OPENAI_API_KEY to be set -pytest -``` +Please refer to [the related page in our documentation](https://fixml.readthedocs.io/en/latest/render.html). ## Contributing -Interested in contributing? Check out the contributing guidelines. Please note -that this project is released with a Code of Conduct. By contributing to this -project, you agree to abide by its terms. +Interested in contributing? Check out +the [contributing guidelines](CONTRIBUTING.md). Please note that this project is +released with a [Code of Conduct](CONDUCT.md). By contributing to this project, +you agree to abide by its terms. ## License @@ -227,5 +123,5 @@ resource for the community. Special thanks to the University of British Columbia (UBC) and the University of Wisconsin-Madison for their support and resources. We extend our gratitude to -Dr. Tiffany Timbers and Dr. Simon Goringfor their guidance and expertise, which +Dr. Tiffany Timbers and Dr. Simon Goring for their guidance and expertise, which have been instrumental in the development of this project. diff --git a/docs/conf.py b/docs/conf.py index aa12a16..f368f1c 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -10,7 +10,7 @@ copyright = "2024, John Shiu, Orix Au Yeung, Tony Shum, and Yingzi Jin" author = "John Shiu, Orix Au Yeung, Tony Shum, and Yingzi Jin" -release = '0.1' +release = '0.1.0' version = '0.1.0' # -- General configuration --------------------------------------------------- diff --git a/docs/index.md b/docs/index.md index b869d12..ff74b56 100644 --- a/docs/index.md +++ b/docs/index.md @@ -4,9 +4,13 @@ ```{toctree} :maxdepth: 1 :hidden: -:caption: Guides +:caption: Getting Started -changelog.md +motivation.md +installation.md +quickstart.md +using_api.md +reliability.md ``` ```{toctree} @@ -14,9 +18,12 @@ changelog.md :hidden: :caption: Development +install_devel_build.md contributing.md conduct.md +render.md release_walkthrough.md +changelog.md ``` ```{toctree} diff --git a/docs/install_devel_build.md b/docs/install_devel_build.md new file mode 100644 index 0000000..5a9537d --- /dev/null +++ b/docs/install_devel_build.md @@ -0,0 +1,51 @@ +# Install Development Build + +If you are interested in helping the development of this tool, or you would like +to get the cutting-edge version of this tool, you can install this tool via +conda. + +To do this, ensure you have Miniconda/Anaconda installed on your system. You can +download miniconda +on [their official website](https://docs.anaconda.com/miniconda/). + + +1. Clone this repository from GitHub: + ```bash + git clone git@github.com:UBC-MDS/fixml.git + ``` + +2. Create a conda environment: + + ```bash + cd fixml && conda env create -f environment.yml + ``` + +3. Activate the newly created conda environment (default name `fixml`): + + ```bash + conda activate fixml + ``` + +4. Use `poetry` which is preinstalled in the conda environment to create a local + package install: + + ```bash + poetry install + ``` + +5. Done! You should now be able to run unit tests to confirm the build works + without problems: + ```bash + # skip integration tests + pytest -m "not integeration" + + # run ALL tests, which requires OPENAI_API_KEY to be set + echo "OPENAI_API_KEY={your-openai-api-key}" > .env + pytest + ``` + +```{note} +For a more detailed walkthrough on how to set up the OpenAI API key , please +refer to the +[API key section of our installation guide](installation.md#configuring-api-keys). +``` \ No newline at end of file diff --git a/docs/installation.md b/docs/installation.md new file mode 100644 index 0000000..82f2e32 --- /dev/null +++ b/docs/installation.md @@ -0,0 +1,151 @@ +# Installation + +In this page, we will guide through you how to install FixML on your +computer, and get started to using it for evaluating your code projects! + +## Prerequisites + +FixML is written to support Python version 3.12 or later. Make sure that you +have Python with version 3.12 or later installed in your system. + +```{note} +Get Python on their [official website](https://www.python.org). + +Alternatively, you can use the ones that come with your Operating System, or +come from a virtual environment managers such +as [virtualenv](https://virtualenv.pypa.io) and/or [conda](https://conda.io). +``` + +## Getting the package from PyPI + +`fixml` package is hosted +on [the Python Package Index (PyPI)](https://pypi.org). You can visit the +project page on PyPI [here](https://pypi.org/project/fixml/). + +To install this, enter this command in your terminal: +```bash +pip install fixml + +# or + +python3 -m pip install fixml +``` + +## Configuring API Keys + +As of version 0.1.0, the only supported connector to LLMs is the OpenAI API. +Therefore, for any workload that involves calls to LLMs, an API key for +accessing OpenAI API is required. + +### Getting the API Key + +You can refer to OpenAI's page +on [obtaining API keys](https://platform.openai.com/api-keys). + +```{warning} +The API keys are credentials that should be treated the same way you treat +passwords. Refer to OpenAI's page on best practices for +[keeping your API key safe](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety). +``` + +```{note} +FixML only use the API key stored *locally* in your system for calls to LLMs, +and will not save, transmit, or leak the API keys in any way. +``` + +### Increasing Quota + +```{note} +The free trial version of OpenAI API comes with many limits, such as a very +low ceiling on the token per minute quota. + +Although optimizations have been done to reduce the usage on the tokens, FixML +still need to transmit a significant portion of the code base when conducting +the analysis. + +As such, for most code bases, the free trial version of the API is unsuitable +for use and would often result in errors stating rate limit has been reached. +``` + +To prevent FixML from hitting rate limit error, user should upgrade their API to +the paid version and thus raising the rate limit to remove this restriction. + +Based +on [OpenAI's documentation](https://platform.openai.com/docs/guides/rate-limits/usage-tiers?context=tier-one), +we estimate you must be at least at Tier 1 to be able to use our tool without +the frequent intermittent rate limit errors. + +Refer to OpenAI's documentations +on [rate limits and quotas](https://platform.openai.com/docs/guides/rate-limits) +and [account limits](https://platform.openai.com/account/limits). + +### Saving the API Key into your system + +Once obtained the API key, the key needs to be stored in your system in +order for FixML to be able to discover it and subsequently use it for +calling the OpenAI service endpoints. + +Currently, FixML will look for such key through the use +of [Environment Variables](https://en.wikipedia.org/wiki/Environment_variable). + +There are two ways to do this: + +#### 1. Saving it Directly as an Environment Variable + +This way we will directly save the API key into the Operating System's set of +environment variables. + +```{note} +**Advantage**: Since it is saved as an Environment Variable in your Operating +System, it is accessible by `fixml` from any directory you are working on. + +**Disadvantage**: This setting is transient and will not persist after a system +reboot. To make this change permanent, you can add the `export` command below as +a part of start up script. +``` + +##### Unix-like systems (Linux, MacOS, etc) + +1. Run this in your console/terminal emulator: + ```bash + export OPENAI_API_KEY={your-openai-api-key} + ``` + +2. After running the command, confirm that the variable has been saved into the +system: + ```bash + export | grep OPENAI_API_KEY + ``` + +##### Windows + +1. Run this in your console: + ```cmd + SET OPENAI_API_KEY={your-openai-api-key} + ``` + +2. After running the command, confirm that the variable has been saved into the + system: + ```cmd + SET OPENAI_API_KEY + ``` + +#### 2. Saving it inside an `.env` File + +This method will save your API key into a file named `.env`. This will not +be directly injecting your API key as an environment variable into your +computer. Rather, when FixML is being run, it will look for file named `. +env` file in the *current working directory* and inject the content inside +as a set of temporary Environment Variables. + +```{note} +**Advantage**: Persistent storage, and will survive reboots as this is a file. + +**Disadvantage**: Since it depends on the `.env` file's location, it is not +runnable outside the directory where there is no such `.env` file. +``` + +To do this, run the follow command: +```bash +echo OPENAI_API_KEY={your-openai-api-key} > .env +``` \ No newline at end of file diff --git a/docs/motivation.md b/docs/motivation.md new file mode 100644 index 0000000..41dc98c --- /dev/null +++ b/docs/motivation.md @@ -0,0 +1,41 @@ +# Motivation + +## Why another tool for testing tests? Aren't code coverage tools enough? + +Testing codes in Machine Learning project mostly revolves around ensuring the +findings are reproducible. To achieve this, currently it requires a lot of +manual efforts. It is because such projects usually have assumptions that are +hard to quantify in traditional software engineering approach i.e. code +coverage. One such example would be testing the model's performance, in which we +would not only to check if there is any error during training, but we also +would write tests to expect the model's performance to be consistent and +reproducible by others. Testing such codes, therefore, require us to not only +quantitatively, but also to qualitatively gauge how effective the tests are. + +## OK, but we can evaluate the tests by looking into the tests by ourselves... + +Yes, a common way to handle this currently is to utilize expertise from domain +experts in this area. Researches and guidelines have been done on how to +incorporate such knowledge through the use of checklists. However, this requires +manually validating the checklist items which usually results in poor +scalability and slow feedback loop for developers, which are incompatible with +today's fast-paced, competitive landscape in ML developments. + +## So what does this tool offer? + +This tool aims to bridge the gap between these two different approaches, by +adding Large Language Models (LLMs) into the loop, given LLMs' recent +advancement in multiple areas including NLU tasks and code-related tasks. They +have been shown to some degrees the ability to analyze codes and to produce +context-aware suggestions. This tool simplifies such workflow by providing a +command line tool as well as a high-level API for developers and researchers +alike to quickly validate if their tests satisfy common areas that are required +for reproducibility purposes. + +## LLMs are known for occasional hallucinations. How is this mitigated? + +Given LLMs' tendency to provide plausible but factually incorrect information, +extensive analyses have been done on ensuring the responses are aligned with +ground truths and human expectations both accurately and consistently. Based on +these analyses, we are also able to continuously refine our prompts and +workflows. \ No newline at end of file diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 0000000..35b6866 --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,90 @@ +# Quickstart and Usages + +Once FixML is installed, the tool offers a Command Line Interface (CLI) +command `fixml`. By using this command you will be able to evaluate your project +code bases, generate test function specifications, and perform various relevant +tasks. + +```{warning} +By default, this tool uses OpenAI's `gpt3.5-turbo` for evaluation. To run any +command that requires calls to LLM (i.e. `fixml evaluate`, `fixml generate`), +an environment variable `OPENAI_API_KEY` needs to be set. + +Visit [installation guide](installation.md) for more information. +``` + +```{note} +Currently, only calls to OpenAI endpoints are supported. This tool is still in +ongoing development and integrations with other service providers and locally +hosted LLMs are planned. +``` + +## Available CLI commands + +Main commands: +- `fixml evaluate` - Test Evaluator +- `fixml generate` - Test Spec Generator + +There are commands grouped together in the following command groups: +- `fixml export` - Report-exporting related commands +- `fixml checklist` - Checklist related commands +- `fixml repository` - Repository related commands + +```{note} +Run `fixml --help` and `fixml {export|checklist|repository} --help` for more +details. +``` + + +## `fixml evaluate` - Test Evaluator + +The test evaluator command is used to evaluate the tests of your repository. It +generates an evaluation report and provides various options for customization, +such as specifying a checklist file, output format, and verbosity. + +Here is a very basic call: +```bash +fixml evaluate /path/to/your/repo +``` + +This will generate a JSON file in your current working directory containing the + + +Of course, a JSON file is rarely enough. `fixml evaluate` is actually very +versatile and support many options/flags to modify its behaviour. Here is an +elaborated example displaying what this command can do: +```bash +# Evaluates the repo, generate a JSON, export report after evaluation as HTML, +# display verbose messages while performing the evaluation, overwrite existing +# reports, use custom checklist instead of the default one, and to use GPT-4o +# instead of GPT-3.5-turbo. + +fixml evaluate /path/to/your/repo \ + --export_report_to=./eval_report.html \ + --verbose --overwrite --model=gpt-4o \ + --checklist_path=checklist/checklist.csv +``` + +```{note} +The command `fixml evaluate --help` provides a comphrensive explanation on all +flags and options available. +``` + +## `fixml generate` - Test Spec Generator + + +The test spec generator command is used to generate a test specification from a +checklist. It allows for the inclusion of an optional checklist file to guide +the test specification generation process. + +Example calls: +```bash +# Generate test function specifications and to write them into a .py file +# Perform the above, but to use a custom checklist +fixml generate test.py -c checklist/checklist.csv +``` + +```{note} +The command `fixml generate --help` provides a comphrensive explanation on all +flags and options available. +``` diff --git a/docs/reliability.md b/docs/reliability.md new file mode 100644 index 0000000..bb54fe3 --- /dev/null +++ b/docs/reliability.md @@ -0,0 +1,15 @@ +# Reliability of the Tool + +Given LLMs' tendency to provide plausible but factually incorrect information, +extensive analyses have been done on ensuring the responses are aligned with +ground truths and human expectations both accurately and consistently. Based on +these analyses, we are also able to continuously refine our prompts and +workflows. + +Furthermore, we analyzed the response's consistency and accuracy when +evaluating against 11 well-known Machine Learning projects on GitHub. We +also have done human evaluations on three repositories manually to make sure +the evaluation of this tool aligns with human expectations. + +The analyses and findings are available inside +the [`report/` directory on GitHub](https://github.com/ubc-mds/fixml/tree/main/report/). diff --git a/docs/render.md b/docs/render.md new file mode 100644 index 0000000..257cf47 --- /dev/null +++ b/docs/render.md @@ -0,0 +1,40 @@ +# Rendering Documentations + +This project comes with both reports rendered using Quarto, and API +documentations rendered using Sphinx (the one you're reading right now!) + +This page will guide through you the rendering process if you're interested +in rendering them locally. + +## Quarto reports + +This includes the proposal, final report, and the human evaluations reports. +To render this, You must have Quarto installed in your system. You can +obtain a copy of Quarto on [their official website](https://quarto.org). + +After go to the project root folder and run this command: +```bash +make clean && make all +```` + +The rendered reports will be located in `report/docs/`. + +## Sphinx documentations + +This includes the installation and quickstart guides, changelogs, API +references and all other things related to the tool as a package. + +To render this, you need to install the development build of this tool. The +dependencies will only be installed when you use development build, but not +the regular version of the package. + +Refer to [the related documentation](install_devel_build.md) for how to +install a development build of FixML. + +After installing the build, run this to render the documentations in HTML: +```bash +cd docs/ +make clean && make html +``` + +The rendered website will be located in `docs/_build/html/`. diff --git a/docs/using_api.md b/docs/using_api.md new file mode 100644 index 0000000..df23886 --- /dev/null +++ b/docs/using_api.md @@ -0,0 +1,48 @@ +# Using the API + +Beside the CLI tool offered, one can also make use of the package's +high-level, modular API to replicate the workflow inside their Python +environments. + +Here is a high-level overview of the FixML system: + +![The high-level overview of the FixML system](../img/proposed_system_overview.png) + +There are five main components in the FixML system: + +1. **Code Analyzer** + +It extracts test suites from the input codebase, to ensure only the most +relevant details are provided to LLMs given token limits. + +2. **Prompt Templates** + +It stores prompt templates for instructing LLMs to generate responses in the +expected format. + +3. **Checklist** + +It reads the curated checklist from a CSV file into a dictionary with a fixed +schema for LLM injection. The package includes a default checklist for +distribution. + +4. **Runners** + +It includes the Evaluator module, which assesses each test suite file using LLMs +and outputs evaluation results, and the Generator module, which creates test +specifications. Both modules feature validation, retry logic, and record +response and relevant information. + +5. **Parsers** + +It reads the report templates and converts Evaluator's responses into evaluation +reports in various formats (QMD, HTML, PDF) using the Jinja template engine, +which enables customizable report structures. + +```{note} +The workflows used in the package have been designed to be fully modular. You +can easily switch between different prompts, models and checklists to use. You +can also write your own custom classes to extend the capability of this library. +``` + +For more usage and examples, refer to API Reference. \ No newline at end of file diff --git a/report/docs/.gitkeep b/report/docs/.gitkeep deleted file mode 100644 index e69de29..0000000