From 8bf9f068fa3179455a2461cf50abb8d23e3c67a1 Mon Sep 17 00:00:00 2001 From: Kelly Brown Date: Wed, 18 Sep 2024 13:46:54 -0400 Subject: [PATCH 1/3] [Docs] Updates for SDG README Signed-off-by: Kelly Brown --- README.md | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 66 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 8752dceb..cdf85d65 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# sdg +# Synthetic Data Generation (SDG) ![Lint](https://github.com/instructlab/sdg/actions/workflows/lint.yml/badge.svg?branch=main) ![Build](https://github.com/instructlab/sdg/actions/workflows/pypi.yaml/badge.svg?branch=main) @@ -10,3 +10,68 @@ ![`e2e-nvidia-l40s-x4.yml` on `main`](https://github.com/instructlab/sdg/actions/workflows/e2e-nvidia-l40s-x4.yml/badge.svg?branch=main) Python library for Synthetic Data Generation + +## Introduction + +Synthetic Data Generation (SDG) is a process that creates an artificially generated dataset that mimics real data based on provided examples. SDG uses a YAML file containing question-and-answer pairs as input data. + +## Installing the SDG library + +Clone the library and navigate to the repo: + +```bash +git clone https://github.com/instructlab/sdg +cd sdg +``` + +Install the library: + +```bash +pip install . +``` + +### Using the library + +You can import SDG into your Python files with the following items: + +```python + from instructlab.sdg.generate_data import generate_data + from instructlab.sdg.utils import GenerateException +``` + +## Pipelines + +There are four pipelines that are used in SDG. Each pipeline requires specific hardware specifications. + + +*Full* - + +This pipeline is targeted for running SDG on consumer grade accelerators (GPUs). + +*Simple* - + +### Pipeline architecture + +All the pipelines are written in YAML format. + +Knowledge: + +Grounded Skills: + +Freeform Skills: + + + +## Repository structure + +```bash +|-- sdg/src/instructlab/ (1) +|-- sdg/docs/ (2) +|-- sdg/scripts/ (3) +|-- sgd/tests/ (4) +``` + +1. Contains the SDG code that interacts with InstructLab. +2. Contains documentation on various SDG methodologies. +3. Contains the code that tests the SDG data types: Knowledge, grounded skills, and freeform skills. +4. Contains all the CI tests for the SDG repository. \ No newline at end of file From d88ec91ad5a51f35bca485a179c1b9cc34b9764d Mon Sep 17 00:00:00 2001 From: Ben Browning Date: Wed, 20 Nov 2024 12:36:55 -0500 Subject: [PATCH 2/3] Flesh out some of the technical details of SDG README.md This fills in some placeholder sections of our updated README.md. It's not as detailed as it should eventually be, but at least gives a bit more information as users browse the repository. Signed-off-by: Ben Browning --- .spellcheck-en-custom.txt | 6 ++++++ README.md | 35 ++++++++++++++++++----------------- 2 files changed, 24 insertions(+), 17 deletions(-) diff --git a/.spellcheck-en-custom.txt b/.spellcheck-en-custom.txt index 2130dd21..0ed144c1 100644 --- a/.spellcheck-en-custom.txt +++ b/.spellcheck-en-custom.txt @@ -4,10 +4,13 @@ Backport backported codebase +configs Dataset dataset datasets distractor +Eval +eval FIXME freeform ICL @@ -17,12 +20,15 @@ Langchain's LLM LLMBlock MCQ +Merlinite +Mixtral MMLU Ouput Pre pre Pregenerated qna +quantized repo sdg Splitter diff --git a/README.md b/README.md index cdf85d65..d9419295 100644 --- a/README.md +++ b/README.md @@ -41,37 +41,38 @@ You can import SDG into your Python files with the following items: ## Pipelines -There are four pipelines that are used in SDG. Each pipeline requires specific hardware specifications. - +A pipeline describes a series of steps to execute in-order to generate data. -*Full* - +There are three default pipelines shipped in SDG. These are the `simple`, `full`, and `eval` pipelines. Each pipeline requires specific hardware specifications -This pipeline is targeted for running SDG on consumer grade accelerators (GPUs). +### Simple Pipeline -*Simple* - +The [simple pipeline](src/instructlab/sdg/pipelines/simple) is designed to be used with [quantized Merlinite](https://huggingface.co/instructlab/merlinite-7b-lab-GGUF) as the teacher model. It exists to enable basic data generation results on lower end consumer grade hardware, such as laptops and desktops with small or no discrete GPUs. -### Pipeline architecture +### Full Pipeline + +The [full pipeline](src/instructlab/sdg/pipelines/full) is designed to be used with [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) as the the teacher model, but has also been successfully tested with smaller models such as [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) and even some quantized versions of the two above. This is the preferred data generation pipeline on higher end consumer grade hardware and on all enterprise hardware. -All the pipelines are written in YAML format. +### Eval Pipeline -Knowledge: +The [eval pipeline](src/instructlab/sdg/pipelines/eval) is used to generate [MMLU](https://en.wikipedia.org/wiki/MMLU) benchmark data that can be used to later evaluate a trained model on your knowledge dataset. It does not generate data for use during model training. -Grounded Skills: +### Pipeline architecture -Freeform Skills: +All the pipelines are written in a YAML format and must adhere to a [specific schema](src/instructlab/sdg/pipelines/schema/v1.json). - +The pipelines that generate data for model training (simple and full pipelines) expect to have three different pipeline configs - one each for knowledge, grounded skills, and freeform skills. They are expected to exist in files called `knowledge.yaml`, `grounded_skills.yaml`, and `freeform_skills.yaml` respectively. For background on the difference in knowledge, grounded skills, and freeform skills, refer to the [InstructLab Taxonomy repository](https://github.com/instructlab/taxonomy). ## Repository structure ```bash -|-- sdg/src/instructlab/ (1) -|-- sdg/docs/ (2) -|-- sdg/scripts/ (3) -|-- sgd/tests/ (4) +|-- src/instructlab/ (1) +|-- docs/ (2) +|-- scripts/ (3) +|-- tests/ (4) ``` 1. Contains the SDG code that interacts with InstructLab. 2. Contains documentation on various SDG methodologies. -3. Contains the code that tests the SDG data types: Knowledge, grounded skills, and freeform skills. -4. Contains all the CI tests for the SDG repository. \ No newline at end of file +3. Contains some utility scripts, but not part of any supported API. +4. Contains all the tests for the SDG repository. From 1f78089dd883f263c5572dc3b1ca032b5fbf80b3 Mon Sep 17 00:00:00 2001 From: Kelly Brown Date: Thu, 21 Nov 2024 10:51:57 -0500 Subject: [PATCH 3/3] Updating some nits in the SDG README Signed-off-by: Kelly Brown --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index d9419295..1ea8e267 100644 --- a/README.md +++ b/README.md @@ -41,17 +41,17 @@ You can import SDG into your Python files with the following items: ## Pipelines -A pipeline describes a series of steps to execute in-order to generate data. +A pipeline is a series of steps to execute in order to generate data. -There are three default pipelines shipped in SDG. These are the `simple`, `full`, and `eval` pipelines. Each pipeline requires specific hardware specifications +There are three default pipelines shipped in SDG: `simple`, `full`, and `eval`. Each pipeline requires specific hardware specifications ### Simple Pipeline -The [simple pipeline](src/instructlab/sdg/pipelines/simple) is designed to be used with [quantized Merlinite](https://huggingface.co/instructlab/merlinite-7b-lab-GGUF) as the teacher model. It exists to enable basic data generation results on lower end consumer grade hardware, such as laptops and desktops with small or no discrete GPUs. +The [simple pipeline](src/instructlab/sdg/pipelines/simple) is designed to be used with [quantized Merlinite](https://huggingface.co/instructlab/merlinite-7b-lab-GGUF) as the teacher model. It enables basic data generation results on low-end consumer grade hardware, such as laptops and desktops with small or no discrete GPUs. ### Full Pipeline -The [full pipeline](src/instructlab/sdg/pipelines/full) is designed to be used with [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) as the the teacher model, but has also been successfully tested with smaller models such as [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) and even some quantized versions of the two above. This is the preferred data generation pipeline on higher end consumer grade hardware and on all enterprise hardware. +The [full pipeline](src/instructlab/sdg/pipelines/full) is designed to be used with [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) as the the teacher model, but has also been successfully tested with smaller models such as [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) and even some quantized versions of the two teacher models. This is the preferred data generation pipeline on higher end consumer grade hardware and all enterprise hardware. ### Eval Pipeline