From 2d9b587b6d5358fe42d9d23341a9e155fe6fb043 Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Mon, 6 Jan 2025 17:31:51 -0600 Subject: [PATCH 1/9] docs(taxonomy-index): resort taxonomy index page to make it clearer; clean up a bit Signed-off-by: Laura Santamaria --- docs/taxonomy/index.md | 29 +++++++++++------------------ 1 file changed, 11 insertions(+), 18 deletions(-) diff --git a/docs/taxonomy/index.md b/docs/taxonomy/index.md index 9eb8669..13b3c95 100644 --- a/docs/taxonomy/index.md +++ b/docs/taxonomy/index.md @@ -5,18 +5,13 @@ logo: images/ilab_dog.png --- ## Welcome to the InstructLab Taxonomy -InstructLab 🐶 uses a novel synthetic data-based alignment tuning method for -Large Language Models (LLMs.) The "**lab**" in Instruct**Lab** 🐶 stands for -[**L**arge-Scale **A**lignment for Chat**B**ots](https://arxiv.org/abs/2403.01081) [1]. +InstructLab 🐶 uses a novel synthetic data-based alignment tuning method for Large Language Models (LLMs.) The "**lab**" in Instruct**Lab** 🐶 stands for [**L**arge-Scale **A**lignment for Chat**B**ots](https://arxiv.org/abs/2403.01081) [^1]. -The LAB method is driven by taxonomies, which are largely created manually and -with care. +The LAB method is driven by taxonomies, which are largely created manually and with care. -This repository contains a taxonomy tree that allows you to create models -tuned with your data (enhanced via synthetic data generation) using the LAB 🐶 -method. +The [instructlab/taxonomy](https://github.com/instructlab/taxonomy) repository contains a taxonomy tree that allows you to create models tuned with your data (enhanced via synthetic data generation) using the LAB 🐶 method. -[1] Shivchander Sudalairaj*, Abhishek Bhandwaldar*, Aldo Pareja*, Kai Xu, David D. Cox, Akash Srivastava*. "LAB: Large-Scale Alignment for ChatBots", arXiv preprint arXiv: 2403.01081, 2024. (* denotes equal contributions) +[^1]: Shivchander Sudalairaj*, Abhishek Bhandwaldar*, Aldo Pareja*, Kai Xu, David D. Cox, Akash Srivastava*. "LAB: Large-Scale Alignment for ChatBots", arXiv preprint arXiv: 2403.01081, 2024. (* denotes equal contributions) ## Choosing domains for the taxonomy @@ -28,19 +23,17 @@ If you are unsure where to put your knowledge or compositional skill, create a f Learn about the concepts of "skills" and "knowledge" in our [InstructLab Community Learning Guide](https://github.com/instructlab/community/blob/main/docs/README.md). -## Taxonomy tree Layout +## Taxonomy tree layout -The taxonomy tree is organized in a cascading directory structure. At the end of -each branch, there is a YAML file (qna.yaml) that contains the examples for that -domain. Maintainers can decide to change the names of the existing branches or to add new branches. +The taxonomy tree is organized in a cascading directory structure. At the end of each branch, there is a YAML file (`qna.yaml`) that contains the examples for that domain along with any attribution files (`attribution.txt`). Maintainers can decide to change the names of the existing branches or to add new branches. !!! important Folder names do not have spaces. Use underscores between words. -## Taxonomy diagram +### Taxonomy diagram !!! note - These diagrams shows a subset of the taxonomy. It is not a complete representation. + These diagrams show subsets of the taxonomy. They are not a complete representation. ```mermaid flowchart TD; @@ -110,7 +103,7 @@ By contributing your skills and knowledge to this repository, you will see your While public contributions are welcome to help drive community progress, you can also fork this repository under [the Apache License, Version 2.0](../LICENSE), add your own internal skills, and train your own models internally. However, you might need your own access to significant compute infrastructure to perform sufficient retraining. -## Ways to Contribute +### Ways to contribute You can contribute to the taxonomy in the following two ways: @@ -119,14 +112,14 @@ You can contribute to the taxonomy in the following two ways: For more information, see the [Ways of contributing to the taxonomy repository](https://github.com/instructlab/taxonomy/blob/main/CONTRIBUTING.md#ways-of-contributing-to-the-taxonomy-repository) documentation. -## How to contribute skills and knowledge +### How to contribute skills and knowledge To contribute to this repo, you'll use the *Fork and Pull* model common in many open source repositories. You can add your skills and knowledge to the taxonomy in multiple ways; for additional information on how to make a contribution, see the [Documentation on contributing](../community/CONTRIBUTING.md). You can also use the following guides to help with contributing: - Contributing using the [GitHub webpage UI](https://github.com/instructlab/taxonomy/blob/main/docs/contributing_via_GH_UI.md). - Contributing knowledge to the taxonomy in the [Knowledge contribution guidelines](../taxonomy/knowledge/guide.md). -### Why should I contribute? +#### Why should I contribute? This taxonomy repository will be used as the seed to synthesize the training data for InstructLab-trained models. We intend to retrain the model(s) using the main From 120940e40908251fffab69cb7973e1bdd16fb97e Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Mon, 6 Jan 2025 17:32:43 -0600 Subject: [PATCH 2/9] docs(guide): clean up taxonomy guide a bit Signed-off-by: Laura Santamaria --- docs/taxonomy/knowledge/guide.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/taxonomy/knowledge/guide.md b/docs/taxonomy/knowledge/guide.md index 12fd975..4096d7b 100644 --- a/docs/taxonomy/knowledge/guide.md +++ b/docs/taxonomy/knowledge/guide.md @@ -5,11 +5,11 @@ logo: images/ilab_dog.png --- # What is "Knowledge"? -Knowledge consists of data and facts and is backed by documents. When you create knowledge for a model, you're giving it additional data to more accurately answer questions. +In the InstructLab world, knowledge consists of data and facts and is backed by documents. When you create knowledge for a model, you're giving it additional data to more accurately answer questions. -Knowledge contributions in this project contain a few things. +Knowledge contributions in this project contain a few things: -- A file in a git repository that holds your information. For example, these repositories can include markdown versions of information on: Oscar 2024 winners, Law books, Shakespeare, Sports, Chemistry, etc. +- A file in a git repository that holds your information. For example, these repositories can include markdown versions of information on Oscar 2024 winners, Law books, Shakespeare, Sports, Chemistry, etc. - A `qna.yaml` file that asks and answers questions about the information in the git repository. - An `attribution.txt` file that includes the sources for the information used in the `qna.yaml`. @@ -58,7 +58,7 @@ We received many joke and poem submissions at the beginning of the project, and LLMs have inherent limitations that make certain tasks extremely difficult, like doing math problems. They're great at other tasks, like creative writing. And they could be better at things like logical reasoning. -An LLM with knowledge helps it create a basis of information that it can learn from, then you can teach it to use this knowledge via the `qna.yaml` files. +Providing an LLM training pipeline with knowledge helps create a basis of information that the model can learn from. With InstructLab, you can teach it to use this knowledge via the `qna.yaml` files. For example, you can give an LLM the entire periodic table, then in a `qna.yaml` add something like: @@ -68,7 +68,7 @@ answer: | The symbol for chlorine is Cl and the atomic number is 17. ``` -With a few of these qna's, the model will learn the periodic table because it has the knowledge data. +With a few of these question-and-answer pairs, the model will learn the periodic table because it has the knowledge data. ### LLMs are great at From 609d3ba8d95182776bf5109ace666455cf3d49d6 Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Mon, 6 Jan 2025 17:34:30 -0600 Subject: [PATCH 3/9] docs(knowledge): start adding clearer guides and a field table to the contribution guide for knowledge Signed-off-by: Laura Santamaria --- .../knowledge/contribution_details.md | 62 ++++++++++++++++--- 1 file changed, 53 insertions(+), 9 deletions(-) diff --git a/docs/taxonomy/knowledge/contribution_details.md b/docs/taxonomy/knowledge/contribution_details.md index 08be534..51138ac 100644 --- a/docs/taxonomy/knowledge/contribution_details.md +++ b/docs/taxonomy/knowledge/contribution_details.md @@ -4,35 +4,49 @@ description: The overview of 🐶 InstructLab's Knowledge contribution guideline logo: images/ilab_dog.png --- -You can create a Git repository to host your knowledge contributions anywhere (GitLab, Gerrit, etc.) but it might be favorable to create one on GitHub. The following instructions show you how to create a knowledge repository in GitHub and contribute to the taxonomy. +You can create a Git repository to host your knowledge contributions anywhere (GitLab, Gerrit, etc.), but it might be favorable to create one on GitHub. At the current time, we require a GitHub username to contribute, and all work is done in GitHub. + +The following instructions show you how to create a knowledge repository in GitHub and contribute to the taxonomy. ## Prerequisites +If you are submitting to the repository directly: - You have a GitHub account - You have a forked copy of the [taxonomy](https://github.com/instructlab/taxonomy/tree/main) repository - You have verified that the model does not already know the knowledge you want to submit -## Creating your own knowledge repository +If you are using the [UI](https://ui.instructlab.ai) to submit: +- You have a GitHub account +- You have verified that the model does not already know the knowledge you want to submit + +## Preparing your knowledge documents + +You need to set up your source documents as Markdown files in a git repository. + +!!! warning + **We are currently only accepting sources from [this list](https://github.com/instructlab/community/blob/main/docs/DataSources.md) at this time due to legal requirements to keep InstructLab open source.** Our taxonomy triage team will reject any contributions that do not match this pattern. Thanks for helping us keep InstructLab 100% open source! + +### Creating your own knowledge repository To create a new GitHub repository, follow the GitHub documentation in [Creating a new repository](https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-new-repository). The specific steps are listed as follows: 1. In your GitHub profile page, navigate to the repositories tab. You will see a search bar where you can search your repositories or create a new one. -2. This takes you to a page titled “Create a new repository”. Create a custom name for your repository and add a `README.md` file. For example, “knowlege_contributions” could be a good name for your repository. +2. This takes you to a page titled “Create a new repository”. Create a custom name for your repository and add a `README.md` file. For example, “knowledge_contributions” could be a good name for your repository. 3. Click “Create” when you are all set. -## Convert your knowledge documentation to markdown +### Convert your knowledge documentation to Markdown -There are many online tools that can help you convert your documents to markdown. If you are using a wiki page for your contributions, you can use [pandocs](https://pandoc.org/try/) to convert the documents. For wikipedia sources on pandoc, use `from: mediawiki` and convert `to: markdown_strict` to access the proper markdown format. +There are many online tools that can help you convert your documents to Markdown. If you are using a wiki page for your contributions, you can use [pandocs](https://pandoc.org/try/) to convert the documents. For Wikipedia sources on pandoc, use `from: mediawiki` and convert `to: markdown_strict` to access the proper Markdown format. -## Add the markdown file to your repository +### Add the Markdown file to your repository To add a file to your GitHub repository, follow the GitHub documentation in [Adding a file to a repository](https://docs.github.com/en/repositories/working-with-files/managing-files/adding-a-file-to-a-repository). The specific steps are listed as follows: -1. Navigate to “Add files”. Click “Create new file” if you want to manually add your markdown content. Click “Upload files” if you have a file locally to add. +1. Navigate to “Add files”. Click “Create new file” if you want to manually add your Markdown content. Click “Upload files” if you have a file locally to add. 2. Add a description and commit your changes. Since this is your own repository, you can commit directly to the `main` branch. @@ -42,6 +56,37 @@ The specific steps are listed as follows: !!! important Make a note of your commit SHA; you'll need it for your `qna.yaml`. +## Creating your knowledge submission in GitHub + +For knowledge submissions, we need a `qna.yaml` file and an `attribution.txt` file. + +### The `qna.yaml` file + +For the current version of the taxonomy, version 3, here are the available fields: + +Key | Type | Required | Constraints | Value | Notes +--|--|--|--|--|-- +`version` | Y | integer | - | `3` | The taxonomy schema version used in the `qna.yaml` file. Defined in [instructlab/schema](https://github.com/instructlab/schema) +`created_by` | Y | string | - | Your GitHub username | - +`domain` | Y | string | - | Knowledge sub-category | The knowledge domain which is used in prompts to the teacher model during synthetic data generation. The domain should be brief such as the title to a textbook chapter or section. +`seed_examples` | Y | array | at least 5 sets | null | This is a collection of questions and answers with context from the knowledge document that InstructLab uses to generate data synthetically. +`context` | Y | string | < 500 words | A chunk of information from the original knowledge document | This should be a copy-paste from the Markdown version of your document +`questions_and_answers` | Y | array | at least 3 pairs per context | null | This is a collection of questions and answers. +`question` | Y | string | > 250 words | A question related to the context | Questions are things you'd expect someone to ask the model based on the context given. This will be used for synthetic data generation. +`answer` | Y | string | > 250 words | An answer for the question | Answers are what you'd like the model to give as an answer. It will not be an exact answer the model always gives. +`document_outline` | Y | string | - | A brief summary of the document | - +`document` | Y | object | - | null | The collection of data for the knowledge document. +`repo` | Y | string | a git URL | The URL (with a `.git` suffix) that identifies your git repo where you've stored your knowledge documents | - +`commit` | Y | string | full commit hash | A SHA1 full commit hash that corresponds to the document in the repo | This hash must be exactly where the system can find the document. +`patterns` | Y | array | `*.md`, `*.pdf` | A list of glob patterns specifying the files in the repo. | Any glob pattern that starts with `*` must be quoted due to YAML rules. Currently, the system accepts `.md` and `.pdf` files. + +!!! important + There must be at least 5 sets of questions and answers with context in every `qna.yaml` file. + +#### An example file + +To build a strong taxonomy, + ## Create a pull request in the taxonomy repository Navigate to your forked taxonomy repository and ensure it is up-to-date. @@ -61,8 +106,7 @@ Here are a few things to check before seeking reviews for your contribution: ## PR Upstream Workflow -The following table outlines the expected timing for the PRs you have submitted. The PRs go through a few steps, and checks, but you should be able to map your `label` to -the place that it is in. +The following table outlines the expected timing for the PRs you have submitted. The PRs go through a few steps, and checks, but you should be able to map your `label` to the place that it is in. | Label | Actor | Action | Duration | | --- | --- | --- | --- | From 7f123ce9e946afe23f5a3fe86d56b622311f1aff Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Tue, 7 Jan 2025 14:58:05 -0600 Subject: [PATCH 4/9] style(footnotes): fix footnote Signed-off-by: Laura Santamaria --- docs/taxonomy/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/taxonomy/index.md b/docs/taxonomy/index.md index 13b3c95..5b628e4 100644 --- a/docs/taxonomy/index.md +++ b/docs/taxonomy/index.md @@ -5,13 +5,13 @@ logo: images/ilab_dog.png --- ## Welcome to the InstructLab Taxonomy -InstructLab 🐶 uses a novel synthetic data-based alignment tuning method for Large Language Models (LLMs.) The "**lab**" in Instruct**Lab** 🐶 stands for [**L**arge-Scale **A**lignment for Chat**B**ots](https://arxiv.org/abs/2403.01081) [^1]. +InstructLab 🐶 uses a novel synthetic data-based alignment tuning method for Large Language Models (LLMs.) The "**lab**" in Instruct**Lab** 🐶 stands for [**L**arge-Scale **A**lignment for Chat**B**ots](https://arxiv.org/abs/2403.01081)[^1]. The LAB method is driven by taxonomies, which are largely created manually and with care. The [instructlab/taxonomy](https://github.com/instructlab/taxonomy) repository contains a taxonomy tree that allows you to create models tuned with your data (enhanced via synthetic data generation) using the LAB 🐶 method. -[^1]: Shivchander Sudalairaj*, Abhishek Bhandwaldar*, Aldo Pareja*, Kai Xu, David D. Cox, Akash Srivastava*. "LAB: Large-Scale Alignment for ChatBots", arXiv preprint arXiv: 2403.01081, 2024. (* denotes equal contributions) +[^1]: Shivchander Sudalairaj*, Abhishek Bhandwaldar*, Aldo Pareja*, Kai Xu, David D. Cox, Akash Srivastava*. "LAB: Large-Scale Alignment for ChatBots", [arXiv preprint arXiv: 2403.01081, 2024](https://arxiv.org/abs/2403.01081). (* denotes equal contributions) ## Choosing domains for the taxonomy From 8500c64a5dc5e5f553cc2bdc923fef346759a25a Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Wed, 8 Jan 2025 16:48:31 -0600 Subject: [PATCH 5/9] docs(sources): add list of accepted sources from ADR/dev-docs Signed-off-by: Laura Santamaria --- docs/taxonomy/knowledge/guide.md | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/docs/taxonomy/knowledge/guide.md b/docs/taxonomy/knowledge/guide.md index 4096d7b..4be86d3 100644 --- a/docs/taxonomy/knowledge/guide.md +++ b/docs/taxonomy/knowledge/guide.md @@ -15,13 +15,37 @@ Knowledge contributions in this project contain a few things: You can learn more about the knowledge structure in [Getting Started with Knowledge contributions](https://github.com/instructlab/taxonomy/blob/main/README.md#getting-started-with-knowledge-contributions). -## Accepted Knowledge +## Accepted Sources of Knowledge !!! important We are currently only accepting knowledge contributions as a limited private beta and sources will be limited to articles from Wikipedia. These are the main knowledge domains that we are currently accepting knowledge contributions for: arts, engineering, geography, history, linguistics, mathematics, philosophy, religion, science, and technology. +Due to the open source nature of InstructLab, all content has to meet specific licensing requirements. This list has currently approved sources for knowledge. If you wish to use a different source, we need to approve it, and that means your submission will be on hold until we get legal review and approval. Please be patient! + +Domain Name | Status | Notes +--|--|-- +[Wikipedia](https://en.wikipedia.org/wiki/Main_Page) | approved | - +[Project Gutenberg](https://www.gutenberg.org) | approved | Pre-1927 works; public domain under US copyright law +[Wikisource](https://en.wikisource.org) (library) | approved | "free library that anyone can improve" +[OpenStax textbooks family of publications](https://openstax.org/subjects) | approved | - +[The Open Organization publications](https://theopenorganization.org) | approved | - +[The Scrum Guide](https://scrumguides.org/index.html) | approved | - +[US Congress site](https://www.congress.gov) | reviewed - manually verify | US government sources may have different licensing; a legal review will need to verify each source +[US White House site](https://www.whitehouse.gov) | reviewed - manually verify | US government sources may have different licensing; a legal review will need to verify each source +[US Senate site](https://www.senate.gov) | reviewed - manually verify | US government sources may have different licensing; a legal review will need to verify each source +[US IRS site](https://www.irs.gov) | reviewed - manually verify | US government sources may have different licensing; a legal review will need to verify each source +[NASA](https://www.nasa.gov) | reviewed - manually verify | [See guidelines](https://www.nasa.gov/nasa-brand-center/images-and-media/) +[Smithsonian Libraries](https://library.si.edu/) | reviewed - manually verify | For any material marked \"No Copyright - United States" or "CC0" as [described here](https://library.si.edu/copyright) +[European Union (EU) site](https://european-union.europa.eu/) | reviewed - manually verify | Specifically documents submitted under "public registrars" as [described here](https://european-union.europa.eu/principles-countries-history/principles-and-values/access-information_en) +[Internet Archive](https://archive.org/) | reviewed - manually verify | Pre-1927 works; public domain under US copyright law +[PLOS family of open access journals](https://plos.org/publish) | reviewed - manually verify | - +[Open Practice Library](https://openpracticelibrary.com/) | reviewed - manually verify | - +[Cynefin.io wiki](https://cynefin.io/wiki/Main_Page) | reviewed - manually verify | - +[The Open Education Project](https://research.redhat.com/blog/research_project/foundations-in-open-source-education/) | reviewed - manually verify | - + + ## Avoid These Topics While the tuning process may eventually benefit from being used to help the models work with complex social topics, at this time this is an area of active research we do not want to take lightly. Therefore, please keep your submissions clear of the following topics: From 6a6e79bf5d6d3d6b7e1b67eb5b64d5281ae58ac8 Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Wed, 8 Jan 2025 16:51:02 -0600 Subject: [PATCH 6/9] docs(contributions): add note pointing to UI for non-GH people Signed-off-by: Laura Santamaria --- docs/taxonomy/knowledge/contribution_details.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/taxonomy/knowledge/contribution_details.md b/docs/taxonomy/knowledge/contribution_details.md index 51138ac..834469f 100644 --- a/docs/taxonomy/knowledge/contribution_details.md +++ b/docs/taxonomy/knowledge/contribution_details.md @@ -4,6 +4,9 @@ description: The overview of 🐶 InstructLab's Knowledge contribution guideline logo: images/ilab_dog.png --- +!!! info + The following information is if you are comfortable contributing using GitHub, a version control system primarily used for code. If you are **not** comfortable with this platform, you can use the InstructLab UI to submit knowledge. To learn more, [head to the UI overview page](../../user-interface/ui_overview.md). + You can create a Git repository to host your knowledge contributions anywhere (GitLab, Gerrit, etc.), but it might be favorable to create one on GitHub. At the current time, we require a GitHub username to contribute, and all work is done in GitHub. The following instructions show you how to create a knowledge repository in GitHub and contribute to the taxonomy. From bc7ee727448fee85de307bf7dc3210f2ff141fd4 Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Wed, 22 Jan 2025 13:58:16 -0700 Subject: [PATCH 7/9] feat(skills): update skills guide with current taxonomy info Signed-off-by: Laura Santamaria --- docs/taxonomy/skills/index.md | 60 ++++++++++++++++++++++------------- 1 file changed, 38 insertions(+), 22 deletions(-) diff --git a/docs/taxonomy/skills/index.md b/docs/taxonomy/skills/index.md index 33a717f..cfcf5b3 100644 --- a/docs/taxonomy/skills/index.md +++ b/docs/taxonomy/skills/index.md @@ -7,11 +7,10 @@ logo: images/ilab_dog.png Skills require a much smaller volume of content than knowledge contributions. An entire skill contribution to the taxonomy tree can be just a few lines of YAML in the `qna.yaml` file ("qna" is short for "questions and answers") and an `attribution.txt` file for citing sources. -Your skills contribution pull requests must include the following: +Your skills contribution pull requests include the following: -- A `qna.yaml` that contains a set of key/value entries with the following keys - - Each `qna.yaml` file requires a minimum of five question and answer pairs. -- An `attribution.txt` that includes the sources for the information used in the `qna.yaml` +- A `qna.yaml` that contains a set of key/value entries +- An `attribution.txt` that includes the sources for the information used in the `qna.yaml`. Even if you are authoring the skill with no additional sources, you must have this file for legal purposes. !!! tip The skill taxonomy structure is used in several ways: @@ -27,21 +26,26 @@ Your skills contribution pull requests must include the following: Compositional skills can either be grounded (includes a context) or ungrounded (does not include a context). Grounded or ungrounded is declared in the taxonomy tree, for example: `linguistics/writing/poetry/haiku/` (ungrounded) or `grounded/linguistics/grammar` (grounded). The `qna.yaml` is in the final node. +### The structure of the `qna.yaml` file + Taxonomy skill files must be a valid [YAML](https://yaml.org/) file named `qna.yaml`. Each `qna.yaml` file contains a set of key/value entries with the following keys: -- `version`: The value must be the number 2. **Required** -- `task_description`: A description of the skill. **Required** -- `created_by`: The GitHub username of the contributor. **Required** -- `seed_examples`: A collection of key/value entries. New - submissions should have at least five entries, although - older files may have fewer. **Required** - - `context`: Grounded skills require the user to provide context containing information that the model is expected to take into account during processing. This is different from knowledge, where the model is expected to gain facts and background knowledge from the tuning process. The context key should not be used for ungrounded skills. - - `question`: A question for the model. **Required** - - `answer`: The desired response from the model. **Required** +Field | Required? | Content +--|--|-- +`version` | yes | The value must be the number 3. +`task_description` | yes | A description of the skill. +`created_by` | yes | The GitHub username of the contributor. +`seed_examples` | yes | A collection of key/value entries. New submissions should have at least five entries, although older files may have fewer.

Note collections are nested lists, like subentries in a bulleted list. +`context` | only for grounded skills | Part of the `seed_examples` collection.

Grounded skills require the user to provide context containing information that the model is expected to take into account during processing. This is different from knowledge, where the model is expected to gain facts and background knowledge from the tuning process.

**Note:** The context key should not be used for ungrounded skills. + `question` | yes | Part of the `seed_examples` collection.

A question for the model. +`answer` | yes | Part of the `seed_examples` collection.

The desired response from the model. Other keys at any level are currently ignored. -### Skills: YAML examples +!!! important + Each `qna.yaml` file requires a minimum of five question and answer pairs. + +### Submissions To make the `qna.yaml` files easier and faster for humans to read, it is recommended to specify `version` first, followed by `task_description`, then `created_by`, and finally `seed_examples`. In `seed_examples`, it is recommended to specify `context` first (if applicable), followed by `question` and `answer`. @@ -64,9 +68,9 @@ seed_examples: ... ``` -Then, you create an `attribution.txt` file that includes the sources of your information. These can also be self authored sources. +Then, you create an `attribution.txt` file that includes the sources of your information, if any. These sources can also be self-authored sources for skills. -*Example `attribution.txt`* +*Fields in `attribution.txt`* ```text [Link to source] @@ -75,14 +79,27 @@ Then, you create an `attribution.txt` file that includes the sources of your inf [Creator name] ``` +*Example of a self-authored source `attribution.txt`* + +```text +Title of work: Customizing an order for tea +Link to work: - +License of the work: CC BY-SA-4.0 +Creator names: Jean-Luc Picard +``` + +You may copy this example and replace the title of the work (your skill) and the creator name to submit a skill. The license is [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/), which is shortened to `CC BY-SA-4.0`. + For more information on what to include in your `attribution.txt` file, see [For your attribution.txt file](https://github.com/instructlab/taxonomy/blob/main/CONTRIBUTING.md#for-your-attributiontxt-file) in CONTRIBUTING.md. -If you have not written YAML before, don't be intimidated - it's just text. +### Writing YAML + +If you have not written YAML before, YAML is a text file where indentation matters. !!! tip - - Spaces and indentation matter in YAML. Two spaces to indent. - - Don't use tabs! + - Spaces and indentation matter in YAML. Use two spaces to indent. + - Don't use tabs! - Be careful to not have trailing spaces at the end of a line. - Each example in `seed_examples` begins with a "-". Place this "-" in front of the first field (`question` or `context`). The remaining keys in the @@ -98,6 +115,8 @@ If you have not written YAML before, don't be intimidated - it's just text. It is recommended that you **lint**, or verify, your YAML using a tool. One linter option is [yamllint.com](https://yamllint.com). You can copy/paste your YAML into the box and click **Go** to have it analyze your YAML and make recommendations. Online tools like [prettified](https://onlineyamltools.com/prettify-yaml) and [yaml-validator](https://jsonformatter.org/yaml-validator) can automatically reformat your YAML to adhere to our `yamllint` PR checks, such as breaking lines longer than 120 characters. +### Examples + #### Ungrounded compositional skill: YAML example ```yaml @@ -117,8 +136,6 @@ seed_examples: answer: wake, lake, steak, make, and quake. ``` -Seriously, that's it. - Here is the location of this YAML in the taxonomy tree. Note that the YAML file itself, plus any added directories that contain the file, is the entirety of the skill in terms of a taxonomy contribution: @@ -149,7 +166,6 @@ Remember that [grounded compositional skills](skills_guide.md#grounded-compositi This example snippet assumes the GitHub username `mairin` and shows some of the question/answer pairs present in the actual file: - ```yaml version: 2 task_description: | From 1568f113a51167e520ce059f0d9bb00385c86706 Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Wed, 22 Jan 2025 14:02:49 -0700 Subject: [PATCH 8/9] fix(layout): fixing layout/style Signed-off-by: Laura Santamaria --- docs/taxonomy/skills/index.md | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/docs/taxonomy/skills/index.md b/docs/taxonomy/skills/index.md index cfcf5b3..1d464a1 100644 --- a/docs/taxonomy/skills/index.md +++ b/docs/taxonomy/skills/index.md @@ -100,18 +100,15 @@ If you have not written YAML before, YAML is a text file where indentation matte - Spaces and indentation matter in YAML. Use two spaces to indent. - Don't use tabs! - - Be careful to not have trailing spaces at the end of a line. - - Each example in `seed_examples` begins with a "-". Place this "-" in + - Do not have trailing spaces at the end of a line. + - Each example in `seed_examples` begins with a dash (`-`). Place this dash in front of the first field (`question` or `context`). The remaining keys in the - example should not have this "-". - - Some special characters such as " and ' need to be escaped with backslash. This is why some - of the lines for keys in the example YAML start the value with the '|' character followed a new line and then an indented multi-line string. - This character disables all of the special characters in the value for the key. - You might also want to use the '|' character for multi-line strings. - - Consider quoting all values with " to avoid surprising YAML parser behavior + example should not have this dash. + - Some special characters such as a double quotation mark (`"`) and an apostrophe or single quotation mark (`'`) need to be escaped with backslash. This is why some of the lines for keys in the example YAML start the value with the pipe character (`|`) followed a new line and then an indented multi-line string. This character disables all of the special characters in the value for the key.

You might also want to use the pipe character (`|`) for multi-line strings. + - Consider quoting all values with double quotation marks (`"`) to avoid surprising YAML parser behavior (e.g. Yes answer can be interpreted by the parser as a boolean of `True` value, unless "Yes" is quoted.) - - See https://yaml-multiline.info/ for more info. + - See [yaml-multiline.info](https://yaml-multiline.info/) for more info. It is recommended that you **lint**, or verify, your YAML using a tool. One linter option is [yamllint.com](https://yamllint.com). You can copy/paste your YAML into the box and click **Go** to have it analyze your YAML and make recommendations. Online tools like [prettified](https://onlineyamltools.com/prettify-yaml) and [yaml-validator](https://jsonformatter.org/yaml-validator) can automatically reformat your YAML to adhere to our `yamllint` PR checks, such as breaking lines longer than 120 characters. From 64bbaf276e45c6bc4ca523bd81bd6acdc3d2fd8b Mon Sep 17 00:00:00 2001 From: Laura Santamaria Date: Wed, 22 Jan 2025 14:10:58 -0700 Subject: [PATCH 9/9] docs(knowledge): set up knowledge docs with table/one source of truth to match skills Signed-off-by: Laura Santamaria --- docs/taxonomy/knowledge/index.md | 26 +++++++------------------- 1 file changed, 7 insertions(+), 19 deletions(-) diff --git a/docs/taxonomy/knowledge/index.md b/docs/taxonomy/knowledge/index.md index 1c617a7..b516724 100644 --- a/docs/taxonomy/knowledge/index.md +++ b/docs/taxonomy/knowledge/index.md @@ -22,25 +22,13 @@ Knowledge in the taxonomy tree consists of a few more elements than skills: - All submissions must be text; images will be ignored - Do not use tables in your markdown freeform contribution -The `qna.yaml` format must include the following fields: +### Structure of the `qna.yaml` file -- `version`: The version of the `qna.yaml` file; this is the format of the file that is used for SDG. The value must be the number 3. -- `created_by`: Your GitHub username. -- `domain`: Specify the category of the knowledge. -- `seed_examples`: A collection of key/value entries. - - `context`: A chunk of information from the knowledge document. Each `qna.yaml` needs five `context` blocks. The context has a maximum token count of 500 tokens. Also, each `context` blocks should have at least 3 question and answer pairs, with a maximum token count of 250 for all 3 question and answer pairs. - - `questions_and_answers`: The parameter that holds your questions and answers. - - `question`: Specify a question for the model. Each `qna.yaml` file needs at least three question and answer pairs per `context` chunk. - - `answer`: Specify the desired answer from the model. Each `qna.yaml` file needs at least three question and answer pairs per `context` chunk. -- `document_outline`: Describe an overview of the document your submitting. -- `document`: The source of your knowledge contribution. - - `repo`: The URL for your repository that holds your knowledge markdown files. - - `commit`: The SHA of the commit in your repository with your knowledge markdown files. - - `patterns`: A list of glob patterns that specify the markdown files in your repository. Any glob pattern that starts with `*`, such as `*.md`, must be quoted due to YAML rules. For example, `"*.md"`. +Reference the structure provided in the [contribution details guide](contribution_details.md#the-qnayaml-file) to understand the required keys and values. -### Knowledge: YAML examples +### Example of a knowledge submission -*Example of a `qna.yaml` file* +#### Example of a `qna.yaml` file ```yaml version: 3 @@ -204,7 +192,7 @@ document: - phoenix_constellation.md ``` -*Example of an `attribution.txt` file* +#### Example of an `attribution.txt` file ```text Title of work: Phoenix (constellation) @@ -216,7 +204,7 @@ Creator names: Wikipedia Authors For more information on what to include in your `attribution.txt` file, see [For your attribution.txt file](https://github.com/instructlab/taxonomy/blob/main/CONTRIBUTING.md#for-your-attributiontxt-file) in the CONTRIBUTING.md file. -### Knowledge: Markdown file example +### Example of a Markdown file The previous knowledge example references one markdown file: `phoenix_constellation.md`. You can also add multiple markdown files for knowledge contributions. @@ -253,7 +241,7 @@ Phoenicids. You can organize the knowledge markdown files in your repository however you want. You just need to ensure the YAML is pointing to the correct file. -### Knowledge: directory tree example +### Example of a directory tree In the taxonomy repository, here's what the previously referenced knowledge might look like in the tree: