Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FAQ.md to provide sample estimation guidelines #408

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

relyt0925
Copy link
Contributor

@relyt0925 relyt0925 commented Nov 25, 2024

The new FAQ.md file includes detailed explanations and examples on how to estimate the number of synthetic samples produced at various stages of the SDG training process. This addition aims to enhance user understanding of the sample generation methodology.

I believe it's an MVP to resolving: #307

@mergify mergify bot added documentation Improvements or additions to documentation ci-failure labels Nov 25, 2024
@relyt0925 relyt0925 force-pushed the faq-size branch 5 times, most recently from b31ff34 to 1f568eb Compare November 25, 2024 05:06
@mergify mergify bot removed the ci-failure label Nov 25, 2024
@aakankshaduggal aakankshaduggal requested a review from a team December 6, 2024 18:21
For each knowledge leaf node: the formula to estimate the number of produced synthetic samples in the training dataset is:

```text
(total cumulative size of knowledge documents / max document chunk size) * number of qna pairs in the knowledge file leaf node * 30 synthetic samples per qna pair
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this factor in the amount of samples that will be filtered out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RobotSail this is an excellent point: it does not since that part is non detemenistic to a certain extent: but folks have been looking to be able to get some guidance on some ball park numbers/ answers to general questions around how the taxonomy is processed through SDG

I should call that out as a disclaimer and will do that

The new FAQ.md file includes detailed explanations and examples on how to estimate the number of synthetic samples produced at various stages of the SDG training process. This addition aims to enhance user understanding of the sample generation methodology.

Signed-off-by: Tyler Lisowski <[email protected]>
@relyt0925
Copy link
Contributor Author

These are a good initial set of FAQs that we have seen pop up with what I feel (based on my studies of the codebase) are the appropriate answers. More than happy to adjust anything that is inaccurate though based on expert opinion!

Copy link
Contributor

@bbrowning bbrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for expanding our docs! I have a couple of comments, but also a more general question. Do you think it would be better to document most of these things in the instructlab/instructlab repository instead of directly in SDG? The reason I ask is that the questions touch on taxonomy, SDG, training, and the intersection of all these. And the target users here would likely be using the ilab CLI to run these workflows as opposed to SDG directly?


There is no known limit to the number of seed example entries for a knowledge leaf node. There must be a minimum of 5 seed examples. These parameters can be seen in the taxonomy schema repo: <https://github.com/instructlab/schema/blob/main/src/instructlab/schema/v3/compositional_skills.json#L31>

## How long can a given seed_example be for a knowledge leaf node?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there's a typo here and this should be "skills leaf node" vs "knowledge leaf node".


## How many qna pairs can be listed for a given seed_example in a knowledge leaf node

**Exactly** 3 QNA pairs must be listed for a given seed_example. If more is specified they will be ignored and not processed by the appropriate prompt. This can be seen by looking at the prompt files in SDG: <https://github.com/instructlab/sdg/blob/v0.6.2/src/instructlab/sdg/configs/knowledge/simple_generate_qa.yaml#L21-L28>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit uncomfortable linking to v0.6.2 here directly, as this doc lives in the main branch. It feels like we do a relative link (can we do that in github markdown?) to the file in the same branch as this doc.

@relyt0925
Copy link
Contributor Author

@bbrowning sorry for delay! Thank you so much for review! I agree with you that maybe there is a better home for this page. I will go ahead and get the comments addressed, we can make sure we are all comfortable with the content: and then we can think on where we want it's home to be!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants