Releases: instructlab/sdg
v0.7.0
SDG v0.7.0
Features
Custom Blocks and Teacher Models via BlockRegistry and PromptRegistry
Advanced users are now able to supply custom Pipeline Block
implementations by registering new blocks with the BlockRegistry
. It's also possible to register new chat templates for custom teacher models using the new PromptRegistry
.
See the tests/testdata/custom_block.py
and tests/testdata/custom_block_pipeline.yaml
files in this repository for an example of how to create custom blocks and use them from your own pipeline config yamls.
See the tests/testdata/custom_prompt.py
file in this repository for an example how to register custom chat templates used when formatting prompts.
New Blocks - IterBlock and LLMMessagesBlock
We have two new Block types available for pipelines in this release - IterBlock
and LLMMessagesBlock
. IterBlock
allows you to execute another Block
multiple times, based on a configured number of iterations. LLMMessagesBlock
is like LLMBlock
but uses the newer chat/completions API of OpenAI-compatible servers instead of the legacy completions API.
Consolidated PDF and Markdown ingestion and chunking implementations
Instead of sending PDF input documents through Docling and using something custom for Markdown, we now send both types of documents through Docling and have consolidated the chunking implementation across both document types. This may result in different chunks being generated for markdown content compared to previous releases.
Added a new instructlab.sdg.mix_datasets
Python API
We've added a new Python API for advanced users that need to re-mix our generated outputs, for example to weight one taxonomy leaf node over others in the output or to have more than our default of 30 skill samples per leaf node in the final mixed output. See the example at docs/examples/mix_datasets/
for some example Python code and Recipe yaml files to accomplish this.
Breaking Changes
Pipeline configs and Prompt templates switched to Jinja
All of our Pipeline config yamls and prompt template files have moved to Jinja templates instead of Python string format()
calls. This brings more expressiveness into our templating language - especially for prompt templates - but does mean any variable substitutions need to be updated from single brackets to double brackets - ie {document}
becomes {{document}}
. This only impacts you if you were using custom pipeline config yaml files or custom prompt templates in your config blocks.
ImportBlock removed from Pipeline blocks
Any users that were specifying custom pipeline configs (instead of using the default full
or simple
shipped by us) and also using the ImportBlock
will now need to rewrite their pipelines to no longer use that block. We do not anticipate that anyone was actually using this block, but please reach out if you were so we can capture your needs in a future release.
Fixes
- The PyTorch dependency is removed, because SDG doesn't directly use PyTorch. The test suite still depends on
instructlab
core, which depends on PyTorch. - The
batch_size
parameter is now respected every time we call an inference server from anLLMBlock
. Previously, we were only batching the initial input but not accounting for some Blocks that may emit more output samples than input samples, meaning we would exceed our configuredbatch_size
when actually making batching inference calls to vLLM, causing more memory to be consumed than expected as well as leading to scenarios where we were overloading inference servers in unexpected ways due to sending in batches with hundreds of completion requests instead of the configured size, which defaults to8
on most hardware profiles.
All Changes
- fix: missing regex from actionlint action by @nathan-weinberg in #390
- Don't fail fast for unit and functional tests by @danmcp in #397
- Adjust to slack-github-action 2.0 api changes by @danmcp in #395
- build(deps): bump slackapi/slack-github-action from 1.27.0 to 2.0.0 by @dependabot in #385
- refactor: remove unused generate_data arguments by @makelinux in #396
- Add [End] to parser cleanup tags by @abhi1092 in #400
- build(deps): bump step-security/harden-runner from 2.10.1 to 2.10.2 by @dependabot in #401
- build(deps-dev): update pre-commit requirement from <4.0,>=3.0.4 to >=3.0.4,<5.0 by @dependabot in #387
- [Docs] Updates for SDG README by @kelbrown20 in #281
- refactor: Introduce jldump by @makelinux in #402
- Ensure knowledge docs are cloned into unique dirs by @bbrowning in #416
- build(deps): bump actions/cache from 4.1.2 to 4.2.0 by @dependabot in #431
- Move AWS_REGION from using secret to var by @danmcp in #422
- Add disk check after tests run by @danmcp in #419
- Add a CHANGELOG.md and fill it in for latest 2 releases by @bbrowning in #418
- build(deps): bump pypa/gh-action-pypi-publish from 1.12.2 to 1.12.3 by @dependabot in #433
- fix: Restrict
docling
library versions to resolve dependency issues + updatemypy
linting packages by @courtneypacheco in #434 - Update CHANGELOG.md for release v0.6.2 by @bbrowning in #440
- Reconcile core data generation features with latest research advances by @bbrowning in #409
- refactor: generated_data as list by @makelinux in #398
- Update README.md with newer content from research team by @bbrowning in #444
- build(deps): bump hynek/build-and-inspect-python-package from 2.10.0 to 2.11.0 by @dependabot in #453
- feat: add discord e2e status reporting by @RobotSail in #455
- feat: update release-strategy to include discord by @RobotSail in #454
- build(deps): bump DavidAnson/markdownlint-cli2-action from 18.0.0 to 19.0.0 by @dependabot in #459
- build(deps): bump rhysd/actionlint from 1.7.4 to 1.7.6 in /.github/workflows by @dependabot in #460
- build(deps): bump rojopolis/spellcheck-github-actions from 0.45.0 to 0.46.0 by @dependabot in #464
- chore!: Update PyTorch to 2.5 by @fabiendupont in #465
- fix: typo in mergify configuration by @nathan-weinberg in #468
- Update CHANGELOG.md for v0.6.3 by @bbrowning in #473
- Add a CONTRIBUTING.md with basic dev setup instructions by @bbrowning in #470
- chore: Change default temporary write directory in all e2e CI jobs from
tmpfs
to/home/tmp
by @courtneypacheco in #475 - build(deps): bump step-security/harden-runner from 2.10.2 to 2.10.3 by @dependabot in #472
- fix: Remove unused PyTorch dependency by @fabiendupont in #479
- Refactor Document Chunker to always use docling by @khaledsulayman in #430
- Document updating of CHANGELOG.md as part of release by @bbrowning in #435
- Split up
generate_data
and add amix_datasets
top level API by @bbrowning in #443 - build(deps): bump DavidAnson/markdownlint-cli2-action from 19.0.0 to 19.1.0 by @dependabot in #487
- build(deps): bump sarisia/actions-status-discord from 1.15.1 to 1.15.2 by @dependabot in #488
- build(deps): bump step-security/harden-runner from 2.10.3 to 2.10.4 by @dependabot in #489
- build(deps): bump rhysd/actionlint from 1.7.6 to 1.7.7 in /.github/workflows by @dependabot in #486
- Implement LLMMessagesBlock by @bbrowning in #461
- Adding Batching After Every Block by @eshwarprasadS in #484
- use render method for jinja template, add unit tests by @eshwarprasadS in #493
- Update release notes for v0.7.0 by @bbrowning in #495
New Contributors
- @kelbrown20 made their first contribution in #281
- @courtneypacheco made their first contribution in #434
- @fabiendupont made their first contribution in #465
- @eshwarprasadS made their first contribution in #484
Full Changelog: v0.6.3...v0.7.0
v0.6.3
SDG v0.6.3
Fixes
- The max version constraint of PyTorch in our requirements file was raised so that we don't prevent SDG users from using it PyTorch 2.5.
All Changes
- chore!: Update PyTorch to 2.5 (backport #465) by @mergify in #469
- Update CHANGELOG.md for v0.6.3 (backport #473) by @mergify in #474
Full Changelog: v0.6.2...v0.6.3
v0.6.2
SDG v0.6.2
Fixes
- Fixed a bug in our version specification of
docling
anddocling_parse
dependencies that were causing new installs of InstructLab to pull in incompatible versions of these. We also fixed a similar bug in themypy
dependency, but that one only impacts developers of SDG as opposed to users of InstructLab.
All Changes
- Move AWS_REGION from using secret to var (backport #422) by @mergify in #438
- fix: Restrict
docling
library versions to resolve dependency issues + updatemypy
linting packages (backport #434) by @mergify in #437 - Update CHANGELOG.md for release v0.6.2 (backport #440) by @mergify in #441
Full Changelog: v0.6.1...v0.6.2
v0.6.1
v0.6.0
SDG v0.6.0
What's Changed
- fix: formatting error by @RobotSail in #378
- Prefer tesserocr over easyocr, if available by @bbrowning in #369
- ci: add large-size E2E CI job by @nathan-weinberg in #380
- Add Release Strategy Document by @khaledsulayman in #381
- Docling models path by @aakankshaduggal in #362
- Check for tokenizer in downloaded models directory by @khaledsulayman in #364
- fix: upsample the phase10 knowledge dataset by @RobotSail in #377
- build(deps): bump DavidAnson/markdownlint-cli2-action from 17.0.0 to 18.0.0 by @dependabot in #386
- Delete .gitattributes by @khaledsulayman in #393
New Contributors
- @RobotSail made their first contribution in #378
Full Changelog: v0.5.0...v0.6.0
v0.3.3
What's Changed
- Prepare release-v0.3 branch for backports by @bbrowning in #371
- Run the simple pipeline on small runners by @bbrowning in #372
- Data mix fix (backport #366) by @mergify in #368
Full Changelog: v0.3.2...v0.3.3
v0.5.0
v0.5.0
What's Changed
- build(deps): bump actions/cache from 4.1.0 to 4.1.1 by @dependabot in #300
- build(deps): bump rojopolis/spellcheck-github-actions from 0.42.0 to 0.43.0 by @dependabot in #299
- build(deps): bump actions/checkout from 4.2.0 to 4.2.1 by @dependabot in #298
- chore: rename 'basic-workflow-tests' to 'e2e-custom' by @nathan-weinberg in #306
- fix: change "group" to "tag" for mmlu_branch task config by @alimaredia in #305
- fix: remove stop token from mixtral by @cdoern in #310
- ci: update small E2E job to align with CLI and Training by @nathan-weinberg in #317
- ci: update medium job to run as PR check by @nathan-weinberg in #318
- build(deps): bump rojopolis/spellcheck-github-actions from 0.43.0 to 0.43.1 by @dependabot in #314
- fix: medium E2E CI job was missing HF_TOKEN by @nathan-weinberg in #319
- build(deps): bump actions/cache from 4.1.1 to 4.1.2 by @dependabot in #320
- ci: use org variable for AWS EC2 AMI in E2E CI jobs by @nathan-weinberg in #322
- ci: convert med E2E CI job to L4 GPU by @nathan-weinberg in #325
- build(deps): bump rojopolis/spellcheck-github-actions from 0.43.1 to 0.44.0 by @dependabot in #326
- build(deps): bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #323
- build(deps): bump pypa/gh-action-pypi-publish from 1.10.3 to 1.11.0 by @dependabot in #327
- build(deps): bump actions/checkout from 4.2.1 to 4.2.2 by @dependabot in #321
- build(deps): bump machulav/ec2-github-runner from 2.3.6 to 2.3.7 by @dependabot in #328
- build(deps): bump hynek/build-and-inspect-python-package from 2.9.0 to 2.10.0 by @dependabot in #329
- build(deps): bump rhysd/actionlint from 1.7.3 to 1.7.4 in /.github/workflows by @dependabot in #332
- build(deps): bump pypa/gh-action-pypi-publish from 1.11.0 to 1.12.0 by @dependabot in #337
- build(deps): bump rojopolis/spellcheck-github-actions from 0.44.0 to 0.45.0 by @dependabot in #338
- build(deps): bump pypa/gh-action-pypi-publish from 1.12.0 to 1.12.2 by @dependabot in #342
- Integrate Context-Aware Chunking and PDF Support by @khaledsulayman in #284
- feat: parametrize system prompt by @jaideepr97 in #339
- feat: support converting messages datasets into multiple pre-training formats by @jaideepr97 in #341
- Move to Docling v2 APIs by @bbrowning in #347
- feat: expose max_num_tokens as configurable by @cdoern in #340
- Remove unnecessary requirement for qna.yaml in ContextAwareChunker by @khaledsulayman in #351
- Upgrade docling, expand chunking testing by @bbrowning in #349
- Don't attempt batching with InstructLab's llama-cpp-python by @bbrowning in #358
- Consolidate test sample documents into one subdir by @bbrowning in #356
- Move a spurious print to a debug log message by @bbrowning in #359
- Only use CPU for the docling OCR models by @bbrowning in #361
- Data mix fix by @aakankshaduggal in #366
New Contributors
- @alimaredia made their first contribution in #305
Full Changelog: v0.4.2...v0.5.0
v0.5.0a2
What's Changed
- build(deps): bump actions/checkout from 4.2.1 to 4.2.2 by @dependabot in #321
- build(deps): bump machulav/ec2-github-runner from 2.3.6 to 2.3.7 by @dependabot in #328
- build(deps): bump hynek/build-and-inspect-python-package from 2.9.0 to 2.10.0 by @dependabot in #329
- build(deps): bump rhysd/actionlint from 1.7.3 to 1.7.4 in /.github/workflows by @dependabot in #332
- build(deps): bump pypa/gh-action-pypi-publish from 1.11.0 to 1.12.0 by @dependabot in #337
- build(deps): bump rojopolis/spellcheck-github-actions from 0.44.0 to 0.45.0 by @dependabot in #338
- build(deps): bump pypa/gh-action-pypi-publish from 1.12.0 to 1.12.2 by @dependabot in #342
- Integrate Context-Aware Chunking and PDF Support by @khaledsulayman in #284
- feat: parametrize system prompt by @jaideepr97 in #339
- feat: support converting messages datasets into multiple pre-training formats by @jaideepr97 in #341
- Move to Docling v2 APIs by @bbrowning in #347
- feat: expose max_num_tokens as configurable by @cdoern in #340
- Remove unnecessary requirement for qna.yaml in ContextAwareChunker by @khaledsulayman in #351
- Upgrade docling, expand chunking testing by @bbrowning in #349
Full Changelog: v0.5.0a1...v0.5.0a2
v0.5.0a1
v0.5.0a1
What's Changed
- build(deps): bump actions/cache from 4.1.0 to 4.1.1 by @dependabot in #300
- build(deps): bump rojopolis/spellcheck-github-actions from 0.42.0 to 0.43.0 by @dependabot in #299
- build(deps): bump actions/checkout from 4.2.0 to 4.2.1 by @dependabot in #298
- chore: rename 'basic-workflow-tests' to 'e2e-custom' by @nathan-weinberg in #306
- fix: change "group" to "tag" for mmlu_branch task config by @alimaredia in #305
- fix: remove stop token from mixtral by @cdoern in #310
- ci: update small E2E job to align with CLI and Training by @nathan-weinberg in #317
- ci: update medium job to run as PR check by @nathan-weinberg in #318
- build(deps): bump rojopolis/spellcheck-github-actions from 0.43.0 to 0.43.1 by @dependabot in #314
- fix: medium E2E CI job was missing HF_TOKEN by @nathan-weinberg in #319
- build(deps): bump actions/cache from 4.1.1 to 4.1.2 by @dependabot in #320
- ci: use org variable for AWS EC2 AMI in E2E CI jobs by @nathan-weinberg in #322
- ci: convert med E2E CI job to L4 GPU by @nathan-weinberg in #325
- build(deps): bump rojopolis/spellcheck-github-actions from 0.43.1 to 0.44.0 by @dependabot in #326
- build(deps): bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #323
- build(deps): bump pypa/gh-action-pypi-publish from 1.10.3 to 1.11.0 by @dependabot in #327
New Contributors
- @alimaredia made their first contribution in #305
Full Changelog: v0.4.2...v0.5.0a1
v0.3.2
What's Changed
- map mistral model name to mixtral by @cdoern in #315
- Without these changes, the mistral models will use merlinite templates which will result in unusable output.
Full Changelog: v0.3.1...v0.3.2