22 Jan 21:14

405ff22

v0.7.0 Latest

Latest

SDG v0.7.0

Features

Custom Blocks and Teacher Models via BlockRegistry and PromptRegistry

Advanced users are now able to supply custom Pipeline Block implementations by registering new blocks with the BlockRegistry. It's also possible to register new chat templates for custom teacher models using the new PromptRegistry.

See the tests/testdata/custom_block.py and tests/testdata/custom_block_pipeline.yaml files in this repository for an example of how to create custom blocks and use them from your own pipeline config yamls.

See the tests/testdata/custom_prompt.py file in this repository for an example how to register custom chat templates used when formatting prompts.

New Blocks - IterBlock and LLMMessagesBlock

We have two new Block types available for pipelines in this release - IterBlock and LLMMessagesBlock. IterBlock allows you to execute another Block multiple times, based on a configured number of iterations. LLMMessagesBlock is like LLMBlock but uses the newer chat/completions API of OpenAI-compatible servers instead of the legacy completions API.

Consolidated PDF and Markdown ingestion and chunking implementations

Instead of sending PDF input documents through Docling and using something custom for Markdown, we now send both types of documents through Docling and have consolidated the chunking implementation across both document types. This may result in different chunks being generated for markdown content compared to previous releases.

Added a new `instructlab.sdg.mix_datasets` Python API

We've added a new Python API for advanced users that need to re-mix our generated outputs, for example to weight one taxonomy leaf node over others in the output or to have more than our default of 30 skill samples per leaf node in the final mixed output. See the example at docs/examples/mix_datasets/ for some example Python code and Recipe yaml files to accomplish this.

Breaking Changes

Pipeline configs and Prompt templates switched to Jinja

All of our Pipeline config yamls and prompt template files have moved to Jinja templates instead of Python string format() calls. This brings more expressiveness into our templating language - especially for prompt templates - but does mean any variable substitutions need to be updated from single brackets to double brackets - ie {document} becomes {{document}}. This only impacts you if you were using custom pipeline config yaml files or custom prompt templates in your config blocks.

ImportBlock removed from Pipeline blocks

Any users that were specifying custom pipeline configs (instead of using the default full or simple shipped by us) and also using the ImportBlock will now need to rewrite their pipelines to no longer use that block. We do not anticipate that anyone was actually using this block, but please reach out if you were so we can capture your needs in a future release.

Fixes

The PyTorch dependency is removed, because SDG doesn't directly use PyTorch. The test suite still depends on instructlab core, which depends on PyTorch.
The batch_size parameter is now respected every time we call an inference server from an LLMBlock. Previously, we were only batching the initial input but not accounting for some Blocks that may emit more output samples than input samples, meaning we would exceed our configured batch_size when actually making batching inference calls to vLLM, causing more memory to be consumed than expected as well as leading to scenarios where we were overloading inference servers in unexpected ways due to sending in batches with hundreds of completion requests instead of the configured size, which defaults to 8 on most hardware profiles.

All Changes

fix: missing regex from actionlint action by @nathan-weinberg in #390
Don't fail fast for unit and functional tests by @danmcp in #397
Adjust to slack-github-action 2.0 api changes by @danmcp in #395
build(deps): bump slackapi/slack-github-action from 1.27.0 to 2.0.0 by @dependabot in #385
refactor: remove unused generate_data arguments by @makelinux in #396
Add [End] to parser cleanup tags by @abhi1092 in #400
build(deps): bump step-security/harden-runner from 2.10.1 to 2.10.2 by @dependabot in #401
build(deps-dev): update pre-commit requirement from <4.0,>=3.0.4 to >=3.0.4,<5.0 by @dependabot in #387
[Docs] Updates for SDG README by @kelbrown20 in #281
refactor: Introduce jldump by @makelinux in #402
Ensure knowledge docs are cloned into unique dirs by @bbrowning in #416
build(deps): bump actions/cache from 4.1.2 to 4.2.0 by @dependabot in #431
Move AWS_REGION from using secret to var by @danmcp in #422
Add disk check after tests run by @danmcp in #419
Add a CHANGELOG.md and fill it in for latest 2 releases by @bbrowning in #418
build(deps): bump pypa/gh-action-pypi-publish from 1.12.2 to 1.12.3 by @dependabot in #433
fix: Restrict docling library versions to resolve dependency issues + update mypy linting packages by @courtneypacheco in #434
Update CHANGELOG.md for release v0.6.2 by @bbrowning in #440
Reconcile core data generation features with latest research advances by @bbrowning in #409
refactor: generated_data as list by @makelinux in #398
Update README.md with newer content from research team by @bbrowning in #444
build(deps): bump hynek/build-and-inspect-python-package from 2.10.0 to 2.11.0 by @dependabot in #453
feat: add discord e2e status reporting by @RobotSail in #455
feat: update release-strategy to include discord by @RobotSail in #454
build(deps): bump DavidAnson/markdownlint-cli2-action from 18.0.0 to 19.0.0 by @dependabot in #459
build(deps): bump rhysd/actionlint from 1.7.4 to 1.7.6 in /.github/workflows by @dependabot in #460
build(deps): bump rojopolis/spellcheck-github-actions from 0.45.0 to 0.46.0 by @dependabot in #464
chore!: Update PyTorch to 2.5 by @fabiendupont in #465
fix: typo in mergify configuration by @nathan-weinberg in #468
Update CHANGELOG.md for v0.6.3 by @bbrowning in #473
Add a CONTRIBUTING.md with basic dev setup instructions by @bbrowning in #470
chore: Change default temporary write directory in all e2e CI jobs from tmpfs to /home/tmp by @courtneypacheco in #475
build(deps): bump step-security/harden-runner from 2.10.2 to 2.10.3 by @dependabot in #472
fix: Remove unused PyTorch dependency by @fabiendupont in #479
Refactor Document Chunker to always use docling by @khaledsulayman in #430
Document updating of CHANGELOG.md as part of release by @bbrowning in #435
Split up generate_data and add a mix_datasets top level API by @bbrowning in #443
build(deps): bump DavidAnson/markdownlint-cli2-action from 19.0.0 to 19.1.0 by @dependabot in #487
build(deps): bump sarisia/actions-status-discord from 1.15.1 to 1.15.2 by @dependabot in #488
build(deps): bump step-security/harden-runner from 2.10.3 to 2.10.4 by @dependabot in #489
build(deps): bump rhysd/actionlint from 1.7.6 to 1.7.7 in /.github/workflows by @dependabot in #486
Implement LLMMessagesBlock by @bbrowning in #461
Adding Batching After Every Block by @eshwarprasadS in #484
use render method for jinja template, add unit tests by @eshwarprasadS in #493
Update release notes for v0.7.0 by @bbrowning in #495

New Contributors

@kelbrown20 made their first contribution in #281
@courtneypacheco made their first contribution in #434
@fabiendupont made their first contribution in #465
@eshwarprasadS made their first contribution in #484

Full Changelog: v0.6.3...v0.7.0

Contributors

bbrowning, danmcp, and 10 other contributors

Assets 6

10 Jan 15:02

bbrowning

v0.6.3

2763286

v0.6.3

SDG v0.6.3

Fixes

The max version constraint of PyTorch in our requirements file was raised so that we don't prevent SDG users from using it PyTorch 2.5.

All Changes

chore!: Update PyTorch to 2.5 (backport #465) by @mergify in #469
Update CHANGELOG.md for v0.6.3 (backport #473) by @mergify in #474

Full Changelog: v0.6.2...v0.6.3

Contributors

mergify

Assets 6

10 Dec 17:44

bbrowning

v0.6.2

9bcde30

v0.6.2

SDG v0.6.2

Fixes

Fixed a bug in our version specification of docling and docling_parse dependencies that were causing new installs of InstructLab to pull in incompatible versions of these. We also fixed a similar bug in the mypy dependency, but that one only impacts developers of SDG as opposed to users of InstructLab.

All Changes

Move AWS_REGION from using secret to var (backport #422) by @mergify in #438
fix: Restrict docling library versions to resolve dependency issues + update mypy linting packages (backport #434) by @mergify in #437
Update CHANGELOG.md for release v0.6.2 (backport #440) by @mergify in #441

Full Changelog: v0.6.1...v0.6.2

Contributors

mergify

Assets 6

27 Nov 23:48

bbrowning

v0.6.1

c220b5f

v0.6.1

SDG v0.6.1

What's Changed

Add [End] to parser cleanup tags (backport #400) by @mergify in #403
Ensure knowledge docs are cloned into unique dirs (backport #416) by @mergify in #417

Full Changelog: v0.6.0...v0.6.1

Contributors

mergify

Assets 6

15 Nov 19:52

khaledsulayman

v0.6.0

4e90549

v0.6.0

SDG v0.6.0

What's Changed

fix: formatting error by @RobotSail in #378
Prefer tesserocr over easyocr, if available by @bbrowning in #369
ci: add large-size E2E CI job by @nathan-weinberg in #380
Add Release Strategy Document by @khaledsulayman in #381
Docling models path by @aakankshaduggal in #362
Check for tokenizer in downloaded models directory by @khaledsulayman in #364
fix: upsample the phase10 knowledge dataset by @RobotSail in #377
build(deps): bump DavidAnson/markdownlint-cli2-action from 17.0.0 to 18.0.0 by @dependabot in #386
Delete .gitattributes by @khaledsulayman in #393

New Contributors

@RobotSail made their first contribution in #378

Full Changelog: v0.5.0...v0.6.0

Contributors

bbrowning, aakankshaduggal, and 4 other contributors

Assets 6

13 Nov 17:15

nathan-weinberg

v0.3.3

fbfe7d4

v0.3.3

What's Changed

Prepare release-v0.3 branch for backports by @bbrowning in #371
Run the simple pipeline on small runners by @bbrowning in #372
Data mix fix (backport #366) by @mergify in #368

Full Changelog: v0.3.2...v0.3.3

Contributors

bbrowning and mergify

Assets 6

12 Nov 22:32

khaledsulayman

v0.5.0

b6f07a8

v0.5.0

What's Changed

build(deps): bump actions/cache from 4.1.0 to 4.1.1 by @dependabot in #300
build(deps): bump rojopolis/spellcheck-github-actions from 0.42.0 to 0.43.0 by @dependabot in #299
build(deps): bump actions/checkout from 4.2.0 to 4.2.1 by @dependabot in #298
chore: rename 'basic-workflow-tests' to 'e2e-custom' by @nathan-weinberg in #306
fix: change "group" to "tag" for mmlu_branch task config by @alimaredia in #305
fix: remove stop token from mixtral by @cdoern in #310
ci: update small E2E job to align with CLI and Training by @nathan-weinberg in #317
ci: update medium job to run as PR check by @nathan-weinberg in #318
build(deps): bump rojopolis/spellcheck-github-actions from 0.43.0 to 0.43.1 by @dependabot in #314
fix: medium E2E CI job was missing HF_TOKEN by @nathan-weinberg in #319
build(deps): bump actions/cache from 4.1.1 to 4.1.2 by @dependabot in #320
ci: use org variable for AWS EC2 AMI in E2E CI jobs by @nathan-weinberg in #322
ci: convert med E2E CI job to L4 GPU by @nathan-weinberg in #325
build(deps): bump rojopolis/spellcheck-github-actions from 0.43.1 to 0.44.0 by @dependabot in #326
build(deps): bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #323
build(deps): bump pypa/gh-action-pypi-publish from 1.10.3 to 1.11.0 by @dependabot in #327
build(deps): bump actions/checkout from 4.2.1 to 4.2.2 by @dependabot in #321
build(deps): bump machulav/ec2-github-runner from 2.3.6 to 2.3.7 by @dependabot in #328
build(deps): bump hynek/build-and-inspect-python-package from 2.9.0 to 2.10.0 by @dependabot in #329
build(deps): bump rhysd/actionlint from 1.7.3 to 1.7.4 in /.github/workflows by @dependabot in #332
build(deps): bump pypa/gh-action-pypi-publish from 1.11.0 to 1.12.0 by @dependabot in #337
build(deps): bump rojopolis/spellcheck-github-actions from 0.44.0 to 0.45.0 by @dependabot in #338
build(deps): bump pypa/gh-action-pypi-publish from 1.12.0 to 1.12.2 by @dependabot in #342
Integrate Context-Aware Chunking and PDF Support by @khaledsulayman in #284
feat: parametrize system prompt by @jaideepr97 in #339
feat: support converting messages datasets into multiple pre-training formats by @jaideepr97 in #341
Move to Docling v2 APIs by @bbrowning in #347
feat: expose max_num_tokens as configurable by @cdoern in #340
Remove unnecessary requirement for qna.yaml in ContextAwareChunker by @khaledsulayman in #351
Upgrade docling, expand chunking testing by @bbrowning in #349
Don't attempt batching with InstructLab's llama-cpp-python by @bbrowning in #358
Consolidate test sample documents into one subdir by @bbrowning in #356
Move a spurious print to a debug log message by @bbrowning in #359
Only use CPU for the docling OCR models by @bbrowning in #361
Data mix fix by @aakankshaduggal in #366

New Contributors

@alimaredia made their first contribution in #305

Full Changelog: v0.4.2...v0.5.0

Contributors

bbrowning, alimaredia, and 6 other contributors

Assets 6

08 Nov 21:22

khaledsulayman

v0.5.0a2

e0698d6

v0.5.0a2 Pre-release

Pre-release

What's Changed

build(deps): bump actions/checkout from 4.2.1 to 4.2.2 by @dependabot in #321
build(deps): bump machulav/ec2-github-runner from 2.3.6 to 2.3.7 by @dependabot in #328
build(deps): bump hynek/build-and-inspect-python-package from 2.9.0 to 2.10.0 by @dependabot in #329
build(deps): bump rhysd/actionlint from 1.7.3 to 1.7.4 in /.github/workflows by @dependabot in #332
build(deps): bump pypa/gh-action-pypi-publish from 1.11.0 to 1.12.0 by @dependabot in #337
build(deps): bump rojopolis/spellcheck-github-actions from 0.44.0 to 0.45.0 by @dependabot in #338
build(deps): bump pypa/gh-action-pypi-publish from 1.12.0 to 1.12.2 by @dependabot in #342
Integrate Context-Aware Chunking and PDF Support by @khaledsulayman in #284
feat: parametrize system prompt by @jaideepr97 in #339
feat: support converting messages datasets into multiple pre-training formats by @jaideepr97 in #341
Move to Docling v2 APIs by @bbrowning in #347
feat: expose max_num_tokens as configurable by @cdoern in #340
Remove unnecessary requirement for qna.yaml in ContextAwareChunker by @khaledsulayman in #351
Upgrade docling, expand chunking testing by @bbrowning in #349

Full Changelog: v0.5.0a1...v0.5.0a2

Contributors

bbrowning, jaideepr97, and 3 other contributors

Assets 6

01 Nov 17:13

nathan-weinberg

v0.5.0a1

5abc57f

v0.5.0a1 Pre-release

Pre-release

v0.5.0a1

What's Changed

build(deps): bump actions/cache from 4.1.0 to 4.1.1 by @dependabot in #300
build(deps): bump rojopolis/spellcheck-github-actions from 0.42.0 to 0.43.0 by @dependabot in #299
build(deps): bump actions/checkout from 4.2.0 to 4.2.1 by @dependabot in #298
chore: rename 'basic-workflow-tests' to 'e2e-custom' by @nathan-weinberg in #306
fix: change "group" to "tag" for mmlu_branch task config by @alimaredia in #305
fix: remove stop token from mixtral by @cdoern in #310
ci: update small E2E job to align with CLI and Training by @nathan-weinberg in #317
ci: update medium job to run as PR check by @nathan-weinberg in #318
build(deps): bump rojopolis/spellcheck-github-actions from 0.43.0 to 0.43.1 by @dependabot in #314
fix: medium E2E CI job was missing HF_TOKEN by @nathan-weinberg in #319
build(deps): bump actions/cache from 4.1.1 to 4.1.2 by @dependabot in #320
ci: use org variable for AWS EC2 AMI in E2E CI jobs by @nathan-weinberg in #322
ci: convert med E2E CI job to L4 GPU by @nathan-weinberg in #325
build(deps): bump rojopolis/spellcheck-github-actions from 0.43.1 to 0.44.0 by @dependabot in #326
build(deps): bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #323
build(deps): bump pypa/gh-action-pypi-publish from 1.10.3 to 1.11.0 by @dependabot in #327

New Contributors

@alimaredia made their first contribution in #305

Full Changelog: v0.4.2...v0.5.0a1

Contributors

alimaredia, cdoern, and 2 other contributors

Assets 6

18 Oct 21:14

aakankshaduggal

v0.3.2

481e3f6

v0.3.2

What's Changed

map mistral model name to mixtral by @cdoern in #315
Without these changes, the mistral models will use merlinite templates which will result in unusable output.

Full Changelog: v0.3.1...v0.3.2

Contributors

cdoern

Assets 6

Releases: instructlab/sdg

v0.7.0

SDG v0.7.0

Features

Custom Blocks and Teacher Models via BlockRegistry and PromptRegistry

New Blocks - IterBlock and LLMMessagesBlock

Consolidated PDF and Markdown ingestion and chunking implementations

Added a new instructlab.sdg.mix_datasets Python API

Breaking Changes

Pipeline configs and Prompt templates switched to Jinja

ImportBlock removed from Pipeline blocks

Fixes

All Changes

New Contributors

Contributors

v0.6.3

SDG v0.6.3

Fixes

All Changes

Contributors

v0.6.2

SDG v0.6.2

Fixes

All Changes

Contributors

v0.6.1

What's Changed

Contributors

v0.6.0

What's Changed

New Contributors

Contributors

v0.3.3

What's Changed

Contributors

v0.5.0

v0.5.0

What's Changed

New Contributors

Contributors

v0.5.0a2

What's Changed

Contributors

v0.5.0a1

v0.5.0a1

What's Changed

New Contributors

Contributors

v0.3.2

What's Changed

Contributors

Added a new `instructlab.sdg.mix_datasets` Python API