Skip to content

Releases: instructlab/sdg

v0.7.0

22 Jan 21:14
405ff22
Compare
Choose a tag to compare

SDG v0.7.0

Features

Custom Blocks and Teacher Models via BlockRegistry and PromptRegistry

Advanced users are now able to supply custom Pipeline Block implementations by registering new blocks with the BlockRegistry. It's also possible to register new chat templates for custom teacher models using the new PromptRegistry.

See the tests/testdata/custom_block.py and tests/testdata/custom_block_pipeline.yaml files in this repository for an example of how to create custom blocks and use them from your own pipeline config yamls.

See the tests/testdata/custom_prompt.py file in this repository for an example how to register custom chat templates used when formatting prompts.

New Blocks - IterBlock and LLMMessagesBlock

We have two new Block types available for pipelines in this release - IterBlock and LLMMessagesBlock. IterBlock allows you to execute another Block multiple times, based on a configured number of iterations. LLMMessagesBlock is like LLMBlock but uses the newer chat/completions API of OpenAI-compatible servers instead of the legacy completions API.

Consolidated PDF and Markdown ingestion and chunking implementations

Instead of sending PDF input documents through Docling and using something custom for Markdown, we now send both types of documents through Docling and have consolidated the chunking implementation across both document types. This may result in different chunks being generated for markdown content compared to previous releases.

Added a new instructlab.sdg.mix_datasets Python API

We've added a new Python API for advanced users that need to re-mix our generated outputs, for example to weight one taxonomy leaf node over others in the output or to have more than our default of 30 skill samples per leaf node in the final mixed output. See the example at docs/examples/mix_datasets/ for some example Python code and Recipe yaml files to accomplish this.

Breaking Changes

Pipeline configs and Prompt templates switched to Jinja

All of our Pipeline config yamls and prompt template files have moved to Jinja templates instead of Python string format() calls. This brings more expressiveness into our templating language - especially for prompt templates - but does mean any variable substitutions need to be updated from single brackets to double brackets - ie {document} becomes {{document}}. This only impacts you if you were using custom pipeline config yaml files or custom prompt templates in your config blocks.

ImportBlock removed from Pipeline blocks

Any users that were specifying custom pipeline configs (instead of using the default full or simple shipped by us) and also using the ImportBlock will now need to rewrite their pipelines to no longer use that block. We do not anticipate that anyone was actually using this block, but please reach out if you were so we can capture your needs in a future release.

Fixes

  • The PyTorch dependency is removed, because SDG doesn't directly use PyTorch. The test suite still depends on instructlab core, which depends on PyTorch.
  • The batch_size parameter is now respected every time we call an inference server from an LLMBlock. Previously, we were only batching the initial input but not accounting for some Blocks that may emit more output samples than input samples, meaning we would exceed our configured batch_size when actually making batching inference calls to vLLM, causing more memory to be consumed than expected as well as leading to scenarios where we were overloading inference servers in unexpected ways due to sending in batches with hundreds of completion requests instead of the configured size, which defaults to 8 on most hardware profiles.

All Changes

  • fix: missing regex from actionlint action by @nathan-weinberg in #390
  • Don't fail fast for unit and functional tests by @danmcp in #397
  • Adjust to slack-github-action 2.0 api changes by @danmcp in #395
  • build(deps): bump slackapi/slack-github-action from 1.27.0 to 2.0.0 by @dependabot in #385
  • refactor: remove unused generate_data arguments by @makelinux in #396
  • Add [End] to parser cleanup tags by @abhi1092 in #400
  • build(deps): bump step-security/harden-runner from 2.10.1 to 2.10.2 by @dependabot in #401
  • build(deps-dev): update pre-commit requirement from <4.0,>=3.0.4 to >=3.0.4,<5.0 by @dependabot in #387
  • [Docs] Updates for SDG README by @kelbrown20 in #281
  • refactor: Introduce jldump by @makelinux in #402
  • Ensure knowledge docs are cloned into unique dirs by @bbrowning in #416
  • build(deps): bump actions/cache from 4.1.2 to 4.2.0 by @dependabot in #431
  • Move AWS_REGION from using secret to var by @danmcp in #422
  • Add disk check after tests run by @danmcp in #419
  • Add a CHANGELOG.md and fill it in for latest 2 releases by @bbrowning in #418
  • build(deps): bump pypa/gh-action-pypi-publish from 1.12.2 to 1.12.3 by @dependabot in #433
  • fix: Restrict docling library versions to resolve dependency issues + update mypy linting packages by @courtneypacheco in #434
  • Update CHANGELOG.md for release v0.6.2 by @bbrowning in #440
  • Reconcile core data generation features with latest research advances by @bbrowning in #409
  • refactor: generated_data as list by @makelinux in #398
  • Update README.md with newer content from research team by @bbrowning in #444
  • build(deps): bump hynek/build-and-inspect-python-package from 2.10.0 to 2.11.0 by @dependabot in #453
  • feat: add discord e2e status reporting by @RobotSail in #455
  • feat: update release-strategy to include discord by @RobotSail in #454
  • build(deps): bump DavidAnson/markdownlint-cli2-action from 18.0.0 to 19.0.0 by @dependabot in #459
  • build(deps): bump rhysd/actionlint from 1.7.4 to 1.7.6 in /.github/workflows by @dependabot in #460
  • build(deps): bump rojopolis/spellcheck-github-actions from 0.45.0 to 0.46.0 by @dependabot in #464
  • chore!: Update PyTorch to 2.5 by @fabiendupont in #465
  • fix: typo in mergify configuration by @nathan-weinberg in #468
  • Update CHANGELOG.md for v0.6.3 by @bbrowning in #473
  • Add a CONTRIBUTING.md with basic dev setup instructions by @bbrowning in #470
  • chore: Change default temporary write directory in all e2e CI jobs from tmpfs to /home/tmp by @courtneypacheco in #475
  • build(deps): bump step-security/harden-runner from 2.10.2 to 2.10.3 by @dependabot in #472
  • fix: Remove unused PyTorch dependency by @fabiendupont in #479
  • Refactor Document Chunker to always use docling by @khaledsulayman in #430
  • Document updating of CHANGELOG.md as part of release by @bbrowning in #435
  • Split up generate_data and add a mix_datasets top level API by @bbrowning in #443
  • build(deps): bump DavidAnson/markdownlint-cli2-action from 19.0.0 to 19.1.0 by @dependabot in #487
  • build(deps): bump sarisia/actions-status-discord from 1.15.1 to 1.15.2 by @dependabot in #488
  • build(deps): bump step-security/harden-runner from 2.10.3 to 2.10.4 by @dependabot in #489
  • build(deps): bump rhysd/actionlint from 1.7.6 to 1.7.7 in /.github/workflows by @dependabot in #486
  • Implement LLMMessagesBlock by @bbrowning in #461
  • Adding Batching After Every Block by @eshwarprasadS in #484
  • use render method for jinja template, add unit tests by @eshwarprasadS in #493
  • Update release notes for v0.7.0 by @bbrowning in #495

New Contributors

Full Changelog: v0.6.3...v0.7.0

v0.6.3

10 Jan 15:02
2763286
Compare
Choose a tag to compare

SDG v0.6.3

Fixes

  • The max version constraint of PyTorch in our requirements file was raised so that we don't prevent SDG users from using it PyTorch 2.5.

All Changes

Full Changelog: v0.6.2...v0.6.3

v0.6.2

10 Dec 17:44
9bcde30
Compare
Choose a tag to compare

SDG v0.6.2

Fixes

  • Fixed a bug in our version specification of docling and docling_parse dependencies that were causing new installs of InstructLab to pull in incompatible versions of these. We also fixed a similar bug in the mypy dependency, but that one only impacts developers of SDG as opposed to users of InstructLab.

All Changes

  • Move AWS_REGION from using secret to var (backport #422) by @mergify in #438
  • fix: Restrict docling library versions to resolve dependency issues + update mypy linting packages (backport #434) by @mergify in #437
  • Update CHANGELOG.md for release v0.6.2 (backport #440) by @mergify in #441

Full Changelog: v0.6.1...v0.6.2

v0.6.1

27 Nov 23:48
c220b5f
Compare
Choose a tag to compare

SDG v0.6.1

What's Changed

Full Changelog: v0.6.0...v0.6.1

v0.6.0

15 Nov 19:52
4e90549
Compare
Choose a tag to compare

SDG v0.6.0

What's Changed

New Contributors

Full Changelog: v0.5.0...v0.6.0

v0.3.3

13 Nov 17:15
fbfe7d4
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.3.2...v0.3.3

v0.5.0

12 Nov 22:32
b6f07a8
Compare
Choose a tag to compare

v0.5.0

What's Changed

New Contributors

Full Changelog: v0.4.2...v0.5.0

v0.5.0a2

08 Nov 21:22
e0698d6
Compare
Choose a tag to compare
v0.5.0a2 Pre-release
Pre-release

What's Changed

  • build(deps): bump actions/checkout from 4.2.1 to 4.2.2 by @dependabot in #321
  • build(deps): bump machulav/ec2-github-runner from 2.3.6 to 2.3.7 by @dependabot in #328
  • build(deps): bump hynek/build-and-inspect-python-package from 2.9.0 to 2.10.0 by @dependabot in #329
  • build(deps): bump rhysd/actionlint from 1.7.3 to 1.7.4 in /.github/workflows by @dependabot in #332
  • build(deps): bump pypa/gh-action-pypi-publish from 1.11.0 to 1.12.0 by @dependabot in #337
  • build(deps): bump rojopolis/spellcheck-github-actions from 0.44.0 to 0.45.0 by @dependabot in #338
  • build(deps): bump pypa/gh-action-pypi-publish from 1.12.0 to 1.12.2 by @dependabot in #342
  • Integrate Context-Aware Chunking and PDF Support by @khaledsulayman in #284
  • feat: parametrize system prompt by @jaideepr97 in #339
  • feat: support converting messages datasets into multiple pre-training formats by @jaideepr97 in #341
  • Move to Docling v2 APIs by @bbrowning in #347
  • feat: expose max_num_tokens as configurable by @cdoern in #340
  • Remove unnecessary requirement for qna.yaml in ContextAwareChunker by @khaledsulayman in #351
  • Upgrade docling, expand chunking testing by @bbrowning in #349

Full Changelog: v0.5.0a1...v0.5.0a2

v0.5.0a1

01 Nov 17:13
5abc57f
Compare
Choose a tag to compare
v0.5.0a1 Pre-release
Pre-release

v0.5.0a1

What's Changed

New Contributors

Full Changelog: v0.4.2...v0.5.0a1

v0.3.2

18 Oct 21:14
481e3f6
Compare
Choose a tag to compare

What's Changed

  • map mistral model name to mixtral by @cdoern in #315
  • Without these changes, the mistral models will use merlinite templates which will result in unusable output.

Full Changelog: v0.3.1...v0.3.2