instructlab/sdg v0.7.0 released!

2 views

Skip to first unread message

Ben Browning

unread,

Jan 22, 2025, 4:22:41 PM1/22/25

to InstructLab Dev

We just put out a v0.7.0 release of instructlab/sdg that contains a number of new features, fixes, as well as some breaking changes that may impact any users that were creating custom Pipelines. Users of the `ilab` CLI workflow should not be impacted.

We have four new first-time contributors in this release as well - thank you new contributors, and looking forward to more!

Details inline below, or read them on the web at https://github.com/instructlab/sdg/releases/tag/v0.7.0

- Ben

SDG v0.7.0

Features

Custom Blocks and Teacher Models via BlockRegistry and PromptRegistry

Advanced users are now able to supply custom Pipeline Block implementations by registering new blocks with the BlockRegistry. It's also possible to register new chat templates for custom teacher models using the new PromptRegistry.

See the tests/testdata/custom_block.py and tests/testdata/custom_block_pipeline.yaml files in this repository for an example of how to create custom blocks and use them from your own pipeline config yamls.

See the tests/testdata/custom_prompt.py file in this repository for an example how to register custom chat templates used when formatting prompts.

New Blocks - IterBlock and LLMMessagesBlock

We have two new Block types available for pipelines in this release - IterBlock and LLMMessagesBlock. IterBlock allows you to execute another Block multiple times, based on a configured number of iterations. LLMMessagesBlock is like LLMBlock but uses the newer chat/completions API of OpenAI-compatible servers instead of the legacy completions API.

Consolidated PDF and Markdown ingestion and chunking implementations

Instead of sending PDF input documents through Docling and using something custom for Markdown, we now send both types of documents through Docling and have consolidated the chunking implementation across both document types. This may result in different chunks being generated for markdown content compared to previous releases.

Added a new `instructlab.sdg.mix_datasets` Python API

We've added a new Python API for advanced users that need to re-mix our generated outputs, for example to weight one taxonomy leaf node over others in the output or to have more than our default of 30 skill samples per leaf node in the final mixed output. See the example at docs/examples/mix_datasets/ for some example Python code and Recipe yaml files to accomplish this.

Breaking Changes

Pipeline configs and Prompt templates switched to Jinja

All of our Pipeline config yamls and prompt template files have moved to Jinja templates instead of Python string format() calls. This brings more expressiveness into our templating language - especially for prompt templates - but does mean any variable substitutions need to be updated from single brackets to double brackets - ie {document} becomes {{document}}. This only impacts you if you were using custom pipeline config yaml files or custom prompt templates in your config blocks.

ImportBlock removed from Pipeline blocks

Any users that were specifying custom pipeline configs (instead of using the default full or simple shipped by us) and also using the ImportBlock will now need to rewrite their pipelines to no longer use that block. We do not anticipate that anyone was actually using this block, but please reach out if you were so we can capture your needs in a future release.

Fixes

The PyTorch dependency is removed, because SDG doesn't directly use PyTorch. The test suite still depends on instructlab core, which depends on PyTorch.
The batch_size parameter is now respected every time we call an inference server from an LLMBlock. Previously, we were only batching the initial input but not accounting for some Blocks that may emit more output samples than input samples, meaning we would exceed our configured batch_size when actually making batching inference calls to vLLM, causing more memory to be consumed than expected as well as leading to scenarios where we were overloading inference servers in unexpected ways due to sending in batches with hundreds of completion requests instead of the configured size, which defaults to 8 on most hardware profiles.

New Contributors

@kelbrown20 made their first contribution in #281
@courtneypacheco made their first contribution in #434
@fabiendupont made their first contribution in #465
@eshwarprasadS made their first contribution in #484

Full Changelog: v0.6.3...v0.7.0

Reply all

Reply to author

Forward

0 new messages