Data generation inefficiencies with larger teacher models

23 views

Skip to first unread message

Ben Browning

unread,

May 21, 2024, 10:23:42 AM5/21/24

to d...@instructlab.ai

I've been experimenting with larger teacher models recently during `ilab generate` - mostly Mixtral-8x7B-Instruct-v0.1 and Mistral-7B-Instruct-v0.2.

One thing that I've noticed with both of these models is that they generate large numbers of discarded instructions due to unexpected formats, at least when generating knowledge docs. An example from a recent partial generation run:

```

sh-5.1# wc -l generated/train_Mistral-7B-Instruct-v0_2024-05-21T11_13_28.jsonl
192 generated/train_Mistral-7B-Instruct-v0_2024-05-21T11_13_28.jsonl
sh-5.1# wc -l generated/discarded_Mistral-7B-Instruct-v0_2024-05-21T11_13_28.log
453 generated/discarded_Mistral-7B-Instruct-v0_2024-05-21T11_13_28.log
sh-5.1# grep -Po "Discarded instruction\(.+?\):" generated/discarded_Mistral-7B-Instruct-v0_2024-05-21T11_13_28.log | sort | uniq -c
13 Discarded instruction(began with punctuation):
36 Discarded instruction(contained a word from the denylist):
404 Discarded instruction(didn't match expected format):

```

That's 192 valid generated instructions, 453 discarded, and 404 out of those discards were format matching errors. This is after some small changes on my local fork of ilab, that makes format parsing a bit more lenient:

https://github.com/bbrowning/instructlab/commit/a70f6ac307611d6e6b2ddd651dca88cfdb6f0cd6

https://github.com/bbrowning/instructlab/commit/210230c4931ebd8063ff1e14fd87e689938efa74

And, it's on top of my unmerged PR at https://github.com/instructlab/instructlab/pull/1069 that fixes generation to actually finish instead of erroring out in most cases when used with models that don't have extremely long context windows.

I have some ideas for how to fix this, which may include things like customized output parsing per model or moving to something like langchain output parsers (https://python.langchain.com/v0.2/docs/concepts/#output-parsers) where we can retry and/or ask the model to fix misformatted outputs.

However, given how long my other generation fix PR has sat without merge or comment, I'm wondering if there is a desire to make substantial improvements here? Is anyone else using `ilab generate` in anger to attempt to work with real datasets with real teacher models and hitting similar issues? Am I doing something wrong or holding this wrong when pointing to a larger teacher model? I am already ensuring the mixtral `--model-family` is used with both mixtral/mistral, since that is the format both expect for our prompt.

Ben

mark...@gmail.com

unread,

May 31, 2024, 7:59:39 PM5/31/24

to Ben Browning, d...@instructlab.ai

Hi Ben,

I think some simple fixes like these to “normalize” synthetic output for better matching/parsing is definitely a good idea. Assuming the discarded data probably can be made good, I don’t think users should be stuck waiting for data just because we didn’t strip some noise.

The more complicated solutions to make the model more “pluggable” with per-model output parsing or other solutions also sounds like something we will need (at least long term). As it gets more complicated, it would be best if it was not too speculative. That can be tricky, but it sounds like you’ll be able to come up with a reasonable problem and solution in the short-term using your example.

Regards,

Mark (markstur)

--
You received this message because you are subscribed to the Google Groups "dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@instructlab.ai.
To view this discussion on the web visit https://groups.google.com/a/instructlab.ai/d/msgid/dev/CAOpbpxFoRz52Vc9o9wTsKB0Hw3RjbtYkCQpHmty3PUfHdEMckw%40mail.gmail.com.

Reply all

Reply to author

Forward

0 new messages