Hi there!
The error is related to the synthetic data generation (SDG) process. Without more information, I suspect the issue is that the questions you're providing are too long, so it's failing when it tries to put the second question into the prompt because the prompt has exceeded the size limits.
There's a few things you need to fix on your qna.yaml file before the InstructLab process can work:
- You only have 2 question-and-answer pairs in the qna.yaml file that's in the repository you shared. You need to have at least 3 pairs for every piece of context. There needs to be 5 pieces of context for the SDG process to work for knowledge files, so that means you will have 15 different question-and-answer pairs, at a minimum, with at least 3 different pieces of context.
- The `context` field needs to be the snippet of information from your document file (the `.txt` files in your repository). This snippet provides context for your question and answer pairs. It *must* be directly from the document, not paraphrased.
- The questions provided are not actually questions. These questions are how the teacher model knows how to generate questions in the synthetic data generation (SDG) process. If you do not word them as questions based on the context provided in the `context` field, the system will fail.
- The `context` fields, `question` fields, and `answer` fields all have a maximum length dependent on the teacher model. Read more here: https://github.com/instructlab/sdg/blob/main/docs/FAQ.md#how-long-can-a-given-seed_example-be-for-a-knowledge-leaf-node. I think your questions and answers are too long to work properly, but I don't know much about Deepseek's windows to help here. You will have to do that research and figure it out.
- As the qna.yaml file is not in a `knowledge` directory, the SDG process may have some issues. https://docs.instructlab.ai/taxonomy/ has more information, though you don't need as detailed of a tree for your use case (updates to the docs to address use cases like yours are being made here: https://github.com/instructlab/docs.instructlab.ai/pull/40/files#diff-e274bfa7faf3772da3670021b18497fdd38959e35ef5e23d9d0c6d11d9d69fcb)
```
context: <short part of the code that is wrong>
question: How is this command definition wrong?
answer: This command definition is wrong because <reason>. A corrected command definition is <code snippet from corrected code>.
```
If you are still having issues after reading through all of that and fixing the issues in your qna.yaml file, please feel free to reach back out.
Thanks!
Cheers,
Laura
---
InstructLab Taxonomy Lead