facing an issue while performing synthetic data generation

12 views
Skip to first unread message

Prithiviraj N

unread,
Feb 9, 2025, 2:06:55 AM2/9/25
to users
i am trying to perform SynGen but i am facing issue like the attached image.
the repository that i used in the qna.yaml file is 
Screenshot 2025-02-09 at 12.35.34 PM.png

Máirín Duffy

unread,
Feb 9, 2025, 11:39:00 AM2/9/25
to Prithiviraj N, users
Hi! 👋

What type of system are you running it on?

~m

Seolta ó mo fhón póca. (Sent from my phone.)

--
You received this message because you are subscribed to the Google Groups "users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to users+un...@instructlab.ai.
To view this discussion visit https://groups.google.com/a/instructlab.ai/d/msgid/users/74689fe8-2732-4cf7-a106-77f305370a4fn%40instructlab.ai.

Prithiviraj N

unread,
Feb 9, 2025, 10:00:56 PM2/9/25
to users, du...@redhat.com, users, Prithiviraj N
i am running it on macos macbook m3pro

Máirín Duffy

unread,
Feb 9, 2025, 10:14:58 PM2/9/25
to Prithiviraj N, users
Hey,

I saw your post on discord. Did you swap out the SDG model from Mixtral to deepseek? If so - there's different prompt templates / etc. that would need to be updated in code.

~m

Prithiviraj N

unread,
Feb 9, 2025, 11:02:23 PM2/9/25
to users, du...@redhat.com, users, Prithiviraj N
no i didnt swap the model from SDG i just changed the path in config.yaml and was able to serve with model and chat with it but the particular issue arrises when i am trynna perform syngen . please see the attachment for error message. and also see what type of data i am trying to perform syngen you can look into my git repo ( https://github.com/Prithiviraj25/ZOs-patch-data ) which is used in qna.yaml file.
Screenshot 2025-02-09 at 12.35.34 PM.png

Máirín Duffy

unread,
Feb 9, 2025, 11:16:27 PM2/9/25
to Prithiviraj N, users
Could you show me the lines in config.yaml you changed? It's really helpful to be very specific. Where did you put deepseek into your config.yaml, which lines?
I looked at your repo but I don't see your qna.yaml. Where is that?

~m

Prithiviraj N

unread,
Feb 10, 2025, 12:07:09 AM2/10/25
to users, du...@redhat.com, Prithiviraj N
Please look into this repo: https://github.com/Prithiviraj25/ZOs-patch-data  for qna.yaml file i have uploaded it
i have changes under the generate the model path (Please look at the attached image )
and similarly i have changed the model path under serve also

If you have time can you please help me through a google meet ? 
i really need to get this done as the project has strict deadlines

Screenshot 2025-02-10 at 10.33.55 AM.png

Máirín Duffy

unread,
Feb 10, 2025, 12:16:39 AM2/10/25
to Prithiviraj N, users
Your configuration indicates you're using deepseek for synthetic data generation. InstructLab has prompts that are built around the Mixtral model for SDG. You'll need to look around the codepaths for that and update the prompts for deepseek accordingly. Unfortunately the codebase right now doesn't support easy swapping of teacher models for generate.

~m

Prithiviraj N

unread,
Feb 10, 2025, 12:52:30 AM2/10/25
to Máirín Duffy, users
so what other way do i have to get my work done ?

Máirín Duffy

unread,
Feb 10, 2025, 12:54:02 AM2/10/25
to Prithiviraj N, users
You're using an unsupported configuration. So you have to take a look at the prompts and update for deepseek. 

~m

Seolta ó mo fhón póca. (Sent from my phone.)

Prithiviraj N

unread,
Feb 10, 2025, 3:03:38 AM2/10/25
to Máirín Duffy, users

Ok, I will do that.thank you

Prithiviraj N

unread,
Feb 11, 2025, 11:29:26 AM2/11/25
to users, du...@redhat.com, users, Prithiviraj N
failed to generate data with exception: 'icl_query_2'. do you know why this error is shown?

Laura Santamaria

unread,
Feb 11, 2025, 11:56:32 AM2/11/25
to Prithiviraj N, users, du...@redhat.com
Hi there!

The error is related to the synthetic data generation (SDG) process. Without more information, I suspect the issue is that the questions you're providing are too long, so it's failing when it tries to put the second question into the prompt because the prompt has exceeded the size limits.

There's a few things you need to fix on your qna.yaml file before the InstructLab process can work:
  • You only have 2 question-and-answer pairs in the qna.yaml file that's in the repository you shared. You need to have at least 3 pairs for every piece of context. There needs to be 5 pieces of context for the SDG process to work for knowledge files, so that means you will have 15 different question-and-answer pairs, at a minimum, with at least 3 different pieces of context.
  • The `context` field needs to be the snippet of information from your document file (the `.txt` files in your repository). This snippet provides context for your question and answer pairs. It *must* be directly from the document, not paraphrased.
  • The questions provided are not actually questions. These questions are how the teacher model knows how to generate questions in the synthetic data generation (SDG) process. If you do not word them as questions based on the context provided in the `context` field, the system will fail.
  • The `context` fields, `question` fields, and `answer` fields all have a maximum length dependent on the teacher model. Read more here: https://github.com/instructlab/sdg/blob/main/docs/FAQ.md#how-long-can-a-given-seed_example-be-for-a-knowledge-leaf-node. I think your questions and answers are too long to work properly, but I don't know much about Deepseek's windows to help here. You will have to do that research and figure it out.
  • As the qna.yaml file is not in a `knowledge` directory, the SDG process may have some issues. https://docs.instructlab.ai/taxonomy/ has more information, though you don't need as detailed of a tree for your use case (updates to the docs to address use cases like yours are being made here: https://github.com/instructlab/docs.instructlab.ai/pull/40/files#diff-e274bfa7faf3772da3670021b18497fdd38959e35ef5e23d9d0c6d11d9d69fcb)
Please read through the documents at https://github.com/instructlab/taxonomy/blob/main/README.md#getting-started-with-knowledge-contributions to get a better sense of what the qna.yaml file needs to contain. Note that the values provided there are specific to the Granite model that we currently use with InstructLab.

So, if you're trying to use the code you're providing (which, please note, we do not yet have a lot of information about using InstructLab for code, as noted in https://docs.instructlab.ai/taxonomy/skills/skills_guide/#avoid-these-topics), an example snippet of your qna.yaml would appear more like this:

```
context: <short part of the code that is wrong>
question: How is this command definition wrong?
answer: This command definition is wrong because <reason>. A corrected command definition is <code snippet from corrected code>.
```

If you are still having issues after reading through all of that and fixing the issues in your qna.yaml file, please feel free to reach back out.

Thanks!

Cheers,
Laura
---
InstructLab Taxonomy Lead

Prithiviraj N

unread,
Feb 11, 2025, 12:26:57 PM2/11/25
to Laura Santamaria, users, du...@redhat.com
Thank you for the detailed explanation. i ll get back to you soon if i have any queries

Reply all
Reply to author
Forward
0 new messages