You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to annif...@googlegroups.com
Hi Uldis,
I seem to remember that your vocabulary NLLSH is quite large. With only
67k training records, there probably will not be enough training data to
get good results on this type of model. You would need a lot more than
that; a good rule of thumb is that you need at least 10 times as many
training examples as there are subjects/concepts in your vocabulary. For
example, we train our YSO models on over 1M short text records.
If you can't find enough training data in your own databases, one option
is to look at synthetic data generation. We did that in the
LLMs4Subjects Shared Task. See this paper for some details:
https://arxiv.org/abs/2508.15877