Hi MJ,
I'm not sure I understood the reason why you would like to have such
long labels in your vocabulary. But it is up to you what kind of labels
to use.
Most of the Annif backends don't care at all what the label says; to all
the associative backends (omikuji, fasttext, svc, tfidf...) the
concepts/subjects are just abstract categories that the algorithm learns
to recognize based on the training data. The labels make no difference
to the result; they are only used when returning the results through the
CLI or REST API.
The lexical backends, on the other hand, do look at the labels. The MLLM
backend tries to find matches between the labels (terms) in the
vocabulary and the text. In practice, it will look for sentences that
contain ALL the words/tokens for particular concepts. So if you have
very long labels, it is highly unlikely that MLLM will find any matches
at all! The situation for STWFSA and YAKE is pretty similar IIRC.
Hope this helps,
Osma
On 18/12/2024 17:30, MJ Suhonos wrote:
> Hi all,
>
> I'm thinking about the subject vocabulary formats as documented in the
> wiki
> <
https://github.com/NatLibFi/Annif/wiki/Subject-vocabulary-formats>, and
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
annif-users...@googlegroups.com
> <mailto:
annif-users...@googlegroups.com>.
> To view this discussion visit
>
https://groups.google.com/d/msgid/annif-users/4fabe493-8be1-43c3-a24a-4fce10265991n%40googlegroups.com <
https://groups.google.com/d/msgid/annif-users/4fabe493-8be1-43c3-a24a-4fce10265991n%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel.
+358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi