Hello Javier,
thank you for your message. The warning is not specific to MLLM, but is
related to reading of the training corpus. You didn't state what kind of
training data you have, but I believe that you are probably using the
simple subject file format:
https://github.com/NatLibFi/Annif/wiki/Document-corpus-formats#simple-subject-file-format
This means that for each train document you have a .txt file with the
text and a .tsv file (or .key) with the subject labels. That file seems
to contain labels like "space telescope", "milky way" and "silicon"
which are mentioned in the warning messages.
I took a look at your Euroscivoc vocabulary. It does contain these
labels, but they are altLabels, not the preferred labels of the
concepts. As the Annif wiki page I mentioned above states, "the labels
must exactly match the preferred labels of concepts in the subject
vocabulary" - but this is not the case for you, as these are altLabels.
Annif doesn't try to interpret altLabels in this context, as they could
be ambiguous.
There are at least two ways to fix this:
1. fix the training corpus by replacing altLabels with prefLabels, e.g.
"space telescope" -> "observational astronomy" and "milky way" ->
"galactic astronomy"
2. use concept URIs instead of (or in addition to) the labels in the
subject files.
I would recommend option 2, because concept URIs are unambiguous and
tend to be stable, whereas labels may change over time and then you will
again hit the same problem. If there are URIs in subject files, Annif
will use them and ignore the labels.
Hope this helps!
-Osma
>
https://op.europa.eu/en/web/eu-vocabularies/dataset/-/resource?uri=http://publications.europa.eu/resource/dataset/euroscivoc <
https://op.europa.eu/en/web/eu-vocabularies/dataset/-/resource?uri=http://publications.europa.eu/resource/dataset/euroscivoc>
> In case you can't trust the link, you can also look it up in Google by
> "Euroscivoc", you'll have no problems whatsoever finding it.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
annif-users...@googlegroups.com
> <mailto:
annif-users...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/annif-users/9e64c3dd-f318-429b-9837-a46c9db0454an%40googlegroups.com <
https://groups.google.com/d/msgid/annif-users/9e64c3dd-f318-429b-9837-a46c9db0454an%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel.
+358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi