MLLM subject label issues

23 views
Skip to first unread message

Javier de Torres

unread,
Jan 31, 2023, 6:56:20 AM1/31/23
to Annif Users
Dear all,
I'm using Annif's MLLM backend for a subject indexing task and I've ran into an issue when training MLLM. Below I show an excerpt of the training log:
(annif-venv) user@server:~/Annif-tutorial$ annif train urosci-mllm-en --docs-limit 100 ./../mllm_data/train/
Backend mllm: starting train
warning: Unknown subject label "space telescope"@en
warning: Unknown subject label "milky way"@en
warning: Unknown subject label "silicon"@en
Backend mllm: preparing training data
  warnings.warn(
warning: Unknown subject label "space telescope"@en
warning: Unknown subject label "milky way"@en
(...)
warning: Unknown subject label "hominid population"@en
Backend mllm: training model
Backend mllm: saving model

My training corpus is exactly as specified. I believe the vocabulary file is the source of the issue. I attach both .rdf and .ttl versions. Most words in the warning listare  there, but Annif won't spot them in training. Could you please tell me how can I fix this? How can I get Annif to recognize my subject labels?

Kind regards, looking forward to your reply,

Javier

P.D: Can't attach the .rdf and .ttl files, I send the link where you can download them, they're public. It's from the European Commision website, so you know it's 100% safe. 
In case you can't trust the link, you can also look it up in Google by "Euroscivoc", you'll have no problems whatsoever finding it.

Osma Suominen

unread,
Jan 31, 2023, 8:15:04 AM1/31/23
to annif...@googlegroups.com
Hello Javier,

thank you for your message. The warning is not specific to MLLM, but is
related to reading of the training corpus. You didn't state what kind of
training data you have, but I believe that you are probably using the
simple subject file format:
https://github.com/NatLibFi/Annif/wiki/Document-corpus-formats#simple-subject-file-format

This means that for each train document you have a .txt file with the
text and a .tsv file (or .key) with the subject labels. That file seems
to contain labels like "space telescope", "milky way" and "silicon"
which are mentioned in the warning messages.

I took a look at your Euroscivoc vocabulary. It does contain these
labels, but they are altLabels, not the preferred labels of the
concepts. As the Annif wiki page I mentioned above states, "the labels
must exactly match the preferred labels of concepts in the subject
vocabulary" - but this is not the case for you, as these are altLabels.
Annif doesn't try to interpret altLabels in this context, as they could
be ambiguous.

There are at least two ways to fix this:

1. fix the training corpus by replacing altLabels with prefLabels, e.g.
"space telescope" -> "observational astronomy" and "milky way" ->
"galactic astronomy"
2. use concept URIs instead of (or in addition to) the labels in the
subject files.

I would recommend option 2, because concept URIs are unambiguous and
tend to be stable, whereas labels may change over time and then you will
again hit the same problem. If there are URIs in subject files, Annif
will use them and ignore the labels.

Hope this helps!

-Osma
> https://op.europa.eu/en/web/eu-vocabularies/dataset/-/resource?uri=http://publications.europa.eu/resource/dataset/euroscivoc <https://op.europa.eu/en/web/eu-vocabularies/dataset/-/resource?uri=http://publications.europa.eu/resource/dataset/euroscivoc>
> In case you can't trust the link, you can also look it up in Google by
> "Euroscivoc", you'll have no problems whatsoever finding it.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/annif-users/9e64c3dd-f318-429b-9837-a46c9db0454an%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/9e64c3dd-f318-429b-9837-a46c9db0454an%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

Javier de Torres

unread,
Feb 1, 2023, 4:39:32 AM2/1/23
to Annif Users
Hello Osma,
Thank you, this is the answer I needed! I will implement one of your solutions (the second one, probably) and I'll make an update post.
Kind regards,
Javier

Javier de Torres

unread,
Feb 9, 2023, 9:45:25 AM2/9/23
to Annif Users
Hello Osma,
The problem was resolved by using a .tsv vocabulary and using the extended subject files. The training was successful and I could build the model. Again, thanks a lot for the help!
Best,
Javier

On Tuesday, January 31, 2023 at 2:15:04 PM UTC+1 osma.s...@helsinki.fi wrote:
Reply all
Reply to author
Forward
0 new messages