Practical length of subject vocabulary labels

MJ Suhonos

unread,

Dec 18, 2024, 10:30:06 AM12/18/24

to Annif Users

Hi all,

I'm thinking about the subject vocabulary formats as documented in the wiki, and wondering if there is a practical limit on the length of concept labels; ie. are these truncated at all, and if so, is there a default or practical limit in how much text they contain?

My use case: I have a monolingual vocabulary with URIs, each corresponding to a label -- however, the labels are gathered from much longer text, and many are well over 1000 characters. These longer labels do contain valuable n-gram information, eg. in the common case where a concept is described by (one or more) specific phrase(s).

These labels often include repeated words and phrases, so it's possible to reduce them using eg. a bag-of-words approach to a much shorter list of terms, typically around 300 characters. This is still much longer than all of the examples I've seen, typically < 50 characters.

Unfortunately the source data is flat text; ie. not SKOS or something structured which contains preferred/alternate labels. I know some of the backends can recognize alternate labels, so if that's a more desirable approach then I can try to generate new data, but I'd prefer to avoid that work if possible.

Does anyone have any experience with using or processing long labels for use in Annif?

Thanks in advance,

MJ

Osma Suominen

unread,

Dec 19, 2024, 8:53:59 AM12/19/24

to annif...@googlegroups.com

Hi MJ,

I'm not sure I understood the reason why you would like to have such
long labels in your vocabulary. But it is up to you what kind of labels
to use.

Most of the Annif backends don't care at all what the label says; to all
the associative backends (omikuji, fasttext, svc, tfidf...) the
concepts/subjects are just abstract categories that the algorithm learns
to recognize based on the training data. The labels make no difference
to the result; they are only used when returning the results through the
CLI or REST API.

The lexical backends, on the other hand, do look at the labels. The MLLM
backend tries to find matches between the labels (terms) in the
vocabulary and the text. In practice, it will look for sentences that
contain ALL the words/tokens for particular concepts. So if you have
very long labels, it is highly unlikely that MLLM will find any matches
at all! The situation for STWFSA and YAKE is pretty similar IIRC.

Hope this helps,
Osma

On 18/12/2024 17:30, MJ Suhonos wrote:
> Hi all,
>
> I'm thinking about the subject vocabulary formats as documented in the
> wiki

> <https://github.com/NatLibFi/Annif/wiki/Subject-vocabulary-formats>, and

> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion visit
> https://groups.google.com/d/msgid/annif-users/4fabe493-8be1-43c3-a24a-4fce10265991n%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/4fabe493-8be1-43c3-a24a-4fce10265991n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

MJ Suhonos

unread,

Dec 19, 2024, 10:05:25 AM12/19/24

to Annif Users

Hi Osma,

Thanks very much, this is quite helpful. To be more clear, the dataset I'm exploring is Wikidata5m: https://deepgraphlearning.github.io/project/wikidata5m

In particular, the entity aliases data contains a list of wikidata QIDs with a TSV list of labels, similar to the multilingual CSV format supported by Annif; unfortunately, there are no language labels so it's just a long string of text.

One example that I'm considering is spelling variants for Q853 (Andrei Tarkovsky), which would be potentially valuable for lexical matching, but of course there are a lot of variants in the English transliteration -- 46 labels for this QID alone, each averaging about 30 characters (over 1300 characters in total). If I use the SKOS labels from a separate wikidata RDF dump and parse them into a bag-of-words, I get 16 labels averaging 10 characters each, but 160 characters is still pretty long! If I only use the prefLabel "Andrei Tarkovsky", then of course I lose the altLabel variants.

MLLM is the backend I was thinking of, so thank you for clarifying its behaviour in matching. It sounds like it wouldn't match variants anyway, given how they may be mixed. My intent is to use x-transformer (PECOS) on this dataset, which as far as I understand, clusters labels in order to increase the cardinality of matches when training. I'll likely use the BOW approach to provide as much information as possible, but it's really useful to know that if I want to do an apples-to-apples comparison between eg. x-transformer and MLLM, I'll have to use the prefLabel values.

Thanks again,

MJ

Osma Suominen

unread,

Dec 20, 2024, 2:06:58 AM12/20/24

to annif...@googlegroups.com

Hi MJ,

Thanks for the clarification!

Could you do something like this:

wd:Q853 a skos:Concept ;
skos:prefLabel "Andrei Tarkovsky"@en ;
skos:altLabel "Andrej Tarkovskij"@en, "Andrei Tarkovski"@en, "Andrej
Tarkovszkij"@en ...

i.e. pick one as prefLabel and make every other label an altLabel.

MLLM looks at prefLabels and all altLabels separately when matching, so
in this case, it should find any variant. Also hiddenLabels can be used,
though you will have to tell MLLM to consider them as well because it's
off by default.

You will need to use language tags for the labels (English if your text
is English) because MLLM will ignore labels in other languages than the
one it's configured to use.

-Osma

On 19/12/2024 17:05, MJ Suhonos wrote:
> Hi Osma,
>
> Thanks very much, this is quite helpful. To be more clear, the dataset
> I'm exploring is Wikidata5m:
> https://deepgraphlearning.github.io/project/wikidata5m
> <https://deepgraphlearning.github.io/project/wikidata5m>
>
> In particular, the entity aliases data contains a list of wikidata QIDs
> with a TSV list of labels, similar to the multilingual CSV format
> supported by Annif; unfortunately, there are no language labels so it's
> just a long string of text.
>
> One example that I'm considering is spelling variants for Q853

> <https://www.wikidata.org/wiki/Q853> (Andrei Tarkovsky), which would be

> https://groups.google.com/d/msgid/annif-users/4fabe493-8be1-43c3-a24a-4fce10265991n%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/4fabe493-8be1-43c3-a24a-4fce10265991n%40googlegroups.com> <https://groups.google.com/d/msgid/annif-users/4fabe493-8be1-43c3-a24a-4fce10265991n%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/annif-users/4fabe493-8be1-43c3-a24a-4fce10265991n%40googlegroups.com?utm_medium=email&utm_source=footer>>.

>
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finland
> P.O. Box 15 (Unioninkatu 36)
> 00014 HELSINGIN YLIOPISTO

> Tel. +358 50 3199529 <tel:+358%2050%203199529>
> osma.s...@helsinki.fi
> http://www.nationallibrary.fi <http://www.nationallibrary.fi>

>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion visit

> https://groups.google.com/d/msgid/annif-users/e18d5aeb-5fea-46c3-b9d5-5f7e00c0d86dn%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/e18d5aeb-5fea-46c3-b9d5-5f7e00c0d86dn%40googlegroups.com?utm_medium=email&utm_source=footer>.

MJ Suhonos

unread,

Dec 20, 2024, 10:05:17 AM12/20/24

to Annif Users

Hi Osma,

This is a great solution. In this case, I guess since I have the wikidata RDF dump, I can just try loading a filtered SKOS subset as the vocabulary, and Annif will magically do the rest. Hadn't even thought of that. :)

Ultimately, this particular example is a transliteration/stemming issue, which is its own discussion…