Different corpus formats and backends

26 views
Skip to first unread message

Gabriel Kulevicius

unread,
May 17, 2022, 12:28:41 PMMay 17
to Annif Users
Hi team, this is my first post and I want to initially congratulate Annif team and their effort and product obtained.

I've made the tutorials exercises and then make a few tests in spanish lang with good results, trying to make same steps than tutorial but with own corpus and vocab.
In particular it was possible for me to get a 2000 docs sample corpus with associated terms in "full text document" corpus format, but was not possible to get a 1M document sample for training like you are using in exercise 2 for tf-idf training.

I have some conceptual questions about this.

Is there any relation/advantage/disadvantage of using one corpus format (full text or short text) with an specific backend? 
In particular in tutorial exercises you are using Short for TF-IDF and Full for the others (MLLM, OMIKUJI). 

As I can see vocab is always mandatory in all backends in order to be trained, and terms in training data MUST exist in our vocab in all backeds (I obtained some warnings if it does not happens). Is this true?

thanks and I'll be sharing my experience here.
best

Osma Suominen

unread,
May 18, 2022, 3:31:23 AMMay 18
to annif...@googlegroups.com
Hi Gabriel,

Glad you have found Annif and already obtained good results. I'll try to
respond to your questions below:

Gabriel Kulevicius kirjoitti 17.5.2022 klo 19.28:
> Hi team, this is my first post and I want to initially congratulate
> Annif team and their effort and product obtained.
>
> I've made the tutorials exercises and then make a few tests in spanish
> lang with good results, trying to make same steps than tutorial but with
> own corpus and vocab.
> In particular it was possible for me to get a 2000 docs sample corpus
> with associated terms in "full text document" corpus format, but was not
> possible to get a 1M document sample for training like you are using in
> exercise 2 for tf-idf training.

2000 documents is already good for many tasks, especially for training
lexical models like MLLM and STWFSA. You didn't mention what kind of
vocabulary your are using? The amount of training data you need for the
associative models (tfidf, omikuji, fasttext, svc...) depends a lot on
the size of the vocabulary. A good rule of thumb is to aim for at least
ten times more documents than you have concepts/classes in the
vocabulary. But of course you use what you have - often it's not
possible to find large amounts of training documents. Even with just
2000 documents it's worth testing what kind of results e.g. Omikuji will
produce, it's probably still useful (especially if your vocabulary is
small) although not ideal. Note that associative models are only able to
suggest subject they have "seen" at least once in the training data, so
if your training corpus is small, it likely doesn't cover the vocabulary
very well.

> I have some conceptual questions about this.
>
> Is there any relation/advantage/disadvantage of using one corpus format
> (full text or short text) with an specific backend? 
> In particular in tutorial exercises you are using Short for TF-IDF and
> Full for the others (MLLM, OMIKUJI). 

There is no conceptual difference between the corpus formats. Internally
they are handled the same way in Annif. It's just sometimes more
convenient to produce one or the other. For example if you have a
million metadata records, it probably isn't very practical to create a
fulltext corpus as it would have 2 million very small files, so the
short text format (single file) is more convenient. But when you have a
thousand PDF files, it's easy to convert each PDF into TXT using e.g.
pdftotext and then add the subjects in separate TSV files to create a
full text corpus.

> As I can see vocab is always mandatory in all backends in order to be
> trained, and terms in training data MUST exist in our vocab in all
> backeds (I obtained some warnings if it does not happens). Is this true?

A vocabulary is always needed. Annif only suggests subjects from a
controlled vocabulary (the YAKE backend in theory could also suggest
out-of-vocabulary terms and keyphrases but this hasn't yet been
implemented).

Terms in training data SHOULD exist in the vocabulary, otherwise you
will get warnings as you saw. These warnings indicate that the
term/subject was ignored by Annif because it couldn't be found in the
vocabulary.

> thanks and I'll be sharing my experience here.

Great! Looking forward to hearing more about your work.

-Osma

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

Gabriel Kulevicius

unread,
May 18, 2022, 9:03:54 AMMay 18
to Annif Users
Thanks! I will continue testing with this new considerations
best

Reply all
Reply to author
Forward
0 new messages