Custom corpus preparation

22 views
Skip to first unread message

Parthasarathi Mukhopadhyay

unread,
Apr 11, 2022, 8:08:03 AMApr 11
to Annif Users
I have two questions related to custom corpus preparation -

1. In case of short text document corpus, it says -

"The first column contains the text of the document (e.g. title or title + abstract) while the second column contains a whitespace-separated list of subject URIs (again within angle brackets) for that document."

Now say, I would like to add in the first column title+abstract of collected papers. What will be the separator between Title and Abstract - a white-space? In such a case, what will happen to an entry like - <main title>: <subtitle> <abstract>? Here whitespace is between the title & the subtitle, and in between the subtitle & the abstract.

2. If I prepare different short text document corpuses (from different databases like Scopus, Lens, SemanticScholar, Crossref etc) by following the given format for short text document corpus, and then use these corpuses for training Annif one after another - will there be an issue for it?

Apology for these novice questions.

Best regards

Parthasarathi Mukhopadhyay

Professor, Department of Library and Information Science,

University of Kalyani, Kalyani - 741 235 (WB), India

Osma Suominen

unread,
Apr 11, 2022, 10:02:20 AMApr 11
to annif...@googlegroups.com
Hi Parthasarathi!

Good questions, I will try to answer inline:

Parthasarathi Mukhopadhyay kirjoitti 11.4.2022 klo 15.07:
> I have two questions related to custom corpus preparation -
>
> 1. In case of short text document corpus, it says -
>
> "The first column contains the text of the document (e.g. title or title
> + abstract) while the second column contains a whitespace-separated list
> of subject URIs (again within angle brackets) for that document."
>
> Now say, I would like to add in the first column title+abstract of
> collected papers. What will be the separator between Title and Abstract
> - *a white-space*? In such a case, what will happen to an entry like -
> <main title>: <subtitle> <abstract>? Here whitespace is between the
> title & the subtitle, and in between the subtitle & the abstract.

To Annif and its algorithms/backend, both the title and abstract are
just text - a sequence of sentences and words - and there is no need to
separate them. You could just use a space as a separator and everything
should work fine.

That said, sometimes you, as the person managing the corpus, may want to
know later on which part was the title and where the abstract starts.
This could be important for example if you discover some problems in the
data. I've been in this situation a few times and I've found it helpful
to use a special character to separate between the title and abstract
(and sometimes also author keywords). I've chosen the international
currency symbol ¤ as the separator, because it's so rarely used in real
world text. So the corpora look like this:

The Title: A Subtitle ¤ Since the beginning of... <tab> <subjects...>
A Title without an Abstract. ¤ <tab> <subjects...>

This kind of special character gets filtered out by text
preprocessing/tokenizing functions in Annif so it doesn't affect the
result, but makes it easier for me to see where the title ends and the
abstract starts.

> 2. If I prepare different short text document corpuses (from different
> databases like Scopus, Lens, SemanticScholar, Crossref etc) by following
> the given format for short text document corpus, and then use these
> corpuses for training Annif one after another - will there be an issue
> for it?

Most Annif algorithms can only be trained once (the only exception at
the moment is the NN ensemble, which supports the "learn" operation,
which starts off from where the last training run ended). So if you have
many corpora, you will still have to train them in one go. You could
simply concatenate them all into a single file. But it's also possible
to give more than one file when you use the annif train command, like this:

annif train my-project corpus1.tsv corpus2.tsv corpus3.tsv

This is equivalent to combining them like this:

cat corpus1.tsv corpus2.tsv corpus3.tsv >big-corpus.tsv
annif train my-project big-corpus.tsv

> Apology for these novice questions.

These are the best kinds of questions, because when you ask them on a
public forum, then others can also see the answers :)

-Osma

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

Parthasarathi Mukhopadhyay

unread,
Apr 11, 2022, 10:32:17 AMApr 11
to Annif Users
Dear Osma

Excellent explanations.

Now, I'm all set to start the training process.

Best regards

--
You received this message because you are subscribed to the Google Groups "Annif Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/8a081163-1700-efec-a644-91e31b1b00c6%40helsinki.fi.
Reply all
Reply to author
Forward
0 new messages