Training Portuguese Publications in Annif

96 views
Skip to first unread message

Renan Luiz da Silva Nascimento

unread,
Oct 24, 2024, 9:26:57 AM10/24/24
to Annif Users

Good morning,

During our attempt to train Annif for Brazilian publications, we chose to use the FastText backend. I believe that the correct format for the training file would be as follows:

__label__ID Normalized text


In a .txt file, each line represents a publication, and each label ID corresponds to a subject ID. For testing purposes, I referred to the yso-nlf folder and, based on the subjects.csv file, indexed some test publications using the IDs from that file. If the test was successful, I intended to apply the same process to our Bibliodata publications.


After loading the YSO vocabulary, the test file I used was as follows:


__label__2851 Art is the expression of human creativity.

__label__2851 The painting is a form of artistic expression.

__label__331 This text is about Alice and the world of wonder.


According to the subjects.csv file, ID 2851 corresponds to "art" and ID 331 to "sociology." (print3)

However, when attempting to run the training, I received the following error: (print1)

Next, I tried modifying the spacing between the label and the normalized text as follows:


__label__2851 Art is the expression of human creativity.

__label__2851 The painting is a form of artistic expression.

__label__331 This text is about Alice and the world of wonder.


Yet, I encountered another error message, this time indicating that each word from the normalized text was being interpreted as a subject label. (print2)

I would like to know if I am doing something wrong in this process or if there is an alternative recommended way to create an appropriate training file.

Thank you in advance for your attention, and I look forward to your response.


Sincerely,

Renan Luiz

Brazilian Institute of Information in Science and Technology (IBICT)


print 1.png
print 2.png
print3.png

juho.i...@helsinki.fi

unread,
Oct 25, 2024, 3:58:46 AM10/25/24
to Annif Users
Dear Renan,

The .txt file format you referred to seemed to be the format for the FastText library, but for Annif that is not a correct format.

Annif supports several algorithms and external libraries via its backends, and most of the libraries require input data in a specific format, like FastText. To allow using the same document corpora for multiple backends, Annif has its own document corpus formats, from which Annif will convert the data to a suitable format for each of its backend automatically when performing train/suggest/eval operations.

The two document formats of Annif are:

1. Full-text document corpus
A directory with text files that have the file extension .txt and subject files that list the assigned subjects for each text file.

2. Short text document corpus
A single TSV file, where the first column contains the text of the document (e.g. title or title + abstract) while the second column contains a whitespace-separated list of subject URIs for that document.

Please see this Wiki page for details.

I guess the Short text document corpus would be more suitable for your documents.

Also, the FastText may not be the easiest backend to start with, because the training time it requires is quite long. I suggest starting with the TFIDIF backend and only trying other backends only when your TFIDF project works at a conceptual level. And Omikuji backend could be easier to work with than FastText.

I guess you have found the Annif-tutorial, which tries to provide a comprehensive picture Annif. If you want further cooperation and tips, maybe you can attend our workshop at SWIB24 (November 25).

Best regards,
-Juho

Renan Luiz da Silva Nascimento

unread,
Dec 13, 2024, 1:04:16 AM12/13/24
to juho.i...@helsinki.fi, Annif Users

Good morning, I hope you’re doing well.

First of all, thank you for your previous response. Thanks to it, we were able to create a training file using data from our own institution. The initial results are very promising, but two questions arose during the process:

  1. I am using the YSO vocabulary. Is there any difference in terms of performance or usability compared to the ZTW vocabulary?
  2. While reviewing the files in the YSO folder, I noticed that the yso-finna.tsv.gz file is mentioned as the recommended one for training in the Annif step-by-step guide. However, when attempting to download it to understand its structure and assess its suitability for our goal with publications in Portuguese, I received an error message. I’d like to know if the file contains only English-language publications and if there’s another way to access it.

Thank you in advance for your attention.
Renan Luiz


--
You received this message because you are subscribed to the Google Groups "Annif Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/annif-users/55c1d52d-0d50-435f-a180-c3e6f73a4421n%40googlegroups.com.

Osma Suominen

unread,
Dec 19, 2024, 8:48:28 AM12/19/24
to annif...@googlegroups.com
Dear Renan,

Regarding your first question: the vocabulary to use depends on a lot of
factors. Ideally you should use an existing vocabulary that your
institution already uses and for which you have training data, i.e.
documents indexed with that vocabulary. The vocabulary should also have
good coverage of the domains you are interested in and of course be in
an appropriate language. The YSO vocabulary has been created in a
Finnish context while STW is related to economics and is maintained in
Germany. I don't think either of these would be an ideal choice for a
Brazilian institution.

Regarding your second question: if you are referring to the Annif
tutorial files, those can be accessed on GitHub using a web browser, or
by cloning the whole repository using git tools. The file
yso-finna.tsv.gz can be found here:

https://github.com/NatLibFi/Annif-tutorial/blob/main/data-sets/yso-nlf/yso-finna.tsv.gz

There is a small icon near the upper right corner "Download raw file"
which should allow you to download it to your computer.

The file should contain only English language document titles along with
subjects.

Best,
Osma

On 11/12/2024 17:09, 'Renan Luiz da Silva Nascimento' via Annif Users wrote:
> Good morning, I hope you’re doing well.
>
> First of all, thank you for your previous response. Thanks to it, we
> were able to create a training file using data from our own institution.
> The initial results are very promising, but two questions arose during
> the process:
>
> 1. I am using the YSO vocabulary. Is there any difference in terms of
> performance or usability compared to the ZTW vocabulary?
> 2. While reviewing the files in the YSO folder, I noticed that the
> *yso-finna.tsv.gz* file is mentioned as the recommended one for
> training in the Annif step-by-step guide. However, when attempting
> to download it to understand its structure and assess its
> suitability for our goal with publications in Portuguese, I received
> an error message. I’d like to know if the file contains only
> English-language publications and if there’s another way to access it.
>
> Thank you in advance for your attention.
> Renan Luiz
>
>
> Em sex., 25 de out. de 2024 às 04:58, juho.i...@helsinki.fi
> <mailto:juho.i...@helsinki.fi> <juho.i...@helsinki.fi
> <mailto:juho.i...@helsinki.fi>> escreveu:
>
> Dear Renan,
>
> The .txt file format you referred to seemed to be theformat for the
> FastText library
> <https://fasttext.cc/docs/en/supervised-tutorial.html#getting-and-preparing-the-data>, but for Annif that is *not* a correct format.
>
> Annif supports several algorithms and external libraries via its
> backends, and most of the libraries require input data in a specific
> format, like FastText. To allow using the same document corpora for
> multiple backends, Annif has its own document corpus formats, from
> which Annif will convert the data to a suitable format for each of
> its backend automatically when performing train/suggest/eval operations.
>
> The two document formats of Annif are:
>
> *1. Full-text document corpus
> *
> A directory with *text files* that have the file extension .txt and
> *subject files* that list the assigned subjects for each text file.
>
> *2. Short text document corpus*
> A *single TSV file*, where the first column contains the text of the
> document (e.g. title or title + abstract) while the second column
> contains a whitespace-separated list of subject URIs for that document.
>
> Please see this Wiki page
> <https://github.com/NatLibFi/Annif/wiki/Document-corpus-formats> for
> details.
>
> I guess the Short text document corpus would be more suitable for
> your documents.
>
> Also, the FastText may not be the easiest backend to start with,
> because the training time it requires is quite long. I suggest
> starting with the TFIDIF backend and only trying other backends only
> when your TFIDF project works at a conceptual level. And Omikuji
> backend could be easier to work with than FastText.
>
> I guess you have found the Annif-tutorial
> <https://github.com/NatLibFi/Annif-tutorial>, which tries to provide
> a comprehensive picture Annif. If you want further cooperation and
> tips, maybe you can attend our workshop at SWIB24
> <https://forum.swib.org/t/workshop-introduction-to-annif-automated-indexing-tool/139> (November 25).
>
> Best regards,
> -Juho
>
> On Thursday 24 October 2024 at 16:26:57 UTC+3 renanna...@ibict.br
> <mailto:renanna...@ibict.br> wrote:
>
> Good morning,
>
> During our attempt to train Annif for Brazilian publications, we
> chose to use the FastText backend. I believe that the correct
> format for the training file would be as follows:
>
> __label__IDNormalized text
>
>
> In a .txt file, each line represents a publication, and each
> label ID corresponds to a subject ID. For testing purposes, I
> referred to the yso-nlf folder and, based on the
> subjects.csv file, indexed some test publications using the IDs
> from that file. If the test was successful, I intended to apply
> the same process to our Bibliodata publications.
>
>
> After loading the YSO vocabulary, the test file I used was as
> follows:
>
>
> __label__2851 Art is the expression of human creativity.
>
> __label__2851 The painting is a form of artistic expression.
>
> __label__331 This text is about Alice and the world of wonder.
>
>
> According to the subjects.csv file, ID 2851 corresponds to "art"
> and ID 331 to "sociology." (print3)
>
> However, when attempting to run the training, I received the
> following error: (print1)
>
> Next, I tried modifying the spacing between the label and the
> normalized text as follows:
>
>
> __label__2851Art is the expression of human creativity.
>
> __label__2851The painting is a form of artistic expression.
>
> __label__331This text is about Alice and the world of wonder.
>
>
> Yet, I encountered another error message, this time indicating
> that each word from the normalized text was being interpreted as
> a subject label. (print2)
>
> I would like to know if I am doing something wrong in this
> process or if there is an alternative recommended way to create
> an appropriate training file.
>
> Thank you in advance for your attention, and I look forward to
> your response.
>
>
> Sincerely,
>
> Renan Luiz
>
> Brazilian Institute of Information in Science and Technology (IBICT)
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion visit
> https://groups.google.com/d/msgid/annif-users/55c1d52d-0d50-435f-a180-c3e6f73a4421n%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/55c1d52d-0d50-435f-a180-c3e6f73a4421n%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion visit
> https://groups.google.com/d/msgid/annif-users/CAJFqYhd8b-qdgAzYpLWE__7s1jtejGL_oKRyUcAa_CGq3%2BY0SA%40mail.gmail.com <https://groups.google.com/d/msgid/annif-users/CAJFqYhd8b-qdgAzYpLWE__7s1jtejGL_oKRyUcAa_CGq3%2BY0SA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

Renan Luiz da Silva Nascimento

unread,
Apr 29, 2025, 7:02:01 AMApr 29
to Osma Suominen, annif...@googlegroups.com

Good morning, I hope you are doing well.


I would like to sincerely thank you for all the support we have received from your side. The guidance you provided was crucial for us to move forward with the exploration of the Annif tool in the context of Ibict.

After completing the exploration phase, we concluded that Annif shows great potential for our needs, and we have started working on developing an API integrated with the tool. During the exploration, we used GitHub Codespaces as our working environment. However, for the creation of the API, I identified the need to install Annif locally on my machine.

Following the instructions available at this link: https://github.com/NatLibFi/Annif-tutorial/blob/main/exercises/01_install_annif.md, I attempted to install Annif using both the "VirtualBox based install" and the "Docker based install" options.

Unfortunately, I encountered issues with both approaches. When trying the VirtualBox installation, I received an error, as shown in Attachment 1.

Then, when attempting the Docker installation, the process fails to locate the file in the specified path, even though the file is correctly placed, as shown in Attachment 2 e 3.

I would like to kindly ask if you could provide any guidance on what might be going wrong, so that I can make the necessary corrections and continue with the project.

Thank you very much for your attention and support.

Sincerely,
Renan Luiz


To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/annif-users/7dabf25c-634f-48bd-af42-6952d4b3bf5d%40helsinki.fi.
Attachment 1.png
Attachment 2.png
Attachment 3.png

juho.i...@helsinki.fi

unread,
May 2, 2025, 4:59:32 AMMay 2
to Annif Users
Hi Renan,

I think the VirtualBox error is the same that is discussed in this thread: https://forums.virtualbox.org/viewtopic.php?t=105752 In there the solution of the first answer was reported to resolve the error, that is, the steps in this thread: https://forums.virtualbox.org/viewtopic.php?f=25&t=99390 You could try that, but otherwise it is really hard to give any good advice.

About the Docker error: Even if the subjects.csv file is present in your host computer, it is not necessarily available in the container, you need to mount the directory from the host to the container by using the option "-v ~/Annif-tutorial:/Annif-tutorial", please make sure you use the command with all the options from here: https://github.com/NatLibFi/Annif-tutorial/blob/main/exercises/01_install_annif.md#13-docker-based-install.

Best regards,
-Juho
Reply all
Reply to author
Forward
0 new messages