XML Support and aksing for best pracitces

23 views

Skip to first unread message

Robina M

unread,

Jul 11, 2025, 11:59:11 AMJul 11

to Annif Users

Dear annif users and developers,

I would like to know two things.

First: Since in another projects our documents are already parsed in XML-format, I was wondering if you can work with xml-format in annif too?

Second: We want to achieve the best possible results by trying to expirement 1. by changing the size and versions of our training corpus and also 2. by trying different parametes in our project configuration (e.g. different language analyzer or different set of subjects). Do you guys have best practices or recommendations how to built the annif setup? Like would it better to define different projects.cfg files for every changed parameter or have it all in one? Or better to do the experimenting with one changed parameter in different git-branches instead of one main-branch?

I am apologizing if the answer to these things seem very obvious to you. My programming skills are still very basic and this is the first time I am doing this kinda project. So I hope you understand.

Regards,

Robina

juho.i...@helsinki.fi

unread,

Jul 16, 2025, 6:29:38 AMJul 16

to Annif Users

Dear Robina,

1.) Using corpus in XML-format is not possible, you need to convert documents to a TSV file (short-text corpus) or to TXT and TSV files in a directory (full-text corpus): see this Annif wiki page.

2.) The way to organize experimentation is a good question. It can have a very big impact on how easy it is to run and track experiments, but I don't have a good answer, apart from "try out and do what best suits you workflow".

Maybe I would set up individual project configurations for e.g. different analyzers (for which there are a limited number of choices), but not for corpus sizes (which can have arbitrary many values). Different project configurations also allows to run experiments in parallel more confidently (the configurations are read when starting an operation, so in principle it is possible to change parameters when one operation has started and start another one, but I think it is not good in practice).

Note that Annif commands have the --backend-param/-b option to override (most) values set up in a configuration file, which can be helpful.

For keeping track of evaluation scores etc. we use online spreadsheets.

However, if you are setting up a larger projects set which will be used for a longer time, I recommend to take a look at DVC pipelines. While it adds complexity to the initial setup, in the long rung it helps in many ways.