Threshold/confidence matrix?

60 views
Skip to first unread message

christelann...@gmail.com

unread,
Nov 17, 2023, 3:12:27 AM11/17/23
to Annif Users

Hi there!

I am currently writing a research proposal that has to do with automatic metadata (with Annif) and I had a question

I was wondering about the application of Annif when one has created a model and would run this on previously unseen texts… is it possible to (build-in or) have a threshold or confidence (matrix) to know how certain Annif is on the application of a certain keyword to the texts? I mean, if there is very little training data on a certain topic in the trainingset, it could be challenging to recognize the topic in the new texts. Is this something you are/ have considered, or is it may be already there in Annif but I overlooked? In other words, would Annif be able to “throw texts” “in a bin” if it were uncertain about the topics, to have these evaluated?

Best,
Annemieke

Anna Kasprzik

unread,
Nov 17, 2023, 3:19:17 AM11/17/23
to christelann...@gmail.com, Annif Users
Hello Annemieke,

not sure if this solves your problem but we at ZBW have developed a model that estimates overall quality post hoc,
and we use it after having applied subject indexing models via Annif:
We have been using it productively for over a year now.

Best
Anna

--
You received this message because you are subscribed to the Google Groups "Annif Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/64ed6040-d0c6-429c-8165-4d829c1caed2n%40googlegroups.com.

Osma Suominen

unread,
Nov 17, 2023, 3:54:04 AM11/17/23
to annif...@googlegroups.com
Hello Annemieke!

As you probably know, Annif suggestions always come with a score value
between 0 and 1. But the interpretation of that value varies by
algorithm and usually it doesn't have any specific meaning except that
higher values represent more confident suggestions.

You can apply a threshold when using the suggest command (or API
method), but this simply drops the suggestions below that threshold.

If you want to have a better picture of the relationship between
suggestion scores and actual likelyhood of the topic being correct, then
you need some sort of gold standard evaluation set. That means a
document corpus with verified subjects, with documents from the subject
area you are interested in (not necessarily the same as that the
training set). Then you can do one or more of the following:

1. Run "annif eval" on the evaluation corpus: this will give you e.g. F1
score and nDCG (by default for 10 suggestions per document), which give
a picture of the overall quality of results

2. Run "annif optimize" on the evaluation corpus: this will give you
suggested limit and threshold values that maximize the F1 score

3. Configure a PAV ensemble project around the project(s) you are using
and then train that project with the gold standard set. For this to
work, you need a sufficiently large set - thousands of documents. The
PAV ensemble will internally, for each concept/subject, create an
isotonic regression model that estimates the relationship between the
score (supplied by the algorithm) and the likelihood that the suggestion
is correct. After training, you can use the PAV ensemble project to
suggest subjects and the scores it will give are likelihoods (e.g. 0.5
means an estimated 50% chance of being correct), though only for those
subjects that were common enough in the training data that it was able
to form a regression model; by default, this means that at least 10
documents in the training set had that subject (you can adjust this with
the min-docs setting).

These are the facilities built into Annif; it doesn't really have a
mechanism for saying "I don't know what this document is about", except
very indirectly, by e.g. applying a high treshold. I see that Anna
already answered with a pointer to Qualle; that tool is maybe a more
appropriate solution for answering the question "is the quality of
automated suggestions good enough or should this be checked manually?".

-Osma
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/annif-users/64ed6040-d0c6-429c-8165-4d829c1caed2n%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/64ed6040-d0c6-429c-8165-4d829c1caed2n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

christelann...@gmail.com

unread,
Nov 17, 2023, 4:50:13 AM11/17/23
to Annif Users
Thank you both Anna and Osma for these specifications! This is very useful. Luckily I do have that gold-standard dataset (I currently have close to 5000 texts ready, while it could be expanded to even 200,000 texts that were once manually labelled).

Did either of you publish on these kind of 'control'-mechanisms an article? Would be good for references in my application!

Many many thanks, it is much appreciated.

Annemieke

Op vrijdag 17 november 2023 om 09:54:04 UTC+1 schreef osma.s...@helsinki.fi:

Anna Kasprzik

unread,
Nov 17, 2023, 6:31:35 AM11/17/23
to christelann...@gmail.com, Annif Users
Hello Annemieke!

I published a paper that addresses our quality management: https://repository.ifla.org/handle/123456789/2047 (the paper PDF at the bottom)
and it also references the original scientific article that our PhD student published:
Martin Toepfer, Christin Seifert (2018-2): Content-Based Quality Estimation for Automatic
Subject Indexing of Short Texts Under Precision and Recall Constraints. In: Méndez, E.,
Crestani, F., Ribeiro, C., David, G., Lopes, J. (eds.) Digital Libraries for Open Knowledge.
TPDL 2018. LNCS, vol 11057. Springer, https://doi.org/10.1007/978-3-030-00066-0_1

The same paper is referenced in the GitHub repository of qualle at the bottom: https://github.com/zbw/qualle

Best wishes
Anna

To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/3c55ec1b-e0d5-431b-ac56-f50cea1d37e2n%40googlegroups.com.

Osma Suominen

unread,
Nov 17, 2023, 7:56:06 AM11/17/23
to annif...@googlegroups.com
Hi Annemieke,

I don't have any very specific articles about these mechanisms, but the
JLIS.it article from 2022 does include this little section:

> The PAV ensemble (Pool Adjacent Violations) uses isotonic regression to estimate probabilities of particular subject suggestions being correct, based on the documents the ensemble has been trained on (see Wilbur and Kim 2014), and combines the estimated probabilities to calculate an overall suggestion.

The reference is to an article where this kind of regression was
suggested for automated subject indexing with MeSH:

Wilbur, W. John, and Won Kim. 2014. ‘Stochastic Gradient Descent and the
Prediction of MeSH for PubMed Records’. AMIA Annual Symposium
Proceedings 2014 (November): 1198–1207.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419959/

...and our article is of course this one:

Suominen, Osma, Juho Inkinen, and Mona Lehtinen. 2022. “Annif and Finto
AI: Developing and Implementing Automated Subject Indexing”. JLIS.It 13
(1):265-82. https://doi.org/10.4403/jlis.it-12740.


-Osma
> https://groups.google.com/d/msgid/annif-users/64ed6040-d0c6-429c-8165-4d829c1caed2n%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/64ed6040-d0c6-429c-8165-4d829c1caed2n%40googlegroups.com> <https://groups.google.com/d/msgid/annif-users/64ed6040-d0c6-429c-8165-4d829c1caed2n%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/annif-users/64ed6040-d0c6-429c-8165-4d829c1caed2n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finland
> P.O. Box 15 (Unioninkatu 36)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529 <tel:+358%2050%203199529>
> osma.s...@helsinki.fi
> http://www.nationallibrary.fi <http://www.nationallibrary.fi>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/annif-users/3c55ec1b-e0d5-431b-ac56-f50cea1d37e2n%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/3c55ec1b-e0d5-431b-ac56-f50cea1d37e2n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Uldis Bojars

unread,
Jan 8, 2024, 4:39:53 AMJan 8
to Annif Users
Hi!

We are about to start exploring automated subject indexing (using Annif) at the National Library of Latvia.

What are the hardware requirements for training and using Annif? 
What hardware are you using in real-life Annif projects?

Some information about requirements can be found on the wiki [1] but it does not say much besides recommending 16 Gb of RAM (or more).

Best regards,
Uldis Bojārs

Osma Suominen

unread,
Jan 8, 2024, 9:36:38 AMJan 8
to annif...@googlegroups.com
Hi Uldis!

On 08/01/2024 11:39, Uldis Bojars wrote:
> We are about to start exploring automated subject indexing (using Annif)
> at the National Library of Latvia.

This is great!

> What are the hardware requirements for training and using Annif?

That really depends a lot on what kind of vocabulary you have
(especially the number of concepts), what algorithm(s) you use and how
you train it (e.g. number of training documents). You should be able to
get started with just a laptop, but for larger models you might need a
more beefy system. Still the requirements are nowhere near those of LLMs
and other deep learning systems. Annif doesn't currently support GPU
computing in any algorithms, though this may change in the future.

> What hardware are you using in real-life Annif projects?
>
> Some information about requirements can be found on the wiki [1] but it
> does not say much besides recommending 16 Gb of RAM (or more).
>
> [1] https://github.com/NatLibFi/Annif/wiki/System-requirements
> <https://github.com/NatLibFi/Annif/wiki/System-requirements>

There is a little bit more information, in particular about Finto AI, on
this tutorial exercise page:

https://github.com/NatLibFi/Annif-tutorial/blob/master/exercises/OPT_production_use.md

At NLF we do Annif development and basic testing on laptops. For
training and evaluation of production models we use a physical server
with 48 CPU cores and 512GB RAM; this is actually way overkill (we could
probably manage just as well with half the resources or less), but it
happened to be available for us.

We are running Finto AI and the Annif instance powering annif.org as
Docker containers on an OpenShift platform. The basic requirement there
is ~30GB RAM for our current set of models.


The DNB uses a similar setup with a powerful physical machine for model
training and evaluation and a Docker platform for the production
environment. Some details were given in their SWIB21 presentation:
https://swib.org/swib21/slides/03-02-uhlmann.pdf


Your best bet is probably to start small with experiments and see how
far you get before you need more powerful hardware.

-Osma
Reply all
Reply to author
Forward
0 new messages