Annif Analyzer Shootout paper in Code4Lib Journal

15 views
Skip to first unread message

Osma Suominen

unread,
Aug 30, 2022, 2:22:19 AM8/30/22
to Annif Users
Hi all,

we have just published a new paper that compares different Annif
analyzer approaches, including both currently implemented lemmatization
methods and possible future ones, and their effect on downstream subject
indexing quality. This paper could be useful for practitioners applying
Annif on their own data sets and trying to come up with the best
performing setup. The experiments in this paper also led to the
implementation of the multilingual Simplemma analyzer that was included
in the Annif 0.58 release.


Annif Analyzer Shootout: Comparing text lemmatization methods for
automated subject indexing

by Osma Suominen, Ilkka Koskenniemi

https://journal.code4lib.org/articles/16719

Abstract:

Automated text classification is an important function for many AI
systems relevant to libraries, including automated subject indexing and
classification. When implemented using the traditional natural language
processing (NLP) paradigm, one key part of the process is the
normalization of words using stemming or lemmatization, which reduces
the amount of linguistic variation and often improves the quality of
classification. In this paper, we compare the output of seven different
text lemmatization algorithms as well as two baseline methods. We
measure how the choice of method affects the quality of text
classification using example corpora in three languages. The experiments
have been performed using the open source Annif toolkit for automated
subject indexing and classification, but should generalize also to other
NLP toolkits and similar text classification tasks. The results show
that lemmatization methods in most cases outperform baseline methods in
text classification particularly for Finnish and Swedish text, but not
English, where baseline methods are most effective. The differences
between lemmatization methods are quite small. The systematic comparison
will help optimize text classification pipelines and inform the further
development of the Annif toolkit to incorporate a wider choice of
normalization methods.


-Osma

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi
Reply all
Reply to author
Forward
0 new messages