ANN: Annif 1.4 released

17 views
Skip to first unread message

juho.i...@helsinki.fi

unread,
Sep 3, 2025, 6:54:53 AMSep 3
to Annif Users
Annif 1.4 has been released!


This release introduces three new corpus formats: a JSON-based full text corpus format (one file per document) and two short-text formats, one based on JSON Lines and another based on CSV. All the new corpus formats include support for document IDs as well as metadata: it is now possible to include structured information such as titles and abstracts for documents. This flexibility is intended to improve the handling of documents that require additional context beyond just the text itself; projects may be configured to operate only on specific metadata fields using the new select transform. All the new corpus formats can be used alongside existing formats.

It is now possible to exclude and include subjects from a vocabulary. Excluding individual concepts can be useful in cases where algorithms frequently produce incorrect subject suggestions. Using exclude and include rules, it is also possible to define more specialized projects that operate on only one type or class of concepts.

Several improvements have been made to the REST API, including exposing vocabulary information via the vocabs method and disabling the learn method by default (controlled by the allow_learn setting in the NN ensemble backend).

The annif index command can now be used on short-text corpus formats (TSV, CSV or JSON Lines) in addition to full text formats (TXT+TSV or JSON). In the case of short-text formats, output including the suggested subjects and their scores is produced in JSON Lines format.

The hyperopt command has been enhanced to better support parallel processing on multiple CPU cores, which can significantly reduce overall processing time.

This release also adds support for Python 3.13, ensuring compatibility with the latest Python version. Furthermore, the tfidf backend has been refactored to eliminate the dependency on gensim, which addresses compatibility issues and simplifies the codebase. Support for Python 3.9 has been dropped. Various maintenance updates and bug fixes are included, such as resolving warnings related to Click and upgrading many libraries to more recent versions.

Special thanks to the German National Library (DNB) EMa team (@c-poley, @RietdorfC, @san-uh) for their work on proposing, specifying and testing the new features in this release!

Supported Python versions:

  • 3.10, 3.11, 3.12 and 3.13

Backward compatibility:

  • ⚠️ tfidf projects trained with Annif 1.3 or older need to be retrained.
  • For other projects, the warnings by SciKit-learn are harmless.
  • ⚠️ This is very likely the last Annif minor release to support the current fasttext backend, because the original fastText library is no longer maintained and there are compatibility issues with other libraries. We are looking for alternative implementations of fasttext.

Enhancements:
#875/#876 Add JSONL short text corpus format
#872/#868 Support metadata in fulltext corpus format / JSON fulltext corpus format
#886/#885 Support document_id in JSON(L) and CSV corpus formats & JSONL output
#889/#639/#877 Support for all corpus formats in annif index CLI command
#863/#140 Flexible fusion part 1: CSV short-text document corpus format
#864 Flexible fusion part 2: core functionality
#866 Flexible fusion part 3: CLI suggest option for additional metadata
#867 Flexible fusion part 4: REST API document metadata support
#844/#846 Support exclude/include rules for vocabulary concepts
#735/#840 Support subject exclusion / Dealing with overrepresented concepts / denylisting
#839/#837 Expose vocabulary information via REST API
#843 Disable /learn REST API method by default
#688/#873 Parallel hyperparameter optimization using multiple CPU cores

Maintenance:
#878 Remove gensim dependency in tfidf backend
#871 Update dependencies for 1.4 release
#890 Use NumPy 2 compatible fastText fork
#849 Drop Python 3.9 support
#850/#869 Support Python 3.13
#884 Upgrade to Poetry 2.0 / Resolve Poetry deprecation warnings
#848 Resolve DeprecationWarning: avoid use of datetime.utcfromtimestamp
#852/#891 Bump GitHub Actions versions

Fixes:
#882 Resolve UserWarning: The parameter --verbosity... for annif list-* CLI commands
#847 Add superclass constructor call to LMDBSequence, to prevent TensorFlow warning
#874 JSON corpus bugfix: avoid parsing subjects in annif index
#887 Fix slow annif train JSONL test & avoid slow jsonschema import


Parthasarathi Mukhopadhyay

unread,
Sep 3, 2025, 8:57:33 AMSep 3
to Annif Users
Dear Annif Team,

Thank you for this excellent news. I’m glad to report that the installation of Annif version 1.4 went smoothly in a Python 3.12 virtual environment. This time I did not face the TensorFlow/Keras conflict error that occurred with Annif version 1.3 (i.e., `tensorflow==2.18.0` with `keras==3.9.2` — the default Keras 3.10 had caused problems earlier).

The new Subject exclusion and inclusion feature looks very helpful. For example, in biomedical literature indexing with MeSH, records often include both “Animal” and “Human” descriptors by default; hopefully, this feature will help resolve such issues.

Another great concept is the use of synthetic training data. I do have one question here (though it is explained well [in the wiki](https://github.com/NatLibFi/Annif/wiki/Generating-synthetic-training-data)):

If I have 10,000 manually curated records, would the correct ratio then be 10,000 manual + 20,000 synthetic records?

Congratulations to the Annif team on this release.

Thanks and kudos,

Parthasarathi Mukhopadhyay

--
You received this message because you are subscribed to the Google Groups "Annif Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/annif-users/672034c0-dab9-41e0-b2ba-678acb8e69f9n%40googlegroups.com.

Osma Suominen

unread,
Sep 3, 2025, 10:22:07 AMSep 3
to annif...@googlegroups.com
Hello Parthasarathi,

thank you for testing the new release so promptly! We are glad to hear
that the install worked fine this time.

Regarding synthetic data: as often in machine learning, there is no
right answer, so you will have to test different techniques and amounts.
The wiki page you mention is based on our experiences from the first
LLMs4Subjects Shared Task at SemEval-2025. There we ended up using 1
part of original data and 3 parts synthetic data; in your case, that
would be equivalent to 10,000 manually curated records plus 30,000
synthetic records.

However, we continued the experiments in the second LLMs4Subjects Shared
Task at GermEval-2025. This time, we used different LLMs to generate the
synthetic records in an attempt to make the synthetic data a bit more
diverse, and also experimented with different ratios of original vs.
synthetic data. Our final solution (for the Omikuji models) was to use 2
parts original data and 4 parts synthetic data; we simply duplicated the
original records so that each record was used twice. You can read more
about this in our pre-print: https://doi.org/10.48550/arXiv.2508.15877
In particular, check out Figure 4 on the last page that shows the effect
of synthetic training data on the nDCG score.

Your data is likely different from this, so you will have to conduct
your own experiments in order to find the optimal ratio and/or the point
where additional synthetic data does not improve the results anymore. If
you do this, please report on your findings here!

Best regards,
Osma / Annif team

Parthasarathi Mukhopadhyay kirjoitti 3.9.2025 klo 15.57:
> Dear Annif Team,
>
> Thank you for this excellent news. I’m glad to report that the
> installation of Annif version 1.4 went smoothly in a Python 3.12 virtual
> environment. This time I did not face the TensorFlow/Keras conflict
> error that occurred with Annif version 1.3 (i.e., `tensorflow==2.18.0`
> with `keras==3.9.2` — the default Keras 3.10 had caused problems earlier).
>
> The new Subject exclusion and inclusion feature looks very helpful. For
> example, in biomedical literature indexing with MeSH, records often
> include both “Animal” and “Human” descriptors by default; hopefully,
> this feature will help resolve such issues.
>
> Another great concept is the use of synthetic training data. I do have
> one question here (though it is explained well [in the wiki](https://
> github.com/NatLibFi/Annif/wiki/Generating-synthetic-training-data)
> <https://github.com/NatLibFi/Annif/wiki/Generating-synthetic-training-
> data)>):
>
> If I have 10,000 manually curated records, would the correct ratio then
> be 10,000 manual + 20,000 synthetic records?
>
> Congratulations to the Annif team on this release.
>
> Thanks and kudos,
>
> Parthasarathi Mukhopadhyay
>
> On Wed, Sep 3, 2025 at 4:24 PM juho.i...@helsinki.fi
> <mailto:juho.i...@helsinki.fi> <juho.i...@helsinki.fi
> <mailto:juho.i...@helsinki.fi>> wrote:
>
> Annif 1.4 has been released!
>
> https://github.com/NatLibFi/Annif/releases/tag/v1.4.0 <https://
> github.com/NatLibFi/Annif/releases/tag/v1.4.0>
>
> This release introduces three new corpus formats <https://
> github.com/NatLibFi/Annif/wiki/Corpus-formats>: a JSON-based full
> text corpus format (one file per document) and two short-text
> formats, one based on JSON Lines and another based on CSV. All the
> new corpus formats include support for document IDs as well as
> metadata: it is now possible to include structured information such
> as titles and abstracts for documents. This flexibility is intended
> to improve the handling of documents that require additional context
> beyond just the text itself; projects may be configured to operate
> only on specific metadata fields using the new select transform
> <https://github.com/NatLibFi/Annif/wiki/Transforms#select-
> transform>. All the new corpus formats can be used alongside
> existing formats.
>
> It is now possible to exclude and include subjects from a vocabulary
> <https://github.com/NatLibFi/Annif/wiki/Subject-exclusion-and-
> inclusion>. Excluding individual concepts can be useful in cases
> where algorithms frequently produce incorrect subject suggestions.
> Using exclude and include rules, it is also possible to define more
> specialized projects that operate on only one type or class of concepts.
>
> Several improvements have been made to the REST API, including
> exposing vocabulary information via the vocabs method and disabling
> the learn method by default (controlled by the allow_learn setting
> in the NN ensemble backend <https://github.com/NatLibFi/Annif/wiki/
> Backend%3A-nn_ensemble>).
>
> The annif index command can now be used on short-text corpus formats
> (TSV, CSV or JSON Lines) in addition to full text formats (TXT+TSV
> or JSON). In the case of short-text formats, output including the
> suggested subjects and their scores is produced in JSON Lines format.
>
> The hyperopt command has been enhanced to better support parallel
> processing on multiple CPU cores, which can significantly reduce
> overall processing time.
>
> This release also adds support for Python 3.13, ensuring
> compatibility with the latest Python version. Furthermore, the tfidf
> backend has been refactored to eliminate the dependency on gensim,
> which addresses compatibility issues and simplifies the codebase.
> Support for Python 3.9 has been dropped. Various maintenance updates
> and bug fixes are included, such as resolving warnings related to
> Click and upgrading many libraries to more recent versions.
>
> Special thanks to the German National Library (DNB) EMa team (@c-
> poley <https://github.com/c-poley>, @RietdorfC <https://github.com/
> RietdorfC>, @san-uh <https://github.com/san-uh>) for their work on
> proposing, specifying and testing the new features in this release!
>
> *Supported Python versions:*
>
> * 3.10, 3.11, 3.12 and 3.13
>
> *Backward compatibility:*
>
> * ⚠️ tfidf projects trained with Annif 1.3 or older need to be
> retrained.
> * For other projects, the warnings by SciKit-learn are harmless.
> * ⚠️ This is very likely the last Annif minor release to support
> the current fasttext backend, because the original fastText
> library is no longer maintained and there are compatibility
> issues with other libraries. We are looking for alternative
> implementations of fasttext <https://github.com/NatLibFi/Annif/
> issues/795>.
>
> *Enhancements:*
> #875 <https://github.com/NatLibFi/Annif/issues/875>/#876 <https://
> github.com/NatLibFi/Annif/pull/876> Add JSONL short text corpus format
> #872 <https://github.com/NatLibFi/Annif/pull/872>/#868 <https://
> github.com/NatLibFi/Annif/issues/868> Support metadata in fulltext
> corpus format / JSON fulltext corpus format
> #886 <https://github.com/NatLibFi/Annif/pull/886>/#885 <https://
> github.com/NatLibFi/Annif/issues/885> Support document_id in JSON(L)
> and CSV corpus formats & JSONL output
> #889 <https://github.com/NatLibFi/Annif/pull/889>/#639 <https://
> github.com/NatLibFi/Annif/issues/639>/#877 <https://github.com/
> NatLibFi/Annif/pull/877> Support for all corpus formats in annif
> index CLI command
> #863 <https://github.com/NatLibFi/Annif/pull/863>/#140 <https://
> github.com/NatLibFi/Annif/issues/140> Flexible fusion part 1: CSV
> short-text document corpus format
> #864 <https://github.com/NatLibFi/Annif/pull/864> Flexible fusion
> part 2: core functionality
> #866 <https://github.com/NatLibFi/Annif/pull/866> Flexible fusion
> part 3: CLI suggest option for additional metadata
> #867 <https://github.com/NatLibFi/Annif/pull/867> Flexible fusion
> part 4: REST API document metadata support
> #844 <https://github.com/NatLibFi/Annif/issues/844>/#846 <https://
> github.com/NatLibFi/Annif/pull/846> Support exclude/include rules
> for vocabulary concepts
> #735 <https://github.com/NatLibFi/Annif/issues/735>/#840 <https://
> github.com/NatLibFi/Annif/pull/840> Support subject exclusion /
> Dealing with overrepresented concepts / denylisting
> #839 <https://github.com/NatLibFi/Annif/pull/839>/#837 <https://
> github.com/NatLibFi/Annif/issues/837> Expose vocabulary information
> via REST API
> #843 <https://github.com/NatLibFi/Annif/pull/843> Disable /learn
> REST API method by default
> #688 <https://github.com/NatLibFi/Annif/issues/688>/#873 <https://
> github.com/NatLibFi/Annif/pull/873> Parallel hyperparameter
> optimization using multiple CPU cores
>
> *Maintenance:*
> #878 <https://github.com/NatLibFi/Annif/pull/878> Remove gensim
> dependency in tfidf backend
> #871 <https://github.com/NatLibFi/Annif/pull/871> Update
> dependencies for 1.4 release
> #890 <https://github.com/NatLibFi/Annif/pull/890> Use NumPy 2
> compatible fastText fork
> #849 <https://github.com/NatLibFi/Annif/pull/849> Drop Python 3.9
> support
> #850 <https://github.com/NatLibFi/Annif/issues/850>/#869 <https://
> github.com/NatLibFi/Annif/pull/869> Support Python 3.13
> #884 <https://github.com/NatLibFi/Annif/pull/884> Upgrade to Poetry
> 2.0 / Resolve Poetry deprecation warnings
> #848 <https://github.com/NatLibFi/Annif/pull/848> Resolve
> DeprecationWarning: avoid use of datetime.utcfromtimestamp
> #852 <https://github.com/NatLibFi/Annif/pull/852>/#891 <https://
> github.com/NatLibFi/Annif/pull/891> Bump GitHub Actions versions
>
> *Fixes:*
> #882 <https://github.com/NatLibFi/Annif/pull/882> Resolve
> UserWarning: The parameter --verbosity... for annif list-* CLI commands
> #847 <https://github.com/NatLibFi/Annif/pull/847> Add superclass
> constructor call to LMDBSequence, to prevent TensorFlow warning
> #874 <https://github.com/NatLibFi/Annif/pull/874> JSON corpus
> bugfix: avoid parsing subjects in annif index
> #887 <https://github.com/NatLibFi/Annif/pull/887> Fix slow annif
> train JSONL test & avoid slow jsonschema import
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion visit https://groups.google.com/d/msgid/
> annif-users/672034c0-dab9-41e0-b2ba-678acb8e69f9n%40googlegroups.com
> <https://groups.google.com/d/msgid/annif-users/672034c0-dab9-41e0-
> b2ba-678acb8e69f9n%40googlegroups.com?
> utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com <mailto:annif-
> users+un...@googlegroups.com>.
> To view this discussion visit https://groups.google.com/d/msgid/annif-
> users/CAGM_5ubxJndm10N2vG%3DXPNXmn-
> dCWithzAasN%3D1cxJhi2YHj9g%40mail.gmail.com <https://groups.google.com/
> d/msgid/annif-users/CAGM_5ubxJndm10N2vG%3DXPNXmn-
> dCWithzAasN%3D1cxJhi2YHj9g%40mail.gmail.com?
> utm_medium=email&utm_source=footer>.

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

Parthasarathi Mukhopadhyay

unread,
Sep 4, 2025, 5:28:57 AMSep 4
to annif...@googlegroups.com

Hello Osma,

Thank you once again for your insightful advice.

After going through the preprint you referred me to, I now have a clearer understanding of the ratio between actual and synthetic data.

I will, of course, report the results of my experiments on synthetic data usage in automated subject indexing and classification.

Thanks and regards,

-Parthasarathi



To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/annif-users/f5505638-c693-4663-9e0a-4dc56d94a7ca%40helsinki.fi.

Christoph Poley

unread,
Sep 4, 2025, 7:24:25 AMSep 4
to Annif Users
Dear Annif-team,

thank you for releasing the new Annif Version. Including the fully flexible fusion of text data is a proved method to achieve better results for indexing and cataloging. And thank you back for the special thanks to us. But, without your initiative to find a solution that starts with data formats and ends with the new transform parameter, it would never work :)

Best regards from the EMa team,
Christoph

Reply all
Reply to author
Forward
0 new messages