Best practices for big data processing?

14 views
Skip to first unread message

Janne

unread,
Sep 1, 2022, 2:15:26 AMSep 1
to Annif Users
Hi

What is the recommended way to process a huge number of texts with Annif? I have around ~4 million text samples on disk and would like to process them all in reasonable time. I would also like to use the best available pre-trained models "yso-fi" and "yso-en". I can think of three different ways at moment:

1. Using web API with parallel processes (say at least ~50-100 processes). Very easy to do with only few lines of code, but typically no-no and rude exploitation of a free API service and probably too slow anyway (~1-2s per text)
2. Using API via local Docker server. Appears too slow, around ~0.4-0.6s per text, when using development server.
3. Using Annif natively from Python code. Not officially supported method, so difficult to set up and optimize.

Are there some other options? What is the realistic maximum performance (time per text) that can be achieved and how?

Sorry if this has been already discussed here or in some tutorial. You can point me there also.

Osma Suominen

unread,
Sep 1, 2022, 6:55:44 AMSep 1
to annif...@googlegroups.com
Hi Janne!

These are excellent questions, thank you for asking.

4 million is indeed a huge number of texts to process! Are these long or
short texts? You can't really get around the fact that processing so
many documents is going to need a lot of time and computing resources.

I assume that you've downloaded the Finto AI models from
annif.org/download/models since you seem to be referring to them.

Comments to your suggestions:

1. is indeed quite rude, considering that you have so many documents.
The bandwith usage and network delays are also going to be significant.
I wouldn't recommend this, since you have better choices.

2. is much better. You are in control of how much resources to spend. I
don't think that the overhead of using the REST API is going to be very
significant in this scenario.

If 0.4-0.6s per document is too slow for you, then you can consider
using a simpler model setup, for example just the MLLM and/or just the
Omikuji models to save on processing, instead of using the full Finto AI
model setup (mllm+omikuji+fasttext all wrapped in an NN ensemble). I
think you could also get a significant speedup without much loss of
quality by defining a simple ensemble of omikuji+mllm, with equal (1:1)
weights for each. You can of course use the Finto AI pretrained omikuji
and mllm models for this.

If your documents are long, you can consider truncating them and sending
only, say, the first 5000 characters. Some of the Finto AI models (at
least Omikuji) do this internally as well, via the transform=limit
setting, but it is even better to do the truncation already before
sending the document to Annif. In our experience, results may actually
improve quality by looking only at the beginning of the document,
because important topics tend to be mentioned near the top.

Another idea is to try to make use of parallel processing. Can you speed
this up by submitting queries in parallel? Or by running many copies of
the same container, perhaps on different hosts? Or if running containers
is cumbersome for you, you can also do the same with just a plain Annif
installation that is running the REST API ("annif run" command).

Option 3 (Annif from native code) is basically bypassing the REST API.
But as said above, I don't think the overhead of the API is going to be
very significant. You can do this, but it's a bit fiddly to set up and
may break with future versions of Annif, as currently the Python API is
not very stable. We need to be able to rearrange the furniture inside
the house so to speak, so there are no guarantees of internal API
stability going forward either.

One further option is also to make use of the "annif index" command. For
this you need to store the texts in individual texts as TXT files in a
single directory (or group of directories). Annif will then store the
suggested subjects in a TSV file alongside the TXT file. You can also,
if you have enough memory, run several "annif index" commands in
parallel. I don't know whether this is actually any faster than using
the REST API, but at least it bypasses the REST API getting rid of any
overhead so in this sense it is similar to using Annif natively from
Python code.

Hope this helps!

-Osma
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/annif-users/59b420e8-1676-4c94-9912-d1895278ae54n%40googlegroups.com
> <https://groups.google.com/d/msgid/annif-users/59b420e8-1676-4c94-9912-d1895278ae54n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi
Reply all
Reply to author
Forward
0 new messages