help with training nn ensemble

38 views
Skip to first unread message

Steven Holloway

unread,
Oct 28, 2022, 9:39:34 AM10/28/22
to Annif Users

Hi,

We are evaluating ANNIF as a LCSH recommendation service at James Madison University. I have been able to train and evaluate all of the pertinent backends on our ETDs, with one exception. It is clear that the ensembles have the best chance of compensating for the subject matter bias in our training set. PAV and the simple Ensemble work as advertised, but my dev platform throws errors when I try to train the NN Ensemble on the two sources, MLLM and Omikuji/AttentionXML, errors that I am unable to diagnose.

Hardware: 

Macbook Pro MacOS 11.7, 2.4 GHz Quad-Core Intel Core i5, 16 GB 2133 MHz LPDDR3 RAM.

Software

ANNIF v.59

Python v.3.9

TensorFlow v.2.10

Keras v.2.10

Sources for ANNIF MLLM & AttentionXML were trained on 1,500 fulltext with subject headings in tsv files, with validation and test files numbering 300 and 200, respectively.

Vocab is LCSH-SKOS, pared down to prefLabels and altLabels, plus about 600 LCNAF headings similarly processed, a Turtle file that is only 120 MB.

ANNIF project configs:

[mllm-fulltext]

name=MLLM Fulltext project

language=en

backend=mllm

vocab=lcsubjects-lcnames-skosrdf

analyzer=snowball(english)

limit=1000

 

[omikuji-attention-fulltext]

name=Omikuji Attention Fulltext project

language=en

backend=omikuji

vocab=lcsubjects-lcnames-skosrdf

analyzer=snowball(english)

cluster_balanced=False

cluster_k=2

collapse_every_n_layers=5

min_df=2

limit=1000

 

[nn-ensemble-fulltext]

name=NN Ensemble Fulltext project

language=en

backend=nn_ensemble

sources=omikuji-attention-fulltext:1,mllm-fulltext:2

vocab=lcsubjects-lcnames-skosrdf

analyzer=snowball(english)

limit=100

nodes=100

dropout_rate=0.2

epochs=10 

The full message when I try to train NN Ensemble:

(ANNIF) LIB-20-0157:ANNIF hollowswx$ annif train nn-ensemble-fulltext /Users/hollowswx/GitHub/ANNIF/JMU-ETD/docs/train-fulltext

2022-10-28 08:51:21.182496: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

Backend nn_ensemble: creating NN ensemble model

2022-10-28 08:51:26.794016: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

Backend nn_ensemble: Initializing source projects: omikuji-attention-fulltext, mllm-fulltext

2022-10-28T12:51:28.577Z INFO [omikuji::model] Loading model from data/projects/omikuji-attention-fulltext/omikuji-model...

2022-10-28T12:51:28.577Z INFO [omikuji::model] Loading model settings from data/projects/omikuji-attention-fulltext/omikuji-model/settings.json...

2022-10-28T12:51:28.578Z INFO [omikuji::model] Loaded model settings Settings { n_features: 305616, classifier_loss_type: Hinge }...

2022-10-28T12:51:28.578Z INFO [omikuji::model] Loading tree from data/projects/omikuji-attention-fulltext/omikuji-model/tree0.cbor...

2022-10-28T12:51:28.600Z INFO [omikuji::model] Loading tree from data/projects/omikuji-attention-fulltext/omikuji-model/tree1.cbor...

2022-10-28T12:51:28.625Z INFO [omikuji::model] Loading tree from data/projects/omikuji-attention-fulltext/omikuji-model/tree2.cbor...

2022-10-28T12:51:28.647Z INFO [omikuji::model] Loaded model with 3 trees; it took 0.07s

Backend nn_ensemble: Processing training documents...

multiprocessing.pool.RemoteTraceback:

"""

Traceback (most recent call last):

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/multiprocessing/pool.py", line 125, in worker

    result = (True, func(*args, **kwds))

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/annif/parallel.py", line 45, in suggest

    project = self.registry.get_project(project_id)

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/annif/registry.py", line 68, in get_project

    projects = self.get_projects(min_access)

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/annif/registry.py", line 62, in get_projects

    for project_id, project in self._projects[self._rid].items()

KeyError: 140361445793648

"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/bin/annif", line 8, in <module>

    sys.exit(cli())

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/click/core.py", line 1130, in __call__

    return self.main(*args, **kwargs)

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/click/core.py", line 1055, in main

    rv = self.invoke(ctx)

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/click/core.py", line 1657, in invoke

    return _process_result(sub_ctx.command.invoke(sub_ctx))

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/click/core.py", line 1404, in invoke

    return ctx.invoke(self.callback, **ctx.params)

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/click/core.py", line 760, in invoke

    return __callback(*args, **kwargs)

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func

    return f(get_current_context(), *args, **kwargs)

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/flask/cli.py", line 357, in decorator

    return __ctx.invoke(f, *args, **kwargs)

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/click/core.py", line 760, in invoke

    return __callback(*args, **kwargs)

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/annif/cli.py", line 333, in run_train

    proj.train(documents, backend_params, jobs)

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/annif/project.py", line 214, in train

    self.backend.train(corpus, beparams, jobs)

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/annif/backend/backend.py", line 64, in train

    return self._train(corpus, params=beparams, jobs=jobs)

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/annif/backend/nn_ensemble.py", line 175, in _train

    self._fit_model(

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/annif/backend/nn_ensemble.py", line 232, in _fit_model

    self._corpus_to_vectors(corpus, seq, n_jobs)

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/site-packages/annif/backend/nn_ensemble.py", line 204, in _corpus_to_vectors

    for hits, subject_set in pool.imap_unordered(

  File "/usr/local/Caskroom/miniconda/base/envs/ANNIF/lib/python3.9/multiprocessing/pool.py", line 870, in next

    raise value

KeyError: 140361445793648

 [END]

Both TensorFlow and Keras have been tested independently in the same Python Conda environment. The TensorFlow message generated at the beginning of the NN Ensemble training load seems to be only a notification about possible CPU support.

Any help diagnosing this would be greatly appreciated.

Osma Suominen

unread,
Oct 31, 2022, 4:51:16 AM10/31/22
to annif...@googlegroups.com
Hi Steven,

Thanks for the report. This seems to be related to the parallel
processing that is done while preparing the training data for the neural
network. Each training document is sent to the source projects and this
is done in parallel to maximize efficiency. There are some tricks in
Annif that rely on the forking process model of Linux and *nix-like
systems (it won't work on Windows at all). You are using Mac OS and
although it has a similar forking model, the details are a bit
different. I think the error you are seeing could be related to this
change in the multiprocessing module of Python 3.8:

> Changed in version 3.8: On macOS, the spawn start method is now the default. The fork start method should be considered unsafe as it can lead to crashes of the subprocess. See bpo-33725.

Annif is being developed on Linux systems so we don't necessarily notice
problems like this. It is also hard to fix without losing the efficiency
that Linux-style forking enables. We probably should state more clearly,
that the recommended OS for running Annif is Linux.

Here are things you can try:

1. Avoid the parallel processing by using the --jobs option to limit
processing to a single thread/process: annif train --jobs 1 ...
This will make the training slower compared to the default which is to
use many CPU cores in parallel, but it could help avoid the problem
you're having.

2. Use the Docker version of Annif. This way the code will actually run
inside a Linux VM.

3. Use some other means of running Annif on Linux, for example a
VirtualBox VM or installing Linux natively.


Other comments on your setup:

Having 1500 training documents is probably good enough for MLLM, but
it's not a lot for training Omikuji models. If you can find more
training data (even just titles or titles+abstracts, with LCSH subject
indexing), that would probably help a lot with the quality. LCSH is such
a huge vocabulary that you need a large number of training documents to
attain good coverage.

Also, you have pared down LCSH quite a lot. MLLM could benefit from some
of the parts you've removed, in particular the hierarchy
(skos:broader/skos:narrower) and associative relationships (skos:related).

Cheers,
-Osma

Steven Holloway kirjoitti 28.10.2022 klo 16.39:
> Hi,
>
> We are evaluating ANNIF as a LCSH recommendation service at James
> Madison University. I have been able to train and evaluate all of the
> pertinent backends on our ETDs, with one exception. It is clear that the
> ensembles have the best chance of compensating for the subject matter
> bias in our training set. PAV and the simple Ensemble work as
> advertised, but my dev platform throws errors when I try to train the NN
> Ensemble on the two sources, MLLM and Omikuji/AttentionXML, errors that
> I am unable to diagnose.
>
> _Hardware:_
>
> Macbook Pro MacOS 11.7, 2.4 GHz Quad-Core Intel Core i5, 16 GB 2133 MHz
> LPDDR3 RAM.
>
> _Software_:
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/annif-users/3182672c-e54f-4ea1-b666-9ed83db60c60n%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/3182672c-e54f-4ea1-b666-9ed83db60c60n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

Osma Suominen

unread,
Nov 1, 2022, 10:07:32 AM11/1/22
to annif...@googlegroups.com
Hi again Steven,

I filed an issue about the "spawn" multiprocessing mode which I think is
the reason why you couldn't train the NN ensemble on Mac OS:

https://github.com/NatLibFi/Annif/issues/637

I also created a pull request with some code changes that I think should
solve the problem:

https://github.com/NatLibFi/Annif/pull/638

Is there any chance you could test the code in that PR branch and see if
it works in your environment and allows you to train the NN ensemble?
That would be very helpful, as I currently don't have a Mac OS
environment to test with.

If you need more specific instructions on how to get that code running,
just ask.

Best,
Osma

Florian Grässle

unread,
Nov 2, 2022, 7:02:30 AM11/2/22
to Annif Users
Hi there,

chiming in as I am in a similar situation as Steven. I'm using Annif on a MacBook with an M1 processor and - to make matters even worse - arm architecture. Training an NN ensemble project ended in a similar KeyError, so I gave the fix proposed by Osma in the issue637-multiprocessing-spawn branch a try. I can confirm that it works on my computer and in my simple test scenario (10000 training documents). Compared to the --jobs 1 option the training time was reduced by approximately 70 %. This is a big improvement for me since I can't use the docker container, it currently won't run on arm because of issues with tensorflow.

Thanks Steven for bringing this up and Osma for looking into it

Regards,
Florian

Osma Suominen

unread,
Nov 2, 2022, 8:40:28 AM11/2/22
to annif...@googlegroups.com
Hi,

Thanks a lot Florian for testing! I noted this successful test in a
comment to the PR as well. I'm very glad to hear that it worked for you
and that there was a big improvement over using --jobs 1.

It would also be interesting to hear about the memory usage. If the base
models are big, then I would expect that training the NN ensemble in
parallel could eat up lots of memory. For example if you have a 1GB
Omikuji model and then train an NN ensemble that includes it, with 8
parallel workers in spawn mode, you would need 8GB of memory as each
worker has to load the Omikuji model separately. In fork mode (Linux)
most of the model can be shared between processes so it should only need
around 1GB regardless of the number of workers.

Best,
Osma
> https://groups.google.com/d/msgid/annif-users/3182672c-e54f-4ea1-b666-9ed83db60c60n%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/3182672c-e54f-4ea1-b666-9ed83db60c60n%40googlegroups.com> <https://groups.google.com/d/msgid/annif-users/3182672c-e54f-4ea1-b666-9ed83db60c60n%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/annif-users/3182672c-e54f-4ea1-b666-9ed83db60c60n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
> >
>
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finland
> P.O. Box 15 (Unioninkatu 36)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529 <tel:+358%2050%203199529>
> osma.s...@helsinki.fi
> http://www.nationallibrary.fi <http://www.nationallibrary.fi>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/annif-users/e16fea1c-ae73-4256-bc12-9afd76a239a8n%40googlegroups.com <https://groups.google.com/d/msgid/annif-users/e16fea1c-ae73-4256-bc12-9afd76a239a8n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Steven Holloway

unread,
Nov 2, 2022, 11:34:10 AM11/2/22
to Annif Users
Osma,

A belated thanks for your advice and a further report.
I was able to train the nn_ensemble backend by setting the "jobs" parameter to 1, have trained it probably ten times without errors.
I regenerated the LCSH SKOS Turtle file to include broader than/narrower than/related triples, with no discernible reduction in processing speed.  The gains in indexing accuracy are slight but measurable.
I know that the training set for the Omikuji backend is much smaller than optimal.  I cannot achieve any meaningful outcome using the fasttext backend, and the TF-IDF isn't suitable for ensemble work, so I am stuck with the Omikuji variants.  I will retrain the nn-ensemble in the future on a much larger "gold standard" data set, but even as it is, at least half of the first 20 LCSH suggestions are plausible.  Watching the sampling bias play out is fascinating.  An MA thesis on the psychology of aggression bags four LCSH with "police" headings including police misconduct and police brutality, even though the word "police" does not occur in the text -- tells us something about American culture, what?

Regarding OS platform for ANNIF -- your documentation is lucidly clear on the need to base a production environment on Linux. I used MacOS for convenience as a test environment.  If we do stand up an ANNIF instance for production, it will run on a virtualized Linux install, probably in the cloud.

Steven
Reply all
Reply to author
Forward
0 new messages