Not working : spacy, ensemble, nn

Aurélie Thébault

unread,

May 31, 2023, 7:53:18 AM5/31/23

to Annif Users

Dear all,

I managed to format my data to use ANNIF. I have only a output-train.tsv training set for small document (all my notices are in the same file).
I manage to run TF-IDF, MLLM and Omikuji with analyzer snowball(french) but I do not manage to do it with Spacy.

I installed Spacy with pip install .[spacy]

$ pip3 freeze | grep [s]pacy
fr-core-news-md @ https://github.com/explosion/spacy-models/releases/download/fr_core_news_md-3.4.0/fr_core_news_md-3.4.0-py3-none-any.whl#sha256=7af020a5be75d7537ded4043390a7082a60cf51ec5177271eac940814215c6a5
fr-core-news-sm @ https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.4.0/fr_core_news_sm-3.4.0-py3-none-any.whl#sha256=f2f89186633f13e2726c32bb032d8162dc6c0324241af16fc2c79533a9528dab
spacy==3.5.3
spacy-alignments==0.9.0
spacy-legacy==3.0.12
spacy-loggers==1.0.4
spacy-transformers==1.2.4

$ python -m spacy download fr_core_web_sm
2023-05-31 11:51:35.518296: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

✘ No compatible package found for 'fr_core_web_sm' (spaCy v3.5.3)

I do not manage to run nn and nn_ensemble as well.

$ annif train rameau-ensemble-fr /home/aurelie/ABES/Annif-tutorial/data-sets/rameau/rameau-train.tsv
Error: Not supported: Training ensemble backend is not possible

Are there known bugs in this new version?
Best regards,

Aurélie

juho.i...@helsinki.fi

unread,

May 31, 2023, 8:21:52 AM5/31/23

to Annif Users

Hi Aurélie!

The Spacy models in French are named fr_core_news_<size> instead of en_core_web_<size> as for English ("news" vs. "web"), so use command

python -m spacy download fr_core_news_sm

For the nn_ensemble problem make sure you have the correct backend setting in the project configuration, i.e. "backend=nn_ensemble" (instead of "backend=ensemble", which is used for (simple) ensemble that cannot be trained).

-Juho

Aurélie Thébault

unread,

May 31, 2023, 9:19:59 AM5/31/23

to Annif Users

Thanks a lot for your answer Juho, it seems to work!! so far nn_ensemble takes long time. Is it normal ? I have a training set of 115000 notices.

I have one last question : is there a way to retrieve predictions (URIs) when using short text document format (one TSV file)? I manage to get the metrics using annif eval, but annif suggest gives me a prediction for my entire dataset I guess, whereas I would like to retrieve predictions for each line of my TSV file (each line corresponding to 1 notice).

Thanks again for your answer !

Aurélie

juho.i...@helsinki.fi

unread,

May 31, 2023, 11:21:32 AM5/31/23

to Annif Users

Good to hear you got the project working :)

Yes, training NN ensemble can take a long time, for example around 2 hours when training on 1400 full-text documents (see the NN ensemble exercise of Annif tutorial). You have very many documents but they are short, and I can't guess how long it would take. You could first try out with a limited number of training documents (lines in your TSV file) by adding the --docs-limit <number> to the train command. For example to train with only 1000 documents

annif train rameau-ensemble-fr /home/aurelie/ABES/Annif-tutorial/data-sets/rameau/rameau-train.tsv --docs-limit 1000

When you know how long this takes, you can estimate how long would training on your full training set take.

Also, we usually train NN ensemble with full-text documents (that is with text lengths like several pages of PDF documents at least), so I'm not sure how much using NN ensemble instead of simple ensemble helps. Usually using NN ensemble instead of simple ensemble increases the evaluation metrics about 1-3 percentage points.

I think using annif suggest on TSV file does not make sense if that is in the short text document format, because that file then already contains the subjects (URIs). For annif eval this is exactly what is needed (for comparing the Annif suggestions against gold-standard subjects). If you want Annif to give suggestions for each notice in TSV file, then each line would need to be fed separately (and without the URIs) to annif suggest, but this could be quite slow. An alternative way would be to separate each notice to its own txt-file and store them in a directory, to which you could run annif index.

-Juho

Reply all

Reply to author

Forward

Not working : spacy, ensemble, nn_ensemble

Aurélie Thébault

juho.i...@helsinki.fi

Aurélie Thébault

juho.i...@helsinki.fi