Seeking guidance in selecting training dataset

Parthasarathi Mukhopadhyay

unread,

Nov 9, 2023, 3:54:02 AM11/9/23

to Annif Users

Dear all

The backdrop for these questions is as follows: we have curated and made ready a set of 3 lakhs (0.3 million) bibliographic records (a short text corpus with title, abstract, and MeSH term URIs as per the Annif compatible format) from the Medline database (available under ODbL) after going through an array of steps. Medline uses MeSH vocabulary to index records by trained LIS professionals. Another 0.1 million records will be ready by this weekend, making it a total of 0.4 million records for training. We want to create four projects (the same MeSH vocabulary for all projects) with four regular backends, namely TF-IDF, fastText, Omikuji-Bonsai, and Omikuji-Parabel, and then the final project with the NN-Ensemble Fusion backend by combining the regular backends with weights like tf-idf:1, fastText:2, Omikuji-B:3, and Omkuji-P:3. We already tested the said process with a small set of 10,000 records, and it is working fine. But before the beginning of the training with all 0.3 million records, we would like to know answers to the following questions:

1. What will be a better approach? 1.1) to train each regular backend by using all 0.3 million records, or 1. 2) Let us divide the datasets into 4 equal sets of records and use them for each regular backend separately, like set 1 for tf-idf, set 2 for fastText, and so on.

2. Let us imagine that we are following Solution 1.1, and in that case, what will be a better approach to handling the NN-Ensemble training and learning - 2.1) with a set of records from the 0.3 million record set (say 0.1 million from this set selected randomly) or 2.2) to use a completely new training set (from the training set in the pipeline with 0.1 million records)?

best regards

Parthasarathi Mukhopadhyay

Professor, Department of Library and Information Science,

University of Kalyani, Kalyani - 741 235 (WB), India

Osma Suominen

unread,

Nov 10, 2023, 4:13:31 AM11/10/23

to annif...@googlegroups.com

Hello Parthasarathi,

I'm glad to hear you have found plenty of training material for MeSH
based models.

My suggestion would be the following:

Split the 0.4 million records into three subsets: train, validate, test.
(If you already have another test set for evaluation, you can of course
use that instead.) The train set should be as big as possible, while
validate and test sets could be perhaps 10,000 records each.

Use the train subset to train associative backends such as tf-idf
(though I can't recommend using this backend in a real setting),
fasttext and the Omikuji variants. You can also use the train subset to
train lexical backends like MLLM or stwfsa, but note that they don't
require this many examples. You will probably get good enough results
with, say, 10,000 records and anything more than that will just be a
waste of resources. It's okay to use the same records for training each
of the backends, so no need to split further within the train set.

Then set up a basic ensemble with all the projects you've trained so
far. Use the "annif hyperopt" command against the validate set, with
enough trial rounds (say 200 or more) to find the best set of weights
for each individual backend.

Finally set up the NN ensemble, using the weights you obtained using
hyperopt, and use the validate set to train it.

It's important to use a different set of records than the train set for
hyperopt and for training the NN ensemble, because both the
hyperparameter optimization and the NN ensemble will try to correct for
any bias in the base models, and they need a realistic view of how well
the projects are performing. So it's important that you give them
"fresh" records that have not been used in training the base models. But
you can use the same validation set for both.

Finally evaluate the NN ensemble against the test set. You should of
course also evaluate all the individual projects independently
(including the basic ensemble after setting the weights according to the
hyperparameter optimization) to make sure each of them are working properly.

I hope this helps!

Best,
Osma

> Parthasarathi Mukhopadhyay
>
> Professor, Department of Library and Information Science,
>
> University of Kalyani, Kalyani - 741 235 (WB), India
>

> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/annif-users/CAGM_5uZnm1V8aKuVq%2B_oU4qZk2JYWgBLQQbKy%3DRVqtH8%2BuK8pQ%40mail.gmail.com <https://groups.google.com/d/msgid/annif-users/CAGM_5uZnm1V8aKuVq%2B_oU4qZk2JYWgBLQQbKy%3DRVqtH8%2BuK8pQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

Parthasarathi Mukhopadhyay

unread,

Nov 10, 2023, 4:58:05 AM11/10/23

to Annif Users

Hello Osma

This guidance will be extremely helpful for us.

One more question - when you are saying basic ensemble I do think this is the "Simple Ensemble" you are referring to, and can we use the weightage preferences there too (in Simple Ensemble)?

Heartfelt thanks and best regards

Parthasarathi

To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/2341ee23-3943-4049-ad87-72d61a82f11b%40helsinki.fi.

Osma Suominen

unread,

Nov 10, 2023, 5:14:28 AM11/10/23

to annif...@googlegroups.com

Hello Parthasarathi,

yes, I meant the simple ensemble (backend type "ensemble"). You can use
weighted sources with all ensemble types.

But I suggest you start with equal weights:

backend=ensemble
sources=mllm,omikuji-bonsai,omikuji-parabel

then run the hyperopt command, which will at the end print out a new
sources setting that includes optimized weights, something like:

sources=mllm:0.2231,omikuji-bonsai:0.4456,omikuji-parabel:0.3313

then you can use this new setting when setting up the NN ensemble as well.

-Osma

> <mailto:annif-users%2Bunsu...@googlegroups.com>
> > <mailto:annif-users...@googlegroups.com
> <mailto:annif-users%2Bunsu...@googlegroups.com>>.

> > To view this discussion on the web visit
> >

> https://groups.google.com/d/msgid/annif-users/CAGM_5uZnm1V8aKuVq%2B_oU4qZk2JYWgBLQQbKy%3DRVqtH8%2BuK8pQ%40mail.gmail.com <https://groups.google.com/d/msgid/annif-users/CAGM_5uZnm1V8aKuVq%2B_oU4qZk2JYWgBLQQbKy%3DRVqtH8%2BuK8pQ%40mail.gmail.com> <https://groups.google.com/d/msgid/annif-users/CAGM_5uZnm1V8aKuVq%2B_oU4qZk2JYWgBLQQbKy%3DRVqtH8%2BuK8pQ%40mail.gmail.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/annif-users/CAGM_5uZnm1V8aKuVq%2B_oU4qZk2JYWgBLQQbKy%3DRVqtH8%2BuK8pQ%40mail.gmail.com?utm_medium=email&utm_source=footer>>.

>
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finland
> P.O. Box 15 (Unioninkatu 36)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529

> osma.s...@helsinki.fi <mailto:osma.s...@helsinki.fi>
> http://www.nationallibrary.fi <http://www.nationallibrary.fi>

>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to annif-users...@googlegroups.com

> <mailto:annif-users%2Bunsu...@googlegroups.com>.

> To view this discussion on the web visit

> https://groups.google.com/d/msgid/annif-users/2341ee23-3943-4049-ad87-72d61a82f11b%40helsinki.fi <https://groups.google.com/d/msgid/annif-users/2341ee23-3943-4049-ad87-72d61a82f11b%40helsinki.fi>.

>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/annif-users/CAGM_5ubh4%3DtkU97pq1gL-m9ewbyPnwT8mLgC%2B5vJhqZCHVcmug%40mail.gmail.com <https://groups.google.com/d/msgid/annif-users/CAGM_5ubh4%3DtkU97pq1gL-m9ewbyPnwT8mLgC%2B5vJhqZCHVcmug%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Parthasarathi Mukhopadhyay

unread,

Nov 10, 2023, 5:27:13 AM11/10/23

to Annif Users

Thanks. I understand it now.

Regards

To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/b39006a6-c133-42ff-bfe3-9fa0dc5cd3fe%40helsinki.fi.

Reply all

Reply to author

Forward