Neural Network model - Successive learning

Parthasarathi Mukhopadhyay

unread,

Oct 20, 2024, 9:09:00 AM10/20/24

to Annif Users

Dear all

We are trying a project on automated indexing based on MeSH and using datasets from Pubmed/Medline Baseline dataset (after lots of curation) - as a short text corpus with the following structure:

We have curated a total 750+ thousand records of publications and distributed them into 5 sets (of 125 thousand each) for training five backends namely MLLM, STWFSA, FastText, Omikuji-Bonsai and SVC. Remaining 125 thousand records kept for training Neural Network model. We created Simple Ensemble by combining all these trained backends and obtained weitaghe score for the Neural Network through command for Hyperoptimization. Then went for training of NN model (one 'train' and four 'learn' commands) by using 5 sets (around 25 K per set).

Our expectation was that we will obtain improved scores for the Neural Network model with successive training but here you can see that it is not actually happening (F1@5 and NDCG are the highest in the second cycle of training (NN-Cycle2) and at the fifth cycle it is actually lower than the Cycle1 (after the first training of NN model).

Is there any issue with our methodology or it may happen with large-scale data?

Parameters	NN-Cycle1	NN-Cycle2	NN-Cycle3	NN-Cycle4	NN-Cycle5
Precision (doc avg):	0.5558	0.5622	0.5485	0.5466	0.5478
Recall (doc avg):	0.4221	0.4276	0.4153	0.4141	0.415
F1 score (doc avg):	0.4636	0.4694	0.4568	0.4554	0.4564
Precision (subj avg):	0.079	0.0799	0.0709	0.0715	0.0729
Recall (subj avg):	0.0604	0.0604	0.047	0.0469	0.0478
F1 score (subj avg):	0.0636	0.064	0.0526	0.0528	0.0537
Precision (weighted subj avg):	0.4684	0.4748	0.4645	0.4638	0.4696
Recall (weighted subj avg):	0.3958	0.4004	0.3907	0.3893	0.3901
F1 score (weighted subj avg):	0.3918	0.3952	0.3767	0.3752	0.3771
Precision (microavg):	0.5558	0.5622	0.5485	0.5466	0.5478
Recall (microavg):	0.3958	0.4004	0.3907	0.3893	0.3901
F1 score (microavg):	0.4623	0.4677	0.4563	0.4548	0.4557
F1@5:	0.3737	0.3736	0.3607	0.354	0.3554
NDCG:	0.5188	0.5205	0.5077	0.505	0.5067
NDCG@5:	0.7272	0.7224	0.7074	0.6987	0.7016
NDCG@10:	0.641	0.643	0.6289	0.6257	0.6278
Precision@1:	0.8548	0.8356	0.8484	0.8544	0.8616
Precision@3:	0.7593	0.7533	0.7301	0.7217	0.7256
Precision@5:	0.6883	0.6874	0.6682	0.6567	0.6584
True positives:	13894	14056	13713	13666	13694
False positives:	11106	10944	11287	11334	11306
False negatives:	21208	21046	21389	21436	21408
Documents evaluated:	2500	2500	2500	2500	2500

Regards

Parthasarathi Mukhopadhyay

Professor, Department of Library and Information Science,

University of Kalyani, Kalyani - 741 235 (WB), India

https://orcid.org/0000-0003-0717-9413

juho.i...@helsinki.fi

unread,

Oct 22, 2024, 6:04:19 AM10/22/24

to Annif Users

Hi Parthasarathi!

It is hard to say why you are seeing the decreasing scores with more training cycles, but unfortunately it is not totally uncommon. We are actually not using the learn method on the NatLibFi's Annif instances, because our experiments did not show clear score improvements with it either and also because of the complexities it introduces (how to prevent mischief by steering projects to give wrong suggestions, how to retain the changes got via online-learning when the project needs to be updated, etc.).

One reason for the decreasing scores can be the differences between the training/learning sets; maybe the set you used for the initial training was the most similar to your evaluation set, and the sets for subsequent learn cycles are someway different? Ideally the evaluation set should consist of the most recent documents, to resemble as much as possible the situation from the future when new documents are fed to Annif.

I think giving all the documents with one training cycle would give better results than giving them in parts with multiple learn commands.

In principle, if you set the learn-epochs parameter (default 1) to the same value as the epochs parameter (default 10), I think the end result should be same when using one train cycle or one trained and many learn cycles. (The NN Ensemble Wiki page was missing information about the learn-epochs parameter, I just added it.)

What kind of scores you get when you train the project with just the train command, but with varying amounts of the documents, like 25k, 50k, 75k, 100k, 125k documents? That way you could get a learning curve, which could give some insights.

-Juho

PS Training projects with the MeSH vocabulary sounds very exciting! That could be of interest to other people as well.

Parthasarathi Mukhopadhyay

unread,

Oct 24, 2024, 1:12:49 PM10/24/24

to Annif Users

Thanks Juho, for a crystal clear explanation.

I'll test and report the results ASAP. Meanwhile let me ask one thing - when you say - "varying amounts of the documents, like 25k, 50k, 75k, 100k, 125k documents" - does that mean we need to first train with 25k randomly selected records, then check the efficacy of the NN model & record it, and then clear the project; and repeat the same until the final set with all records (125k). Or should we issue the 'train' command successively without clearing models everytime?

Another of my observation is that the Wiki says now the command to clear a project is "clear-project" [https://annif.readthedocs.io/en/stable/source/commands.html#annif-clear-project] but in reality the old 'clear' command is working, not the new command 'clear-project':

annif clear-project biomed-tfidf
Usage: annif [OPTIONS] COMMAND [ARGS]...
Try 'annif --help' for help.
Error: No such command 'clear-project'.

Regards

Parthasarathi

--
You received this message because you are subscribed to the Google Groups "Annif Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/fe7d4620-aecb-4a83-b1e2-210666febc8en%40googlegroups.com.

Parthasarathi Mukhopadhyay

unread,

Oct 25, 2024, 6:01:15 AM10/25/24

to Annif Users

Hello Juho

Another dilemma I'm facing is conducting this experiment with the learning curve for neural network model.

As we need to mention sources (preferably with weightage) in the project.cfg for the neural network model, what should be done out of these two:

1. First validation (I've 12500 records for this) to determine weightage, use that weightage formula in source, and then run trn-evl-lmt script

2. Add the target sources in NN model without any weightage, run trn-evl-lmt script to get idea about optimum number of records, and then go for validation for fine tubing the weightage formula

Actually I tried without any source list but getting error like this:

KeyError: 'sources'

real 0m5.170s
user 0m11.889s
sys 0m4.157s

Regards

Parthasarathi

juho.i...@helsinki.fi

unread,

Oct 25, 2024, 6:48:00 AM10/25/24

to Annif Users

Hi Parthasarathi!

First thanks for reporting about the documentation error, I just corrected the main branch docs.

> does that mean we need to first train with 25k randomly selected records, then check the efficacy of the NN model & record it, and then clear the project; and repeat the same until the final set with all records (125k). Or should we issue the 'train' command successively without clearing models everytime?

It is enough just to run the train command with different, increasing number of records: every time the train command is invoked, the project is trained from scratch, there is no need for the clear command in between runs (this is where the learn command is different to train; learn command will continue "training" a project from the state where the previous train or learn command operation has left the project).

And as you said, the records should be in randomly selected for each of the sub train sets, that's important.

> what should be done out of these two:
> 1. First validation (I've 12500 records for this) to determine weightage, use that weightage formula in source, and then run trn-evl-lmt script
> 2. Add the target sources in NN model without any weightage, run trn-evl-lmt script to get idea about optimum number of records, and then go for validation for fine tubing the weightage formula

I think I don't have a good answer for this, but I guess the order of optimizing the weights and the number of records should not matter much. So either way. :)

But I also guess that these optimizations don't necessarily improve the evaluation scores much. In our projects the improvement by optimizing the project weights in an NN ensemble has been around 1 %-point of the F1@5-score. But your case may differ, so it's good to check this.

-Juho

Parthasarathi Mukhopadhyay

unread,

Oct 27, 2024, 11:36:04 AM10/27/24

to Annif Users

Hello Juho

Thanks for introducing me with this hidden gem [https://github.com/NatLibFi/Annif-tutorial/blob/main/exercises/05_mllm_project.md#extra-experiment-with-different-amounts-of-training-data] - the script to determine the learning curve automatically. I never noticed that earlier.

It worked flawlessly in our server for the NN backend (took almost a day in a 48 core 64GB server on CentOS for max limit 125000 and testset 2500). A minor correction we needed in the script to run it at our end - in the second line -

print "usage: $0 <project-id> <trainset> <testset> <minlimit> <maxlimit> <step>" >> echo "usage: $0 <project-id> <trainset> <testset> <minlimit> <maxlimit> <step>" (but this may be for the reason that we are using an old bash shell in our server).

When Annif is setting '--docs-limit' - does that mean it is sequentially selecting the document limit from the dataset or picking up randomly? If not, can it be changed to pick up records randomly when setting document limits?

Another thing I was wondering - it seems this script always ends with training by the max limit of documents - is it possible to store the limit for best NDCG value during the loop, and finally end up with the training with the best '--docs-limit' set? Is there any way in Annif to see how many documents are actually used in training for a given model (say in show-project command)?

I'm attaching the result in case it is useful for other Annifers.

Heartfelt thanks and regards

Parthasarathi

To view this discussion visit https://groups.google.com/d/msgid/annif-users/b44ba353-6512-4bfe-80c1-0982558f6720n%40googlegroups.com.

juho.i...@helsinki.fi

unread,

Oct 28, 2024, 9:17:29 AM10/28/24

to Annif Users

Hi Parthasarathi!

The --docs-limit option does not affect the order of the documents, it just selects the given number of documents to processed. But regarding randomizing document order (the line order in a file), you might want to check the shuf command, if you are not familiar with it yet. We are using it like this in one of our pipelines.

The learning-curve script as such does not allow automatically selecting the project with the best docs value, maybe it could be modified for that, but it would also increase its complexity. I think Python code could be more suitable for that.

Currently Annif does not store the number of documents used in training a project, but there is a GitHub issue about it and other project metadata. If anyone finds those valuable or has new ideas for metadata to store, please react/comment in the issue.

-Juho

Reply all

Reply to author

Forward