Lexical models for a small vocabulary

17 views
Skip to first unread message

Parthasarathi Mukhopadhyay

unread,
Jun 11, 2024, 5:33:34 AMJun 11
to Annif Users
Dear all

I am not getting any clue for the following situation (my Annif version is 1.0.2, Python 3.10)

1. Using a small vocabulary - 17 SDGs of United nations (the first level from this list - https://metadata.un.org/sdg/?lang=en - e.g. URI http://metadata.un.org/sdg/1 for Goal 1 No Poverty).
2. Feed this tiny vocabulary to Annif
3. Created two lexical projects (MLLM, STWFSA)
4. Trained both the projects with 15K datasets (containing Title+Abstract of journal papers and SDG classification URI)
5. Both (MLLM and STWFSA) are returning no suggestions (just returning to command prompt)
6. Evaluation scores for both are very low against a test dataset of 1K records (maybe that is the reason for returning nothing against the annif suggest command)
7. Tried with the Omikuji Bonsai model - all working nice and NDCG is around 0.89

It's not that there is any technical issue in the system because the same MLLM and STWFSA are working as per expectations for a relatively large vocabulary (Homosaurus with 2893 terms) and producing results after training with 10K records.

Any clue where to look?

Regards

Parthasarathi

Parthasarathi Mukhopadhyay

Professor, Department of Library and Information Science,

University of Kalyani, Kalyani - 741 235 (WB), India

https://orcid.org/0000-0003-0717-9413

juho.i...@helsinki.fi

unread,
Jun 11, 2024, 6:19:32 AMJun 11
to Annif Users
Hi Parthasarathi,

Looking at the webpage of the vocabulary's first subject the subject's skos:prefLabel is shown to be "End poverty in all its forms everywhere" and no other labels are present.

The MLLM works by constructing a term index from the preferred, alternate and optionally also hidden labels of a vocabulary, see the MLLM wiki page, and for a subject to be suggested for a document, the label of a subject (or possibly some processed form of it) needs to occur in the input document. In this case the label is quite long, and I suspect it does not occur as such in your test documents, thus essentially giving no suggestions and very low eval results.

You could check what happens if you give just the labels as an inputs to your MLLM project, for example

    echo "End poverty in all its forms everywhere" | annif suggest <mllm-project-id>

I assume this would give a suggestions.

The SDG vocabulary subjects have links to Wikidata, maybe you could get shorter and possibly better working labels from there, like "Sustainable Development", which is linked to from the first subject?

-Juho

Parthasarathi Mukhopadhyay

unread,
Jun 11, 2024, 7:44:59 AMJun 11
to Annif Users
Hello Juho

Thanks for looking into this problem.

Let me share more details related to this pilot project so that you can get the complete picture:

1. The purpose of the project is to categorize Indian research publications (2012-2023) by relating these publications to their affinity with SDGs. It will give us an idea about research focus of different institutes in India for obtaining a larger picture.

2. The SD goals URIs (developed by the library of UN HQ) are in use for this purpose, as no SDG ontology is available for use. We decided to create a subjects.tsv for this work in the following format:

<https://metadata.un.org/sdg/1> No poverty
<https://metadata.un.org/sdg/2> Zero hunger
<https://metadata.un.org/sdg/3> Good health and well-being
<https://metadata.un.org/sdg/4> Quality education

3. The load-vocab command created subjects.csv and subjects.ttl in the following formats:

uri notation label_en
https://metadata.un.org/sdg/1
No poverty
https://metadata.un.org/sdg/2
Zero hunger
https://metadata.un.org/sdg/3
Good health and well-being
https://metadata.un.org/sdg/4
Quality education

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .

<https://metadata.un.org/sdg/1> a skos:Concept ;
    skos:notation "None" ;
    skos:prefLabel "No poverty"@en .

<https://metadata.un.org/sdg/2> a skos:Concept ;
    skos:notation "None" ;
    skos:prefLabel "Zero hunger"@en .

<https://metadata.un.org/sdg/3> a skos:Concept ;
    skos:notation "None" ;
    skos:prefLabel "Good health and well-being"@en .

<https://metadata.un.org/sdg/4> a skos:Concept ;
    skos:notation "None" ;
    skos:prefLabel "Quality education"@en .

3. The structure of our training dataset (so far around 15K categorized manually and target is around 100K):

Modeling of Packed Bed Column for the Removal of Cu (II) Ions from Aqueous Solution by Indion 730 %% Background: Acid mine / Copper mine drainage is one of the major source of Cu(II) ions. Acid mine or copper mine drainage occurs naturally within environments, but it is intensify by large scale earth disturbances found with copper mining or exhausted leach heap operations, as a result of generation of metal ions. These metal ions decomposed and produce an acidic waste stream which contains dissolved copper, iron, aluminium and zinc. Objective: The aim of this work was to study the equilibrium and performance of the Indion 730 (strong acid cation exchange resin). Indion 730 resins used as an adsorbent for removal of Cu (II) from Acid Mine Drainage (AMD). Present study also highlights the evaluation of sorption capacity of Indion730as an ion exchangers. A breakthrough curve was used to observe the effectiveness of packed bed column for removal of Cu (II) ions. The Profile of breakthrough curve and time intended for development of breakthrough curve are vital characteristics for determining the process and response of fixed bed column. Method: In the present study, the kinetics of fixed bed column has been tested for Clark’s model. Clark proposed a new simulation of the development of breakthrough curves. Clark model is based on the hypotheses that uses mass-transfer concept in combination with the Freundlich isotherm. Results: The result of this study has shown best testing of experimental breakthrough curves by the Clark kinetic equation has shown outstanding matching of experimental values with modeled curves. The linear method of solving of the equation gives matching values of parameters A and r, which suggests similar values of the sorption rate coefficient and removal capacity. Conclusion: The Clark kinetic models have been found to be suitable for the removal of Cu (II) on Indion 730 for various experimental conditions. On the basis of the premeditated parameters, the theoretical breakthrough curves have been plotted and compared with the experimental values. Keywords: Acid Mine Drainage (AMD), breakthrough curve, cation exchange resin, Clark equation, Cu (II), Indion 730. <https://metadata.un.org/sdg/6>
Assessment Of Prices Of Essential Medicines For Chronic Diseases Prevalent In The Asia Pacific Region %% To assess the prices of essential medicines for chronic diseases prevalent in the Asia Pacific Region. A secondary analysis of medicine prices data from the World Health Organization/Health Action International’s database on medicine prices, availability and affordability was undertaken in March - May 2016. Data on price of 18 medicines used for chronic diseases prevalent in the older population were obtained from facility-based surveys conducted between 2001 and 2013 in 11 countries, namely China, Fiji, India, Indonesia, Lao, Malaysia, Mongolia, the Philippines, Sri Lanka, Thailand and Vietnam. Prices were converted into the base year of 2015. Patient prices were adjusted for inflation and purchasing power parity, and procurement prices for inflation and official exchange rates. Data were analysed for lowest priced generic (LPG) and innovator brand (IB) products in both public and private sectors. Outcome measures were median (range) price ratios to international reference price (IRP). The median (range) procurement price for IBs were found be highest in the Philippines [23.39 (7.24-106.43)] and lowest in Malaysia [4.05 (1.13-56.77)]; and for LPGs highest in Mongolia [2.71 (1.43-22.73)] and lowest in India [0.36 (0.23-2.2)]. Patient price in public sector for IBs were found be highest in the Philippines [79.13 (12.05-380.08)] and nil in Malaysia as it is providing freely; and for LPGs highest in the Philippines [32.88 (14.19-53.93)] and nil in Malaysia and India as they are providing freely. Patient price in private sector for IBs were found be highest in Indonesia [150.03 (15.53-329.28)] and lowest in India [12 (1.39-29.73)]; and for LPGs highest in the Philippines [46.21 (9.76-140.86)] and lowest in China [0.92 (0.46-9.14)]. Procurement price of essential medicines for chronic conditions were high in Asia Pacific Region compared to IRP, especially for IBs. Patients are paying very high prices for both IB and LPG medicines, especially in private sector. <https://metadata.un.org/sdg/17>
Concrete Incorporated With Silica Fume And Manufactured Sand Under Acidic Nature %% Now a dayAcid attack is growing threat to concrete structures. Using some alternate materials as partial replacement of cement and sand in the concrete is an important factor that enhances the performance of concrete in an aggressive environment. In this present study, the cement is replaced with silica fume by 10 % and the sand is replaced with manufactured sand up to 50%. The selection of these replacement materials involves a balance between economy and durability.The specimens are immersed in 2% concentration of both acids such as sulphuric acid and hydro chloric acid at 28 days. The grade of concrete is M30. From the results, it is observed that the concrete containing 10% silica fume and 40% manufactured sand has the better performance in acid environment. <https://metadata.un.org/sdg/15>
Pleurodesis in pulmonary Langerhans cell histiocytosis in children – A life saving measure <https://metadata.un.org/sdg/8>

4. The eval command produced the following results (against a test dataset of 2.5K records) for MLLM and STWFSA:

MLLM

Precision (doc avg):           0.0019
Recall (doc avg):             0.0019
F1 score (doc avg):           0.0019
Precision (subj avg):         0.0915
Recall (subj avg):             0.0154
F1 score (subj avg):           0.0200
Precision (weighted subj avg): 0.0339
Recall (weighted subj avg):   0.0019
F1 score (weighted subj avg): 0.0030
Precision (microavg):         0.2941
Recall (microavg):             0.0019
F1 score (microavg):           0.0038
F1@5:                         0.0019
NDCG:                         0.0019
NDCG@5:                       0.0019
NDCG@10:                       0.0019
Precision@1:                   0.0019
Precision@3:                   0.0019
Precision@5:                   0.0019
True positives:               5
False positives:               12
False negatives:               2640
Documents evaluated:           2596

STWFSA

Precision (doc avg):           0.0004
Recall (doc avg):             0.0004
F1 score (doc avg):           0.0004
Precision (subj avg):         0.0588
Recall (subj avg):             0.0009
F1 score (subj avg):           0.0017
Precision (weighted subj avg): 0.0261
Recall (weighted subj avg):   0.0004
F1 score (weighted subj avg): 0.0007
Precision (microavg):         1.0000
Recall (microavg):             0.0004
F1 score (microavg):           0.0008
F1@5:                         0.0004
NDCG:                         0.0004
NDCG@5:                       0.0004
NDCG@10:                       0.0004
Precision@1:                   0.0004
Precision@3:                   0.0004
Precision@5:                   0.0004
True positives:               1
False positives:               0
False negatives:               2644
Documents evaluated:           2596

5. You are right - when I am giving  a small fictitious title both are working:

echo "Zero hunger to No poverty - the aim of FAO research" | annif suggest sdg-mllm

<https://metadata.un.org/sdg/1> No poverty 0.5265
<https://metadata.un.org/sdg/2> Zero hunger 0.4800

echo "Zero hunger to No poverty - the aim of FAO research" | annif suggest sdg-stwfsa

<https://metadata.un.org/sdg/1> No poverty 0.4667
<https://metadata.un.org/sdg/2> Zero hunger 0.4667


But for real-life records it is returning nothing (even from the training dataset):

echo "Molecular marker systems with special reference to the Silkworm Bombyx mori L. %% Study on genetic diversity is critical to success in any crop breeding and it provides information about the quantum of genetic divergence and serves a platform for specific breeding objectives. Genetic diversity is a particular concern because greater genetic uniformity in silkworm can increase vulnerability to pests and diseases. Hence maintenance of genetic diversity is a fundamental component in long term management strategies for genetic improvement of silkworm which is cultivated by millions of people around the worlds for its lustrous silk. In view of the present study, genetic diversity studies carried out in silkworm using divergent methods (Quantitative traits, biochemical and molecular markers) and present level of diversity, pertaining to the literature has been reviewed. Genetic diversity is the genetic variation within species, both among geographically separated populations and among individuals within a single population. Genetic diversity is an essential aspect in conservation biology because a fundamental concept of natural selection states that the rate of evolutionary change in a population is proportional to the amount of genetic diversity present in it. Decreasing genetic diversity increases the extinction risk of populations due to a decline in fitness. Therefore, both biochemical and molecular markers have recently been employed to estimate the extent of genetic diversity present among various types of silkworm strains such as mono-, bi and multivoltines present in China, Japan, Korea, India, and several other countries." | annif suggest sdg-stwfsa

(annif-venv) roshni@roshni-HP-Pavilion-Laptop-14-dv0xxx:~/annif$

6. The test result from an associative model (Omikuji Bonsai) is a completely reverse picture:

Precision (doc avg):           0.1001
Recall (doc avg):             0.9821
F1 score (doc avg):           0.1814
Precision (subj avg):         0.0978
Recall (subj avg):             0.9827
F1 score (subj avg):           0.1711
Precision (weighted subj avg): 0.1473
Recall (weighted subj avg):   0.9822
F1 score (weighted subj avg): 0.2493
Precision (microavg):         0.1002
Recall (microavg):             0.9822
F1 score (microavg):           0.1818
F1@5:                         0.3215
NDCG:                         0.8987
NDCG@5:                       0.8886
NDCG@10:                       0.8987
Precision@1:                   0.8093
Precision@3:                   0.3124
Precision@5:                   0.1939
True positives:               2598
False positives:               23332
False negatives:               47
Documents evaluated:           2596

echo "A flexible security architecture for mobile data offloading %% The use of complementary network technologies for delivering data, originally targeted for transmission over cellular networks, in order to save money and relieve the mobile telephony network is known as mobile-WIFI offloading. In the proposed model, the double layered data encryption model has been designed for the security of data which is travelling across the communication links between mobile network and Wi-Fi network. The combination of advance encryption standard with RSA algorithm has been designed for the security of data. The various performance parameters has been evaluated such as network load, transmission delay, packet loss and throughput." | annif suggest sdg-omikujiB --threshold 0.5

2024-06-11T11:36:55.479Z INFO [omikuji::model] Loading model from data/projects/sdg-omikujiB/omikuji-model...
2024-06-11T11:36:55.479Z INFO [omikuji::model] Loading model settings from data/projects/sdg-omikujiB/omikuji-model/settings.json...
2024-06-11T11:36:55.479Z INFO [omikuji::model] Loaded model settings Settings { n_features: 89833, classifier_loss_type: Hinge }...
2024-06-11T11:36:55.479Z INFO [omikuji::model] Loading tree from data/projects/sdg-omikujiB/omikuji-model/tree0.cbor...
2024-06-11T11:36:55.486Z INFO [omikuji::model] Loading tree from data/projects/sdg-omikujiB/omikuji-model/tree1.cbor...
2024-06-11T11:36:55.492Z INFO [omikuji::model] Loading tree from data/projects/sdg-omikujiB/omikuji-model/tree2.cbor...
2024-06-11T11:36:55.499Z INFO [omikuji::model] Loaded model with 3 trees; it took 0.02s

<https://metadata.un.org/sdg/9> Industry, Innovation and Infrastructure 0.7995

7. If this the situation, should we follow this strategy?

discard lexical models >> use all associative models (TF-IDF, FastText, Omikuji Bonsai, Omikuji parabel) >> Join these models in Simple Ensemble >> Derive weightage formula through hyperparameter optimization against a validation dataset >> use that weightage formula in creating Neural Network model >> obtain eval scores for all models >> try successive learning of the NN model to increase eval score of the model

Regards

Parthasarathi

--
You received this message because you are subscribed to the Google Groups "Annif Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/8fa1ba75-0ed9-40a2-9591-0275a5ec1574n%40googlegroups.com.

juho.i...@helsinki.fi

unread,
Jun 11, 2024, 11:47:14 AMJun 11
to Annif Users
Thank you for the details! Sounds like an interesting project.

> discard lexical models >> use all associative models (TF-IDF, FastText, Omikuji Bonsai, Omikuji parabel) >> Join these models in Simple Ensemble >> Derive weightage formula through hyperparameter optimization against a validation dataset >> use that weightage formula in creating Neural Network model >> obtain eval scores for all models >> try successive learning of the NN model to increase eval score of the model

Yes, I think this is the right approach. For starters just FastText and either Omikuji variant combined in an esemble could be enough, TF-IDF probably won't help much.

Good luck with the project! Summer vacations are soon starting in Finland, so at some point we probably cannot respond very promptly, but anyway it would be nice hear what kind of results you get.
-Juho

juho.i...@helsinki.fi

unread,
Jun 12, 2024, 4:31:08 AMJun 12
to Annif Users
Forgot to mention that as you are trying to do multiclass classification (i.e. there is only one "true class" for a document), not multilabel (where there can be multiple "true classes/labels"), you could try adding also SVC as a base project to the ensemble.

Also the DDC classification article by Osma could include some valuable insights, because it deals with a quite similar problem.

-Juho

Parthasarathi Mukhopadhyay

unread,
Jun 12, 2024, 5:16:09 AMJun 12
to Annif Users
Thanks Juho again for all the help. 

I'm presently going through this article of Osma.

I'll explore SVC and report back all the results.

Meanwhile, happy summer vacation.

Regards

Parthasarathi

Reply all
Reply to author
Forward
0 new messages