MeSH - issue with Annif

33 views
Skip to first unread message

Parthasarathi Mukhopadhyay

unread,
Jul 9, 2022, 11:58:05 AM7/9/22
to Annif Users
We are using Annif version 0.57.0 in Ubuntu 22.04 (Python 3.8.13).

We've prepared a comprehensive training dataset in the format as acceptable by Annif on the basis of the Medline/Pubmed dataset available in XML format without much issues.

We thought that as MeSH is available for downloading in NT format (from here - https://nlmpubs.nlm.nih.gov/projects/mesh/rdf/2022/), it will be a cakewalk to loadvoc MeSH to Annif, but that is not happening against a day long struggle.

We first attempted to simply load MeSH available in NT format (extracting from zipped MeSH available from the above link) but did not succeed due to sudden closure of the terminal (tried three times). Then we thought of Skosify it first to convert NT format to TTL format. 

We issued a skosify command with the following attributes -  --label "MeSH-2022" --eliminate-redundancy. It produces following errors and failed to output TTL file  -

INFO: Don't know what to do with type http://id.nlm.nih.gov/mesh/vocab#AllowedDescriptorQualifierPair

INFO: Don't know what to do with type http://id.nlm.nih.gov/mesh/vocab#CheckTag

INFO: Don't know what to do with type http://id.nlm.nih.gov/mesh/vocab#Concept

INFO: Don't know what to do with type http://id.nlm.nih.gov/mesh/vocab#DisallowedDescriptorQualifierPair

INFO: Don't know what to do with type http://id.nlm.nih.gov/mesh/vocab#GeographicalDescriptor

INFO: Don't know what to do with type http://id.nlm.nih.gov/mesh/vocab#PublicationType

INFO: Don't know what to do with type http://id.nlm.nih.gov/mesh/vocab#Qualifier

INFO: Don't know what to do with type http://id.nlm.nih.gov/mesh/vocab#SCR_Chemical

INFO: Don't know what to do with type http://id.nlm.nih.gov/mesh/vocab#SCR_Disease

INFO: Don't know what to do with type http://id.nlm.nih.gov/mesh/vocab#SCR_Organism

INFO: Don't know what to do with type http://id.nlm.nih.gov/mesh/vocab#SCR_Protocol

INFO: Don't know what to do with type http://id.nlm.nih.gov/mesh/vocab#Term

INFO: Don't know what to do with type http://id.nlm.nih.gov/mesh/vocab#TopicalDescriptor

INFO: Don't know what to do with type http://id.nlm.nih.gov/mesh/vocab#TreeNumber

INFO: No skos:ConceptScheme or owl:Ontology found. Using namespace auto-detection for creating concept scheme.

CRITICAL: Namespace auto-detection failed. Set namespace using the --namespace option.


A bit of study reveals that the problem is deep rooted as MeSH has its own Skos standard.

Is there any work around?

Best regards

Parthasarathi Mukhopadhyay

Professor, Department of Library and Information Science,

University of Kalyani, Kalyani - 741 235 (WB), India

juho.i...@helsinki.fi

unread,
Jul 13, 2022, 5:27:28 AM7/13/22
to Annif Users
Hi Parthasarathi!

Seems that to run skosify for mesh2022.nt the option --namespace needs to be given. Although I'm not sure what the namespace value should be for MeSH.

By the way I tried to load the mesh2022.nt vocabulary directly to Annif, and it succeeded. But it seemed to need quite much memory, I think something like 15 GB at least (did not actually measure it, just noticed at some point with top command). Maybe your terminal got closed because of memory problems(?).

-Juho

Parthasarathi Mukhopadhyay

unread,
Jul 14, 2022, 7:38:31 AM7/14/22
to Annif Users
Hello Juho

Thanks for the clue. It was a memory issue.

Previously, we were using a i7/16 GB RAM machine with Ubuntu 22.04, and one memory intensive program was running along with Annif-venv.

This time, on the basis of the clue given, we started annif-venv in the same machine after closing down all other programs, and issued loadvoc command against MeSH vocabulary.

We were observing closely with top/htop how memory usage is going up, and within 2 minutes it reached 15.7 GB memory consumption leaving only 200 MB free.

However, it worked this time for us but took almost 90 minutes.

Thanks for the guidance

Best regards

--
You received this message because you are subscribed to the Google Groups "Annif Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/3529cbc9-3c9b-4ba5-b8af-8fa64f97cc2cn%40googlegroups.com.

Osma Suominen

unread,
Aug 1, 2022, 4:53:03 AM8/1/22
to annif...@googlegroups.com
Hello Parthasarathi,

please note that the MeSH download that NLM provides is not in SKOS but
in a custom RDF data model ("MeSH vocabulary"). My guess is that Annif
cannot find any subjects in there even if it seems that the loadvoc
operation will complete without errors. Please check the subjects file
under data/vocabs/mesh (or whatever you used as the vocabulary id) - it
should contain one line per subject, so tens of thousands of lines. If
it is empty, then you don't have any subjects!

You may want to use an alternative, SKOS version of MeSH. For example,
Finto provides one here: https://finto.fi/mesh/en/
(you can find the download links at the bottom of the page)

Best,
Osma

Parthasarathi Mukhopadhyay kirjoitti 14.7.2022 klo 14.38:
> Hello Juho
>
> Thanks for the clue. It was a memory issue.
>
> Previously, we were using a i7/16 GB RAM machine with Ubuntu 22.04, and
> one memory intensive program was running along with Annif-venv.
>
> This time, on the basis of the clue given, we started annif-venv in the
> same machine after closing down all other programs, and issued loadvoc
> command against MeSH vocabulary.
>
> We were observing closely with top/htop how memory usage is going up,
> and within 2 minutes it reached 15.7 GB memory consumption leaving only
> 200 MB free.
>
> However, it worked this time for us but took almost 90 minutes.
>
> Thanks for the guidance
>
> Best regards
>
> On Wed, Jul 13, 2022 at 2:57 PM juho.i...@helsinki.fi
> <mailto:juho.i...@helsinki.fi> <juho.i...@helsinki.fi
> <mailto:juho.i...@helsinki.fi>> wrote:
>
> Hi Parthasarathi!
>
> Seems that to run skosify for mesh2022.nt the option --namespace
> needs to be given. Although I'm not sure what the namespace value
> should be for MeSH.
>
> By the way I tried to load the mesh2022.nt vocabulary directly to
> Annif, and it succeeded. But it seemed to need quite much memory, I
> think something like 15 GB at least (did not actually measure it,
> just noticed at some point with top command). Maybe your terminal
> got closed because of memory problems(?).
>
> -Juho
>
>
> On Saturday, 9 July 2022 at 18:58:05 UTC+3 psmukho...@gmail.com
> <mailto:psmukho...@gmail.com> wrote:
>
> We are using Annif version 0.57.0 in Ubuntu 22.04 (Python 3.8.13).
>
> We've prepared a comprehensive training dataset in the format as
> acceptable by Annif on the basis of the Medline/Pubmed dataset
> available in XML format without much issues.
>
> We thought that as MeSH is available for downloading in NT
> format (from here -
> https://nlmpubs.nlm.nih.gov/projects/mesh/rdf/2022/
> <https://nlmpubs.nlm.nih.gov/projects/mesh/rdf/2022/>), it will
> Parthasarathi Mukhopadhyay
>
> Professor, Department of Library and Information Science,
>
> University of Kalyani, Kalyani - 741 235 (WB), India
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> <https://groups.google.com/d/msgid/annif-users/3529cbc9-3c9b-4ba5-b8af-8fa64f97cc2cn%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/annif-users/CAGM_5uZ82pj7TAaSWZo_W-mue9i5sRkR-9xQQHbpyP0t8xUuLw%40mail.gmail.com
> <https://groups.google.com/d/msgid/annif-users/CAGM_5uZ82pj7TAaSWZo_W-mue9i5sRkR-9xQQHbpyP0t8xUuLw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

Parthasarathi Mukhopadhyay

unread,
Aug 1, 2022, 2:33:11 PM8/1/22
to Annif Users
Hello Osma

Thanks for pointing out the issue, and the right version of the MeSH TTL file.

Actually, we could loadvoc MeSH ttl file in Annif after the initial hiccup with the memory issue, but it has never worked for our training dataset made by using Pubmed. Now, I understand the reason.

We tried to solve this issue by creating a subject key file in tsv format on the basis of Pubmed xml download (https://www.nlm.nih.gov/databases/download/pubmed_medline.html) treated in OpenRefine like


<http://id.nlm.nih.gov/mesh/D000255> Adenosine Triphosphate
<http://id.nlm.nih.gov/mesh/D000273> Adipose Tissue
<http://id.nlm.nih.gov/mesh/D000317> Adrenergic alpha-Antagonists
<http://id.nlm.nih.gov/mesh/D000319> Adrenergic beta-Antagonists
<http://id.nlm.nih.gov/mesh/D000818> Animals
<http://id.nlm.nih.gov/mesh/D004837> Epinephrine
<http://id.nlm.nih.gov/mesh/D006593> Hexokinase
<http://id.nlm.nih.gov/mesh/D066298> In Vitro Techniques
<http://id.nlm.nih.gov/mesh/D008156> Luciferases

......

Then we trained the mesh project with 3 million dataset (as a test case). The pilot project is working for us by following the above methodology.

Today, we've tried the mesh ttl as pointed out by you. But we need to change one statement in the tll file from @prefix mesh: <http://www.yso.fi/onto/mesh/> . to @prefix mesh: <http://id.nlm.nih.gov/mesh/> .

as our training dataset format is like -


[Biochemical studies on camomile components/III. In vitro studies about the antipeptic activity of (--)-alpha-bisabolol (author's transl)]. ¤ (--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which is not caused by an alteration of the pH-value. The proteolytic activity of pepsin is reduced by 50 percent through addition of bisabolol in the ratio of 1/0.5. The antipeptic action of bisabolol only occurs in case of direct contact. In case of a previous contact with the substrate, the inhibiting effect is lost.


It's rocking now. 

Thanks a lot.

Best regards





To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/8a8cb862-4704-d076-f999-377dffc36735%40helsinki.fi.
Reply all
Reply to author
Forward
0 new messages