Issues with LCSH

17 views
Skip to first unread message

Parthasarathi Mukhopadhyay

unread,
Aug 13, 2022, 10:15:55 AMAug 13
to Annif Users
Dear all

I'm reporting here an issue related to developing a project with LCSH as a vocabulary.

I download LCSH turtle format (around 41 MB) directly from the LoC site, and use it on an as-it-is basis to load inside Annif. All went smooth, but during training I found the process was reporting lots of unknown URI errors. Obviously, it means my training dataset has those URIs but they are not available in the uploaded vocabulary. I was quite sure about the correct encoding of my training dataset and decided to look at the resultant subject file that the loadvoc process had created to find the reasons for such a mismatch. To my utter surprise, the subject file under the voc folder includes only around 280K URIs with the corresponding descriptors. But during an analysis earlier, I came to know that LCSH has 500k+ subject descriptors.

I decided to run loadvoc again after skoifying LCSH with the Skosify tool. Again, the result is almost equivalent, resulting in around 278K URIs -- Descriptors. Then I tested with the lcsh ttl file as prepared by Jim Hahn and made available for use by others. The result was the same.  

A closer look at the resultant subject files created by all three above mentioned experiments reveals a common error in the corresponding subject files. In some places of the file, the loadvoc process failed to load URIs and descriptors properly, and in all these cases, there were " " inside the subject descriptors like Friends of God ("Gottesfreunde") | Tovarishchestvo "Iskusstvo ili smertʹ" (Group of artists) | Latvia--Kolhozs "Nākotne"

In all such cases, the loadvoc process created many wrong things like <URI> | Descriptor URI Descriptor and so on.

Some of the descriptor columns even included <URI> | Another 5000 URIs in the descriptor column. There were a lot of them (144 rows x 3000 URIs lost per row) on average.
Now I understand the reason for losing almost half of the descriptors after the loadvoc process. It cannot handle " " if present in the strings of subject descriptors.
 
I took the most comprehensive subject file with 280K (as produced by loadvoc process) and, in OpenRefine cleaned the file. Now I'm having 503K+ descriptors but still having problems with 214 rows where " " is present in the descriptor (loadvoc process reported cannot handle URI and TTL file cannot be generated). So, finally I excluded these 214 rows and applied the resultant TSV file for loading the vocabulary. This time it was smooth. It's all working okay and there was no such unknown URI report during the training process. 
 
If anyone is interested, I have attached a tsv file (with 214 rows having " " in descriptor column) for further analysis.




Parthasarathi Mukhopadhyay

Professor, Department of Library and Information Science,

University of Kalyani, Kalyani - 741 235 (WB), India

LCSH-214rows.tsv

Osma Suominen

unread,
Aug 16, 2022, 3:53:16 AMAug 16
to annif...@googlegroups.com
Hello Parthasarathi!

Thank you for the report. It looks like a serious issue if so many LCSH
descriptors are getting lost.

I tried reproducing what you did. This is on a server running Ubuntu
Linux 20.04.

I downloaded the LCSH file from id.loc.gov (subjects.skosrdf.ttl.gz, 42
MB, dated yesterday) and uncompressed it into subjects.skosrdf.ttl. Then
I installed Annif 0.58 (the most recent release) into a virtual
environment, set up a minimal tfidf project and loaded the vocabulary:

$ /usr/bin/time -v annif loadvoc lcsh-tfidf-en subjects.skosrdf.ttl
Command being timed: "annif loadvoc lcsh-tfidf-en subjects.skosrdf.ttl"
User time (seconds): 1667.52
System time (seconds): 65.98
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 28:49.47
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 34447692
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 88510792
Voluntary context switches: 403
Involuntary context switches: 1528122
Swaps: 0
File system inputs: 8
File system outputs: 1566280
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

This all went fine. Then I checked the length of the subjects file
created by the loadvoc process:

$ wc -l data/vocabs/lcsh/subjects
505449 data/vocabs/lcsh/subjects

The length is 500k+ lines. So it seems that a bit more than 500k
subjects have been successfully loaded. I also verified the problematic
URIs you mentioned:

$ grep 'http://id.loc.gov/authorities/subjects/sh85051990'
data/vocabs/lcsh/subjects
<http://id.loc.gov/authorities/subjects/sh85051990> Friends of God
("Gottesfreunde")

$ grep 'http://id.loc.gov/authorities/subjects/sh2010012828'
data/vocabs/lcsh/subjects
<http://id.loc.gov/authorities/subjects/sh2010012828> Tovarishchestvo
"Iskusstvo ili smertʹ" (Group of artists)

$ grep 'http://id.loc.gov/authorities/subjects/sh92002692-781'
data/vocabs/lcsh/subjects
<http://id.loc.gov/authorities/subjects/sh92002692-781> Latvia--Kolhozs
"Nākotne"


All three seem to have been properly loaded into the subjects file. I
also did a visual check of some of the file contents but couldn't find
anything suspicious.

Can you compare notes and see if there's anything you did differently?

You mentioned that the loadvoc process created e.g. "5000 URIs in the
descriptor column". Can you post the file somewhere so that I could take
a look?

Best,
Osma
> Parthasarathi Mukhopadhyay
>
> Professor, Department of Library and Information Science,
>
> University of Kalyani, Kalyani - 741 235 (WB), India
>
> --
> You received this message because you are subscribed to the Google
> Groups "Annif Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to annif-users...@googlegroups.com
> <mailto:annif-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/annif-users/CAGM_5uane8eNV-1cHYvSQDq72Ds1%2B6re3aYWYm7j-BssrfbE5Q%40mail.gmail.com
> <https://groups.google.com/d/msgid/annif-users/CAGM_5uane8eNV-1cHYvSQDq72Ds1%2B6re3aYWYm7j-BssrfbE5Q%40mail.gmail.com?utm_medium=email&utm_source=footer>.


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.s...@helsinki.fi
http://www.nationallibrary.fi

Parthasarathi Mukhopadhyay

unread,
Aug 16, 2022, 1:23:38 PMAug 16
to Annif Users
Hello Osma

Let me thank you from the core of my heart for such an elaborate and detailed answer to help a beginner of Annif.

It's now working just fine (as per expectations)

wc -l annif/data/vocabs/lcsh-en/subjects > shows >
505942 annif/data/vocabs/lcsh-en/subjects

 A few additional lines / rows created possibly due to the fact that I've Skosified the lcsh ttl file before sending to Annif. These lines in the final subjects file look like
(these are all deprecated headings in lcsh) -

<nb30ff4b44bec41e486bf500a3b82b715b1105102> Marquesas Islands (French Polynesia)
<nb30ff4b44bec41e486bf500a3b82b715b1106274> Lau Province (Fiji)
<nb30ff4b44bec41e486bf500a3b82b715b1106960> Cyclades (Greece)
<nb30ff4b44bec41e486bf500a3b82b715b1107904> Nasu Jinja (Ōtawara-shi, Japan)
<nb30ff4b44bec41e486bf500a3b82b715b1108690> Balearic Islands (Spain)

But overall it is now working great.

I think I made a mistake earlier during Skosifying. I am trying to reproduce the error and the resultant subjects file again for sharing with you.


Thanks and best regards


To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/28030b45-51e8-d913-3489-940cf03a341a%40helsinki.fi.
Reply all
Reply to author
Forward
0 new messages