I download LCSH turtle format (around 41 MB) directly from the LoC site, and use it on an as-it-is basis to load inside Annif. All went smooth, but during training I found the process was reporting lots of unknown URI errors. Obviously, it means my training dataset has those URIs but they are not available in the uploaded vocabulary. I was quite sure about the correct encoding of my training dataset and decided to look at the resultant subject file that the loadvoc process had created to find the reasons for such a mismatch. To my utter surprise, the subject file under the voc folder includes only around 280K URIs with the corresponding descriptors. But during an analysis earlier, I came to know that LCSH has 500k+ subject descriptors.
I decided to run loadvoc again after skoifying LCSH with the Skosify tool. Again, the result is almost equivalent, resulting in around 278K URIs -- Descriptors. Then I tested with the lcsh ttl file as prepared by Jim Hahn and made available for use by others. The result was the same.
A closer look at the resultant subject files created by all three above mentioned experiments reveals a common error in the corresponding subject files. In some places of the file, the loadvoc process failed to load URIs and descriptors properly, and in all these cases, there were " " inside the subject descriptors like Friends of God ("Gottesfreunde") | Tovarishchestvo "Iskusstvo ili smertʹ" (Group of artists) | Latvia--Kolhozs "Nākotne"
In all such cases, the loadvoc process created many wrong things like <URI> | Descriptor URI Descriptor and so on.
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/28030b45-51e8-d913-3489-940cf03a341a%40helsinki.fi.