Loading OCLC FAST Vocabulary

76 views
Skip to first unread message

Remi Malessa

unread,
Nov 20, 2020, 2:50:33 AM11/20/20
to Annif Users
Hello, 
I am starting to work with Annif and trying to load OCLC FAST Dataset (vocabulary).
I am working with the "RDF (Linked Data Format N-Triples) from this website:
https://www.oclc.org/research/areas/data-science/fast/download.html

One of the problem is, it comes as a pack of 8 files.  Looking at the Annif wiki, I think I won't be able to load them separately:

"If a vocabulary has already been loaded, reinvoking loadvoc with a new subject file will update the Annif's internal vocabulary: label names are updated and any subject not appearing in the new subject file is removed. Note that new subjects won't be suggested before the project is retrained with the updated vocabulary."

I've tried merging and loading the files as one but the process is getting "Killed" which I am guessing it is running out of memory (it's likely I don't have the 8GB memory on the server where it is installed). I am running my Annif instance in a docker container.

I also found that Annif couldn't load records with URLs (especially from wikipedia) with speech marks or apostrophes inside. For example:

<https://en.wikipedia.org/wiki/Muse\\u0301e_de\\u0301partemental_Maurice_Denis_"The_Priory"> <http://www.w3.org/2000/01/rdf-schema#label> "Mus\\u00E9e d\\u00E9partemental Maurice Denis \\"The Priory\\"" 

Any help would be much appreciated

Remi

Remi Malessa

unread,
Nov 20, 2020, 7:55:20 AM11/20/20
to Annif Users
Actually, my server has around 32MB of memory and it still fails to load the vocabulary. It's quite big = 2.8GB

MJ Suhonos

unread,
Nov 20, 2020, 10:25:06 AM11/20/20
to Remi Malessa, Annif Users
Hi Remi,

Actually, my server has around 32MB of memory and it still fails to load the vocabulary. It's quite big = 2.8GB

On Friday, November 20, 2020 at 7:50:33 AM UTC Remi Malessa wrote:
Hello, 
I am starting to work with Annif and trying to load OCLC FAST Dataset (vocabulary).
I am working with the "RDF (Linked Data Format N-Triples) from this website:
https://www.oclc.org/research/areas/data-science/fast/download.html

One of the problem is, it comes as a pack of 8 files.  Looking at the Annif wiki, I think I won't be able to load them separately:

"If a vocabulary has already been loaded, reinvoking loadvoc with a new subject file will update the Annif's internal vocabulary: label names are updated and any subject not appearing in the new subject file is removed. Note that new subjects won't be suggested before the project is retrained with the updated vocabulary."

I've tried merging and loading the files as one but the process is getting "Killed" which I am guessing it is running out of memory (it's likely I don't have the 8GB memory on the server where it is installed). I am running my Annif instance in a docker container.

To begin, using the entirely of FAST is a *huge* dataset.  I’ve worked with LCSH as provided by the Library of Congress, and in total it’s only about 50MB once it’s been processed.

You can see the repository I use, along with some possibly useful tools, here: https://github.com/mjsuhonos/Annif-corpora

I also assume you mean 32GB of memory, which should be plenty to load a big vocabulary in Annif.  However, you will certainly have to pre-process your data before loading it.  I would recommend the following steps:

1) You can use the unix “cat” command to combine all of the n-triples files together; this will result in one huge file which you can then work with.  eg. “cat *.nt > FAST-Combined.nt"

2) Most RDF dumps, including FAST, have a lot of extra statements that Annif doesn’t use (eg. skos:inScheme, anything outside of skos, etc).  You can use the unix “grep” command to filter out statements, eg. “grep -v schema.org FAST-Combined.nt > FAST-Combined-schema-removed.nt” (this would remove all statements using the schema.org vocabulary, which Annif doesn’t use)

You will have to run grep multiple times to filter adequately, I would start with “purl.org”, “schema.org”, “#inScheme”, “#type” which I believe are all ignored by Annif.  These can be chained together into one command like:

cat FAST-Combined.nt | grep -v "purl.org" | grep -v schema.org | grep -v "#inScheme" | grep -v "#type" | grep -v "odcby.htm" > FAST-Combined-filtered.nt

In this case, the triples file reduces from 2.8GB to 1.1GB (over 60% reduction).

3) N-triples is great for combining and filtering as above, but it’s still a very verbose format.  You can reduce it further by converting to turtle using the “rapper” tool (from http://librdf.org/raptor/), or converting to a simple TSV format that Annif can read (I have a Ruby script here: https://github.com/mjsuhonos/Annif-corpora/blob/master/vocab/rdf2tsv.rb).

Simple usage for rapper would be eg. "rapper -i ntriples -o turtle FAST-Combined-filtered.nt > FAST-Combined-filtered.ttl"

These aren’t strictly necessary, but again they will reduce the work Annif has to do to load the vocabulary, reducing processing time and memory overhead.

In this case, converting to turtle produces a file 880MB in size with 9.2M triples.  It is still very large, but now only about 31% of the original size.  I’ve placed a temporary copy of the output file at:


I also found that Annif couldn't load records with URLs (especially from wikipedia) with speech marks or apostrophes inside. For example:

<https://en.wikipedia.org/wiki/Muse\\u0301e_de\\u0301partemental_Maurice_Denis_"The_Priory"> <http://www.w3.org/2000/01/rdf-schema#label> "Mus\\u00E9e d\\u00E9partemental Maurice Denis \\"The Priory\\"" 

Any help would be much appreciated

That URL looks malformed to me, which should correctly be encoded as "https://en.wikipedia.org/wiki/Musée_départemental_Maurice_Denis_%22The_Priory%22” — the FAST triples file appears to have almost 1.9M statements with these escaped UTF-8 sequences.  

For this, you’ll have to use a tool like “uni2ascii” or some shell commands like "echo -ne” or “echo $” depending on your platform.  If you’re willing to lose some statements, you could also simply filter them out using grep -v “\u0”.

In summary, if you’re just getting started with Annif, I would strongly recommend to start with a small vocabulary instead of tackling a large one like FAST right away.  Also, don’t underestimate the amount of data preprocessing you’ll have to do in order to get your vocabulary (and training) files in a format suitable for Annif.

Good luck, and hope this helps,

MJ


-- 
You received this message because you are subscribed to the Google Groups "Annif Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/e37cd93d-e1d2-4f99-bec2-94b5dce73c27n%40googlegroups.com.

Message has been deleted

Remi Malessa

unread,
Nov 24, 2020, 9:09:15 AM11/24/20
to Annif Users
Hi MJ,

Firstly, thank you very much for the reply, much appreciated. I followed your suggestions and using a Python script removed all the offending lines and omitted the unnecessary ones. That brought the vocabulary to around 1.1 GB and I actually managed to load it up as n triples.

But I think I misunderstood the idea of training the project as I now need training data indexed using the FAST vocabulary.  So it seems I am going to stick with what's available "out of the box" anyway.

Thank you
Remi

mjsu...@ryerson.ca

unread,
Dec 17, 2020, 12:28:25 PM12/17/20
to Annif Users
Hi all,

I wanted to provide some more information about the work I did with using LCSH in Annif, in particular the data preprocessing stages, in case it is useful to others.

The data set in  https://github.com/mjsuhonos/Annif-corpora is about 6000 papers (in PDF) from our institutional repository, all of which were catalogued with LCSH.  However, they were only assigned textual headings, which I had to convert to URIs based on the LCSH SKOS RDF download.

The resultant files in the git repository are:

- fulltext: the source documents as text and subject files, divided as described in https://github.com/NatLibFi/Annif/wiki/Achieving-good-results (test/train/validate); there are also numbered sets (100, 500, 1000, 3000) for testing backend sensitivity, but they aren’t strictly necessary

- subjects: the short documents (title only), TSV formatted as described in https://github.com/NatLibFi/Annif/wiki/Document-corpus-formats ; this is a more compact version of the same documents as the “all” folder, above

- tools: some useful scripts for creating / training / destroying Annif on ephemeral VMs (I use DigitalOcean)

- training: the short documents, TSV formatted, ie. a more compact of the “training” folder above

- vocab: 1) the LCSH vocabulary, downloaded from LoC and processed as I described in my earlier message 2) a custom RULA vocabulary, which contains headings that exist in our repository but not in the LCSH download

This last part was the most challenging for me — because LCSH allows (potentially infinite?) subdivisions, there are inevitably subject headings that will not exist in the LCSH RDF, and won’t have URIs (eg "Photography--Canada, Northern—History”).

For these, I generated RULA.tsv which uses unique URNs for each subject heading (the script is subjects/match-subjects.rb).  Then I combined the LCSH vocabulary with the RULA vocabulary (vocab/rula/rula-lcsh.ttl) and loaded that into Annif. (Note: The use of mixed URNs / URIs can cause issues with Maui backend; this is fixed in https://github.com/NatLibFi/maui/pull/7 )

At a very high level, the steps are:

1. Process your vocabulary / subject data (generate URNs, merge LCSH): either into TTL or TSV format
2. Process your document data: generate full-text and short-text corpora, and then split into test/train/validate
3. Configure Annif backends: most backends will do best with the short-text corpus, the exception to this is Maui, which prefers a smaller number of full-text documents (about 500 is plenty)
4. Evaluate your backends using “eval” and your “test” corpus.  I never use the “validate” corpus, it’s for possible future use.

It sounds like a lot of work, and it is.  But it’s not hard; the biggest challenge is planning out the data processing, I would even recommend on paper (I used a whiteboard).
Reply all
Reply to author
Forward
0 new messages