I wanted to provide some more information about the work I did with using LCSH in Annif, in particular the data preprocessing stages, in case it is useful to others.
The resultant files in the git repository are:
- fulltext: the source documents as text and subject files, divided as described in
https://github.com/NatLibFi/Annif/wiki/Achieving-good-results (test/train/validate); there are also numbered sets (100, 500, 1000, 3000) for testing backend sensitivity, but they aren’t strictly necessary
- subjects: the short documents (title only), TSV formatted as described in
https://github.com/NatLibFi/Annif/wiki/Document-corpus-formats ; this is a more compact version of the same documents as the “all” folder, above
- tools: some useful scripts for creating / training / destroying Annif on ephemeral VMs (I use DigitalOcean)
- training: the short documents, TSV formatted, ie. a more compact of the “training” folder above
- vocab: 1) the LCSH vocabulary, downloaded from LoC and processed as I described in my earlier message 2) a custom RULA vocabulary, which contains headings that exist in our repository but not in the LCSH download
This last part was the most challenging for me — because LCSH allows (potentially infinite?) subdivisions, there are inevitably subject headings that will not exist in the LCSH RDF, and won’t have URIs (eg "Photography--Canada, Northern—History”).
For these, I generated RULA.tsv which uses unique URNs for each subject heading (the script is subjects/match-subjects.rb). Then I combined the LCSH vocabulary with the RULA vocabulary (vocab/rula/rula-lcsh.ttl) and loaded that into Annif. (Note: The use of mixed URNs / URIs can cause issues with Maui backend; this is fixed in
https://github.com/NatLibFi/maui/pull/7 )
At a very high level, the steps are:
1. Process your vocabulary / subject data (generate URNs, merge LCSH): either into TTL or TSV format
2. Process your document data: generate full-text and short-text corpora, and then split into test/train/validate
3. Configure Annif backends: most backends will do best with the short-text corpus, the exception to this is Maui, which prefers a smaller number of full-text documents (about 500 is plenty)
4. Evaluate your backends using “eval” and your “test” corpus. I never use the “validate” corpus, it’s for possible future use.
It sounds like a lot of work, and it is. But it’s not hard; the biggest challenge is planning out the data processing, I would even recommend on paper (I used a whiteboard).