Advise on which backend to try

christelann...@gmail.com

unread,

Jul 6, 2022, 4:58:50 AM7/6/22

to Annif Users

Dear all,

I have about 5000 early modern German texts (legal texts) that have been labelled with most of the time 3 to 4 labels, but sometimes it goes up to 26 labels (n=1). These texts range from 1500-1800 approximately. As you might suspect, the texts are definitely not 'standardised spelling'.

1) At this point, the labels are all written out in words. Is that ok, or do I need to label them in numbers (or does this depend on the SKOS?)
2) What backends do you advise to try?

Best,
Annemieke

christelann...@gmail.com

unread,

Jul 6, 2022, 5:07:18 AM7/6/22

to Annif Users

The SKOS we have created can be found here: https://skohub.io/rg-mpg-de/vocabs-polmat/heads/main/w3id.org/rg-mpg-de/polmat/n01.1so.1.de.html

Op woensdag 6 juli 2022 om 10:58:50 UTC+2 schreef christelann...@gmail.com:

juho.i...@helsinki.fi

unread,

Jul 6, 2022, 11:26:03 AM7/6/22

to Annif Users

Hi Annemieke!

1)
If the texts are assigned with labels that are identical to the prefLabels of the SKOS vocabulary, you are good to go just with that. But you need to use the full-text document corpus with simple subject files, that is a directory of .txt files containing the texts and corresponding .key files containing the subjects: https://github.com/NatLibFi/Annif/wiki/Document-corpus-formats#simple-subject-file-format

However, for the long run it could be better to have the labels of the texts converted to URIs (i.e. ids) of your SKOS vocabulary. That would allow you to later easily change the prefLabels in SKOS, if need for that arises, and not have to worry about changing also the labels of the documents. And you could also use the Extended subject file format or Short text document corpus (TSV file) format for the corpus.

So the numbers you refer to are the last parts of the ids of the concepts in the SKOS, and they appear also as the altLabels and notations(?). In principle you could use them as the URIs, but as you already have the ids of SKOS, I think it is better to use the ids.

2)
Omikuji with either Bonsai or Parabel configuration usually gives very good results. And you said the spelling is not standardized in your texts, so I guess fastText would work well too, as it can use character-level information to somewhat ignore varying spelling. But for that you need to set the parameters for the character n-gram length in the configuration of the fasttext project. You can try for example the values minn=2 and maxn=5. There are also many other parameters for fastText, and it can be problematic to find good values for them to make the model work well: https://github.com/NatLibFi/Annif/wiki/Backend%3A-fastText#backend-specific-parameters

I hope I could give some useful thoughts at least, please ask again if needed or problems arise!
-Juho

Reply all

Reply to author

Forward