Hi Paul!
Good that you got the vocabulary working!
I think it would be helpful if you gave some more background about what
you are trying to do with Annif, what kind of training data you have
available, what kind of documents you wish to apply Annif on and what
kind of results you are aiming for. Then it would be easier to guide you
in the right direction.
Also, I suggest that you take a closer look at the Annif tutorial videos
and exercises. They explain many of the ideas and choices, for example
what kind of data sets you can use for training and what kind of
backends (different algorithms) are available:
https://github.com/NatLibFi/Annif-tutorial
To answer your questions:
1. Yes, you could create a training data set where the text is the same
as the keyword (subject label) and train for example a TFIDF model using
that. Not sure how useful this would be though, as I suspect the quality
wouldn't be very good if the model is applied on longer real world
documents. If you are looking for a solution for lexical subject
indexing (matching words and phrases in the text directly to subject
terms in your vocabulary), then I would suggest that you instead look at
using the MLLM backend (which does require some training data - for
example 50 or 100 documents with manually assigned subjects would be a
good start) or the YAKE backend, which doesn't need any training data.
2. The training documents can be almost any length, from a few words
(e.g. just document titles) to longer abstracts, tables of contents or
full text documents with over a hundred pages of text. It depends on
what you have available and what kind of documents you want to apply
Annif on later on.
Best,
Osma
> > its
https://fhpcloud.fh-potsdam.de/s/X6d6KzFDRGee4Pp <https://
>
fhpcloud.fh-potsdam.de/s/X6d6KzFDRGee4Pp>
> vocab#Mount_Baker> <
http://example.org/ <
http://example.org/>
>
example.org/vocab#Mount_Baker> <http://
> > >
http://example.org/ <
http://example.org/> <
http://example.org/
> <
http://example.org/>>
> > > vocab#Konferenz_über_Sicherheit_und_Zusammenarbeit_in_Europa
> > <http://
> > >
example.org/ <
http://example.org/>
> > vocab#Spielzeugeisenbahn> <
http://example.org/ <http://
>
example.org/> <
http://example.org/ <
http://example.org/>>
> > vocab#Wasserschaden> <
http://example.org/ <
http://example.org/>
> > >
http://example.org/ <
http://example.org/> <
http://example.org/
> msgid/ <
https://groups.google.com/d/msgid/>
> > annif- <
https://groups.google.com/d/msgid/annif- <https://
>
groups.google.com/d/msgid/annif->>
> > > users/fa82039f-7f08-46b4-8626-0264bd95d293n%
40googlegroups.com
> <
http://40googlegroups.com>
> > <
http://40googlegroups.com <
http://40googlegroups.com>> <https://
> > >
groups.google.com/d/msgid/annif-users/ <http://
>
groups.google.com/d/msgid/annif-users/> <
http://groups.google.com/
> > <
http://40googlegroups.com <
http://40googlegroups.com>>?
> > > utm_medium=email&utm_source=footer>.
> >
> > --
> > Osma Suominen
> > D.Sc. (Tech), Information Systems Specialist
> > National Library of Finland
> > P.O. Box 15 (Unioninkatu 36)
> > 00014 HELSINGIN YLIOPISTO
> > Tel.
+358 50 3199529 <tel:+358%2050%203199529> <tel:
> annif- <
https://groups.google.com/d/msgid/annif->
> > users/54fef4af-9b9d-44a2-92e6-056dab325a6an%
40googlegroups.com
> <
http://40googlegroups.com> <https://
> >
groups.google.com/d/msgid/annif- <
http://groups.google.com/d/
> msgid/annif->
> > users/54fef4af-9b9d-44a2-92e6-056dab325a6an%
40googlegroups.com
> users/f76698c7-4fb0-4953-908f-4f5aaa022f2en%
40googlegroups.com <https://
>
groups.google.com/d/msgid/annif-users/
> f76698c7-4fb0-4953-908f-4f5aaa022f2en%
40googlegroups.com?