My thinking about this is similar, I will comment inline:
jaoh kirjoitti 22.5.2022 klo 19.12:
> · Every Word of the vocabulary should be present in the training
> data so it can be learned.
Ideally, yes - every subject should be represented with several
examples. This is especially important for associative algorithms
(tfidf, fasttext, omikuji, svc) because they cannot learn what they
haven't seen. For lexical algorithms (mllm, stwfsa) this is not so
crucial as they learn higher level heuristics, not individual subjects.
In practice this is rarely the case. It's quite typical especially for
large subject vocabularies that a significant fraction (a third or even
half) of the subjects are used just once or not at all. That doesn't
make the training data useless though, since usually the more important
subjects are also the ones that are used frequently.
> · The classes should be balanced i.e. no word of the vocabulary
> should be overrepresented in the training set to avoid bias
Again, ideally yes - especially for a multiclass (choose one out of many
classes) classification tasks. Some algorithms are more sensitive to
class imbalance than others, though. I think the Annif algorithms that
are suitable for multiclass classification (fasttext, omikuji, svc)
should be fairly robust in this sense but I admit I haven't really tried
to measure this. tfidf, on the other hand, is easily thrown off - it's
really just intended as a simple and fast proof-of-concept algorithm.
For multilabel classification (choose a few subjects out of many) it's
more difficult to define what "balanced" means. In practice, all
multilabel data sets tend to be heavily imbalanced.
> Are there other important aspects I should consider?
Well, there are basic things like:
- the manually assigned subjects should be chosen consistently
- the subjects used in the data should match those in the vocabulary
(for example the vocabulary could be an older/newer version!)
- there should be a reasonable amount of text in each document (e.g. a
200 word abstract is probably better than just a 5 word title)
- the documents should be in a single language, not many languages mixed
There are probably more aspects but these were off the top of my head.
> Are there plans to extend Annif to measure these aspects of the training
> data or is this not part of the project scope?
For now it isn't in scope, but please feel free to propose, especially
if you have an idea about metrics that could be helpful. The best way to
propose a new feature is to open up an issue on GitHub.
> Annif could also implement resampling algorithms to solve the class
Yes, although this could be done externally as well, for example scripts
that read and write Annif compatible corpora. I've actually toyed with
something like this some time back but it didn't become a finished tool.
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529