SVC backend: wiki page and training-data format

29 vistas
Ir al primer mensaje no leído

Claudia Grote

no leída,
29 jun 2021, 11:03:24 a.m.29/6/21
para Annif Users
Hi all,

Many thanks to Osma and the Annif team for your great work and advancement of Annif!

We curiously tried to use the SVC backend with Annif 0.53 but ended up in questions:
1. Somehow, the SVC wiki page does not show in the generated toc of the Annif wiki but can be found in the page list above the toc.
2. The 20news example works, but we did not succeed using one of our own training corpora which can be successfully used with other backends. Our document corpora are built using the full-text-document-corpus format (one txt file and one tsv file (containing one class) per document in the same directory). This does not seem appropriate because SVC training warns about ignoring all lines of all files.

Since the SVC-wiki page does not mention any restrictions on the training-corpus format, we are wondering: Does SVC training only work with one tsv file containing the whole training corpus, like the 20news example, or can single documents be used as well, like with other backends?

Thank you very much for your support!

Cheers,
Claudia

juho.i...@helsinki.fi

no leída,
29 jun 2021, 4:00:45 p.m.29/6/21
para Annif Users
Hi Claudia,

Good to hear about your interest in testing SVC, and thanks for the findings!

1. The SVC wiki page was forgotten from the TOC, but now it is in place.

2. The intention is that SVC can be used for both full-text and short-text document corpora just like any other backend. But there is a bug for the case of full-text corpus: when I try to train an SVC project on full-text corpus, it fails with an error. The last 3 lines of the traceback are:

  File "/home/local/jmminkin/git/Annif/annif/backend/svc.py", line 60, in _corpus_to_texts_and_classes
    classes.append(doc.uris[0])
TypeError: 'set' object is not subscriptable

This is straightforward to fix, but you wrote you see warnings about ignoring lines, and not this error? Could you post the output here?

-Juho

Claudia Grote

no leída,
29 jun 2021, 5:21:23 p.m.29/6/21
para Annif Users
Hi Juho,

Very good to hear that SVC is meant to be used like the other backends.

When I execute the training on a full-text corpus with one txt and one tsv file for each document, I get the following output for the first document and subsequently for all other documents alike:
warning: Unknown subject URI <133.4>
warning: Skipping invalid line (missing tab): "Das kleine Hexenbuch  Grundlagenwissen für Hexen"
warning: Skipping invalid line (missing tab): ""
warning: Skipping invalid line (missing tab): "Das kleine Hexenbuch - Grundlagenwissen fiir Hexen INHALTSVERZEICHNIS 1 Vorwort 15 2[…]

The first warning is coming from a tsv file with URI <http://d-nb.info/ddc/133.4> interpreted as contents and literal class 133.4 interpreted as URI.
The other warnings are from the 3 lines of the txt file. Obviously, for each line, contents with a following tab and a URI is expected (as in the short-text-document format).
There are no errors.

Cheers,
Claudia

-- 
You received this message because you are subscribed to a topic in the Google Groups "Annif Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/annif-users/F8N9x2dBHYM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/1dce193f-b3e0-438f-b05e-ffa898a97256n%40googlegroups.com.

Claudia Grote

no leída,
30 jun 2021, 1:05:44 a.m.30/6/21
para Annif Users
Hi Juho,

I called the training on the files in the directory.
If I call the training on the directory, I also get the "TypeError: 'set' object does not support indexing" error you describe.

Cheers,
Claudia

juho.i...@helsinki.fi

no leída,
1 jul 2021, 3:55:57 a.m.1/7/21
para Annif Users
Hi Claudia,

We just released Annif 0.53.1 which fixes the TypeError coming for training SVC backend on fulltext corpus: https://github.com/NatLibFi/Annif/releases/tag/v0.53.1

Thanks again for bringing this up!
-Juho
Responder a todos
Responder al autor
Reenviar
0 mensajes nuevos