Formats for importing documents, organizing multiple annotators

115 views
Skip to first unread message

Tatjana Scheffler

unread,
Nov 26, 2014, 11:18:29 AM11/26/14
to webann...@googlegroups.com
Dear all,

I'm trying to use Webanno for the first time. Is there any kind of documentation on what the text documents to be annotated should look like? I haven't found anything in the user or developer guidelines (other than to pick "the right format"). I'm trying to annotate tweets and would like to use pre-segmented data (each tweet presented in one line).

For another question, is there any support for annotation projects with multiple annotators? I have 30-40 annotators who will each only annotate some of the documents (each document annotated thrice) - but I assume I must organize this separately and just instruct the annotators accordingly?

Thanks a lot,
Tatjana.

Chris Biemann

unread,
Nov 26, 2014, 11:24:57 AM11/26/14
to webann...@googlegroups.com
Dear Tatjana,

please have a look at our video tutorials and documentation at
https://code.google.com/p/webanno/, in particular
https://code.google.com/p/webanno/wiki/UserGuide and
https://code.google.com/p/webanno/wiki/Tutorials .

You should be able to use plain text format for your tweets, and you can
assign annotators to documents in the monitoring page - annotators will
only see their documents not th edocuments of others.

best wishes,

Chris

Am 26/11/14 17:18, schrieb Tatjana Scheffler:

Tatjana Scheffler

unread,
Dec 2, 2014, 12:26:43 PM12/2/14
to webann...@googlegroups.com, bi...@lt.informatik.tu-darmstadt.de
Thank you Chris. Yes I have looked at these documents but they are not very specific.

In particular, the plain text import for documents seems to do some kind of automatic segmentation. After each period, the sentence is split into a new line. On the other hand, line breaks in the text file are ignored if they don't end with a period.  I would like to preserve the new lines from the original document. Is this possible?

In the documents you have referenced I haven't found any kind of instructions on how to format the input (corpus) files.

Best
Tatjana

Richard Eckart de Castilho

unread,
Dec 2, 2014, 12:51:32 PM12/2/14
to Tatjana Scheffler, webann...@googlegroups.com, Prof. Dr. Chris Biemann
On 02.12.2014, at 18:26, Tatjana Scheffler <tsche...@gmail.com> wrote:

> In particular, the plain text import for documents seems to do some kind of automatic segmentation. After each period, the sentence is split into a new line. On the other hand, line breaks in the text file are ignored if they don't end with a period. I would like to preserve the new lines from the original document. Is this possible?

WebAnno displays text according to sentence boundaries, not according to line breaks. Lines are wrapped automatically in the web interface.

When you import plain text, a basic heuristic is used to segment the text into sentences, and line breaks are largely (if not completely) ignored by this heuristic.

If you want WebAnno to render your text based on line breaks, then you need to explicitly mark these as sentence breaks. This means, you have to use another data format for import, e.g. TCF or TSV.

Consider the text

Sie sind so sehr vermessen /
Weil sie des Tods verges=
sen .

To make WebAnno respect the line breaks, you would have to render it like this in the TSV format:

1-1 Sie
1-2 sind
1-3 so
1-4 sehr
1-5 vermessen
1-6 /

2-1 Weil
2-2 sie
2-3 des
2-4 Tods
2-5 verges=

3-1 sen
3-2 .

First column is <lineId>-<tokenId> (<lineId> is actually <sentenceId>!).
Second column is the token text.

Cheerio,

-- Richard

Tatjana Scheffler

unread,
Dec 2, 2014, 2:26:04 PM12/2/14
to webann...@googlegroups.com, tsche...@gmail.com, bi...@lt.informatik.tu-darmstadt.de
Thank you, Richard, for this quick and helpful explanation.

Best
Tatjana
Reply all
Reply to author
Forward
0 new messages