configuring project for short documents

已查看 15 次
跳至第一个未读帖子

Dario Bonaretti

未读,
2021年10月25日 15:14:402021/10/25
收件人 inception-users
Hello,

I need to annotate several thousands of short documents (e.g., online reviews, tweets). Can anyone advise on the best approach for importing this data?

Right now, I'm uploading each review as a .txt documents. In total I have some 2k reviews/documents for  a total of ~8MB. Moving forward, I'll probably need to upload more batches like this, that is, 2-3k documents, each one containing one review.

Does this approach make any sense or should I really try to keep all reviews on one document? 

I would also like some input on another question: The reviews have some metadata which I would like to pass to inception (e.g., author's name, review's score from 1 to 5). I saw there's an experimental feature for importing metadata, would you say that's the way to go?

Any feedback is appreciated

Richard Eckart de Castilho

未读,
2021年10月26日 03:00:232021/10/26
收件人 incepti...@googlegroups.com
Hi Dario,

> On 25. Oct 2021, at 21:14, Dario Bonaretti <bonaret...@gmail.com> wrote:
>
> I need to annotate several thousands of short documents (e.g., online reviews, tweets). Can anyone advise on the best approach for importing this data?

Sounds like the "Dynamic workload" [1] management mode is for you. That mode allows you to define how many annotators should label a document before the document is considered to be finished. The assignment of documents to annotators happens automatically. Mind that annotators need to explicitly mark a document as finished. In the dynamic mode, marking a document as finished is the only way of being able to access the next document.

> Right now, I'm uploading each review as a .txt documents. In total I have some 2k reviews/documents for a total of ~8MB. Moving forward, I'll probably need to upload more batches like this, that is, 2-3k documents, each one containing one review.
>
> Does this approach make any sense or should I really try to keep all reviews on one document?

If your annotation task is to assign labels at the level of a full review, then modelling these annotations as a "document metadata" [2] layer is likely to be more convenient to your annotators than having to manually mark the span of a review in order to create a span-level annotation and then assigning a label to that span-level annotation. You will find that the recently introduced "singleton" [3] option on document metadata layers is very useful for this purpose.

If every review is a document, then annotators finish the reviews one-by-one. The action to finish a document currently involves a confirmation dialog. Having to go through that dialog hundred of times may be annoying to the annotators.

You could reduce the frequency of the confirmation dialog by having for example batches of 10 reviews per document. If sentence boundaries are not crucial for your annotation process, you may want to import the data in such a way that each review is treated as a single sentence. You may also want to increase the default page size in the annotation editor to your batch size [4].

If your annotation task is to annotate words/short spans within each review, then you should be fine with these mini-batches at this point. But if your task is annotating at the level of the full review, then your annotators would now face the issue of having to first create an annotation for the span of the review and then assigning a label to it. Assuming you configure you annotation layer as a "sentence level" layer and that you have imported each review as "one sentence", creating an annotation is as easy as double-clicking anywhere within the sentence. But it still is a step that needs to be taken for each sentence. Also, once the annotation has been created, the default "brat (one line per sentence)" visualization mode can no longer line-wrap the sentence - which can be quite inconvenient. This problem can be removed by switching to the "brat (break at 120 chars)" visualization mode via the preferences dialog on the annotation page. Presently, every annotator has to make that switch once manually - the project manager cannot (yet) set a default visualization mode.

> I would also like some input on another question: The reviews have some metadata which I would like to pass to inception (e.g., author's name, review's score from 1 to 5). I saw there's an experimental feature for importing metadata, would you say that's the way to go?

If you have one review per document, you can model them as a document-metadata layer. If you have "one review as a sentence", you can model them as a sentence-level layer. If you do not want to display this metadata to the annotators, you can uncheck the "enabled" checkbox of the layer. If you want to display them but not let the user edit them, you can check the "read only" checkbox.

Cheers,

-- Richard

[1] https://inception-project.github.io/releases/21.1/docs/user-guide.html#sect_dynamic_workload
[2] https://inception-project.github.io/releases/21.1/docs/user-guide.html#_document_metadata
[3] https://inception-project.github.io/releases/21.1/docs/user-guide.html#_singletons
[4] https://inception-project.github.io/releases/21.1/docs/admin-guide.html#sect_settings_annotation

Richard Eckart de Castilho

未读,
2021年10月26日 03:01:552021/10/26
收件人 incepti...@googlegroups.com
On 25. Oct 2021, at 21:14, Dario Bonaretti <bonaret...@gmail.com> wrote:
>
> Right now, I'm uploading each review as a .txt documents. In total I have some 2k reviews/documents for a total of ~8MB. Moving forward, I'll probably need to upload more batches like this, that is, 2-3k documents, each one containing one review.

You can try throwing all these into a single project incrementally, but I would probably recommend creating a new project for each of the 2k batches.

Cheers,

-- Richard

Dario Bonaretti

未读,
2021年11月2日 16:59:502021/11/2
收件人 inception-users
Richard, thanks a lot for your answer, which helps a lot! 

And to clarify how I'm planning out the annotation task: annotators will need to annotate multiple tokens within the same review, so a review (possibly 3-4 sentences) can have multiple tokens (typically spanning 1-5 words) labeled differently. 

回复全部
回复作者
转发
0 个新帖子