Data splitting and ID mapping

31 views
Skip to first unread message

sliz...@googlemail.com

unread,
Oct 26, 2015, 9:39:24 AM10/26/15
to dkpro-tc-users
Hello,

I just started using dkpro-tc for my bachelor thesis. After going through the examples, there are a couple of questions I have in mind.

- I am using a set of datas, where I have to decide whether a sentence is fitting to a template or not. In your examples you had one file for a sentence. Is it necessary to split my data into serveral files? For now i have a file for good or bad.

- Every sentence I'm using has an unique ID. Can I use this ID in dkpro-tc? I saw, that the evaluation is sorted by an ID. I want to access original sentences ater evaluating them and for me it seems easier to map the ID's with the sentences. Or is there an other option to get the original text?

Best regards,

Sebastian

Emily Jamison

unread,
Oct 26, 2015, 12:40:36 PM10/26/15
to sliz...@googlemail.com, dkpro-tc-users
Hi Sebastian,

Welcome to DKPro TC!

- I am using a set of datas, where I have to decide whether a sentence is fitting to a template or not. In your examples you had one file for a sentence. Is it necessary to split my data into serveral files? For now i have a file for good or bad.

As you probably noticed from the demos, each dataset needs a special Reader to convert the dataset into JCases.  You will write this Reader yourself, customizing it to your dataset's format.

Depending how you want to structure your classification problem, you might find it particularly helpful to study the readers from the demos:
/dkpro-tc-examples/src/main/java/de/tudarmstadt/ukp/dkpro/tc/examples/single/unit/BrownUnitPosDemo.java
/dkpro-tc-examples/src/main/java/de/tudarmstadt/ukp/dkpro/tc/examples/single/document/TwentyNewsgroupsDemo.java
 
- Every sentence I'm using has an unique ID. Can I use this ID in dkpro-tc? I saw, that the evaluation is sorted by an ID. I want to access original sentences ater evaluating them and for me it seems easier to map the ID's with the sentences. Or is there an other option to get the original text?

Yes, pre-existing sentence IDs are usable.  For example, if you use a unit reader, one option is to set a suffix on the unit instance id:
(From BrownCorpusReader)
unit.setSuffix(sentence.getCoveredText());
or
unit.setSuffix("_" + myIdHere);

Hope this helps,
Emily


--
You received this message because you are subscribed to the Google Groups "dkpro-tc-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-tc-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

sliz...@googlemail.com

unread,
Oct 28, 2015, 4:38:59 AM10/28/15
to dkpro-tc-users, sliz...@googlemail.com
Hi Emily

thanks for the fast help.

> As you probably noticed from the demos, each dataset needs a special Reader to convert the dataset into JCases.  You will write this Reader yourself, customizing it to your dataset's format.

Now i wrote my own Reader. After some problems (Units have to read annotated texts?) I decided to extend the Conll Reader. I'm setting the sentences as an Unit. I tested the Reader and my sentences are in each Unit. Yout tip with the ID worked fine.
When I am looking at the IdOutcomeReport, i get only two outputs. One for every dataset. Do I have to set an option, so weka knows that it has to map the unit?

Best,
Sebastian

Emily Jamison

unread,
Oct 28, 2015, 10:37:34 AM10/28/15
to Sebastian Z, dkpro-tc-users
Hi Sebastian,

Now i wrote my own Reader. After some problems (Units have to read annotated texts?) I decided to extend the Conll Reader. I'm setting the sentences as an Unit. I tested the Reader and my sentences are in each Unit. Yout tip with the ID worked fine.

Just to clarify: the purpose of a unit reader, as opposed to say a document reader, is that the unit reader creates multiple machine learning instances within the same document, but the document is only subjected to preprocessing (tokenizing, POS tagging, etc) once.  Additionally, in the case of sequence classification, units together in a text classification sequence should have some similarity, such as tokens in a single sentence for POS tagging, because the text classification sequence is used in the sequence classification.
I'm not sure exactly what your task is, but if you are trying to classify a sequence of sentences from a document (such as one news article), it could make sense to create a unit reader with the (single) sentence as the unit, and all sentences from a single document together in a text classification sequence, with as many text classification sequences as you have news articles.  In the Id2Outcome file, each unit will be listed with its individual classification.
If your sentences have nothing in common with eachother, then you probably want to set each sentence as an individual jcas in a document reader, such as the TwentyNewsgroups demo.

No, units don't have to be annotated text.
 
When I am looking at the IdOutcomeReport, i get only two outputs. One for every dataset. Do I have to set an option, so weka knows that it has to map the unit?

This sounds like you set each entire file as a unit.  After you fix your reader, the Id2Outcome file will show one classification for each sentence, as fits your task.

Hope this helps,
Emily
 
Reply all
Reply to author
Forward
0 new messages