Training TSV(tab seperated files) files in ClearTK

15 views
Skip to first unread message

dreamsco...@gmail.com

unread,
Apr 13, 2015, 2:23:18 AM4/13/15
to cleart...@googlegroups.com
Hi,

 i am using basic Document Classification example . is there a way to train Tab seperated (TSV) files? ... any example code available?

Thanks in Advance
dreams

Lee Becker

unread,
Apr 13, 2015, 10:30:22 AM4/13/15
to cleart...@googlegroups.com
ClearTK doesn't support this explicitly, but you will not need too much beyond the Document Classification example code in ClearTK.

What is your TSV's schema?  I assume it's something like:
document text<tab>document label

Most of the pipeline is setup to ingest files, load the text from the URI, extract features, etc...  What you will want to do is swap the GoldDocumentCategoryAnnotator, which has a 1:1 correspondence between document and category.  To something that can put multiple document annotations within a single CAS.  Your process method would look something like this[1]: 

String tsvText = jCas.getDocumentText();
int begin = 0;
int end = 0;
for (String tsvLine :  tsvText.split("\\r?\\n") {
   // This is where you would swap the TSV parsing to match your own schema
   String[] parts = tsvLine.split("\\t");
   String docText = parts[0];
   String docLabel = parts[1];
   end += docText.length + 1;

   // Swap UsenetDocument with your own type
   UsenetDocument document = new UsenetDocument(jCas, begin, end) ;
   document.setCategory(docLabel);
   document.addToIndexes();
   begin = end + 1;
}

[1] This isn't actual working code, you will need to make sure the offsets are calculated correctly, and that the Java I'm cobbling together from memory is correct.  You may also want to change UsenetDocument to something from your own type system.

Cheers,
Lee
Reply all
Reply to author
Forward
0 new messages