Reading an archive of CAS objects as input

Samudra Banerjee

unread,

Feb 18, 2014, 3:40:56 PM2/18/14

to dkpro-big...@googlegroups.com

Hi Experts,

I would like to implement a pipeline which would accept a large document corpus in the form of an archive of un-annotated JCas objects, run tokenization and POS tagging on them and save back the annotated JCas objects as a collection archive. I expect the number of JCas objects to be huge (for example each corresponding to a page from wikipedia), and since I need to run this pipeline on hadoop, I was thinking of trying dkpro-bigdata.

Can someone throw some light on how I should go about? I am yet to try out the example in the wiki. Will try that out. In particular, does CollectionReader support reading from archives? If no, is it possible to implement that?

Thanks and Regards,

Samudra

Richard Eckart de Castilho

unread,

Feb 18, 2014, 3:49:22 PM2/18/14

to Samudra Banerjee, dkpro-big...@googlegroups.com

Hi,

I don't think that it is necessary to read from an archive on Hadoop. The equivalent is afaik a sequence file. I can even imagine that HDFS may be able to deal nicely with large numbers of files… However, I'm not one of the experts ;)

Cheers,

-- Richard

Hans-Peter Zorn

unread,

Feb 18, 2014, 4:57:33 PM2/18/14

to dkpro-big...@googlegroups.com

Hi,

if you want to do it with dkpro bigdata, it is not that difficult. You will have implement the (very simple) Interface DocumentExtractor within the the class Text2CASInputFormat in dkpro.bigdata.io.hadoop. I actually have such an InputFormat for tweets (different format however) half-finished. If you are interested I can send it to you.
The tweets then get transformed into CASes by the inputformat and the Mapper can process those directly. The resulting
CASes will the be written as hadoop sequence files. Either as xml (CASWritable) or binary (BinCasWithTypesystemWritable).

dkpro-bigdata and the associated scripts will also help you to assemble the dependencies without the need of
a fat jar which can lead to problems with finding type system descriptors.

Best,
-hp

Am 18.02.2014 um 21:49 schrieb Richard Eckart de Castilho <richard...@gmail.com>:

> Hi,
>

> I don't think that it is necessary to read from an archive on Hadoop. The equivalent is afaik a sequence file. I can even imagine that HDFS may be able to deal nicely with large numbers of files... However, I'm not one of the experts ;)

>
> Cheers,
>
> -- Richard
>
> On 18.02.2014, at 21:40, Samudra Banerjee <sam...@gmail.com> wrote:
>
>> Hi Experts,
>>
>> I would like to implement a pipeline which would accept a large document corpus in the form of an archive of un-annotated JCas objects, run tokenization and POS tagging on them and save back the annotated JCas objects as a collection archive. I expect the number of JCas objects to be huge (for example each corresponding to a page from wikipedia), and since I need to run this pipeline on hadoop, I was thinking of trying dkpro-bigdata.
>>
>> Can someone throw some light on how I should go about? I am yet to try out the example in the wiki. Will try that out. In particular, does CollectionReader support reading from archives? If no, is it possible to implement that?
>>
>> Thanks and Regards,
>> Samudra
>

> --
> You received this message because you are subscribed to the Google Groups "dkpro-bigdata-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-bigdata-u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Samudra Banerjee

unread,

Feb 18, 2014, 5:04:18 PM2/18/14

to Hans-Peter Zorn, dkpro-big...@googlegroups.com

Thanks a lot Richard and Hans for your inputs.

@Hans: great .. i will try that.. and yes it would be great if you can send me .. I plan to deal with twitter as well and it would give me an idea. You can mail me at sam...@gmail.com or point me to repositories if you are maintaining any!

Thanks once again,
Regards,
Samudra

Sent from my iPhone

> You received this message because you are subscribed to a topic in the Google Groups "dkpro-bigdata-users" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/dkpro-bigdata-users/Uj5v3L24a5k/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to dkpro-bigdata-u...@googlegroups.com.

Hans-Peter Zorn

unread,

Feb 19, 2014, 7:38:18 AM2/19/14

to dkpro-big...@googlegroups.com

Hi,

I just commited a change to the dkpro-bigdata git repository. You will need to check it out and

compile it or add the ukp snapshot maven repo to your maven configuration.

With this, you should be able to use the example given below (use the examples project in

dkpro-bigdata as a template).

The class defines custom Document and MetadataExtractors for Text2CASInputFormat. Here you can plug in

your code that parses the Tweet JSON. Prerequisite is that all records are in a separate line each.

It will run a simple pipeline on the data and store the results as XMI Cas in a SequenceFile.

These files then can be read and processed again by dkpro bigdata directly.

For small documents such as tweets the xml format is more effective because it does not

need to store a typesystem with each document.

Disclaimer: I didn't really test the code below, use at your own risk :-)

Best,

-hp

import static org.apache.uima.fit.factory.AnalysisEngineFactory.createEngineDescription;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.util.ToolRunner;

import org.apache.uima.analysis_engine.AnalysisEngineDescription;

import org.apache.uima.resource.ResourceInitializationException;

import de.tudarmstadt.ukp.dkpro.bigdata.hadoop.DkproHadoopDriver;

import de.tudarmstadt.ukp.dkpro.bigdata.hadoop.DkproMapper;

import de.tudarmstadt.ukp.dkpro.bigdata.hadoop.DkproReducer;

import de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.CASWritable;

import de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.Text2CASInputFormat;

import de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.Text2CASInputFormat.DocumentMetadataExtractor;

import de.tudarmstadt.ukp.dkpro.bigdata.io.hadoop.Text2CASInputFormat.DocumentTextExtractor;

import de.tudarmstadt.ukp.dkpro.core.api.metadata.type.DocumentMetaData;

import de.tudarmstadt.ukp.dkpro.core.snowball.SnowballStemmer;

import de.tudarmstadt.ukp.dkpro.core.tokit.BreakIteratorSegmenter;

public class Tweet2CASExample extends DkproHadoopDriver {

// This extractor assumes the tweet data as tab-seperated values: timestamp\tuser\ttweet

public static class TweetTextExtractor implements DocumentTextExtractor,

DocumentMetadataExtractor {

@Override

public void extractDocumentMetaData(Text key, Text value,

DocumentMetaData metadata) {

String[] values = value.toString().split("\t");

// set document id as user%timestamp

metadata.setDocumentId(key.toString() + "%" + values[0]);

}

@Override

public Text extractDocumentText(Text key, Text value) {

// Put your JSON-Parser here.

String[] values = value.toString().split("\t");

return new Text(values[1]);

}

public AnalysisEngineDescription buildMapperEngine(Configuration job)

throws ResourceInitializationException {

AnalysisEngineDescription tokenizer = createEngineDescription(BreakIteratorSegmenter.class);

AnalysisEngineDescription stemmer = createEngineDescription(

SnowballStemmer.class, SnowballStemmer.PARAM_LANGUAGE, "en");

return createEngineDescription(tokenizer, stemmer);

}

public static void main(String[] args) throws Exception {

Tweet2CASExample pipeline = new Tweet2CASExample();

pipeline.setMapperClass(DkproMapper.class);

pipeline.setReducerClass(DkproReducer.class);

ToolRunner.run(new Configuration(), pipeline, args);

}

@Override

public void configure(JobConf job) {

/*

* Use custom extractors

*/

job.set("dkpro.uima.text2casinputformat.documentmetadataextractor",TweetTextExtractor.class.getCanonicalName());

job.set("dkpro.uima.text2casinputformat.documenttextextractor",TweetTextExtractor.class.getCanonicalName());

/*

* Tweets are very small documents, the default BinCasWithTypeSystem output is very inefficient for

* this kind of data, therefore we use XMI serialization.

*/

job.setOutputValueClass(CASWritable.class);

/*

* Use Text2Cas InputFormat, read texts directly from hdfs

*/

job.setInputFormat(Text2CASInputFormat.class);

}

@Override

public Class getInputFormatClass() {

return Text2CASInputFormat.class;

}

Reply all

Reply to author

Forward