Issue with tokenizer when running on hadoop

32 views

Skip to first unread message

Samudra Banerjee

unread,

Feb 18, 2014, 3:06:33 PM2/18/14

to dkpro-c...@googlegroups.com

Hi Experts,

I have a scenario where I would like to POS tag documents and would need to run the annotation on hadoop. I started with Twitter data and would generalize it later. Basically the Mapper get lines (which correspond to tweets in JSON format) from the input file and once 100 tweets are accumulated, a "process" method is invoked which runs tokenizer and POS tagger over the collection. This MapReduce code seems to run correctly when NOT run on hadoop, but when I try to run it on hadoop, I get the following Exception at the line, "tc.annotator.process(jCas)":

JCas type "de.tudarmstadt.ukp.dkpro.core.api.metadata.type.Sentence" used in Java code, but was not declared in the XML type descriptor

I am not familiar with dkpro-bigdata yet. I am looking into it. Is it that I need to use dkpro-bigdata in order to run annotations on hadoop? What can be the reason for the failure here?

My Mapper class is as follows:

public class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>{

HashMap<String, TweetConcat> map = new HashMap<String, TweetConcat>();

JSONParser parser=new JSONParser();

TweetConcat tc;

public static void process(TweetConcat tc) throws UIMAException {

//AnalysisEngineDescription cc = createEngineDescription(CasDumpWriter.class,CasDumpWriter.PARAM_OUTPUT_FILE, "/home/sabanerjee/Twitter/annotations2/"+tc.language+"/"+tc.document_id+"output.txt");

JCas jCas = JCasFactory.createJCas();

jCas.setDocumentLanguage(tc.language);

jCas.setDocumentText(tc.sb.toString());

DocumentMetaData metaData = new DocumentMetaData(jCas);

metaData.setDocumentId(tc.document_id);

metaData.addToIndexes();

SimplePipeline.runPipeline(jCas, tc.tokenizer);

tc.annotator.process(jCas);

//SimplePipeline.runPipeline(jCas, cc);

SimplePipeline.runPipeline(jCas, tc.serializer);

}

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

String line = value.toString();

//System.out.println("[MAPPER]"+line+"------"+count);

try {

JSONObject tweet = (JSONObject)parser.parse(line);

String lang = (String) tweet.get("lang");

if(lang.equals("en") || lang.equals("de") || lang.equals("ar"))

{

//System.out.println(lang);

tc = map.get(lang);

if(tc==null) {

tc = new TweetConcat(lang);

tc.document_id = String.valueOf(tweet.get("id"));

map.put(lang, tc);

}

else if(tc.tweetcount==0) {

tc.document_id = String.valueOf(tweet.get("id"));

}

tc.sb.append(tweet.get("text")+"\n");

tc.tweetcount++;

if(tc.tweetcount>100) {

System.out.println("Started: "+tc.language+" "+tc.document_id);

try {

process(tc);

}

catch(Exception e) {

e.printStackTrace();

}

tc.tweetcount=0;

tc.sb.setLength(0);

System.out.println("Done: "+tc.language+" "+tc.document_id);

}

Thanks and Regards,

Samudra

Samudra Banerjee

First year graduate student,

Department of Computer Science,

State University of New York at Stony Brook

NY 11794-3393

Richard Eckart de Castilho

unread,

Feb 18, 2014, 3:55:19 PM2/18/14

to dkpro-c...@googlegroups.com

Hi,

sounds like you may have hit a known problem

https://issues.apache.org/jira/browse/UIMA-3385

A quick workaround would be to call TypeSystemDescriptionFactory.createTypeSystem() once outside hadoop, add the type system description as an XML file to one of your JARs and later use JCasFactory.createJCas("path.to.file") to load it in your code (note package-like notation and missing .xml extension).

There may be better ways. Feel free to comment on the issue mentioned above.

Cheers,

-- Richard

Reply all

Reply to author

Forward

0 new messages