Writing CASes to HDFS

4 views
Skip to first unread message

Yamen Ajjour

unread,
Jun 29, 2016, 7:41:03 AM6/29/16
to dkpro-bigdata-users
Hello Everybody,

I would like to ask about the way the users of Dkpro Bigdata write cases to HDFS ? What i tried to do is to create an Engine from the class TextWriter and pass to it an URI to the place where i would like the output to be saved but sadly is not working.

This is the code i tried


public AnalysisEngineDescription buildMapperEngine(Configuration job)
throws ResourceInitializationException
{
ExternalResourceDescription locator = createExternalResourceDescription(HdfsResourceLoaderLocator.class,HdfsResourceLoaderLocator.PARAM_FILESYSTEM, hdfsURI);
AnalysisEngineDescription tokenizer = createEngineDescription(BreakIteratorSegmenter.class);
AnalysisEngineDescription stemmer = createEngineDescription(SnowballStemmer.class,
SnowballStemmer.PARAM_LANGUAGE, "en");
AnalysisEngineDescription writer = createEngineDescription(TextWriter.class,
TextWriter.PARAM_TARGET_LOCATION,"hdfs:/user/befi8957/sample_experiment/result");
return createEngineDescription(tokenizer, stemmer,writer);
}


Regards,
Yamen

Richard Eckart de Castilho

unread,
Jun 29, 2016, 7:46:27 AM6/29/16
to Yamen Ajjour, dkpro-bigdata-users
Hi,

DKPro Core writers presently do not support writing to HDFS, but it is something we have on the radar (contributions welcome):

https://github.com/dkpro/dkpro-core/issues/869

DKPro BigData has some provisions of copying data that was written to a file system directory back into HDFS after the end of
a process. However, for some (unknown?) reason, this is currently disabled as the number of mappers has been hard-coded to 0.

Cf. https://groups.google.com/d/msg/dkpro-bigdata-developers/UVqRqKpa2IE/3UPXwuh3BwAJ

You could go into DkproHadoopDriver and change it back such that running a reducer would work or consider porting the copying-logic into the mapper.

Cheers,

-- Richard

Yamen Ajjour

unread,
Jun 29, 2016, 9:07:57 AM6/29/16
to Richard Eckart de Castilho, dkpro-bigdata-users
Thank you very much. You mean the number of reducers is 0 right ?  I tried also to use a local directory to get the output to but no output is being generated .
I used the following TextWriter
            AnalysisEngineDescription writer =  createEngineDescription(TextWriter.class,
                        TextWriter.PARAM_FILENAME_SUFFIX,".txt",TextWriter.PARAM_TARGET_LOCATION,"~/output");

Is there any way i can debug the code ? I tried to write to stdout and look at the logs but didn't manage to achieve any things.

Richard Eckart de Castilho

unread,
Jun 29, 2016, 9:09:33 AM6/29/16
to Yamen Ajjour, dkpro-bigdata-users
On 29.06.2016, at 15:07, Yamen Ajjour <yamen...@gmail.com> wrote:
>
> Thank you very much. You mean the number of reducers is 0 right ? I tried also to use a local directory to get the output to but no output is being generated .
> I used the following TextWriter
> AnalysisEngineDescription writer = createEngineDescription(TextWriter.class,
> TextWriter.PARAM_FILENAME_SUFFIX,".txt",TextWriter.PARAM_TARGET_LOCATION,"~/output");
>
> Is there any way i can debug the code ? I tried to write to stdout and look at the logs but didn't manage to achieve any things.

I would recommend you check out the code from GitHub and import it into your IDE (Eclipse, IntelliJ, etc.).

Of course you can debug it using the debugger integrated in your IDE. Just set breakpoints in the code and run your program in debugging mode.

Cheers,

-- Richard

Hans-Peter Zorn

unread,
Jun 29, 2016, 4:54:42 PM6/29/16
to Richard Eckart de Castilho, Yamen Ajjour, dkpro-bigdata-users
Hi,
copying back is implemented in UIMAMapReduceBase, so it should be working for both Mappers and Reducers. However since this functionality was mainly implemented for doing embarrasingly parallel execution of UIMA Pipelines, the number of reducers is set to 0. 
Did you try using the $dir placeholder for the TARGET_LOCATION of the TextWriter? Dkpro BigData looks in the Uima metadata for this and replaces it by a temporary path from which data should be copied back to hdfs.

At least this was how it worked 3 years ago :)

Best,
-hp



--
You received this message because you are subscribed to the Google Groups "dkpro-bigdata-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-bigdata-u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Richard Eckart de Castilho

unread,
Jun 29, 2016, 4:56:08 PM6/29/16
to Hans-Peter Zorn, Yamen Ajjour, dkpro-bigdata-users
I think I tried it with the $dir in the mapper, but I don't remember exactly. Maybe I didn't ;)

I'll try again!

Thanks for the pointer!

-- Richard

Hans-Peter Zorn

unread,
Jun 29, 2016, 5:09:22 PM6/29/16
to Richard Eckart de Castilho, Yamen Ajjour, dkpro-bigdata-users
Just looking at the original question:
>I would like to ask about the way the users of Dkpro Bigdata write cases to HDFS ? 

Just for clarification, do you want to use a *Writer to write some textual output to HDFS? Or you want to write
CASes for further processing by DKPro BigData? In the latter case you just need emit CASes from the last component in your pipeline (e.g. not have a Writer), they will then be written to HDFS as SequenceFile as binary CAS objects.

Best,
Hans-Peter

Yamen Ajjour

unread,
Jun 30, 2016, 3:13:16 AM6/30/16
to Hans-Peter Zorn, Richard Eckart de Castilho, dkpro-bigdata-users
Thank you for this information. May I ask how I can choose the path where the CASes will be saved ?

Richard Eckart de Castilho

unread,
Jul 19, 2016, 6:39:34 AM7/19/16
to Yamen Ajjour, Hans-Peter Zorn, dkpro-bigdata-users
On 30.06.2016, at 09:13, Yamen Ajjour <yamen...@gmail.com> wrote:
>
> Thank you for this information. May I ask how I can choose the path where the CASes will be saved ?

The directory where DKPro Bigdata stores the files appears to be hard-coded/automatically determined.
If you run a job, you should see some message starting with "Writing local data to: ..." to be logged.

You can tell components to store their output in that location using the "$dir" placeholder, e.g.

createEngineDescription(createEngineDescription(TextWriter.class,
TextWriter.PARAM_TARGET_LOCATION, "$dir/output")));

Cheers,

-- Richard
Reply all
Reply to author
Forward
0 new messages