Is there a MultiStorage like output format in Pangool

48 views
Skip to first unread message

Alexei Perelighin

unread,
Dec 13, 2013, 4:15:46 AM12/13/13
to pangoo...@googlegroups.com
Hi,

Is there a org.apache.pig.piggybank.storage.MultiStorage like output format compatible with ITuple?

It would be very convenient for ETL Pangool jobs which prepare partitioned data for inserting stright into HIVE.

Thanks,
Alexei

Pere Ferrera

unread,
Dec 13, 2013, 5:57:08 AM12/13/13
to pangoo...@googlegroups.com
Hello Alexei,

I haven't used MultiStorage, but it seems it is able to write arbitrary directory names based on some Tuple field.

In current Pangool version there is method setDefaultNamedOutput() which allows you to define the type of an "arbitrary" named output. This means that you can use an arbitrary number of Pangool's named outputs without pre-configuring them beforehand. I use this in some projects to produce date-partitioned outputs (i.e. one sub-folder per each date).

I believe this feature is only available in the current snapshot. We have many new good things in the snapshot, which we want to release before the end of the year. We also need to update the documentation as many features are not well shown there.

Tell me if this works out well for you,

Alexei Perelighin

unread,
Dec 13, 2013, 6:12:07 AM12/13/13
to pangoo...@googlegroups.com

Hi Pere,

Looks like it is similar to the MultiStorage.

Could you give a code snipped with configuring it and using it in the mapper or reducer?

Is the snapshot available in maven repository?

Thanks,
Alexei

Pere Ferrera

unread,
Dec 13, 2013, 6:20:02 AM12/13/13
to pangoo...@googlegroups.com
Sure. It's actually pretty straight-forward. If you are using tuple output with a certain schema:

builder.setDefaultNamedOutput(outputSchema);

(there's other methods with the same name as well in case you don't emit ITuple for configuring Hadoop Key, Value classes).

And then you just use named outputs as usual:

collector.getNamedOutput(folderName).write(...)

You can use the snapshot maven repository specified here: http://pangool.net/build.html
Current development version is 0.60.7-SNAPSHOT, which is actually about to be official release.

Let me know how it works,
Reply all
Reply to author
Forward
0 new messages