Pangool 0.60.0 major release with nulls support

Pere Ferrera

unread,

Mar 12, 2013, 6:47:50 AM3/12/13

to pangoo...@googlegroups.com

We e have released major version 0.60.0 with an important new feature: native null support.

When we created Pangool we conceived it as a generalized, improved low-level API for Hadoop. In Hadoop you can't emit null as key or as value, and so we didn't think Pangool should allow you to, either. However, as Pangool is becoming more mature within our stack, we realized there was the need to, at least, make null support optional for making it easier to flow data coming in / out from other systems through Pangool job chains.

For instance, when processing or dumping data from relational databases, it is convenient to have nulls in your Tuples and let them flow through your pipeline.

But we didn't want to add it at the cost of performance loss, so we implemented it as efficient as it could be implemented, and made it optional. So now you can declare individual fields of your schema to be "nullable" with a ? symbol (e.g. field1:int?), and if there is at least one which can be nullable, then a new bitset is serialized underneath for each Tuple indicating which fields are null.

Besides, this release adds deep copy semantics to Tuples and fixes an important issue related to instance serialization, as well as other minor race conditions. We are also beginning to explore native integration between Pangool and existing Hadoop ETL systems such as Cascading, Pig and Hive. In this release we add InputFormats for Cascading and HCatalog (which allows for reading any Hive table), and a StoreFunc for Pig.

The release is already available at Maven central:http://search.maven.org/#search%7Cga%7C1%7Cpangool

And the updated documentation, as usual: http://pangool.net/userguide

Jay Vyas

unread,

Mar 12, 2013, 11:53:30 AM3/12/13

to pangoo...@googlegroups.com

cool :) re: "deep copy semantics to tuples"... what does that mean. I assumed that

pangool only used primitives in the tuples but maybe i am missing a feature ?

--
Has recibido este mensaje porque estás suscrito al grupo "pangool-user" de Grupos de Google.
Para anular la suscripción a este grupo y dejar de recibir sus correos electrónicos, envía un correo electrónico a pangool-user...@googlegroups.com.
Para obtener más opciones, visita https://groups.google.com/groups/opt_out.

--
Jay Vyas
http://jayunit100.blogspot.com

Pere Ferrera Bertran

unread,

Mar 12, 2013, 2:28:28 PM3/12/13

to pangoo...@googlegroups.com

Hi Jay,

There are two different questions here. On one side Pangool Tuples allows any object as long as it can be serialized by Hadoop. On the other side even primitive types didn't have deep copy semantics before as we are reusing objects internally for efficiency (binary buffers). I believe deep copy semantics is only for primitive types, though.

In any case Iván implemented this so he can explain you better.

2013/3/12 Jay Vyas <jayun...@gmail.com>

--
Pere Ferrera

CTO & Co-founder

www.datasalt.com

Jay Vyas

unread,

Mar 12, 2013, 3:20:45 PM3/12/13

to pangoo...@googlegroups.com

Hmmm.. This is intriguing... So

1) how could a primitive type need deep copy semantics?

and even if so -- 2) why would reusing the object preclude deep copy?

Pere Ferrera Bertran

unread,

Mar 12, 2013, 3:38:21 PM3/12/13

to pangoo...@googlegroups.com

Jay, think of Hadoop's Text object and you'll understand it all.

2013/3/12 Jay Vyas <jayun...@gmail.com>

Pere Ferrera Bertran

unread,

Mar 12, 2013, 4:20:54 PM3/12/13

to pangoo...@googlegroups.com

What I meant is that in Pangool strings are wrapped into objects like Text which are reused for efficiency... so if you create an array of Tuples and save the references you will have the same string in all of them (because it is the same object).

What we added is a deepCopy() method for perform a real deep copy of the Tuple in case you need to keep them in memory. This is the commit: https://github.com/datasalt/pangool/commit/b3e5e7224dda4d73586b78dbd4999a95cb1cb5d2

And I see it seems that Iván made it general so that any Object can be customly cloned, but he will clarify if needed...

2013/3/12 Pere Ferrera Bertran <pe...@datasalt.com>

Jay Vyas

unread,

Mar 12, 2013, 4:33:24 PM3/12/13

to pangoo...@googlegroups.com

Thanks for finding that commit pere...! that helps. I've seen code like this - typically for databases and ORMs and stuff. Is there now an example of a pangool mr job that uses this functionality explicitly which was previously not possible ?

Pere Ferrera Bertran

unread,

Mar 12, 2013, 4:47:38 PM3/12/13

to pangoo...@googlegroups.com

Yes, we made this mostly because we needed it for Splout SQL... In Splout SQL we perform sampling for distributing the data. One method implements Reservoir sampling in a Map-only job, and each mapper keeps an array of Tuples in memory (the samples): method reservoirSampling(): https://github.com/datasalt/splout-db/blob/master/splout-hadoop/src/main/java/com/splout/db/hadoop/TupleSampler.java

Another example of a thing that was not possible before which is possible now would be a generic cross product (or co-group, as you like).

2013/3/12 Jay Vyas <jayun...@gmail.com>

Iván de Prado

unread,

Mar 13, 2013, 5:57:51 AM3/13/13

to pangoo...@googlegroups.com

In general, you need to create copies of Tuples if you want to keep them on memory for any reason. We have added the possibility to create safe copies in general, and the possibility to provide custom methods for helping when copying for more complex objects (like Thrift objects).

Iván

Iván de Prado

CEO & Co-founder

www.datasalt.com

Reply all

Reply to author

Forward