Recomputing Datasets and PailTap

Stefan Hübner

unread,

Jul 26, 2012, 3:20:18 AM7/26/12

to cascal...@googlegroups.com

I follow Nathan and Sam's BigData book and I have a bunch of conceptual
questions around dfs-datastores-cascading and PailTap. Instead of
posting to the Author's forum on Manning I decided to dump these
questions here.

The book talks at length about advantages of recomputing whole datasets
in the batch layer instead of incrementally adding to them. That makes a
lot of sense to me.

So, I've started to build a data model along the lines described in the
book using Apache Thrift as the model description technology. The model
will have lots of different node types with lots of different
properties. My intention was then to go over the raw datasets and compute
these properties, serialize them into a temporary PailTap and then
replace the existing data in the final Pail by the data from the just
computed PailTap.

But there are two question ATM, I haven't gotten my head around yet:

1) A PailTap only serializes the first element of a Cascading
record. How do you serialize several properties (read: "lots of
properties") efficiently? I guess, a bufferop could be used to
transpose a record of multiple thrift objects into lots of records
with one object. Another way could be to have individual jobs - one
for each property to be recomputed. Both strategies seem sub optimal
to me and I feel like I'm missing something.

2) Once you have a freshly recomputed dataset in a temporary PailTap,
how do update the "production" Pail? Do you run .clean() followed by
an .absorb()? This would cause trouble for any job corrently
operating on that pail, no?

Thanks for any hints!
-Stefan

Andy Xue

unread,

Jul 27, 2012, 1:33:18 AM7/27/12

to cascal...@googlegroups.com, sthu...@googlemail.com

Stefan -- I totally sympathize with 1)

Seems restricting that it only works on the 1st element; forces the data to be in a thrift object where it is the first (and only) element in the record and all info/properties in the record is encompassed within the thrift object. Not sure why you think this would require a bufferop (which gathers tuples); rather I see the issue as gathering up all the elements (fields) within a tuple into a single object

Also confused why you want to use Pail. the whole reason to use Pail is that it handles adding incremental data onto a data set really well, not re-generating the whole set anew (this is basically what .absorb does). I feel like Pail is the opposite of what you want? Or maybe I have totally misunderstood the issue here?

Stefan Hübner

unread,

Jul 30, 2012, 3:44:39 PM7/30/12

to cascal...@googlegroups.com

Andy Xue <and...@lumoslabs.com> writes:

> Stefan -- I totally sympathize with 1)
>
> Seems restricting that it only works on the 1st element; forces the data to be in a thrift
> object where it is the first (and only) element in the record and all info/properties in the
> record is encompassed within the thrift object. Not sure why you think this would require a
> bufferop (which gathers tuples); rather I see the issue as gathering up all the elements
> (fields) within a tuple into a single object

Well, the SplitPailDataStructure splits the pail based on the
information found in a "Data" object. One Data object represents one
piece of information - a property of a node or an edge. Thrift is just a
means of serializing that data object and describing the scheme of that
data object.

But how can I compute several properties of a node and sink them into a
PailTap at once? I can read information related to a specific node type
and create a query with N data objects as output vars - representing N
property values. The idea to use a bufferop was to turn every record of
N data objects into N records of one data object each. Then these
transposed records could easily be sinked using PailTap with the
SplitPailDataStructure as implemented in the book. But to me that
doesn't feel right and I wonder if this is really what one would do.

> Also confused why you want to use Pail. the whole reason to use Pail is that it handles
> adding incremental data onto a data set really well, not re-generating the whole set anew
> (this is basically what .absorb does). I feel like Pail is the opposite of what you want? Or
> maybe I have totally misunderstood the issue here?

My goal is to recompute the graph data model regularly and host it in a
well known HDFS directory where other jobs could work on without
breaking. I could just delete all the contents of that directory and
write the fresh data into it by those jobs that compute it. But that
would disturb jobs that are currently reading the old data. How would I
replace the old data with the newly recomputed data?

Reply all

Reply to author

Forward