Andy Xue <
and...@lumoslabs.com> writes:
> Stefan -- I totally sympathize with 1)
>
> Seems restricting that it only works on the 1st element; forces the data to be in a thrift
> object where it is the first (and only) element in the record and all info/properties in the
> record is encompassed within the thrift object. Not sure why you think this would require a
> bufferop (which gathers tuples); rather I see the issue as gathering up all the elements
> (fields) within a tuple into a single object
Well, the SplitPailDataStructure splits the pail based on the
information found in a "Data" object. One Data object represents one
piece of information - a property of a node or an edge. Thrift is just a
means of serializing that data object and describing the scheme of that
data object.
But how can I compute several properties of a node and sink them into a
PailTap at once? I can read information related to a specific node type
and create a query with N data objects as output vars - representing N
property values. The idea to use a bufferop was to turn every record of
N data objects into N records of one data object each. Then these
transposed records could easily be sinked using PailTap with the
SplitPailDataStructure as implemented in the book. But to me that
doesn't feel right and I wonder if this is really what one would do.
> Also confused why you want to use Pail. the whole reason to use Pail is that it handles
> adding incremental data onto a data set really well, not re-generating the whole set anew
> (this is basically what .absorb does). I feel like Pail is the opposite of what you want? Or
> maybe I have totally misunderstood the issue here?
My goal is to recompute the graph data model regularly and host it in a
well known HDFS directory where other jobs could work on without
breaking. I could just delete all the contents of that directory and
write the fresh data into it by those jobs that compute it. But that
would disturb jobs that are currently reading the old data. How would I
replace the old data with the newly recomputed data?