ok,you realy lost me in the details but the example of a extract/load driver would be great
furthermore i would appreciate a little conceptual insight into this driver mechanism as you intend it (althogh javadocs etc for the example might be enough)
if it seems suitable for our needs, maybe we might dedicate some developer time towards peregerine.
currently our clusters are rather small, so ptimizing for small systems would be a good thing.
We are working on a recommender system and are researching the benefits and pitfalls of map/reduce and different map/reduce frameworks for us.
hadoop for instance (and mahout) are pretty close to what we need but way to complex to setup, manage and maintain while beeing optimized for realy large compute clusters.
A first go would be to iterate over the complete local data and map the entries to the correct subsets. this should already provide a serious computation speedup and eliminate several bottlenecks.
And we would work our way up from there towards a full fledged map/reduce logic.
So key requirements for us are
- cassandra integration (best with data locality)
- ease of operations (setup/maintain map/reduce cluster)
- elimination of single points of failure
hadoop has a cassandra integration with data locality but cannot deliver on the other requirementscloud-map-reduce has no single-point-of-failure but lacks cassandra integration
On Thu, Jan 5, 2012 at 5:27 AM, Roland Gude <r...@ndgu.de> wrote:
currently our clusters are rather small, so ptimizing for small systems would be a good thing.How many machines? 'small' is relative if you're used to 4000 node clusters :-PWe are working on a recommender system and are researching the benefits and pitfalls of map/reduce and different map/reduce frameworks for us.
hadoop for instance (and mahout) are pretty close to what we need but way to complex to setup, manage and maintain while beeing optimized for realy large compute clusters.Yes… and Hadoop is missing a number of optimizations for both smaller installations and to be optimum per-node performance (mmap, fallocate, etc)
Also, it might be worth noting that DataStax has a version of Cassandra in DataStax Enterprise that embeds the Hadoop runtime into the Cassanda JVM,. It can then take advantage of data locality divvying up the jobs appropriately so input data need not traverse the network.
When using the Thrift client directly and the "out-of-the-box" Hadoop Support in Cassandra, I don't believe you get the data locality, which means you could be (and most likely are) slurping data over the network. (could be expensive)
Of course this only matters if you are co-locating the analytics processes (Hadoop) with the data (Cassandra).
ok this looks very promisingmeanwhile i was able to run a examples and port one simple algorithm of ours to peregrine (with random data).
using one node it already works but i still have issues when using more nodes (shuffle input not available). should keys always be hashcodes to make sure that every node gets part of the data?
btw. i was wondering why the partitioner does not apply the "hashcode" stuff to the key (like randompartitioner in cassandra) so that a key could still contain information for the computation.
wow great, i'll pull the changes and try it out first thing on monday.
the algorithm i have ported - a simple popularity count - needs a last reduce step, wehre all data goes to the same reducer (i.e. key is 0) (actually i am not sure this is required but i could not think of another way. i just need to sort the output from the other reducers. probably possible in parallel - but how?).
on my last test this lead to Exceptions (i think it was shuffle data does not exist and sometimes nullpointer exceptions) on some nodes, while producing the correct output on the responsible node. still job output lead me to believe my job failed.
I will reproduce the exceptions and retry with the new version maybe this is already resolved.
The class you are referring to is exactly the one i meant when i talked about cassandra/hadoop integration.
i think forking that one is probably a good idea
I don't think that a turing complete language is needed for the filtering. you just need a way to provide the desired range and the desired slice, or a the slice for a given rowkey, so basically just one interface a bit like this
public boolean isKeyInRange(ByteBuffer key);
public SliceRange getSliceRange(ByteBuffer key);
or
public boolean isNameInSliceRange(ByteBuffer key, ByteBuffer name);
ok the iritating errors are gone with the changes,
but i still get NUllPointerExceptions if i do a reduce where the input is a shuffle and the output is another shuffle (i.e. new ReducerX(new Input("shuffle:step1"),new Output("shuffle:step2")) this will gve me a NPE
same code works if output is a file
Wow... i did not know i was tripping over a conceptual error.
I think, I currently do not yet understand the concept of shuffle (i actually have no thorough knowledge about the whole map and reduce thingy, but i am just picking it up).
but basically what i get now is that shuffle means to divide the data between the nodes (a bit like MPI scatter) while writing to a file means "stay on the node where you are".