I have Cassandra input support working ... I need to test it some more but the general functionality is there.
I have a branch called burton-cassandra-support that I have to merge into default first.
It's somewhat maintainable as I'm just using the stock Hadoop+Cassandra InputFormat and then decorate it to look like a peregrine JobInput ...
Output should be easy after this point.
The general idea is that you write just have a normal job but instead of reading from a file you read from:
cassandra://localhost:9160/mykeyspace/graph
which is just a URI for building the config used in Cassandra.
There is more work to be done of course:
- It would be nice to actually have Peregrine unit tests startup Cassandra, import data into it, then have Peregrine map over it...
- I haven't run any benchmarks.
- We're taking the key/value maps for cassandra records and mapping them to a new interface since EVERY 'record' in cassandra is key/value based. It wastes a bit of CPU to do this but I can't think of an elegant way to do this without breaking a LOT of abstractions.
- We don't do any routing based on host name.. so data is just randomly read. This is the FALLBACK case for when you're reading data from non-local machines but we should do something more intelligent. Directly mapping by hostname is the first step but understanding network topology is probably required eventually.