Just merged Cassandra support into default...

23 views
Skip to first unread message

burtonator

unread,
Feb 15, 2012, 1:19:09 AM2/15/12
to peregrine...@googlegroups.com
This took me longer than I thought because our testing and continuous integration system needed some work.

I *thought* there was a bug but it turns out to be a performance issue with the machine that was performing the testing.

Anyway ... There is a TestCassandraJobs unit test that shows how to work with Cassandra.

It needs more work an TLC... specifically it doesn't support all Cassandra options in the URI but that won't be as hard to implement now that the core functionality is complete.

Running a basic job looks like:

            controller.map( Mapper.class,
                            new Input( "cassandra://localhost:9160/mykeyspace/graph" ),
                            new Output( "shuffle:default" ) );

            controller.reduce( Reducer.class,
                               new Input( "shuffle:default" ),
                               new Output( output ) );

I'm going to work on output next ... 

Also, I have been working on combiner support and I should be able to make progress again now that integration works and I can make sure that my builds are functional.


Roland Gude

unread,
Feb 16, 2012, 12:01:50 PM2/16/12
to peregrine...@googlegroups.com
great i will check it out asap

Roland Gude

unread,
Feb 22, 2012, 7:24:16 AM2/22/12
to peregrine...@googlegroups.com
unfortunately i cannot figure out how this works.

what do the mappers/reducers actually get in the structreaders/writers
given i have a row like this

"rowkeyABCD: colA='a', colB='B'"

what will the readers writers contain?
will it be like this?
key="rowkeyABCD"
value="colA"+"a"+"colB" + "B"

Kevin Burton

unread,
Feb 22, 2012, 2:58:50 PM2/22/12
to peregrine...@googlegroups.com
Yeah… 

It's a bit weird because every row in cassandra is a key/value pair so I had to bundle up the 'value' as a serialized map. 

I refactored the key/value pair interface to be a SequenceReader…

so it's basically an interface that supports

hasNext()
next()
key()
value()

so the Cassandra values are StructSequenceReaders that you can read from as key value pairs.

I guess what I really should do is have a unit/integration test that first starts up a new Cassandra instance, populates it with new data, then runs a map and reduce over it.  I think this would do a good job of explaining how the system works.

Anyway.. in your map function you take the struct reader and then make a new StructSequenceReader out of it and then you can read the key/value pairs from that Cassandra row.

Kevin
--
--

Founder/CEO Spinn3r.com

Location: San Francisco, CA
Skype: burtonator

Skype-in: (415) 871-0687


Roland Gude

unread,
Feb 23, 2012, 4:45:29 AM2/23/12
to peregrine...@googlegroups.com
ok i will try that

Roland Gude

unread,
Feb 24, 2012, 8:21:47 AM2/24/12
to peregrine...@googlegroups.com
i can now successfully extract columnnames and values, but i cannot get the key properly. is it again just a hashcode? where can i get the complete rowkey?

Roland Gude

unread,
Feb 24, 2012, 10:50:27 AM2/24/12
to peregrine...@googlegroups.com
i have created a patch as a work around so i can move on

it wraps the original key and the original value into a new structreader which becomes the value.
it can be found on my patch queue if you are interested.

burtonator

unread,
Feb 24, 2012, 3:29:06 PM2/24/12
to peregrine...@googlegroups.com
ok... I'll take a look. I don't like the way we're doing key generation right now so this is good.

What I'm going to do is create a per Job directive so that we can determine when a key is ALREADY usable as a hash code or we need to run a hash function against it.

Kevin

burtonator

unread,
Feb 24, 2012, 6:40:46 PM2/24/12
to peregrine...@googlegroups.com
 so... I'm going to back out the change where we hash code the key... 

so this:


            value = ssw.toStructReader();
            key = StructReaders.wrap( reader.getCurrentKey() );

            // now hashcode it so we are fixed width.  In the future we should
            // consider NO doing this for performance reasons but this kept it
            // easy to implement
            key = StructReaders.hashcode( key.toByteArray() );

... will just become:

            key = StructReaders.wrap( reader.getCurrentKey() );
            value = ssw.toStructReader();

... however, this might break YOUR code if you have keys that are > 8 bytes and not hash codes.

If this will bite you I can implement variable length keys which we should probably have ... but I think this would take me a while.

Roland Gude

unread,
Feb 25, 2012, 4:18:30 AM2/25/12
to peregrine...@googlegroups.com
this is very important for our usecase, so if i am able to help here, just tell me where in the code to look and i will start working on that as well

burtonator

unread,
Feb 25, 2012, 5:28:41 PM2/25/12
to peregrine...@googlegroups.com
Just to clarify, which use case? Using keys which are variable length?

Roland Gude

unread,
Feb 27, 2012, 6:29:36 AM2/27/12
to peregrine...@googlegroups.com
yes.

we have important information in the rowkeys and we need the complete keys. they are however of variable length

burtonator

unread,
Mar 5, 2012, 3:44:58 PM3/5/12
to peregrine...@googlegroups.com
OK... I mostly have this implemented now... So if the key IS a hash you can map over it... however, if it isn't you may not get good key distribution.

I'm going to work on a job parameter to automatically encode the key if it isn't a hash code.

Kevin
Reply all
Reply to author
Forward
0 new messages