Matt,
Sure, your point is definitely valid for moving data from one completely separate distributed system to another. It is definitely not optimal in cases where I am using Couchbase as distributed cache on the same nodes as my Hadoop cluster. In fact, one of the main powers of Hadoop is its ability to maintain knowledge of locality and pass that info down to the map/reduce layer so that mappers can be scheduled on nodes closest to the data. Network is worlds slower than memory- if I can have a mapper on each node just pulling data from it's local Couchbase tap instead of hitting the network at all, then I'm in a much better position.
I'd also say the same for Couchbase proper- if I could hash the data my way so that I can control on which node it ends up, i'd be in a better position with my use of the distributed cache. I want to do streams processing but give my users the abilty to query the cache using elasticsearch. From what I've looked at in the Couchbase Java Client, I can fill in an interface to determine which VBucket a key should end up in but I'd have to recompile the client in order to use my specialized hashing function. I don't mind doing this, but again I have no way to find out which node will host that vbucket. Your hashing solution works when I want to guarantee always an even distribution, but I don't always want to guarantee that (or maybe I know better about what a more useful even distribution may look like based on my domain's use-cases than Couchbase does based on its auto-sharding).
In my environment, I'm using Couchbase as a mutability layer on top of Hadoop because my data can change quite frequently for a period of time until it's considered immutable and I can vet the data into Accumulo via map/reduce job. For this use case, the Sqoop plugin just adds an extra step of having to write a file in HDFS and then map/reduce over the file- to put the data somewhere else. It also adds storage overhead. I ripped out the CouchbaseInputFormat from the Sqoop plugin github project. I don't know why the version of the Sqoop plugin that works with CDH3 uses Membase client to perform the TAP but for some reason I could not get that to work in Couchbase 2.x. I changed that to use CouchbaseClient instead of Membase and it works fine. I've now got an InputFormat that's correctly pushing the data directly to Accumulo but it's based entirely on the network. It would definitely benefit from having locality and not wasting precious network bandwidth. I'm not an Erlang developer so I don't think pointing me directly to an Erlang method would be useful to me- though I know in my past experience with Couchbase that some of the methods have been exposed via a remote "eval" function (or maybe Erlang does this automatically?). Is it possible to use that eval to ask Couchbase on which nodes a vbucket is being hosted? It's a function that I'd need to call once during the initialization of the inputformat.
Thanks again Matt!