Cascading 3.0.0-wip & HBase (RFC)

18 views
Skip to first unread message

Cyrille Chépélov

unread,
May 12, 2015, 2:35:03 PM5/12/15
to cascadi...@googlegroups.com
Hello, me breaking stuff again  :-D

now trying to get Cascading-3.0.0-wip-X to load data from an HBase table. The data is being accumulated by an external, non-scalding tool, with a non-trivial serialization scheme[*].

Environment:
  • cascading-3.0.0-wip-115
  • cascading-hbase-{hadoop ; hadoop2-mr1 ; hadoop2-tez} 3.0.0-wip-12
  • scalding-0.13.1 + PR1220
  • where applicable, tez-0.6.0+patches
  • Custom scheme (which, in a nutshell, (de)serializes the data model into case classes)

After the customary fight against incompatible guava versions (HBase is built against guava < 14.0, and the HBaseTestUtility makes use of Stopwatch#elapsedMillis() ), here's the difficulty I'm hitting now:

  • The (pre-existing) local model & serialization library is currently built against HBase 1.0.0. It heavily depends on the hbase.client.Connection object
  • cascading-hbase-*-3.0.0-wip-12 is built against HBase 0.98.12, which:
    1. lacked hbase.client.Connection
    2. had different return types for most of the Scan#set(Whatever) accessors (0.98.12 had void, 1.0.0 now returns Scan, which is source-compatible upwards but breaks binary compatibility)
  • HBase-1.0.0 dropped compatibility for hadoop-1 (hbase-compat-hadoop1 no longer exists), so it seems kind of out of question to just build cascading-hbase on hbase-1.0.0

So tonight, I'm kind of stuck:

  • cascading-hbase's code looks compatible with me evicting hbase-server-0.98.12 with hbase-client-1.0.0, but the JVM crushes my hopes.
  • can't use a locally-upgraded cascading-hbase to HBase-1.0.0, if only because scalding's TestJob#runHadoop is hard-coded on hadoop1
  • the model uses Connection all over the place, which didn't exist back then in 0.98.12

I'm now planning to break out of this this way:

  1. downgrade hbase-client in my code base, to 0.98.12
  2. replace all uses of hbase.client.Connection with a new local.TableMaker class, filling in the missing methods (this new class will be dissolved once going back beyond HBase 1.0.0)
  3. avoid patching more stuff

Will report as I progress...

    -- Cyrille


[*]
    case class Foo(abc: String, def: String)
    case class Bar(id: String, items: Seq[Foo])

    val bar = Bar("A123", Seq(Foo("aaa1", "bbb1"), Foo("aaa2", "bbb2"), Foo("aaa3", "bbb3"))
is serialized within a CF as (json-ish):
    { "id": "A123",
       "items[#]": 3,
       "items[0].abc": "aaa1", "items[0].def": "bbb1",
       "items[1].abc": "aaa2", "items[1].def": "bbb2",
       "items[2].abc": "aaa3", "items[2].def": "bbb3"
   }

Chris K Wensel

unread,
May 12, 2015, 2:38:52 PM5/12/15
to cascadi...@googlegroups.com
I’ll let Andre chime in tomorrow.. we had some high level hbase + 3.0 discussions this morning that I think are relevant.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/555247D0.1040604%40transparencyrights.com.
For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Andre Kelpe

unread,
May 13, 2015, 4:03:10 AM5/13/15
to cascadi...@googlegroups.com
Hi!

Since we are in the 3.0 wip cycle we could choose to break backwards compatibility and let go of everything before HBase 1.0. I think that is not a good idea at this point: We are still on the 0.98-release since that is what ships with the main distributions and it also still has support for hadoop 1.x. For instance the latest HDP ships HBase 0.98.x, CDH _just_ shipped with 1.0 and EMR is still on 0.94 and MapR also ships 0.98.x. I don't know about the demographic of HBase users, but it seems it is the safest bet at this point to stick with 0.98.

That being said, we can think about moving to HBase 1.x during the 3.1 or 3.5 release cycle. We would keep the hadoop 1.x sub-project on 0.98 forever and evolve the code in the hadoop2-mr1/hadoop2-tez sub-projects independently. That would mean a bit of project structure reorganization, but we do that all the time.

Here is another thought: Would there be a way for you to use HBase 0.98 in production or is 1.x a strict requirement?

- André

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/555247D0.1040604%40transparencyrights.com.
For more options, visit https://groups.google.com/d/optout.

Cyrille Chépélov

unread,
May 13, 2015, 4:06:51 AM5/13/15
to cascadi...@googlegroups.com
Indeed, it's probably premature to drop HBase < 1.0.0 at this stage.

Especially as HBase 1.0.0 doesn't ship with hbase-compat-hadoop1 anymore, which may create another set of interesting problems (hi, Scalding!)

I'm currently trying to downgrade the code base to being a hbase 0.98 client, see what happens today.

    -- Cyrille

Cyrille Chépélov

unread,
May 13, 2015, 6:13:45 AM5/13/15
to cascadi...@googlegroups.com
Got it to work!

the feeder client seems to behave once built against HBase 0.98.12-hadoop2 (server remains 1.0.0). And I could successfully build the specific scheme to handle our local serialization system, overriding setSourceInitFields()/source()/sink().

Once I got past petty classloader issues (95% of the work, and still some (expletive) guava collisions), this actually worked beautifully, and required no further patches.

    -- Cyrille
Reply all
Reply to author
Forward
0 new messages