Corc: An ORC File Scheme for Cascading

179 views
Skip to first unread message

Dave Maughan

unread,
May 18, 2015, 9:21:03 AM5/18/15
to cascadi...@googlegroups.com
Hotels.com is pleased to announce the open source release of Corc, a library for reading and writing files in the Apache ORC file format using Cascading.

Our intention is to provide a Scheme that allows developers to access the full range of unique features provided by ORC and fully realise the advantages of this high performance format. Currently Corc includes:
  • Full support of the rich ORC type system with the option to customise.
  • Column projection: read only those columns required by your application.
  • Predicate pushdown: skip stripes of data that do not contain pertinent values.
  • Seamlessly read the ACID data sets that back Hive's transactional tables.

We aim to follow closely future developments in the ORC file format and expose new features as they are released.

Corc is freely available on GitHub under the Apache 2.0 license.


- Dave

Koert Kuipers

unread,
May 18, 2015, 11:15:59 AM5/18/15
to cascadi...@googlegroups.com
that is really great. we would love to be able to use ORC format in cascading, but are nervous about pulling in all these hive dependencies (hive is a bit of a kitchen sink). in practice do you treat hive as provided dependencies, or actually bundle them in your jar?

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/667a825b-19f0-43b5-b629-5fad8bcb943b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Elliot West

unread,
May 18, 2015, 11:19:13 AM5/18/15
to cascadi...@googlegroups.com

Andre Kelpe

unread,
May 18, 2015, 11:35:39 AM5/18/15
to cascadi...@googlegroups.com
That is excellent! Thanks for releasing it.

Are you using this with Cascading-Hive by any chance? If so, would you mind writing a little howto/example? We get quite a few questions about ORC in Cascading-Hive and if this works, we can make quite a few users happy.

- André

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/667a825b-19f0-43b5-b629-5fad8bcb943b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andre Kelpe

unread,
May 18, 2015, 11:40:18 AM5/18/15
to cascadi...@googlegroups.com
Now that ORC is a top-level project, I expect the dependency tree to become simpler. Let's hope, that this is true.

- André


For more options, visit https://groups.google.com/d/optout.

Elliot West

unread,
May 18, 2015, 12:02:58 PM5/18/15
to cascadi...@googlegroups.com
Hi André,

We're not currently using Corc with cascading-hive as we're only reading/writing vanilla ORC at this time. However it should 'just work' if you supply an OrcFile scheme when constructing a HiveTap. Corc's README hopefully covers the various construction options available. We're keen to use ORC+ACID which will then require interaction with Hive's meta store DB. We're currently implementing this in a cascading-hive fork, and this includes a demo using Corc for the more involved ACID case.

We hope to have the pull request finalised soon.

Cheers - Elliot.

Dave Maughan

unread,
May 19, 2015, 7:19:56 AM5/19/15
to cascadi...@googlegroups.com
Hi Koert,

I'll try and share our findings on which dependencies are needed when using corc in an application.

If you have Hive >= 1.0.0 on your cluster, then if you ensure that hive-exec.jar is on your HADOOP_CLASSPATH it should just work.

If you have Hive < 1.0.0 then things get a bit more involved. You could just shade in hive-exec but that is itself a shaded jar and you will massively bloat your own jar. What we do is bring in the following with scope compile:
  • org.apache.hive:hive-exec:1.0.0-core (the non-shaded version)
  • org.apache.hive:hive-serde:1.0.0
  • com.esotericsoftware.kryo:kryo:2.22
  • com.google.protobuf:protobuf-java:2.5.0

We also exclude various dependencies from hive-exec and hive-serde to keep it as slim as possible. We need to explicitly bring in kryo and protobuf-java because these are required for ORC.

We're looking forward to ORC being top-level and hope that brings simplification of the dependency tree!

Hope that helps,

- Dave

Koert Kuipers

unread,
May 20, 2015, 10:20:32 PM5/20/15
to cascadi...@googlegroups.com
thanks dave that is helpful

Pushpender Garg

unread,
May 21, 2015, 3:43:47 AM5/21/15
to cascadi...@googlegroups.com
sorry for this naive question, I have not worked with cascading-hive but my assumption is that it will get serdes, storageformats, storagehandlers etc from Hive metastore and expect them to be available on classpath. Internally it would make sense of data same as Hive does and bridge it into cascading Tap using Hive to cascading data type mappings.
If this assumption is true then cascading-hive would support any format that Hive support out of the box so it would already have support for any Hive table being stored as ORC.
Isnt it how it works? Not sure where is gap is my understanding.

Elliot West

unread,
May 21, 2015, 6:47:33 AM5/21/15
to cascadi...@googlegroups.com
Hi,

It's not a naïve question, it's a reasonable expectation. Something similar has been attempted with HCatalog here:


I can't speak about the intentions behind the implementation specifics of cascading-hive ('-' not '.') but I can explain why we chose to implement an ORC specific Scheme. The first point to make is that ORC can be used outside of Hive, and in fact Apache are in the process of splitting it out into its own independent project. Similarly, Corc was conceived as a Scheme that enables reading/writing of ORC files independently of Hive. You may choose to use it with cascading-hive when your ORC data happens to underpin a Hive table but it's not a requirement. A second consideration is that ORC has a number of features that make it very attractive. To my knowledge these could not be easily exploited if we were to access the data via Hive. Column projection comes to mind.

So far I've been talking about vanilla ORC. In the case of ORC+ACID (currently used to underpin a Hive transactional table) there are cases when we might want to access some of the ACID meta data that Hive hides away. Corc allows you to retrieve this data.

Hope this helps.

Thanks - Elliot.

Reply all
Reply to author
Forward
0 new messages