Automatic Hive Partitions

Andrew Otto

unread,

Oct 21, 2013, 3:34:47 PM10/21/13

to camu...@googlegroups.com, Dan Andreescu

Hi all,

Over at Wikimedia, we're about to start using Camus + Kafka in production. We'll be creating external Hive tables on top of the time-bucketed directory hierarchies that Camus automatically creates for us. We want these external tables to be automatically partitioned.

Felix over at Mate1 has a really nice camus2hive[1] shell script. I toyed with using it, but wanted something that supported more options (--auxpath for hive-serdes, --database, --tables etc.). I also wanted to keep this camus-agnostic :p. We have other ways of importing data into HDFS, and there was no reason why Camus had to be associated with the logic that creates Hive partitions.

We came up with this:

https://github.com/wikimedia/kraken/blob/master/kraken-etl/hive-partitioner

(It uses the util.py file there as well).

We think we should probably pull this code out into a standalone repository, but we're not sure what to name it. The fact that util.py is its own file is a bit annoying, since you can't just dl a single script and run it.

Anybody got any good naming ideas?

-Andrew Otto

[1] https://github.com/mate1/camus2hive

P.S. Oh yeah, it depends on Python docopt: http://docopt.org/, if you are using Ubuntu/Debian, you can get a backport for precise here. http://apt.wikimedia.org/wikimedia/pool/universe/d/docopt/

Roger Hoover

unread,

Mar 12, 2014, 2:32:00 PM3/12/14

to camu...@googlegroups.com, Dan Andreescu, aco...@gmail.com

Andrew,

This script looks like it could be useful. I don't see a license on it though. What are the terms of using it?

Thanks,

Roger

Dan Andreescu

unread,

Mar 12, 2014, 2:42:48 PM3/12/14

to Roger Hoover, camu...@googlegroups.com, Andrew Otto

We should probably make it explicit and pick a venerable license, but it's WTF for now as far as I'm concerned

Roger Hoover

unread,

Mar 12, 2014, 2:53:01 PM3/12/14

to Dan Andreescu, camu...@googlegroups.com, Andrew Otto

Thanks, Dan.

Is this the one? http://www.wtfpl.net/

Do you mind adding it to the files so that we can freely use it?

Thanks,

Roger

Roger Hoover

unread,

Mar 12, 2014, 2:56:30 PM3/12/14

to Dan Andreescu, camu...@googlegroups.com, Andrew Otto

Or maybe this is better: http://creativecommons.org/publicdomain/zero/1.0/

aco...@gmail.com

unread,

Mar 18, 2014, 12:18:25 PM3/18/14

to camu...@googlegroups.com, Dan Andreescu, Andrew Otto

Done! I added an Apache-2 License.

Roger Hoover

unread,

Mar 18, 2014, 12:35:59 PM3/18/14

to aco...@gmail.com, camu...@googlegroups.com, Dan Andreescu

Thank you!

Sent from my iPhone

> On Mar 18, 2014, at 9:18 AM, aco...@gmail.com wrote:
>
> Done! I added an Apache-2 License.
>

> --
> You received this message because you are subscribed to the Google Groups "Camus - Kafka ETL for Hadoop" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to camus_etl+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Andrew Otto

unread,

Mar 20, 2014, 5:17:36 PM3/20/14

to Roger Hoover, camu...@googlegroups.com, Dan Andreescu, Christian Aistleitner

Hi Roger,

After I committed this, I was informed by a coworker that the POM files in that project specify GPL-2, which is incompatible with Apache-2. We’ve updated the licenses on the kraken-etl .py files to GPL-2 accordingly. I hope that’s ok!

-Ao

Roger Hoover

unread,

Mar 20, 2014, 5:42:27 PM3/20/14

to Andrew Otto, camu...@googlegroups.com, Dan Andreescu, Christian Aistleitner

Uggg. :( I don't think that'll work for us. We could never distribute any code that uses it without making it GPL as well.

It would be great (for me at least) if you could split these utility files from the main kraken project and license them under Apache.

Andrew Otto

unread,

Mar 20, 2014, 6:29:11 PM3/20/14

to Roger Hoover, camu...@googlegroups.com, Dan Andreescu, Christian Aistleitner

Ah rats, really?

I am not a license pro at all here. Christian, what would we do?

Roger Hoover

unread,

Mar 20, 2014, 7:42:58 PM3/20/14

to Andrew Otto, camu...@googlegroups.com, Dan Andreescu, Christian Aistleitner

My reading of the two statements below is that if you use a GPL python package in your application, entire application must be GPL if you ever distribute it.

http://www.gnu.org/licenses/gpl-2.0.html

You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License.

http://www.gnu.org/licenses/old-licenses/gpl-2.0-faq.html#IfLibraryIsGPL

If a library is released under the GPL (not the LGPL), does that mean that any program which uses it has to be under the GPL?

Yes, because the program as it is actually run includes the library.

Christian Aistleitner

unread,

Mar 21, 2014, 6:26:39 AM3/21/14

to Andrew Otto, Roger Hoover, camu...@googlegroups.com, Dan Andreescu, Christian Aistleitner

Hi,

On Thu, Mar 20, 2014 at 06:29:11PM -0400, Andrew Otto wrote:
> Ah rats, really?

Yes. GPLv2 is quite copylefty :-)

> Christian, what would we do?

Convince the world that GPL is the way to go!
But no luck with companies or in a Java environment :-(

IANAL, but for the files that have been passed around under Apache
License 2.0, there should not be a problem. WMF allows us to release
as Apache License 2.0. WMF allows us to dual license.
The files have been written by Dan and Andrew (if git-foo didn't fail
me). If both of you are fine with dual licensing them under Apache
License 2.0 in addition to kraken's GPL, Roger can continue to use the
files that he received under Apache License 2.0 without problems.

But the for the files that we have in our kraken repo, the situation
is a bit different. From my point of view, we can

* Keep everything GPL
GPL is good™.
It really is.

* Make kraken-etl Apache License 2.0
(either by splitting it out into a separate repository, or by using
aggregate constructs)
(Needs only a bit of work on our side)

* Make kraken Apache License 2.0 as a whole
(Needs a bit more work on our side. But we ran into the need to get
Apache License 2.0 code into kraken before. So I guess sooner or
later, we'd have to do it anyways. The sooner, the better.)

Being a Copylefty, I'd vote for the first option. But given that up to
now not too many people contributed to kraken, and anything non-Apache
License 2.0 is a problem around Java, the last option seems most
viable and most friendly to me.

Best regards,
Christian

--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Gruendbergstrasze 65a Email: chri...@quelltextlich.at
4040 Linz, Austria Phone: +43 732 / 26 95 63
Fax: +43 732 / 26 95 63
Homepage: http://quelltextlich.at/
---------------------------------------------------------------

signature.asc

Roger Hoover

unread,

Mar 25, 2014, 2:30:01 PM3/25/14

to Christian Aistleitner, Andrew Otto, camu...@googlegroups.com, Dan Andreescu, Christian Aistleitner

Do you guys think options 2 or 3 will happen? I need to decide if I can use the code or not.

Thanks,

Roger

Andrew Otto

unread,

Mar 25, 2014, 2:45:06 PM3/25/14

to Roger Hoover, Christian Aistleitner, camu...@googlegroups.com, Dan Andreescu, Christian Aistleitner

Ah yes, sorry, I think we can switch the whole repo to Apache-2. I have to run but will try to do this tomorrow, if someone doesn’t get to it before me.

-Andrew

Christian Aistleitner

unread,

Mar 25, 2014, 3:22:50 PM3/25/14

to Roger Hoover, Andrew Otto, camu...@googlegroups.com, Dan Andreescu

Hi Roger,

On Tue, Mar 25, 2014 at 11:30:01AM -0700, Roger Hoover wrote:
> Do you guys think options 2 or 3 will happen? I need to decide if I can
> use the code or not.

IANAL. If I understood Andrew correctly, he gave you the code under
the Apache license 2.0. WMF gives us permission to do so, so I guess
it's safe to use that code under Apache license 2.0. Regardless of
how the code continues in our repos.

About turning the relevant parts of our kraken repository into Apache
license:
In private emails, our team pretty strongly said that we should take
it to Apache license 2.0. And I did not hear an opposing voice.

So, yes. I really /think/ (no guarantee) and hope this will
happen. But we're not the fastest, and I guess it'll take some
discussions with our legal team. So give it some time. I filed a bug
for it in our bugzilla:

https://bugzilla.wikimedia.org/show_bug.cgi?id=63084

Have fun,

signature.asc

Roger Hoover

unread,

Mar 25, 2014, 5:06:16 PM3/25/14

to Christian Aistleitner, Andrew Otto, camu...@googlegroups.com, Dan Andreescu

Thank you, Andrew and Christian.

Zhu Wayne

unread,

Apr 4, 2014, 1:19:13 PM4/4/14

to camu...@googlegroups.com, Christian Aistleitner, Andrew Otto, Dan Andreescu

Thanks guys for sharing camus2hive tool.
However, Hive seems to have issues reading data in partitions. It works without partition.

Here is the JIRA:
https://issues.apache.org/jira/browse/HIVE-5820

CREATE EXTERNAL TABLE avro_price_external

PARTITIONED BY (year int, month int, day int, hour int)

ROW FORMAT SERDE

'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

STORED AS INPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

TBLPROPERTIES (

'avro.schema.url'='hdfs:///user/me/camus/camus-avsc/PriceSchema.avsc'

);

hive> ALTER TABLE avro_price_external ADD IF NOT EXISTS PARTITION (year=2014, month=03, day=31, hour=19) LOCATION '/user/me/camus/dest/pricesingle/hourly/2014/03/31/19';

hive> select * from avro_price_external;                                                                      OK

Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.avro.BadSchemaException

Time taken: 0.301

seconds

Félix GV

unread,

Apr 4, 2014, 2:21:58 PM4/4/14

to Zhu Wayne, camu...@googlegroups.com, Christian Aistleitner, Andrew Otto, Dan Andreescu

Hi Zhu,

I have tested the camus2hive script only on CDH 4.2 and 4.5, and it worked on both. Someone in the JIRA ticket you linked said they were unable to reproduce the issue on the latest trunk. Since the itcket was opened for CDH 4.3, and you are also using 4.3, then perhaps there is an issue that was introduced in that version, and then fixed later on?

This mailing list thread is also consistent with the theory that this bug was introduced in CDH 4.3: https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/PEUiS5AcYlo

If you can upgrade to a newer CDH 4.x release, that would probably be helpful.

Regards,

--
Félix

Andrew Otto

unread,

Apr 4, 2014, 2:45:09 PM4/4/14

to Zhu Wayne, camu...@googlegroups.com, Christian Aistleitner, Andrew Otto, Dan Andreescu

It also doesn’t look like Zhu’s issue is a problem with hive-partitioner, but rather with Hive + Avro serde.

Zhu Wayne

unread,

Apr 4, 2014, 2:59:10 PM4/4/14

to Andrew Otto, Christian Aistleitner, camu...@googlegroups.com, Dan Andreescu, Andrew Otto

Thanks all for your response. I will upgrade to the latest cdh4.

Also one suggestion. Could we add one function to camus2hive to pick up a schema from hdfs location? Or from httpfs? Thanks.

Félix GV

unread,

Apr 4, 2014, 3:03:31 PM4/4/14

to Zhu Wayne, Andrew Otto, Christian Aistleitner, camu...@googlegroups.com, Dan Andreescu, Andrew Otto

Pull requests accepted ;)

--
Félix

Andrew Otto

unread,

Apr 4, 2014, 4:03:11 PM4/4/14

to Zhu Wayne, Christian Aistleitner, camu...@googlegroups.com, Dan Andreescu, Andrew Otto

Zhu, I don’t know about camus2hive, but hive-partitioner doesn’t deal with Hive schemas at all. It examines a list of tables / topic directories for a partition scheme, compares the partitions to what is in HDFS, and then issues ALTER TABLE _ ADD PARTITION statements for any missing partitions.

If you are using hive-partitioner, you should create your tables before you attempt to use hive-partitioner to automatically add schemas.

Andrew Otto

unread,

Apr 11, 2014, 1:07:33 PM4/11/14

to Roger Hoover, Christian Aistleitner, camu...@googlegroups.com, Dan Andreescu, Christian Aistleitner

Ok great! the kraken repository has been switched to Apache-2, including kraken-etl and hive-partitioner.

Sorry that took so long! We had to get responses from anyone who had ever work on the repository to give their go ahead before we could switch it.

Roger Hoover

unread,

Apr 11, 2014, 1:24:30 PM4/11/14

to Andrew Otto, Christian Aistleitner, camu...@googlegroups.com, Dan Andreescu, Christian Aistleitner

Thank you guys at wikimedia for all your effort to make this happen. I really appreciate it. It's saved me having to re-implement something similar.

Sent from my iPhone

Andrew Otto

unread,

Aug 20, 2014, 5:32:18 PM8/20/14

to camu...@googlegroups.com, ot...@wikimedia.org, chri...@quelltextlich.at, dandr...@wikimedia.org, caistl...@wikimedia.org

An update!

We've rearranged things a bit at Wikimedia, and are no longer using hive-partitioner to automatically create hive partitions. We now use oozie to handle this for us. It was more automatic and more flexible than the python script, but also more annoying to develop and troubleshoot. Pick your poison!

https://github.com/wikimedia/analytics-refinery/tree/master/oozie

We are still using a python script to automatically drop old partitions and data imported into HDFS, but the script is mostly specific to our datasets.

https://github.com/wikimedia/analytics-refinery/blob/master/bin/refinery-drop-webrequest-partitions

Roger Hoover

unread,

Aug 21, 2014, 12:38:31 PM8/21/14

to Andrew Otto, camu...@googlegroups.com, Andrew Otto, Christian Aistleitner, Dan Andreescu, Christian Aistleitner