cascading.avro 2.1 released

570 views
Skip to first unread message

Christopher Severs

unread,
Oct 29, 2012, 2:51:28 PM10/29/12
to cascadi...@googlegroups.com
Hi everyone,

We put out a new shiny version of cascading.avro. It's a mash of the previous cascading.avro and cascading-avro projects with some new features.
https://github.com/bixolabs/cascading.avro

Highlights:
- Read avros without specifying a schema (can't do this with write yet but it's in the works)
- Construct the AvroScheme with either an avro schema (new style) or by specifying a list of fields and types (old style).
- Supports all avro types (except union of two non-null types), including nested records as cascading tuples.
- User can specify avro maps and arrays either with a cascading tuple or a java list/map.
- New option to tell cascading avro to not pack or unpack your avros when reading/writing. Use this if you want the actual Avro Record in hand or you provide a single output field containing an Avro record (similar to how WritableSequenceFile is handled).

2.1 is up on conjars now:
http://conjars.org/cascading.avro/avro-scheme

Here is a quick modification of the Cascading for the Impatient lessons 1 and 2 using Avro:
https://gist.github.com/3975481

Please give it a shot and let us know what breaks.




Johannes Schulte

unread,
Nov 1, 2012, 5:56:06 AM11/1/12
to cascadi...@googlegroups.com
Hi Christopher,

first of all thanks a lot for the scheme.

I am however currently facing trouble with cascading avro. I always get a NPE when trying to read from an avro file, with or without schema. My intention is to read from avro and from thereon never look back and use cascading tuples.

at java.io.StringReader.<init>(StringReader.java:50)
	at org.apache.avro.Schema$Parser.parse(Schema.java:921)
	at org.apache.avro.Schema.parse(Schema.java:970)
	at org.apache.avro.mapred.AvroJob.getMapOutputSchema(AvroJob.java:78)

The Nullpointer exception is from my point of view rooted in the odd behaviour of AvroJob, who expects an OutputSchema to be set as soon as you call setInputSchema (->configureAvroInput() ->configureShuffle() sets KeyComparator which tries to get the map output schema which defaults to output schema)

The WordCountExample basically tries to read from Avro and write to TextDelimited without an output schema like i want to do, but even this example fails:

"unable to read from input identifier"

which seems to be rooted in the AvroToCascading class when a null Record is Passed in (since the cause is a NPE in line 40 there)

I have to admit that i had a lot of trouble getting to run even a regular map reduce job with avro since it always assumes all intermediate and final key and values to be avro and/or pairs. 

Does anyone else had similar problems someday?

Johannes Schulte

unread,
Nov 1, 2012, 8:24:22 AM11/1/12
to cascadi...@googlegroups.com
Okay I just realised that there are null union fields which make the official WordCount example fail. I'll further dig into this and see if i find something....

Johannes Schulte

unread,
Nov 1, 2012, 10:21:52 AM11/1/12
to cascadi...@googlegroups.com
Still..

The 

AvroReadExample

https://gist.github.com/3975481

works perfectly now.

But when i include a simple GroupBy and Count into the flow, I get the Schema null exception the from org.apache.avro.mapred.AvroJob.getMapOutputSchema method.


Does this mean i can't directly operate with GroupBy steps on avro input and have to map-only them first? 


Cheers and thanks for your time,


Johannes

Paco Nathan

unread,
Nov 1, 2012, 3:45:28 PM11/1/12
to cascadi...@googlegroups.com
Wonderful. I'll see how to incorporate that new code into the Impatient series!






--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/cxqRNHeTOyoJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Ken Krugler

unread,
Nov 1, 2012, 6:25:15 PM11/1/12
to cascadi...@googlegroups.com
On Nov 1, 2012, at 7:21am, Johannes Schulte wrote:

Still..

The 

AvroReadExample

https://gist.github.com/3975481

works perfectly now.

But when i include a simple GroupBy and Count into the flow,

In which flow? AvroReadExample, I assume. If so, what does that code look like?

I get the Schema null exception the from org.apache.avro.mapred.AvroJob.getMapOutputSchema method.


Stack trace?

Does this mean i can't directly operate with GroupBy steps on avro input and have to map-only them first? 

No.

Even if you have no operation before a GroupBy, you'll be using an identity map task - every Hadoop job needs one.

-- Ken


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/GZuBXXRZNhMJ.

To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--------------------------------------------




Christopher Severs

unread,
Nov 1, 2012, 8:07:30 PM11/1/12
to cascadi...@googlegroups.com
I'm seeing the same thing. I'll try and track it down.

Thanks for finding this.

Here is the stacktrace I get, do you see something similar?

java.lang.NullPointerException
    at java.io.StringReader.<init>(StringReader.java:33)
    at org.apache.avro.Schema$Parser.parse(Schema.java:971)
    at org.apache.avro.Schema.parse(Schema.java:1020)
    at org.apache.avro.mapred.AvroJob.getMapOutputSchema(AvroJob.java:78)
    at org.apache.avro.mapred.AvroKeyComparator.setConf(AvroKeyComparator.java:39)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
    at org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java:773)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:959)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)


-----
Chris

Christopher Severs

unread,
Nov 1, 2012, 9:17:54 PM11/1/12
to cascadi...@googlegroups.com
Okay I found it. I was using AvroJob.setInputSchema which in turn calls an internal AvroJob function and sets a bunch of key comparator and shuffle information. Setting the input schema and class directly fixes the problem. I'll try and push a new jar up soon.
-----
Chris

Johannes Schulte

unread,
Nov 2, 2012, 2:24:45 AM11/2/12
to cascadi...@googlegroups.com
Hi Chris, I am glad I'm not the only one :)

As said, I also suspected the AvroInputJob since I had the same trouble when i was trying to operate with plain m/r on avro files. I ended up creating a small Extension of the AvroInputFormat for reading GenericRecords (as i dont wanted to specify a schema before),


Since you peek at the schema anyway in AvroScheme this isnt necessary in the AvroScheme case.

Have a good day,
Johannes

Christopher Severs

unread,
Nov 2, 2012, 2:05:24 PM11/2/12
to cascadi...@googlegroups.com
I put up a new bugfix branch here if you want to try it out. I'll try and push the 2.1.1 jar today as well.
https://github.com/bixolabs/cascading.avro/tree/2.1-bugfix

I also modified the AvroReadExample in the gist to count the counts with a groupBy to make sure everything is behaving correctly.
https://gist.github.com/3975481

Thanks again for finding this. I'm not sure how it slipped through.


Regards,
Chris

Johannes Schulte

unread,
Nov 2, 2012, 5:57:09 PM11/2/12
to cascadi...@googlegroups.com
Great Chris,

I'll try it. I did basically the same modification locally and it worked perfectly.

By the way, is anybody interested in a LocalAvroScheme? Since I like to develop cascading flows locally, i hacked that together. It's not a big deal though

yo

Ken Krugler

unread,
Nov 2, 2012, 6:11:04 PM11/2/12
to cascadi...@googlegroups.com
On Nov 2, 2012, at 2:57pm, Johannes Schulte wrote:

Great Chris,

I'll try it. I did basically the same modification locally and it worked perfectly.

By the way, is anybody interested in a LocalAvroScheme? Since I like to develop cascading flows locally, i hacked that together. It's not a big deal though

We probably would be interested - how much was changed from the base AvroScheme?

-- Ken

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/fzx8rJTELDcJ.

To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Johannes Schulte

unread,
Nov 3, 2012, 2:48:50 AM11/3/12
to cascadi...@googlegroups.com
The common parts woul have to be factored out probably. It's basically just using the local counterparts of the scheme (Properties instead of JobConf)
and a avro DataFileStream for reading the data.

As I'm fairly new to cascading I don't know if everything is perfect but at least it works in the unit tests..
Cheers,
Johannes

Ken Krugler

unread,
Nov 3, 2012, 11:58:57 AM11/3/12
to cascadi...@googlegroups.com
On Nov 2, 2012, at 11:48pm, Johannes Schulte wrote:

The common parts woul have to be factored out probably. It's basically just using the local counterparts of the scheme (Properties instead of JobConf)
and a avro DataFileStream for reading the data.

As I'm fairly new to cascading I don't know if everything is perfect but at least it works in the unit tests..

Excellent.

It would be great if you could file an issue and attach a patch.

Thanks!

-- Ken

To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/dPeeHgk5wLkJ.

To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Ken Krugler

unread,
Feb 22, 2013, 10:16:05 AM2/22/13
to cascadi...@googlegroups.com
On Nov 2, 2012, at 2:57pm, Johannes Schulte wrote:

Great Chris,

I'll try it. I did basically the same modification locally and it worked perfectly.

By the way, is anybody interested in a LocalAvroScheme? Since I like to develop cascading flows locally, i hacked that together. It's not a big deal though

This would be interesting - do you still have the code hanging around?

Best (for me, of course:)) would be a pull request.

Thanks,

-- Ken


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/fzx8rJTELDcJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Jakub Kotowski

unread,
Mar 24, 2013, 5:42:27 PM3/24/13
to cascadi...@googlegroups.com
Hi,

I'm seeing the same problem. I am using Cascading 2.1.5 and Cascading.avro 2.1.1. I only set the input scheme as Avro, the output is different - can this be the problem?

Thanks,

Jakub

P.S. The stack trace I'm getting:

java.lang.NullPointerException
	at java.io.StringReader.<init>(StringReader.java:33)
	at org.apache.avro.Schema$Parser.parse(Schema.java:971)
	at org.apache.avro.Schema.parse(Schema.java:1020)
	at org.apache.avro.mapred.AvroJob.getMapOutputSchema(AvroJob.java:78)
	at org.apache.avro.mapred.AvroKeyComparator.setConf(AvroKeyComparator.java:39)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
	at org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java:773)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:959)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
	at org.apache.hadoop.mapred.Child.main(Child.java:249)

Christopher Severs

unread,
Mar 25, 2013, 10:05:44 AM3/25/13
to cascadi...@googlegroups.com
Hi Jakub,

That definitely looks like the bug that was fixed previously. Can you post a cut down example of your job so I can try it on my end?

Thanks,
Chris

Jakub Kotowski

unread,
Mar 25, 2013, 10:44:08 AM3/25/13
to cascadi...@googlegroups.com
Hi Christopher,

I was debugging it now and I found out that due to a mixup it was using Cascading 2.1.0. Switching to Cascading 2.1.5 for real seems to have fixed the issue. I can't easily post an example of the job because it is very complex. Thanks for your reply and sorry about the confusion.

Best regards,

Jakub
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.

To post to this group, send email to cascadi...@googlegroups.com.

Christopher Severs

unread,
Mar 25, 2013, 1:20:56 PM3/25/13
to cascadi...@googlegroups.com
Glad it is working!

More generally, I was thinking it might be nice to not force the include of cascading in cascading.avro. Does that sound reasonable to you (and anyone else)? Right now I think cascading.avro has a dependency on some version of cascading but it could be changed to be a provided dependency so it doesn't step on your chosen version of cascading if you use maven. It's also possible of course to exclude the cascading from cascading.avro in the pom file but it might be nice to do so by default.

Jakub Kotowski

unread,
Mar 25, 2013, 3:00:13 PM3/25/13
to cascadi...@googlegroups.com
Hi Chris,

I discussed it briefly with our team and we all agreed that having cascading as a provided dependency in cascading.avro makes a lot of sense. It's important to know with which Cascading version it is tested and as you say, we can always exclude it in maven.

Jakub
Reply all
Reply to author
Forward
0 new messages