handle Avro nested Record in cascading or Scalding

1,547 views
Skip to first unread message

Jason Cao

unread,
Nov 26, 2013, 4:56:46 AM11/26/13
to cascadi...@googlegroups.com
Hi,

Is there any way to handle nested Record avro file with scalding or Cascading, 
For example I have below schema, how do I read/write it to HDFS with cascading?
I know current Cascading.Avro support nested Record, do you have a real example?
And how to map sub-Record to Fields?


{
"type": "record",
"name": "test",
"namespace": "com.ebay.pandaren.avro",
"doc": "A key/value pair",
"fields": [{
"name": "key",
"type": {
"type": "record",
"name": "BambusKey",
"namespace": "com.ebay.pandaren.avro.key",
"fields": [{
"name": "employee_number",
"type": ["string",
"null"]
}
]
},
"doc": "The key"
},
{
"name": "value",
"type": {
"type": "record",
"name": "BambusValue",
"namespace": "com.ebay.pandaren.avro.value",
"fields": [{
"name": "employee_number",
"type": ["int",
"null"]
},
{
"name": "manager_employee_number",
"type": ["int",
"null"]
},
{
"name": "department_number",
"type": ["int",
"null"]
},
{
"name": "job_code",
"type": ["int",
"null"]
},
{
"name": "last_name",
"type": ["string",
"null"]
},
{
"name": "first_name",
"type": ["string",
"null"]
},
{
"name": "hire_date",
"type": ["string",
"null"]
},
{
"name": "birthdate",
"type": ["string",
"null"]
},
{
"name": "salary_amount",
"type": ["double",
"null"]
},
{
"name": "expense_amount",
"type": ["double",
"null"]
}]
},
"doc": "The value"
}]
}

Vitaly Gordon

unread,
Nov 26, 2013, 1:59:42 PM11/26/13
to cascadi...@googlegroups.com
Hi Jason,
Seeing that you are from ebay, and Chris Severs from ebay is the one who added the support for Avro to Scalding, I'm sure you can find a lot of production examples of Avro usage on data. Just ping Chris for examples.

Vitaly

Christopher Severs

unread,
Nov 28, 2013, 9:57:57 AM11/28/13
to cascadi...@googlegroups.com
Reading nested records with cascading.avro should just work. If you have a look at some of the unit tests there is a nested record test. The default behavior is to recursively unpack every record into a cascading tuple. So in your case the output tuple will contain two fields, each a tuple. To write a nested record it should also take nested cascading tuples as inputs. 

In Scalding the preferred way would be to use the PackedAvroSource which will give you a TypedPipe[com.ebay.pandaren.avro.test] which you can do anything you want with. Since it just sees a TypedPipe of Java objects it doesn't matter if each object has other objects nested inside. 

Niranjan Reddy

unread,
Mar 27, 2015, 1:38:01 PM3/27/15
to cascadi...@googlegroups.com
Chris, Jason,

Could you please share sample code for doing this in Scalding? I was unsuccessful in getting Scalding to read nested avro files 

Thanks,
Niranjan

cse...@ebay.com

unread,
Mar 27, 2015, 6:29:40 PM3/27/15
to cascadi...@googlegroups.com
Hi Niranjan,

Are you using the Packed or Unpacked source? If you use the PackedAvroSource you will have the entire Avro record (including any nested records) in hand and do whatever you want with it. This is my preferred way of dealing with Avro since the choices we have to make for automatic unpacking usually aren't quite right for every use case. 

------
Chris

Niranjan Reddy

unread,
Mar 30, 2015, 12:23:13 PM3/30/15
to cascadi...@googlegroups.com
Chris,

I'm trying to use the PackedAvroSource. I'm fairly new to Scala & Java world, so I may be missing a step somewhere. Here're the steps I'm following:
  1. Generated a java file from the schema file using avro tools (based on your suggestion in a different thread. The ides is to use the generated class as a type parameter in PackedAvroSource)
  2. Tried to compile the auto-generated java file (to create a class file) using javac but got an error - package org.apache.avro.specific does not exist (this is where, I think, I'm doing something wrong)

Niranjan

cse...@ebay.com

unread,
Mar 31, 2015, 8:28:45 PM3/31/15
to cascadi...@googlegroups.com
Hi Niranjan,

Are you using something like Maven or SBT for your project? If so, your best bet is to use a plugin to autogenerate the avro classes when you compile. For Maven, check out:

The very first bit is about using the maven plugin. 

For SBT you can try:

In both cases you just put your schema in the right place and Maven or SBT takes care of the code generation, with all the proper libraries, for you. 

------
Chris
Reply all
Reply to author
Forward
0 new messages