Re: Using scalding to run MR jobs from reading Pail files with protobuf

366 views
Skip to first unread message

Sam Ritchie

unread,
May 21, 2013, 2:20:31 AM5/21/13
to cascadi...@googlegroups.com
Hey Helena,

The current definitive reference on Pail is an article by Nathan Marz (Pail's author):

http://www.manning.com/marz/

http://www.manning.com/free/excerpt_marz.html

PailSource is in serious alpha, and the brief answer is that we don't currently have great examples or tests of how to use Pail with more complex data formats.

The article describes how to write a PailStructure that serializes to/from Thrift. Scalding-Commons, as I'm sure you've seen, abstracts away serialization behind an "Injection" from our Bijection library:

https://github.com/twitter/bijection

the easiest way to use Protobufs with the Scalding source is to use the Protobuf Injections in that library for serialization/deserialization.

Please keep posting as you explore with Pail! I'm happy to give what insight I have as you move forward.
May 18, 2013 1:29 PM
Hi, I'm creating a new scalding project to run MR jobs from reading Pail files (by year dayOfYear hourOfDay essentially), using protobuf types.

I have samples working using scald.rb from 

but these are just using Int types. So I'm looking for samples on how to use PailSource.source in more complex scenarios, how to pass in credentials, s3 uri scheme, specify pail files, map fields from protobufs...  

Eventually I will need to run these jobs on EMR but I did find a sample project that describes a way to do this.

Any insight will help,
Helena
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
Sam Ritchie, Twitter Inc
703.662.1337
@sritchie

Helena

unread,
May 22, 2013, 10:34:21 AM5/22/13
to cascadi...@googlegroups.com
 Thanks for responding Sam. I'm in the middle of Nathan's book, and deep in the bowels of the scalding/scalding-commons/bijection code :-) Serious lack of samples for the pail work in this path but I'm happy to contribute some. Liking the bijection protobuf, clean!

In the sample, the write is essentially:
def writejob = { 
    val pipe = IterableSource((1 to 100), "src").read
    val sink = PailSource.sink[Int]( "pailtest", structure)
    pipe.write(sink)
  }

I am trying to figure out how to work with IterableSource, or if this is the correct Source implementation for me to even be using with protobuf+pail ?  So any pointers on the Source usage
would be greatly appreciated:

def writejob: Pipe = { 
    val pipe = IterableSource(Seq(event), "src").read // Creates: MyEvent/2013/1/0/part-000000 so far which is wrong, the 'part-0000' part
    val sink = PailSource.sink[MyBaseProtobuf]("myrootpath", new MyPailStructure)
    pipe.write(sink)
  }

I break during pail structure validation, which makes sense, given that I'm trying to figure out the proper configuration of Source or IterableSource question:
Caused by: java.lang.IllegalArgumentException: MyEvent/2013/1/0/part-000000 is not valid with the pail structure {structure=com.foo.MyPailStructure, args={}, format=SequenceFile} --> [MyEvent, 2013, 1, 0]


Also, WRT serialization, I did have to add the override in my base Job:
override def ioSerializations = List("com.twitter.elephantbird.cascading2.io.protobuf.ProtobufSerialization") ++ super.ioSerializations

but I'd written a custom 
class MyEventSerialization extends com.esotericsoftware.kryo.Serializer[MyBaseEvent]  
and expected to configure the override as List("com.foo.MyEventSerialization") ++ super.ioSerializations
however that blew up with: cannot be cast to org.apache.hadoop.io.serializer.Serialization, but the ProtobufSerialization works so far.

Thanks :)
Helena
@helenaedelson

Toby Evans

unread,
Mar 4, 2014, 7:48:21 AM3/4/14
to cascadi...@googlegroups.com
Hi there Helena/Sam,

I'm also ploughing through the MEAP Big Data book, and have also run into this error:
---------------------------------
I break during pail structure validation, which makes sense, given that I'm trying to figure out the proper configuration of Source or IterableSource question:
Caused by: java.lang.IllegalArgumentException: MyEvent/2013/1/0/part-000000 is not valid with the pail structure {structure=com.foo.MyPailStructure, args={}, format=SequenceFile} --> [MyEvent, 2013, 1, 0]
---------------------------------

as this is the only other reference on the entire internet to this specific error message, I'm trying my luck reopening this thread ...

basically, I can create new custom Pail objects, with vertical partitioning,using the SplitDataPailStructure right out of the book. This works fine in the local filesystem, but  breaks when I run insde HDFS:

Caused by: java.lang.IllegalArgumentException: 1/1/part-000000 is not valid with the pail structure {structure=com.foo.pail.SplitDataPailStructure, args={}, format=SequenceFile} --> [0, _temporary, attempt_local137248367_0002_r_000000_0, 1, 1]
at com.backtype.hadoop.pail.Pail.checkValidStructure(Pail.java:563)


It's a bit frustrating, as Pail seems to automatically delete the "1/1/part-000000" file, so I don't get to even manually check it. Did you get anywhere?  I'm going to drop the partitioning, see if that works, then work back up

thanks

T  

Helena

unread,
Mar 4, 2014, 9:48:37 AM3/4/14
to cascadi...@googlegroups.com
Hi Toby,

Check that your code is passing in a valid path that maps to your S3 pailfile dir structure. It looks like, at least in some cases, it does not, which is entirely possible.
 
- Helena


CONFIDENTIALITY NOTICE: The information contained in this message may be privileged and/or confidential. It is the property of CrowdStrike.  If you are not the intended recipient, or responsible for delivering this message to the intended recipient, any review, forwarding, dissemination, distribution or copying of this communication or any attachment(s) is strictly prohibited. If you have received this message in error, please notify the sender immediately, and delete it and all attachments from your computer and network.

Toby Evans

unread,
Mar 11, 2014, 7:13:36 AM3/11/14
to cascadi...@googlegroups.com
Hi Helena,

that's what it is, the whole path, eg "MyEvent/2013/1/0/part-000000" is being passed to isValidTarget, rather than "MyEvent/2013/1/0/" - where "part-000000" contains data. 

I'm working on the "Select isn't broken" principle, so I'm getting something wrong somewhere, working out why will be more useful than something I don't understand just working, and then getting caught out  later ...

cheers

T
Reply all
Reply to author
Forward
0 new messages