local/load

24 views
Skip to first unread message

Andy L

unread,
Aug 4, 2015, 1:01:54 AM8/4/15
to PigPen Support
Hi,

First I would like to thank for the PigPen - that an awesome software.

Currently, I am trying to use existing loaders in order to explore certain Hadoop data. I am kind of stuck trying understand a role of `defmethod local/load :my-custom-storage` in https://github.com/Netflix/PigPen/wiki/Custom-Loaders , as compared with this example: https://gist.github.com/mbossenbroek/8461143 (apparently, PigPen was refactored in between).

My questions is, why `local/load` must be defined (otherwise I am getting "No method in multimethod 'load' for dispatch value: :my-custom-storage") in order to run `pigpen.pig/write-script`?

Is it okay to leave it empty as in the tutorial?

Thanks,
AndyL

Matt Bossenbroek

unread,
Aug 4, 2015, 5:17:19 PM8/4/15
to Andy L, PigPen Support
The purpose of :my-custom-storage is to uniquely identify the type of storage so that the different platforms can dispatch to the appropriate readers. It can be any arbitrary keyword that you want.


For example, when we add parquet storage, we use :parquet as the storage type [1]:

(raw/load$ location :parquet fields {:schema schema})

This informs pigpen that we want to load something using the parquet format, but pigpen doesn't have to know anything about that format yet.


To generate a pig script that loads parquet, we implement the appropriate multimethod that the script generation defines, using the parquet identifier [2]:

(defmethod pigpen.pig.script/storage->script [:load :parquet]

Once we do that, pigpen.pig/write-script now knows how to load parquet files.


The local / REPL is just another platform (like pig, cascading, rx), but it has a slightly different model for loading data, so we implement a different multimethod [3]:

(defmethod pigpen.local/load :parquet

And now we can load the parquet format in the REPL. The cascading platform doesn't implement its loader dispatch for :parquet, so if you were to try to load a parquet file & generate a cascading flow, you'd get a no such method exception.


So, to answer your question, you shouldn't have to define a local version of the loader if you only want to generate scripts; you would only require the pigpen.pig.script/storage->script implementation. You can choose to implement as many or as few platforms as makes sense for your scenario. If you don't implement the local version, it just won't work locally. If you've implemented the script multimethod & are getting that error when generating a script, send me a stack trace and I can figure out why it's happening.


And yeah, that gist is way out of date - sorry for the confusion there.

Let me know if that answers your question or if you had any other questions...

-Matt




--
You received this message because you are subscribed to the Google Groups "PigPen Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pigpen-suppor...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andy L

unread,
Aug 4, 2015, 9:17:29 PM8/4/15
to PigPen Support
Matt,

Thanks for the explanation. Turns out that my issue was caused by 3 concurrent things 1) learning Hadoop, Pig and PigPen 2) some REPL issues with redefining multimethods and 3) I think a small mistake in this part of tutorial https://github.com/Netflix/PigPen/wiki/Custom-Loaders, where:
        '(pigpen.op/map->bind vector) ;; vector is the function we use to combine
should be (I believe):
        '(pigpen.runtime/map->bind vector) ;; vector is the function we use to combine

With all that addressed, I am able to run my job now. However, there is one more unclear item I am trying to resolve. It would seem to me that specifying fields in
      (op/load$               ;; this is a raw pig load command
        location              ;; the location of the data - this should be a string
        :my-custom-storage    ;; this indicates the type of storage to use
        '[name address phone] ;; these are the fields this loader returns
        opts)  

would result with Pig code:

load18297 = LOAD 'input.foo'
USING my.custom.Storage('arg') AS (name address phone);

However I am getting something like that:

load18297 = LOAD 'input.foo'
USING my.custom.Storage('arg');

Not sure if I misinterpret the example, or there is another way to specify  fields of the tuple returned by my custom loader. Manually (at the moment) adding "AS ()" into the  generated Pig script fixed it and it runs without a problem.

I also was wondering if there is an option to add an arbitrary "REGISTER jar;" to PigPen so it is included in generated script?
Best regards,
Andy


Matt Bossenbroek

unread,
Aug 4, 2015, 11:28:00 PM8/4/15
to Andy L, PigPen Support
Yep - that's a problem with the tutorial. Good eye & thanks for pointing it out!


But here are the short answers:

map->bind is defined in pigpen.runtime, but aliased in pigpen.core.op (used to be pigpen.op). Either will work, but pigpen.core.op is the official public API. That namespace isn't automatically loaded, so we need to explicitly pass it to the bind$ command.

The fields aren't automatically used as aliases - they need to be pulled out & used within the script generation code. The example now reflects this. Note that they are namespace qualified symbols here, hence the (map name) part.

There's another multimethod to use for loading jars (pigpen.pig.oven/storage->references). Example is in the docs now.


Hope that helps & apologies for the confusion. Let me know if that works for you or if you find anything else.

Thanks,
Matt

Andy L

unread,
Aug 5, 2015, 10:01:36 PM8/5/15
to PigPen Support, core....@gmail.com
Hi,

With all additional hints, my job run first without a hitch first time I submitted it. There is one cosmetic improvement though, my jars are stored in non default location and need to refer to them like this:
register hdfs://namenode:port/aa/bb/cc/ex.jar;

While using pigpen.pig.oven/storage->references works for my custom storage jar, I still need to edit the script to account for it for pigpen.jar, which (I checked in sources) seems to have an arbitrary default location. Not a big deal though.

Thanks,
Andy

Matt Bossenbroek

unread,
Aug 6, 2015, 12:14:47 PM8/6/15
to Andy L, PigPen Support
You can specify the location of the pigpen jar as an option to write-script:

(write-script "my-script.pig" {:pigpen-jar-location "hdfs://pigpen.jar"} …)

-Matt

--

Andy L

unread,
Aug 7, 2015, 9:09:07 PM8/7/15
to PigPen Support, core....@gmail.com


On Thursday, August 6, 2015 at 9:14:47 AM UTC-7, Matt Bossenbroek wrote:
You can specify the location of the pigpen jar as an option to write-script:

(write-script "my-script.pig" {:pigpen-jar-location "hdfs://pigpen.jar"} …)


 Thanks, this is what I was looking for.

Andy
Reply all
Reply to author
Forward
0 new messages