How to build a model and make it available in the pigpen job?

ajaybap...@gmail.com

unread,

Jan 25, 2016, 10:15:50 PM1/25/16

to PigPen Support

Hello,

We are trying out pigpen in our project, and at present we struck at a point and we were wondering if anyone can help.

Problem:
We have a training file from which we have to build a model (which is a clojure code) and that model will be used in one of the step in pigpen for classification. What's the best way to do that ?

Example:

(def model (atom {})) ;; bad practice just used for the sake ease in this example

(defn train-model [model-file]
... ;; trains model, records features in model
)

(defn classify [record]
.... ;; classify, uses model
)

(defn process [training-file input-path output-path]
(let [trained-model (train-model training-file)] ;; is this the right thing to do ?
(->>
(pig/load-string input-path)
(pig/map classify)
(pig/store output-path)
)
)
)

(pigpen/write-script "large-scale-classification.pig" (process "$traning-file" "$input-file" "$output-path")

Thank you in advance.

Matt Bossenbroek

unread,

Jan 26, 2016, 1:10:34 PM1/26/16

to ajaybap...@gmail.com, PigPen Support

Close, but obviously the atom won't work in a distributed environment :)

I would reduce your model into a single row and then join it to the larger dataset. If you use a replicated join, it'll keep the model record in memory and stream the larger dataset over it.

(defn train-model [training-file]

(->>

(pig/load-string training-file)

(pig/reduce ... ))) ;; reduce model into single record if it isn't already

(defn classify [record model]

.... ;; classify, uses model

)

(defn process [training-file input-path output-path]

(let [input-data (pig/load-string input-path)

trained-model (train-model training-file)]

(->>

(pig/join [(input-data :on (constantly 42))

(trained-model :on (constantly 42))]

classify

{:strategy :replicated})

(pig/map classify)

(pig/store output-path))))

More on pig joins here: https://pig.apache.org/docs/r0.15.0/perf.html#replicated-joins

The (constantly 42) bit gives pig a constant join key so that the model is applied to every record. However, this has the unfortunate effect of sending every record to the same reducer, so you won't get any parallelism. To get more parallelism, I would play around with creating n copies of the model with pig/mapcat and randomly selecting one of them in the join. Let me know if that makes sense or if you need a better example.

Another, radically different, approach would be to just load your precomputed model into memory before the classify map stage. Something like this:

(defn process [model-file input-path output-path]

(->>

(pig/load-string input-path)

(pig/map (let [model (load-trained-model model-file)]

(fn [record]

(classify record model))))

(pig/store output-path)))

The (load-trained-model model-file) function would be called once for each mapper when it starts. You can't use pig/load-* functions in there though, it needs to load the model natively from the environment.

HTH

-Matt

--
You received this message because you are subscribed to the Google Groups "PigPen Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pigpen-suppor...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ajaybap...@gmail.com

unread,

Jan 26, 2016, 8:37:19 PM1/26/16

to PigPen Support, ajaybap...@gmail.com

Hey Matt,

Thank you so much for the reply.

- Ajay

Reply all

Reply to author

Forward