surprising behavior w/ memoized function

19 views
Skip to first unread message

Andrew Xue

unread,
Apr 28, 2012, 6:12:00 AM4/28/12
to cascalog-user
hi --

so i have a lookup function that basically does a mapreduce job to
read small dimension data from S3 and then puts it into a hashmap. i
memoized the function so that the map is stored in memory. code looks
like this

(defn- get-referral-dimension-map* [referral-dimension-path]
(let [rd-src (sdt/get-query referral-dimension-path ["!referral_key"
"!ref_name"]) <- this makes a query and selects fields !referral_key
and !ref_name
tuples (??- rd-src)]
(into {} (first tuples))))

(def get-referral-dimension-map (memoize get-referral-dimension-map*))

(defn get-referral-name [referral-key referral-dimension-path]
(let [m (get-referral-dimension-map referral-dimension-path)] (m
referral-key)))

this gets called in a "main'"query, something like

(<- [?referral_name]
(src ?referral_key)
(get-referral-name ?referral_key :> ?referral_name))

oddly, the behavior I am observing is that a mapreduce job is launched
for the referral-dimension data for every map task in the "main" query
-- just seems like once one map task has called the get-referral-name
function, it be memoized and all subsequent map tasks on a node that
call that function should not need to re-do the mapreduce job.

Andrew Xue

unread,
Apr 28, 2012, 6:53:19 AM4/28/12
to cascalog-user
so the fix for this was to do something like

(let [rd-map (get-referral-dimension-map* referral-dimension-path)]
(<- [?referral_name]
(src ?referral_key)
(rd-map ?referral_key :> ?referral_name)))

the odd thing is that in this case there is only one mapreduce job for
the referral dimension data -- but this also seems odd; my intuition
feels like there should be at as map mapreduce jobs as there are nodes
(ie, copies of the app jar)? what is actually going on under the hood?
Reply all
Reply to author
Forward
0 new messages