which way of filtering would be better/faster?

31 views
Skip to first unread message

Andy Xue

unread,
Dec 20, 2012, 1:27:33 PM12/20/12
to cascal...@googlegroups.com
Hi -- I have a job which filters on a collection on unique user_ids. 

I can think of two ways to do this -- one is to do this, one is to use the collection like a generator and use that generator as a filter:

(def user_ids [1 2 3 4 5])
(defn filter-as-gen [user_ids sub-query] 
  (<- [!event_name] (sub-query !user_id !event_name) (user_ids !user_id))

The other is to explicitly filter by it turning the collection into a set using contains? 

(def user_ids [1 2 3 4 5])
(defn filter-as-gen [user_ids sub-query] 
  (let [user_ids_set (into #{} user_ids)] 
    (<- [!event_name] (sub-query !user_id !event_name) (contains? user_ids_set !user_id)))

I guess is there one way that is preferable? In my particular use case, the size of user_ids will be small, ie, will fit into memory. if that is the case, is it better to use the latter method? thanks
Andy 

Mayank Agarwal

unread,
Dec 20, 2012, 1:47:27 PM12/20/12
to cascal...@googlegroups.com
Hi Andy,

If you can fit user_ids in memory then latter is preferred in my opinion because it will run completely map-side.

The former is essentially a join on the column !user-id and will trigger a map and a reduce task.

thanks
Mayank

Reply all
Reply to author
Forward
0 new messages