Using the distributed cache from cascalog

184 views
Skip to first unread message

Gerrard McNulty

unread,
Jan 4, 2012, 8:49:24 AM1/4/12
to cascalog-user
Hi guys,

If I want to share a file in the distributed cache from a cascalog
query, do I have to fall back to hadoop's java apis? Or is there a
cascalog way of doing it?

Sam Ritchie

unread,
Jan 4, 2012, 1:24:17 PM1/4/12
to cascal...@googlegroups.com
There isn't any way of doing this currently. What are you trying to share in the distributed cache? One other option might be to distribute the value to each operation as a parametrized argument, as with

(defmapop [add-n [n]]
 [x]
 (+ x n))

but this gets a little flakey with large data structures as arguments.
--
Sam Ritchie, Twitter Inc
@sritchie09

(Too brief? Here's why! http://emailcharter.org)

Gerrard McNulty

unread,
Jan 4, 2012, 4:48:17 PM1/4/12
to cascalog-user
I'm using Maxmind Geoip to do different kinds of lookups on ip
addresses.
They export their databases as CSV or .dat files (with a library to
access
the .dat). I'd like to make cascalog queries based on lookup
information
without performing an expensive join or adding the lookups in advance
on my
data

It seems pushing the .dat file out to the distributed cache is the
quickest way
to do this, but of course I'm open to suggestions :)


On Jan 4, 6:24 pm, Sam Ritchie <sritchi...@gmail.com> wrote:
> There isn't any way of doing this currently. What are you trying to share
> in the distributed cache? One other option might be to distribute the value
> to each operation as a parametrized argument, as with
>
> (defmapop [add-n [n]]
>  [x]
>  (+ x n))
>
> but this gets a little flakey with large data structures as arguments.
>
> On Wed, Jan 4, 2012 at 5:49 AM, Gerrard McNulty
> <gerrard.mcnu...@gmail.com>wrote:

Andrew Xue

unread,
Jan 4, 2012, 11:39:26 PM1/4/12
to cascalog-user
i do something similar but i just jar the csv files up in into uberjar
instead of using distributed cache

i wonder if theres a performance difference between that and
distributed cache though; they both (app jar and distributed cache
stuff) gets copied from master to slaves all the same, so, it seems
equivalent?

Andrew Xue

unread,
Jan 4, 2012, 11:42:46 PM1/4/12
to cascalog-user
also, take a look at this

https://gist.github.com/872918

"join a small file that can fit in memory, map-side"

my solution, which is pretty heavy handed but my data is small enough,
is to load the csv files up into a hashmap and do look ups from that

rweald

unread,
Jan 5, 2012, 12:59:13 PM1/5/12
to cascal...@googlegroups.com
We are currently bundling the .dat file as part of the uberjar and then using the maxmind java api through clojure. It works well for us and is still quite concise.

(def lookup (new LookupService (.getPath (clojure.java.io/resource "GeoLiteCity.dat")) LookupService/GEOIP_MEMORY_CACHE))

Then to perform a lookup we simply define a function that takes the ip address as an argument 

(defn geocode_ip [ip]
  (.getLocation lookup ip))

If you discover a more idomatic way I would love to hear about it. 

Ryan Weald
Engineer @ Sharethrough

Andrew Xue

unread,
May 18, 2012, 2:38:34 AM5/18/12
to cascal...@googlegroups.com
Hey Ryan -- Do you run this code on Amazon EMR? I used this solution for getting the maxmind geoip and for some reason using the geoip java api (ver 1.2.5) leads to a really odd and hard to figure out classloading issue and lots of task failures -- you have never encountered anything like that?
Reply all
Reply to author
Forward
0 new messages