Pails and Thrift and JCascalog - oh my!

317 views
Skip to first unread message

David Kincaid

unread,
Jan 4, 2013, 4:57:42 PM1/4/13
to cascal...@googlegroups.com
Anyone else out there reading Nathan's book Big Data and trying to work through the ideas and code? I've been struggling my way through it and am at the point where I've got the Thrift schema setup, can create Thrift objects in a Pail and read them using JCascalog queries one type at a time. I'm now stuck on how to do queries with more than one Thrift object at a time. JCascalog seems to get very confused and will only call a single function giving it all of the data.

Here is my attempt to create a query that will give me all the e-mail addresses for all the clients and the "sap_id", which is essentially a store id, that they belong to. I realize it may be difficult to follow without all the rest of the code, but I think I can state the problem without it.

If I run the clientQuery() or the emailQuery() by themselves they work correctly giving me the expected results. When I run them in debug mode with a breakpoint inside ExtractClientEdgeFields.operate() and another breakpoint inside ExtractClientId.operate() I can see them each getting called with the right data.

However, if I try to join them using fullQuery() then the only function that gets called at all is the ExtractClientId.operate() and it gets passed all of the data from both taps. No results are produced. Taps.clientEdgeTap() and Taps.clientTap() set different PailOptions.attrs() values so read from different subpails and I can see them working correctly when the queries are run individually.

Have I found a bug with JCascalog or maybe the PailTap? Why does it send all of the records from both taps to only one function? I may give Clojure and Cascalog a shot at it, but that will take some work to setup. Does anyone have any idea what is going on here?

    public static Subquery practiceClientEmail(String pailPath, String sap_id) {
        PailTap clientEdgeTap = Taps.clientEdgeTap(pailPath);
        PailTap clientTap = Taps.clientTap(pailPath);

        Subquery clientQuery = new Subquery("?sapid", "?clientid")
                .predicate(clientEdgeTap, "_", "?client-edge-data")
                .predicate(new ExtractClientEdgeFields(), "?client-edge-data").out("?sapid", "?clientid");

        Subquery emailQuery = new Subquery("?clientid", "?email")
                 .predicate(clientTap, "_", "?pet-owner-data")
                .predicate(new ExtractClientId(), "?pet-owner-data").out("?clientid", "?email");

        Subquery fullQuery =  new Subquery("?sapid", "?clientid", "?email")
                .predicate(clientQuery, "?sapid", "?clientid")
                .predicate(emailQuery, "?clientid", "?email");

        return fullQuery;
    }

Andy Xue

unread,
Jan 8, 2013, 2:04:33 AM1/8/13
to cascal...@googlegroups.com
not familiar with jcascalog -- but it sounds like ExtractClientId function was moved into the reduce stage? that might explain why it gets data from both?

David Kincaid

unread,
Jan 8, 2013, 9:27:12 AM1/8/13
to cascal...@googlegroups.com
Andy, thanks for the reply. I did get some help over on the Cascading group. It turns out there is a bug in dfs-datastores-cascading. Check out this thread: https://groups.google.com/forum/?fromgroups=#!topic/cascading-user/1vNtHhPI39E

Dave

Jeroen van Dijk

unread,
Jan 11, 2013, 4:09:12 AM1/11/13
to cascal...@googlegroups.com
Hi Dave, 

After reading the Cascading thread it sounds like you ran into the same/a similar issue as the Forma team: https://github.com/reddmetrics/forma-clj/issues/60 and here https://groups.google.com/forum/#!msg/cascalog-user/9kBdt-NNFPA/2T2B3rvDuHgJ

Jeroen
Reply all
Reply to author
Forward
0 new messages