Anyone else out there reading Nathan's book Big Data and trying to work through the ideas and code? I've been struggling my way through it and am at the point where I've got the Thrift schema setup, can create Thrift objects in a Pail and read them using JCascalog queries one type at a time. I'm now stuck on how to do queries with more than one Thrift object at a time. JCascalog seems to get very confused and will only call a single function giving it all of the data.
Here is my attempt to create a query that will give me all the e-mail addresses for all the clients and the "sap_id", which is essentially a store id, that they belong to. I realize it may be difficult to follow without all the rest of the code, but I think I can state the problem without it.
If I run the clientQuery() or the emailQuery() by themselves they work correctly giving me the expected results. When I run them in debug mode with a breakpoint inside ExtractClientEdgeFields.operate() and another breakpoint inside ExtractClientId.operate() I can see them each getting called with the right data.
However, if I try to join them using fullQuery() then the only function that gets called at all is the ExtractClientId.operate() and it gets passed all of the data from both taps. No results are produced. Taps.clientEdgeTap() and Taps.clientTap() set different PailOptions.attrs() values so read from different subpails and I can see them working correctly when the queries are run individually.
Have I found a bug with JCascalog or maybe the PailTap? Why does it send all of the records from both taps to only one function? I may give Clojure and Cascalog a shot at it, but that will take some work to setup. Does anyone have any idea what is going on here?
public static Subquery practiceClientEmail(String pailPath, String sap_id) {
PailTap clientEdgeTap = Taps.clientEdgeTap(pailPath);
PailTap clientTap = Taps.clientTap(pailPath);
Subquery clientQuery = new Subquery("?sapid", "?clientid")
.predicate(clientEdgeTap, "_", "?client-edge-data")
.predicate(new ExtractClientEdgeFields(), "?client-edge-data").out("?sapid", "?clientid");
Subquery emailQuery = new Subquery("?clientid", "?email")
.predicate(clientTap, "_", "?pet-owner-data")
.predicate(new ExtractClientId(), "?pet-owner-data").out("?clientid", "?email");
Subquery fullQuery = new Subquery("?sapid", "?clientid", "?email")
.predicate(clientQuery, "?sapid", "?clientid")
.predicate(emailQuery, "?clientid", "?email");
return fullQuery;
}