joining predicates with a matching function?

41 views
Skip to first unread message

Eric Gebhart

unread,
Aug 9, 2016, 2:46:50 PM8/9/16
to cascalog-user

I sure thought I could do this, but I haven't figured it out yet.

I have two taps  which return as one of their values a vector of numbers.  The master vector, 'mcatv',  might be shorter and I still want to match if 
the vectors are the same to that point.

Of course cascalog can't join the two predicates because there is nothing to join.  I must just be thinking the wrong way around but I 
can't think of how to do this.

Hmm, maybe not a matcher, but a function that takes all variables and returns the return vars when there is a match...
Ok. trying that, but in the mean time, here's my code.


(fact "matching master and product vectors matches the master vector even if it is shorter."
      (catv-match [1 2 4 5] [1 2 4 5]) => truthy
      (catv-match [1 2 4] [1 2 4 5]) => truthy
      (catv-match [1 2 ] [1 2 4 5]) => truthy
      (catv-match [1 2 3] [1 2 4 5]) => falsey)


(<- [?pid ?cat]
         (prod-categories :> ?cat ?mcatv)
         (prods :> ?pid ?pcatv)
         (catv-match ?mcatv ?pcatv))

Eric Gebhart

unread,
Aug 10, 2016, 6:05:34 PM8/10/16
to cascalog-user
Well I thought I would be clever and create mapfn to give me the mcat.  This is not very much data and could be easily held in an anonymous function.  
The function works great. But I still get a block on the job.  This is so basic I can't imagine what could be wrong. 

The code makes perfect sense to me, but I'm baffled as to why it doesn't work.  I did check that I don't have clojure flycheck-mode on, so that's not the problem.

Here's what I've tried: 

I have an anonymous function to take any catv and match it, to return a mcatv.  this works perfectly all alone.  p-cat-tap just returns a flattened tree of keys and vectors.
And I use my verified catv-match function to filter it down to what should be one record and if its not the first one should do. I get the key turn it into a string and return it.

(defn get-mcat-fn [pm cats-key]
  (let [mcats (p-cat-tap pm cats-key)]
    (fn [catv]
      (str (name (ffirst (filter #(catv-match (second %) catv) mcats)))))))

;; here's the test to verify that function works.

(fact "We can get the master category for any product category vector; Lego is [159 81]"
      (let [get-mcat (get-mcat-fn pm :master-categories)]
        (get-mcat [159 81 2 1]) => "lego"
        (get-mcat [159 81 6 5]) => "lego"
        (get-mcat [159 81 12]) => "lego"
        (get-mcat [159 81]) => "lego"
        (get-mcat [156 28 12]) => "tinkertoy"))

;; Just for sanity I tried using this function in the query instead, it works just fine.

(defn foo [pcat]
  (first pcat))

(defn p-cats
  "return a list of pids and mcats"
  ([pm pk catdata-key]
   (let [prods (p-cat-vectors pm pk)
         get-mcat (get-mcat-fn pm catdata-key)]
     (<- [?pid ?mcat ]
         (prods :> ?pid ?pcatv)
         (get-mcat ?pcatv :> ?mcat)))))

;; this is essentially the same as this example which we have all done some variation of an uncountable number of times.

(defn square [x]
  (* x x))

(def src [1 2 3 4 5])

(def bar (??<- [!x !squared]
               (src !x)
               (square !x :> !squared)))

Eric Gebhart

unread,
Aug 10, 2016, 6:37:05 PM8/10/16
to cascalog-user
Well, I thought maybe there were no matches on occasion, and I was right about that.  I modified the matcher to return "_" when the match was nil.

It still gets a block on job.  Then I wrote a function to do this same functionality outside of Cascalog. I have a small test data set, so I just run the tap query to
load up a variable and then use map on it.  That works just fine, and it did show that I'm getting records without matches.

(defn foobar [pm pk catdata-key]
  (let [prods (tc/run<- (product-category-vector pm pk))
        get-mcat (get-mcat-fn pm catdata-key)]
    (map #(into [(first %)] [(second %) (get-mcat (second %))]) prods)))

Igor Postelnik

unread,
Aug 11, 2016, 11:13:22 AM8/11/16
to cascalog-user
This cannot work in M/R world. You need a common key on both sides of the join, which will be used as the reduce key.

-Igor

Eric Gebhart

unread,
Aug 12, 2016, 3:48:47 PM8/12/16
to cascalog-user

I turned half of it into mapfn.  This works now. 

Although it didn't for a while because apparently just disabling clojure-flycheck was not thorough enough.  
I commented out clojure flycheck setup completely and restarted emacs before I started getting any sort of reasonable behavior.
 
After that, I just made sure my anonymous function was  (mapfn...  and not just (fn...   and now it works.

If there is a better way to do the mapfn part, without the filter I'd be interested.  In this case, the number of rows in mcats is only 200 or so.
I suppose I could have made the m-cats function take the catv and then it could create a filter fn each time this mapfn is called. Which would result
in a fresh query every time where this only does the query once on setup.  Which is fine in this case.

(defn get-mcat-fn [pm cats-key]
  (let [mcats (m-cats pm cats-key)]
    (mapfn [catv]
           (let [mcat (ffirst (filter #(catv-match (second %) catv) mcats))]
             (when-not (nil? mcat)
               (str (name mcat)))))))


(defn p-cats
   ([pm pk catdata-key]
   (let [ps (p-cat-vectors pm pk)
         get-mcat (get-mcat-fn pm catdata-key)]
     (<- [?pid ?mcat ]
         (ps :> ?pid ?pcatv)
         (get-mcat ?pcatv :> ?mcat)))))

It is essentially this example except get-mcat is anonymous.

(defn square [x]
  (* x x))

(def src [1 2 3 4 5])

(defn bar [] (??<- [!x !squared]
Reply all
Reply to author
Forward
0 new messages