self join ok? -- errors when defn a method for source tap

55 views
Skip to first unread message

John Liberty

unread,
May 24, 2012, 11:42:16 AM5/24/12
to cascal...@googlegroups.com
I have a single file with key, name, value tuples and I wish to explore relationships among the name/values.
For example, with the following raw data
(def raw
  [
   ["K1" "C1" "c1-2"]
   ["K1" "C2" "c2-2"]
   ["K1" "C3" "c3-1"]
   ["K2" "C1" "c1-2"]
   ["K2" "C2" "c2-2"]
   ["K2" "C3" "c3-2"]
   ["K3" "C1" "c1-3"]
   ["K3" "C2" "c2-2"]
   ["K3" "C3" "c3-3"]
 ]
)

To get the summary and count of C1 where C2=c2-2
(def q1 (<- [?key ?name ?val] (raw ?key ?name ?val)))
(?<- (stdout) [?c1 ?count] (q1 ?key "C2" "c2-2") (q1 ?key "C1" ?c1) (c/count ?count))

This works fine with REPL.

To move to Hadoop , if the raw data is now in a single file, each line containing, the tuple, I basically use hfs taps:
(defn knv [dir] (textline-parsed dir 3))
(def raw-q (<- [?key ?name ?val] ((knv "/tmp/raw") ?key ?name ?val)))

I can run the following query:
(?<- (hfs-textline "/tmp/r3") [?c1 ?count]
          (raw-q ?key "C1" ?c1) (raw-q ?key "C2" "c2-2") (c/count ?count))

However, I would rather specify a file argument:
(defn raw-q
  [dir]
  (<- [?key ?name ?val] ((knv dir) ?key ?name ?val)))

and then I can run the following query:
(?<- (hfs-textline "/tmp/r3") [?c1 ?count]
          ((raw-q "/tmp/raw") ?key "C1" ?c1) ((raw-q "/tmp/raw") ?key "C2" "c2-2") (c/count ?count))


But, this flow will fail (not exactly sure of the error). The main difference I can see is that in the latter case (using defn), when the flow is starting up, the log will show 2 source statements, where the former (using def) will show just one, e.g:
12/05/24 15:20:42 INFO flow.Flow: []  source: Hfs["TextLine[['line']->[ALL]]"]["/tmp/raw"]"]
12/05/24 15:20:42 INFO flow.Flow: []  source: Hfs["TextLine[['line']->[ALL]]"]["/tmp/raw"]"]

The only indication of trouble I get is:
12/05/24 15:20:42 WARN flow.FlowStep: [] abandoning step: (3/4) ...quenceFile  ..... (snip)

which causes the rest of the job to stop.


So, is there an issue with the self join? Is there an alternative (besides 2 separate source files)? or is there someway to define the source defn to avoid this (whatever the real failure is)?

Thanks,
John



Paul Lam

unread,
May 28, 2012, 4:32:33 AM5/28/12
to cascal...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages