I have a single file with key, name, value tuples and I wish to explore relationships among the name/values.
For example, with the following raw data
(def raw
[
["K1" "C1" "c1-2"]
["K1" "C2" "c2-2"]
["K1" "C3" "c3-1"]
["K2" "C1" "c1-2"]
["K2" "C2" "c2-2"]
["K2" "C3" "c3-2"]
["K3" "C1" "c1-3"]
["K3" "C2" "c2-2"]
["K3" "C3" "c3-3"]
]
)
To get the summary and count of C1 where C2=c2-2
(def q1 (<- [?key ?name ?val] (raw ?key ?name ?val)))
(?<- (stdout) [?c1 ?count] (q1 ?key "C2" "c2-2") (q1 ?key "C1" ?c1) (c/count ?count))
This works fine with REPL.
To move to Hadoop , if the raw data is now in a single file, each line containing, the tuple, I basically use hfs taps:
(defn knv [dir] (textline-parsed dir 3))
(def raw-q (<- [?key ?name ?val] ((knv "/tmp/raw") ?key ?name ?val)))
I can run the following query:
(?<- (hfs-textline "/tmp/r3") [?c1 ?count]
(raw-q ?key "C1" ?c1) (raw-q ?key "C2" "c2-2") (c/count ?count))
However, I would rather specify a file argument:
(defn raw-q
[dir]
(<- [?key ?name ?val] ((knv dir) ?key ?name ?val)))
and then I can run the following query:
(?<- (hfs-textline "/tmp/r3") [?c1 ?count]
((raw-q "/tmp/raw") ?key "C1" ?c1) ((raw-q "/tmp/raw") ?key "C2" "c2-2") (c/count ?count))
But, this flow will fail (not exactly sure of the error). The main difference I can see is that in the latter case (using defn), when the flow is starting up, the log will show 2 source statements, where the former (using def) will show just one, e.g:
12/05/24 15:20:42 INFO flow.Flow: [] source: Hfs["TextLine[['line']->[ALL]]"]["/tmp/raw"]"]
12/05/24 15:20:42 INFO flow.Flow: [] source: Hfs["TextLine[['line']->[ALL]]"]["/tmp/raw"]"]
The only indication of trouble I get is:
12/05/24 15:20:42 WARN flow.FlowStep: [] abandoning step: (3/4) ...quenceFile ..... (snip)
which causes the rest of the job to stop.
So, is there an issue with the self join? Is there an alternative (besides 2 separate source files)? or is there someway to define the source defn to avoid this (whatever the real failure is)?
Thanks,
John