Am i doing somthing wrong here

78 views
Skip to first unread message

thejus

unread,
Jun 1, 2012, 5:30:41 PM6/1/12
to cascalog-user
Hi guys,

I was writing a log processor in cascalog,
Here is a snippet similar to the problem

(defn extract [source]
(<- [?a ?b ?c]
(source ?line)
(extract-log ?line :> ?line-params) ; extract-log returns a map
(get ?line-params :a1 :> ?a)
(get ?line-params :b1 :> ?b)
(get ?line-params :c1 :> ?c)
)
)

(defn process [input-dir output-tap]
(let [source (hfs-textline input-dir)
extracted (extract source)]
(?<- (output-tap) [?l ?m ?n] (extracted ?l ?m ?n)
)
)

doubts:
There are around 14k records in my log, but the above code is
producing only 7k records in the ouptut-tap. What could be the issue??
Also my log files have key-value pairs, is there a better way of
dealing with them than extract function??

Mayank Agarwal

unread,
Jun 1, 2012, 7:02:29 PM6/1/12
to cascal...@googlegroups.com
It looks to me that you just got bitten by the (:distinct false) nightmare.

Cascalog, by default, reduces so as to output unique tuples.

For example

(?<-  (stdout) [?a] ([[1] [1]] ?a))
will output just 1.

If you want it not to reduce automatically, you can include (:distinct false) in your query at the end.
For example:

(?<-  (stdout) [?a] ([[1] [1]] ?a) (:distinct false))
should give 
1
1
as output.


I have been bitten by this design feature several times before.

thanks
Mayank

thejus

unread,
Jun 2, 2012, 3:03:20 AM6/2/12
to cascalog-user
In my case on of values emitted by extract is the logId.
Hence all the records will be unique.

(let [source (hfs-textline input-dir)
extracted (extract source)]
(?<- (output-tap) [?l ?m ?n] (extracted ?l ?m ?n)
)

The way i have used extracted, is it wrong??

--
Thejus

Paul Lam

unread,
Jun 2, 2012, 3:27:59 AM6/2/12
to cascal...@googlegroups.com
Use nullable variables. Any value that was null were filtered in your previous query.

(defn extract [source] 
  (<- [!a !b !c] 
    (source ?line) 
    (extract-log ?line :> !line-params)  ; extract-log returns a map 
    (get !line-params :a1 :> !a) 
    (get !line-params :b1 :> !b) 
    (get !line-params :c1 :> !c) 
    (:distinct false)  ;; noted you had unique id, would the ID be nullable too?
  ) 
 )  

thejus

unread,
Jun 2, 2012, 4:02:36 AM6/2/12
to cascalog-user
Yup that was the problem.
Sorry to bother you guys with silly doubts,
Learning to use it.

Is there any better pattern to access key-value pair based logs??
Right now i am hard coding the parameter :a1 :b1 :c1, to extract
values from the map. and doing that for different kinds of sources is
painful.
Is there any way of generating a wide source, dynamically based on the
keys.
so later i can access the values with something like this
(<- [?a ?b ?sum]
(+ ?a ?b :> ?sum)
((select-fields generator ["?a" "?b"]) ?a ?b))

--
Thejus

Paul Lam

unread,
Jun 2, 2012, 5:25:21 AM6/2/12
to cascal...@googlegroups.com
(defmapop map-horizontal [m expected-keys]
       (into [] (my-vals m expected-keys))) ;; my-vals returns nil for any missing but expected key

(defn map-expander [src expected-keys]
  (let [src-fields (get-out-fields src)
        key-fields (keys-to-fields expected-keys)]
    (<- (reduce conj src-fields key-fields)
        (src :>> src-fields)
        (map-horizontal ?map expected-keys :>> key-fields))))

The above assumes that the map's key/val have a fixed structure. Otherwise, expand it vertically to multiple tuples. Which would make your multi-key operational process in the reduce-side.

(defmapcatop map-vertical [m]
     (into [] (map vector (keys m) (vals m))))

(<- [?id ?key ?val]
       (src ?id ?map)
       (map-vertical ?map :> ?key ?val))


Paul
@Quantisan

Paul Lam

unread,
Jun 2, 2012, 5:36:08 AM6/2/12
to cascal...@googlegroups.com
I'm guessing you're parsing a delimited text file? If that's true, just use TextDelimited in Cascading or its clojure wrapper at https://github.com/nathanmarz/cascalog-contrib/blob/master/cascalog.more-taps/src/cascalog/more_taps.clj 

thejus

unread,
Jun 2, 2012, 1:41:27 PM6/2/12
to cascalog-user
I couldn't make complete sense of map-expander.
Do you mind explaining it in a bit more detail.

but this did the trick for me..

(defn map-horizontal [m expected-keys]
(into [] (map #(get m %) expected-keys))
)

(defn extract [source expected-keys]
(let [outargs (v/gen-nullable-vars (count expected-keys))]
(<- outargs
(source ?line)
(extract-log ?line :> ?line-
params) ;; extract-log returns a map
(map-horizontal ?line-params expected-keys :>> outargs)
)
)
)

(defn process [output input]
(let [
source (hfs-textline input)
extracted (extract source [:a :b])
]
(?<- (hfs-textline output) [?l ?m]
(extracted ?l ?m))
)
)

Used part of your idea. Thanks :)

--
Thejus

On Jun 2, 2:36 pm, Paul Lam <paul....@forward.co.uk> wrote:
> I'm guessing you're parsing a delimited text file? If that's true, just use
> TextDelimited in Cascading or its clojure wrapper athttps://github.com/nathanmarz/cascalog-contrib/blob/master/cascalog.m...

Paul Lam

unread,
Jun 5, 2012, 4:41:40 AM6/5/12
to cascal...@googlegroups.com
map-expanding merely differs from your extract fn by returning 1) named-vars with 2) both original vars and expanded vars. It's just a design choice.
Reply all
Reply to author
Forward
0 new messages