Best way to run multiple filters on the same [lazy] collection?

63 views
Skip to first unread message

Dmitry Kakurin

unread,
Oct 18, 2009, 3:48:09 AM10/18/09
to Clojure
I need to run 3 separate filters on the same collection.
Can I do this in a single pass thru collection (as opposed to 3
separate passes)?
Basically I don't want to bind the head of 'idata' collection because
it holds the whole thing in memory. Also collection itself is coming
from a file.

Here is what I have right now:
(let [idata ... something ...
sales
(filter
#($ (% "Product Type Identifier") = "1" &&
% "Vendor Identifier" = "01010012"
)
idata
)
upgrades
(filter
#($ % "Product Type Identifier" = "7" &&
% "Vendor Identifier" = "01010012"
)
idata
)
demo
(filter
#($ % "Product Type Identifier" = "1" &&
% "Vendor Identifier" = "01010318"
)
idata
)
]
(println "Sales : " (sum-units sales))
(println "Upgrades: " (sum-units upgrades))
(println "Demo : " (sum-units demo))

I can of cause write my own custom filter fn that would return 3
collections, but I'm wondering if there is a generic way to do it.

- Dmitry

Meikel Brandmeyer

unread,
Oct 18, 2009, 11:12:55 AM10/18/09
to clo...@googlegroups.com
Hi,

Maybe you can do:

(reduce (fn [[sales upgrades demo :as v] data]
(cond
(is-sales? data) [(conj sales data) upgrades demo]
(is-upgrade? data) [sales (conj upgrades data) demo]
(is-demo? data) [sales upgrades (conj demo data)]
:else v))
[[] [] []] (get-idata))

This will hold the resulting collections completely in memory, but not
the idata part. This is maybe not what you want, because it's too big.
Or it is what you want, because you want to close the file at some
point in time. I think defining anything lazy will hold onto idata.
(or at least part of the end results)

Sincerely
Meikel

Alex Osborne

unread,
Oct 18, 2009, 1:03:22 PM10/18/09
to clo...@googlegroups.com
Meikel Brandmeyer wrote:
> Maybe you can do:
>
> (reduce (fn [[sales upgrades demo :as v] data]
> (cond
> (is-sales? data) [(conj sales data) upgrades demo]
> (is-upgrade? data) [sales (conj upgrades data) demo]
> (is-demo? data) [sales upgrades (conj demo data)]
> :else v))
> [[] [] []] (get-idata))

Another variation:

(use 'clojure.contrib.seq-utils)

(defn record-type [data]
(cond
(is-sale? data) :sales
(is-upgrade? data) :upgrades
(is-demo? data) :demos))

(group-by record-type (filter record-type (get-idata)))

If the three output lists themselves are too large, I'd just explicitly
sum your units with reduce:

(reduce
(fn [counts data]
(let [type (record-type data)]
(assoc counts type (+ (units data)
(get counts type 0)))))
{} (get-idata))

=> {:sales 1233, :upgrades 17, :demos 42, nil 30}

Alex Osborne

unread,
Oct 18, 2009, 1:54:24 PM10/18/09
to clo...@googlegroups.com
Alex Osborne wrote:

> If the three output lists themselves are too large, I'd just explicitly
> sum your units with reduce:
>
> (reduce
> (fn [counts data]
> (let [type (record-type data)]
> (assoc counts type (+ (units data)
> (get counts type 0)))))
> {} (get-idata))
>
> => {:sales 1233, :upgrades 17, :demos 42, nil 30}

Actually come to think of it, this sort of thing is common enough that
you could pull out a 'reduce-by' function like this:

(defn reduce-by [grouper f val coll]
(reduce
(fn [m x]
(let [group (grouper x)]
(assoc m group (f (get m group val) x))))
(sorted-map) coll))

Then group-by could be easily defined in terms of it:

(defn group-by [f coll]
(reduce-by f conj [] coll))

And it makes your unit summing example:

(reduce-by record-type
(fn [count data] (+ count (units data)))
0 (get-idata))

Timothy Pratley

unread,
Oct 18, 2009, 9:19:26 PM10/18/09
to Clojure


On Oct 19, 4:54 am, Alex Osborne <a...@meshy.org> wrote:
> (reduce-by record-type
>             (fn [count data] (+ count (units data)))
>             0 (get-idata))

Wow - that is a really nice abstraction!

Kind of leads to SQLalike queries:

(defn select-count
[grouper coll]
(reduce-by grouper (fn [x _] (inc x)) 0 coll))

(defn select-sum
[summer grouper coll]
(reduce-by grouper (fn [x row] (+ x (summer row))) 0 coll))

(let [owners #{{:name "dorris" :sale 5}
{:name "joe" :sale 2}
{:name "dorris" :sale 2}
{:name "joe" :sale 2}
{:name "carol" :sale 2}
{:name "louise" :sale 2}}]
(select-sum :sale :name owners))
#_({"carol" 2, "dorris" 7, "joe" 2, "louise" 2})

Alex Osborne

unread,
Oct 18, 2009, 10:20:51 PM10/18/09
to Dmitry Kakurin, clo...@googlegroups.com
Dmitry Kakurin wrote:
> I actually like your "tag them then group them" approach.
> But what if the same record can have multiple tags?
> E.g. :sales and :upgrades?
>

Hmmm. the first way that occurred to me is just make your tagging
function return a set:

(defn record-types [x]
(disj
#{(when (is-sale? x) :sales)
(when (is-upgrade? x) :upgrades)
(when (is-demo? x) :demos)}
nil))

So you'd then get groups like:

{#{:sales :upgrades} [...]
#{:sales} [...]
#{:upgrades} [...]}

But unfortunately it seems that because the group-by in
clojure.contrib.seq-utils uses a sorted-map it can't have keys that are
sets (as there's no ordering defined for sets). You could either use
vectors instead of sets, or better yet lets just define reduce-by and
group-by to use a hash-map instead:

(defn reduce-by [grouper f val coll]
(reduce (fn [m x]
(let [group (grouper x)]
(assoc m group (f (get m group val) x))))

{} coll))

(defn group-by [grouper coll]
(reduce-by grouper conj [] coll))

This has the advantage that you can do things like intersections
("number of sales sales that were also upgrades"). If you don't care
about intersections and just want the three :sales, :upgrades, and
:demos lists, then I guess we start with 'for' to generate [tag record]
pairs:

(for [record (get-idata)
type (record-types record)]
[type record])

So if "data1" is both sales and upgrades then you'd get:

([:sales data1] [:upgrades data1] [:sales data2] [:demos data3] ...)

We can then reduce-by grouped on the first value in the pair (the tag)
and then transform the resulting maps to select just the second item in
the pair (the record):

(reduce-by
first #(conj %1 (second %2)) []
(for [record (get-idata)
type (record-types record)]
[type record]))

=> {:sales [data1 data2 ... ],
:upgrades [data1 data3 ...],
:demos [data4 ...]}

Dmitry Kakurin

unread,
Oct 18, 2009, 7:34:12 PM10/18/09
to Clojure
Thanks Meikel,
This is in line with what I was thinking for my own custom filter
function.
Now how would you modify it if the same record can be both "sales" and
"upgrade" at the same time?
I.e. if filters are not mutually exclusive.

- Dmitry
>  smime.p7s
> 3KViewDownload

Dmitry Kakurin

unread,
Oct 18, 2009, 7:51:48 PM10/18/09
to Clojure
The reduce-by approach (while cool) would not work for me because I
need to run multiple queries on the results.

- Dmitry

Meikel Brandmeyer

unread,
Oct 19, 2009, 3:13:25 PM10/19/09
to clo...@googlegroups.com
Hi,

Am 19.10.2009 um 01:34 schrieb Dmitry Kakurin:

> This is in line with what I was thinking for my own custom filter
> function.
> Now how would you modify it if the same record can be both "sales" and
> "upgrade" at the same time?
> I.e. if filters are not mutually exclusive.

I'd define a helper. (Code untested)

(defmacro ->if
"Acts like ->, but pipes the expression only through the next step
if the predicate clause returns true. Otherwise it skips the step
going on with the next steps, if any."
([x] x)
([x pred form & forms]
(let [x-gs (gensym "x")]
`(let [~x-gs ~x]
(if (-> ~x-gs ~pred)
(-> ~x-gs form (->if ~@forms))
(->if ~x-gs ~@forms))))))

(letfn [(add-sale
[[sales upgrades demos] x]
[(conj sales x) upgrades demos])
(add-upgrade
[[sales upgrades demos] x]
[sales (conj upgrades x) demos])
(add-demo
[[sales upgrades demos] x]
[sales upgrades (conj demos x)])]
(reduce (fn [sales-upgrades-demos data]
(->if sales-ugprades-demos
(do (is-sales? data)) (add-sale data)
(do (is-upgrade? data)) (add-upgrade data)
(do (is-demo? data)) (add-demo data)))
[[] [] []] (get-idata)))

This is maybe trying to be a little too clever, but well... I found
something like ->if useful at times. Note that, the predicate there
normally depends on the thing piped through the chain with ->. However
in our case it is not. That's what's the do is for: to ignore the
argument of the ->.

Looking at this, I would think, that the other approaches using the
group are much more elegant.

Hmm.. Another approach might be juxt:

(defn conj-if
"Conjoins x to coll if x fulfills pred."
[pred coll x]
(if (pred x)
(conj coll x)
coll))

(reduce (juxt #(conj-if is-sales? (nth %1 0) %2)
#(conj-if is-upgrade? (nth %1 1) %2)
#(conj-if is-demo? (nth %1 2) %2))
[[] [] []] (get-idata))

This looks much nicer, although there is still the ugly nth
"destructuring" for vector. And juxt is maybe only semi-official API
at the moment... If you don't want to depend on contrib, this might be
an alternative, though...

Sincerely
Meikel

Dmitry Kakurin

unread,
Oct 19, 2009, 9:16:52 PM10/19/09
to Clojure
Thanks Alex, this is a VERY elegant solution.
But there is still ex-C++ programmer sitting inside my head, screaming
"Aaaaah, so many temporaries and lookups, this is so inefficient" :-).
I'll try to make "him" shut up by comparing perf with my multi-filter
approach.

- Dmitry

Dmitry Kakurin

unread,
Oct 19, 2009, 9:08:51 PM10/19/09
to Clojure
Thank you for your feedback and interesting ideas.
I've decided to bite the bullet and to write my own multi-filter fn.
I'm very new to Clojure so any (constructive :-) ) criticism is
welcome:

(defn multi-filter [filters coll]
(reduce
(fn [res e]
(map (fn [f r] (if (f e) (conj r e) r))
filters
res))
(repeat (count filters) [])
coll
)
)

It takes a list of filter functions and coll and returns a list of
vectors each one containing result of corresponding filter. Here is
how I use it in my context:
(let [ ...
[sales upgrades demo]
(multi-filter [
#($ % "Product Type Identifier" = "1" &&
% "Vendor Identifier" = "01010012"
)
#($ % "Product Type Identifier" = "7" &&
% "Vendor Identifier" = "01010012"
)
#($ % "Product Type Identifier" = "1" &&
% "Vendor Identifier" = "01010318"
)
]
(get-idata)
)
]

Evaluation happens during destructuring bind (which is good for me
because idata is coming from file).
Interestingly enough replacing map with eager-map in multi-filter
definition gives very insignificant boost in perf (less than 10%), so
I've decided to leave it as is.
May be there are improvements possible to my emap definition?

(defn emap [f c1 c2]
(loop [s1 (seq c1) s2 (seq c2) res [] ]
(if (and s1 s2)
(recur (seq (rest s1)) (seq (rest s2)) (conj res (f (first s1)
(first s2))))
res)))

- Dmitry
>  smime.p7s
> 3KViewDownload

Meikel Brandmeyer

unread,
Oct 20, 2009, 8:52:33 AM10/20/09
to Clojure
Hi,

On Oct 20, 3:08 am, Dmitry Kakurin <dmitry.kaku...@gmail.com> wrote:

> (defn multi-filter [filters coll]
>   (reduce
>     (fn [res e]
>       (map (fn [f r] (if (f e) (conj r e) r))
>         filters
>         res))
>     (repeat (count filters) [])
>     coll
>   )
> )

I think this basically equivalent to the juxt solution. Besides your
are using sequence, where juxt uses vectors directly. Maybe that is a
bit faster? Dunno.

Stylistic: you should not put the closing parens on dedicated lines.
They are normally collected on the last line. While this is only a
style issue, you should get used to it, since 99.9% of all code at in
the wild will use that style...

> Interestingly enough replacing map with eager-map in multi-filter
> definition gives very insignificant boost in perf (less than 10%), so
> I've decided to leave it as is.
> May be there are improvements possible to my emap definition?
>
> (defn emap [f c1 c2]
>   (loop [s1 (seq c1) s2 (seq c2) res [] ]
>     (if (and s1 s2)
>       (recur (seq (rest s1)) (seq (rest s2)) (conj res (f (first s1)
> (first s2))))
>       res)))

Use next instead of the seq-rest combination.

Sincerely
Meikel

Sean Devlin

unread,
Oct 20, 2009, 9:36:11 AM10/20/09
to Clojure
Hey everyone,
I've developed a predicate library to handle stuff like this.

http://github.com/francoisdevlin/devlinsf-clojure-utils/

Check out lib.devlinsf.pred-utils

There are two methods, ever-pred? and any-pred? designed to handle
this. Here's an example

;;mimics OR
user=> (filter (any-pred? odd? zero?) (range 0 10))
(0 1 3 5 7 9)

;;mimics AND
user=> (filter (every-pred #(zero? (mod % 5)) #(zero? (mod % 3)))
(range 1 61))
(15 30 45 60)

They are based on every? & some, so they are appropriately lazy. You
can check my docs for more information.

Hope this helps.

Sean Devlin

unread,
Oct 20, 2009, 9:43:06 AM10/20/09
to Clojure
Oops... I misread the problem statement. Sorry!

Alex Osborne

unread,
Oct 20, 2009, 10:50:07 AM10/20/09
to clo...@googlegroups.com
Dmitry Kakurin wrote:
> Thanks Alex, this is a VERY elegant solution.

Hehe. I think I got a bit carried away generalising mine, but I found
it interesting. :-) I think your way or Meikal's juxt (which is really
neat, I didn't know about juxt) is much better for this specific
problem. I guess mine is more suited for when the 'groups' are derived
from the data (like the 'SQLalike' examples Timothy came up with) rather
than predefined by the code.

> But there is still ex-C++ programmer sitting inside my head, screaming
> "Aaaaah, so many temporaries and lookups, this is so inefficient" :-).
> I'll try to make "him" shut up by comparing perf with my multi-filter
> approach.
>

On my PC your multi-filter is 3-4 times faster than my last reduce-by
version, which I guess is not too bad considering my version has all
those lookups and temporary objects (I tested by filtering a large range
of integers into groups divisible by 2, 3 and 6). The JVM seems to be
actually pretty fast at creating objects and Clojure does lots of little
tricks to speed up lookups, like caching hash codes and using vectors
instead of hash tables for small maps.

To quiet my inner premature optimisation fanatic I usually rationalise
with "the disk is going to be the bottleneck anyway" and "lets try the
first way that occurs to me and only if it turns out to be too slow in
practice only then tweak it".

Dmitry Kakurin

unread,
Oct 20, 2009, 7:40:31 PM10/20/09
to Clojure
On Oct 20, 5:52 am, Meikel Brandmeyer <m...@kotka.de> wrote:
> > (defn multi-filter [filters coll]
> >   (reduce
> >     (fn [res e]
> >       (map (fn [f r] (if (f e) (conj r e) r))
> >         filters
> >         res))
> >     (repeat (count filters) [])
> >     coll))
>
> I think this basically equivalent to the juxt solution. Besides your
> are using sequence, where juxt uses vectors directly. Maybe that is a
> bit faster? Dunno.

Where do I use a sequence? reduce? or this line?
> > (repeat (count filters) [])
Should I wrap it in (vec ...) or is there a better way?

> Stylistic: you should not put the closing parens on dedicated lines.
> They are normally collected on the last line. While this is only a
> style issue, you should get used to it, since 99.9% of all code at in
> the wild will use that style...

I've read it many times, and I've tried to do it.
But how do you solve this practical problem:
When you need to insert something in the middle, how do you find the
right closing paren
where to split it? I use TextMate if it matters.

> Use next instead of the seq-rest combination.

Thanks, source is cleaner now, but perf is exactly the same. Not
surprisingly I guess as next is defined as (seq (rest x)) :-) .

- Dmitry

P.S. Why does it take almost 2 DAYS for my messages to appear in this
group? I've sent two last messages on Sunday afternoon.

John Harrop

unread,
Oct 20, 2009, 8:16:39 PM10/20/09
to clo...@googlegroups.com
On Tue, Oct 20, 2009 at 7:40 PM, Dmitry Kakurin <dmitry....@gmail.com> wrote:
P.S. Why does it take almost 2 DAYS for my messages to appear in this
group? I've sent two last messages on Sunday afternoon.

It's to keep out spam. Posts from new users are held for manual approval, until they've made a few posts none of which were spam. For some reason it doesn't seem to be 100% effective though, but it seems likely there'd be a lot more spam here otherwise. 

Alex Osborne

unread,
Oct 20, 2009, 8:30:21 PM10/20/09
to clo...@googlegroups.com
Dmitry Kakurin wrote:
>> Stylistic: you should not put the closing parens on dedicated lines.
>> They are normally collected on the last line. While this is only a
>> style issue, you should get used to it, since 99.9% of all code at in
>> the wild will use that style...
>
> I've read it many times, and I've tried to do it.
> But how do you solve this practical problem:
> When you need to insert something in the middle, how do you find the
> right closing paren
> where to split it? I use TextMate if it matters.

Most people writing in a lisp use an editor that does paren
highlighting. So for example in Emacs, when I have something like this:

(foo (bar (baz (quox '(a b c)))))

It's hard to show this in an email but if I move the cursors (_) to the
right of one of closing parens, then emacs will highlight the one it
matches with, by changing the background colour:

(foo (bar *(*baz (quox '(a b c)))_))

Some people also use something called "rainbow parens" which is where
each pair of matching parens are colour coded. Others use s-expression
highlighting, which colour codes the whole expression. You can see
various versions of this here:

http://lemonodor.com/archives/001207.html

It's worth trying out an editor with Clojure support (Emacs with SLIME,
Vim with VimClojure, Enclojure, the IDEA plugin etc), they have useful
shortcuts like hotkeys for moving between s-expressions and you can
quickly evaluate things from in your editor without having to switch to
the REPL. For example if I write (apply + [1 2 3 4]) in a file in Emacs
and hit Ctrl-X Ctrl-E then "10" appears in the status line.

Meikel Brandmeyer

unread,
Oct 21, 2009, 1:20:04 AM10/21/09
to clo...@googlegroups.com
Hi,

Am 21.10.2009 um 01:40 schrieb Dmitry Kakurin:

> Where do I use a sequence? reduce? or this line?

No. reduce itself is eager.

>>> (repeat (count filters) [])
> Should I wrap it in (vec ...) or is there a better way?

repeat, map, filter, cycle, take, iterate, etc. are all sequence
functions. They will turn their argument into sequence (via seq) and
return again a sequence. As I said: I don't know. It's probably not
much of a difference. If it works for you, use it! Worry about speed
when it's a problem.

> I've read it many times, and I've tried to do it.
> But how do you solve this practical problem:
> When you need to insert something in the middle, how do you find the
> right closing paren
> where to split it? I use TextMate if it matters.

As others said: try to get some support into your editor. I use Vim
with VimClojure. It colors the paren ("rainbow parens") so it's
relatively easy to match the paren visually. Even without that aid Vim
would highlight the correct paren when putting the cursor on its
partner paren. % let's me move between matching parens etc... With
some training the "paren problem" (if you want to call it like that)
will mostly vanish.

> P.S. Why does it take almost 2 DAYS for my messages to appear in this
> group? I've sent two last messages on Sunday afternoon.

Because Google Groups is one of the worst interfaces ever? I don't
know what it's problem is, but mailing lists are not a terribly new
idea. After 20 years of working mailing lists Google Groups is really
disappointing. I noticed this delay also from the web interface. Maybe
you should try to send emails directly from you mailing client.

Sincerely
Meikel

John Harrop

unread,
Oct 21, 2009, 10:52:30 AM10/21/09
to clo...@googlegroups.com
On Wed, Oct 21, 2009 at 1:20 AM, Meikel Brandmeyer <m...@kotka.de> wrote:
>>>    (repeat (count filters) [])
> Should I wrap it in (vec ...) or is there a better way?

repeat, map, filter, cycle, take, iterate, etc. are all sequence
functions. They will turn their argument into sequence (via seq) and
return again a sequence.

Not all of them turn their arguments into sequences. Repeat doesn't; the above repeat function will return a sequence of empty vectors, not a sequence of empty sequences.

The ones that iterate over a collection turn the iterated-over arguments into a sequence: all but repeat and iterate. Still only those arguments: for instance, map won't apply "seq" to its function argument. :)

Their outputs are all lazy sequences.

Just to clarify.

Meikel Brandmeyer

unread,
Oct 22, 2009, 1:40:46 PM10/22/09
to Dmitry Kakurin, clo...@googlegroups.com
Hi,

Am 22.10.2009 um 03:48 schrieb Dmitry Kakurin:

> Has someone developed a better support for "paren problem" for
> TextMate editor?
> Right now all it does is flashing matching open paren when I type the
> closing one. And that's it (apart from cmd-shift-B which is too much
> work).
>
> Also TextMate Clojure bundle seems to be replicated in several places,
> which one is considered the latest and greatest?

I'm sorry. I don't use TextMate, so I can't tell. Maybe someone on the
list might help here.

Sincerely
Meikel

Reply all
Reply to author
Forward
0 new messages