Michael newbee challange nr 1

1 view
Skip to first unread message

Michael Jaaka

unread,
Nov 8, 2009, 7:33:22 AM11/8/09
to Clojure
Hi! How would you solve such problem:

I have a collection of pairs (key, value) ->
[  [ "tom" 32 ] [ "tom" 2333 ] [ "anne" 12 ] [ "anne" 55 ] ]

As you can see keys can occur more than once, also that collection is very large so it should be evaluated lazily (for example the collection cames from stream input).

I have function with two arguments key and sequence of its values, the pseudo code would be:

callbackListener(string key, iterator values)

now I would like get such effect that callbackListener will be called twice for the example collection.

once with "tom" and iterator (or a lazy seq) to lazy evaluated collection of (32 and 2333)
and second with "anne" and iterator for collection of (12 and 55)

In Java it is easy because the language is imperative. But how about such logic in Clojure?
Note that all must be evaluated lazily.

Thanks in advance.
 

Meikel Brandmeyer

unread,
Nov 8, 2009, 10:34:08 AM11/8/09
to clo...@googlegroups.com
Hi,

Obviously, this cannot be done completely lazy since changing the
first element is not a stopping time. You'll always need at least one
lookahead to determine the change of the key and you'll have to
realize the input to find all values for a key.

(defn grouped-seq
[input]
(lazy-seq
(when-let [s (seq input)]
(let [k (ffirst s)
[vs r] (split-with #(-> % first (= k)) s)]
(cons [k vs] (grouped-seq r))))))

(map #(apply callbackListener %) (grouped-seq [["tom" 32] ["tom" 2333]
["anne" 12] ["anne" 55]]))

Note: if you try to avoid realizing all values of a given key, you'll
need two sequences: one to keep track of the key and of for the
values. While callbackListener realises the values, the key sequence
will retain the head of the input causing all values to stay in
memory. If you want really laziness, you'll have to traverse the input
twice over independent input streams. So if you have a file containing
the data, you'll have to read the file twice from independent streams
to get full laziness.

Sincerely
Meikel

John Harrop

unread,
Nov 8, 2009, 1:22:36 PM11/8/09
to clo...@googlegroups.com
On Sun, Nov 8, 2009 at 7:33 AM, Michael Jaaka <michae...@googlemail.com> wrote:
Hi! How would you solve such problem:

I have a collection of pairs (key, value) ->
[  [ "tom" 32 ] [ "tom" 2333 ] [ "anne" 12 ] [ "anne" 55 ] ]

As you can see keys can occur more than once, also that collection is very large so it should be evaluated lazily (for example the collection cames from stream input).

I have function with two arguments key and sequence of its values, the pseudo code would be:

callbackListener(string key, iterator values)

now I would like get such effect that callbackListener will be called twice for the example collection.

once with "tom" and iterator (or a lazy seq) to lazy evaluated collection of (32 and 2333)
and second with "anne" and iterator for collection of (12 and 55)

Something related to

(let [ks (distinct (map first s))
      vals-for-key (fn [k] (map second (filter #(= (first %) k) s)))]
  (doseq [k ks] (callback k (vals-for-key k))))

or to collect and return the return values of the callbacks, use "for" instead of "doseq".

The problem is that this will realize s and hold onto its head, as Meikel points out. To make it really lazy you need something more like

(defn do-it [callback producer]
  (let [ks (fn [p] (distinct (map first (p))))
        vals-for-key (fn [k p] (map second (filter #(= (first %) k) (p))))]
    (doseq [k (ks p)] (callback k (vals-for-key k p))))

and call this with two arguments, the first the callback that will take a key and a lazy seq of the associated values, and the second a function that will produce anew a lazy sequence of key-value pairs from your disk file each time it's called, with the property that the lazy-seq clause that returns nil when the sequence is exhausted also has the side effect of closing the input stream so as not to leak file handles.

This will grovel over the file N+1 times where N is the number of distinct keys, though, and will have in general two input streams open on the file at the same time, one of them crawling along discovering more distinct keys, and the other finding all the values for one key, then for the next, etc.

It will also build up a hashset of all of the keys over time, hidden inside the "distinct" function's returned seq. So, if the summed lengths of the distinct keys is going to be big enough to blow the heap all on its own, you've got real problems and you'll probably need to use a scratch disk file. In that case I'd recommend a more imperative style with a loop/recur and lazy-seq, where the outer loop starts with a copy of the input file, runs the inner loop, and then deletes its input and reruns itself with the file generated by the inner loop (hence starting with a *copy* of the input file), and the inner loop reads the first key from the current input file, creates an empty temp file to be the current output file, then calls the callback with a lazy seq object whose realization actually advances through the input file, copying entries to the output file that are not matching the current key and producing lazy seq elements for entries that do match.

The idea is that if your input file foo.dat reads [[x 1] [y 2] [x 3] [z 4] [y 5]] it will make a copy foo1.tmp, then call the callback with x and a lazy seq whose realization emits (1 3) while also generating foo2.tmp = [[y 2] [z 4] [y 5]]; then the outer loop deletes foo1.tmp and repeats on foo2.tmp and the callback is called with y and a lazy seq whose realization emits (2 5) while generating foo3.tmp = [[z 4]]; and then the outer loop deletes foo2.tmp and the callback is called with z and a lazy seq (4) whose realization creates an empty foo4.tmp; then the outer loop deletes foo3.tmp, sees foo4.tmp is empty, and deletes foo4.tmp and terminates.

If you want the callback's return values, you'll need to make the outer loop another lazy-seq.

I'm a bit uncomfortable with lazy-seq bodies that have side effects.

You might want to consider if you need to be using a RDBMS instead of just disk files for whatever it is you're doing.

DTH

unread,
Nov 8, 2009, 1:11:50 PM11/8/09
to Clojure
On Nov 8, 12:33 pm, Michael Jaaka <michael.ja...@googlemail.com>
wrote:
>
> now I would like get such effect that callbackListener will be called twice
> for the example collection.
>
> once with "tom" and iterator (or a lazy seq) to lazy evaluated collection of
> (32 and 2333)
> and second with "anne" and iterator for collection of (12 and 55)
<snip>
> Note that all must be evaluated lazily.

Unless the keys in your sequence will always be sorted, I don't
believe you can do this lazily; in order to generate a sequence of the
values for a given key in your collection, you would have to traverse
the entire collection first in order to check each key, i.e. if you
had:

[[ "tom" 32 ] [ "anne" 12 ] [ "anne" 55 ] [ "tom" 2333 ]]

Then you'd need to get right to the end of the list in order to
generate the sequence of values for "tom".

One, non-lazy, way to accomplish what you're after would be:

(use '[clojure.contrib.seq-utils :only (group-by)])

(def *s* [["tom" 32] ["tom" 2333] ["anne" 12] ["anne" 55]])

(defn callback-listener
[k v]
(println (str "key: " k ", value: " v)))

(doseq [[k v] (map (fn [[k v]] [k (map second v)])
(group-by first *s*))]
(callback-listener k v))

Note that since you used the name "callback-listener", I'm assuming
that it has side effects, which is why I used doseq. If callback-
listener does _not_ have side effects (i.e. if it performs some
computation upon it's arguments and simply yields a result, then you
could use:

(defn callback-fn
[[k v]]
(compute-some-value-from-arguments k v))

(map (comp callback-fn (fn [[k v]] [k (map second v)]))
(group-by first *s*))

which would yield a sequence of the result of each call to callback-
fn2. Note that while the resulting sequence would, technically, be
lazy, group-by is not, so you would still consume the entire input
sequence up front.

Now, _if_ you can guarantee that your input sequence will always be
sorted by key (that is, if you know for a fact that all the entries
for a given key will be consecutive, as in your example: [tom tom anne
anne], rather than mine above: [tom anne tom anne]), then you could
use partition-by instead, which would (I believe) let you achieve full
laziness:

(use '[clojure.contrib.seq-utils :only (partition-by)])

(map (comp callback-fn (fn [part] [(ffirst part) (map second part)]))
(partition-by first *s*))

Hope this helps,

-David

Michael Jaaka

unread,
Nov 9, 2009, 9:09:00 AM11/9/09
to clo...@googlegroups.com
Keys are always sorted. So once a key stops appearing it won't appear again.
The first solution seems to be the right one, because all values are processed in a sequence manner in a lazy way.

2009/11/8 DTH <dth...@gmail.com>

Michael Jaaka

unread,
Nov 9, 2009, 9:14:38 AM11/9/09
to clo...@googlegroups.com
By the first I mean this done by Meikel Brandmeyer. However I will check later also this one:

"(use '[clojure.contrib.seq-utils :only (partition-by)])

(map (comp callback-fn (fn [part] [(ffirst part) (map second part)]))
    (partition-by first *s*))"

and the other question is if I have (def c [ [ 1 2 ] [ 3 4 ] ]) and want to get lazily [ 2 4 ] (values of tuplets of a sequence) will be this a correct (map #(-> % fnext)  c ) way?


2009/11/9 Michael Jaaka <michae...@googlemail.com>

Meikel Brandmeyer

unread,
Nov 9, 2009, 9:53:03 AM11/9/09
to Clojure
Hi,

On Nov 9, 3:14 pm, Michael Jaaka <michael.ja...@googlemail.com> wrote:

> and the other question is if I have (def c [ [ 1 2 ] [ 3 4 ] ]) and want to
> get lazily [ 2 4 ] (values of tuplets of a sequence) will be this a correct
> (map #(-> % fnext)  c ) way?

(map second c) is what you want. This is missing from my solution,
btw. The vs should go through exactly this map call: (map second vs).

Sincerely
Meikel
Reply all
Reply to author
Forward
0 new messages