[suggestion+patch] Bounded sorted-map searches

27 views
Skip to first unread message

Phil Jordan

unread,
Apr 10, 2008, 11:15:21 AM4/10/08
to clo...@googlegroups.com
Hi all,

I'm slowly getting to grips with Clojure and starting to appreciate its
subtleties. It's definitely my favourite Lisp so far. :)

In any case, one thing I've missed in the available data structures is
the ability to do bounded searches in a (sorted-map), so e.g. find the
smallest item for which the key is >= 5.

I've attached a patch that adds exactly that, it introduces 4 functions
that work with sorted-maps:

(find< [sortedmap key])
Returns the greatest sortedmap entry less than key, or nil if no match
present.

(find<= [sortedmap key])
Returns the greatest sortedmap entry less than or equal to key, or nil
if no match present.

(find> [sortedmap key])
Returns the smallest sortedmap entry greater than key, or nil if no
match present.

(find>= [sortedmap key])
Returns the smallest sortedmap entry greater than or equal to key, or
nil if no match present.


Such that

=> (let [tm (sorted-map 5 'A 68 'B 6 'C 0 'D)]
(prn (find tm -3) (find tm 0) (find tm 5) (find tm 6)
(find tm 10) (find tm 68) (find tm 100))
(prn (find>= tm -3) (find>= tm 0) (find>= tm 5) (find>= tm 6)
(find>= tm 10) (find>= tm 68) (find>= tm 100))
(prn (find> tm -3) (find> tm 0) (find> tm 5) (find> tm 6)
(find> tm 10) (find> tm 68) (find> tm 100))
(prn (find<= tm -3) (find<= tm 0) (find<= tm 5) (find<= tm 6)
(find<= tm 10) (find<= tm 68) (find<= tm 100))
(prn (find< tm -3) (find< tm 0) (find< tm 5) (find< tm 6)
(find< tm 10) (find< tm 68) (find< tm 100)))
nil [0 D] [5 A] [6 C] nil [68 B] nil
[0 D] [0 D] [5 A] [6 C] [68 B] [68 B] nil
[0 D] [5 A] [6 C] [68 B] [68 B] nil nil
nil [0 D] [5 A] [6 C] [6 C] [68 B] [68 B]
nil nil [0 D] [5 A] [6 C] [6 C] [68 B]
nil


I hope this is deemed sufficiently useful to be included in Clojure. The
Java part seems to involve a lot of repetition, but I couldn't see any
obvious way to reduce it - sorry, my Java isn't so great.

I've not included a (get ) counterpart, as I find that for bounded
searches, I really do want the key returned as well. Adding get>, get=>
etc. ought to be trivial though, if completeness is an issue.


This sort of thing is probably useful for sorted-set as well, although
the only interface for searching for set items I've found is fn-calling
the set itself, which obviously doesn't extend analogously to (find []).
Maybe find* should be able to operate on sets, too?

Feedback welcome!

Rich: consider copyright assigned to you if you choose to apply the
patch. (I assume that's how it's handled for Clojure)

~phil

bounded-sorted-map-find.diff

Rich Hickey

unread,
Apr 10, 2008, 12:22:59 PM4/10/08
to Clojure


On Apr 10, 11:15 am, Phil Jordan <li...@philjordan.eu> wrote:
> Hi all,
>
> I'm slowly getting to grips with Clojure and starting to appreciate its
> subtleties. It's definitely my favourite Lisp so far. :)
>
> In any case, one thing I've missed in the available data structures is
> the ability to do bounded searches in a (sorted-map), so e.g. find the
> smallest item for which the key is >= 5.
>
> I've attached a patch that adds exactly that, it introduces 4 functions
> that work with sorted-maps:

> I hope this is deemed sufficiently useful to be included in Clojure. The
> Java part seems to involve a lot of repetition, but I couldn't see any
> obvious way to reduce it - sorry, my Java isn't so great.
>

> Feedback welcome!
>

Phil - thanks for taking a swing at this, it's a long-standing todo
item for me. IMO, the 'right thing' is to build this kind of
functionality on seqs, then we not only get single lookups but ranges.

Everything can be built on just 2 functions on the Java side:

seqFrom(key)
rseqFrom(key)

these will map to seq-from and rseq-from on the Clojure side. What
they should do is return a PersistentTreeMap.Seq parked on the Node >=
or <= the key respectively, with the direction set accordingly. It's
easy to see how you can then build all of the single lookups you want
(which will just use the first/second of the seqs), plus ranges seq-
from, seq-after, rseq-from, rseq-before, and range-bounded versions
(e.g.seq-from-to, seq-from-through etc), of same. All of those
functions can/should be defined in Clojure.

seqFrom and rseqFrom will be a little tricky though, because the
PersistentTreeMap.Seq uses a stack to keep track of where it is, and
will therefore need to call the Seq ctor with a stack built during the
search.

If you are up for it, it would be a terrific addition, if not, I
understand :)

Rich

phil jordan

unread,
Apr 10, 2008, 4:00:52 PM4/10/08
to Clojure
On Apr 10, 6:22 pm, Rich Hickey <richhic...@gmail.com> wrote:
> On Apr 10, 11:15 am, Phil Jordan <li...@philjordan.eu> wrote:
> Phil - thanks for taking a swing at this, it's a long-standing todo
> item for me. IMO, the 'right thing' is to build this kind of
> functionality on seqs, then we not only get single lookups but ranges.

Yep, that makes a lot of sense, and would indeed be very useful.

> Everything can be built on just 2 functions on the Java side:
>
> seqFrom(key)
> rseqFrom(key)
>
> these will map to seq-from and rseq-from on the Clojure side. What
> they should do is return a PersistentTreeMap.Seq parked on the Node >=
> or <= the key respectively, with the direction set accordingly. It's
> easy to see how you can then build all of the single lookups you want
> (which will just use the first/second of the seqs),

Although in that case you also need access to the comparator to
determine whether the found entry is equal or greater/less.

> plus ranges seq-
> from, seq-after, rseq-from, rseq-before, and range-bounded versions
> (e.g.seq-from-to, seq-from-through etc), of same. All of those
> functions can/should be defined in Clojure.

Again for the bounded versions you'll need access to the comparator,
so I suppose we'll have to just provide an accessor - not the end of
the world.

> seqFrom and rseqFrom will be a little tricky though, because the
> PersistentTreeMap.Seq uses a stack to keep track of where it is, and
> will therefore need to call the Seq ctor with a stack built during the
> search.

Yep, I've now stared at the PersistentTreeMap.Seq for long enough to
understand how it works. Man, are you *sure* the JVM won't let us have
continuations?

Looks like it stores the path of nodes from root to leaf explicitly in
the list, and the rest as subtrees. This would indeed be fairly easy
to construct as part of a search. Maintaining an accurate O(1) count
operation won't be possible, I suspect, as there is no way of cheaply
determining how many nodes are to the left and right of a node in a
standard red-black tree. This will probably have to end up being O(n),
maybe O(1) on subsequent calls. (despite the caching it could still
maintain the appearance of being immutable)

After all these changes, the easiest way to do it is probably actually
to implement a new type of Seq for search results, a SearchSeq or
something like that. It could perform the actual search and construct
the search stack locally.

> If you are up for it, it would be a terrific addition, if not, I
> understand :)

I'll see what I can do.

~phil

Phil Jordan

unread,
Apr 11, 2008, 8:38:56 AM4/11/08
to clo...@googlegroups.com
Rich, I've made a second attempt at this based on your feedback, patch
attached. Extra commentary below.

Rich Hickey wrote:
> IMO, the 'right thing' is to build this kind of
> functionality on seqs, then we not only get single lookups but ranges.
>
> Everything can be built on just 2 functions on the Java side:
>
> seqFrom(key)
> rseqFrom(key)

I've done exactly that.

> these will map to seq-from and rseq-from on the Clojure side. What
> they should do is return a PersistentTreeMap.Seq parked on the Node >=
> or <= the key respectively, with the direction set accordingly.

So, I didn't find a better solution than what I said in my email last
night, and I've inherited a new SortedSeq from PersistentTreeMap.Seq,
which provides the appropriate lazy count semantics. (count ) is O(n) on
the first call, after which it's cached, and the cached value is passed
on to any (rest ) of that SortedSeq, so any subsequent (count ) is O(1).
The way I've implemented it isn't particularly idiomatic, feel free to
suggest something cleaner.

> It's
> easy to see how you can then build all of the single lookups you want
> (which will just use the first/second of the seqs), plus ranges seq-
> from, seq-after, rseq-from, rseq-before,

Done. It's a little awkward with having to read out the comparator, but
it works and isn't too shabby. I'm not 100% sure I like the terminology,
seq>=, seq>, (r)seq<= and (r)seq< somehow seem more obvious to me.

> and range-bounded versions
> (e.g.seq-from-to, seq-from-through etc), of same. All of those
> functions can/should be defined in Clojure.

This part is embryonic in my current patch: I've done the most idiomatic
thing I could think of, which is to apply take-while to the result of
seq-from. This suffers from doing a comparison on every iteration, I'll
have to do a little more thinking here. Maybe another, specialised Seq
is called for which directly takes into account two search stacks.

I'm also *really* not happy with the terminology here. seq-from-to and
seq-from-through don't seem any different to me - I'd be looking up the
difference all the time. seq-after-to-before and such seem a little
long-winded. Unfortunately, mathematical symbols don't seem to help much
either: seq<=<= or seq-<=-<= aren't exactly what I'd call readable, and
the usual range indicators in maths - e.g. [a, b) - have other meanings
already. Ideas?

> seqFrom and rseqFrom will be a little tricky though, because the
> PersistentTreeMap.Seq uses a stack to keep track of where it is, and
> will therefore need to call the Seq ctor with a stack built during the
> search.

This actually turned out to be pretty easy based on my previous
implementations of find>= and find<=. I've reimplemented those functions
based on seq-from and its brethren.

What do you think of it so far?

~phil

bounded-sorted-map-seq.diff

Phil Jordan

unread,
Apr 11, 2008, 10:55:56 AM4/11/08
to clo...@googlegroups.com
Phil Jordan wrote:
> This part is embryonic in my current patch: I've done the most idiomatic
> thing I could think of, which is to apply take-while to the result of
> seq-from. This suffers from doing a comparison on every iteration, I'll
> have to do a little more thinking here. Maybe another, specialised Seq
> is called for which directly takes into account two search stacks.

So, straight after I hit send I actually thought of a (hopefully) better
way to do seq-from-to, by doing two searches and iterating until the
second search is found. I also discovered that with my initial
implementation, you could run into trouble with the end item, for
example if from > to. In any case, the improved version is attached.
It's not exactly pretty, so this is probably not going to be the last
word on the matter.

~phil

bounded-sorted-map-seq2.diff

Rich Hickey

unread,
Apr 11, 2008, 12:24:27 PM4/11/08
to Clojure
Thanks Phil!

I took a look and am uncomfortable about the derivation from Seq -
(concrete derivation from a non-abstract base by a different
author...), but I had forgotten about the counts. I made a revision to
Seq so that the count is optional - in your case leave it out. I think
you should now be able to use Seq as-is with your existing search
logic by passing the stack to the Seq ctor, let me know if that's not
the case. This will greatly ease maintenance.

I think your original formulation (sans any buglets) for range-bounded
versions using take-while is the correct path and what I expected.
Don't worry about the cost of the compare call.

This is really close - I would have made the additional changes myself
but I'm running today.

We should be able to get this in in the next iteration, I think.

Thanks again,

Rich

Phil Jordan

unread,
Apr 11, 2008, 6:33:06 PM4/11/08
to clo...@googlegroups.com
Right, here we go. Next attempt.

Rich Hickey wrote:
> I took a look and am uncomfortable about the derivation from Seq -
> (concrete derivation from a non-abstract base by a different
> author...), but I had forgotten about the counts.

Hmm, okay. I personally didn't see anything wrong with that solution,
other than maybe the wasted final cnt.

> I made a revision to
> Seq so that the count is optional - in your case leave it out. I think
> you should now be able to use Seq as-is with your existing search
> logic by passing the stack to the Seq ctor, let me know if that's not
> the case. This will greatly ease maintenance.

I've adapted it to use Seq. (count) is now always O(n).

> I think your original formulation (sans any buglets) for range-bounded
> versions using take-while is the correct path and what I expected.

OK, done. I've done a rseq version as well now. Not sure it's got the
semantics of least surprise though.

> Don't worry about the cost of the compare call.

I do still worry, although I suppose if I run into this problem in live
code I probably have other troubles.

> This is really close - I would have made the additional changes myself
> but I'm running today.

No worries, this way I get to see some of the inner bits of Clojure,
which is something I want to do anyway. I get the impression that the
Clojure implementation is still small enough that a newcomer can get a
decent handle on it 'on the side'. I hope it stays that way to some
extent, it's not something I can say for most languages.

> We should be able to get this in in the next iteration, I think.

We'll see. Patch attached.

~phil

bounded-sorted-map-seq3.diff

jon

unread,
Apr 11, 2008, 6:36:38 PM4/11/08
to Clojure
Hi,
Just been looking at.. bounded-sorted-map-seq2.diff..
Isn't it handling not-strictly-increasing sequences wrongly?
(ie. isn't seq-after right only because seq-from is wrong?)
Jon

user=> (def sm (sorted-map 1 :a1, 2 :b1, 2 :b2, 2 :b3, 3 :c1, 4 :d1))
#<Var: user/sm>
user=> (seq-from sm 2)
([2 :b3] [3 :c1] [4 :d1])
user=> (seq-after sm 2)
([3 :c1] [4 :d1])

Phil Jordan

unread,
Apr 11, 2008, 6:57:51 PM4/11/08
to clo...@googlegroups.com
jon wrote:
> Hi,
> Just been looking at.. bounded-sorted-map-seq2.diff..
> Isn't it handling not-strictly-increasing sequences wrongly?
> (ie. isn't seq-after right only because seq-from is wrong?)
> Jon

I didn't think that sorted-map actually tracks non-unique entries, and

user=> (sorted-map 1 :a1, 2 :b1, 2 :b2, 2 :b3, 3 :c1, 4 :d1)
{1 :a1, 2 :b3, 3 :c1, 4 :d1}
user=> (count (sorted-map 1 :a1, 2 :b1, 2 :b2, 2 :b3, 3 :c1, 4 :d1))
4

seems to support that theory.
Do you have any examples of how to persuade sorted-map to track
non-uniquely keyed entries?

Thanks

~phil

jon

unread,
Apr 11, 2008, 7:26:22 PM4/11/08
to Clojure
whoops! sorry about the red herring.. I wasn't thinking straight.
back to bottom of the class :(

Rich Hickey

unread,
Apr 15, 2008, 2:51:19 PM4/15/08
to Clojure
Thanks. I got to think about this over the weekend and here's what I
came up with. Two functions, subseq and rsubseq. They take test
functions and keys. In simple case, a single test and key splits the
sequence and returns the head or the tail to/from the split point:

(def ss (sorted-set :a :b :c :d :e))

user=> (subseq ss >= :c)
(:c :d :e)
user=> (subseq ss < :c)
(:a :b)
user=> (rsubseq ss >= :c)
(:e :d :c)
user=> (rsubseq ss < :c)
(:b :a)

There is also a double-bounded case, taking start/end tests and keys:

user=> (subseq ss >= :b < :e)
(:b :c :d)
user=> (rsubseq ss >= :b < :e)
(:d :c :b)

Works on sorted sets and sorted maps.

It's up now, let me know what you think,

Rich

christop...@gmail.com

unread,
Apr 15, 2008, 3:46:24 PM4/15/08
to Clojure
Hi,

Is there any reasn why you have both subseq and rsubseq? Seems
(rsubseq...) === (reverse (subseq ...)).

Not to mention that the rsubseq probably will perform poorly on data-
structures that are forward-iterable only (to reuse some c++
terminology).

Another question, why is this done in java and not clojure?

Best regards,
Christophe Poucet

Rich Hickey

unread,
Apr 15, 2008, 4:26:38 PM4/15/08
to Clojure


On Apr 15, 3:46 pm, "christophe.pou...@gmail.com"
<christophe.pou...@gmail.com> wrote:
> Hi,
>
> Is there any reasn why you have both subseq and rsubseq? Seems
> (rsubseq...) === (reverse (subseq ...)).

rsubseq finds the reverse head in logN time and is lazy. reverse is
linear on the length of the subsequence and is eager.

>
> Not to mention that the rsubseq probably will perform poorly on data-
> structures that are forward-iterable only (to reuse some c++
> terminology).
>

There are no forward-iterable-only sorted data structures in Clojure.

> Another question, why is this done in java and not clojure?
>

The data structures are written in Java for bootstrapping and
performance reasons. The logic for subseq is more Clojure than Java.

Rich

Phil Jordan

unread,
Apr 17, 2008, 7:37:35 AM4/17/08
to clo...@googlegroups.com

OK, that looks pretty elegant. It'll presumably break fairly badly if
you decide to overload >=, >, <= and < in your local namespace, but I
guess then you deserve what you get. :)

~phil

Rich Hickey

unread,
Apr 17, 2008, 8:08:42 AM4/17/08
to Clojure
Ok, so now we're back to your original request. Options are:

;add an overload to find
(find sm < :c)

;bundle direction in name (needs 4 variations):

(find< sm :c)

;new name:

(seek sm < :c)

Thoughts?

Rich

Phil Jordan

unread,
Apr 17, 2008, 8:51:36 AM4/17/08
to clo...@googlegroups.com
Rich Hickey wrote:
> Ok, so now we're back to your original request. Options are:
>
> ;add an overload to find
> (find sm < :c)

Definitely an acceptable option. Could even change the argument order:
(find < sm :c)
and although that could be considered inconsistent with subseq, I
personally find it more intuitive.

> ;bundle direction in name (needs 4 variations):
>
> (find< sm :c)

This is what I'm using locally at the moment. The namespace pollution
isn't really an issue here as far as I'm concerned, and it generally
seems clean enough to me.

> ;new name:
>
> (seek sm < :c)

This seems unnecessarily confusing and I'm already occasionally running
into the problem of all the good, short names being taken, so I'm
against this. (e.g.: str, key, val)

~phil

Reply all
Reply to author
Forward
0 new messages