I'm slowly getting to grips with Clojure and starting to appreciate its
subtleties. It's definitely my favourite Lisp so far. :)
In any case, one thing I've missed in the available data structures is
the ability to do bounded searches in a (sorted-map), so e.g. find the
smallest item for which the key is >= 5.
I've attached a patch that adds exactly that, it introduces 4 functions
that work with sorted-maps:
(find< [sortedmap key])
Returns the greatest sortedmap entry less than key, or nil if no match
present.
(find<= [sortedmap key])
Returns the greatest sortedmap entry less than or equal to key, or nil
if no match present.
(find> [sortedmap key])
Returns the smallest sortedmap entry greater than key, or nil if no
match present.
(find>= [sortedmap key])
Returns the smallest sortedmap entry greater than or equal to key, or
nil if no match present.
Such that
=> (let [tm (sorted-map 5 'A 68 'B 6 'C 0 'D)]
(prn (find tm -3) (find tm 0) (find tm 5) (find tm 6)
(find tm 10) (find tm 68) (find tm 100))
(prn (find>= tm -3) (find>= tm 0) (find>= tm 5) (find>= tm 6)
(find>= tm 10) (find>= tm 68) (find>= tm 100))
(prn (find> tm -3) (find> tm 0) (find> tm 5) (find> tm 6)
(find> tm 10) (find> tm 68) (find> tm 100))
(prn (find<= tm -3) (find<= tm 0) (find<= tm 5) (find<= tm 6)
(find<= tm 10) (find<= tm 68) (find<= tm 100))
(prn (find< tm -3) (find< tm 0) (find< tm 5) (find< tm 6)
(find< tm 10) (find< tm 68) (find< tm 100)))
nil [0 D] [5 A] [6 C] nil [68 B] nil
[0 D] [0 D] [5 A] [6 C] [68 B] [68 B] nil
[0 D] [5 A] [6 C] [68 B] [68 B] nil nil
nil [0 D] [5 A] [6 C] [6 C] [68 B] [68 B]
nil nil [0 D] [5 A] [6 C] [6 C] [68 B]
nil
I hope this is deemed sufficiently useful to be included in Clojure. The
Java part seems to involve a lot of repetition, but I couldn't see any
obvious way to reduce it - sorry, my Java isn't so great.
I've not included a (get ) counterpart, as I find that for bounded
searches, I really do want the key returned as well. Adding get>, get=>
etc. ought to be trivial though, if completeness is an issue.
This sort of thing is probably useful for sorted-set as well, although
the only interface for searching for set items I've found is fn-calling
the set itself, which obviously doesn't extend analogously to (find []).
Maybe find* should be able to operate on sets, too?
Feedback welcome!
Rich: consider copyright assigned to you if you choose to apply the
patch. (I assume that's how it's handled for Clojure)
~phil
Rich Hickey wrote:
> IMO, the 'right thing' is to build this kind of
> functionality on seqs, then we not only get single lookups but ranges.
>
> Everything can be built on just 2 functions on the Java side:
>
> seqFrom(key)
> rseqFrom(key)
I've done exactly that.
> these will map to seq-from and rseq-from on the Clojure side. What
> they should do is return a PersistentTreeMap.Seq parked on the Node >=
> or <= the key respectively, with the direction set accordingly.
So, I didn't find a better solution than what I said in my email last
night, and I've inherited a new SortedSeq from PersistentTreeMap.Seq,
which provides the appropriate lazy count semantics. (count ) is O(n) on
the first call, after which it's cached, and the cached value is passed
on to any (rest ) of that SortedSeq, so any subsequent (count ) is O(1).
The way I've implemented it isn't particularly idiomatic, feel free to
suggest something cleaner.
> It's
> easy to see how you can then build all of the single lookups you want
> (which will just use the first/second of the seqs), plus ranges seq-
> from, seq-after, rseq-from, rseq-before,
Done. It's a little awkward with having to read out the comparator, but
it works and isn't too shabby. I'm not 100% sure I like the terminology,
seq>=, seq>, (r)seq<= and (r)seq< somehow seem more obvious to me.
> and range-bounded versions
> (e.g.seq-from-to, seq-from-through etc), of same. All of those
> functions can/should be defined in Clojure.
This part is embryonic in my current patch: I've done the most idiomatic
thing I could think of, which is to apply take-while to the result of
seq-from. This suffers from doing a comparison on every iteration, I'll
have to do a little more thinking here. Maybe another, specialised Seq
is called for which directly takes into account two search stacks.
I'm also *really* not happy with the terminology here. seq-from-to and
seq-from-through don't seem any different to me - I'd be looking up the
difference all the time. seq-after-to-before and such seem a little
long-winded. Unfortunately, mathematical symbols don't seem to help much
either: seq<=<= or seq-<=-<= aren't exactly what I'd call readable, and
the usual range indicators in maths - e.g. [a, b) - have other meanings
already. Ideas?
> seqFrom and rseqFrom will be a little tricky though, because the
> PersistentTreeMap.Seq uses a stack to keep track of where it is, and
> will therefore need to call the Seq ctor with a stack built during the
> search.
This actually turned out to be pretty easy based on my previous
implementations of find>= and find<=. I've reimplemented those functions
based on seq-from and its brethren.
What do you think of it so far?
~phil
So, straight after I hit send I actually thought of a (hopefully) better
way to do seq-from-to, by doing two searches and iterating until the
second search is found. I also discovered that with my initial
implementation, you could run into trouble with the end item, for
example if from > to. In any case, the improved version is attached.
It's not exactly pretty, so this is probably not going to be the last
word on the matter.
~phil
Rich Hickey wrote:
> I took a look and am uncomfortable about the derivation from Seq -
> (concrete derivation from a non-abstract base by a different
> author...), but I had forgotten about the counts.
Hmm, okay. I personally didn't see anything wrong with that solution,
other than maybe the wasted final cnt.
> I made a revision to
> Seq so that the count is optional - in your case leave it out. I think
> you should now be able to use Seq as-is with your existing search
> logic by passing the stack to the Seq ctor, let me know if that's not
> the case. This will greatly ease maintenance.
I've adapted it to use Seq. (count) is now always O(n).
> I think your original formulation (sans any buglets) for range-bounded
> versions using take-while is the correct path and what I expected.
OK, done. I've done a rseq version as well now. Not sure it's got the
semantics of least surprise though.
> Don't worry about the cost of the compare call.
I do still worry, although I suppose if I run into this problem in live
code I probably have other troubles.
> This is really close - I would have made the additional changes myself
> but I'm running today.
No worries, this way I get to see some of the inner bits of Clojure,
which is something I want to do anyway. I get the impression that the
Clojure implementation is still small enough that a newcomer can get a
decent handle on it 'on the side'. I hope it stays that way to some
extent, it's not something I can say for most languages.
> We should be able to get this in in the next iteration, I think.
We'll see. Patch attached.
~phil
I didn't think that sorted-map actually tracks non-unique entries, and
user=> (sorted-map 1 :a1, 2 :b1, 2 :b2, 2 :b3, 3 :c1, 4 :d1)
{1 :a1, 2 :b3, 3 :c1, 4 :d1}
user=> (count (sorted-map 1 :a1, 2 :b1, 2 :b2, 2 :b3, 3 :c1, 4 :d1))
4
seems to support that theory.
Do you have any examples of how to persuade sorted-map to track
non-uniquely keyed entries?
Thanks
~phil
OK, that looks pretty elegant. It'll presumably break fairly badly if
you decide to overload >=, >, <= and < in your local namespace, but I
guess then you deserve what you get. :)
~phil
Definitely an acceptable option. Could even change the argument order:
(find < sm :c)
and although that could be considered inconsistent with subseq, I
personally find it more intuitive.
> ;bundle direction in name (needs 4 variations):
>
> (find< sm :c)
This is what I'm using locally at the moment. The namespace pollution
isn't really an issue here as far as I'm concerned, and it generally
seems clean enough to me.
> ;new name:
>
> (seek sm < :c)
This seems unnecessarily confusing and I'm already occasionally running
into the problem of all the good, short names being taken, so I'm
against this. (e.g.: str, key, val)
~phil