Proposal: "smarter" set operations that take size into account for speed

已查看 55 次
跳至第一个未读帖子

Jason Wolfe

未读,
2009年1月20日 18:54:032009/1/20
收件人 Clojure
I ran into a situation in my application where I wanted to take a set-
difference between a small set and a large set. In this circumstance,
clojure.set/difference is needlessly slow. Because of this, in
general, a sequence of n clojure.set operations may run in O(n^2) time
when O(n) is easily achievable. Here's the basic idea:

(defn difference2 "Like difference, but fast for big s2 and small
s1" [s1 s2]
(reduce (fn [result item] (if (contains? s2 item) (disj result item)
result))
s1 s1))

(defn fast-difference "Like difference, but faster." [s1 s2]
(let [c1 (int (count s1))
c2 (int (count s2))]
(if (< c1 c2)
(difference2 s1 s2)
(clojure.set/difference s1 s2))))

user> (let [small-set (hash-set 1), big-set (set (range 10000))]
(doseq [fn [clojure.set/difference difference2 fast-difference]
[s1 s2] [[small-set big-set] [big-set small-set]]]
(time (dotimes [_ 1000] (fn s1 s2)))))

"Elapsed time: 3995.249 msecs" ; (set/difference small big)
"Elapsed time: 3.358 msecs" ; (set/difference big small)
"Elapsed time: 1.91 msecs" ; (difference2 small big)
"Elapsed time: 4935.183 msecs" ; (difference2 big small)
"Elapsed time: 3.088 msecs" ; (fast-difference small big)
"Elapsed time: 3.401 msecs" ; (fast-difference big small)

For the commutative operations (union and intersection), I suppose you
could argue that the user should know how the operations work and put
the arguments in the right (fast) order -- but sometimes, that
information will only be known at runtime. Overall, the possibility
of a big O improvement for a small constant penalty seems worth it to
me.

Questions or comments? Any objections/support for me adding this as
an issue?

Thanks,
Jason

Jason Wolfe

未读,
2009年1月20日 19:11:222009/1/20
收件人 Clojure
One caveat: I do like the fact that the set operations currently don't
require the second argument to be a set. In the case that it's not, I
believe the current implementations (at least for difference) are as
good as it gets.
So, I guess fast-difference should read:

(defn fast-difference "Like difference, but faster." [s1 s2]
(if (or (not (set? s2))
(> (int (count s1)) (int (count s2))))
(clojure.set/difference s1 s2)
(difference2 s1 s2)))

-Jason

Jason Wolfe

未读,
2009年1月20日 19:44:492009/1/20
收件人 Clojure
Apologies for multiple posts; I will try to thoroughly investigate
things before starting posting next time. Anyway, to round out the
thread, here are corresponding results for intersection and union:


(defn- fast-intersection- "Expects s1 and s2 sets, s1 bigger than
s2." [s1 s2]
(reduce (fn [result item]
(if (contains? s1 item)
result
(disj result item)))
s2 s2))

(defn fast-intersection "Like intersection, but faster." [s1 s2]
(if (or (not (set? s2))
(> (int (count s1)) (int (count s2))))
(fast-intersection- s1 s2)
(fast-intersection- s2 s1)))

user> (let [small-set (hash-set 1), big-set (set (range 10000))]
(doseq [fn [clojure.set/intersection fast-intersection]
[s1 s2] [[small-set big-set] [big-set small-set]]]
(time (dotimes [_ 1000] (fn s1 s2)))))

"Elapsed time: 3976.334 msecs" ; (clojure.set/intersection small big)
"Elapsed time: 18059.634 msecs" ; (clojure.set/intersection big small)

"Elapsed time: 9.68 msecs" ; (fast-intersection small big)
"Elapsed time: 7.598 msecs" ; (fast-intersection big small)

As you can see, it looks like clojure.set/intersection is fast for
*neither* ordering of sets if they are of disparate size.



(defn fast-union "Like union, but faster." [s1 s2]
(if (or (not (set? s2))
(> (int (count s1)) (int (count s2))))
(reduce conj s1 s2)
(reduce conj s2 s1)))

user> (let [small-set (hash-set 1), big-set (set (range 10000))]
(doseq [fn [clojure.set/union fast-union]
[s1 s2] [[small-set big-set] [big-set small-set]]]
(time (dotimes [_ 1000] (fn s1 s2)))))
"Elapsed time: 20444.035 msecs" ; (clojure.set/union small big)
"Elapsed time: 0.956 msecs" ; (clojure.set/union big small)

"Elapsed time: 2.906 msecs" ; (fast-union small big)
"Elapsed time: 2.826 msecs" ; (fast-union big small)

And, finally, clojure.set/union is fast for one order of arguments but
(especially) slow for the other.

-Jason



Jason Wolfe

未读,
2009年1月20日 20:49:492009/1/20
收件人 Clojure

On Jan 20, 4:44 pm, Jason Wolfe <jawo...@berkeley.edu> wrote:
> Apologies for multiple posts; I will try to thoroughly investigate
> things before starting posting next time.  Anyway, to round out the
> thread, here are corresponding results for intersection and union:

Well, that didn't last too long ... from now on ... :) Anyway, there
was a bug in fast-intersection when s2 was not a set. Fixed version:

(defn fast-intersection "Like intersection, but faster." [s1 s2]
(cond (not (set? s2))
(reduce (fn [result item]
(if (contains? s1 item)
(conj result item)
result))
#{} s2)
(> (int (count s1)) (int (count s2)))
(fast-intersection- s1 s2)
:else
(fast-intersection- s2 s1)))

-Jason

Mark Engelberg

未读,
2009年1月20日 23:27:122009/1/20
收件人 clo...@googlegroups.com
I say "thumbs up" to this proposal. Sets really should test for the
size of the two inputs, and adjust the algorithm accordingly.

pc

未读,
2009年1月21日 09:32:092009/1/21
收件人 Clojure
I think that having efficient algorithms for the fundamental data
structures is extremely important.
But Rich and the Contributers must have tons of other stuff to worry
about. In the meantime I will
use Jason's fix.

Jason Wolfe

未读,
2009年1月23日 15:23:442009/1/23
收件人 Clojure
OK, if these are not wanted in core right now, will anyone sign off
for adding them to clojure.contrib?

I can't say I like the idea of having two sets of functions that do
exactly the same thing, but ... sometimes you just don't want things
to run ~1000 times slower than they should.

-Jason

Mark Engelberg

未读,
2009年1月23日 15:52:372009/1/23
收件人 clo...@googlegroups.com
On Fri, Jan 23, 2009 at 12:23 PM, Jason Wolfe <jaw...@berkeley.edu> wrote:
>
> OK, if these are not wanted in core right now, will anyone sign off
> for adding them to clojure.contrib?
>

Well, *I* want these changes you've proposed in the core, but of
course, I'm not in charge. I guess the real question is, what is the
process to ensure that Rich sees and considers a potential core
improvement like this? I think the main mechanism for this is to post
it as an "issue" on google code, but I'm not certain whether you're
supposed to post an issue unless he's seen the newsgroup thread and
says, "Yes, that sounds good, please post it as an issue." But if he
misses the thread for some reason, that will never happen. So it's a
bit of a catch-22. Anyway, maybe someone can clarify the procedure.

Cosmin Stejerean

未读,
2009年1月23日 16:03:082009/1/23
收件人 clo...@googlegroups.com
In a previous thread Rich suggested that additions to clojure-contrib be discussed here and lacking any objections they should be posted as issues with attached patches on the clojure-contrib project. From what I've seen in the past clojure-contrib is the place for functions like the fast set operations discussed here. This gives people a chance to use them and identify any problems, etc before being considered for a move into clojure core.

--
Cosmin Stejerean
http://offbytwo.com

Jason Wolfe

未读,
2009年1月23日 16:19:562009/1/23
收件人 clo...@googlegroups.com
>
> On Fri, Jan 23, 2009 at 12:23 PM, Jason Wolfe <jaw...@berkeley.edu>
> wrote:
>>
>> OK, if these are not wanted in core right now, will anyone sign off
>> for adding them to clojure.contrib?
>>
>
> Well, *I* want these changes you've proposed in the core, but of
> course, I'm not in charge.

Thanks.

> I guess the real question is, what is the
> process to ensure that Rich sees and considers a potential core
> improvement like this? I think the main mechanism for this is to post
> it as an "issue" on google code, but I'm not certain whether you're
> supposed to post an issue unless he's seen the newsgroup thread and
> says, "Yes, that sounds good, please post it as an issue." But if he
> misses the thread for some reason, that will never happen. So it's a
> bit of a catch-22. Anyway, maybe someone can clarify the procedure.

Yes, it is not supposed to be posted as a core issue unless Rich OK's
it here.

I just had a discussion about just this "meta"-issue on IRC. Chouser
says that Rich still reads every message on the group. See also the
further-up discussion in [1] for more procedural details, where it is
also suggested that an explicit sign-off here should be required for
posting clojure.contrib issues.

[1] http://groups.google.com/group/clojure/msg/657291bc62c48f32?hl=en

Anyway, I'm feeling quite frustrated and won't try to push this (or
any other) issue further. I know Rich and the team are very busy, but
it really saps my will to contribute when I have to keep pushing to
get an authoritative answer (be it yes or no) on even (what seems to
me to be) a fairly uncontroversial change like this one, or [2].

[2] http://groups.google.com/group/clojure/browse_thread/thread/5d11bc0da25a7ecc?hl=en#

Sorry for taking your question as a jumping off point for whining
about not getting attention. I guess my short answer is: the policy
is fairly clear, but its current implementation may be discouraging
potential contributors like myself.

-Jason


Rich Hickey

未读,
2009年1月23日 16:48:572009/1/23
收件人 Clojure


On Jan 23, 4:19 pm, Jason Wolfe <jawo...@berkeley.edu> wrote:
> > On Fri, Jan 23, 2009 at 12:23 PM, Jason Wolfe <jawo...@berkeley.edu>
> [2]http://groups.google.com/group/clojure/browse_thread/thread/5d11bc0da...
>
> Sorry for taking your question as a jumping off point for whining
> about not getting attention. I guess my short answer is: the policy
> is fairly clear, but its current implementation may be discouraging
> potential contributors like myself.
>

I appreciate your desire to contribute, but Clojure is not just about
your needs. You have flooded the group with every idea you have, some
are bugs (important), some are good ideas, some not, but there are
simply too many to address at the rate you are producing them. In
addtion, sometimes you've made issues, and often blogged about them.
So, for #2 the issue was addressed and you found out about it that
way. I can't answer you (or anyone) in every forum.

I'd advise you to be more patient, build up a small library of helper
functions you use frequently, make contributions of the most important
of them to contrib. Clojure doesn't change that fast and it's not
going to. I like to consider things and I can't address every
suggestion as it is made.

Separate out the important things (like potential bugs) so they get
more attention.

As for these:

- 0-arg distinct? returns true, not an exception (so (apply distinct?
nil) = true)

Not now, will consider.

- rewrite concat so that (apply concat seq) doesn't evaluate the
first three elements of seq

No, may fall out of streams work.

- make every?, not-every?, some, not-any? take multiple seq args like
map, i.e., (every? not= [1 2 3] [2 3 4])

No.

- allow arguments to merge-with after the first to be lists of
pairs.

No.

Rich

Stephen C. Gilardi

未读,
2009年1月23日 17:26:172009/1/23
收件人 clo...@googlegroups.com
Hi Jason,

Thanks very much for all your recent thought, work, and postings.

I think it makes sense for clojure-contrib to be much more agile in accepting new, experimental, and incrementally improved code than Clojure proper. Things in contrib are always optional for end users so contrib can be quite welcoming to any high quality code without giving anyone much pain. One thing we may need to look out for over time is the namespace getting crowded, but we have a whole hierarchy under clojure.contrib to work with. We can manage that issue. I'll be thinking along those lines when responding to issue requests for contrib going forward.

For changes to clojure that can possibly be implemented as a lib, clojure-contrib is the right place to get them some visibility. They may stay a contrib forever, or they may at some point move to Clojure itself. In either case, they will be easily available to Clojure programmers.

Please add the improved set operations you've proposed as an "enhancement request" issue for clojure-contrib. I suggest you come up with a clojure-contrib.set lib that includes all changes you'd like to see. Users can easily use clojure.contrib.set rather than clojure.set to try them out. Features of "ns" even allow an end user to pick and choose set operations at the individual function level if that fine grained control is desired.

(ns my-ns
   (:use [clojure.set :exclude (difference)]
            [clojure.contrib.set :only (difference)]))

One could even make a custom lib that picks and chooses and re-exports functionality from existing libs using clojure.contrib.def/defalias.

If anyone has a pending issue request for clojure-contrib they'd like approved, please make a fresh posting to the thread where the proposal was discussed and perhaps change its subject to include a banner like "[Issue Request]" I'll try to help get our pending issue request count for clojure-contrib down to zero (and keep it hovering there).

Thanks,

--Steve

Jason Wolfe

未读,
2009年1月23日 17:42:212009/1/23
收件人 clo...@googlegroups.com

> I appreciate your desire to contribute, but Clojure is not just about
> your needs. You have flooded the group with every idea you have, some
> are bugs (important), some are good ideas, some not, but there are
> simply too many to address at the rate you are producing them.

OK, I am sorry. I know I have made at least one misguided post, and a
few quibbles about relatively minor things, but by and large I thought
I was being helpful. I have tried to meter out my posts, and focus on
issues of broader use to the community, but I guess I am failing on
both counts and will have to try harder.

> I'd advise you to be more patient, build up a small library of helper
> functions you use frequently, make contributions of the most important
> of them to contrib. Clojure doesn't change that fast and it's not
> going to. I like to consider things and I can't address every
> suggestion as it is made.

Fair enough.

> Separate out the important things (like potential bugs) so they get
> more attention.
>
> As for these:
>
> - 0-arg distinct? returns true, not an exception (so (apply distinct?
> nil) = true)
>
> Not now, will consider.
>
> - rewrite concat so that (apply concat seq) doesn't evaluate the
> first three elements of seq
>
> No, may fall out of streams work.
>
> - make every?, not-every?, some, not-any? take multiple seq args like
> map, i.e., (every? not= [1 2 3] [2 3 4])
>
> No.
>
> - allow arguments to merge-with after the first to be lists of
> pairs.
>
> No.

OK, thanks.

-Jason

Jason Wolfe

未读,
2009年1月23日 18:23:532009/1/23
收件人 Clojure
Thank you very much, Steve.

I just posted an issue here:

http://code.google.com/p/clojure-contrib/issues/detail?id=14

-Jason

P.S. I couldn't see how to make an "enhancement request" rather than a
"defect". The "Template" drop-down that shows up for the main Clojure
SVN doesn't show up here ... maybe it never got configured properly?

e

未读,
2009年1月23日 19:56:202009/1/23
收件人 clo...@googlegroups.com

(ns my-ns
   (:use [clojure.set :exclude (difference)]
            [clojure.contrib.set :only (difference)]))

Did you have to say ":exclude" or does the last thing override?



Stephen C. Gilardi

未读,
2009年1月23日 20:41:232009/1/23
收件人 clo...@googlegroups.com
I had to say :exclude. :use makes a call to refer which will refuse to replace an existing entry in ns-map. You can see an example of this in a fresh repl:

user=> (use 'clojure.zip)
java.lang.IllegalStateException: replace already refers to: #'clojure.core/replace in namespace: user (NO_SOURCE_FILE:0)
user=> 

--Steve

Rich Hickey

未读,
2009年1月24日 08:45:222009/1/24
收件人 Clojure
The key thing here is patience. You started this thread on the 20th,
it's now the morning of the 24th and the first time I've had to really
look at the set issue.

I'm all for optimizing for size here, however, the fact that these
functions happen to work when the second argument is not a set is an
implementation artifact and not a promise of the interface, so I'm not
in favor of the set? testing or any other accommodation of that.

An issue w/patch for this would be welcome.

Thanks,

Rich

Jason Wolfe

未读,
2009年1月24日 20:11:452009/1/24
收件人 clo...@googlegroups.com
> I'm all for optimizing for size here, however, the fact that these
> functions happen to work when the second argument is not a set is an
> implementation artifact and not a promise of the interface, so I'm not
> in favor of the set? testing or any other accommodation of that.

OK, from that I take it the Clojure way is to not call "set" on the
arguments either, and let whatever happens happen if you pass the
functions non-sets? Note that here this has the potential to be
especially confusing since, e.g., (union #{1 2 3} [4 5]) might succeed
while (union #{1 2} [3 4 5]) would error.

Second, should this be combined with this issue to make efficient
versions that take any number of arguments?
http://code.google.com/p/clojure/issues/detail?id=52&colspec=ID%20Type%20Status%20Priority%20Reporter%20Owner%20Summary

> An issue w/patch for this would be welcome.

Finally, I don't know how to make a patch, and found nothing in a
quick search of the wiki/newsgroup/website. I heard "Git" floating
around somewhere earlier; am I to check out the SVN with git-svn and
make a patch that way?

-Jason


Stephen C. Gilardi

未读,
2009年1月24日 21:01:462009/1/24
收件人 clo...@googlegroups.com

On Jan 24, 2009, at 8:11 PM, Jason Wolfe wrote:

> Finally, I don't know how to make a patch, and found nothing in a
> quick search of the wiki/newsgroup/website. I heard "Git" floating
> around somewhere earlier; am I to check out the SVN with git-svn and
> make a patch that way?

To prepare a patch, use svn to checkout Clojure to a directory. Edit
the files within it to incorporate the changes you'd like included in
the patch, then cd to the top level of the checkout directory and
issue the command:

svn diff > my-patch.patch

You can also use the command:

svn status

from the top level of the checkout directory to see which files you
changed.

For completeness, if you'd like to apply such a patch to a fresh
Clojure checkout, copy the patch file to "/tmp", cd to the top level
of the checkout directory and issue the command:

patch -p0 < /tmp/my-patch.patch

Note: I used the /tmp temporary placeholder just to make the path in
the example literal.

--Steve

Jason Wolfe

未读,
2009年1月25日 02:51:362009/1/25
收件人 clo...@googlegroups.com

Thanks for the detailed instructions!

-Jason

回复全部
回复作者
转发
0 个新帖子