Collections annoyances / missed opportunities for structural sharing

358 views
Skip to first unread message

Rüdiger Klaehn

unread,
Aug 6, 2012, 2:21:18 PM8/6/12
to scala-language, scala-debate
Hi all,

all in all I am quite happy with the scala collections library. But
there are quite a few annoyances and inefficiencies I have been
confronted with over the last year. This is an attempt to summarize
them and put them up to discussion. I am working mostly with the
default implementations of Set and Map, so this is where I have the
most remarks

In general, while the collections "just work" and do exactly what you
expect, they occasionally behave inefficiently and unintuively
regarding performance and structural sharing. Often methods that you
would expect to work very efficiently for a certain type of collection
instead just use the default implementation from e.g. TraversableLike.


Set:

- filter is easy to implement for HashSet, and it is very important
since it is used by operations that people expect to be fast on a set,
like intersection. Nevertheless filter just uses the default
implementation from TraversableLike, which is very inefficient. I
created a SI for that issue and implemented a filter method on
HashSet. For maximum benefit you would have to implement filter on
Set1..Set4 as well: https://issues.scala-lang.org/browse/SI-6196

- merge exists for HashMap, but not for HashSet. It should be
implemented, and used from union and ++ whenever possible. That would
increase the efficiency of that operation tremendously. A set should
have quick union and intersection methods.

- while implementing filter on HashSet, I also found some minor issues
that result in some inefficiency:
https://issues.scala-lang.org/browse/SI-6197 and
https://issues.scala-lang.org/browse/SI-6198

- intersect should check if both sides are of the same type, and then
do smaller.filter(larger). People should not have to do this manually
to get best performance. See for example this discussion:
https://groups.google.com/forum/?fromgroups#!searchin/scala-user/smaller$20filter$20larger/scala-user/3QJYuUU3frU/YfMTW9mRXsYJ

- (set should be covariant, but I guess there is nothing that can be
done about this at this point. Maybe have another version of set?)


Map:

- since map is very similar to set, it should also have a filter
method that works efficiently instead of just using the one from
TraversableLike. I added one here:
https://issues.scala-lang.org/browse/SI-6200 . Basically the same as
https://issues.scala-lang.org/browse/SI-6196 .

- HashMap has a merge method. There was a bug in it which has been
fixed. https://issues.scala-lang.org/browse/SI-5879 . But almost
nobody will ever use this very efficient method because when people
want to merge two maps they use ++. ++ should check if both arguments
are of the right type and use HashMap.merge if possible.

- mapValues and filterKeys: either make it clear that they are lazy by
using an appropriate return value (MapView?), or make them eager. I
would prefer the latter. There is an old SI for this issue (
https://issues.scala-lang.org/browse/SI-4776 ), but nothing has been
done on it

- updated should not create a new instance if not necessary.
https://issues.scala-lang.org/browse/SI-5139


Vector:

- vector is a brilliant data structure that would be very useful for
number crunching like time series analysis if it wasn't for the boxing
overhead of Vector[Double] etc. Would it be possible to use
@specialized for Vector? And for other collections like Set and Map
for that matter?


In general, care should be taken to use structural sharing whenever
possible. If that means that you have to override some of the generic
methods of TraversableLike or SeqLike, then so be it. For example
this:

scala> val x=Set(1,2,3,4)
x: scala.collection.immutable.Set[Int] = Set(1, 2, 3, 4)

scala> x eq (Set.empty ++ x)
res9: Boolean = false // Set.empty ++ x builds a completely new set,
which can be very slow for large sets!

scala> x eq (x ++ Set.empty)
res10: Boolean = true

This would not happen if ++ would use merge when both sides are a Set.


I think that having an efficient collecions library is tremenously
important. At least for me, having maximum efficiency in the
collections is more important than new language features such as
macros.

best regards,

Rüdiger

Simon Ochsenreither

unread,
Aug 6, 2012, 2:45:21 PM8/6/12
to scala-l...@googlegroups.com, scala-debate, rkl...@googlemail.com
Hi Rüdiger,

looks like you did a lot of research. Great summary! I'm currently trying to fix the slow Vector#isEmpty¹² and I agree that there are quite few optimization possibilities.

Let me know if I can help! (Although I'm currently trying to understand why exactly the first element of the view gets evaluated for “tail” in #1037 ...)

Thanks,

Simon

¹ https://github.com/scala/scala/pull/1036
² https://github.com/scala/scala/pull/1037

Josh Suereth

unread,
Aug 6, 2012, 3:08:58 PM8/6/12
to scala-l...@googlegroups.com, scala-debate
First - Thanks for this comprehensive list!  I things these are all great improvements to make.  Perhaps a ticket to track each would be good.  I see you already have a few created.

Second - Let's not issue a statement like "These would be better to do than macros".  As I'm sure you may be aware, it's not an either/or situation.   macros are being contributed by Eugene Bumako, who wouldn't be contributing to Scala if he weren't making macros.  In other words, his research is on Macros, so that's what he's doing.   It's not an either or situation.   Those of us responsible for collections need to sit down and take time to improve this situation.   We're the ones working on other things, like improved testing, etc.  If you haven't noticed, we lack of cohesive test suite to detect performance issues like those stated below.   We can't even specify intent, like "filter on Set should be closer to HashSet's efficiency than Traversable".   These kinds of slip-ups happen by virtue of not having a *good* set of warning bells to prevent them.   Some of us are working on those warnings, so we don't have to continually be fixing this problems as we maintain the collections.

Third -  All of these points seem like simple no-brainer fixes.   All of them are optimisatoins/fixes.   If we get pull/requests before 2.10.0 I don't see a big issue accepting a lot of these changes into the codebase.  Most of us are busy fixing critical bugs.    If you have time to create patches, I'll certainly review them and we can include non-controversial performance improvements for 2.10.x.


- Josh

Rüdiger Klaehn

unread,
Aug 6, 2012, 3:22:23 PM8/6/12
to Josh Suereth, scala-l...@googlegroups.com, scala-debate
On Mon, Aug 6, 2012 at 9:08 PM, Josh Suereth <joshua....@gmail.com> wrote:
> First - Thanks for this comprehensive list! I things these are all great
> improvements to make. Perhaps a ticket to track each would be good. I see
> you already have a few created.
>
You're welcome.

> Second - Let's not issue a statement like "These would be better to do than
> macros". As I'm sure you may be aware, it's not an either/or situation.
> macros are being contributed by Eugene Bumako, who wouldn't be contributing
> to Scala if he weren't making macros. In other words, his research is on
> Macros, so that's what he's doing. It's not an either or situation.
>
I agree. This was just strictly from my perspective.

> Those of us responsible for collections need to sit down and take time to
> improve this situation. We're the ones working on other things, like
> improved testing, etc. If you haven't noticed, we lack of cohesive test
> suite to detect performance issues like those stated below. We can't even
> specify intent, like "filter on Set should be closer to HashSet's efficiency
> than Traversable". These kinds of slip-ups happen by virtue of not having
> a *good* set of warning bells to prevent them. Some of us are working on
> those warnings, so we don't have to continually be fixing this problems as
> we maintain the collections.
>
Specifying performance characteristics in some kind of test would be
great. Sounds difficult though.

> Third - All of these points seem like simple no-brainer fixes. All of
> them are optimisatoins/fixes. If we get pull/requests before 2.10.0 I
> don't see a big issue accepting a lot of these changes into the codebase.
> Most of us are busy fixing critical bugs. If you have time to create
> patches, I'll certainly review them and we can include non-controversial
> performance improvements for 2.10.x.
>
I added patches to most of the tickets I opened instead of making pull
requests.

The last time I tried to contribute a bit to the scala collections
library I struggled with the build process of the scala collections
library until I ran out of time. Simple patches against the currently
installed scala version are much easier to create when you work with
an IDE under windows. But if you prefer pull requests I will make
some.

Josh Suereth

unread,
Aug 6, 2012, 3:27:05 PM8/6/12
to Rüdiger Klaehn, scala-l...@googlegroups.com, scala-debate
I can't guarantee I'll have time to pull patches out of JIRA.   That was a blocker for us accepting patches  in the past.  Pull requests are nicer in that we can acheive faster two-way communication directly on the code, and I can pull your code into my repo with an automatic script to try it out.  We've migrated most of our mindshare to these.

- Josh

Rüdiger Klaehn

unread,
Aug 6, 2012, 4:06:12 PM8/6/12
to Josh Suereth, scala-l...@googlegroups.com, scala-debate
OK. If it increases the chances of this being used, I will create pull
requests for each issue. Here is one for SI-6196:
https://github.com/scala/scala/pull/1068

Paul Phillips

unread,
Aug 6, 2012, 6:24:09 PM8/6/12
to scala-l...@googlegroups.com, scala-debate
I'll be the bad cop and tell you, please don't cross-post to scala-language and scala-debate.  Or to scala-X and scala-Y for that matter.

General question, how many of the issues you point out are a direct result of the fact that about half the time you inherit an implementation which is more like a tumor than a healthy organ? To me there's something wrong if you can lose orders of magnitude performance by neglecting to override an implementation.  All my sketches for future collections make this go away; no shared implementations between immutable/mutable or between linear/indexed or other incompatible data structures, and there is really no need for it.  It means a lot of bugs, especially a lot of performance bugs, can not even occur.

On Mon, Aug 6, 2012 at 11:21 AM, Rüdiger Klaehn <rkl...@googlemail.com> wrote:
- (set should be covariant, but I guess there is nothing that can be
done about this at this point. Maybe have another version of set?)

"Should" is one of those things.  I like having a typesafe contains.  Are you ready to abstract over variance? I was going to pretend to write some syntax Set[VT] where V is a member of { +, "" }  and I couldn't stop from laughing.
 
I think that having an efficient collecions library is tremenously
important. At least for me, having maximum efficiency in the
collections is more important than new language features such as
macros.

Macros are part of how we'll get efficient.

Rüdiger Klaehn

unread,
Aug 6, 2012, 6:43:12 PM8/6/12
to Paul Phillips, scala-l...@googlegroups.com, scala-debate
On Tue, Aug 7, 2012 at 12:24 AM, Paul Phillips <pa...@improving.org> wrote:
> I'll be the bad cop and tell you, please don't cross-post to scala-language
> and scala-debate. Or to scala-X and scala-Y for that matter.
>
OK, copy. Which of the two would have been the more appropriate for
the original post.

> General question, how many of the issues you point out are a direct result
> of the fact that about half the time you inherit an implementation which is
> more like a tumor than a healthy organ? To me there's something wrong if you
> can lose orders of magnitude performance by neglecting to override an
> implementation.
The question is if having a slow method available on a collection
where you expect it to be fast is better or worse than not having it
available at all.

Is it really possible to do fundamental changes to the design of the
scala collections at this point without breaking a lot of scala code
out there?

> All my sketches for future collections make this go away;
> no shared implementations between immutable/mutable or between
> linear/indexed or other incompatible data structures, and there is really no
> need for it. It means a lot of bugs, especially a lot of performance bugs,
> can not even occur.
>
That would be great, as long as it does not break too much existing
source code. What is the timeframe for that new collections library? I
would really like to get some improvements quickly.

> On Mon, Aug 6, 2012 at 11:21 AM, Rüdiger Klaehn <rkl...@googlemail.com>
> wrote:
>>
>> - (set should be covariant, but I guess there is nothing that can be
>> done about this at this point. Maybe have another version of set?)
>
>
> "Should" is one of those things. I like having a typesafe contains. Are
> you ready to abstract over variance? I was going to pretend to write some
> syntax Set[VT] where V is a member of { +, "" } and I couldn't stop from
> laughing.
>
It's not a big deal anyway.

>>
>> I think that having an efficient collecions library is tremenously
>> important. At least for me, having maximum efficiency in the
>> collections is more important than new language features such as
>> macros.
>
>
> Macros are part of how we'll get efficient.
>
So macros are going to be involved in the redesign of the collections
library? Just when I started to understand the current design...

Paul Phillips

unread,
Aug 6, 2012, 6:48:29 PM8/6/12
to Rüdiger Klaehn, scala-l...@googlegroups.com, scala-debate
On Mon, Aug 6, 2012 at 3:43 PM, Rüdiger Klaehn <rkl...@googlemail.com> wrote:
OK, copy. Which of the two would have been the more appropriate for
the original post.

Depends on what sort of thread you want.  Either would be fine.
 
The question is if having a slow method available on a collection
where you expect it to be fast is better or worse than not having it
available at all.

That's not the choice.  The only question is whether pessimal-for-you implementations should loom in your superclass, waiting for the unwary deriver who fails to override.

Is it really possible to do fundamental changes to the design of the
scala collections at this point without breaking a lot of scala code
out there?

If it seems impossible, that only makes it sound more interesting.

That would be great, as long as it does not break too much existing
source code. What is the timeframe for that new collections library? I
would really like to get some improvements quickly.

I'm just a guy with some sketches, I'm not sitting on any next-generation libraries.
 
> Macros are part of how we'll get efficient.
>
So macros are going to be involved in the redesign of the collections
library? Just when I started to understand the current design...

That was more of a long-term prediction.

Rüdiger Klaehn

unread,
Aug 6, 2012, 6:57:03 PM8/6/12
to Paul Phillips, scala-l...@googlegroups.com, scala-debate
On Tue, Aug 7, 2012 at 12:48 AM, Paul Phillips <pa...@improving.org> wrote:
>>
>> The question is if having a slow method available on a collection
>> where you expect it to be fast is better or worse than not having it
>> available at all.
>
>
> That's not the choice. The only question is whether pessimal-for-you
> implementations should loom in your superclass, waiting for the unwary
> deriver who fails to override.
>
Well, then the answer is a clear no.

But you will need the efficient implementations of the various
operations (e.g. HashSet.filter) in any case, so I guess there is no
harm in including them in the current collections via overrides.

Jason Zaugg

unread,
Aug 7, 2012, 2:18:04 AM8/7/12
to Rüdiger Klaehn, Paul Phillips, scala-l...@googlegroups.com, scala-debate
BTW, the new reflection API let's you review the family tree of method
inheritance.

https://gist.github.com/3282240

I'll bet you can find other cases where performance and/or correctness
is compromised by analysing a broader set methods and collections.

-jason

Rüdiger Klaehn

unread,
Aug 7, 2012, 4:31:55 AM8/7/12
to Paul Phillips, scala-l...@googlegroups.com, scala-debate
On Tue, Aug 7, 2012 at 12:24 AM, Paul Phillips <pa...@improving.org> wrote:
> I'll be the bad cop and tell you, please don't cross-post to scala-language
> and scala-debate. Or to scala-X and scala-Y for that matter.
>
> General question, how many of the issues you point out are a direct result
> of the fact that about half the time you inherit an implementation which is
> more like a tumor than a healthy organ? To me there's something wrong if you
> can lose orders of magnitude performance by neglecting to override an
> implementation. All my sketches for future collections make this go away;
> no shared implementations between immutable/mutable or between
> linear/indexed or other incompatible data structures, and there is really no
> need for it. It means a lot of bugs, especially a lot of performance bugs,
> can not even occur.
>
I got bitten by this again in a seemingly trivial piece of code. In
this commit https://github.com/rklaehn/scala/commit/10656bcf1c264cba4f4faa5280eea8f7ccd7467e
I try to check if ks1 has one element. Using size is obviously not a
good idea for a list data structure, since it is O(N). N should rarely
be large since that would mean a lot of entries with the same hash
code. But this rare case has to be handled gracefully as well.

But using head and tail is also O(N), since both are not overridden by
ListSet. This is completely against the reasonable expectations you
have for a list-like data structure. You expect head and tail to be
O(1). As far as I can see there is no way to find out if a ListSet
contains just one element except modifying it or using a O(N) method.

This really is a common issue in the collections. You can not really
rely on your performance expectations except if you check every single
method invocation manually.

Chris Marshall

unread,
Aug 7, 2012, 7:06:12 AM8/7/12
to scala-l...@googlegroups.com
On Mon, Aug 6, 2012 at 7:21 PM, Rüdiger Klaehn <rkl...@googlemail.com> wrote:
- (set should be covariant, but I guess there is nothing that can be
done about this at this point. Maybe have another version of set?)

This is simply not true: Set being covariant would imply that Set is a functor; but it is not. If Set were "properly" defined it would look something like this:

trait Set[A: Equiv]

Then the functor for set would look like:

implicit val SetFunctor = new Functor[Set] { def fmap[A, B](s: Set[A])(f: A => B) } 

Oh noes! There's no way to inject the context bound Equiv on B because a functor quantifies over any B.

Chris

Alex Cruise

unread,
Aug 7, 2012, 6:12:44 PM8/7/12
to scala-l...@googlegroups.com, Paul Phillips, scala-debate
On Tue, Aug 7, 2012 at 1:31 AM, Rüdiger Klaehn <rkl...@googlemail.com> wrote:
> I try to check if ks1 has one element. Using size is obviously not a
> good idea for a list data structure, since it is O(N).
> ...

> This really is a common issue in the collections. You can not really
> rely on your performance expectations except if you check every single
> method invocation manually.

Not to get involved in the actual substantive conversation, I just want to point out that this is what lengthCompare is there for.  Here's one way I use it. :)

  /**
   * Characterizes the length of the provided Seq as empty (0),
   * singular (1) or multiple (2) without having to actually
   * get the whole length.
   */
  def characterizeLength(in: Seq[Any]) = {
    if (in.isEmpty)
      0
    else if (in.lengthCompare(1) == 0)
      1
    else
      2
  }


-0xe1a

Rüdiger Klaehn

unread,
Aug 7, 2012, 6:48:23 PM8/7/12
to Alex Cruise, scala-l...@googlegroups.com, Paul Phillips, scala-debate
That's one thing I tried. But lengthCompare is not available on s.c.i.ListSet.

Luke Vilnis

unread,
Aug 7, 2012, 10:24:50 PM8/7/12
to scala-l...@googlegroups.com
Sorry for off-topic: I think functor implies covariant, but covariant doesn't imply functor. Covariance comes from functor because upcasting A => B is just a function, but I don't see how being able to "fmap" the subtyping relation implies being able to fmap every function? Or am I way off-base here? Now I think the fact that Set[A] extends A => Boolean means that without breaking changes it can't be made covariant, but that's unrelated.

Chris Marshall

unread,
Aug 8, 2012, 3:18:11 AM8/8/12
to scala-l...@googlegroups.com
I'll try a different tack then: covariant means that Set[A] <: Set[B] when A <: B. But this means that Set[A] <: Set[Any]. But Set[Any] makes no sense ~ how can one have an Equiv for Any?

C

Rüdiger Klaehn

unread,
Aug 8, 2012, 8:46:58 AM8/8/12
to scala-l...@googlegroups.com
I don't know what you mean by Equiv. A quick googling revealed that it
is some concept from category theory that describes if two things are
equivalent ( http://en.wikipedia.org/wiki/Equivalence_of_categories )

> <: B. But this means that Set[A] <: Set[Any]. But Set[Any] makes no sense ~
> how can one have an Equiv for Any?
>
For a language like scala that has a common base type like Any, a Set
of Any makes perfect sense. And Any defines equals, so it is possible
to determine if two values of type Any are considered equal. You have
to rely on the equals method being properly implemented, but that is
true in any case.

Making set covariant would have required extending Any => Boolean
instead of T => Boolean. So you would not get an error if you have a
Set[Int] and try contains("abc"). But why should you get an error?
Just returning false is a perfectly valid response, since obviously a
set of Int can never contain any string.

I know that there are people that consider any program containing Any
defective. But the way scala is designed even methods like != and ==
use Any. So if you want to avoid Any at any cost it is very hard in
scala.

Daniel Kristensen

unread,
Aug 8, 2012, 9:09:32 AM8/8/12
to scala-l...@googlegroups.com
Also, even though this is not a type error,  it seems it would be possible for the compiler to warn about calling contains("abc") on a Set[Int] even it Set were covariant, since this call is supposed to always return false.


On Wed, Aug 8, 2012 at 2:46 PM, Rüdiger Klaehn <rkl...@gmail.com> wrote:
...

Making set covariant would have required extending Any => Boolean
instead of T => Boolean. So you would not get an error if you have a
Set[Int] and try contains("abc"). But why should you get an error?
Just returning false is a perfectly valid response, since obviously a
set of Int can never contain any string.
...

Paul Phillips

unread,
Aug 8, 2012, 1:10:00 PM8/8/12
to scala-l...@googlegroups.com
On Wed, Aug 8, 2012 at 5:46 AM, Rüdiger Klaehn <rkl...@gmail.com> wrote:
Making set covariant would have required extending Any => Boolean
instead of T => Boolean. So you would not get an error if you have a
Set[Int] and try contains("abc"). But why should you get an error?

Maybe because it is 100% of the time not what you meant to write? Or do you spend a lot of time looking for Strings in sets of Ints? To be told there is a programmer error at compile time in preference to silently being given the right answer to the wrong question at runtime is the reason there are types.

Every method could have signature (Any*)Any and throw a ClassCastException if the arguments are wrong, why don't we do that? These questions all have the same answer.

I know that there are people that consider any program containing Any
defective. But the way scala is designed even methods like != and ==
use Any. So if you want to avoid Any at any cost it is very hard in
scala.

It's hard to avoid exhaust fumes too, but it doesn't follow that I should run hoses from all the neighborhood mufflers into my house.

Nils Kilden-Pedersen

unread,
Aug 8, 2012, 1:46:39 PM8/8/12
to scala-l...@googlegroups.com
On Wed, Aug 8, 2012 at 12:10 PM, Paul Phillips <pa...@improving.org> wrote:
To be told there is a programmer error at compile time in preference to silently being given the right answer to the wrong question at runtime is the reason there are types.

Well said, Sir.

Daniel Sobral

unread,
Aug 8, 2012, 1:59:13 PM8/8/12
to scala-l...@googlegroups.com
It needs to be 35 characters shorter, though.


--
Daniel C. Sobral

I travel to the future all the time.

Rüdiger Klaehn

unread,
Aug 8, 2012, 2:49:57 PM8/8/12
to scala-l...@googlegroups.com
On Wed, Aug 8, 2012 at 7:10 PM, Paul Phillips <pa...@improving.org> wrote:
>
> On Wed, Aug 8, 2012 at 5:46 AM, Rüdiger Klaehn <rkl...@gmail.com> wrote:
>>
>> Making set covariant would have required extending Any => Boolean
>> instead of T => Boolean. So you would not get an error if you have a
>> Set[Int] and try contains("abc"). But why should you get an error?
>
>
> Maybe because it is 100% of the time not what you meant to write? Or do you
> spend a lot of time looking for Strings in sets of Ints? To be told there is
> a programmer error at compile time in preference to silently being given the
> right answer to the wrong question at runtime is the reason there are types.
>
You are right. Extending Any => Boolean would probably make it worse
in most use cases.

But what about Seq[T]? That has a contains method taking Any so that
Seq[T] can be covariant. Do you consider that a mistake as well?

> Every method could have signature (Any*)Any and throw a ClassCastException
> if the arguments are wrong, why don't we do that? These questions all have
> the same answe.

You mean like ask in akka?

Paul Phillips

unread,
Aug 8, 2012, 2:55:38 PM8/8/12
to scala-l...@googlegroups.com


On Wed, Aug 8, 2012 at 11:49 AM, Rüdiger Klaehn <rkl...@gmail.com> wrote:
But what about Seq[T]? That has a contains method taking Any so that
Seq[T] can be covariant. Do you consider that a mistake as well?

I consider it the price of covariance. Because I am aware there is a price, I consider the tradeoffs before I make something covariant.

Rüdiger Klaehn

unread,
Aug 8, 2012, 3:00:17 PM8/8/12
to scala-l...@googlegroups.com
Well, then why are the tradeoffs different for Seq than for Set?
Because contains is just some method on Seq, whereas contains on Set
is more important?

Paul Phillips

unread,
Aug 8, 2012, 3:09:17 PM8/8/12
to scala-l...@googlegroups.com
On Wed, Aug 8, 2012 at 12:00 PM, Rüdiger Klaehn <rkl...@gmail.com> wrote:
Because contains is just some method on Seq, whereas contains on Set
is more important?

Yes, apply (aliased to contains) is the central operation of Set, whereas contains is indeed "just some method" on Seq.

Nobody's saying you don't sometimes want a covariant Set, but on balance, it is more useful invariant. 

Rüdiger Klaehn

unread,
Aug 8, 2012, 3:53:16 PM8/8/12
to scala-l...@googlegroups.com
OK, you convinced me. Thanks for taking the time to explain the tradeoffs.

Simon Ochsenreither

unread,
Aug 8, 2012, 5:28:34 PM8/8/12
to scala-l...@googlegroups.com
Maybe because it is 100% of the time not what you meant to write? Or do you spend a lot of time looking for Strings in sets of Ints?
What about Int/BigInt? In the current design equality is not based on the type and I consider that to be a good thing in general. Making it invariant would break that and would make it considerably harder for equal things to be considered equal.

Paul Phillips

unread,
Aug 8, 2012, 5:40:41 PM8/8/12
to scala-l...@googlegroups.com
On Wed, Aug 8, 2012 at 2:28 PM, Simon Ochsenreither <simon.och...@gmail.com> wrote:
What about Int/BigInt? In the current design equality is not based on the type and I consider that to be a good thing in general.

I don't.  I would not call Int/BigInt a success story. But I don't want to convince you; the real problem is that I can't easily opt for a different definition of == than you do.

We could probably make an Eq[T] oriented equals zero-overhead now -- in fact, because equals(Any) forces you to box every argument every time, we can do a lot better.  Faster and type-safe, wouldn't that be something.
 
Making it invariant would break that and would make it considerably harder for equal things to be considered equal.

"Equal things considered equal" assumes the conclusion (what are "equal things" ?)  but I find it difficult to believe that the challenge of making "equal things equal" given a little type-safety is greater than the harm done by having no type-safety in what is approximately the most frequently invoked binary operation there is.

Erik Osheim

unread,
Aug 8, 2012, 6:29:45 PM8/8/12
to scala-l...@googlegroups.com
On Wed, Aug 08, 2012 at 02:40:41PM -0700, Paul Phillips wrote:
> I don't. I would not call Int/BigInt a success story. But I don't want to
> convince you; the real problem is that I can't easily opt for a different
> definition of == than you do.
>
> We could probably make an Eq[T] oriented equals zero-overhead now -- in
> fact, because equals(Any) forces you to box every argument every time, we
> can do a lot better. Faster and type-safe, wouldn't that be something.

For what it's worth, Spire has an Eqv[A] typeclass that has zero
overhead (via specialization), is typesafe, and can easily be extended
to support whatever notion of equivalence you're interested in. It also
plugs into a specialized Order[A] typeclass.

I think I agree with Paul, at least that there is no one notion of
equivalence that satisfies all cases. You might mean equivalent object
references (i.e. pointer equality), equivalent values with possible
type conversions (i.e. Int/BigInt, Int/Double), equivalent values
within some error threshold (i.e. BigDecimal/Double), typesafe
equivalence, or something else.

If we had more flexibility around the existing methods like equals, eq
and == I think it would be pretty trivial to implement this well.
Scheme has three different equality tests (equal? eqv? and eq?). As-is
we would need to use other operators to achieve this, like ===, eqv,
etc.

Having a dynamic/runtime check between Any/Any seems good, but not at
the expense of a typesafe A/A check, especially when we can have both!

Another world is possible!

-- Erik

Daniel Kristensen

unread,
Aug 9, 2012, 6:20:50 AM8/9/12
to scala-l...@googlegroups.com
Equivalent values within some threshold is tricky, because it's not an equivalence relation (not transitive). Of course you can use some interval based equivalence within fixed intervals instead. Actually i guess this is probably what you meant :)

Chris Marshall

unread,
Aug 9, 2012, 6:49:28 AM8/9/12
to scala-l...@googlegroups.com
On Wed, Aug 8, 2012 at 1:46 PM, Rüdiger Klaehn <rkl...@gmail.com> wrote:

For a language like scala that has a common base type like Any, a Set
of Any makes perfect sense. And Any defines equals, so it is possible
to determine if two values of type Any are considered equal. You have
to rely on the equals method being properly implemented, but that is
true in any case.

the fact that equals and hashCode exist on Object is one of Java's bigger mistakes. It only makes sense to have a set of a type A where values of that type can be sensibly compared. A Set[Any] could contain:
  • Int
  • String
  • List[Double]
Can you think of a mechanism for comparing these values which makes *any* sense at all under *any* circumstances? Scala is (as I understand it) currently embarking on a deprecation cycle that will eventually see the "removal of equals" from AnyRef. At least, Martin posted a plan for doing this a while ago - not sure if it's still the intention

Chris
 

Rüdiger Klaehn

unread,
Aug 9, 2012, 9:09:01 AM8/9/12
to scala-l...@googlegroups.com
On Thu, Aug 9, 2012 at 12:49 PM, Chris Marshall <oxbow...@gmail.com> wrote:
> sense at all under *any* circumstances? Scala is (as I understand it)
> currently embarking on a deprecation cycle that will eventually see the
> "removal of equals" from AnyRef. At least, Martin posted a plan for doing
> this a while ago - not sure if it's still the intention
>
I don't see how you can pull this off without deprecating almost every
piece of scala code out there.

But if this is really possible without breaking everything, please
make sure to deprecate the thread-synchronization primitives (
synchronized, notify, notifyAll, wait ) as well.

Making them methods of java.lang.Object was a really bad design
decision. The idea that you can just use any object for low-level
synchronization is evil. Normal application programmers should never
even be confronted with these primitives.

Chris Marshall

unread,
Aug 9, 2012, 9:31:35 AM8/9/12
to scala-l...@googlegroups.com
On Thu, Aug 9, 2012 at 2:09 PM, Rüdiger Klaehn <rkl...@gmail.com> wrote:
On Thu, Aug 9, 2012 at 12:49 PM, Chris Marshall <oxbow...@gmail.com> wrote:
> sense at all under *any* circumstances? Scala is (as I understand it)
> currently embarking on a deprecation cycle that will eventually see the
> "removal of equals" from AnyRef. At least, Martin posted a plan for doing
> this a while ago - not sure if it's still the intention
>
I don't see how you can pull this off without deprecating almost every
piece of scala code out there.

Simple - make the use of == require an implicit type class and then make sure there is (to begin with) such an implicit instance in scope for any type. The original conversation is here:


 Not sure what the status of it is - last message was over 18 months ago

C

Alex Cruise

unread,
Aug 9, 2012, 1:22:24 PM8/9/12
to scala-l...@googlegroups.com
On Thu, Aug 9, 2012 at 6:31 AM, Chris Marshall <oxbow...@gmail.com> wrote:
Simple - make the use of == require an implicit type class and then make sure there is (to begin with) such an implicit instance in scope for any type. The original conversation is here:

Bruno

unread,
Aug 10, 2012, 10:05:24 AM8/10/12
to scala-l...@googlegroups.com
Because contains is just some method on Seq, whereas contains on Set
is more important?

Yes, apply (aliased to contains) is the central operation of Set, whereas contains is indeed "just some method" on Seq.

Nobody's saying you don't sometimes want a covariant Set, but on balance, it is more useful invariant. 

At least for me, Set would be more useful covariant and with method signatures uniform with those in Seq. The fact that it isn't causes me this kind of inconvenience:

Best regards,

Bruno

Paolo Giarrusso

unread,
Aug 12, 2012, 10:06:05 PM8/12/12
to scala-l...@googlegroups.com, rkl...@googlemail.com
Hi all,
first, thanks for your effort. I'd like to mention to you a problem I can't investigate further right now: Stream.map seems to not take constant stack space, at least in 2.10.0-M5, and reading the code suggests the same for Stream.flatMap.
Below there's my (unconfirmed) analysis of the code, of the complexity involved, and so on. The issue is pretty annoying for me, performance-wise: I need to convert an Iterator (returned by sliding) to a Traversable, and the only lazy way to do that is to use a Stream.

I experience currently a stack overflow on Stream#StreamWithFilter.map:

<repeat often:>
at scala.collection.immutable.Stream$StreamWithFilter.scala$collection$immutable$Stream$StreamWithFilter$$tailMap$1(Stream.scala:482)
at scala.collection.immutable.Stream$StreamWithFilter.map(Stream.scala:486)

The problematic code is below. It seems that while calls to cons(f(head), tailMap) don't consume stack space, as expected, the problem is the unprotected call to tailMap in line 486. In other words, the stack depth seems O(f) with f = number of stream elements in a row which do not satisfy the predicate `p`. This would make sense, since in my application the filter is a bug detector which finds 2 matches over the whole JDK. The streams which my app filters are streams of bytecode instructions in a JDK method - which are at most 11594. Still, those entries manage, apparently, to exhaust a stack of 1MB.

  final class StreamWithFilter(p: A => Boolean) extends WithFilter(p) {

    override def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Stream[A], B, That]): That = {
      def tailMap = asStream[B](tail withFilter p map f) //Line 482
      if (isStreamBuilder(bf)) asThat(
        if (isEmpty) Stream.Empty
        else if (p(head)) cons(f(head), tailMap)
        else tailMap //Line 486
      )
      else super.map(f)(bf)
    }

It seems that the call to tailMap could be rewritten to be tail-recursive, but the transformation seems somewhat involved:
//p, f, A, B available in lexical scope
def tailMap(coll: Stream[A]): Stream[B] = {
  if (coll.isEmpty) Stream.Empty
  else if (p(coll.head)) cons(f(coll.head), tailMap(coll.tail)) //Non-tail call?
  else tailMap(coll.tail) //tail-call
}
I guess the non-tail-call shown is not a problem in practice, since it's a call-by-name parameter but requires doing the transformation by hand. Something like:

def tailMap(coll: Stream[A]): Stream[B] = {
  var head: A = null.asInstanceOf[A]
  var tail: Stream[A] = coll
  do {
    if (tail.isEmpty)
      return Stream.Empty
    head = tail.head
    tail = tail.tail
    if (p(head))
      return cons(f(head), tailMap(tail))
  } while (true)
}

The body would then become something like:
    override def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Stream[A], B, That]): That = {
      def tailMap //... as above
      if (isStreamBuilder(bf)) asThat(tailMap(Stream.this))
      else super.map(f)(bf)
    }

Best regards

Il giorno lunedì 6 agosto 2012 20:21:18 UTC+2, Rüdiger Klaehn ha scritto:
Hi all,

all in all I am quite happy with the scala collections library. But
there are quite a few annoyances and inefficiencies I have been
confronted with over the last year. This is an attempt to summarize
them and put them up to discussion. I am working mostly with the
default implementations of Set and Map, so this is where I have the
most remarks

In general, while the collections "just work" and do exactly what you
expect, they occasionally behave inefficiently and unintuively
regarding performance and structural sharing. Often methods that you
would expect to work very efficiently for a certain type of collection
instead just use the default implementation from e.g. TraversableLike.


Set:

- filter is easy to implement for HashSet, and it is very important
since it is used by operations that people expect to be fast on a set,
like intersection. Nevertheless filter just uses the default
implementation from TraversableLike, which is very inefficient. I
created a SI for that issue and implemented a filter method on
HashSet. For maximum benefit you would have to implement filter on
Set1..Set4 as well: https://issues.scala-lang.org/browse/SI-6196

- merge exists for HashMap, but not for HashSet. It should be
implemented, and used from union and ++ whenever possible. That would
increase the efficiency of that operation tremendously. A set should
have quick union and intersection methods.

- while implementing filter on HashSet, I also found some minor issues
that result in some inefficiency:
https://issues.scala-lang.org/browse/SI-6197 and
https://issues.scala-lang.org/browse/SI-6198

- intersect should check if both sides are of the same type, and then
do smaller.filter(larger). People should not have to do this manually
to get best performance. See for example this discussion:
https://groups.google.com/forum/?fromgroups#!searchin/scala-user/smaller$20filter$20larger/scala-user/3QJYuUU3frU/YfMTW9mRXsYJ

- (set should be covariant, but I guess there is nothing that can be
done about this at this point. Maybe have another version of set?)


Map:

- since map is very similar to set, it should also have a filter
method that works efficiently instead of just using the one from
TraversableLike. I added one here:
https://issues.scala-lang.org/browse/SI-6200 . Basically the same as
https://issues.scala-lang.org/browse/SI-6196 .

- HashMap has a merge method. There was a bug in it which has been
fixed. https://issues.scala-lang.org/browse/SI-5879 . But almost
nobody will ever use this very efficient method because when people
want to merge two maps they use ++. ++ should check if both arguments
are of the right type and use HashMap.merge if possible.

- mapValues and filterKeys: either make it clear that they are lazy by
using an appropriate return value (MapView?), or make them eager. I
would prefer the latter. There is an old SI for this issue (
https://issues.scala-lang.org/browse/SI-4776 ), but nothing has been
done on it

- updated should not create a new instance if not necessary.
https://issues.scala-lang.org/browse/SI-5139


Vector:

- vector is a brilliant data structure that would be very useful for
number crunching like time series analysis if it wasn't for the boxing
overhead of Vector[Double] etc. Would it be possible to use
@specialized for Vector? And for other collections like Set and Map
for that matter?


In general, care should be taken to use structural sharing whenever
possible. If that means that you have to override some of the generic
methods of TraversableLike or SeqLike, then so be it. For example
this:

scala> val x=Set(1,2,3,4)
x: scala.collection.immutable.Set[Int] = Set(1, 2, 3, 4)

scala> x eq (Set.empty ++ x)
res9: Boolean = false // Set.empty ++ x builds a completely new set,
which can be very slow for large sets!

scala> x eq (x ++ Set.empty)
res10: Boolean = true

This would not happen if ++ would use merge when both sides are a Set.


I think that having an efficient collecions library is tremenously
important. At least for me, having maximum efficiency in the
collections is more important than new language features such as
macros.

best regards,

Rüdiger

Paolo Giarrusso

unread,
Aug 20, 2012, 4:48:45 PM8/20/12
to scala-l...@googlegroups.com
My proposed fix, together with a testcase, is now in
https://github.com/scala/scala/pull/1167
Reply all
Reply to author
Forward
0 new messages