Hi Rex,
can you please share your benchmarks?
It would be interesting to augment them with ScalaBlitz, that uses it's own spliterators+kernels.
I've been comparing ScalaBlitz with standard Scala Parallel Collections and I see substantial speedups(100x) on primitive-based tests with cheap operations, and substantial, but smaller(~15x) on operations with boxed elements(eg value classes).
My experience tells that scala parallel collections create enormous overhead for small operations(eg array.par.sum), so I'm interested to see which operations did you measure. I can hardly make dirrect comparison as ScalaBlitz doesn't implement vectors, but the numbers that I get for ranges, arrays, (mutable hashmaps&hashsets) and immutable maps&sets backed by hash tries suggest that Scala Parallel Collections have ennormous overhead.
Speaking of what could be improved, there's a simple week spot: In case time required to perform action is different for different elements, Scala Parallel Collections could be arbitary bad, as it's scheduler is primitive and does fixed size blocking in advance. I believe that the state of art here is workstealing tree scheduling, that Aleksandar Prokopec implemented in ScalaBlitz.
Acutally, using WS-tree should help to substantially lower overhead of orchestrating small operations, and also solve a problem with non-uniform workload.
Is it part of contract that it should? I would say no, if user wants a snapshot he should do something explicitly about it or use datastructure that supports this(ctrie for example).
Specifically--since more is usually better than less, but prioritizing is important!--which of these things would you really like to use, as opposed to it just being kinda reassuring that they're there in case you need them?
--Rex