Thread safety with iterators

575 views
Skip to first unread message

Rex Kerr

unread,
Dec 29, 2014, 9:14:29 PM12/29/14
to scala-i...@googlegroups.com
Several Iterator methods return two or more iterators: in addition to duplicate, there are also span and partition.  duplicate, but not the other two, make a nominal attempt at synchronization on the underlying single iterator for hasNext and next, but not any other methods (which almost all call hasNext and/or next multiple times).

Do we think that maintaining basic consistency of the data structures for duplicate are valuable?  Should we insist upon the same for span and partition, or drop the sync on duplicate?

My instinct is that since duplicate only preserves exactly one use-case (duplicate an iterator, pass one each to two threads), it isn't that useful.  Then again, it's the only thing we've got that lets you give two threads access to the data in one iterator without necessarily duplicating all the data.

  --Rex

Viktor Klang

unread,
Dec 30, 2014, 11:02:48 AM12/30/14
to scala-i...@googlegroups.com
I wouldn't expect an Iterator to be thread safe, and I don't see the value of having it be like that since there's an inherent check-then-act problem between hasNext and next anyway.

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Cheers,

Erik Osheim

unread,
Dec 30, 2014, 1:47:41 PM12/30/14
to scala-i...@googlegroups.com
On Tue, Dec 30, 2014 at 05:02:46PM +0100, Viktor Klang wrote:
> I wouldn't expect an Iterator to be thread safe, and I don't see the value
> of having it be like that since there's an inherent check-then-act problem
> between hasNext and next anyway.

Agreed.

I would prefer synchronization be opt-in for cases like this (and done
at a higher level probably).

-- Erik

Haoyi Li

unread,
Dec 30, 2014, 1:54:35 PM12/30/14
to scala-internals
I was under the impression that all of scala collections were not synchronized by default. Is that not the case?

Rex Kerr

unread,
Dec 30, 2014, 2:40:40 PM12/30/14
to scala-i...@googlegroups.com
Hi Viktor (and Erik and Haoyi),

That's what I thought at first too, but maybe I didn't make the counterargument clear enough (i.e. the one that made me wonder).

    val it = myCollection.iterator.filter(p)
    val (ita, itb) = it.duplicate
    val (a, b) = (Future(foo(ita)), Future(bar(itb))   // Horrible breakage ensues

This pattern has no check-then-act problem since each thread gets its own iterator.

But it does have a synchronization problem because duplicate is mutable under the hood (as it must be, since iterators themselves are mutable), so there is an easy-to-overlook dependency between ita and itb.  It's especially easy to overlook if I add more lines of code.

Right now, duplicate is safe in this pattern, but partition and span are not.  I'm happy to make duplicate also unsafe for this pattern, but it's not quite the normal problem with iterator thread-safety: normally it is fine to hand an iterator to another thread as long as you don't touch it any more (i.e. transfer ownership, in Rust lingo).  (I haven't checked what the performance implications are in a single-threaded context.  Maybe I'd better before coming to conclusions about what I should do.)

FWIW, views of mutable collections are full of problems of this sort, and I presently have no intention to fix them.  Stream might be also; I haven't checked.

  --Rex

Viktor Klang

unread,
Dec 30, 2014, 2:59:07 PM12/30/14
to scala-i...@googlegroups.com
On Tue, Dec 30, 2014 at 8:40 PM, Rex Kerr <ich...@gmail.com> wrote:
Hi Viktor (and Erik and Haoyi),

That's what I thought at first too, but maybe I didn't make the counterargument clear enough (i.e. the one that made me wonder).

    val it = myCollection.iterator.filter(p)
    val (ita, itb) = it.duplicate

What happens if you use `it` after this line?



--
Cheers,

Rex Kerr

unread,
Dec 30, 2014, 3:49:54 PM12/30/14
to scala-i...@googlegroups.com
On Tue, Dec 30, 2014 at 11:59 AM, Viktor Klang <viktor...@gmail.com> wrote:


On Tue, Dec 30, 2014 at 8:40 PM, Rex Kerr <ich...@gmail.com> wrote:
Hi Viktor (and Erik and Haoyi),

That's what I thought at first too, but maybe I didn't make the counterargument clear enough (i.e. the one that made me wonder).

    val it = myCollection.iterator.filter(p)
    val (ita, itb) = it.duplicate

What happens if you use `it` after this line?

You have violated the contract for iterator, which is that after using any method save hasNext, next, isEmpty, or nonEmpty, you may only operate sensibly on the return value from that method (if any).  (I think there are a few others; we need to document the not-completely-destructive ones better.)

That is easy to keep track of with a "use me only once" rule.  (It would be lovely if the type system could make sure.)

It's much harder to keep track of a "passing iterators to other threads is totally fine unless you happened to originally create them this particular way" rule.

I'm perfectly okay with saying: yes, it's a tougher rule; life is hard, get used to it--or just fall back to one of no-duplication or no-passing-to-other-threads.

But it's worth noting the higher cognitive demand.

  --Rex
 

Scott Carey

unread,
Dec 30, 2014, 3:50:58 PM12/30/14
to scala-i...@googlegroups.com


On Tuesday, December 30, 2014 11:40:40 AM UTC-8, Rex Kerr wrote:
Hi Viktor (and Erik and Haoyi),

That's what I thought at first too, but maybe I didn't make the counterargument clear enough (i.e. the one that made me wonder).

    val it = myCollection.iterator.filter(p)
    val (ita, itb) = it.duplicate
    val (a, b) = (Future(foo(ita)), Future(bar(itb))   // Horrible breakage ensues


if foo() races far ahead of bar(), could this produce an OOM?   Doesn't duplicating an arbitrary iterator imply buffering?  An iterator could have unbounded elements.

val c = myCollection.withFilter(p)
val ita = c.iterator
val itb = c.iterator

or the equivalent would be more sane if so.  If it buffers, I feel the best thing is to remove it entirely -- it does not fit well with the concept of mutable iterators if it implies potentially unbounded buffering.  Although iterators carry state, users generally expect their state to be small and bounded.   Some implementations (such as an iterator for an IndexedSeq) could avoid the buffering, but in general this seems dangerous for arbitrary iterators rather than specific subtypes of iterators where this would be safer. 

Scott Carey

unread,
Dec 30, 2014, 3:53:05 PM12/30/14
to scala-i...@googlegroups.com


On Tuesday, December 30, 2014 12:50:58 PM UTC-8, Scott Carey wrote:


On Tuesday, December 30, 2014 11:40:40 AM UTC-8, Rex Kerr wrote:
Hi Viktor (and Erik and Haoyi),

That's what I thought at first too, but maybe I didn't make the counterargument clear enough (i.e. the one that made me wonder).

    val it = myCollection.iterator.filter(p)
    val (ita, itb) = it.duplicate
    val (a, b) = (Future(foo(ita)), Future(bar(itb))   // Horrible breakage ensues


if foo() races far ahead of bar(), could this produce an OOM?   Doesn't duplicating an arbitrary iterator imply buffering?  An iterator could have unbounded elements.

val c = myCollection.withFilter(p)
val ita = c.iterator
val itb = c.iterator

or the equivalent would be more sane if so.  If it buffers, I feel the best thing is to remove it entirely -- it does not fit well with the concept of mutable iterators if it implies potentially unbounded buffering.  Although iterators carry state, users generally expect their state to be small and bounded.   Some implementations (such as an iterator for an IndexedSeq) could avoid the buffering, but in general this seems dangerous for arbitrary iterators rather than specific subtypes of iterators where this would be safer. 

Additionally, at least in the few cases I have thought of so far -- the cases that do not require buffering do not require synchronization either. 

Viktor Klang

unread,
Dec 30, 2014, 4:05:20 PM12/30/14
to scala-i...@googlegroups.com
Perhaps `duplicate` should be deprecated instead?

If you want that kind of behavior* you may as well do:

val (it1, it2) = { val v = it.toVector; (v.iterator, v.iterator) }

(* I've only been programming for 20-odd years and I've never had the "Iterator.duplicate" use-case yet)

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Cheers,

Rex Kerr

unread,
Dec 30, 2014, 4:05:38 PM12/30/14
to scala-i...@googlegroups.com
On Tue, Dec 30, 2014 at 12:50 PM, Scott Carey <scott...@gmail.com> wrote:


On Tuesday, December 30, 2014 11:40:40 AM UTC-8, Rex Kerr wrote:
Hi Viktor (and Erik and Haoyi),

That's what I thought at first too, but maybe I didn't make the counterargument clear enough (i.e. the one that made me wonder).

    val it = myCollection.iterator.filter(p)
    val (ita, itb) = it.duplicate
    val (a, b) = (Future(foo(ita)), Future(bar(itb))   // Horrible breakage ensues


if foo() races far ahead of bar(), could this produce an OOM?   Doesn't duplicating an arbitrary iterator imply buffering? 

Sure does!  This is an anti-pattern if you your iterators might be long, or if you don't know how long they are.

The docs have an note about it: "The implementation may allocate temporary storage for elements iterated by one iterator but not yet by the other."
 
If it buffers, I feel the best thing is to remove it entirely -- it does not fit well with the concept of mutable iterators if it implies potentially unbounded buffering.

Well, then we'd better also remove foldRight, which stores the entire iterator even if it terminates early after only a few function evaluations, and partition and span, which also necessarily has buffering since they produce multiple iterators from one.

I think we're better off allowing these operations but documenting the potential pitfalls, since these things are useful.  (Maybe the docs could be better.)

  --Rex
 

Rex Kerr

unread,
Dec 30, 2014, 4:10:35 PM12/30/14
to scala-i...@googlegroups.com
On Tue, Dec 30, 2014 at 1:05 PM, Viktor Klang <viktor...@gmail.com> wrote:
Perhaps `duplicate` should be deprecated instead?

If you want that kind of behavior* you may as well do:

val (it1, it2) = { val v = it.toVector; (v.iterator, v.iterator) }

Welllll, this necessarily instead of only potentially buffers the entire iterator.
 
(* I've only been programming for 20-odd years and I've never had the "Iterator.duplicate" use-case yet)

Yeah, me too.  And, me too.  (Caveat: I don't recall using an iterator at all in Turbo Pascal.  Second caveat: I guess duplicating pointers to walk a singly-linked list in C++ doesn't count.)

But I have used partition and span.  And they have the thread-unsafety problem, which I should at least document.

  --Rex
 

Haoyi Li

unread,
Dec 30, 2014, 4:15:26 PM12/30/14
to scala-internals
I think we're better off allowing these operations but documenting the potential pitfalls, since these things are useful.

I'd argue the opposite; that scala's collections have too many "useful but full of pitfalls" things, and that the API in general needs to be smaller. When you have so many methods that nobody knows what is safe and what isn't, that's as good as not having them =/ What good is "necessarily" vs "potentially" if nobody really knows what the current implementation is doing?

I know most of the methods on Iterator are basically off-limits to me because I can never be sure when It'll do something stupid and I'll have to put in a .toSeq anyway.

Viktor Klang

unread,
Dec 30, 2014, 4:22:46 PM12/30/14
to scala-i...@googlegroups.com
On Tue, Dec 30, 2014 at 10:10 PM, Rex Kerr <ich...@gmail.com> wrote:
On Tue, Dec 30, 2014 at 1:05 PM, Viktor Klang <viktor...@gmail.com> wrote:
Perhaps `duplicate` should be deprecated instead?

If you want that kind of behavior* you may as well do:

val (it1, it2) = { val v = it.toVector; (v.iterator, v.iterator) }

Welllll, this necessarily instead of only potentially buffers the entire iterator.

Seem legit :)
 
 
(* I've only been programming for 20-odd years and I've never had the "Iterator.duplicate" use-case yet)

Yeah, me too.  And, me too.  (Caveat: I don't recall using an iterator at all in Turbo Pascal.  Second caveat: I guess duplicating pointers to walk a singly-linked list in C++ doesn't count.)

But I have used partition and span.  And they have the thread-unsafety problem, which I should at least document.

Couldn't/shouldn't there be a general "Iterators are assumed to be thread-unsafe"?
 


  --Rex
 

--
You received this message because you are subscribed to the Google Groups "scala-internals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-interna...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Cheers,

Erik Osheim

unread,
Dec 30, 2014, 4:30:48 PM12/30/14
to scala-i...@googlegroups.com
On Tue, Dec 30, 2014 at 10:22:44PM +0100, Viktor Klang wrote:
> Couldn't/shouldn't there be a general "Iterators are assumed to be
> thread-unsafe"?

I agree with Viktor here and also with Haoyi. Mutable structures in
the standard library like Iterator, ArrayBuffer, etc. are really only
provided for "lower-level" optimizations, and shouldn't need a large
set of complex/subtle methods (IMO anyway).

I would prefer to make it clear that there is no synchronization for
Iterators, and to avoid providing high-level methods with subtle
interactions. In this case, I would vote for documenting potential
problems and possibly deprecating methods deemed "too complex".

-- Erik

Rex Kerr

unread,
Dec 30, 2014, 4:31:24 PM12/30/14
to scala-i...@googlegroups.com
On Tue, Dec 30, 2014 at 1:15 PM, Haoyi Li <haoy...@gmail.com> wrote:
I think we're better off allowing these operations but documenting the potential pitfalls, since these things are useful.

I'd argue the opposite; that scala's collections have too many "useful but full of pitfalls" things, and that the API in general needs to be smaller.

That's a good goal for a rewrite.  It's sort of like the RISC/CISC debate, you know.  Both won: RISC implementation of CISC instruction set.
 
When you have so many methods that nobody knows what is safe and what isn't, that's as good as not having them =/ What good is "necessarily" vs "potentially" if nobody really knows what the current implementation is doing?

Needs better docs, then.
 

I know most of the methods on Iterator are basically off-limits to me because I can never be sure when It'll do something stupid and I'll have to put in a .toSeq anyway.

Do you have actual examples of it doing something stupid?  Iterator is actually written quite well (for single-threaded context).  Bug reports or just a "this messed me up" are welcome.

Views are more problematic, both conceptually and in implementation.  (Especially IndexedSeqView.)  Stream also has quite a few more gotchas (not being lazy when you think it should, not releasing memory when you think it should, etc.).

  --Rex


Rex Kerr

unread,
Dec 30, 2014, 4:34:07 PM12/30/14
to scala-i...@googlegroups.com
This seems like the consensus.  Sounds reasonable to me.  I don't think I'll deprecate anything until I have a better sense of the use in the wild.  It's usually not that popular to deprecate things that can be and are widely being used safely, with no real alternative, just because some people are careless.

  --Rex
 

Bardur Arantsson

unread,
Dec 30, 2014, 4:38:27 PM12/30/14
to scala-i...@googlegroups.com
On 2014-12-30 21:49, Rex Kerr wrote:
[--snip--]

The solution seems simple (yet extremely difficult if you're going the
no-rewrites way): Split the immutable and mutable hierarchies completely
and then consider what operations make sense on each of those. (I
realize that I just asked for a pony^3, but that's the only way most of
these things are going to make any sense and be safe.)

In fact, take paulp's lessons to heart and dump the mutable collections
-- or at least dump all the fancy stuff in the mutable collections.

Of course, just writing a new collections implementation from scratch
would probably be easier. (The idea being that this would eventually
supercede the current collections.)

Regards,

martin odersky

unread,
Dec 30, 2014, 4:42:28 PM12/30/14
to scala-internals
Sounds good to me. I also think duplicate does too much magic for a
low-level abstraction such as iterator. Let's figure out where it is
used and deprecate if possible. Grepping through the source of the
community build should help with that.

- Martin

> --Rex
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "scala-internals" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scala-interna...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Martin Odersky
EPFL

Rex Kerr

unread,
Dec 30, 2014, 5:31:49 PM12/30/14
to scala-i...@googlegroups.com
Incidentally, I can't easily construct any scenario where there is a significant performance penalty for the synchronized blocks where you wouldn't otherwise have random incorrect results.  This suggests to me that I should add them to span and partition (independent of whether we eventually deprecate these methods).

  --Rex

Rex Kerr

unread,
Dec 30, 2014, 5:40:07 PM12/30/14
to scala-i...@googlegroups.com
Er, wait.  Yes I can, if I build in the correct window.  About 2x on trivial operations.  Out it goes in my optimized branch.

  --Rex

Hanns Holger Rutz

unread,
Dec 31, 2014, 8:53:48 AM12/31/14
to scala-i...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/30/2014 08:40 PM, Rex Kerr wrote:
> val it = myCollection.iterator.filter(p) val (ita, itb) =
> it.duplicate val (a, b) = (Future(foo(ita)), Future(bar(itb)) //
> Horrible breakage ensues

I always assumed than an Iterator should be consume as soon as it gets
created, in particular on the same thread. If you can't use those
constraints, then Iterable ('Iterator-Factory') is the best thing to
use IMO.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAEBAgAGBQJUo//nAAoJEKZFmaPaYk6QFAQP/2CBQ3Wd/zJ5lNDgKbC6vk/b
M0+NDilvEX7RsUA5vFxtUiPri65fd01tO/Fpe8zm58uCJaYSH8w+AMA+lY07hJLo
FuP+qeGoO5co8vhsPEyUb3JvNfoMgVTQwh14Xmk4rxWQButqBShPjV/O23KDFmXT
kbhlJ6U+cz3rTp+IEMs2/7YQ3+Sx/Qsmc12BWrALSA3wWk2nUqZTSHnHUG0iUqrN
qKjFzJ8sp2LH+qGP6095ldH+yMmLdbyII9XKyi4G/R5AHhKXQvky7OehrauQX0Cu
jmrrDnqZXvBzWMic24i4kOGhmALZ5QfLqV4Gg5F04dhb9trNeognasPR7sgTpTdC
BltPAsrsJdssla3w4xmBUZ2yl5W5yv2SkuvrMkv3QKeQk/G/7oakrTCv1G+jQgoA
+mjAUwlqcAXccO40DWvx+w13a/omCl/jwFfmNFUBg+d/7/ar1MVGWjKO0RxjnQKC
6dnLX3A5wYkE0fA2Z0mfo+j+1XWfV7QqU+RgRcR3+9+/v1so5ugwnkbpeScELFHW
fIb0szXAvQ032TbWGKcbjH0O7RZJOgZsVPbwJlB82p/vhgdDqld6bVBNBPky1ffS
wbc0iodvPJNpc+ZttO1vPTqohy1tYjZV+vGDdvhR5xfgtbLIiT9TjzPgNzdTpgaR
D7jR0QRu6nrnqXR/uCtz
=Wlj6
-----END PGP SIGNATURE-----
Reply all
Reply to author
Forward
0 new messages