Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Check for correct multithreading

5 views
Skip to first unread message

Harlan Messinger

unread,
Dec 21, 2009, 3:36:17 PM12/21/09
to
I've written an enumeration, LazyFileList, that builds a list (a
List<string>) of the paths of the files in a given directory (with an
optional mask) on a separate thread. The enumerator returned by its
enumerator's MoveNext method will do one of three things, depending on
the value of its Timeout and TimeoutAction properties, if it goes to get
the next file and finds that the LazyFileList doesn't have any more
files to offer at the moment but hasn't finished gathering them yet:

*ThrowException (the default) means that MoveNext will wait Timeout
milliseconds, and if another file hasn't been added to the list by then,
it will throw a System.TimeoutException.

*ReportComplete means that MoveNext will wait Timeout milliseconds, and
if another file hasn't been added to the list by then, it will return
false--that is, it will report that enumeration is complete, even though
it really isn't yet.

*NoTimeout means that MoveNext will wait indefinitely for the
LazyFileList to add the next file to its list.

The code is at http://www.gavelcade.com/lazyfilelist.html. I think the
code is satisfactorily thread-safe, and actually I think I tried to
overdo it at one point, but uselessly because even if it was necessary,
it was only half the job. Before I tinker any further, I wondered if any
of you might look at it and give me guidance.

The block marked [1] in the code where I launch the thread to build up
the file list.

Block [2] is where the enumeration retrieves the next file to add to the
list. The reason I locked this code was because I thought it might
interfere with one of its enumerators looking up a file. But (a) there's
only one thread filling the list, so there's no concern about
concurrency there, and I didn't lock the code in MoveNext that reads a
file from the list. I'm not sure whether I need to lock both blocks of
code or don't need to lock either, but I'm pretty sure having just this
lock doesn't accomplish anything.

Block [3] in the Current method and Blocks [4] and [5]--essentially, the
whole MoveNext method--are where I might also have locks. Besides
concern for concurrency with the AddFile method, I note that I've got
boolean formulas where I'm using different properties from the
LazyFileList such as its Count, its TimeoutAction, and its Finish status
that are read at different moments and therefore could, taken together,
not represent a consistent state. However, given that progress for the
enumeration is unidirectional (time waited so far can go from less than
the Timeout threshold to more than it but not the reverse; the
enumeration can go from Finished = false to Finished = true but not the
reverse), I don't think this is a problem, because the worst that will
happen is that the method will wait through one more sleep cycle than
was strictly necessary before responding to the caller. Does that make
sense?

Peter Duniho

unread,
Dec 21, 2009, 6:53:23 PM12/21/09
to
Harlan Messinger wrote:
> [...]

> Block [2] is where the enumeration retrieves the next file to add to the
> list. The reason I locked this code was because I thought it might
> interfere with one of its enumerators looking up a file. But (a) there's
> only one thread filling the list, so there's no concern about
> concurrency there, and I didn't lock the code in MoveNext that reads a
> file from the list. I'm not sure whether I need to lock both blocks of
> code or don't need to lock either, but I'm pretty sure having just this
> lock doesn't accomplish anything.

Was there a "(b)"?

You are correct�locking only at the writer accomplishes nothing. You
need to lock both the writer and reader. Otherwise, the reader may
observe the collection in an inconsistent state (adding to the
collection isn't atomic).

> Block [3] in the Current method and Blocks [4] and [5]--essentially, the
> whole MoveNext method--are where I might also have locks. Besides
> concern for concurrency with the AddFile method, I note that I've got
> boolean formulas where I'm using different properties from the
> LazyFileList such as its Count, its TimeoutAction, and its Finish status
> that are read at different moments and therefore could, taken together,

> not represent a consistent state. [...]

Actually, your biggest issue is that because you're using automatic
properties, the backing field isn't "volatile" and so you can't count on
the value being consistently visible in one thread after an update in
some other thread.

> However, given that progress for the
> enumeration is unidirectional (time waited so far can go from less than
> the Timeout threshold to more than it but not the reverse; the
> enumeration can go from Finished = false to Finished = true but not the
> reverse), I don't think this is a problem, because the worst that will
> happen is that the method will wait through one more sleep cycle than
> was strictly necessary before responding to the caller. Does that make
> sense?

Without "volatile", you could loop through any number of sleep cycles,
even indefinitely, depending on the compiler optimizations applied to
the code.

Basically:

-- Any mutable state at a minimum needs to be marked "volatile".

-- If the mutable state depends on non-atomic operations or is
modified based on its own value or some other mutable state, you need
full synchronization. (For example, the Count property of your list
changes according to a non-atomic operation, and you can't make the data
structure the Count property depends on volatile in any case, so you
have to lock around the retrieval of the Count property value).

-- A corallary to the above is that anything you can make
immutable, you should. For example, your "Mask" and "Root" properties,
why allow these to be mutable at all? Just back them with a readonly
field, initialized at object construction only. And since your code
doesn't really handle mutation of the "Timeout" property, I'd say
there's a good argument for making that immutable as well (require it to
be initialized during construction of the object).

-- And a general rule of locking: don't lock on "this". Use an
explicit sentinel object for the specific purpose.

I would also not bother checking the "Finished" property except in
between the more time-consuming operations. That is, any time you
return from a GetDirectories() or GetFiles() method call, but not after
the AddFile() call (i.e. in the loop that calls that). Statistically,
if the flag gets set (by the client or timeout), it's nearly always
going to occur while the code's waiting for GetDirectories() or
GetFiles() to return anyway, and only checking at those points will keep
the code simpler.

From an API point of view, IMHO it makes more sense for your Timeout
property to be a TimeSpan. This is especially true that given the way
you use it is basically as a TimeSpan anyway. But in any case, given
that the value is in milliseconds as it stands now, the code would IMHO
be more readable if you used "new TimeSpan(0, 0, 0, 0, list.Timeout)"
instead of "new TimeSpan(10000 * list.Timeout)". (Yeah, I know that's
as much typing, but at least there's no "magic conversion" going on in
the visible code :) ).

Finally, I find the enumerator code a bit over-complicated, and
inefficient as well. There's no need to poll, since your code knows
exactly when the timeout may expire, and you can signal to the code when
to wake up for other purposes (i.e. when the client has canceled the
enumeration, or there's new data to return).

I hope you don't mind, I went ahead and made some modifications to your
original code. For the most part, I tried to leave things as I found
them, except for fixing bugs. But two major things I modified in the
process are:

-- I moved the logic for dealing with consuming the file list and
handling timeouts back into the main class. This leaves the enumerator
itself much simpler, and allows the producer and consumer sections of
the code to exist in the same class.

-- I removed the polling aspects of the consumer code, replacing it
with what is IMHO a better solution. Specifically, using the Monitor
class to handle both synchronization and the timeout aspects. This way,
the code only wakes up if a timeout has expired or if there's work to be
done. Otherwise, the thread can remain dormant.

You will also notice some code protected by "TESTING_SUPPORT". I added
that so that I could more easily test the various delay/cancel effects
of the code. I wrote a little test console application to exercise the
class, which responds to key inputs to insert delays in processing or to
cancel the operation altogether. Obviously, none of that code would be
included in a production application.

I also (slightly) changed the semantics of your IEnumerator<T>
implementation. Note that the behavior of the Current property is
documented to be undefined if called at times when MoveNext() hasn't
been called, or has returned "false". We can take advantage of that to
just do something halfway sensible rather than doing a lot of checking
to ensure consistency. Basically, we push the question of correct usage
off onto the client code, rather than loading the enumeration code down
with that logic.

Speaking of semantics, I notice that you've implemented the
IEnumerator.Reset() method. Strictly speaking, it's perfectly fine to
just throw a NotSupportException. And if you allow that, then your
enumerator implementation could in fact be a single iterator method,
rather than the full-blown interface implementation you've got.

If you need to support Reset(), then of course you have to do that. But
otherwise, the code could be even simpler without it, well beyond the
simple removal of the method. In particular, after my other changes as
well, you could remove the entire LazyFileEnumerator class, and replace
it with this:

public IEnumerator<string> GetEnumerator()
{
int ifile = 0;
string file;

while ((file = NextFile(ifile++)) != null)
{
yield return file;
}
}

...letting the compiler write all the rest of the implementation for you.

All that said, I have copied and pasted the code below, preserving the
Reset() implementation you had before.

Which, by the way, is what you should have done with your original code.
This forum is a public forum, and one of the main benefits of the
forum is so that others can learn from the discussions found in it, even
when reading those discussions long after they have taken place. If you
refer to code that is not actually posted in the forum, then that
benefit relies entirely on the external site being hosted indefinitely,
for as long as the forum itself is ever available by archive.

Given that no matter where you host the external content, the lifetime
of the archive is likely to be MUCH longer, it's generally a bad idea to
post messages that involve a discussion about some specific code without
actually including that code in the message. If you want to do some
special formatting, I suppose you might consider posting the external
link _also_. But IMHO you might as well just include comments in the
code that highlight the specific areas you want to point out.

Anyway, I believe that the code I posted has fixed all the threading
bugs your code had, as well as cleaned up a bit the implementation. Of
course, while I believe the code to be correct, it's possible I've
introduce my own bugs. :) Please feel free to ask for any
clarifications or corrections!

Pete


#define TESTING_SUPPORT
using System;
using System.Collections.Generic;
using System.Threading;
using System.IO;

namespace Harlan.IO
{
public enum EnumerationTimeoutActions
{
NoTimeout,
ReportComplete,
ThrowException
}

public class LazyFileList : IEnumerable<string>
{
// Public method.
public void Cancel()
{
Finished = true;
}

#if TESTING_SUPPORT
public bool Delay
{
get { bool fRet = _fDelay; _fDelay = false; return fRet; }
set { _fDelay = value; }
}
private volatile bool _fDelay;
#endif // TESTING_SUPPORT

// Public modifiers

/// <summary>
/// Timeout value in milliseconds
/// </summary>
public int Timeout { get { return _timeout; } }
private readonly int _timeout;

/// <summary>
/// Action to take when timeout expires
/// </summary>
public EnumerationTimeoutActions TimeoutAction
{
get { return _action; }
set { _action = value; }
}
private volatile EnumerationTimeoutActions _action;

// Public accessors
/// <summary>
/// Directory name of root of file search
/// </summary>
public string Root { get { return _root; } }
private readonly string _root;

/// <summary>
/// Filename mask for file search
/// </summary>
public string Mask { get { return _mask; } }
private readonly string _mask;

/// <summary>
/// True if the search has completed or was cancelled
/// </summary>
public bool Finished
{
get { return _finished; }
private set
{
// Locking here ensures that a consumer will not
// get stuck waiting for something to be added to
// the list. Otherwise, it would be possible for
// the producer to just stop working without ever
// pulsing the consumer again.
lock (_objLock)
{
_finished = value;
Monitor.Pulse(_objLock);
}
}
}
// Making this volatile allows the producer to check the variable
// without having to bother locking. It can be read atomically,
// and the producer is just polling it, so additional
// synchronization isn't that useful.
private volatile bool _finished;

public int Count
{
get
{
lock (_objLock)
{
return fileList.Count;
}
}
}

// Private implementation
private List<string> fileList = new List<string>();
private readonly object _objLock = new object();

// Constructors
public LazyFileList(string root) : this(root, null) { }

public LazyFileList(string root, string mask) : this(root,
mask, 10000, EnumerationTimeoutActions.NoTimeout) { }

public LazyFileList(string root, string mask, int timeout,
EnumerationTimeoutActions action)
{
_timeout = timeout;
_action = action;
_root = root;
_mask = mask;

// This is superfluous. Finished is initialized to
// false by default.
// Finished = false;

Thread thread = new Thread(new ThreadStart(Fill));
thread.Start();
}

// File getter
private void Fill()
{
Fill(Root);
Finished = true;
}

private void Fill(string parentDirectory)
{
// Note that you could just make the default Mask value "*",
// using that if the client code passes null, and avoid the
// conditional expression here altogether.
string[] files = (Mask == null) ?
files = Directory.GetFiles(parentDirectory) :
files = Directory.GetFiles(parentDirectory, Mask);

if (Finished)
{
return;
}

foreach (string file in files)
{
#if TESTING_SUPPORT
Thread.Sleep(Delay ? 11000 : 250);
#endif // TESTING_SUPPORT
AddFile(file);
}

string[] directories =
Directory.GetDirectories(parentDirectory);

foreach (string directory in directories)
{
if (Finished)
{
return;
}

Fill(directory);
}
}

private void AddFile(string path)
{
lock (_objLock)
{
fileList.Add(path);
Monitor.Pulse(_objLock);
}
}

private string NextFile(int ifile)
{
lock (_objLock)
{
while (fileList.Count <= ifile)
{
if (Finished)
{
// Producer will never add anything more to the
list,
// so we must be done.
return null;
}

// Otherwise, we need to wait...either
indefinitely, or with
// a timeout, depending on configuration
if (TimeoutAction ==
EnumerationTimeoutActions.NoTimeout)
{
Monitor.Wait(_objLock);
}
else
{
// The Wait() can complete either due to a
Pulse() or
// the timeout. In theory, we could use the
return value
// to predict whether the queue now has
non-zero length
// or not, but IMHO the code is clearer and
easier to prove
// correct just checking the length again.

Monitor.Wait(_objLock, Timeout);

if (fileList.Count <= ifile)
{
// Finished could be set by the client code
calling
// Cancel(). Presumably, one does not
expect an
// exception to be thrown in that case. :)
if (Finished || TimeoutAction ==
EnumerationTimeoutActions.ReportComplete)
{
return null;
}

throw new System.TimeoutException("File
enumeration timed out");
}
}
}

return fileList[ifile];
}
}

#region IEnumerable<string> Members

public IEnumerator<string> GetEnumerator()
{
return new LazyFileEnumerator(this);
}

#endregion

#region IEnumerable Members

System.Collections.IEnumerator
System.Collections.IEnumerable.GetEnumerator()
{
return GetEnumerator();
}

#endregion

public class LazyFileEnumerator : IEnumerator<string>
{
private int index = -1;
private LazyFileList list;
private bool dead = false;
private string _current;

public LazyFileEnumerator(LazyFileList list)
{
this.list = list;
}

#region IEnumerator<string> Members

public string Current
{
get
{
return _current;
}
}

#endregion

#region IDisposable Members

public void Dispose()
{
// Nothing to dispose.
}

#endregion

#region IEnumerator Members

object System.Collections.IEnumerator.Current
{
get { return Current; }
}

public bool MoveNext()
{
if (!dead)
{
string file = list.NextFile(++index);

if (file != null)
{
_current = file;
}
else
{
dead = true;
}
}

return !dead;
}

public void Reset()
{
index = -1;
_current = null;
dead = false;
}

#endregion
}
}
}

Harlan Messinger

unread,
Dec 21, 2009, 8:38:41 PM12/21/09
to
Peter Duniho wrote:
> Harlan Messinger wrote:
>> [...]
>> Block [2] is where the enumeration retrieves the next file to add to
>> the list. The reason I locked this code was because I thought it might
>> interfere with one of its enumerators looking up a file. But (a)
>> there's only one thread filling the list, so there's no concern about
>> concurrency there, and I didn't lock the code in MoveNext that reads a
>> file from the list. I'm not sure whether I need to lock both blocks of
>> code or don't need to lock either, but I'm pretty sure having just
>> this lock doesn't accomplish anything.
>
> Was there a "(b)"?

Ah. Um. I think (b) was a variant of what I wrote, something to the
effect of "I'm not sure that the adding of an item at the end of the
list interferes with indexed reading of items that were in the list
before it, or with reading its Count property as long as I don't mind
the possibility of an undercount".

>
> You are correct�locking only at the writer accomplishes nothing. You
> need to lock both the writer and reader. Otherwise, the reader may
> observe the collection in an inconsistent state (adding to the
> collection isn't atomic).
>
>> Block [3] in the Current method and Blocks [4] and [5]--essentially,
>> the whole MoveNext method--are where I might also have locks. Besides
>> concern for concurrency with the AddFile method, I note that I've got
>> boolean formulas where I'm using different properties from the
>> LazyFileList such as its Count, its TimeoutAction, and its Finish
>> status that are read at different moments and therefore could, taken
>> together, not represent a consistent state. [...]
>
> Actually, your biggest issue is that because you're using automatic
> properties, the backing field isn't "volatile" and so you can't count on
> the value being consistently visible in one thread after an update in
> some other thread.

Wow. I didn't think about that. I don't suppose (I'm asking instead of
checking it myself since I don't have Visual Studio in front of me right
now) that properties can be marked volatile. If not, obviously I can't
use autoproperties--so I guess implementing them the long way, but based
on explicitly declared volatile variables would work?

>> However, given that progress for the enumeration is unidirectional
>> (time waited so far can go from less than the Timeout threshold to
>> more than it but not the reverse; the enumeration can go from Finished
>> = false to Finished = true but not the reverse), I don't think this is
>> a problem, because the worst that will happen is that the method will
>> wait through one more sleep cycle than was strictly necessary before
>> responding to the caller. Does that make sense?
>
> Without "volatile", you could loop through any number of sleep cycles,
> even indefinitely, depending on the compiler optimizations applied to
> the code.
>
> Basically:
>
> -- Any mutable state at a minimum needs to be marked "volatile".
>
> -- If the mutable state depends on non-atomic operations or is
> modified based on its own value or some other mutable state, you need
> full synchronization. (For example, the Count property of your list
> changes according to a non-atomic operation, and you can't make the data
> structure the Count property depends on volatile in any case, so you
> have to lock around the retrieval of the Count property value).
>
> -- A corallary to the above is that anything you can make immutable,
> you should. For example, your "Mask" and "Root" properties, why allow
> these to be mutable at all? Just back them with a readonly field,
> initialized at object construction only.

This was the source of my question about readonly properties late last
week. OK.

> And since your code doesn't
> really handle mutation of the "Timeout" property, I'd say there's a good
> argument for making that immutable as well (require it to be initialized
> during construction of the object).

Makes sense. I can create a couple more overloads to handle the case
where the user doesn't want to or doesn't know to specify it.


>
> -- And a general rule of locking: don't lock on "this". Use an
> explicit sentinel object for the specific purpose.

I'm sure I read the reason why when I studied threading years ago, but I
can't remember it. Assuming I'm going to create *one* object for
everything to lock on, what's the reason for not using the shared
resource itself?

> I would also not bother checking the "Finished" property except in
> between the more time-consuming operations. That is, any time you
> return from a GetDirectories() or GetFiles() method call, but not after
> the AddFile() call (i.e. in the loop that calls that). Statistically,
> if the flag gets set (by the client or timeout), it's nearly always
> going to occur while the code's waiting for GetDirectories() or
> GetFiles() to return anyway, and only checking at those points will keep
> the code simpler.

I considered that. I was thinking about the case where a directory as
10,000 files, but that's a rarity, and GetFiles returns them in an array
and I *suppose* that loading them into the List would still be
negligible for the purpose at hand. OK, thanks.

> From an API point of view, IMHO it makes more sense for your Timeout
> property to be a TimeSpan. This is especially true that given the way
> you use it is basically as a TimeSpan anyway. But in any case, given
> that the value is in milliseconds as it stands now, the code would IMHO
> be more readable if you used "new TimeSpan(0, 0, 0, 0, list.Timeout)"
> instead of "new TimeSpan(10000 * list.Timeout)". (Yeah, I know that's
> as much typing, but at least there's no "magic conversion" going on in
> the visible code :) ).

I considered TimeSpan, but only really learned about it after I'd
already planned to use milliseconds as the unit and didn't take the time
to retrofit it. It makes sense.

> Finally, I find the enumerator code a bit over-complicated, and
> inefficient as well. There's no need to poll, since your code knows
> exactly when the timeout may expire, and you can signal to the code when
> to wake up for other purposes (i.e. when the client has canceled the
> enumeration, or there's new data to return).

I didn't think of that--you mean, define an event in the enumeration for
which the enumerator can contain a handler?

> I hope you don't mind, I went ahead and made some modifications to your
> original code. For the most part, I tried to leave things as I found
> them, except for fixing bugs. But two major things I modified in the
> process are:

I don't mind at all. I don't think I've told you yet, but you are
remarkably helpful, and give great, detailed explanations, and I hope
the others here appreciate you as much as I do. :-) I'm going to end
here because I need to go, but I'll continue reading this later. Thanks!

Peter Duniho

unread,
Dec 21, 2009, 9:21:53 PM12/21/09
to
Harlan Messinger wrote:
> [...]

>>> (a) there's only one thread filling the list, so there's no concern
>>> about concurrency there, and I didn't lock the code in MoveNext that
>>> reads a file from the list. I'm not sure whether I need to lock both
>>> blocks of code or don't need to lock either, but I'm pretty sure
>>> having just this lock doesn't accomplish anything.
>>
>> Was there a "(b)"?
>
> Ah. Um. I think (b) was a variant of what I wrote, something to the
> effect of "I'm not sure that the adding of an item at the end of the
> list interferes with indexed reading of items that were in the list
> before it, or with reading its Count property as long as I don't mind
> the possibility of an undercount".

Okay. Your inference might be correct, but it's entirely
implementation-dependent. That is, it assumes things about the
internals of the List<T> class that may or may not be true, and if true
now, may or may not remain true in the future.

For example, there's no guarantee about whether the Count property is
updated before or after the actual new data has been copied to the
List<T>'s internal array. If I were writing List<T> I'd probably update
the Count last, but it's perfectly legal to update it first, in which
case code that examines Count and then does something based on that
(such as retrieving the element at index Count - 1) could fail (the
element might not have been copied to the array yet or, even worse, the
array may need reallocating to let the element fit, but not have been
reallocated yet).

Basically, List<T> doesn't guarantee that the Add() method is atomic, so
you need to synchronize any code that calls Add() with any other code
that may use something in List<T> that depends on the call to Add()
(which for List<T> is practically everything, of course :) ).

> [...]
>> Actually, your biggest issue is that because you're using automatic
>> properties, the backing field isn't "volatile" and so you can't count
>> on the value being consistently visible in one thread after an update
>> in some other thread.
>
> Wow. I didn't think about that. I don't suppose (I'm asking instead of
> checking it myself since I don't have Visual Studio in front of me right
> now) that properties can be marked volatile. If not, obviously I can't
> use autoproperties--so I guess implementing them the long way, but based
> on explicitly declared volatile variables would work?

Right. No, you can't declare a property "volatile", and yes using a
long-hand property with a "volatile" backing variable works (and is how
I did it in my modifications to your code).

> [...]


>> -- And a general rule of locking: don't lock on "this". Use an
>> explicit sentinel object for the specific purpose.
>
> I'm sure I read the reason why when I studied threading years ago, but I
> can't remember it. Assuming I'm going to create *one* object for
> everything to lock on, what's the reason for not using the shared
> resource itself?

The main reason is that if you make a dedicated object instance for the
lock, you can ensure that you have complete control over how that
instance is used for locking. If you use "this", then any client code
may wind up using that reference for object locking as well. At best,
this could create synchronization contention when there was no need, and
at worst it makes it easier to run into a deadlock situation.

Of course, if _all_ code were following the rule of not using "this" for
locking, then you wouldn't have to worry about client code using your
own "this" for its lock. It would use some other object. But you can't
always predict what the client code will do, so it's better to design
such that you aren't reliant on any specific behavior.

>> I would also not bother checking the "Finished" property except in
>> between the more time-consuming operations. That is, any time you
>> return from a GetDirectories() or GetFiles() method call, but not
>> after the AddFile() call (i.e. in the loop that calls that).
>> Statistically, if the flag gets set (by the client or timeout), it's
>> nearly always going to occur while the code's waiting for
>> GetDirectories() or GetFiles() to return anyway, and only checking at
>> those points will keep the code simpler.
>
> I considered that. I was thinking about the case where a directory as
> 10,000 files, but that's a rarity, and GetFiles returns them in an array
> and I *suppose* that loading them into the List would still be
> negligible for the purpose at hand. OK, thanks.

Note that even if you have a directory with 10,000 files, statistically
you are still much more likely to have the "Finished" property wind up
getting set during that call to GetFiles(), rather than during your own
code's processing of it, because the 10,000 files causes the GetFiles()
method to take longer to return as well.

I suspect that at least some of the overhead is constant for the method
call, and of course that overhead would be amortized over all the files.
So the relative proportion might wind up slightly less biased in favor
your code being waiting on the GetFiles() method call when "Finished" is
set. But I doubt it gets biased enough to make a significant
difference. Adding 10,000 elements from an array to a List<T> is
_extremely_ fast.

(So fast, in fact, that if you are especially concerned about
efficiency, you might find it worthwhile to change the adding logic so
that you pass the entire array to the adding method, so that you can
lock just once per directory, reducing thread-switching and
synchronization contention).

> [...]


>> Finally, I find the enumerator code a bit over-complicated, and
>> inefficient as well. There's no need to poll, since your code knows
>> exactly when the timeout may expire, and you can signal to the code
>> when to wake up for other purposes (i.e. when the client has canceled
>> the enumeration, or there's new data to return).
>
> I didn't think of that--you mean, define an event in the enumeration for
> which the enumerator can contain a handler?

No. An event can be used for asynchronous processing, of course. But
in that case, you still wind up needing some kind of underlying
mechanism to support that. What I'm talking about is more along the
lines of that underlying mechanism, rather than the higher-level view
the code sees.

Hopefully the code I posted illustrates what I mean. The short version
is this: the "lock" statement in C# is just a shorthand way of using the
Monitor class in an exception-safe way. So you can use the Monitor
class within a "lock" statement, just as if you'd called Monitor.Enter()
explicitly.

The Monitor class has a Wait() method that includes a timeout. That
overload of the method will wait until some other thread signals it by
calling the Pulse() method, or until the timeout expires, whichever
comes first.

So, your consumer code (the enumerator), when it tries to get another
filename to return, rather than sitting there in a loop checking state
variables, can just call the Wait() method instead with the appropriate
timeout. If something changes that the consumer code needs to know
about, the code making that change would call the Pulse() method, so
that the consumer then can wake up and continue processing. Otherwise,
the consumer code will stay dormant until the timeout expires, at which
time it then will still wake up and continue with processing.

Depending on the state of things after it wakes up, the consumer code
can then make a decision as to whether it can return a valid string, or
should terminate the enumeration, or should throw an exception.

>> I hope you don't mind, I went ahead and made some modifications to
>> your original code. For the most part, I tried to leave things as I
>> found them, except for fixing bugs. But two major things I modified
>> in the process are:
>
> I don't mind at all. I don't think I've told you yet, but you are
> remarkably helpful, and give great, detailed explanations, and I hope
> the others here appreciate you as much as I do. :-) I'm going to end
> here because I need to go, but I'll continue reading this later. Thanks!

You're welcome. I realize not everyone is willing to believe this, but
I really do intend for everything I write to be helpful in some way.
I'm glad that you find my input helpful and appreciate your thanks. :)

Pete

Harlan Messinger

unread,
Dec 22, 2009, 4:57:51 PM12/22/09
to
Peter Duniho wrote:
>[...]

> I hope you don't mind, I went ahead and made some modifications to your
> original code. For the most part, I tried to leave things as I found
> them, except for fixing bugs. But two major things I modified in the
> process are:
>
> -- I moved the logic for dealing with consuming the file list and
> handling timeouts back into the main class. This leaves the enumerator
> itself much simpler, and allows the producer and consumer sections of
> the code to exist in the same class.

Perhaps this is a judgment call, but this surprises me. Is leaving the
enumerator simpler a goal? More so than leaving the enumerable simpler?
I figure that the business of the enumerator is to enumerate, and to
know how to do it. If I tell the enumerable, "I want to enumerate you",
it spawns an enumerator as the means by which I can do so. That's why
nested classes exist, isn't it--to give classes the ability to model
part of their implementation via helper classes privy to their internal
workings, and that can be farmed out for use by clients, instead of
being restricted to data members and properties and methods?

> -- I removed the polling aspects of the consumer code, replacing it
> with what is IMHO a better solution. Specifically, using the Monitor
> class to handle both synchronization and the timeout aspects. This way,
> the code only wakes up if a timeout has expired or if there's work to be
> done. Otherwise, the thread can remain dormant.

This is mildly embarrassing. My experience with multithreading and
contention for resources is so long ago I'd forgotten about the most
basic aspect of the lock structure, with a Wait block and
Notify/NotifyAll calls. I don't even remember what language I was
using--Java or C++? Java, I think. "Pulse", that's a funny name. Did
Microsoft choose it just to be different?

> You will also notice some code protected by "TESTING_SUPPORT". I added
> that so that I could more easily test the various delay/cancel effects
> of the code. I wrote a little test console application to exercise the
> class, which responds to key inputs to insert delays in processing or to
> cancel the operation altogether. Obviously, none of that code would be
> included in a production application.
>
> I also (slightly) changed the semantics of your IEnumerator<T>
> implementation. Note that the behavior of the Current property is
> documented to be undefined if called at times when MoveNext() hasn't
> been called, or has returned "false". We can take advantage of that to
> just do something halfway sensible rather than doing a lot of checking
> to ensure consistency. Basically, we push the question of correct usage
> off onto the client code, rather than loading the enumeration code down
> with that logic.

I don't see how this is "loading it down" with that logic. If the
enumerator is asked to provide the Current item when it isn't on a
Current item, isn't it sensible to assume that the developer programmed
his client incorrectly and needs to have this pointed out to him? The
spec doesn't require this, but I don't see any reason to return a result
that leaves the client not knowing that something was wrong. Well, I'll
split the difference with you: returning null is one way to do this, and
that's what your approach does before MoveNext is ever called. But then
it looks like Current will continue to return the last actual item in
the list even after the enumerator is dead. After

dead = true;

did you mean to have

_current = null;

?

>
> Speaking of semantics, I notice that you've implemented the
> IEnumerator.Reset() method. Strictly speaking, it's perfectly fine to
> just throw a NotSupportException. And if you allow that, then your
> enumerator implementation could in fact be a single iterator method,
> rather than the full-blown interface implementation you've got.

I hadn't realized this and disagreed at first. Normally I assume that
the default NotSupportedException that VS puts into its method stubs is
there not to warn the developer of a calling procedure that he has
called an optional method that isn't implemented and shouldn't be
called, but to warn him that he forgot an implementation that is
promised under the contract embodied by the interface. But now I've come
across a reference that explaines that Reset exists only for COM
interoperability. So I guess it's a special case. And I don't mind not
implementing it, because I didn't even like seeing it there. The normal
paradigm for using an enumerator is that you use it up and throw it
away, and if you need to enumerate again, you get yourself a new enumerator.

> If you need to support Reset(), then of course you have to do that. But
> otherwise, the code could be even simpler without it, well beyond the
> simple removal of the method. In particular, after my other changes as
> well, you could remove the entire LazyFileEnumerator class, and replace
> it with this:
>
> public IEnumerator<string> GetEnumerator()
> {
> int ifile = 0;
> string file;
>
> while ((file = NextFile(ifile++)) != null)
> {
> yield return file;
> }
> }
>
> ...letting the compiler write all the rest of the implementation for you.

Out of curiosity, what happens if I have

IEnumerator<string> e = GetEnumerator();
e.MoveNext();
e.Reset();

?

OK. Of course, if I use this approach, then my comments above about
whether the determination of the "next" item is in the producer or the
consumer is eliminated, since there is no longer an explicitly defined
consumer where I could put that code.

> All that said, I have copied and pasted the code below, preserving the
> Reset() implementation you had before.
>
> Which, by the way, is what you should have done with your original code.
> This forum is a public forum, and one of the main benefits of the forum
> is so that others can learn from the discussions found in it, even when
> reading those discussions long after they have taken place. If you
> refer to code that is not actually posted in the forum, then that
> benefit relies entirely on the external site being hosted indefinitely,
> for as long as the forum itself is ever available by archive.

You should visit comp.infosystems.www.authoring.html, where a typical
(and sometime justified, sometimes not) reaction to code included in the
posting is, "What, are we supposed to copy and paste this onto our own
website so we can see what your problem is? Put this page on your
website and let us look at it!" Normally, outside of c.i.w.a.*, I do
post code directly in my messages, but this was such a long stretch I
thought it might be easier to reference it from a separate space. I
really, really wish Usenet didn't break lines at 80 characters (or
however many it is) because that makes it so difficult to present and
discuss code. I see your point (though in this case my example can be
there for very long time because I have no reason to get rid of it).
>[...]

I prefer explicit initialization. It is clearer, and I prefer consistent
treatment regardless of the initial value over the likes of

int x; // which I will rely on later to be 0
int y = 60;
bool foo1; //which I will rely on later to be false
bool foo2 = true;
bool foo3; //which I will rely on later to be false

Besides, after all the programming languages I've been through over the
years, each with its own rules, I prefer not to have to remember what a
type's default values is, and in what cases a variable is left undefined!

>
> Thread thread = new Thread(new ThreadStart(Fill));
> thread.Start();
> }
>
> // File getter
> private void Fill()
> {
> Fill(Root);
> Finished = true;
> }
>
> private void Fill(string parentDirectory)
> {
> // Note that you could just make the default Mask value "*",
> // using that if the client code passes null, and avoid the
> // conditional expression here altogether.
> string[] files = (Mask == null) ?
> files = Directory.GetFiles(parentDirectory) :
> files = Directory.GetFiles(parentDirectory, Mask);

Does * work? It seems to me that there have been occasions over the
years where * has been interpreted as "all files without extensions". I
would have defaulted to "*.*", but I'm not sure how that deals with
files called, for example, foo.aspx.cs, or files that DON'T have an
extension.

Am I misreading this, or are you never setting a value to return in the
NoTimeout case?

Did you mean to have

_current = null;

here?

Peter Duniho

unread,
Dec 22, 2009, 6:03:07 PM12/22/09
to
Harlan Messinger wrote:
> Peter Duniho wrote:
>> [...]
>> I hope you don't mind, I went ahead and made some modifications to
>> your original code. For the most part, I tried to leave things as I
>> found them, except for fixing bugs. But two major things I modified
>> in the process are:
>>
>> -- I moved the logic for dealing with consuming the file list and
>> handling timeouts back into the main class. This leaves the
>> enumerator itself much simpler, and allows the producer and consumer
>> sections of the code to exist in the same class.
>
> Perhaps this is a judgment call, but this surprises me. Is leaving the
> enumerator simpler a goal? More so than leaving the enumerable simpler?

I agree it's a judgment call. But let me justify my judgment to you. :)

> I figure that the business of the enumerator is to enumerate, and to
> know how to do it.

The phrase "to know how to do it" hides a lot of complex
decision-making. In particular, what is "enumeration" and what is just
"data retrieval"?

In my suggested design, _enumeration_ is simply initializing and then
updating an index. Data retrieval is left to the class where the data
actually exists.

IMHO, the biggest signal in the code that argues in favor of my chosen
design is that the data retrieval code _requires_ intimate access to
members of the outer class. The various settings, finished state, along
with the list itself.

A class in which most of the code simply refers to private fields in a
completely different class is telling me that that class has a bunch of
code that really belongs in that completely different class. Whatever
_doesn't_ rely on the private data from the other class can remain, of
course.

You may say that this is not a hard-and-fast rule, and I would agree.
There are definitely times when it might be beneficial to allow a nested
class to use private members from an outer class. However, IMHO those
situations are rare, and generally involve private members created for
the _express_ purpose of the nested class. Private members that are
part of the primary implementation of the outer class should remain
unused by any other class.

In fact, you'll see this division in the code I posted. There is in
fact _a_ private member (the "NextFile()" method) of the outer class
accessed by the nested class. It's private, because it's inappropriate
for any code outside the outer class to call that method. But the rest
of the private members, the ones that have to be in the outer class, and
which are regularly accessed by code in the outer class, remain private
and isolated to the code in the outer class.

IMHO this division enhances encapsulation, mainly by not spreading
access to shared data structures (such as the list in this case) across
multiple classes. There's a single point of access from the OO point of
view. Providing that bottleneck is an important hallmark of OO design.

> If I tell the enumerable, "I want to enumerate you",
> it spawns an enumerator as the means by which I can do so. That's why
> nested classes exist, isn't it--to give classes the ability to model
> part of their implementation via helper classes privy to their internal
> workings, and that can be farmed out for use by clients, instead of
> being restricted to data members and properties and methods?

I think nested classes exist for a variety of reasons, and in fact I
think that the most important reason they exist is to provide for
private interface implementations. This can be done without any access
to the private members of the outer class at all, in many cases.
Instead, the goal is to force consumers of the interface to go through
the outer class, rather than instantiating the nested class directly.

Access to the private members of the outer class should be done rarely
and judiciously. The nested class should still be thought of as a class
unto itself, rather than an extension of the outer class.

All IMHO, of course.

>> [...] Basically, we push the question of

>> correct usage off onto the client code, rather than loading the
>> enumeration code down with that logic.
>
> I don't see how this is "loading it down" with that logic. If the
> enumerator is asked to provide the Current item when it isn't on a
> Current item, isn't it sensible to assume that the developer programmed
> his client incorrectly

No need to assume. It's hard evidence of that fact.

> and needs to have this pointed out to him?

That's up to you. The truth is, the enumerator will practically never
be used in an explicit way. They almost always are used in "foreach" or
LINQ expressions, where there's no chance of misuse.

It should be a very minor convenience to a programmer using the
enumerator explicitly to have the Current property throw an exception
rather than just returning some poorly-defined value, and personally it
is my opinion that writing the code to detect that case and throw the
exception is a waste of code. Especially since the documentation of the
interface specifically says that the behavior of the Current property in
that scenario is "undefined" (which is programmer-speak for "you can do
whatever you want in this situation").

Note also that by throwing an exception, you may turn what is a simple,
benign, off-by-one error, into a program crash. It depends on whether
the programmer tests their code well enough to run into the exception
before they deliver the code to the customer. Sometimes a program crash
is actually more desirable, but sometimes it's not.

Finally, note that the more code you write, the more code there is that
might have a bug in it. All else being equal, my preference is to write
less code.

> [...] But then


> it looks like Current will continue to return the last actual item in
> the list even after the enumerator is dead. After
>
> dead = true;
>
> did you mean to have
>
> _current = null;

Nope. Again, since the behavior in that situation is "undefined", my
general rule is "don't write any more code than one has to". I don't
care what Current returns when "dead" is true, so I don't bother to set
it to anything.

>> Speaking of semantics, I notice that you've implemented the
>> IEnumerator.Reset() method. Strictly speaking, it's perfectly fine to
>> just throw a NotSupportException. And if you allow that, then your
>> enumerator implementation could in fact be a single iterator method,
>> rather than the full-blown interface implementation you've got.
>
> I hadn't realized this and disagreed at first. Normally I assume that
> the default NotSupportedException that VS puts into its method stubs is
> there not to warn the developer of a calling procedure that he has
> called an optional method that isn't implemented and shouldn't be
> called, but to warn him that he forgot an implementation that is
> promised under the contract embodied by the interface.

I think you are confusing "NotSupportedException" with
"NotImplementedException". The latter is what VS inserts, not the former.

> But now I've come
> across a reference that explaines that Reset exists only for COM
> interoperability. So I guess it's a special case. And I don't mind not
> implementing it, because I didn't even like seeing it there. The normal
> paradigm for using an enumerator is that you use it up and throw it
> away, and if you need to enumerate again, you get yourself a new
> enumerator.

Exactly. And the simpler your enumerator is, the easier it is to
justify that approach. :)

>> If you need to support Reset(), then of course you have to do that.
>> But otherwise, the code could be even simpler without it, well beyond
>> the simple removal of the method. In particular, after my other
>> changes as well, you could remove the entire LazyFileEnumerator class,
>> and replace it with this:
>>
>> public IEnumerator<string> GetEnumerator()
>> {
>> int ifile = 0;
>> string file;
>>
>> while ((file = NextFile(ifile++)) != null)
>> {
>> yield return file;
>> }
>> }
>>
>> ...letting the compiler write all the rest of the implementation for you.
>
> Out of curiosity, what happens if I have
>
> IEnumerator<string> e = GetEnumerator();
> e.MoveNext();
> e.Reset();

Well, what happened when you tried it? :)

You will find that, per the interface specification, the
iterator-generated enumerator just throws NotSupportedException. That's
why I suggested that implementation only if it's okay for the enumerator
to not support the Reset() method.

> OK. Of course, if I use this approach, then my comments above about
> whether the determination of the "next" item is in the producer or the
> consumer is eliminated, since there is no longer an explicitly defined
> consumer where I could put that code.

Very true. :) In fact, IMHO the fact that by moving the code in
"NextFile()" out to the outer class allows for this drastic
simplification of the enumerator implementation itself is yet another
argument in favor of the general rule I've suggested. By keeping the
code isolated, you've so greatly simplified the interaction between two
different classes that one class can just disappear.

That is not always going to be the outcome, but in general I think you
will find that following the rule will in fact result in _some_ kind of
simplification in the class design and implementation. And IMHO that's
the whole goal. Make those connections between classes as simple and
independent as possible, and the code will be easier to write, and
easier to maintain.

>> All that said, I have copied and pasted the code below, preserving the
>> Reset() implementation you had before.
>>
>> Which, by the way, is what you should have done with your original
>> code. This forum is a public forum, and one of the main benefits of
>> the forum is so that others can learn from the discussions found in
>> it, even when reading those discussions long after they have taken
>> place. If you refer to code that is not actually posted in the forum,
>> then that benefit relies entirely on the external site being hosted
>> indefinitely, for as long as the forum itself is ever available by
>> archive.
>
> You should visit comp.infosystems.www.authoring.html, where a typical
> (and sometime justified, sometimes not) reaction to code included in the
> posting is, "What, are we supposed to copy and paste this onto our own
> website so we can see what your problem is? Put this page on your
> website and let us look at it!"

Well, a) that's a completely different group of people. And b) judging
from the name of the newsgroup, the "code" being discussed there is
immediately renderable in any standard browser, and yet can still be
immediately viewed in said browser as well. That is, it's HTML.

That's quite a different situation from C# code, which requires
compilation, and even that step is further complicated if you've
included text like "[1]", "[2]", etc. in the code that isn't even valid
C# code.

Regardless, even in the context of the HTML newsgroup I would disagree
that _generally_ one should provide a link rather than source, and for
the same reasons I've made that comment regarding C# code.
Specifically, how is anyone who comes along a few years later and reads
a discussion about how to use some particular technique or fix some
particular problem supposed to have any context or see any examples of
what people are actually talking about?

> Normally, outside of c.i.w.a.*, I do
> post code directly in my messages, but this was such a long stretch I
> thought it might be easier to reference it from a separate space. I
> really, really wish Usenet didn't break lines at 80 characters (or
> however many it is) because that makes it so difficult to present and
> discuss code.

No doubt there are difficulties associated with posting code to a
newsgroup, line breaks certainly being one of them. But IMHO those
difficulties are far outweighed by the advantages of keeping the entire
discussion together.

If Google Groups would archive linked pages as well as the original
newsgroup posts, then that wouldn't be so much an issue. But they
don't, and AFAIK it's not necessarily even practical or feasible to do that.

> I see your point (though in this case my example can be
> there for very long time because I have no reason to get rid of it).

It will be wonderful if you can continue to host that link for a long
time. But I suspect that no matter what, the lifetime of that link will
be significantly less than the lifetime of this discussion on a
newsgroup archive.

Of course, if the lifetime of both far exceeds the lifetime of common
usage of C# (I would guess that by 20 or 30 years from now, C# would be
viewed as archaic), then it's probably not a big problem. :)

>> [...]
> [...]


>> private void Fill(string parentDirectory)
>> {
>> // Note that you could just make the default Mask value "*",
>> // using that if the client code passes null, and avoid the
>> // conditional expression here altogether.
>> string[] files = (Mask == null) ?
>> files = Directory.GetFiles(parentDirectory) :
>> files = Directory.GetFiles(parentDirectory, Mask);
>
> Does * work? It seems to me that there have been occasions over the
> years where * has been interpreted as "all files without extensions". I
> would have defaulted to "*.*", but I'm not sure how that deals with
> files called, for example, foo.aspx.cs, or files that DON'T have an
> extension.

Well, from the documentation of the Directory.GetFiles(string) overload:

This method is identical to GetFiles with
the asterisk (*) specified as the search
pattern.

If you find that my advice doesn't work, you should file a bug report. :)

> [...]


>> if (TimeoutAction ==
>> EnumerationTimeoutActions.NoTimeout)
>> {
>> Monitor.Wait(_objLock);
>> }
>
> Am I misreading this, or are you never setting a value to return in the
> NoTimeout case?

Follow the code execution path. And remember, the C# compiler won't let
you fail to return a value from a method. :)

Since I only have one return statement in the entire method, it should
be apparent that the method will either return the value of interest, or
exit the method by some other means. And of course, by "by some other
means" I really mean "by throwing an exception". :)

Pete

Harlan Messinger

unread,
Dec 29, 2009, 12:51:14 PM12/29/09
to
Peter Duniho wrote:
> If you need to support Reset(), then of course you have to do that. But
> otherwise, the code could be even simpler without it, well beyond the
> simple removal of the method. In particular, after my other changes as
> well, you could remove the entire LazyFileEnumerator class, and replace
> it with this:
>
> public IEnumerator<string> GetEnumerator()
> {
> int ifile = 0;
> string file;
>
> while ((file = NextFile(ifile++)) != null)
> {
> yield return file;
> }
> }
>
> ...letting the compiler write all the rest of the implementation for you.

What this tells me is that the compiler goes in and sees this code that
defines a method and, solely based on the fact that it returns an
IEnumerator<string> and has a yield statement in it, knows to turn it
into factory that will generate a distinct object derived from
IEnumerator<string> that has data members that correspond to the
method's automatic variables, and that will turn the code inside-out so
that whatever happens between successive invocations of yield will
become the body of MoveNext(), and Current() will return the same thing
as the yield return. In other words,

public class AnonymousClass : IEnumerator<string>
{
private int ifile = 0;
private string file;
private myEnumeration;
public AnonymousClass(LazyFileList list)
{
myEnumeration = list;
}
public bool GetNext()
{
return (file = NextFile(ifile++)) == null;
}
public string Current()
{
return file;
}
}

[inside LazyFileList.GetEnumerator]
AnonymousClass anonymousThing = new AnonymousClass(this);

I gather that this is supposed to be what happens, but it seems so
bizarre I guess I'm just looking for confirmation. Is this what happens?
It must be, but I can't imagine the compiler figuring out that, for example

file = NextFile(ifile++);

is what GetNext is supposed to *do* while

file == null

is what GetNext is supposed to *return*, and so on.

Peter Duniho

unread,
Dec 29, 2009, 1:15:33 PM12/29/09
to
Harlan Messinger wrote:
> What this tells me is that the compiler goes in and sees this code that
> defines a method and, solely based on the fact that it returns an
> IEnumerator<string> and has a yield statement in it,

It can return IEnumerator<T>, IEnumerator, IEnumerable<T>, or
IEnumerable. Any of those types are permitted.

> knows to turn it
> into factory that will generate a distinct object derived from
> IEnumerator<string> that has data members that correspond to the
> method's automatic variables, and that will turn the code inside-out so
> that whatever happens between successive invocations of yield will
> become the body of MoveNext(), and Current() will return the same thing
> as the yield return.

Basically, yes. But this:

> In other words,
>
> public class AnonymousClass : IEnumerator<string>
> {
> private int ifile = 0;
> private string file;
> private myEnumeration;
> public AnonymousClass(LazyFileList list)
> {
> myEnumeration = list;
> }
> public bool GetNext()
> {
> return (file = NextFile(ifile++)) == null;
> }
> public string Current()
> {
> return file;
> }
> }
>
> [inside LazyFileList.GetEnumerator]
> AnonymousClass anonymousThing = new AnonymousClass(this);

Isn't very much like what the compiler _actually_ generates.

In particular, the generated code is actually based very closely on the
original declared iterator method, all contained within the MoveNext()
method implementation in the generated class.

> I gather that this is supposed to be what happens, but it seems so
> bizarre I guess I'm just looking for confirmation. Is this what happens?
> It must be, but I can't imagine the compiler figuring out that, for example
>
> file = NextFile(ifile++);
>
> is what GetNext is supposed to *do* while
>
> file == null
>
> is what GetNext is supposed to *return*, and so on.

It's actually much more complex than you imagine, and yet using much
more basic, simple techniques than you also imagine. :)

The compiler doesn't do any analysis at all with respect to the various
flow control conditions in the method. It keys entirely off of "yield"
and "return" statements.

A "yield return" simply interrupts the flow of the method, using the
return value of the "yield return" statement as the current value for
the enumeration. A "yield break" or just a plain "return" (implicit or
explicit) terminates the enumeration.

Initializing and then later resuming the flow of execution in the code
is handled simply by keeping an extra state variable and judicious use
of "goto" statements.

If you'd like more detailed information�

Raymond Chen wrote a pretty good series of articles on the topic:
http://blogs.msdn.com/oldnewthing/archive/2008/08/12/8849519.aspx
http://blogs.msdn.com/oldnewthing/archive/2008/08/13/8854601.aspx
http://blogs.msdn.com/oldnewthing/archive/2008/08/14/8862242.aspx
http://blogs.msdn.com/oldnewthing/archive/2008/08/15/8868267.aspx

Eric Lippert wrote a series of articles discussing the various
implementation details and design constraints that led to the way
iterators are today in C#:
http://blogs.msdn.com/ericlippert/archive/tags/Iterators/default.aspx

Pete

Harlan Messinger

unread,
Dec 29, 2009, 3:55:11 PM12/29/09
to
Peter Duniho wrote:
> Harlan Messinger wrote:
>

I'm realizing this is sort of like (I have no idea if you'll be familiar
with this) old-fashioned FORTRAN subroutines with internal "entry
points" (other than the fact that those were static subroutines, since
there was no such thing as OO code).

>
> The compiler doesn't do any analysis at all with respect to the various
> flow control conditions in the method. It keys entirely off of "yield"
> and "return" statements.

After I wrote the above, it occurred to me to just try out a test case,
so I created one, and I see that it's much more simple than I
thought--which means, as with auto-implemented properties, "it's only
good for what it's good for"--which is to say, it's actually quite
valuable, but only in the limited context in which it's valuable! I see,
for example, that my guess that the local variable initializations might
be treated like class initializers is wrong. If I have

IEnumerable foo = new MySimplifiedEnumerable(); //which implements
IEnumerable<string> and uses "yield return" inside its GetEnumerator method
IEnumerator<string> bar = foo.GetEnumerator();

and at the beginning of GetEnumerator I have

int i = this.SomeIntegerProperty;
string s = this.SomeStringProperty;

the keyword "this" refers to the MySimplifiedEnumerable, not to the
IEnumerable<string> that will be generated, and these initializations
don't happen until the first time MoveNext is called, which means i and
s will have values based on the state of the MySimplifed Enumerable at
the time MoveNext is first called, and not at the time the enumerator
was constructed!

I understand that for "foreach" use, there is no explicit construction,
or explicit call to MoveNext or Current, anyway, so this is adequate for
that use, which is to say, for most uses. But in any case where I need
something to happen specifically at the time an enumerator is created,
this approach won't do.

>
> A "yield return" simply interrupts the flow of the method, using the
> return value of the "yield return" statement as the current value for
> the enumeration. A "yield break" or just a plain "return" (implicit or
> explicit) terminates the enumeration.
>
> Initializing and then later resuming the flow of execution in the code
> is handled simply by keeping an extra state variable and judicious use
> of "goto" statements.
>
> If you'd like more detailed information�
>
> Raymond Chen wrote a pretty good series of articles on the topic:
> http://blogs.msdn.com/oldnewthing/archive/2008/08/12/8849519.aspx
> http://blogs.msdn.com/oldnewthing/archive/2008/08/13/8854601.aspx
> http://blogs.msdn.com/oldnewthing/archive/2008/08/14/8862242.aspx
> http://blogs.msdn.com/oldnewthing/archive/2008/08/15/8868267.aspx
>
> Eric Lippert wrote a series of articles discussing the various
> implementation details and design constraints that led to the way
> iterators are today in C#:
> http://blogs.msdn.com/ericlippert/archive/tags/Iterators/default.aspx

Thanks for the leads!

Peter Duniho

unread,
Dec 29, 2009, 4:18:45 PM12/29/09
to
Harlan Messinger wrote:
> [...] If I have

>
> IEnumerable foo = new MySimplifiedEnumerable(); //which implements
> IEnumerable<string> and uses "yield return" inside its GetEnumerator method
> IEnumerator<string> bar = foo.GetEnumerator();
>
> and at the beginning of GetEnumerator I have
>
> int i = this.SomeIntegerProperty;
> string s = this.SomeStringProperty;
>
> the keyword "this" refers to the MySimplifiedEnumerable, not to the
> IEnumerable<string> that will be generated, and these initializations
> don't happen until the first time MoveNext is called, which means i and
> s will have values based on the state of the MySimplifed Enumerable at
> the time MoveNext is first called, and not at the time the enumerator
> was constructed!

Well, being variables that are hoisted out into the compiler-generated
class, they still have values at the time the enumerator is constructed
(their default values). But, since they are local variables in your
iterator method, it doesn't matter. There's no way for any other code
to depend on them until the code in the iterator method that may use
them has a chance to execute (and even then, it requires some
contortions to "let them out", so to speak).

Now, if they were instance fields in the class implementing IEnumerable,
that would be different, and yes�the state of those fields are modified
in an implementation-dependent way, and not necessarily at the time the
IEnumerator is actually created.

> I understand that for "foreach" use, there is no explicit construction,
> or explicit call to MoveNext or Current, anyway, so this is adequate for
> that use, which is to say, for most uses. But in any case where I need
> something to happen specifically at the time an enumerator is created,
> this approach won't do.

I would be curious to see a code example where it makes sense for you to
have some side-effect that happens as part of the creation of the
IEnumerator instance. I would think that would be a design flaw, rather
than something positive. Under what circumstances would that actually
be a good thing, and why?

Pete

Harlan Messinger

unread,
Dec 29, 2009, 4:33:36 PM12/29/09
to

In the code below, with two enumerable instances, and two enumerator
instances extracted from each, I have each enumerator take a name from
its parent enumerable to be used as a prefix when it emits each of its
values. However, the output shows that foo1 and bar1 display "foo2" and
"bar2" respectively, because their myName fields aren't set until
MoveNext is called, despite the fact that at the time they were
constructed, the values of their parents' respective currentName fields
were "foo1" and "bar1".

class Program
{
static void Main(string[] args)
{
Blah foo = new Blah();
Blah bar = new Blah();

foo.currentName = "foo1";
IEnumerator<string> foo1 = foo.GetEnumerator();
bar.currentName = "bar1";
IEnumerator<string> bar1 = bar.GetEnumerator();

foo.currentName = "foo2";
IEnumerator<string> foo2 = foo.GetEnumerator();
bar.currentName = "bar2";
IEnumerator<string> bar2 = bar.GetEnumerator();

foo1.MoveNext(); Console.WriteLine(foo1.Current);
foo1.MoveNext(); Console.WriteLine(foo1.Current);
bar1.MoveNext(); Console.WriteLine(bar1.Current);
bar2.MoveNext(); Console.WriteLine(bar2.Current);
foo1.MoveNext(); Console.WriteLine(foo1.Current);
foo1.MoveNext(); Console.WriteLine(foo1.Current);
foo2.MoveNext(); Console.WriteLine(foo2.Current);
foo2.MoveNext(); Console.WriteLine(foo2.Current);
bar1.MoveNext(); Console.WriteLine(bar1.Current);
foo2.MoveNext(); Console.WriteLine(foo2.Current);
bar2.MoveNext(); Console.WriteLine(bar2.Current);
foo2.MoveNext(); Console.WriteLine(foo2.Current);
foo2.MoveNext(); Console.WriteLine(foo2.Current);
foo2.MoveNext(); Console.WriteLine(foo2.Current);
bar1.MoveNext(); Console.WriteLine(bar1.Current);
bar2.MoveNext(); Console.WriteLine(bar2.Current);
bar1.MoveNext(); Console.WriteLine(bar1.Current);
bar1.MoveNext(); Console.WriteLine(bar1.Current);
foo1.MoveNext(); Console.WriteLine(foo1.Current);
foo1.MoveNext(); Console.WriteLine(foo1.Current);
bar1.MoveNext(); Console.WriteLine(bar1.Current);
bar2.MoveNext(); Console.WriteLine(bar2.Current);
bar2.MoveNext(); Console.WriteLine(bar2.Current);
bar2.MoveNext(); Console.WriteLine(bar2.Current);

Console.ReadKey();
}
}

public class Blah : IEnumerable<string>
{
public string currentName;
string[] items = { "four", "score", "and", "seven", "years",
"ago" };
#region IEnumerable<string> Members

public IEnumerator<string> GetEnumerator()
{
string myName = currentName;
int index;
for (index = 0; index < 6; index++)
{
yield return myName + " " + items[index];
}

}

#endregion

#region IEnumerable Members

System.Collections.IEnumerator
System.Collections.IEnumerable.GetEnumerator()
{
throw new NotImplementedException();
}

#endregion
}

Harlan Messinger

unread,
Dec 31, 2009, 12:31:07 PM12/31/09
to
Peter Duniho wrote:
> private void AddFile(string path)
> {
> lock (_objLock)
> {
> fileList.Add(path);
> Monitor.Pulse(_objLock);
> }
> }
>

What if more than enumerators one on more than one thread are waiting?
Should this be PulseAll?

And then shouldn't the following method also call Monitor.PulseAll
before exiting, in case the enumerator calling this code is not the only
one waiting?

Also, in that case, would a good way to handle that be to wrap the
contents of the locked block inside a try block, and then put the
PulseAll call in a finally block?

Harlan Messinger

unread,
Dec 31, 2009, 12:31:58 PM12/31/09
to
Harlan Messinger wrote:
> What if more than enumerators one on more than one thread are waiting?

This should read, "What if enumerator on more than one thread are waiting?"

Peter Duniho

unread,
Dec 31, 2009, 2:19:21 PM12/31/09
to
Harlan Messinger wrote:
> Peter Duniho wrote:
>> private void AddFile(string path)
>> {
>> lock (_objLock)
>> {
>> fileList.Add(path);
>> Monitor.Pulse(_objLock);
>> }
>> }
>>
>
> What if more than enumerators one on more than one thread are waiting?
> Should this be PulseAll?

Yes, if you have multiple consumers, you should call PulseAll() so that
all consumers are notified.

> And then shouldn't the following method also call Monitor.PulseAll
> before exiting, in case the enumerator calling this code is not the only

> one waiting? [...]

No, I don't think so. You shouldn't need to _add_ a new place where the
Monitor is pulsed. Just change the other call to Pulse(), in the
Finished property's setter, to PulseAll() as well.

Pete

Harlan Messinger

unread,
Dec 31, 2009, 2:36:27 PM12/31/09
to
Peter Duniho wrote:
> Harlan Messinger wrote:
>> Peter Duniho wrote:
>>> private void AddFile(string path)
>>> {
>>> lock (_objLock)
>>> {
>>> fileList.Add(path);
>>> Monitor.Pulse(_objLock);
>>> }
>>> }
>>>
>>
>> What if more than enumerators one on more than one thread are waiting?
>> Should this be PulseAll?
>
> Yes, if you have multiple consumers, you should call PulseAll() so that
> all consumers are notified.

In that case I will, since there's never any harm, when designing an app
that's multithreaded anyway, in assuming that multiple enumerators can
be requested for simultaneous use.

>> And then shouldn't the following method also call Monitor.PulseAll
>> before exiting, in case the enumerator calling this code is not the
>> only one waiting? [...]
>
> No, I don't think so. You shouldn't need to _add_ a new place where the
> Monitor is pulsed. Just change the other call to Pulse(), in the
> Finished property's setter, to PulseAll() as well.

Let's say file names are being retrieved on thread T0, and I have two
enumerators being used concurrently on threads T1 and T2. Suppose the
conditions and sequence of events at some point are:

5 items have been retrieved by T0 so far and it's still looking for more.

The two enumerators have both run through all 5 items.

T1 locks on _objLock and then calls Monitor.Wait.

T2 locks on _objLock and then calls Monitor.Wait.

T0 locks on _objLock, finds another file, and then calls Monitor.PulseAll.

T1 and T2 are now both in the middle of a lock block on _objLock. What
happens then? Is it the same thing as when both threads arrived at the
lock(_objLock): one of them (say, T1) will get there first; T2 will be
blocked until T1 unlocks _objLock; and then T2 will continue execution?

Peter Duniho

unread,
Dec 31, 2009, 3:01:39 PM12/31/09
to
Harlan Messinger wrote:
> [...]

>>> What if more than enumerators one on more than one thread are
>>> waiting? Should this be PulseAll?
>>
>> Yes, if you have multiple consumers, you should call PulseAll() so
>> that all consumers are notified.
>
> In that case I will, since there's never any harm, when designing an app
> that's multithreaded anyway, in assuming that multiple enumerators can
> be requested for simultaneous use.

Note that there may be specific scenarios where it may be more desirable
to call Pulse(), even if there are multiple threads that may call
Wait(). The general rule is sound, but be careful to not follow it too
blindly in other designs.

Here, you are correct that PulseAll() is needed.

> [...]


> T1 locks on _objLock and then calls Monitor.Wait.
>
> T2 locks on _objLock and then calls Monitor.Wait.
>
> T0 locks on _objLock, finds another file, and then calls Monitor.PulseAll.
>
> T1 and T2 are now both in the middle of a lock block on _objLock. What
> happens then? Is it the same thing as when both threads arrived at the
> lock(_objLock): one of them (say, T1) will get there first; T2 will be
> blocked until T1 unlocks _objLock; and then T2 will continue execution?

Calling PulseAll() moves all threads that were in the wait queue over to
the ready queue. But only one thread may still have the lock at a time,
so yes�if T1 acquires the lock first, T2 has to wait for T1 to release
the lock again (either by calling Wait() again or exiting the "lock"
block) before it can proceed.

Because of this, in some cases it is desirable to have consumers copy as
much of the existing data that's available as it can. This doesn't work
if you have multiple consumers that do _not_ share work items; in that
case, you really want each consumer to just pull out one at a time and
work on it, while other consumers get a chance to get one too.

But in the case where there's only ever one consumer, or as in your
example where there are multiple consumers but each consumer has to
process all items produced, you can reduce contention by getting each
thread to do as much work as it can each time it acquires the lock, to
minimize the number of times it does have to acquire the lock.

Note that this is yet another example of why the simplified
IEnumerator<T> implementation I mentioned before may be desirable. It's
a lot easier to merge the NextFile() method and the enumerator loop
itself if both are in the same class. Merging the two being a
requirement for optimizing the lock acquisition as I suggested, so that
the iterator method itself acquires the lock.

Pete

0 new messages