This is a very interesting and ambitious experiment! If successful, this is getting at the type of evidence that would be useful when considering to change the language.
When evaluating the change, we will need to find cases where the semantics of the program might differ after the change. Semantic changes include race conditions being eliminated (like in the
Bugs section for the project). It can also include changes to the memory layout after the loop (one aliased memory allocation vs a memory allocation per iteration being stored in memory), changes in branching behavior (pointer comparisons), changes in variable lifetime when an escape does happen (which can impact when finalizers get run), and order of calls to external functions. Looking at closures of loop variables and escaping references is definitely a good start. I am not yet sure exactly what would suffice (but happy to chat over email).
To make a decision, we probably want to know for each location:
* Is it "semantically equivalent" before/after the change? This can be done by a combination of automation and manual effort.
* If it differs, was the behavior before clearly "incorrect"/"has "bugs"?
* If it differs, was the behavior after clearly "incorrect"/"has "bugs"?
For the above, a location might not have a clear answer after expert review. It might be worth documenting this. A few of these map decently onto the existing categories. Knowing when the number of differing locations where someone is clearly helped vs differing locations where someone is clearly hurt is probably the most important in making a decision.
Cases with bugs may want to include a notion of confidence. Examples like those in the
Bugs section are extremely likely to be bugs. Others may not be as clear. ("Is there actually a race condition here?" can be hard to assess in someone else's code.) The current phrasing "could lead to undesirable behavior" leaves the confidence ambiguous. It might be helpful to clarify what this means to reviewers.
The current wording also currently suggests we can also stop looking once we have identified "a fix to a bug" without considering downstream interactions after the loop. For example: `for _, x := range l { defer func() { x.someMethod() }() }`. That is almost certainly a bug. However, on the list `{nil, <something valid>}` the original version avoids a nil pointer dereference panic by only using the last element. Basically having two bugs pull in the opposite direction mitigating the other. I have only really seen this where an iffy unit test was broken by creating a new copy for each loop. The larger point though is it is not always clear how much context to look at before labeling an example "clearly buggy before and would be fixed". Stopping early could be a source of false negatives. This may be more of a theoretical concern than a real one.
Minor technical points:
For folks developing/commenting on the static analyzers, it would also help to formalize the notion of "semantic equivalence" that is trying to be reached. It is likely a refinement relation like one would see in the program transformation literature. The change will always be safe when the loop after the change is a refinement of the original loop. (If we also care about performance, we will also want to know how often this transformation can be undone to reuse the memory, e.g. the original refines the modified. But declaring this out of scope is reasonable.)