Shall we revert partest changes in 2.10.x branch?

23 views
Skip to first unread message

Grzegorz Kossakowski

unread,
Jul 19, 2012, 3:53:36 AM7/19/12
to scala-internals
Hi,


It appears to me that problems discussed there are fairly deep and it will take some time to sort them out. While I warmly welcomed increased utilization of my eight cores after partest rewrite I think we cannot afford any instability in building/testing infrastructure for 2.10.x branch at this point.

My exact proposal would be:
  • revert partest changes in 2.10.x but keep them in master so the work is not dropped and people can work on figuring out Jenkins failures
  • in case partest in master becomes stable again (let's say we don't see anything suspicious for one week) we might reconsider merging it back to 2.10.x again
WDYT?

--
Grzegorz Kossakowski

Adriaan Moors

unread,
Jul 19, 2012, 4:00:31 AM7/19/12
to scala-i...@googlegroups.com
I'm not sure.

The main regressions were a couple of tests wrt separate compilation that started failing spuriously.
I am worried about those. Did the higher degree of concurrency elicit some kind of race condition?
Is the test framework to blame? We'll never know since those tests moved to pending recently (unless we revert the move).

Additionally, some benchmarking tests were failing because they were run concurrently and thus didn't finish fast enough.
We shouldn't have tests like this in the run/ category.

The random aborts we've been seeing date from before the partest change, according to Lukas.
Also, the abort occurs way after, as well as before, partest runs, so I don't see how that could be related.

Finally, Paul said he has some improvements in the pipeline.

That said, I agree test instability is horrible. I experience it first-hand every day.

Grzegorz Kossakowski

unread,
Jul 19, 2012, 4:09:03 AM7/19/12
to scala-i...@googlegroups.com
On 19 July 2012 10:00, Adriaan Moors <adriaa...@epfl.ch> wrote:
I'm not sure.

The main regressions were a couple of tests wrt separate compilation that started failing spuriously.
I am worried about those. Did the higher degree of concurrency elicit some kind of race condition?
Is the test framework to blame? We'll never know since those tests moved to pending recently (unless we revert the move).

Additionally, some benchmarking tests were failing because they were run concurrently and thus didn't finish fast enough.
We shouldn't have tests like this in the run/ category.

I agree. However, it takes time to smooth everything and we don't have that luxury when dealing with 2.10.x branch. That's my point.
 
The random aborts we've been seeing date from before the partest change, according to Lukas.
Also, the abort occurs way after, as well as before, partest runs, so I don't see how that could be related.

We all know that in complex systems there are no things that are impossible. I think it doesn't hurt to revert partest change and see if they are not related. If our problems with Jenkins do not go away then we have to look for other suspects.
 
Finally, Paul said he has some improvements in the pipeline.

Which I want to see (in master). Again, I didn't say we should block/drop his work. I'm just asking for baby steps. That's main reason why we have two branches.

--
Grzegorz Kossakowski

Adriaan Moors

unread,
Jul 19, 2012, 4:11:38 AM7/19/12
to scala-i...@googlegroups.com
I guess our main problem is a lack of data.

How much time has been lost with spurious failures, how much was gained by faster builds?
When we revert, we're sure to get slower builds, but not sure to get fewer spurious failures.

So the rational thing to do is not obvious to me.

√iktor Ҡlang

unread,
Jul 19, 2012, 4:13:25 AM7/19/12
to scala-i...@googlegroups.com
On Thu, Jul 19, 2012 at 10:11 AM, Adriaan Moors <adriaa...@epfl.ch> wrote:
I guess our main problem is a lack of data.

How much time has been lost with spurious failures, how much was gained by faster builds?
When we revert, we're sure to get slower builds, but not sure to get fewer spurious failures.

So the rational thing to do is not obvious to me.


Also, until failures have been diagnosed it is impossible to conclude wether they are bugs in partest or actually bugged tests.

Cheers,
 

On Thu, Jul 19, 2012 at 10:09 AM, Grzegorz Kossakowski <grzegorz.k...@gmail.com> wrote:
On 19 July 2012 10:00, Adriaan Moors <adriaa...@epfl.ch> wrote:
I'm not sure.

The main regressions were a couple of tests wrt separate compilation that started failing spuriously.
I am worried about those. Did the higher degree of concurrency elicit some kind of race condition?
Is the test framework to blame? We'll never know since those tests moved to pending recently (unless we revert the move).

Additionally, some benchmarking tests were failing because they were run concurrently and thus didn't finish fast enough.
We shouldn't have tests like this in the run/ category.

I agree. However, it takes time to smooth everything and we don't have that luxury when dealing with 2.10.x branch. That's my point.
 
The random aborts we've been seeing date from before the partest change, according to Lukas.
Also, the abort occurs way after, as well as before, partest runs, so I don't see how that could be related.

We all know that in complex systems there are no things that are impossible. I think it doesn't hurt to revert partest change and see if they are not related. If our problems with Jenkins do not go away then we have to look for other suspects.
 
Finally, Paul said he has some improvements in the pipeline.

Which I want to see (in master). Again, I didn't say we should block/drop his work. I'm just asking for baby steps. That's main reason why we have two branches.

--
Grzegorz Kossakowski





--
Viktor Klang

Akka Tech Lead
Typesafe - The software stack for applications that scale

Twitter: @viktorklang

Lukas Rytz

unread,
Jul 19, 2012, 5:06:21 AM7/19/12
to scala-i...@googlegroups.com
One abort happened recently in 2.10.x, build #21

Here's what the jenkins log file says about it

Jul 19, 2012 10:40:26 AM hudson.model.Run execute
INFO: scala-checkin-2.10.x #21 aborted
java.lang.InterruptedException
  at java.lang.Object.wait(Native Method)
  at hudson.remoting.Request.call(Request.java:146)
  at hudson.remoting.Channel.call(Channel.java:663)
  at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:158)
  at $Proxy41.join(Unknown Source)
  at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:861)
  at hudson.Launcher$ProcStarter.join(Launcher.java:345)
  at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:82)
  at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
  at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
  at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:717)
  at hudson.model.Build$BuildExecution.build(Build.java:199)
  at hudson.model.Build$BuildExecution.doRun(Build.java:160)
  at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
  at hudson.model.Run.execute(Run.java:1488)
  at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
  at hudson.model.ResourceController.execute(ResourceController.java:88)
  at hudson.model.Executor.run(Executor.java:236)

Paul Phillips

unread,
Jul 19, 2012, 10:32:31 AM7/19/12
to scala-i...@googlegroups.com


On Thu, Jul 19, 2012 at 2:06 AM, Lukas Rytz <lukas...@epfl.ch> wrote:
java.lang.InterruptedException

I was pretty sure it was going to be this.  This might be all we need:


I'll probably need to do something more interesting to handle the interrupt, the origin of which I don't know, but if we can get this in there it will tell us if this is the spot (and I'll be surprised if it isn't.)

Reply all
Reply to author
Forward
0 new messages