Regression tests for SBCL with cl-test-grid

Jan Moringen

unread,

Jun 15, 2013, 10:18:54 AM6/15/13

to cl-test-grid, sbcl-devel

Hi,

I'm working on automated regression tests for SBCL using cl-test-grid.
Most things worked very well so far - thank you for the great project. I
am writing to ask two questions.

First, should I get everything working, would the use of your
infrastructure (result storage and result namespace) be acceptable? For
now, I set up the result storage id "sbcl" and user email
"sbcl-maintainers". I expect about three test results each day: one test
run on each of three machines. If that was currently unacceptable, would
donations help?

The second question is about a problem with the result upload: after a
complete test run, the upload of log files to the blob store starts but
fails at some point. I observed two failure modes:

1. With vanilla cl-test-grid, the upload failed due to a
"connection reset by peer". I suspected this to be due to some
app engine quota violation since the upload was probably more
than 20 MB within one minute. I tried to work around this by
waiting between log file uploads.
2. This slowed-down version behaves similarly but fails with a
"broken pipe".

A full log of the second failure mode is available [1] (Warning: ~ 40 MB
of text; search for "Broken pipe"). Do you have an idea about what may
be going wrong?

Many thanks in advance and kind regards,
Jan

[1] https://ci.cor-lab.org/job/sbcl-master-test-grid/label=ubuntu_quantal_64bit/lastBuild/consoleFull

Anton Vodonosov

unread,

Jun 15, 2013, 10:34:56 AM6/15/13

to cl-tes...@googlegroups.com, sbcl-devel

15.06.2013, 18:18, "Jan Moringen" <jmor...@techfak.uni-bielefeld.de>:

> Hi,
>
> I'm working on automated regression tests for SBCL using cl-test-grid.
> Most things worked very well so far - thank you for the great project. I
> am writing to ask two questions.
>
> First, should I get everything working, would the use of your
> infrastructure (result storage and result namespace) be acceptable?

Absolutely. It is created for people to use.

> The second question is about a problem with the result upload: after a
> complete test run, the upload of log files to the blob store starts but
> fails at some point. I observed two failure modes:
>

> О©╫О©╫О©╫О©╫О©╫1. With vanilla cl-test-grid, the upload failed due to a
> О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫"connection reset by peer". I suspected this to be due to some
> О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫app engine quota violation since the upload was probably more
> О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫than 20 MB within one minute. I tried to work around this by
> О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫waiting between log file uploads.
> О©╫О©╫О©╫О©╫О©╫2. This slowed-down version behaves similarly but fails with a
> О©╫О©╫О©╫О©╫О©╫О©╫О©╫О©╫"broken pipe".

>
> A full log of the second failure mode is available [1] (Warning: ~ 40 MB
> of text; search for "Broken pipe"). Do you have an idea about what may
> be going wrong?
>
> Many thanks in advance and kind regards,
> Jan
>
> [1] https://ci.cor-lab.org/job/sbcl-master-test-grid/label=ubuntu_quantal_64bit/lastBuild/consoleFull
>

For me currently the uploads work. Do you have a way to reproduce the problem?
For example, do you have the folder with library test logs, so that I can try to upload
it from my machine?

Best regards,
- Anton

Jan Moringen

unread,

Jun 15, 2013, 10:54:01 AM6/15/13

to cl-tes...@googlegroups.com

Hi Anton,

thank you for the quick reply.

On Sat, 2013-06-15 at 18:34 +0400, Anton Vodonosov wrote:
> 15.06.2013, 18:18, "Jan Moringen" <jmor...@techfak.uni-bielefeld.de>:
> > Hi,
> >

> > First, should I get everything working, would the use of your
> > infrastructure (result storage and result namespace) be acceptable?
>
> Absolutely. It is created for people to use.

Great, thanks.

> > The second question is about a problem with the result upload: after a

> > [...]

>
> For me currently the uploads work. Do you have a way to reproduce the problem?

Sorry for not mentioning this initially: on other machines, the upload
works fine for me as well.

The problematic machines are virtual machines (Ubuntu x86 and x86_64 and
MacOS) behind a firewall. However, http[s] are permitted and the logs
[1] seem to indicate a partially successful upload.

> For example, do you have the folder with library test logs, so that I can try to upload
> it from my machine?

The "workspace" from the most recent failed attempt can be accessed via
https [2]. cl-test-grid resides in the "cl-test-grid" subdirectory. The
"workspace" is also downloadable as a (huge) archive [3].

Thanks again and kind regards,
Jan

[1] https://ci.cor-lab.org/job/sbcl-master-test-grid/label=ubuntu_quantal_64bit/lastBuild/consoleFull
[2] https://ci.cor-lab.org/job/sbcl-master-test-grid/label=ubuntu_quantal_64bit/ws/
[3] https://ci.cor-lab.org/job/sbcl-master-test-grid/label=ubuntu_quantal_64bit/ws/*zip*/ubuntu_quantal_64bit.zip

Anton Vodonosov

unread,

Jun 15, 2013, 11:21:38 AM6/15/13

to cl-tes...@googlegroups.com

15.06.2013, 18:54, "Jan Moringen" <jmor...@techfak.uni-bielefeld.de>:
>>> О©╫The second question is about a problem with the result upload: after a
>>> О©╫[...]
>> О©╫For me currently the uploads work. Do you have a way to reproduce the problem?

>
> Sorry for not mentioning this initially: on other machines, the upload
> works fine for me as well.
>
> The problematic machines are virtual machines (Ubuntu x86 and x86_64 and
> MacOS) behind a firewall. However, http[s] are permitted and the logs
> [1] seem to indicate a partially successful upload.
>

>> О©╫For example, do you have the folder with library test logs, so that I can try to upload
>> О©╫it from my machine?

>
> The "workspace" from the most recent failed attempt can be accessed via
> https [2]. cl-test-grid resides in the "cl-test-grid" subdirectory. The
> "workspace" is also downloadable as a (huge) archive [3].
>
> Thanks again and kind regards,
> Jan
>
> [1] https://ci.cor-lab.org/job/sbcl-master-test-grid/label=ubuntu_quantal_64bit/lastBuild/consoleFull
> [2] https://ci.cor-lab.org/job/sbcl-master-test-grid/label=ubuntu_quantal_64bit/ws/
> [3] https://ci.cor-lab.org/job/sbcl-master-test-grid/label=ubuntu_quantal_64bit/ws/*zip*/ubuntu_quantal_64bit.zip

I have found the logs (cl-test-grid/work-dir/agent/logs/)

Will experiment with uploading them tonight.

Can it be that the virtual machines are slow?
One of the restrictions of Google App Engine is that requests should not take more than 30 seconds.
If this timeout is exceeded, then it interrupts request handling by throwing an exception
(https://developers.google.com/appengine/docs/java/javadoc/com/google/apphosting/api/DeadlineExceededException?hl=en)

I am not sure this is exactly the cause, just guessing.

To satisfy this limit and another limit, the logs are submitted in batches.
The function which submits files to GAE is called test-grid-gae-blobstore:submit-files2
It has keyword parameter :batch-size with default value 300. I have tuned this default value
so that it works from my machines.

You can try smaller value for batch size, so that every request will contain less data.

Change the batch size either by changing the default value of the parameter, or
by passing another value explicitly. Test grid agent calls this function from tg-agent::submit-logs,
in the file agent/submit-results.lisp, line 26.

Best regards,
- Anton

Anton Vodonosov

unread,

Jun 15, 2013, 11:41:12 AM6/15/13

to cl-tes...@googlegroups.com

15.06.2013, 19:21, "Anton Vodonosov" <avodo...@yandex.ru>:

>
> To satisfy this limit and another limit, the logs are submitted in batches.
> The function which submits files to GAE is called test-grid-gae-blobstore:submit-files2
> It has keyword parameter :batch-size with default value 300. I have tuned this default value
> so that it works from my machines.
>
> You can try smaller value for batch size, so that every request will contain less data.
>
> Change the batch size either by changing the default value of the parameter, or
> by passing another value explicitly. Test grid agent calls this function from tg-agent::submit-logs,
> in the file agent/submit-results.lisp, line 26.

BTW, if you will try this, no need to perform new test run.
For experiments you can just call
(tg-agent::submit-logs (tg-agent::make-gae-blobstore) "cl-test-grid/work-dir/agent/test-runs/20130615153704-sbcl-1.1.8.57-d5c8232-dirty-linux-x64/")

PS, in the previous letter I said the logs I am interested in is "cl-test-grid/work-dir/agent/logs/",
but of course it is a mistake; the test run logs submitted to GAE are
"cl-test-grid/work-dir/agent/test-runs/20130615153704-sbcl-1.1.8.57-d5c8232-dirty-linux-x64/"

Jan Moringen

unread,

Jun 16, 2013, 11:24:40 AM6/16/13

to cl-tes...@googlegroups.com

On Sat, 2013-06-15 at 19:21 +0400, Anton Vodonosov wrote:

> [...]

>
> Will experiment with uploading them tonight.

Thanks.

> Can it be that the virtual machines are slow?

Not slower than other machines I tried this on. Because of a firewall,
their internet connection may have different/unknown constraints,
though.

> One of the restrictions of Google App Engine is that requests should not take more than 30 seconds.
> If this timeout is exceeded, then it interrupts request handling by throwing an exception
> (https://developers.google.com/appengine/docs/java/javadoc/com/google/apphosting/api/DeadlineExceededException?hl=en)
>
> I am not sure this is exactly the cause, just guessing.
>
> To satisfy this limit and another limit, the logs are submitted in batches.
> The function which submits files to GAE is called test-grid-gae-blobstore:submit-files2
> It has keyword parameter :batch-size with default value 300. I have tuned this default value
> so that it works from my machines.
>
> You can try smaller value for batch size, so that every request will contain less data.
>
> Change the batch size either by changing the default value of the parameter, or
> by passing another value explicitly. Test grid agent calls this function from tg-agent::submit-logs,
> in the file agent/submit-results.lisp, line 26.

I will experiment with the batch size (Using your tip for submitting
logs without running tests from the other mail).

Thanks and kind regards,
Jan

Anton Vodonosov

unread,

Jun 17, 2013, 8:22:48 PM6/17/13

to cl-tes...@googlegroups.com, Jan Moringen

Hello Jan.

I wanted to try the upload from my machine, but the directory
cl-test-grid/work-dir/agent/test-runs/20130615153704-sbcl-1.1.8.57-d5c8232-dirty-linux-x64/
is not available online anymore, because your test system runs another test run now.

I can say now that my theory about 30 seconds timeout is not true. I saw in the agent.log
that the network connection problem occurred 4 seconds after the request was started.

Currently to me the theory that your virtual machines have some network configuration
problems look the most probable.

Do you have any news about this problem?

Best regards,
- Anton

Jan Moringen

unread,

Jun 17, 2013, 10:19:39 PM6/17/13

to Anton Vodonosov, cl-tes...@googlegroups.com

Hi Anton.

On Tue, 2013-06-18 at 04:22 +0400, Anton Vodonosov wrote:
> I wanted to try the upload from my machine, but the directory
> cl-test-grid/work-dir/agent/test-runs/20130615153704-sbcl-1.1.8.57-d5c8232-dirty-linux-x64/
> is not available online anymore, because your test system runs another test run now.

Sorry, I did some experiments of my own in the meantime which caused the
workspace to be deleted. Thanks for helping figuring this out.

> I can say now that my theory about 30 seconds timeout is not true. I saw in the agent.log
> that the network connection problem occurred 4 seconds after the request was started.
>
> Currently to me the theory that your virtual machines have some network configuration
> problems look the most probable.
>
> Do you have any news about this problem?

I did three experiments:

1. I started SBCL on one of the virtual machines in the workspace
in which the failed upload occurred. I connected to this SBCL
via SLIME/SWANK, changed the batch size to 50 and performed a
manual upload. That worked.
2. After that, I tried automated runs with batch size 50 and it
worked once on one machine, but did not work on the other
machines.
3. After that, I kept the batch size 50 and added a 30 second delay
between individual upload batches, but that it still did not
work on the machine failing before. At the time of writing, the
previously successful machine is still running with this
configuration.

All failed uploads failed with "Connection reset by peer" after less
than 300 files (6 batches with my batch size).

Next, I will try a 60 second delay between individual upload batches.

Kind regards,
Jan

Jan Moringen

unread,

Jun 18, 2013, 5:05:31 PM6/18/13

to cl-tes...@googlegroups.com, Anton Vodonosov

Hi.

On Tue, 2013-06-18 at 04:19 +0200, Jan Moringen wrote:

> Next, I will try a 60 second delay between individual upload batches.

After trying different combinations, I found one that seems to work
reliably: batches of 10 log files with delays of 10 seconds between
batches. I will perform more experiments to determine whether this
really works reliably and whether some speed up is possible.

Kind regards,
Jan

Anton Vodonosov

unread,

Jun 18, 2013, 6:32:00 PM6/18/13

to Jan Moringen, cl-tes...@googlegroups.com

19.06.2013, 01:05, "Jan Moringen" <jmor...@techfak.uni-bielefeld.de>:

> Hi.
>
> On Tue, 2013-06-18 at 04:19 +0200, Jan Moringen wrote:
>

>
> After trying different combinations, I found one that seems to work
> reliably: batches of 10 log files with delays of 10 seconds between
> batches. I will perform more experiments to determine whether this
> really works reliably and whether some speed up is possible.
>
> Kind regards,
> Jan

Hello Jan.

That's good.

We need not only find the combination that works, but also find the reason why it doesn't work the usual way.
For this we need a way to reliably reproduce the problem. At least on your machine, but better on some other machines.
Since yesterday I performed and submitted 7 test runs from my machine, so I can confirm once again
the usual submit works for me.

Best regards,
- Anton

Anton Vodonosov

unread,

Jun 21, 2013, 5:25:40 AM6/21/13

to cl-tes...@googlegroups.com, Jan Moringen

Hi Jan.

How is going with this upload error? Have you solved it?

Best regards,
- Anton

Jan Moringen

unread,

Jun 22, 2013, 1:02:57 AM6/22/13

to Anton Vodonosov, cl-tes...@googlegroups.com

Hi Anton.

On Fri, 2013-06-21 at 13:25 +0400, Anton Vodonosov wrote:

> How is going with this upload error? Have you solved it?

I had some success tuning the batch size parameter and delays between
batches: the Ubuntu Quantal 32bit and 64bit slaves now upload their logs
successfully most of the time. However, there still occasional upload
errors caused by the old "connection reset by peer" error.

Another error I noticed on the Ubuntu Quantal 32bit slave is "Error
uploading files, the HTTP response code 502: Bad Gateway" [1] (this
happened only once so far).

I also tried enable our MacOS slave but the upload for this slave fails
with yet another failure mode: the upload just hangs at some point until
the whole job times out and gets cancelled [2].

I didn't have time yet to try repeated manual uploads via SWANK/SLIME,
but that is probably still the next logical step.

Kind regards,
Jan

[1] https://ci.cor-lab.org/job/sbcl-master-test-grid/label=ubuntu_quantal_32bit/58/consoleFull
[2] https://ci.cor-lab.org/job/sbcl-master-test-grid/label=MAC_OS_lion_64bit/58/consoleFull

Anton Vodonosov

unread,

Jul 9, 2013, 2:39:21 PM7/9/13

to Jan Moringen, cl-tes...@googlegroups.com

Jan, I have another idea.

What if we test your vitrual machines in different network?
Probably I can download the VM images? How big they are?

Jan Moringen

unread,

Jul 10, 2013, 3:04:42 AM7/10/13

to Anton Vodonosov, cl-test-grid

Hi Anton.

That would be complicated.

But your suggestion did give me another idea: I will recreate our setup
with a local Jenkins instance and check whether the problem goes away.
This may tell us whether Jenkins or the vm/networking setup is the
problem. I may need a day or few for this experiment, though.

Kind regards,
Jan

Reply all

Reply to author

Forward