SR2014 Comp Server Fail

8 views
Skip to first unread message

Rob Spanton

unread,
May 12, 2014, 11:11:40 AM5/12/14
to srobo...@googlegroups.com
Hey Guys,

Here's an overview of the server failure that occurred during the SR2014
competition:

Day 1: Too many workers.

The maximum number of apache worker processes was set to be very
high (400). As load increased on the server, we reached
somewhere near this number. Each worker consumed some non-zero
amount of RAM. This caused the machine to start swapping a
*lot*. The machine was at least in the ballpark of using ~1.5GB
swap, and possibly actually ran out of swap. We had to hard
reboot the server to escape this problem.

(The machine has 2GB RAM and 2GB swap.)

We rescued the server for the last half (third?) of the day by
reducing the maximum number of workers to 200, and turning off
keep-alive.

Day 2: Still too many workers, then too much IO.

All was fine at the beginning of day 2. However, as load
increased, the server started grinding to a halt again. After
some inspection using top and iotop, it was clearly spending
most of its time doing two types of IO:

1. IDE git-related IO.
2. swapping

The git-related IO appeared to be greater in size than the
swapping situation. I think it was teetering on the edge of
being acceptable memory-wise, but since the disk cache was small
due to the large amount of memory in use, the git-related IO was
considerably slower than normal.

So we further reduced the maximum number of workers down to 100.
This *significantly* reduced IO immediately -- in a matter of
moments it went from a load of ~100 down to somewhere around 1.

It still wasn't great. There was still quite a lot of
git-related IO going on. So lots of apache workers were blocked
doing git-related things, which obviously meant that response
times went up considerably.

There's a rumour going around that it had something to do with the load
from the srcomp stuff polling. However, this was not the case. The
srcomp stuff only really loads the CPU, since its data is only loaded to
memory once and then served many times after that. The server has 8
CPUs, and these were minimally loaded throughout -- the linode logs show
~150% max (i.e. the max would be 800%) throughout the event. This fits
with the things we observed in top on the machine.

Suggested mitigation approaches to be applied in various arrangements:

1. Migrate to Linode's SSD situation. This is part of their normal
plan now. We just have to get off a 32-bit arch, and onto
64-bit. This is easy, as we need to transition to Fedora 20
this summer anyway.
2. Switch to a linode with more RAM. Having more free RAM for disk
cache will benefit us considerably.
3. Split load across multiple linodes. e.g. get two additional
linodes, and serve one half of the teams the IDE from one, and
the other half from the other.
4. Upgrade the linode to a much more beefy one for just the
duration of the competition. This appears to be possible now as
linode have switched to hourly billing.
5. Ensure that git-repack is run against IDE repos reasonably
regularly. This should improve git's performance somewhat.
However is a very IO intensive operation to perform, so it can
only really be done at less loaded times -- e.g. 2-6 AM.
6. Reduce the IDE's git-related loading. e.g. only populate the
drop-down showing a file's history when it is actually necessary
(i.e. the user clicks on it). Presumably most of the IDE's git
activity is superfluous most of the time, as the most common use
case is a user just editing a single file, rather than using any
of the other bits of the IDE's interface.

Cheers,

Rob

signature.asc

Alistair Lynn

unread,
May 12, 2014, 11:18:01 AM5/12/14
to srobo...@googlegroups.com
Hi Rob-

> <analysis>

Thanks for digging into this.

> 5. Ensure that git-repack is run against IDE repos reasonably
> regularly. This should improve git's performance somewhat.
> However is a very IO intensive operation to perform, so it can
> only really be done at less loaded times -- e.g. 2-6 AM.

Shouldn't this happen automatically due to the periodic git gc mechanism?

Alistair


Rob Spanton

unread,
May 12, 2014, 11:21:58 AM5/12/14
to srobo...@googlegroups.com
On Mon, 2014-05-12 at 16:18 +0100, Alistair Lynn wrote:
> Shouldn't this happen automatically due to the periodic git gc
> mechanism?

It isn't. I repacked all the repos after the competition, which reduces
the 556000 files in /var/www/html/ide/repos down to 186000, and took 55
minutes.

Also, we probably don't want it to happen automatically in the IDE's
requests, as that would increase the latency of each IDE request
further. Repacking them in the early hours of the day, when IDE latency
does not matter, is preferable I think.

R
signature.asc

Andrew Cottrell

unread,
May 12, 2014, 11:28:56 AM5/12/14
to srobo...@googlegroups.com
Hi Rob

>    2. Switch to a linode with more RAM.  Having more free RAM for disk
>      cache will benefit us considerably.

I hear it's possible for us to do this temporarily, so next competition we could arrange for this to happen a week or two before the competition, check it all works (with time for a roll back if there's some inexplicable bug) and then roll back after the competition. Is this a viable option / the plan?

Cheers,
Andy

Rob Spanton

unread,
May 12, 2014, 11:28:58 AM5/12/14
to srobo...@googlegroups.com
On Mon, 2014-05-12 at 16:21 +0100, Rob Spanton wrote:
> On Mon, 2014-05-12 at 16:18 +0100, Alistair Lynn wrote:
> > Shouldn't this happen automatically due to the periodic git gc
> > mechanism?
>
> It isn't. I repacked all the repos after the competition, which reduces
> the 556000 files in /var/www/html/ide/repos down to 186000, and took 55
> minutes.

This stackoverflow answer hints that it's not being run on our system
because the threshold isn't being reached:
http://stackoverflow.com/questions/3532740/do-i-ever-need-to-run-git-gc-on-a-bare-repo

However, at the moment, we don't really know whether running it actually
helps. All I know is that it reduces the number of files, which may
reduce the amount of filesystem IO. Some actual simulated competition
load would help with answering this question, as well as fixing the
whole problem.

Cheers,

Rob
signature.asc

Rob Spanton

unread,
May 12, 2014, 11:29:29 AM5/12/14
to srobo...@googlegroups.com
On Mon, 2014-05-12 at 16:28 +0100, Andrew Cottrell wrote:
> I hear it's possible for us to do this temporarily, so next
> competition we could arrange for this to happen a week or two before
> the competition, check it all works (with time for a roll back if
> there's some inexplicable bug) and then roll back after the
> competition. Is this a viable option /
> the plan?

That was number 4 on the list.

signature.asc

Sam Phippen

unread,
May 12, 2014, 11:30:42 AM5/12/14
to srobo...@googlegroups.com
>
> However, at the moment, we don't really know whether running it actually
> helps. All I know is that it reduces the number of files, which may
> reduce the amount of filesystem IO. Some actual simulated competition
> load would help with answering this question, as well as fixing the
> whole problem.
>

To this end, I’m vaguely starting to bake some load testing tools that might help with this.
I will explain in more detail when I’m happy that the solution isn’t completely nuts, don’t expect any
particularly high speed delivery of this, but it will exist eventually.

Thanks

Sam Phippen

Andrew Cottrell

unread,
May 12, 2014, 11:32:09 AM5/12/14
to srobo...@googlegroups.com
Not sure how I missed that, sorry!

Peter Law

unread,
May 12, 2014, 4:13:51 PM5/12/14
to Student Robotics
Rob wrote:
> <failure analysis>

Thanks for digging into that.

> Suggested mitigation approaches to be applied in various arrangements:

I've converted most of these to trac tickets, as follows:

> 1. Migrate to Linode's SSD situation.

http://srobo.org/trac/ticket/2463, also updated
http://srobo.org/trac/ticket/1480 with the new x64 requirement.

> 2. Switch to a linode with more RAM. Having more free RAM for disk
> cache will benefit us considerably.

Not a ticket yet as I'm not sure how this fits there. Is this
something we definitely want to do? I'm guessing that this would cost
us more, so is it something that we want to do just for the
competition? If it is, how does this differ from #4?

> 3. Split load across multiple linodes.

http://srobo.org/trac/ticket/2465 exists to investigate this. We can
create further making it happen tickets later.

> 4. Upgrade the linode to a much more beefy one for just the
> duration of the competition. This appears to be possible now as
> linode have switched to hourly billing.

Also not a ticket yet as it appears to be similar to #2 which has
questions attached.

> 5. Ensure that git-repack is run against IDE repos

This feels like something we should do anyway;
http://srobo.org/trac/ticket/2466.

> 6. Reduce the IDE's git-related loading.

http://srobo.org/trac/ticket/2467

I expect that the majority of the loading comes from actions which are
part of the standard workflow anyway, since it does a git update
(wrapped in various autosave preserving functionality) as part of
pretty much all requests which use a user's repo. This is needed to
make sure the latest data is available.

That said, lots of the read-only operations now occur directly on the
bare master repos, which shouldn't be hit by that, but which would
still be hit by slow IO. In turn, since these still grab exclusive
locks rather than shared locks on the master repos, each will prevent
others from accessing the repo. How much of an issue that causes is
unclear, since the chances are that at the competition a given team
would only have a single IDE instance pointed at each project anyway.

Thanks,
Peter

Jeremy Morse

unread,
May 14, 2014, 9:12:04 AM5/14/14
to srobo...@googlegroups.com
Hi,

I've provided Sam with the webserver logs from across the competition,
and a copy of the git repos backed up on the morning of the 27th (so
they're all unpacked).

IMO, being able to load test a) the IDE, but also b) the rest of the
website in a way that can be replicated is extremely desirable, seeing
how that's the best way of identifying future bottlenecks. It's a
significant task though, so I'll be most impressed if it happens (and
happens well).

~

This is pure speculation, but libgit2 does support non-disk database
backends, even mysql and redis and suchlike. It's /possible/ that moving
to an actual database will improve performance, but a really high
investment is required to test that (although already desirable [0]).

IMO the primary benefit of such a move, is that the IDE would use
technology that scales in a way that's well understood and measurable,
and is common in existing web tech designs.

The downsides would be that everything becomes more opaque (one cannot
just navigate to a directory and frob a git repo) and there'd be more
technical work to integrate it with the rest of our infrastructure
(git-http-backend for example).

Any discussion of this should happen in a different thread.

[0] https://www.studentrobotics.org/trac/ticket/773

--
Thanks,
Jeremy

signature.asc
Reply all
Reply to author
Forward
0 new messages