Competition Server Fail Analysis

Sam Phippen

unread,

Jun 29, 2014, 2:52:11 PM6/29/14

to sr...@googlegroups.com

Hello Everyone,

I did some analysis of the server logs from the competition and some benchmarking to attempt to reproduce the failures we encountered at the competition. This reproduction was successful and I’ve published my analysis here: http://funandplausible.com:8888/

The basic conclusion is: let’s get an SSD backed server and let’s move the competition api and teams_data.php endpoints to another server.

Thanks

—

Sam Phippen

unread,

Jun 29, 2014, 4:08:54 PM6/29/14

to sr...@googlegroups.com

Btw Everyone,

I’ve gisted the results csvs here:

https://gist.github.com/samphippen/1b1cd7c2dd32b939054a

https://gist.github.com/samphippen/4eeb34a570da7727c1c4

https://gist.github.com/samphippen/f7bb20c7e5b42d95f092

The CSVs are weird because I opened one of them in numbers (apple’s spreadsheet too) and saved it out again, the rest are as they were out of locust.

I’ve also got my locust file here:

http://sprunge.us/IXAc

Thanks

—

Sam Phippen

Peter Law

unread,

Jun 29, 2014, 4:24:10 PM6/29/14

to Student Robotics

Sam wrote:
> I've gisted the results csvs here:
>
> https://gist.github.com/samphippen/1b1cd7c2dd32b939054a
> https://gist.github.com/samphippen/4eeb34a570da7727c1c4
> https://gist.github.com/samphippen/f7bb20c7e5b42d95f092

I'm probably missing something really obvious, but could you clarify
what each of the columns mean? In particular I'm not understanding
what the % ones are.

Thanks,
Peter

Sam Phippen

unread,

Jun 29, 2014, 4:25:24 PM6/29/14

to sr...@googlegroups.com

Centile response time,

50% -> 50% of requests are faster than this (so it’s the median)
75% -> 75% of requests are faster than this (so it’s the 3rd quartile)
95% -> 95% of requests are faster than this (so 5% are slower, 1/20 requests!)

Thanks
—
Sam Phippen

> --
>
> ---
> You received this message because you are subscribed to the Google Groups "Student Robotics" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to srobo+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Peter Law

unread,

Jun 29, 2014, 4:36:45 PM6/29/14

to Student Robotics

Sam wrote:
> I did some analysis of the server logs from the competition and some
> benchmarking to attempt to reproduce the failures we encountered at the
> competition. This reproduction was successful and I've published my analysis
> here: http://funandplausible.com:8888/

Many thanks for this. I've some general thoughts first, and then some
more specific queries against the data.

Is this (Locust) something that we can run against a locally hosted
badger clone? I'd really like to be able to explore alternative
scenarios without needing to thrash badger each time. Thanks for
posting your locust config.

> The IDE accounts for a tiny fraction of the requests throughout the competition.
> Of the 1,950,723 requests only 139,476 came from the IDE that's just over 7%.

I'm a little surprised about this, given what we expected the teams to
mainly be doing. I'm also wondering if the "low" number of requests
could be related to the fact that the IDE endpoints took relatively
long to respond.

> The comp-api family of endpoints account for 74% of all requests. My guess:
> some kind of polling rate was insane somewhere, or just the sheer number of
> people who had this open caused some crazyness.

The polling rate on the comp-api clients was very mixed. Most of the
website related ones were about every 10s, though some were less
often. The arena screens polled every 1s (though this load was
intended to have been routed to a separate VM, and I think was for the
latter portion of Sunday), the venue displays polled at 30s.

The endpoints used in the benchmarking
(`/comp-api/matches/A?numbers=previous%2Ccurrent%2Cnext%2Cnext%2B1`
and the variant for arena B) look like they will have originated
(only) from the front page of the website. Were these chosen for a
specific reason?

~~

While you do note that disk IO / blocking are an issue, I'm not sure I
follow how that's related to the loading from comp-api. The comp-api
endpoints by their very nature have effectively no disk access [1].

If we're suggesting that this is the main cause of the problematic
load it would seem to be in contradiction to both what it should be
doing and what Rob's earlier investigation [2] showed. Is there
anything which specifically points to comp-api being at fault?

I should probably ask at this point what evidence Rob had for
suggesting that it was IDE related git IO causing the issue, since his
analysis didn't say why he felt that was to blame either. Rob: please
could you add that to this discussion?

teams_data.php feels like a much more likely candidate for causing IO
issues than comp-api, and unfortunately would be somewhat harder to
move to another server. This endpoint reads all of the team-status
json files as well as the teams.json and munges them together for
output. Its output doesn't get cached on the server anywhere I'm aware
of.

Would it be possible to run a test which combines the teams_data.php
endpoint with the IDE testing (but without comp-api), in order to see
which of comp-api and teams_data.php was causing the issues?

> The basic conclusion is: let's get an SSD backed server

This seems a useful thing to do anyway (and we've a ticket for it
already [3]), and hopefully should happen. I don't think we should
rely on it solving the problems however.

> and let's move the
> competition api and teams_data.php endpoints to another server.

Currently, the main (public) access of these are pages on our website.
In theory (though it didn't actually happen) the other clients (the
arena screens, venue displays etc.; note that these only use comp-api
and not teams_data as far as I know) should have been routed to a
separate VM anyway.

Do we have any data on the proportion of requests which came from each
source? (admittedly I'm not really sure how you'd determine this).

The main complication with moving comp-api to another host is that
we'd need to do something to avoid hitting single-origin issues when
using it as part of the website.

Peter

[1] it checks whether it should re-load the data at most once every
five seconds. The check consists of comparing the modification time of
a specific file to the last time it was checked. Only when that
detects that an update is needed is the main state reloaded.
I'm not actually sure how/if it avoids two requests trying to reload
the data at the same time, but since this test won't have changed the
data this shouldn't have been an issue.
[2] https://groups.google.com/d/msg/srobo-devel/S_fufDXr5ro/xUPHzSYYMfoJ
[3] https://www.studentrobotics.org/trac/ticket/2463

Rob Spanton

unread,

Jun 29, 2014, 6:17:40 PM6/29/14

to sr...@googlegroups.com

Hi Sam,

Sam wrote:
> I did some analysis of the server logs from the competition and some
> benchmarking to attempt to reproduce the failures we encountered at
> the competition. This reproduction was successful and I've published
> my analysis here: http://funandplausible.com:8888/

Thanks.

> The basic conclusion is: let's get an SSD backed server and let's move
> the competition api and teams_data.php endpoints to another server.

It doesn't seem clear to me from this data that moving the comp-api onto
another server will sort out problems. Do you have some data that shows
what the actual load of the comp api is when just it is being accessed?

I note that you say in the last-but-one paragraph that you have observed
improved response times if the comp API is not accessed. You state that
this causes "disk thrashing". It doesn't seem like the immediate
conclusion should be to split into two servers. For example, it may be
that doubling or quadrupling the amount of RAM the server has would make
the problem go away (as the disk cache has grown) -- see my original
email about the server fail summary.

Could you clarify what sort of ordering of and timing between requests
you have? It sounds like you've got 100 workers continuously making
requests from the server, but it's unclear exactly what's going on with
them.

Also, are the response times shown here the times for the whole HTTP
request to complete? It would be good to have some data on how long
these connections take to be accepted, and then how long it takes for
them to actually be processed. This would shed some light on what was
actually going on.

It'd also be good to see what the results look like for a static file
being served by apache to provide a baseline.

Were the IDE endpoints that you benchmarked really the ones that were
called the most? It seems more likely that things like the poll
endpoint would have been called quite often (clients call it every 30
seconds) in comparison to some of the ones that you've included (like
lint, put, co, commit).

Cheers,

Rob

signature.asc

Sam Phippen

unread,

Jun 29, 2014, 6:21:17 PM6/29/14

to sr...@googlegroups.com

Hi Rob,

Totally fair point, a bigger server will almost certainly make the load of the comp API vanish. I don’t have sequence or accept time data as locust doesn’t produce that. The IDE endpoints benchmarked here were ones I determined to actually cause load in earlier testing. I couldn’t make the poll endpoint cause any kind of load in initial testing.

Thanks
—
Sam Phippen

Jeremy Morse

unread,

Jun 30, 2014, 11:27:40 AM6/30/14

to sr...@googlegroups.com

Hi,

I'm intensely overjoyed that a data driven approach is happening here.
Being able to test where the load on the webserver is is so /so/ good.
Great thanks to Sam for making this happen and publishing data / scripts.

I think we'd be wise in this topic to focus on what else we can measure
about the website, before thinking about how to go about scaling it up.
As Rob suggests there are still a few more benchmarks that could be run.

A couple of things,

Peter wrote:
> While you do note that disk IO / blocking are an issue, I'm not sure I
> follow how that's related to the loading from comp-api. The comp-api
> endpoints by their very nature have effectively no disk access [1].

One of the things we can't measure is memory pressure -- additional
worker threads that consume significant amounts of memory wipe out the
file cache, which indirectly leads to more IO later.

I'd also like to point out the SSD server ('hahabusiness') was a
linode2048 (as of 28/06/14), so the same amount of memory as badger, but
with an SSD disk. One thing that's different between the two though is
the CPU allocation: hahabusiness has two virtual CPUs that (I believe)
have two dedicated cores on the VM host. Wheras badger has 8 virtual
CPUs, and has some kind of scheduling applied to it. Exactly how this
affects server performance is open to speculation.

--
Thanks,
Jeremy

signature.asc

Peter Law

unread,

Jul 3, 2014, 5:38:23 PM7/3/14

to Student Robotics

Hi,

Sam wrote:
> I've published my analysis here: http://funandplausible.com:8888/

This appears to be down. Does anyone have a copy of it? I think it
would be useful if we could duplicate it onto trac [1].

Thanks,

Peter

[1] https://www.studentrobotics.org/trac/wiki/2014/Competition/ServerLoadAnalysis

Jeremy Morse

unread,

Jul 15, 2014, 7:54:51 PM7/15/14

to sr...@googlegroups.com

Hi,

This has stalled a little; I think we can gather a lot more information
with additional benchmarks. Some of the ones suggested so far:
1. Base cost of serving a static file
2. Comp-API requests only
3. Comp-API with greater variety in endpoints (as Peter suggests that
the ones picked were only from the frontpage)
4. teams_data.php + IDE

I'd also like to run some of these benchmarks with the worker.c MPM --
i.e., with an instance of apache that serves clients through threads,
rather than forking one server process per client. This kills the IDE;
but IMO it's load originates mostly from non-apache things [0]. Threaded
httpd has been apache's preferred configuration for years, apparently.

The first two suggested benchmarks are straight forward; the latter two
need some consideration of what requests to make, could someone who
understands the website infrastructure suggest what requests need to be
made?

[0] And if it provides a considerable performance improvement, then we
know where our time should be invested.

--
Thanks,
Jeremy

signature.asc

Reply all

Reply to author

Forward