Sam wrote:
> I did some analysis of the server logs from the competition and some
> benchmarking to attempt to reproduce the failures we encountered at the
> competition. This reproduction was successful and I've published my analysis
> here:
http://funandplausible.com:8888/
Many thanks for this. I've some general thoughts first, and then some
more specific queries against the data.
Is this (Locust) something that we can run against a locally hosted
badger clone? I'd really like to be able to explore alternative
scenarios without needing to thrash badger each time. Thanks for
posting your locust config.
> The IDE accounts for a tiny fraction of the requests throughout the competition.
> Of the 1,950,723 requests only 139,476 came from the IDE that's just over 7%.
I'm a little surprised about this, given what we expected the teams to
mainly be doing. I'm also wondering if the "low" number of requests
could be related to the fact that the IDE endpoints took relatively
long to respond.
> The comp-api family of endpoints account for 74% of all requests. My guess:
> some kind of polling rate was insane somewhere, or just the sheer number of
> people who had this open caused some crazyness.
The polling rate on the comp-api clients was very mixed. Most of the
website related ones were about every 10s, though some were less
often. The arena screens polled every 1s (though this load was
intended to have been routed to a separate VM, and I think was for the
latter portion of Sunday), the venue displays polled at 30s.
The endpoints used in the benchmarking
(`/comp-api/matches/A?numbers=previous%2Ccurrent%2Cnext%2Cnext%2B1`
and the variant for arena B) look like they will have originated
(only) from the front page of the website. Were these chosen for a
specific reason?
~~
While you do note that disk IO / blocking are an issue, I'm not sure I
follow how that's related to the loading from comp-api. The comp-api
endpoints by their very nature have effectively no disk access [1].
If we're suggesting that this is the main cause of the problematic
load it would seem to be in contradiction to both what it should be
doing and what Rob's earlier investigation [2] showed. Is there
anything which specifically points to comp-api being at fault?
I should probably ask at this point what evidence Rob had for
suggesting that it was IDE related git IO causing the issue, since his
analysis didn't say why he felt that was to blame either. Rob: please
could you add that to this discussion?
teams_data.php feels like a much more likely candidate for causing IO
issues than comp-api, and unfortunately would be somewhat harder to
move to another server. This endpoint reads all of the team-status
json files as well as the teams.json and munges them together for
output. Its output doesn't get cached on the server anywhere I'm aware
of.
Would it be possible to run a test which combines the teams_data.php
endpoint with the IDE testing (but without comp-api), in order to see
which of comp-api and teams_data.php was causing the issues?
> The basic conclusion is: let's get an SSD backed server
This seems a useful thing to do anyway (and we've a ticket for it
already [3]), and hopefully should happen. I don't think we should
rely on it solving the problems however.
> and let's move the
> competition api and teams_data.php endpoints to another server.
Currently, the main (public) access of these are pages on our website.
In theory (though it didn't actually happen) the other clients (the
arena screens, venue displays etc.; note that these only use comp-api
and not teams_data as far as I know) should have been routed to a
separate VM anyway.
Do we have any data on the proportion of requests which came from each
source? (admittedly I'm not really sure how you'd determine this).
The main complication with moving comp-api to another host is that
we'd need to do something to avoid hitting single-origin issues when
using it as part of the website.
Peter
[1] it checks whether it should re-load the data at most once every
five seconds. The check consists of comparing the modification time of
a specific file to the last time it was checked. Only when that
detects that an update is needed is the main state reloaded.
I'm not actually sure how/if it avoids two requests trying to reload
the data at the same time, but since this test won't have changed the
data this shouldn't have been an issue.
[2]
https://groups.google.com/d/msg/srobo-devel/S_fufDXr5ro/xUPHzSYYMfoJ
[3]
https://www.studentrobotics.org/trac/ticket/2463