Just to try to summarize our findings in #clusto today:
Clusto becomes very very slow when you have a long history of clusto
transactions and/or many attributes. The problem seems to be
multi-faceted, but boils down to two issues: Clusto hits the database
far too often, and the queries it is making are not cacheable. Listing
the contents of a single pool with 82 entities in it makes 338
expensive, non-cacheable queries to the database for our data set.
Jorge/kad- suggests that the non-cacheable queries happen because even
reads are wrapped in an sqlachemy transaction in which the clusto
version is incremented, but that is then rolled back afterwards.
However, within the context of the query, that increment has happened,
making the resulting SQL query miss the cache.
Lex/lexlinden notes that cache0ability is not the whole story, by far.
They were only able to achieve a 30% improvement by caching everything
in local memcached. He suggests that the other side of the problem is
reducing the number of times clusto hits the database. Notably, every
time you wish to fetch a single attribute or small subset thereof,
Clusto apparently loads all of the attributes from the database.
I hope I didn't miss anything, but if I did, others can chime in.
Also, Jeremy/synack logs everything in IRC so we can always pester him
for a transcript.
P.S. Big thanks to Rob/rcoli and Timeless/mrphilov for showing off
their DBA black magic helping me diagnose this issue.
Just to clarify:
There's no way to ask clusto to fetch a single attribute, even if you
know there will be only one matching your key/subkey/number/etc
arguments. Attributes are always fetched by calling Driver.attrs(), or
another accessor that eventually hits Driver.attrs(). That function, in
turn, accesses Entity.attrs, fetching the entity's entire attribute list
and filtering it on the client side.
There are two problems here:
1) It's almost always possible to do the attribute filtering on the
MySQL side, which may allow the use of indexes, speeding things up and
limiting the amount of data traversing the network. To do this, we'd
have to replace the Entity.attrs references with Driver.do_attr_query()
2) Driver.attrs() is called way too often. As Plathrop saw, for a
relatively small set of hosts (e.g. a rack of 41 hosts) and a relatively
small set of data per host (hostname, mac address(es), etc) we at Linden
saw Entity.attrs get referenced over 300 times, so the same data is
retrieved over and over. Even though the query can be completed in a
few tens of milliseconds, added up this takes a long time, and a lot of
data traverses the network over and over. Even local memcached doesn't
help just due to the sheer number of times this data is retrieved.
#2 is due to a plausibly sensible design decision: always check the
database for the most up-to-date information in case someone else has
updated the entity we're looking at. In practice, we need to evaluate
whether that design decision is worth causing Clusto to be relatively
slow, or whether we can slightly relax this constraint in the name of
speed (by only fetching a given entity's full attribute list once per