Hi Ken,
Thanks. The 200+ clients rarely trigger a DB call now, as I'm using
the datastore for caching data and have cron jobs that periodically
flush data to/from GC-SQL and datastore. The DB hangs under a load of
under 1 QPS (according to the GAE dashboard, though I know those
numbers aren't one-to-one for GC-SQL queries) and remains hung even
when there's only a single client in 10 minutes trying to access the
DB. I.e. once the connections are maxed out, none seem to time out in
a quick enough fashion for the system to heal itself.
The application is a research project that involves worker nodes and
the GAE/GC-SQL app as the control master that allocates work and
aggregates results. I can give you more specific detail off-channel,
if that might help. It's a university/Google collaboration, actually.
In general, things work for long stretches now, except that every x
number of hours I come back to find the DB hung with maxed out
connections ("Instance has too many concurrent requests: 101"), which
I can only resolve by restarting my instance manually via the APIs
console. I'm unable to see what the maxed out connections are doing,
as neither the GC-SQL APIs console nor commandline tool can even
connect to the DB to poke around when the connections are maxed out.
There's probably some deadlock scenario that creeps up occasionally,
or otherwise it's the case of random load occasionally leading to a
QPS spike that causes a chain of events (i.e. GAE deadlines --> DB
connections hanging --> GAE using up all 100 connections --> splat).
I think I'd like to take your advice and write another cron job that
kills GC-SQL processes that are stuck waiting too long when the DB
appears down. But, this job would still fail to work once all 100
connections are established (after all, it needs a connection itself
to see the processlist and execute the kill -- same issue that makes
the APIs console mostly useless in this scenario). I'd have to run it
very often to have a chance at being useful. So, this brings me to
another related question:
Is there any programmatic introspection/management available for
GC-SQL instances? E.g. is there a better way to detect and/or
troubleshoot the type of scenario I'm running into? It seems that if
there's an APIs console for it that the team could code up a meta API
for controlling the instance. For example, if I could have GAE calls
that detect the state of the DB, check the number of connections,
(maybe even get the min,max,avg process connection time), test when
the DB fails to load (as the APIs console typically reports under the
above scenario), and allow me to make a programmatic call to restart
the instance, that would be very useful.
Relatedly, is there any logging visible for the GC-SQL itself?
There's a "Logs" tab in the APIs console, but that only shows when I
first created the instance. It could be a bit more verbose. :-)
Thanks for your time and help!
Jake