the issue: it looks like we may not be getting all of our log entries
when when pull the logs from app engine.
first, a little context. there's a lot here, so bear with me.
we need to record event lines for metrics. normally, we would write
the lines to a local file on each app server and then pull those logs
every few minutes from the metrics system. we found this to be the
most stable and scalable architecture.
however, in app engine land, we can't write to a file. so we wrote the
event lines to the logs, set up a script to pull them in 10 minute
intervals, and loaded them into the stats system.
to be clear, the process goes like this:
1.) an event happens on the server that we'd like to record. we write
a line to the log using
logging.info(...) in python
2.) every 10 minutes, a job starts on a metrics server, which requests
the next batch of logs by calling appcfg.py. the last log in the new
batch is kept in a append file to use as the 'sentinel' for the next
fetch.
3.) the new log file is parsed for event lines, which are written to
another 'event' file.
4.) a few minutes later, another job grabs new event files and loads
the events into the metrics system.
everything seemed to work great. until we realized that we were
missing events. a lot of them. between 20-50%.
there are some events that need to be shared with other systems. for
one if those event types, i was feeling lazy, so i just fired http
hits at the other system as the event happen. at some point, we
compared these numbers - and found them to be drastically different.
i ran tests today comparing the number of events recorded 'through'
the logs system and the same events recorded by http hit during
runtime. the percent of 'missing' events ranged from 18-56%, and the
percent missing appeared to be significantly higher when the frequency
of events was higher (during peak).
i've done a significant amount of work that points to the logs being
missing by the point that appcfg.py records them. i've reasonably
verified that all the event lines that appcfg.py pulls down make it
into the metrics system. oh, and all the numbers are being run on
unique user counts, so there's no way that the counts could be
mistakenly large (accidentally reading an event twice does not produce
a new unique user id).
my questions / issues:
1.) should we be getting all of our logged lines from appcfg.py's
request_logs command? is this a known behavior? recall that we are
missing 20-50% of events - this is not a small discrepancy.
2.) we're pulling our logs every 10 minutes. seeing as the
request_logs command lets you specify the time you want in days, i
imagine this as more frequent than intended. could this be causing an
issue?
3.) we switch major versions of the app every time we push, which can
be several times each day. this doesn't make sense as an issue since
the numbers are know to be wrong over periods where there have been no
version changes, but i wanted to mention it.
4.) can you suggest a better solution for getting data to offline
processing? right now we getting the correct numbers using the async
http requests without ever calling get_result() or the like as a 'fire-
and-forget' http hit (not even sure if we're supposed to use it like
this, but seems to work). however, this approach has serious
drawbacks:
a.) http requests are very slow and expensive for something that does
not need to happen immediately.
b.) if the metrics system endpoint becomes unavailable, then, at best,
the data gets lost. at worst the issue domino's back up into the http
servers and takes the app down as well. (each http request has to
timeout, which takes significantly longer, spiking the concurrent
connections. this has screwed me multiple times. maybe you google guys
mark an endpoint as down system-wide so that subsequent requests never
attempt the connection, but we were never that smart).
thanks in advance, neil.