missing logs

36 views

Skip to first unread message

neil souza

unread,

Sep 26, 2009, 8:40:30 AM9/26/09

to Google App Engine

the issue: it looks like we may not be getting all of our log entries
when when pull the logs from app engine.

first, a little context. there's a lot here, so bear with me.

we need to record event lines for metrics. normally, we would write
the lines to a local file on each app server and then pull those logs
every few minutes from the metrics system. we found this to be the
most stable and scalable architecture.

however, in app engine land, we can't write to a file. so we wrote the
event lines to the logs, set up a script to pull them in 10 minute
intervals, and loaded them into the stats system.

to be clear, the process goes like this:

1.) an event happens on the server that we'd like to record. we write
a line to the log using logging.info(...) in python

2.) every 10 minutes, a job starts on a metrics server, which requests
the next batch of logs by calling appcfg.py. the last log in the new
batch is kept in a append file to use as the 'sentinel' for the next
fetch.

3.) the new log file is parsed for event lines, which are written to
another 'event' file.

4.) a few minutes later, another job grabs new event files and loads
the events into the metrics system.

everything seemed to work great. until we realized that we were
missing events. a lot of them. between 20-50%.

there are some events that need to be shared with other systems. for
one if those event types, i was feeling lazy, so i just fired http
hits at the other system as the event happen. at some point, we
compared these numbers - and found them to be drastically different.

i ran tests today comparing the number of events recorded 'through'
the logs system and the same events recorded by http hit during
runtime. the percent of 'missing' events ranged from 18-56%, and the
percent missing appeared to be significantly higher when the frequency
of events was higher (during peak).

i've done a significant amount of work that points to the logs being
missing by the point that appcfg.py records them. i've reasonably
verified that all the event lines that appcfg.py pulls down make it
into the metrics system. oh, and all the numbers are being run on
unique user counts, so there's no way that the counts could be
mistakenly large (accidentally reading an event twice does not produce
a new unique user id).

my questions / issues:

1.) should we be getting all of our logged lines from appcfg.py's
request_logs command? is this a known behavior? recall that we are
missing 20-50% of events - this is not a small discrepancy.

2.) we're pulling our logs every 10 minutes. seeing as the
request_logs command lets you specify the time you want in days, i
imagine this as more frequent than intended. could this be causing an
issue?

3.) we switch major versions of the app every time we push, which can
be several times each day. this doesn't make sense as an issue since
the numbers are know to be wrong over periods where there have been no
version changes, but i wanted to mention it.

4.) can you suggest a better solution for getting data to offline
processing? right now we getting the correct numbers using the async
http requests without ever calling get_result() or the like as a 'fire-
and-forget' http hit (not even sure if we're supposed to use it like
this, but seems to work). however, this approach has serious
drawbacks:

a.) http requests are very slow and expensive for something that does
not need to happen immediately.

b.) if the metrics system endpoint becomes unavailable, then, at best,
the data gets lost. at worst the issue domino's back up into the http
servers and takes the app down as well. (each http request has to
timeout, which takes significantly longer, spiking the concurrent
connections. this has screwed me multiple times. maybe you google guys
mark an endpoint as down system-wide so that subsequent requests never
attempt the connection, but we were never that smart).

thanks in advance, neil.

Nick Johnson (Google)

unread,

Sep 30, 2009, 5:57:02 AM9/30/09

to google-a...@googlegroups.com

Hi Neil,

Sorry for the delay responding. Responses inline.

App Engine has a fixed amount of space available for logs; it's essentially a circular buffer. When it runs out of space, it starts replacing older logs.

2.) we're pulling our logs every 10 minutes. seeing as the
request_logs command lets you specify the time you want in days, i
imagine this as more frequent than intended. could this be causing an
issue?

How much traffic are you getting? What's the size of 10 minutes' worth of logs?

3.) we switch major versions of the app every time we push, which can
be several times each day. this doesn't make sense as an issue since
the numbers are know to be wrong over periods where there have been no
version changes, but i wanted to mention it.

4.) can you suggest a better solution for getting data to offline
processing? right now we getting the correct numbers using the async
http requests without ever calling get_result() or the like as a 'fire-
and-forget' http hit (not even sure if we're supposed to use it like
this, but seems to work). however, this approach has serious
drawbacks:

You could log to the datastore, and read and delete old entries using remote_api.

a.) http requests are very slow and expensive for something that does
not need to happen immediately.

b.) if the metrics system endpoint becomes unavailable, then, at best,
the data gets lost. at worst the issue domino's back up into the http
servers and takes the app down as well. (each http request has to
timeout, which takes significantly longer, spiking the concurrent
connections. this has screwed me multiple times. maybe you google guys
mark an endpoint as down system-wide so that subsequent requests never
attempt the connection, but we were never that smart).

thanks in advance, neil.

--
Nick Johnson, Developer Programs Engineer, App Engine
Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: 368047

neil souza

unread,

Oct 5, 2009, 8:00:17 PM10/5/09

to Google App Engine

thanks nick, responses in line (hope they come through right, i don't
really know how to use groups)

On Sep 30, 2:57 am, "Nick Johnson (Google)" <nick.john...@google.com>
wrote:

well, i'm going to guess that's the culprit.

>
>
> > 2.) we're pulling our logs every 10 minutes. seeing as the
> > request_logs command lets you specify the time you want in days, i
> > imagine this as more frequent than intended. could this be causing an
> > issue?
>
> How much traffic are you getting? What's the size of 10 minutes' worth of
> logs?
>

we're at maybe avg. 200 request and looks like we're recording 1.25
events per request, so perhaps 250 log lines / sec? that's in addition
to all the other junk getting spilled out there - i didn't know that
space was limited, there's prob some debug output, then the
exceptions, etc...

>
>
> > 3.) we switch major versions of the app every time we push, which can
> > be several times each day. this doesn't make sense as an issue since
> > the numbers are know to be wrong over periods where there have been no
> > version changes, but i wanted to mention it.
>
> > 4.) can you suggest a better solution for getting data to offline
> > processing? right now we getting the correct numbers using the async
> > http requests without ever calling get_result() or the like as a 'fire-
> > and-forget' http hit (not even sure if we're supposed to use it like
> > this, but seems to work). however, this approach has serious
> > drawbacks:
>
> You could log to the datastore, and read and delete old entries using
> remote_api.
>

this just doesn't seem like the right job for the datastore - we're
only inserting at 250 events / sec right now, but need to be able to
scale that. if we're inserting a few thousand events per second, and
can only fetch or delete 1K at a time, that seems like a potential
problem. we can batch up the events in each request, but still only
limits the inserts per second to the app's requests per second, which
can have the same issue. just doesn't sound fun.

maybe http requests are the best way to do it for now, unless we can
get the log space significantly expanded or find another solution.

jonathan

unread,

Oct 5, 2009, 10:13:41 PM10/5/09

to Google App Engine

I have been using a lockless memcache queue for something similar. It
is only for data that it is ok if you lose some of it. But it scales
quite well.

The implementation here is a little out of date:
http://www.redredred.com.au/memcache-lockless-queue-implementation/. I
plan to update with my current version soon.

j

Nick Johnson (Google)

unread,

Oct 6, 2009, 5:47:29 AM10/6/09

to google-a...@googlegroups.com

Hi Neil,

There's a limit on the total amount of logs stored - the logs are effectively stored in a circular buffer, so old data is overwritten as needed to make space for new data.

>
>
> > 3.) we switch major versions of the app every time we push, which can
> > be several times each day. this doesn't make sense as an issue since
> > the numbers are know to be wrong over periods where there have been no
> > version changes, but i wanted to mention it.
>
> > 4.) can you suggest a better solution for getting data to offline
> > processing? right now we getting the correct numbers using the async
> > http requests without ever calling get_result() or the like as a 'fire-
> > and-forget' http hit (not even sure if we're supposed to use it like
> > this, but seems to work). however, this approach has serious
> > drawbacks:
>
> You could log to the datastore, and read and delete old entries using
> remote_api.
>

this just doesn't seem like the right job for the datastore - we're
only inserting at 250 events / sec right now, but need to be able to
scale that. if we're inserting a few thousand events per second, and
can only fetch or delete 1K at a time, that seems like a potential
problem. we can batch up the events in each request, but still only
limits the inserts per second to the app's requests per second, which
can have the same issue. just doesn't sound fun.

maybe http requests are the best way to do it for now, unless we can
get the log space significantly expanded or find another solution.

If you're doing 250 events a second, HTTP requests are going to be equally problematic - you'll be doing 250 of _those_ per second. 1000 events is still 4 seconds worth - so instead of doing 250QPS of outgoing traffic, you could be doing only 1/4 of a QPS of incoming traffic!

-Nick

>
>
> > a.) http requests are very slow and expensive for something that does
> > not need to happen immediately.
>
> > b.) if the metrics system endpoint becomes unavailable, then, at best,
> > the data gets lost. at worst the issue domino's back up into the http
> > servers and takes the app down as well. (each http request has to
> > timeout, which takes significantly longer, spiking the concurrent
> > connections. this has screwed me multiple times. maybe you google guys
> > mark an endpoint as down system-wide so that subsequent requests never
> > attempt the connection, but we were never that smart).
>
> > thanks in advance, neil.
>
> --

> Nick Johnson, Developer Programs Engineer, App Engine
> Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number:
> 368047

Jason Smith

unread,

Oct 8, 2009, 11:36:53 AM10/8/09

to Google App Engine

(What is the etiquette for replying to mixed top-post/bottom-post
threads? Google Wave to the rescue!?!)

I am also suffering from the logging circular buffer. Part of me feels
like if a customer is paying triple-digit dollars per day to Google he
should get more than a few gigs of logs! But to discuss a technical
solution...

Part a suggestion from my experience, and part a request from any
Googlers: What I did was to subclass RequestHandler into
LoggingRequestHandler (the site is a webapp app) which would create
its own logger object and set the default log level to warn. Next, all
logging happens through that logger instead of the root one in the
logger module. Now I have no debug or info logs because they will be
re-enabled only under special circumstances. To re-activate them, the
LoggingRequestHandler pulls a very simple serialized dict from
basically a singleton entity in the DB (or memcache), of the following
format:

{
"Name 1 (usually an issue id from our tracker)": [
["HandlerClass1", "http_method", "get_or_post_var", "trigger
value"],
["HandlerClass2", "http_method", "get_or_post_var", "trigger
value"]
],

# More concrete example below
"Issue 12345": [
["GetUsers", "get", "username", "m...@example.com"],
["PostUser", "post", "username", "m...@example.com"]
]
}

LoggingRequestHandler wraps self.get(), self.put(), etc., first
checking whether its own name, the HTTP method, query parameters and
values match those listed in the data structure. Those strings are
treated as regexes, and there are some aliased "variables" like
"__url__" which allows me to match against the entire URL. If any of
the conditions match, the logging object's level is set to DEBUG.
Performance impact is negligible, and the main problem is that we now
have to make a round-trip to users to tell them "logging is activated"
and to reproduce their problem.

Now to my question. I am suspicious that all the work I did to
implement this is for nothing, because our logs are still rotating out
into the nothing after only a day. Is this because the request volume
is still so high (80 request/sec, the vast majority with no logs)?
Back in the bad old days we were logging over 1KB of data per request
and I thought it was reasonable; but my expectation of having logs
(all of which I'm interested in by the way) linger on for many days
turned out to be totally wrong. If anybody has any advice about this I
would love to hear it.

In the meantime, I think I will have to do what I probably should have
done from the start: regularly fetch the logs to a separate system and
search through them there.

On Oct 6, 4:47 pm, "Nick Johnson (Google)" <nick.john...@google.com>

Jason Smith

unread,

Oct 8, 2009, 11:39:14 AM10/8/09

to Google App Engine

...whoops, not the "logger" module, the "logging" module. I write a
lot of Ruby too. Remembering which direction my stack traces are
going, and whether it's logger or logging is something I will probably
never achieve.

On Oct 8, 10:36 pm, Jason Smith <j...@proven-corporation.com> wrote:
> (What is the etiquette for replying to mixed top-post/bottom-post
> threads? Google Wave to the rescue!?!)
>
> I am also suffering from the logging circular buffer. Part of me feels
> like if a customer is paying triple-digit dollars per day to Google he

> should get more than a few gigs oflogs! But to discuss a technical

> solution...
>
> Part a suggestion from my experience, and part a request from any
> Googlers: What I did was to subclass RequestHandler into
> LoggingRequestHandler (the site is a webapp app) which would create
> its own logger object and set the default log level to warn. Next, all
> logging happens through that logger instead of the root one in the

> logger module. Now I have no debug or infologsbecause they will be

> re-enabled only under special circumstances. To re-activate them, the
> LoggingRequestHandler pulls a very simple serialized dict from
> basically a singleton entity in the DB (or memcache), of the following
> format:
>
> {
> "Name 1 (usually an issue id from our tracker)": [
> ["HandlerClass1", "http_method", "get_or_post_var", "trigger
> value"],
> ["HandlerClass2", "http_method", "get_or_post_var", "trigger
> value"]
> ],
>
> # More concrete example below
> "Issue 12345": [
> ["GetUsers", "get", "username", "m...@example.com"],
> ["PostUser", "post", "username", "m...@example.com"]
> ]
>
> }
>
> LoggingRequestHandler wraps self.get(), self.put(), etc., first
> checking whether its own name, the HTTP method, query parameters and
> values match those listed in the data structure. Those strings are
> treated as regexes, and there are some aliased "variables" like
> "__url__" which allows me to match against the entire URL. If any of
> the conditions match, the logging object's level is set to DEBUG.
> Performance impact is negligible, and the main problem is that we now
> have to make a round-trip to users to tell them "logging is activated"
> and to reproduce their problem.
>
> Now to my question. I am suspicious that all the work I did to

> implement this is for nothing, because ourlogsare still rotating out

> into the nothing after only a day. Is this because the request volume
> is still so high (80 request/sec, the vast majority with nologs)?
> Back in the bad old days we were logging over 1KB of data per request
> and I thought it was reasonable; but my expectation of havinglogs
> (all of which I'm interested in by the way) linger on for many days
> turned out to be totally wrong. If anybody has any advice about this I
> would love to hear it.
>
> In the meantime, I think I will have to do what I probably should have

> done from the start: regularly fetch thelogsto a separate system and

> search through them there.
>
> On Oct 6, 4:47 pm, "Nick Johnson (Google)" <nick.john...@google.com>
> wrote:
>
> > Hi Neil,
>
> > On Tue, Oct 6, 2009 at 1:00 AM, neil souza <nr...@foulplaymedia.com> wrote:
>
> > > thanks nick, responses in line (hope they come through right, i don't
> > > really know how to use groups)
>
> > > On Sep 30, 2:57 am, "Nick Johnson (Google)" <nick.john...@google.com>
> > > wrote:
> > > > Hi Neil,
>
> > > > Sorry for the delay responding. Responses inline.
>
> > > > On Sat, Sep 26, 2009 at 1:40 PM, neil souza <nr...@foulplaymedia.com>
> > > wrote:
>
> > > > > the issue: it looks like we may not be getting all of our log entries

> > > > > when when pull thelogsfrom app engine.

>
> > > > > first, a little context. there's a lot here, so bear with me.
>
> > > > > we need to record event lines for metrics. normally, we would write
> > > > > the lines to a local file on each app server and then pull thoselogs
> > > > > every few minutes from the metrics system. we found this to be the
> > > > > most stable and scalable architecture.
>
> > > > > however, in app engine land, we can't write to a file. so we wrote the

> > > > > event lines to thelogs, set up a script to pull them in 10 minute

> > > > > intervals, and loaded them into the stats system.
>
> > > > > to be clear, the process goes like this:
>
> > > > > 1.) an event happens on the server that we'd like to record. we write
> > > > > a line to the log using logging.info(...) in python
>
> > > > > 2.) every 10 minutes, a job starts on a metrics server, which requests

> > > > > the next batch oflogsby calling appcfg.py. the last log in the new

> > > > > batch is kept in a append file to use as the 'sentinel' for the next
> > > > > fetch.
>
> > > > > 3.) the new log file is parsed for event lines, which are written to
> > > > > another 'event' file.
>
> > > > > 4.) a few minutes later, another job grabs new event files and loads
> > > > > the events into the metrics system.
>
> > > > > everything seemed to work great. until we realized that we were
> > > > > missing events. a lot of them. between 20-50%.
>
> > > > > there are some events that need to be shared with other systems. for
> > > > > one if those event types, i was feeling lazy, so i just fired http
> > > > > hits at the other system as the event happen. at some point, we
> > > > > compared these numbers - and found them to be drastically different.
>
> > > > > i ran tests today comparing the number of events recorded 'through'

> > > > > thelogssystem and the same events recorded by http hit during

> > > > > runtime. the percent of 'missing' events ranged from 18-56%, and the
> > > > > percent missing appeared to be significantly higher when the frequency
> > > > > of events was higher (during peak).
>
> > > > > i've done a significant amount of work that points to thelogsbeing
> > > > > missing by the point that appcfg.py records them. i've reasonably
> > > > > verified that all the event lines that appcfg.py pulls down make it
> > > > > into the metrics system. oh, and all the numbers are being run on
> > > > > unique user counts, so there's no way that the counts could be
> > > > > mistakenly large (accidentally reading an event twice does not produce
> > > > > a new unique user id).
>
> > > > > my questions / issues:
>
> > > > > 1.) should we be getting all of our logged lines from appcfg.py's
> > > > > request_logs command? is this a known behavior? recall that we are
> > > > > missing 20-50% of events - this is not a small discrepancy.
>

> > > > App Engine has a fixed amount of space available forlogs; it's

> > > essentially
> > > > a circular buffer. When it runs out of space, it starts replacing older
> > > >logs.
>
> > > well, i'm going to guess that's the culprit.
>