Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Spike in Treeherder New Relic alerts

3 views
Skip to first unread message

Ed Morley

unread,
Apr 2, 2015, 7:25:48 AM4/2/15
to Mauro Doglio, Dawson, Cameron, William Lachance, Griffin, Jonathan, dev-tree-...@lists.mozilla.org
Hi everyone - hope you're all well!

I'm just going through my vacation email backlog & have noticed that there
are a significant number of New Relic alerts, starting shortly after I went
on PTO.

Has anyone looked into these? I can't see any bugs filed for them, and the
only mention of them I could see on IRC was yesterday (a week after they
started) and which unfortunately concluded they could be ignored. The alert
emails are currently sent to myself, Cameron and Mauro (though the IRC
scrollback for yesterday mentions that people are filtering these out - and
presumably deleting/ignoring them?) - plus I would hope anyone deploying
would look at the New Relic app error page for stage/prod before deploying
(and there were several deployments whilst I was away).

Just to put this into context, in the 8 days of my PTO there was a massive
spike, of:
- 264 alert emails (would normally expect single or double digits - and for
server alerts, not app alerts).
- 84,000 app exceptions on prod and 180,000 on stage (would expect
1,000-10,000 at most, since our error rate has fallen in recent
weeks/months).
- these are comprised of about 20-30 different exception types (whereas we
normally only see a half dozen to a dozen different exception types) - and
more importantly the top exception types are not ones we normally see.

Frustratingly, New Relic only keeps exception logs for 7 days - so we
really need to look into spikes like these as they occur - otherwise we
lose valuable context as to when each exception type first started.

I've only quickly glanced at the exceptions so far, but my gut instinct is
that a fair proportion of them are due to bad data being submitted by
TaskCluster. However they ideally still needed to have been fixed last
week, since:
* They presumably would have caught bug 1147958.
* They drastically reduce the signal to noise ratio of the exception
overview page - and as proven by the cycle-data failure a few months ago
(which resulted in us running out of DB disk space on stage, halting
ingestion!) - sometimes a major problem results in only one exception per
24 hours - so 10,000-20,000 spurious exceptions a day can easily mask that.
* Now that we've switched off TBPL, there is an even greater need to be
vigilant with infra/reliability issues (particularly initially) in order
that we not harm the perceived reliability of Treeherder.

I know I've been the one digging into New Relic the most so far - but I
think it may be spreading the knowledge more - so we maintain coverage when
one of us goes on vacation. In addition, since all of us end up deploying
at one point or another, I think we all need to know how to read the New
Relic tea leaves regardless of PTO :-)

I'm happy to give a quick demo/guide presentation during today's Treeherder
meeting (presuming Vidyo screenshare cooperates), covering things like:
* The difference between app, server, plugin & key transaction pages/alerts.
* How to configure alert settings/thresholds, and what error rates we
typically see.
* Discussing our workflow for dealing with these alerts (eg who/when,
filing bugs, when to ignore etc).

Catch up with you all later :-)

Best wishes,

Ed
0 new messages