Observability Team Newsletter (2024q1)
(cross-posted to sentry-users, crash-reporting-wg, obs-team)
Observability Team is a team dedicated to the problem domain and discipline of Observability at Mozilla.
We own, manage, and support monitoring infrastructure and tools supporting Mozilla products and services. Currently this includes Sentry and crash ingestion related services (Crash Stats (Socorro), Mozilla Symbols Server (Tecken), and Mozilla Symbolication Service (Eliot)).
In 2024, we'll be working with SRE to take over other monitoring services they are currently supporting like New Relic, InfluxDB/Grafana, and others.
This newsletter covers an overview of 2024q1. Please forward it to interested readers.
🤹 Observability Services: Change in user support
🏆 Sentry: Change in ownership
‼️ Sentry: Please don't start new trials
⏲️ Sentry: Cron monitoring trial ending April 30th
⏱️ Sentry: Performance monitoring pilot
🤖 Socorro: Improvements to Fenix support
🐛 Socorro: Support guard page access information
See details below.
None this quarter.
We overhauled our pages in Confluence, started an #obs-help Slack channel, created a new Jira OBSHELP project, built out a support rotation, and leveled up our ability to do support for Observability-owned services.
See our User Support Confluence page for:
where to get user support
documentation for common tasks (get protected data access, create a Sentry team, etc)
self-serve instructions
Hop in #obs-help in Slack to ask for service support, help with monitoring problems, and advice.
The Observability team now owns Sentry!
We successfully completed Phase 1 of the transition in Q1. If you're a member of the Mozilla Sentry organization, you should have received a separate email about this to the sentry...@mozilla.com Google group.
We've overhauled Sentry user support documentation to improve it in a few ways:
easier to find "how to" articles for common tasks
best practices to help you set up and configure Sentry for your project needs
Check out our Sentry user guide.
There's still a lot that we're figuring out, so we appreciate your patience and cooperation.
Sentry sends marketing and promotional emails to Sentry users which often include links to start a new trial. Please contact us before starting any new feature trials in Sentry.
Starting new trials may prevent us from trialing those features in the future when we’re in a better position to evaluate the feature. There's no way for admins to prevent users from starting a trial.
The Cron Monitoring trial that was started a couple of months ago will end April 30th.
Based on feedback and other factors, we will not be enabling this feature once the trial ends.
This is a good reminder to build in redundancy in your monitoring systems. Don't rely solely on trial or pilot features for mission critical information!
Once the trial is over, we'll put together an evaluation summary.
Performance Monitoring is being piloted by a couple of teams; it is not currently available for general use.
In the meantime, if you are not one of these pilot teams, please do not use Performance Monitoring. There is a shared transaction event quota for the entire Mozilla Sentry organization. Once we hit that quota, events are dumped.
If you have questions about any of this, please reach out.
Once the trial is over, we'll put together an evaluation summary.
We worked on improvements to crash ingestion and the Crash Stats site for the Fenix project:
Previously, the platform would be "Unknown". Now the platform for Fenix crash reports is "Android". Further, the platform_pretty_version includes the Android ABI version.
1819628: reject crash reports for unsupported Fenix forks
Forks of Fenix outside of our control periodically send large swaths of crash reports to Socorro. When these sudden spikes happened, Mozillians would spend time looking into them only to discover they're not related to our code or our users. This is a waste of our time and resources.
We implemented support for the Android_PackageName crash annotation and added a throttle rule to the collector to drop crash reports from any non-Mozilla releases of Fenix.
From 2024-01-18 to 2024-03-31, Socorro accepted 2,072,785 Fenix crash reports for processing and rejected 37,483 unhelpful crash reports with this new rule. That's roughly 1.7%. That's not a huge amount, but because they sometimes come in bursts with the same signature, they show up in Top Crashers wasting investigation time.
1884041: fix create-a-bug links to work with java_exception
A long time ago, in an age partially forgotten, Fenix crash reports from a crash in Java code would send a crash report with a JavaStackTrace crash annotation. This crash annotation was a string representation of the Java exception. As such, it was difficult-to-impossible to parse reliably.
In 2020, Roger Yang and Will Kahn-Greene spec'd out a new JavaException crash annotation. The value is a JSON-encoded structure mirroring what Sentry uses for exception information. This structure provides more information than the JavaStackTrace crash annotation did and is much easier to work with because we don't have to parse it first.
Between 2020 and now, we have been transitioning from crash reports that only contained a JavaStackTrace to crash reports that contained both a JavaStackTrace and a JavaException. Once all Fenix crash reports from crashes in Java code contained a JavaException, we could transition Socorro code to use the JavaException value for Crash Stats views, signature generation, generate-create-bug-url, and other things.
Recently, Fenix dropped the JavaStackTrace crash annotation. However, we hadn't yet gotten to updating Socorro code to use--and prefer--the JavaException values. This broke the ability to generate a bug for a Fenix crash with the needed data added to the bug description. Work on bug 1884041 fixed that.
Comments for Fenix Java crash reports went from:
Crash report: https://crash-stats.mozilla.org/report/index/eb6f852b-4656-4cf5-8350-fd91a0240408
to:
Crash report: https://crash-stats.mozilla.org/report/index/eb6f852b-4656-4cf5-8350-fd91a0240408
Top 10 frames:
0 android.database.sqlite.SQLiteConnection nativePrepareStatement SQLiteConnection.java:-2
1 android.database.sqlite.SQLiteConnection acquirePreparedStatement SQLiteConnection.java:939
2 android.database.sqlite.SQLiteConnection executeForString SQLiteConnection.java:684
3 android.database.sqlite.SQLiteConnection setJournalMode SQLiteConnection.java:369
4 android.database.sqlite.SQLiteConnection setWalModeFromConfiguration SQLiteConnection.java:299
5 android.database.sqlite.SQLiteConnection open SQLiteConnection.java:218
6 android.database.sqlite.SQLiteConnection open SQLiteConnection.java:196
7 android.database.sqlite.SQLiteConnectionPool openConnectionLocked SQLiteConnectionPool.java:503
8 android.database.sqlite.SQLiteConnectionPool open SQLiteConnectionPool.java:204
9 android.database.sqlite.SQLiteConnectionPool open SQLiteConnectionPool.java:196
This both fixes the bug and also vastly improves the bug comments from what we were previously doing with JavaStackTrace.
Between 2024-03-31 and 2024-04-06, there were 158,729 Fenix crash reports processed. Of those, 15,556 have the circumstances affected by this bug: a JavaException but don't have a JavaStackTrace. That's roughly 10% of incoming Fenix crash reports.
While working on this, we refactored the code that generates these crash report bugs, so it's in a separate module that's easier to copy and use in external systems in case others want to generate bug comments from processed crash data.
Further, we changed the code so that instead of dropping arguments in function signatures, it now truncates them at 80 characters.
We're hoping to improve signature generation for Java crashes using JavaException values in 2024q2. That work is tracked in bug #1541120.
1830954: Expose crashes which were likely accessing a guard page
We updated the stackwalker to pick up the changes for determining is_likely_guard_page. Then we exposed that in crash reports in the has_guard_page_access field. We added this field to the Details tab in crash reports and made it searchable. We also added this to the signature report.
This helps us know if a crash is possibly due to a bug with memory access that could be a possible security vulnerability vector--something we want to prioritize fixing.
Since this field is security sensitive, it requires protected data access to view and search with.
4 signature generation changes. Thank you Andrew McCreight and Jim Blandy!
Maintenance and documentation improvements.
6 production deploys. Created 71 issues. Resolved 61 issues.
Maintenance and documentation improvements.
5 production deploys. Created 21 issues. Resolved 28 issues.
Find us:
Confluence page: Observability Team
User support hub: User Support
Support: #obs-help (Slack)
Crash ingestion: #crashreporting (Matrix)
Thank you for reading!