Welcome to the April and May edition of the Engineering Effectiveness Newsletter! The Engineering Effectiveness org makes it easy to develop, test and release Mozilla software at scale. See below for some highlights, then read on for more detailed info!
The Select Translations MVP has landed in Nightly, and the feature is scheduled to ride the trains to ship in Firefox 128. This allows users to translate selected text via the context menu.
A bug in Identical Code Folding detection was fixed for Firefox Desktop and Android builds. This leads to a 20MB reduction on Firefox Desktop build size and a 2MB reduction on Android!
We published Linux ARM64 Nightlies, which have seen a steady increase in DAU/MAU since launch. The deb package already represents 45% of ARM64 MAU.
Mozillians across many teams (both within EE and without) successfully rotated the Certification Authority we use to sign Firefox plugins and addons! This prevented a third “Armag-addon” (only this one would have been much worse).
We kicked off our first big parallel translations training run! This follows a long effort to stabilize the incredibly complex pipeline such that it can run hundreds of training tasks in parallel.
We added support for running tests on try matching tags in the manifest. Now you can do ./mach try fuzzy –tag <tag> and only tests annotated with that tag will be selected (WPT and Reftest based suites are not yet supported).
Benjamin Mah built a new ML model to classify Fenix bugs into suitable components. This model will work in coordination with the general component model to improve bug classification.
Benjamin Mah implemented an enhancement for BugBot’s triage rotations feature to notify involved triage owners when performing a rotation.
Benjamin Mah implemented an improvement for BugBot to automatically clear its needinfo requests when closing variant expiration bugs.
A new component was created for Release Engineering's packaging.
Serge Guelton fixed a bug in Identical Code Folding (ICF) detection for Firefox desktop and Android builds. This leads to a 20MB reduction on Firefox desktop build size and a 2MB reduction on Android!
Serge Guelton reduced the execution time of mach configure + mach export by ~25%, mostly through parallelisation of various operations.
Alphare implemented batch Taskcluster APIs and updated Taskgraph to use them, resulting in a 20% performance improvement in Gecko Decision tasks
Ben Hearsum discovered and debugged problems with our GCP spot termination logic. Tasks can now upload artifacts even after being terminated, opening the door for long running tasks to “pick up where they left off”.
Ben Hearsum helped enable GCP resource usage collection on our workers, giving us much more detailed insight into how workers are (or aren’t) being utilized.
Andrew Halberstadt upgraded Gecko to Taskgraph 7.x which contains several incompatible changes, including a simpler and more intuitive /taskcluster layout.
Sebastian Hengst + contractors from Teklia migrated Treeherder’s database from MySQL to PostgreSQL. This unblocks further updates to modern versions of tooling like Django and better analytics of CI data.
Eva Bardou upgraded Treeherder’s frontend to Django 4.2.
Sebastian Hengst enabled local development of Treeherder with a remote PostgreSQL database instance behind Google Cloud SQL Proxy.
Joel Maher added support for running tests on try matching tags in the manifest. Now you can do ./mach try fuzzy –tag <tag> and only tests annotated with that tag will be selected. This does not yet work for web-platform-tests, reftest or crashtest.
Joel Maher fixed a couple of Reftest issues, ensuring that Linux tests have a valid theme, and Windows GPUs have the correct device driver. These fixes should help reduce intermittents.
Suhaib Mujahid and Marco Castelluccio published a paper titled “Predicting the Impact of Crashes Across Release Channels” at the MSR conference. The paper was published in collaboration with Diego Elias Costa from Concordia University
Many issues were fixed in the new crash reporter client, including: improved localization, better Thunderbird support and superior backwards compatibility with the old client.
Gabriele Svelto ensured crash reports intercepted by the Windows Error Reporting runtime exception module now always contain an install time.
The Linux symbol scrapers have been expanded to cover more packages and not fail when presented with huge amounts of debug information in a single pass.
Crash Pings submitted over Glean on desktop now contain the full telemetry environment and the crash stack.
Marco Castelluccio, Christian Holler and Jason Kratzer published a study about code coverage gaps and automatic generation of tests, titled “Mind the Gap: What Working With Developers on Fuzz Tests Taught Us About Coverage Gaps”. This study was published at the ICSE conference in collaboration with Carolin Brandt and Andy Zaidman from Delft University of Technology and with Alberto Bacchelli from the University of Zurich.
QA has begun testing our integration of the DLP (data loss prevention) SDK support in Nightly. This is an enterprise feature allowing data loss prevention vendors such as Broadcom and Trellix to integrate with Firefox in a more reliable and stable manner.
Calixte replaced the jpeg2000 decoder with the OpenJPEG one using WASM. This fixes various rendering issues and improves the overall performance of jpeg2000 decoding.
Nicolò Ribaudo implemented fixes for text selection flickering on touch screen devices.
Aditi fixed a discrepancy between the lang tag of the PDF viewer and of the canvas, which led to misaligned text selection.
We have started experimenting with alt text generation using local AI models in the feature to add images within PDFs.
We kicked off our first big parallel training run! This follows a long effort to stabilize the incredibly complex pipeline such that it can run hundreds of training tasks in parallel.
Greg Tatum created a dashboard that shows the current training run’s progress. Updates are also manually tracked in this spreadsheet (which also contains a link to the most recent dashboard).
We will train the first half of the model pipeline (up until a single teacher training) and look at the initial evaluation results. If the models are good enough to continue, we'll trigger the rest of the training to go until the final production ready models.
The first wave will be the models going into English, because there is a lot of English monolingual data available. After the first wave, we'll continue with a second wave going from English. We can bootstrap this second wave with our xx-en models we trained in the first wave.
It's about 3-4 weeks for a full training run for a single language direction. The first stage we're stopping at is about 1 week of training. This is all dependent on data size, and it will be variable.
Evgeny Pavlov has been leading up a big part of the work on developing our training recipe, and coordinating with Teklia contractors to get our experiment tracking integration with Weights and Biases set up.
Ben Hearsum has done significant work to ensure that we can train new language pairs on preemptible GCP instances, which will greatly lower the financial cost of training them.
Erik Nordin has nearly completed the implementation of the Select Translations MVP, and the feature is scheduled to ride the trains to ship in Firefox 128.
Connor Sheehan completed a migration of the Treestatus tool from a standalone service owned by RelEng into a feature of Lando. The new Lando Treestatus has a proper test suite and the UI is implemented in technologies familiar to our engineering teams.
Connor Sheehan implemented several of the hook checks on hg.mozilla.org as checks within Lando, which is required for the hg->git migration.
Connor Sheehan added support for the cypress project branch to Lando/Phabricator.
Release Management shipped two new Firefox releases and a number of follow-up dot releases to address quality issues found post-release.
Gabriel Bustamante published a Linux ARM64 Nightly. Since then, DAU and MAU have linearly increased. No sign of decrease in sight. Interesting fact: the .deb package represents 45% of ARM64 MAU.
Julien Cristau, Ben Hearsum, and members of the Desktop Integration team prevented Windows users from not being able to update or reinstall their Firefox. This was caused by a certificate expiring in mid-June and the new certificate had new constraints.
Ben Hearsum and several Mozillians from many teams successfully rotated the Certification Authority we use to sign Firefox plugins and addons. This prevented another “Armag-addon” like the one that occurred in May 2019.
Heitor Neiva optimized how macOS builds are notarized. The average tasks duration went from ~30 minutes down to ~12 minutes. Overall, this represented 860+ hours of compute in April and now we’re down to ~450h.
Andrew Halberstadt converted Firefox iOS to use the new Bitrise scriptworker. This allows Firefox iOS to securely trigger Bitrise workflows from Taskcluster, allowing these workflows to plug into standard Taskcluster release pipelines.
Geoff Brown, Julien Cristau, Johan Lorenzo and members of the Android teams wrapped up the Android repository migration. Now all Fenix and Focus releases happen on hg.mozilla.org.
Julien Cristau updated beetmover to stop archiving test packages, saving storage costs.
Sylvestre audited the artifacts stored on archive.m.o to remove a lot of old and unused files.
Ben Hearsum has been working on migrating l10n strings to GitHub with the l10n team. We expect to cut over to the new repository in early June.
Ryan VanderMeulen has been working with Mike Kaply and Release Engineering to mitigate slow Google Play review times impacting our ability to ship timely Android releases.
Donal Meehan created a Release Delay Runbook documenting the steps that need to be taken in the event of needing to delay a release in response to a recent incident.
Multiple tooling improvements for the release management team (on https://whattrainisitnow.com/, https://trainqueries.herokuapp.com/ , https://bugimpact.herokuapp.com)
Van Le, Greg Cox and Connor Sheehan worked to increase the amount of RAM on hg.mozilla.org, eliminating many OOM issues and making the service more stable.
Connor Sheehan added a pushchangedfiles endpoint to hg.mozilla.org, which is a minimal and more performant version of the json-automationrelevance endpoint used by various tasks in CI, and Andrew Halberstadt updated CI to use it.
Sylvestre upgraded Sphinx to 7.2.6 and all other dependencies for https://firefox-source-docs.mozilla.org.
Thanks for reading and see you next time!