Engineering Effectiveness Newsletter (February/March 2024 Edition)

222 views

Skip to first unread message

Connor Sheehan

unread,

Apr 9, 2024, 2:22:40 PMApr 9

to dev-pl...@mozilla.org

Welcome to the February/March edition of the Engineering Effectiveness Newsletter! The Engineering Effectiveness org makes it easy to develop, test and release Mozilla software at scale. See below for some highlights, then read on for more detailed info!

Highlights

Firefox Android has been merged into mozilla-central. Fenix, Focus, and android-components are now built and tested in sync with Gecko and Geckoview, and Android developers can use the same tools and workflows used for desktop.
Mach is now mostly compatible with Python 3.12.
Caching, redirects and filesystem changes were made to improve the performance and stability of hg.mozilla.org.
Release SREs reduced the amount of time it takes an Azure worker to claim a task by nearly 15 minutes, resulting in reduced wait times for Windows test jobs.

Contributors

Detailed Project UpdatesBugzilla

and Bugbug

Suhaib Mujahid dropped a BugBot rule that was requesting needinfo from the authors of bugs with STR about a missing regression range.
BugBug and BugBot are now using Ruff instead of Flake8, Black and isort (thanks to Juliet Owah and Baasanbayar).
Promise Fru has integrated the performance bugs model into BugBot, which will request evaluation from the performance team for new potential performance-related bugs by setting the “Performance Impact” field to “?”.
John Pangas has integrated the accessibility bugs model into BugBot, which will add the “access” keyword for new potential accessibility-related bugs.
Bugzilla has an updated attachment details page implemented by bug 1157227.
Also certain pages of Bugzilla are much more usable on mobile devices after bug 1878813.

Build System and Mach Environment

Most linker checks have been moved from legacy autoconf-based configuration to the Python configuration step.
Mach is now mostly compatible with Python 3.12.

CI and Treeherder

Heitor Neiva and Andrew Halberstadt baked Taskgraph into the decision images, so it’s now possible to pin to a Taskgraph version via docker tag.
Andrew Halberstadt significantly reduced the number of queries CI makes to hgmo’s json-automationrelevance endpoint to help reduce load.
Joel Maher has been working to reduce CI costs by reducing test load 16%, cutting android emulator costs in half by using smaller instances, and removing our extra disks on test instances to reduce cost of SSD’s by 25%.
Tom Marble has added support for `./mach manifest skip-fails`.

Crash Management

The crash reporter client has been fully rewritten in Rust.
Crash annotations are now using the new pull-based system which significantly reduces their runtime cost.
Symbol files now contain the name, version and build ID of the product they were generated for.
Linux crash reports now contain the library versions as specified in the library file name.
The Windows Error Reporting runtime exception module is significantly more robust.
Crashes on macOS cannot lead to hangs anymore.
A new version of rust-minidump was released with significant improvements when dealing with native debug information during unwinding.

Lint, Static Analysis and Code Coverage

The Python codebase has been cleaned from Invalid escape warnings and a ruff linter warning has been turned on to prevent regressing on that aspect.

OS Integration and Security

On Windows, the content process now runs at untrusted integrity in Nightly.

PDF.js

Calixte Denizet landed many improvements and fixes for the highlighting feature, in preparation for its release in Firefox 125.
Marco Castelluccio has been collecting a list of websites and third-party software relying on PDF.js. Please reach out if you know any additional users.

Firefox Translations

Thanks to Serge Guelton, the translation engine can take advantage of AVX and AVX512 VNNI extension instructions for faster translation on Intel architectures.
We shipped our first big update of translation models. These models have mostly been in Nightly since our original launch, but we came up with an evaluation metric using COMET scores to decide if they were good enough to ship.
- el -> en (Greek)
- en -> et (Estonian, from-English)
- et -> en (Estonian, to-English)
- hu -> en (Hungarian)
- fi -> en (Finnish)
- ru -> en (Russian)
- sl -> en (Slovenian)
- tr -> en (Turkish)
- uk -> en (Ukrainian)
- See the full list of released models here: https://gregtatum.github.io/taskcluster-tools/models.html
Greg Tatum trained his first full language model using the pipeline for training English to Catalan. It is in Nightly and is meeting requirements to ship to all release channels. English to Czech is currently training with good initial stats.
Evgeny Pavlov has been focusing on translation robustness focusing on the Russian model, making our translations more consistent and reliable.
Thanks to Erik Nordin, the Select Translation initial UI is about to land in Nightly, and can be enabled with the pref. browser.translations.select.enable set to true. See Bug 1870316.
Erik Nordin has been analyzing our language release with great telemetry. See his post on slack here for some interesting insights: https://mozilla.slack.com/archives/C04DGR18D0F/p1709330531349729
As a team we're preparing for a kick off of our first big parallel training run, looking at training and shipping the following languages. It takes 2 models to ship a full language:
- Russian (en->ru)
- Indonesian
- Czech (en->cs)
- Hungarian (en->hu)
- Turkish (en->tr)
- Greek (en->el)
- Finnish (en->fi)
- Swedish
- Romanian
- Danish
- Bosnian

Phabricator

, moz-phab, and Lando

Connor Sheehan added a blocking check to Lando to enforce revisions with the needs-data-classification tag cannot land.
Connor Sheehan fixed Lando’s automated code formatting by running a specific subset of linters, avoiding an issue with the WPT linter in Lando.
Connor Sheehan added an environment variable to moz-phab to disable the progress spinner, allowing workarounds when the spinner causes moz-phab to hang.

Release Engineering

and Management

After almost a year of preparations, we merged firefox-android to mozilla-central on March 18th. Fenix, Focus, and android-components are now built and tested in sync with Gecko and Geckoview, and Android developers can use the same tools and workflows used for desktop.
- Julien Cristau and Geoff Brown participated in this work from release engineering, working with Titouan Thibaud and Gabriel Luong from the android team, and support from countless others.
- This project culminated in two pushes to mozilla-central, and will ride the trains with Firefox 126.
RelEng had a meet-up in early February to address several growing pain points
- Julien Cristau and Gabriel Bustamante fixed the action tasks on Github PRs. This enables developers, sheriffs and release managers to retrigger and backfill tasks on PRs.
- Andrew Halberstadt and Heitor Neiva automated the publication of taskgraph and decision images, something RelEng has done manually every week for 5 years.
- Ben Hearsum and Johan Lorenzo reduced the number of steps to add a new product (like Firefox for iOS) onto RelMan’s one-stop shop for kicking off releases “ShipIt”
Geoff Brown quickly enabled nightly updates to get #sidebar-foxfooding started. It was done on really short notice (Bug 1877483)
Gabriel Bustamante respun an off-cycle Firefox partner-repack. Such requests are rare enough nowadays to be highlighted (Bug 1881085)
Andrew Halberstadt implemented a new scriptworker-script for interfacing with Bitrise. It will allow securely triggering Bitrise workflows from Taskcluster.
Heitor Neiva implemented a tool to periodically check Mozilla’s account in AppStoreConnect and report on any user changes. This is to ensure proper protocol is followed when adding new users to the platform and prevent rogue actors on our Apple account. The result are being posted in #infosec-releng-alerts
RelEng-adjacent: our Release SREs reduced the amount of time it takes an Azure worker to claim a task by nearly 15 minutes, resulting in reduced wait times for Windows test jobs.
Julien Cristau helped the Desktop Integration team fixing macOS updates on Nightly, by identifying the root cause and proposing the most optimal fix (bug 1882322)

Version Control

Connor Sheehan added clonebundles to northamerica-northeast1, so CI jobs in that region will clone much faster.
Greg Cox and Connor Sheehan made some configuration changes to the NFS mount on the hg push server that provides a moderate performance boost to pushes.
Connor Sheehan added a redirect for an in-tree file that is accessed by a popular third-party testing package, mitigating load issues when external CI farms used the package.
Greg Cox and Connor Sheehan made configuration changes to the Zeus load balancer in front of hgweb to enable caching of raw-file requests, which will reduce the load on hg.mozilla.org from CI jobs which request many copies of the same file.