tl;dr the behavior of how
https://hg.mozilla.org/ exposes data after a push
has changed and this may impact service consumers. This should be a
positive change for most. Read on for details.
5+ years ago, requests hitting
https://hg.mozilla.org/ were served off an
NFS volume. As soon as a push was made to
hg.mozilla.org, it was available
via the HTTP endpoint.
For various reasons, we changed this several years ago so that pushes to
hg.mozilla.org would asynchronously replicate independently to local
storage on N servers behind a load balancer fronting
https://hg.mozilla.org.
Because HTTP requests are round-robined to N servers and the local storage
was independent, there were race conditions after a push where the state of
a repository could vary between servers. e.g. server A would have the new
push and server B would not. This could result in consumers seeing
inconsistent repository state at any given instance in time.
In reality, this wasn't a major issue because the servers were all
homogeneous and operating in a well-defined environment. So the
"inconsistency window" after a push was small - typically no more than a
few hundred milliseconds. Short enough that you probably wouldn't ever
notice.
But the migration of
hg.mozilla.org to a different data center required us
to make servers non-homogeneous for a period of time. This made the
"inconsistency window" up to several seconds and caused disruptive
intermittent failures in CI (bug 1462323). And upcoming plans to host
Mercurial endpoints in AWS would suffer the same fate (since performance in
EC2 is highly varied).
With the help of glob, sheehan, and fubar, we've rolled out a change (bug
1470606) that should practically eliminate the inconsistency window after
pushes and make
https://hg.mozilla.org/ expose atomic state at any given
instance in time.
Essentially, instead of an individual server exposing data once it has been
replicated, we wait until all servers behind the load balancer have
replicated the data. At that point, the new data is exposed by all servers.
i.e. new pushes are exposed once the slowest server has replicated them.
I know many consumers of
hg.mozilla.org have been affected by the
inconsistency window in the past. People have had to add things like sleeps
and excessive retries to work around the issue. And developers and sheriffs
have been annoyed by failures that fall through the cracks (especially
those in the past few weeks). Hopefully now that the inconsistency window
is practically non-existent, these workarounds can be removed and we can
all enjoy a more reliable service.
We're still tracking down some minor fallout from the change. But for the
most part it appears things "just work." If you see anything wonky or want
to track work as a result of this change, please chain things up to bug
1470606.
I'd like to thank glob, sheehan, and fubar for helping with the design,
implementation, and rollout of this significant change.
If you are interested in learning more about how the replication works, it
is described at
https://mozilla-version-control-tools.readthedocs.io/en/latest/hgmo/replication.html#architecture
.
Gregory