Metamath mirrors: Current state & proposed plan

48 views
Skip to first unread message

David A. Wheeler

unread,
Sep 13, 2022, 10:06:27 PM9/13/22
to Metamath Mailing List
We currently have a few mirrors of the Metamath website (e.g., cn.metamath.org). Their purpose
is to provide access Metamath information even if the "main" site us.metamath.org is down.

If you manage a Metamath mirror, please contact me! If you have an opinion about handling mirrors,
please let me know (or post here for discussion).

The main question is: Do we want to continue to support Metamath mirrors at all?
Mirrors made more sense when websites were less reliable and CDNs didn't exist.
If we stop supporting mirrors, that's one less complication.
The mirror in the Chinese mainland has the strongest case to continue, due to the
Great Firewall of China.

If we *do* want to support mirrors, we need to change things, as mirrors no longer work.
Historically these copied data from "rsync.metamath.org", which was really
Norm's "us2.metamath.org" site in his house. Unfortunately, us2.metamath.org
has stopped working & it's not clear if we can get it working again.
(We managed to get off us2.metamath.org just in time!!).
In addition, the old system used an unsecured rsync connection, so we need to change it anyway.
I want to lock it down to make it unlikely to be a serious vulnerability.

If we *stop* supporting mirrors that'd be one less complication & effort. But if we
want to continue supporting mirrors, then we need to make them actually work :-).

Below is the current state & my proposed plan *if* we want to continue to have mirrors.

--- David A. Wheeler

=============

First, the current state. The *working* metamath mirror sites are
(not counting us.metamath.org and us2.metamath.org):
- at.metamath.org Secondary mirror (Austria) [courtesy of Digital Solutions Marco Kriegner]
- cn.metamath.org Secondary mirror (China) [courtesy of caiyunapp.com]
- de.metamath.org

They use rsync to synchronize their data. Rsync is a great protocol when combined
with ssh, but using it *without* ssh over the public internet is a terrible idea.

If we continue to support mirrors, I propose configuring a special account for
synchronization. Mirrors would each log in using ssh and their specific private SSH key.
They'd log into a restricted account specific to that mirror
via ssh in a way that provides read-only access to only the
mirrored public files (the program "rrsync" can restrict what access is allowed to rsync).

Here are my proposed configuration:

* Every mirror would create a public/private keypair & send the public key to me.
* For each mirror we'd create a new account on us.metamath.org (e.g., "mirror.cn")
* That account's /home/mirror.cn/.ssh/authorized_key would be modified to this:
command="/usr/bin/rrsync -ro /var/www/us.metamath.org/",restrict ssh-rsa {PUBLIC_KEY}
.. .this forces ssh logins to run the restricted rsync to read-only mode for JUST the mirror directory,
restricts (eliminates) all other ssh access, and uses ssh-rsa for login.
We could use a different key pair algorithm. Note that the PUBLIC_KEY will be public,
but that's fine!!
* Ensure nothing in its .ssh dir can be easily changed: chmod -R a-w /home/mirror.cn/.ssh
* Disable any other kind of login with: chsh mirror.cn -s /bin/nologin
* Note: The "rsync" daemon will *not* be enabled; the only way to access this is to
log in via ssh. It's not wise to run a bare rsync nowadays.
* The mirrors would periodically run an "rsync -e ssh -a mirr...@us.metamath.org:/var/www/us.metamath.org/ /var/www/cn.metamath.org/".
Rsync is clever about updates, so this would typically be extremely fast unless the internet link is bad.

This would make mirrors log in - mirrors can make us do "extra work" so we don't want just anyone
to be able to mirror. However, these steps will mean that the only thing the mirrors can do is make us
do extra work to serve public info... making the accounts not-very-interesting.

I haven't actually tried to *implement* this configuration, so there may need to be some tweaks & changes,
but that's the general idea.

Some info sources:
* https://serverfault.com/questions/965053/restricting-a-ssh-key-to-only-allow-rsync-file-transfer
* https://www.whatsdoom.com/posts/2017/11/07/restricting-rsync-access-with-ssh/
* http://gergap.de/restrict-ssh-to-rsync.html

Mingli Yuan

unread,
Sep 13, 2022, 10:27:05 PM9/13/22
to meta...@googlegroups.com
Hi,  David,

I am the maintainer of the cn.metamath.org. In my opinion, we'd better continue the hosting of mirrors.

I am from the machine-learning community, I can see the incoming interests in using metamath as a language corpus.
If we hold the mirrors, it will help researchers, esp. them in China, to access metamath.
On the other hand, mirrors can distribute the volume of visiting or downloading.

I can check your configuration in 2~3 days to settle down the implementation details with you help, if the community reaches a consensus on keeping the mirrors.

Mingli







--
You received this message because you are subscribed to the Google Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/metamath/94157C87-67F2-492F-B6B6-A98BEDE5D45C%40dwheeler.com.

heiph...@wilsonb.com

unread,
Sep 14, 2022, 12:05:36 AM9/14/22
to meta...@googlegroups.com
Nice work modernizing and hardening the infra. These days what's considered the
"bare minimum" has a lot of moving pieces.

Anyway, please permit me to butt in with a small idea. The mirror setup you
propose has each mirror polling the source server for changes. What about a
push-centered architecture?

Since rsync is equally capable of pushing changes, I'm imagining a reversal of
roles in your ssh setup and having some post-update hook that rsyncs the
changes to each mirror.

Off the top of my head, pros:

- Tighter sync between mirrors and us.metamath.org
- Less network noise
- Slight reduction in attack surface area on us.metamath.org

cons:

- Error handling becomes responsibility of the post-update hook
- Might require updates to Firewall settings

Thoughts?


"David A. Wheeler" <dwhe...@dwheeler.com> wrote:

David A. Wheeler

unread,
Sep 14, 2022, 10:23:19 AM9/14/22
to Metamath Mailing List


> On Sep 14, 2022, at 12:05 AM, heiphohmia via Metamath <meta...@googlegroups.com> wrote:
>
> Nice work modernizing and hardening the infra. These days what's considered the
> "bare minimum" has a lot of moving pieces.

It really isn't bad. There are more pieces, but the tools to manage it are better,
and many things "just work". Trivial example: historically you had to take extra
steps to install rrsync, now it's just another program you install & manage with the
package manager.


> Anyway, please permit me to butt in with a small idea. The mirror setup you
> propose has each mirror polling the source server for changes. What about a
> push-centered architecture?
>
> Since rsync is equally capable of pushing changes, I'm imagining a reversal of
> roles in your ssh setup and having some post-update hook that rsyncs the
> changes to each mirror.

Like everything, there are pros & cons :-).


> Off the top of my head, pros:
>
> - Tighter sync between mirrors and us.metamath.org

I don't think that's critical. We update 1/day, and it's not a crisis if
the mirror update is delayed. Also, rsync is *really* fast at determining
"there is nothing to do".

> - Less network noise
> - Slight reduction in attack surface area on us.metamath.org
>
> cons:
>
> - Error handling becomes responsibility of the post-update hook

We really *can't* handle such errors. If the other side is inaccessible, or
can't write an update, there's little we can do about it.

> - Might require updates to Firewall settings

Not really, and we control our firewall settings anyway.

There's a bigger "con": the mirrors would need to allow
someone from the outside (specifically us.metamath.org) to log in
*to* their system & write to it. I don't know if they're willing to do that.
They may not even have the rights necessary to do it.
That's not a technical issue, but it certainly might matter :-).

Another con: It'd mean that the us.metamath.org site would have to
store the private keys for logging in to those other sites.
Doable, but an extra step.

Mirror folks: Comments?

--- David A. Wheeler

Jim Kingdon

unread,
Sep 14, 2022, 11:14:33 AM9/14/22
to David A. Wheeler, Metamath Mailing List


On September 14, 2022 7:23:11 AM PDT, "David A. Wheeler" <dwhe...@dwheeler.com> wrote:
>
>Mirror folks: Comments?

For all the reasons you suggest, I'd pick the design where the mirrors pull from the central site. But I'm fine with whatever you and the mirror operators want to do.

Cris Perdue

unread,
Sep 14, 2022, 2:47:13 PM9/14/22
to meta...@googlegroups.com
Hi David,

Your plan looks quite good to me, especially given that you prefer to only permit authorized mirroring, not just anyone who chooses to set up a mirror.

My one small suggestion would be to check that mirror servers can run rsync like this:

$ rsync -e ssh -a mirr...@us.metamath.org: /var/www/cn.metamath.org/

 -- in other words, omitting the path to the source area to be mirrored.  

https://dev-notes.eu/2015/06/secure-rsync-between-servers/ indicates that this should work (OK, they add a trailing "/"),
and you really don't want the mirrors to even be able to copy over other parts of your source filesystem, so the
full source path ought to be ignored if given.

Best regards,
Cris


David A. Wheeler

unread,
Sep 14, 2022, 5:33:18 PM9/14/22
to Metamath Mailing List


> On Sep 14, 2022, at 2:46 PM, Cris Perdue <cr...@perdues.com> wrote:
>
> Hi David,
>
> Your plan looks quite good to me, especially given that you prefer to only permit authorized mirroring, not just anyone who chooses to set up a mirror.
>
> My one small suggestion would be to check that mirror servers can run rsync like this:
>
> $ rsync -e ssh -a mirr...@us.metamath.org: /var/www/cn.metamath.org/
>
> -- in other words, omitting the path to the source area to be mirrored.

That's a new syntax for me, but sure, we can try it.

> https://dev-notes.eu/2015/06/secure-rsync-between-servers/ indicates that this should work (OK, they add a trailing "/"),

Interesting. I note that they do the same thing - the mirrors initiate the rsyncs (not the other way around).

> and you really don't want the mirrors to even be able to copy over other parts of your source filesystem, so the
> full source path ought to be ignored if given.

I agree that mirrors shouldn't have access to everything on the server.
That said, leaking the name of the path is okay. The configuration is publicly visible after all.

The overall goal is to minimize privilege to make an attacker's job harder.
That also makes my life easier; if the system isn't 0wned, then I don't have to
spend my time fixing it :-). Determined & clever attackers can get into all sorts of
systems, but if we make it a pain, the attackers are likely to attack
a different system that's more valuable to them.

--- David A. Wheeler

heiph...@wilsonb.com

unread,
Sep 14, 2022, 8:55:52 PM9/14/22
to meta...@googlegroups.com
> > - Tighter sync between mirrors and us.metamath.org
>
> I don't think that's critical. We update 1/day, and it's not a crisis if
> the mirror update is delayed. Also, rsync is *really* fast at determining
> "there is nothing to do".

Good point. With a slow, prescribed update cadence, this point is far from
critical.

I'm thinking a bit more broadly, however. It sounds like the current
HTML-generation process is quite slow, but if something like localized
incremental updates becomes possible, then it's not hard to imagine "live"
updates of the site every time an MR gets merged in.

It mostly boils down to an experiential difference---"live" updated page vs
"static" site with periodic updates. I'm clearly biased to thinking the former
is way cooler.

> We really *can't* handle such errors. If the other side is inaccessible, or
> can't write an update, there's little we can do about it.

SMTP might like to have a word about that :D
Transient network errors have to be handled or ignored by *someone*, be that
us.metamath.org or the mirrors.

> > - Might require updates to Firewall settings
>
> Not really, and we control our firewall settings anyway.

I'm thinking more broadly of the us.metamath.org + mirrors distributed
"system". Potential firewalls around each node.

> There's a bigger "con": the mirrors would need to allow
> someone from the outside (specifically us.metamath.org) to log in
> *to* their system & write to it. I don't know if they're willing to do that.
> They may not even have the rights necessary to do it.
> That's not a technical issue, but it certainly might matter :-).
>
> Another con: It'd mean that the us.metamath.org site would have to
> store the private keys for logging in to those other sites.
> Doable, but an extra step.

The configuration overhead is symmetric for each pairwise connection, so I'm
not really convinced this is a con so much as a policy decision. I probably
shouldn't have framed the discussion with a half-baked pro vs con list in the
first place, though :P

Anyway, security-wise, the pull model implicitly delegates trust out to each of
the mirrors. If any of them are compromised, then up to any (not unheard of)
rrsync bug, us.metamath.org is hosed.

Controlling connections with ssh keys implicitly centralizes control of the
mirror list into the main server; so from a purely "maximize security of
us.metamath.org" perspective, switching to a push-centric model simply makes
this explicit and lets you remove attack surface area on the main site.

Anyway, modulo other constraints, in my "make everything stateless containers"
world, the default tends to be to avoid polling as much as possible, so adjust
priors accordingly.


All that said, thanks for all the good work!

Cheers,

David A. Wheeler

unread,
Sep 14, 2022, 10:27:59 PM9/14/22
to Metamath Mailing List


> On Sep 14, 2022, at 8:55 PM, heiphohmia via Metamath <meta...@googlegroups.com> wrote:
> It mostly boils down to an experiential difference---"live" updated page vs
> "static" site with periodic updates. I'm clearly biased to thinking the former
> is way cooler.

That doesn't seem important to me. After all,
we don't create new theorems THAT quickly :-).
Rsync is *remarkably* efficient in the "no change" case. Mirrors could poll hourly
(or even more often) with no problem.


>> We really *can't* handle such errors. If the other side is inaccessible, or
>> can't write an update, there's little we can do about it.
>
> SMTP might like to have a word about that :D
> Transient network errors have to be handled or ignored by *someone*, be that
> us.metamath.org or the mirrors.

Sure, but the mirror admin will have to fix the problem. If there's a network problem
that doesn't affect us.metamath.org but DOES affect the mirror, the mirror will be in a
better position to do something about it. A more likely problem is out-of-storage at the mirror site;
again, we can't solve that, only the mirror can.


> Anyway, security-wise, the pull model implicitly delegates trust out to each of
> the mirrors. If any of them are compromised, then up to any (not unheard of)
> rrsync bug, us.metamath.org is hosed.

No, we have other lines of defenses even in that case:
(1) The login shell is still disabled, making "use login shell" tricks not work
(2) Even if rrsync sandbox is escaped and the account is fully taken over,
the attacker only gets a separate user account created for that mirror with no
special privileges. It's *not* a privileged account. The mirror account
can't modify the files being served by the webserver (owned by user "generator"),
and it can't see private keys (owned by user "root").
The attacker would have to use yet another vulnerability (e.g., a kernel exploit) to get real privileges.
To make that harder, I've hardened the system & the system auto-installs security updates.

That's on top of the other defenses to *get* to that point:
a) The mirror system has to be taken over and/or have its private key exfiltrated
b) ssh and/or rrsync are subverted. But both are relatively small, widely used, and
specifically intended to be securely usable for these purposes.

Sure, an attacker *might* get through all that. Worse comes to worse, we trash the
virtual machine, create another, and re-run the script to recreate the site.

> Anyway, modulo other constraints, in my "make everything stateless containers"
> world, the default tends to be to avoid polling as much as possible, so adjust
> priors accordingly.

I don't have any allergy to polling, and a web server necessarily has state
(that's what we're serving!).

I think the key is that we can always revisit this stuff later.
My goal is to "make it work reasonably well" for now, so that everything is automated.

--- David A. Wheeler

David A. Wheeler

unread,
Sep 14, 2022, 10:48:24 PM9/14/22
to Metamath Mailing List
I've started implementing mirroring here:
https://github.com/metamath/metamath-website-scripts/pull/1
Click on the "Files Changed" tab for all the details.

I'll need *real* public keys from the mirror maintainers for this to be useful.
For the moment I've stubbed that out.
It's possible this won't work at first, even with real public keys,
but if so we'll work out the bugs until it does. I don't expect any serious problems.

I still haven't heard from:
- at.metamath.org - Austria [courtesy of Digital Solutions Marco Kriegner]
- de.metamath.org - Germany.

--- David A. Wheeler

Reply all
Reply to author
Forward
0 new messages