RFC: my quest to make our discourse redundant

8 views
Skip to first unread message

Alon Levy

unread,
Jun 13, 2019, 12:53:17 PM6/13/19
to hasadna
Hi,

 Recently I started thinking about having higher up time for our self
hosted discourse and wekan webapps. The keyword here is self - using
physical servers I control from the ground up, not relying on any cloud
infrastructure.

You can debate this starting point, but for this thread, it is a given.

General 2 host redundancy problem statement:

 Given a hosted service = (FQDN, (host1, host2))

 Provide setup such that if one of the hosts is up the service is
operational.

Before I describe my insights into this problem: any input on this
problem? is it well defined enough? i.e. is there a general solution not
specific to a given webapp? Is anyone interested in this problem?

So far I think this statement is too broad. Operational is by itself a
can of worms, but I think it should bring to mind minimal change in
usability from a single host solution.

 In general it can be divided into static sites and dynamic. The minimal
requirement, even for static sites, is to resolve a FQDN to the up host.

 So far I think this is the easy part, and it can be done via DNS using
an A record with multiple IPs. I've tested it locally using puredns and
firefox, and it seemed to work (used the localhost domain, had 127.0.0.1
and 127.0.0.2 allocated to the same A record).

 For dynamic sites I don't think there is a general solution, and by
dynamic I mean updated via client side actions, since for central
updates the problem is easier and amounts to distributing changes to the
hosts in sync.

For discourse, this seems to amount to replicating a postgresql server
and a directory of uploads.

Seems weird that there is no general know-how about this, which is why
I'm writing this. I looked at the discourse meta site
(meta.discourse.org), keywords as redundancy, high availability.

 I hope I'm just missing something trivial, but assuming I'm not - is
anyone interested in this area. Something like yunohost, but with
redundancy built in. All of the new interenet approaches will eventually
also produce this (thinking of ipfs, maidsafe, many others I'm sure),
but I'm interested in something that works now.

Alon

pEpkey.asc

Yuval Adam

unread,
Jun 13, 2019, 6:10:59 PM6/13/19
to TAMI
Generally speaking, you're not charting any undiscovered territory, high availability is a well known problem with many well known solutions.

DNS load balancing works well for distributing loads, but when one of the servers goes down, 50% (or any other K-out-of-N) of your requests will still hit the dead server.
Thus you need an active health-checking load balancer in front of your web instances.
After that's settled, you need a highly available DB cluster.
And of course a highly available object storage layer.

If you're taking this as a proof of concept, by all means enjoy yourself.
But if we want to be realistic about it, adding up all the aforementioned resources will probably end up costing a hefty amount of $$$ for not a lot of gain for a service that can easily handle some downtime.

--
archive and web access >https://groups.google.com/forum/#!forum/hasadna
---
‏קיבלת הודעה זו מכיוון שאתה מנוי לקבוצה 'TAMI' בקבוצות Google.
כדי לבטל את הרישום לקבוצה הזו ולהפסיק לקבל ממנה אימייל, שלח אימייל אל hasadna+u...@googlegroups.com.
כדי לפרסם הודעות בקבוצה זו, שלח אימייל ל-has...@googlegroups.com.
לתצוגה של דיון זה באינטרנט, היכנס לכתובת https://groups.google.com/d/msgid/hasadna/6e31e786-d0b3-f8bc-faa9-1618fb9fe3e0%40pobox.com
לאפשרויות נוספות היכנס ל-https://groups.google.com/d/optout.


--
Yuval Adam

Alon Levy

unread,
Jun 14, 2019, 12:48:16 AM6/14/19
to has...@googlegroups.com, Yuval Adam

I have not nor do I claim this is new. However I am looking for anyone who wants to work on this or share their knowledge of real world setups that are from the ground up. Yes, it is partially a learning experience, but I also think it will be of value. The $$$ you mention doesn't make sense to me: the cost will be exactly those two hosts. Regarding The requests hitting a dead server, the solution I saw is having a small TTL on the A record and a service (on both hosts) to update it. Still resulting in 50% timedout TCP connection requests for the TTL window.

‏קיבלת את ההודעה הזו מפני שאתה רשום לקבוצה 'TAMI' של קבוצות Google.

כדי לבטל את הרישום לקבוצה הזו ולהפסיק לקבל ממנה אימייל, שלח אימייל אל hasadna+u...@googlegroups.com.
כדי לפרסם בקבוצה הזו, שלח אימייל אל has...@googlegroups.com.
כדי להציג את הדיון הזה באתר, היכנס ל-https://groups.google.com/d/msgid/hasadna/CA%2BP%2BrLcjmLw2drL_phQ8qO_5zwPzDJRD-gJMayu7ny3G4f%3D6SQ%40mail.gmail.com.
לאפשרויות נוספות, היכנס ל-https://groups.google.com/d/optout.
pEpkey.asc

Duncan Thomas

unread,
Jun 14, 2019, 8:14:00 AM6/14/19
to TAMI
The problem as you're stating it is very, very difficult to solve.

You can make it far easier by having 3 (or any other odd number) hosts. This allows you to have quorum in the event of a network split. The third host doesn't need to be a full functioning host, it just needs to run enough to act as a tie-breaker in the event of a network partition. This is basic CAP theory stuff - let me know if you want more detail. 

For discourse, you need to set up a distributed database at a minimum, and ideally a distributed object store. Given the modest traffic levels, entirely achievable on a hobby budget and a great learning exercise.

For the actual hosting, the cheapest option, as you've noted, is to keep the DNS as the source of truth. The time-to-recovery is the length of the DNS TTL, which runs entirely fine at 1 minute and can potentially run shorter though your mobile and other low-bandwidth customers will start to see a noticeably degraded experience in that case.

The more expensive but 'better' option is to use a redundantly hosted load balancer. I'm not sure how you'd go about setting one up yourself from scratch, though it would be interesting to attempt it. Putting your service behind an Amazon ELB/NLB is easy enough, not especially expensive and I'd suggest at least experimenting with it in order to understand it - very commonly used in commercial sites.

Good luck and feel free to ping me with any questions. I'd love to read a write-up of any experiments.

כדי להציג את הדיון הזה באתר, היכנס ל-https://groups.google.com/d/msgid/hasadna/429c36d5-37c4-f837-9370-e040764c5981%40pobox.com.

לאפשרויות נוספות, היכנס ל-https://groups.google.com/d/optout.


--
Duncan Thomas

Michael Bravo

unread,
Jun 14, 2019, 11:41:10 AM6/14/19
to TAMI
I suggest you look at haproxy.

Alon Levy

unread,
Jun 14, 2019, 12:57:55 PM6/14/19
to Duncan Thomas, TAMI
Hey Duncan. Thanks for the response. I am well aware of the theoretical underpinnigs. I am interested in practical solutions.

I will try haproxy fronting for two dynamic SSH tunneled NATed hosts because thats the cheapest common internet connection everyone in Israel gets. I have to pay an extra 17 nis a month to get a static IP and the ability to open ports. Yuck. Of course that means I am not completely independent, but its still hosting the content locally, only haproxy on a vm.

To do anything with DNS I need either two (or more as you pointed out) static IPs or my own DNS servers, which are required to have fixed IPs as well afaict, making it a chicken and egg problem.

I appreciate your interest in updates, I will post any achievement for your and others scrutiny.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Michael Tewner

unread,
Jun 14, 2019, 1:13:42 PM6/14/19
to has...@googlegroups.com, Duncan Thomas
I don't understand why hardware HA is so important, while the servers. presumably, sit somewhere where internet, "dynamic SSH tunneled NATed hosts", and power are less reliable than the hardware itself. 

Alon - Have we met? I work next door to TAMI and I have extensive experience with building/hosting production infrastructure - Want to swing by for coffee this week?

כדי להציג את הדיון הזה באתר, היכנס ל-https://groups.google.com/d/msgid/hasadna/725C68DC-A337-4B3E-9B4E-BF79024736BE%40pobox.com.

Michael Bravo

unread,
Jun 14, 2019, 1:19:56 PM6/14/19
to has...@googlegroups.com
I second Michael's point. If we're talking about trying to host commodity hardware behind NATted household connections, then the reliability of such a setup is constrained by the dubious reliability of household internet connectivity (which is wildly asymmetrical to start with) , not to mention the power. "cloud" hardware is vastly better connected and powered and in case of physical malfunction are super easily replaced. What's the upside? 

‏קיבלת את ההודעה הזו מפני שאתה רשום לנושא בקבוצה 'TAMI' של קבוצות Google.
כדי לבטל את הרישום לנושא הזה, היכנס ל-https://groups.google.com/d/topic/hasadna/og5bk0pqUF0/unsubscribe.
כדי לבטל את הרישום לקבוצה הזו ולכל הנושאים שלה, שלח אימייל אל hasadna+u...@googlegroups.com.

כדי לפרסם בקבוצה הזו, שלח אימייל אל has...@googlegroups.com.

Alon Levy

unread,
Jun 14, 2019, 3:25:06 PM6/14/19
to has...@googlegroups.com, Michael Tewner, Duncan Thomas
I would be glad to meet. I am not sure what you mean by hardware specifically so here is what I have in mind

Haproxy setup:

Digitalocean droplet
- uptime: high (never measured. 95÷ ?)
- haproxy

Home server 1 and 2
- downtime 20÷ appro.
- discourse (one read only)
- postgresql in master slave replication (table level)
- disk ("blob storage") rsynced from master to slave

This is not solving the problem as stated, but an easier one where I want HA for read only. I.e. when the master is down the slave is available.

Read availability = 1 - failure

Failure = a + (1 - a) * (1 - b)^2


a = haproxy failure = 0.05

b = home failure (independent) = 0.2

When is it better than a vm? Never, because it is constrained by a. But if I go the DNS route I can fix that, and meanwhile I'll solve part of the problem I need solved anyway, discourse redundancy.

But I'm starting from the requirement that the service itself be hosted at my server, not on a cloud. More on that in my next reply to Michael Bravo (later), but in brief, I think self hosting is valuable and I want to advance it because it contributes to my freedom, see projects such as yunohost.org and freedombox.org for more elaborate rational. But even above that I find it interesting to learn the nitty gritty. My background is software development, I've used wireshark for a long time, developed network protocols, but never did specifically self hosting with HA.

pEpkey.asc
Reply all
Reply to author
Forward
0 new messages