Self-hosted server architecture (recommendations please)

391 views
Skip to first unread message

Adon Irani

unread,
Oct 21, 2017, 11:27:48 PM10/21/17
to Canvas LMS Users
Hi everyone,

I'm building a self-hosted Canvas LMS instance for a client - an online learning academy (based primarily in BC, Canada with expectations to serve South America and other global regions).

Right now the enrolments will be low, but I want to be prepared for scale.

My current architecture (pre-Canvas) is:
- bare metal dedicated server in New York (E3-1230 v5 @ 3.40GHz, 4 core, virtual 8 processor, 16GB RAM) 
- 2x VPS (2GB RAM, 2 virtual processors) in Toronto and Amsterdam
- off-site backup solution + a mail server
...Running Nginx/Percona MySQL and primarily run Drupal sites, all on Debian.

In preparation for Canvas, I replaced MySQL with Postgres and migrated all my Drupal sites to pgsql. (Bit of a pain, but allows me to drop out MySQL). I encountered problems getting Debian 9 to play nice, so I switched to Ubuntu and voila! Now I've got my 2x VPS running the Drupal sites, and I've freed up my dedicated bare metal for everything Canvas. Everything is working great so far. On my bare metal dedicated server, I've got Canvas LMS (ruby, Nginx, passenger, postgres, redis, delayed jobs). Runs fast, although i haven't had any user load to mention.

...so, my underlying question is what architecture should i really be looking at? Planning current state (low enrolments) and next stage (not sure what that'll be, e.g. at what user level will i need to ramp up??)

A couple scenarios i'm considering (and would welcome input...)
- bare metals vs. VPS; am i best with 2 bare metal dedicated servers, or a bunch of VPS (dollar for dollar, I could probably get 4x 4GB VPS for price of my one bare metals)
- is first step to separate postgres from ruby (+nginx/passenger)? (my dedicated server has more processing power than a VPS, but i could easily spin up a 4GB VPS as an app server for canvas/ruby to start)
- is postgres a single instance? while canvas/ruby application server gets load-balanced?
- where does redis best go? e.g. if it's canvas/ruby that's memory hungry, can redis sit on the postgres server, or best to spin up another VPS? so far my redis is barely used with bare install of canvas, and server ram jumps between 2-4GB used.
- same question with delayed jobs server; best to sit on bare metal dedicated server along with postgres for now? or spin up another VPS? and if so, can that be a low-end one? how low-end? 0.5 GB ram?

I'm pretty budget sensitive, so having some input from the self-hosted community will really help me put my money where i'll get the most impact.

As enrolments grow, I can increase my budget of course. But for now, i'm hoping to add incrementally to my current architecture. Also, I need to propose a fair pricing scheme for my client to cover my fixed costs (on a sliding scale based on enrolments as the key metric I'd presume). 

Your insight is welcomed!!!

best,

Adon

Adon Irani

unread,
Oct 22, 2017, 7:49:21 AM10/22/17
to Canvas LMS Users
oh - and for network storage between app servers, is GlusterFS a good choice? Any insight here.

Much obliged!

Adon

Bryan Petty

unread,
Oct 22, 2017, 6:49:24 PM10/22/17
to canvas-l...@googlegroups.com
On Sat, Oct 21, 2017 at 9:27 PM, Adon Irani <adon...@gmail.com> wrote:
> In preparation for Canvas, I replaced MySQL with Postgres and migrated all
> my Drupal sites to pgsql. (Bit of a pain, but allows me to drop out MySQL).

I agree, this was a good choice.

> I encountered problems getting Debian 9 to play nice, so I switched to
> Ubuntu and voila! Now I've got my 2x VPS running the Drupal sites, and I've
> freed up my dedicated bare metal for everything Canvas. Everything is
> working great so far. On my bare metal dedicated server, I've got Canvas LMS
> (ruby, Nginx, passenger, postgres, redis, delayed jobs). Runs fast, although
> i haven't had any user load to mention.

> - bare metals vs. VPS; am i best with 2 bare metal dedicated servers, or a
> bunch of VPS (dollar for dollar, I could probably get 4x 4GB VPS for price
> of my one bare metals)

This question is frequently more affected by choices involving
security and ease of updates, rather than resources. It depends on
what resources you need to allow access to, and how you can
efficiently and securely grant that access without giving people
access to more than they should have. Generally speaking, the more you
can split each service into an isolated box, the better you can
provide access controls around each service, and the easier it becomes
to upgrade each service individually. Of course, the cost to that is
redundant base services (syslog, NTP, monitoring tools, etc), which
chips away at your total available memory on all servers (often the
most critical resource).

> - is first step to separate postgres from ruby (+nginx/passenger)? (my
> dedicated server has more processing power than a VPS, but i could easily
> spin up a 4GB VPS as an app server for canvas/ruby to start)

I think this is your first step. It's usually the easiest and quickest
performance win just to split the DB out into it's own dedicated
machine. Coincidentally, it's also the best choice for better security
too.

> - is postgres a single instance? while canvas/ruby application server gets
> load-balanced?

Yes. It takes a lot of time and load to get to this point with a
single institution, but past having a single DB server for PostgreSQL,
the next step is provisioning a read-only replicated DB to offload the
majority of SQL reads from the primary DB, and you only do that if the
DB is already on it's own dedicated server, you're not able to upgrade
the hardware on it anymore, and you are still running into performance
issues.

Make sure to look into connection pooling between your app servers
(web and work) and your PostgreSQL server if necessary.

> - where does redis best go? e.g. if it's canvas/ruby that's memory hungry,
> can redis sit on the postgres server, or best to spin up another VPS? so far
> my redis is barely used with bare install of canvas, and server ram jumps
> between 2-4GB used.

Redis should go somewhere that isn't tied into the scaling models of
your app servers, since you still want one Redis server behind all app
servers. PostgreSQL can be pretty memory intensive as well, so if you
stick it there, be sure you understand how to properly tune both
services so they don't stomp on each other (probably most important
being the "maxmemory" setting for Redis).

> - same question with delayed jobs server; best to sit on bare metal
> dedicated server along with postgres for now? or spin up another VPS? and if
> so, can that be a low-end one? how low-end? 0.5 GB ram?

Generally speaking, each background job worker will use just as much
memory as each Passenger worker. It's the same application being
initialized, and it still requires a lot of memory. This entirely
depends on how many background job workers you configure to spin up,
just like the number of Passenger workers. I don't believe it's smart
to run either background jobs or Passenger workers on a server with
less than 2GB though (and that's assuming they aren't running
*anything* else on them, have a very low number of workers, and the
server is tuned for that low amount of resources).

Instructure runs background jobs on a dedicated pool of app servers
that scales on demand, and then a completely different pool of app
servers for Passenger web requests, also scaling with demand. The more
load you start seeing, the more you'll notice that it's not evenly
distributed for both of those, so you may find out that it's necessary
to split them up so they can be scaled independently.

That said, for self-hosted, you might be able to find a good way of
balancing your delayed_jobs workers per server along with the number
of Passenger workers, that works just fine on the same server. You
might find it's easier to update Canvas like that at least.

Speaking of updates, try to find a way of load balancing the web
requests early, with a minimum of a couple servers hooked up behind
your load balancer. If you want to provide no-downtime updates, you'll
need to pull one offline at a time (ideally during low load) to
restart it with new updates to the application.

--
Regards,
Bryan Petty

Adon Irani

unread,
Oct 22, 2017, 8:34:13 PM10/22/17
to Canvas LMS Users
Hi Bryan,
This is great insight thank-you.

It sounds like next immediate step for me would be to spin up one 4GB VPS for the web application and perhaps delayed jobs (with a single worker); and keep the dedicated server for postgres and redis. The monitor monitor monitor. I can then clone out the web application servers as load increases.

Could i put out a few more questions:
- Any guesses in terms of max concurrent users per single VPS (assuming 4GB ram running Canvas LMS) assuming no database, redis and delayed jobs? This is the key metric that'll help me evaluate fixed costs at various levels of load
- With 16GB dedicated box, for now could redis co-exist say with a maxmemory=4GB?
- For delayed jobs, if I set to a single worker (is there any negative impact other than some things taking longer for admin tasks) and run on the 4GB VPS for now, monitor, and then separate, and spin up its own 2GB VPS and upgrade to 4GB as needed; perhaps 1x 4GB VPS combined application/delayed jobs, then 2x 4GB VPS combined, then go 2GB VPS jobs only eventually 4GB VPS for delayed jobs no longer sharing with the 4GB VPS application servers
- What am i missing about the file nodes... Graham Ballantyne's So now what video https://www.youtube.com/watch?v=85DnvWwGvms talks about as many file nodes as application nodes; are those simply a cluster of networked storage or something else entirely? is this the file_store.yml that defaults to /tmp/files - i wouldn't think these run as hot as the application server if it's simply file storage, but perhaps there's a lot of activity?
- For the shared network storage across multiple VPS, is GlusterFS a good choice or what is recommended/used?
- for upgrades i'm going to do some more research; i found some great insight also from Graham Ballantyne on this in the forum...
- for backup policies, is a daily recursive copy (rsync or incremental rsnapshot) of /var/canvas and an hourly database .pgsql dump sufficient? everything goes offsite daily with multiple restore points after that
- and if i've only configured for production, does this mean i don't have test and development iterations using resources? I have to do more research into how these all play together. For now i figured a 

So in terms of phases, i have dedicated server + 4GB VPS; then dedicated server (db, redis) + 2x 4GB VPS (app, jobs); and from there i have dedicated server (db), 4GB VPS each for redis, delayed jobs, and multiple app servers. With app servers sharing a common storage mountpoint.

Thanks again for the insight on all this. It's extremely helpful. I'm looking forward to being a long-time Canvas LMS user.

Regards,

Adon

Graham Ballantyne

unread,
Oct 22, 2017, 9:39:00 PM10/22/17
to canvas-l...@googlegroups.com
Hi Adon,

Bryan's response covers most of what I would suggest. Our philosophy at SFU has been use separate VMs for each facet of Canvas: a pool of app servers, a pool of delayed jobs servers, a pool of redis servers, and a database server. No one server fulfils multiple roles. This is advantageous for a few reasons: performance, security, reliability, and maintainability. 

SFU's current Canvas infrastructure (serving ~29000 students and ~2000 instructors and TAs) is:

- 20 application servers
- 2 servers for delayed jobs and other background tasks (e.g. bulk enrollments)
- 3 redis servers
- 1 postgres server (behind pgbouncer)
- 1 cassandra server

These are all virtual servers running Red Hat Enterprise Linux. App & management servers are quad-CPU, 16GB RAM. Database is 8-CPU, 32GB. Redis servers are dual-CPU, 12GB RAM.

With small load, you can get away with running delayed jobs on your app servers. We chose to separate them out for performance and scalability. It's easy enough to do this later if load warrants.

I'm a firm believer that Redis should be on its own sever(s). Redis' sole purpose in life is to use as much memory as you give it. There are also some suggested tuning settings that don't always play well with other applications. I've had mixed results with Redis' maxmemory setting; I don't fully trust it.

File nodes: it used to be that file downloads were processed and served by the rails app. We ran into performance in large courses that had very large files, so we split off several dedicated nodes to serve the files domain. Files are now served directly by Apache (if you install and enable mod_sendfile), so while we're still running this setup, it isn't necessary, and will be removed when we rearchitect our installation as part of moving to a new datacentre.

For file storage, we use NFS mounts from our NetApp storage appliances. I'm not familiar with GusterFS so I can't comment on that one. The code that handles S3 storage in Canvas is robust as Instructure uses and relies on it. The code for local file storage is not as robust, and we occasionally find (and fix where we can) weird bugs in it; most of the time, though, it's fine.

For no-downtime upgrades, you're going to need multiple app servers behind a load balancer, or you'll need to spring for Passenger Enterprise and use the rolling restart feature. We do both (20 nodes behind a LB, all running Passenger Enterprise). It works very well.

For backups: DB backups are handled for me by a different group so I'm not fluent in the details, but I believe we use Postgres' built-in backup tools (see pg_start_backup) and WAL. We don't, to my knowledge, use pgdump for backups. Our database volumes are NFS-mounts, so I think we get filesystem-level snapshots from that as well. I don't explicitly back up anything on the application nodes; they're disposable as far as I care. If one breaks, I can remove it from the pool and either redeploy to it, or more likely just have it destroyed and re-cloned from a good one. Our file storage backups are handled by the NetApps.

Hope that helps. Feel free to ask more questions. You can also join us in the #canvas-lms channel on freenode.

Graham.


--
Graham Ballantyne
IT Services
Simon Fraser University
--

---
You received this message because you are subscribed to the Google Groups "Canvas LMS Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to canvas-lms-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Adon Irani

unread,
Oct 22, 2017, 11:12:17 PM10/22/17
to Canvas LMS Users
You guys are awesome!! thanks for the leads.
I have a few follow-up questions, but I'll head to the IRC. What i have here is a great blueprint to work with.

Very much obliged,

Adon
Reply all
Reply to author
Forward
0 new messages