Reducing the pain of a clojars outage

653 views
Skip to first unread message

Toby Crawley

unread,
Jan 1, 2016, 10:32:11 PM1/1/16
to clo...@googlegroups.com
Given the recent DDoS-triggered outages at linode (including the one
today that has been the worst yet, currently 10 hours at the time I'm
writing this), I've been giving some more thought to how we can make
future outages less painful for the community.

I have an open issue[1] (but no code yet) to move the repository off
of the server and on to a block store (s3, etc), with the goal there
to make repo reads (which is what we use clojars for 99.9% of the
time) independent of the status of the server. But I'm not sure that
really solves the problem we are seeing today. Currently, we have two
points of failure for repo reads:

(1) the server itself (hosted on linode)
(2) DNS for the clojars.org domain (also hosted on linode)

moving the repo off of the server to a block store still has two
points of failure:

(1) the block store (aws, rackspace, etc)
(2) DNS for the clojars.org domain, since we would CNAME the block
store (hosted on linode)

Though the block store provider would probably be better distributed,
and have more resources to withstand a DDoS (but do any block store
providers have 100% uptime?).

The block store solution is complex - it introduces more moving parts
into clojars, and requires reworking the way we generate usage stats,
and how the api gets its data. It also requires reworking the way we
administer the repo (deletion requests, cleaning up failed/partial
deploys). And it may not solve the availability problem at all, since
we still have two points of failure.

I think a better solution may be to have multiple mirrors of the repo,
either run by concerned citizens or maintained by the clojars staff. I
know some folks in the community already run internal caching proxies
or rsynced mirrors (and are probably chuckling knowingly at those of
us affected by the outage), but those proxies don't really help those
in the community that don't have that internal infrastructure. And I
don't want to recommend that everyone set up a private mirror - that
seems like a lot of wasted effort.

Ideally, it would be nice if we had a turn-key tool for creating a
mirror of clojars. We currently provide a way to rsync the repo[2], so
the seed for a mirror could be small, and could then slurp down the
full repo (and could continue to do so on a schedule to remain up to
date). We could then publish a list of mirrors that the community
could turn to in times of need (or use all the time, if they are
closer geographically or just generally more responsive). Any deploys
would still need to hit the primary server, but deploys are are
dwarfed by reads.

There are a few issues with using mirrors:

(1) security - with artifacts in more places, there are more
opportunities to to introduce malicious versions. This could be
prevented if we had better tools for verifying that the artifacts
are signed by trusted keys, and we required that all artifacts be
signed, but that's not the case currently. But if we had a regular
process that crawled all of the mirrors and the canonical repo to
verify that the checksums every artifact are identical, this could
actually improve security, since we could detect if any checksum
had been changed (a malicious party would have to change the
checksum of a modified artifact, since maven/lein/boot all confirm
checksums by default).

(2) download stats - any downloads from a mirror wouldn't get
reflected in the stats for the artifact unless we had some way to
report those stats back to clojars.org. We currently generate the
stats by parsing the nginx access logs, mirrors could do the same
and report stats back to clojars.org if we care enough about
this. We don't get stats from the existing private mirrors, and
the stats aren't critical, so this may be a non-issue, and
definitely isn't something that has to be solved right away, if
ever.

The repo is just served as static files, so I think a mirror could
simply be:

(1) a webserver (preferably (required to be?) HTTPS)
(2) a cronjob that rsyncs every N minutes

And the cronjob would just need the rsync command in [2], so, to get
this started, we just need:

(1) linode to be up
(2) people willing to run mirrors

(I would say "(3) add a page to the wiki on how to use a mirror", but
that would destroy the symmetry of all the other 2-item lists in this
message)

And it would be nice to have the process in place to verify checksums
soon - that would actually be a boon if we had another linode
compromise[3].

Does anyone see any issues with this plan - I'm curious if there are
security implications (or anything else) that I haven't thought of?

Are you willing to run a mirror?

One issue that comes to mind is if we do decide to move the repo to a
block store, it actually makes mirroring more difficult unless we keep
a copy of the repo on disk on clojars.org as well. But I would like to
have mirrors in place as soon as possible, and worry about that later.

- Toby

[1]: https://github.com/clojars/clojars-web/issues/433
[2]: https://github.com/clojars/clojars-web/wiki/Data#rsync-the-whole-classic-repository
[3]: https://groups.google.com/d/msg/clojars-maintainers/uAVJVwRAnSU/WISqQn5E9KIJ

Toby Crawley

unread,
Jan 1, 2016, 10:50:03 PM1/1/16
to clo...@googlegroups.com
One potential issue with the mirrors is java 6 and HTTPS - the mirrors
couldn't use 2048-bit dhparams[1] or SNI[2], since neither are
supported in java 6. Yes, we all should be on java 7 or 8 at this
point, but I believe Intellij still uses java 6 on MacOS, which would
mean Cursive couldn't download from the mirrors.

[1]: https://weakdh.org/sysadmin.html
[2]: https://en.wikipedia.org/wiki/Server_Name_Indication

Daniel Compton

unread,
Jan 1, 2016, 11:50:59 PM1/1/16
to Clojure
IntelliJ 15 (the new version), bundles JDK8 for Mac OS X so the concern about Java 6 will get less over time.

It could be helpful to extend https://github.com/clojars/clojars-web/issues/432 to support these third party mirrors so people just need to point an Ansible script at a server and it will be set up for them.

Ken Restivo

unread,
Jan 2, 2016, 12:09:17 AM1/2/16
to clo...@googlegroups.com
Any tooling would also have to upgrade to clj-http 2.0.0 and/or HttpClient 4.5, because before that SNI was broken even on Java 8:

https://issues.apache.org/jira/browse/HTTPCLIENT-1613?devStatusDetailDialog=repository

Supposedly fixed in 4.5 of HttpClient, which 2.0.0 of clj-http pulls in, but I haven't tested to confirm.

-ken
--
-----
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Toby Crawley

unread,
Jan 2, 2016, 12:10:40 AM1/2/16
to clo...@googlegroups.com
On Fri, Jan 1, 2016 at 11:50 PM, Daniel Compton
<daniel.com...@gmail.com> wrote:
> IntelliJ 15 (the new version), bundles JDK8 for Mac OS X so the concern about Java 6 will get less over time.

Ah, good to know.

>
> It could be helpful to extend https://github.com/clojars/clojars-web/issues/432 to support these third party mirrors so people just need to point an Ansible script at a server and it will be set up for them.

Yes, definitely. I was thinking of the bare minimum to get a few
mirrors started.

Michael Gardner

unread,
Jan 2, 2016, 12:48:16 AM1/2/16
to clo...@googlegroups.com

> On Jan 1, 2016, at 21:31, Toby Crawley <to...@tcrawley.org> wrote:
>
> But if we had a regular
> process that crawled all of the mirrors and the canonical repo to
> verify that the checksums every artifact are identical, this could
> actually improve security, since we could detect if any checksum
> had been changed

I would caution against this approach. An attacker could easily target specific organizations, serving compromised artifacts only to particular IP ranges. A periodic verification process wouldn't detect this[1], and might lend a false sense of security that lulls people into putting off real security measures.

[1] Unless run by every organization that uses lein, and even then it still might not catch anything if the attackers are clever.

Nando Breiter

unread,
Jan 2, 2016, 4:30:34 AM1/2/16
to clo...@googlegroups.com
Would CloudFlare help on the short term? I haven't used the service yet, I just ran across it researching DDoS solutions, but judging from the overview of how it works, it might be able to cache all clojars.org assets in a distributed manner and handle the DNS issue as well. https://www.cloudflare.com/ If it would work, the advantage is a very quick initial setup. All you need to do is let them handle the DNS.





Aria Media Sagl
Via Rompada 40
6987 Caslano
Switzerland

+41 (0)91 600 9601
+41 (0)76 303 4477 cell
skype: ariamedia

Toby Crawley

unread,
Jan 2, 2016, 11:27:58 AM1/2/16
to clo...@googlegroups.com
That's a good point. Would you trust this approach more if the mirrors
were all managed by the clojars staff instead of by community members?
You currently trust the clojars staff to not act maliciously, and to
detect an intrusion by a third party against clojars.org.

- Toby

Michael Gardner

unread,
Jan 2, 2016, 2:00:19 PM1/2/16
to clo...@googlegroups.com
I would trust it somewhat more. An increase in the number of servers still means an increase in the system's attack surface, but at least there shouldn't be any additional risk from those running the mirrors.

Still, my personal opinion (for whatever it's worth) is that ensuring the entire process is always cryptographically secure end-to-end should be a higher priority than establishing mirrors.

Toby Crawley

unread,
Jan 2, 2016, 3:34:20 PM1/2/16
to clo...@googlegroups.com
On Sat, Jan 2, 2016 at 1:59 PM, Michael Gardner <gard...@gmail.com> wrote:
> Still, my personal opinion (for whatever it's worth) is that ensuring the entire process is always cryptographically secure end-to-end should be a higher priority than establishing mirrors.

I agree, ensuring the process is cryptographically secure end-to-end
should be a priority, but it is also a Sisyphean task, since it would
at least require:

* getting everyone to sign releases: not difficult - we just require
signatures at deploy time on clojars.org and deal with the pain of
bringing everyone up to speed
* dealing with existing unsigned releases: deprecate them? give the
authors a way to sign them after the fact?
* changing tooling to confirm that the artifacts are signed with keys
that are in your web of trust: lein and boot can already tell you
what in the dep graph is signed, and verify that the signatures are
valid, but don't yet confirm against the caller's web of
trust. Without that, how would you know that the artifact isn't
signed with a random, throwaway key?
* organizing key-signing parties around the world to build the web of
trust for the clojure community: Phil Hagelberg started that process
with key-signing meetings at clojure conferences, but it didn't
spread very far. Initiatives like https://keybase.io/ may help with
this.

And this assumes that everyone in your web of trust that publishes
artifacts is who you think they are, keeps their keys 100% secure,
and aren't coerceable.

Even after all that, we still won't be able to pull jars when
clojars.org is down unless we have some alternate source.

- Toby

Colin Fleming

unread,
Jan 2, 2016, 6:45:37 PM1/2/16
to clo...@googlegroups.com
I'm travelling at the moment so I don't have time to respond to everything right now, but one thing about the Java 6 issue - IntelliJ won't be fully on Java 8 until IntelliJ 16. This means that Java 6 will be around until a) everyone is on whatever comes after El Capitan (the last OSX to support Apple's Java 6, which came out not long ago), or b) everyone is on IntelliJ 16, which has only just gone into beta. I support the last two major IntelliJ versions, so that'll be another two years or so. Of course, there may be a vanishingly small number of users still on Java 6 at that point but that's the timeline. It's anyone's guess when a majority of OSX users will be on JDK 8 - at some point I'll just have to say that you need to upgrade IntelliJ if you want to use Leiningen on OSX, but that won't be for a while yet - at least a year I guess.

Glen Mailer

unread,
Jan 2, 2016, 6:46:22 PM1/2/16
to Clojure
This seems like it could be a fruitful avenue to me (cloudflare or another CDN)

I know the folks at npm use fastly in a similar fashion - gaining both geographical distribution and improved resiliency.

Mikhail Kryshen

unread,
Jan 2, 2016, 8:48:24 PM1/2/16
to clo...@googlegroups.com
I would suggest also considering decentralized technologies.
IPFS (https://ipfs.io/) looks like a good fit for the task.

- It is distributed, every node used to access the repository will
contribute to it's availability.

- Directory trees in IPFS work like Clojure's persistent data
structures: they are immutable and share identical substructures.

- IPFS is content-addressable: to access the repository one will only
need to know the hash of the current version of the root directory.
The content can not be changed without also changing the hash.

- The current hash can be published using IPNS
(https://github.com/ipfs/examples/tree/master/examples/ipns) or using
a special DNS TXT record ("dnslink=/ipfs/<hash>") on clojars.org
domain. Then the current version of the repository (at least the
files other nodes have copies of) will be accessible regardless of the
availability of the main server
- via public IPFS gateway: https://ipfs.io/ipns/clojars.org/
- or local gateway: http://localhost:8080/ipns/clojars.org/
- or local fuse mount at /ipns/clojars.org/
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
Mikhail
signature.asc

Toby Crawley

unread,
Jan 3, 2016, 2:01:27 PM1/3/16
to clo...@googlegroups.com
Cloudflare (or a similar CDN) would be useful - we have an open issue
to implement that, but haven't had a chance to get to it:
https://github.com/clojars/clojars-web/issues/434

- Toby

Nando Breiter

unread,
Jan 3, 2016, 4:14:34 PM1/3/16
to clo...@googlegroups.com
I've spent some time looking into both Cloudflare and Fastly over the weekend. Fastly seems to have a sophisticated purging mechanism which the ticket mentions would be a requirement. See https://docs.fastly.com/guides/purging/

Initial setup is dead easy (for both), basically requiring a signup and a change to the DNS record, adding a CNAME. Fastly charges for bandwidth and caches everything. Cloudflare charges monthly flat rates but only caches the most popular assets, unless the subscriber pays $200 a month. In a nutshell, you have full control over the content cached in the CDN with Fastly and full control of the price paid, but not the service rendered, with Cloudflare.



Aria Media Sagl
Via Rompada 40
6987 Caslano
Switzerland

+41 (0)91 600 9601
+41 (0)76 303 4477 cell
skype: ariamedia

Toby Crawley

unread,
Jan 4, 2016, 2:00:28 PM1/4/16
to clo...@googlegroups.com
On Sat, Jan 2, 2016 at 8:47 PM, Mikhail Kryshen <mik...@kryshen.net> wrote:
> I would suggest also considering decentralized technologies.
> IPFS (https://ipfs.io/) looks like a good fit for the task.

IPFS looks interesting, but I'm not sure it's worth moving to an
experimental solution, especially when there are simpler,
battle-tested solutions (block stores, CDNs) we're not yet taking
advantage of.

- Toby

Toby Crawley

unread,
Jan 4, 2016, 2:03:53 PM1/4/16
to clo...@googlegroups.com
Nando:

Thanks for looking in to this. I've added your comments to the issue.

- Toby

Lucas Bradstreet

unread,
Jan 4, 2016, 3:31:30 PM1/4/16
to clo...@googlegroups.com
Good info. Now that we've performed the initial clojars drive, which was performed at a very fortuitous time, do you think that the problem is primarily one of money, man poweror, or both? I realise that there's a lot of kI'm happy to help in I'm one of
Rikkkkeeee way, because I think we definitely want to avoid some of the past  issues in Node JS - which I think they have mostly solved now

Lucas

Toby Crawley

unread,
Jan 4, 2016, 10:52:50 PM1/4/16
to clo...@googlegroups.com
On Mon, Jan 4, 2016 at 3:31 PM, Lucas Bradstreet
<lucasbr...@gmail.com> wrote:
> Good info. Now that we've performed the initial clojars drive, which was
> performed at a very fortuitous time, do you think that the problem is
> primarily one of money, man poweror, or both? I realise that there's a lot
> of kI'm happy to help in I'm one of
> Rikkkkeeee way, because I think we definitely want to avoid some of the past
> issues in Node JS - which I think they have mostly solved now

I don't quite follow all of that, but I think I get the gist :)

Seriously though, what issues did the Node JS community have? I
haven't been involved there at all, so haven't paid attention.

The donations have been great, and I appreciate every bit of it. But
what we primarily need right now is time from others. For the past
nine months, I've been the only administrator, but today Daniel
Compton graciously agreed to help out with that[1], so I think we are
good there. I also need help with some of the bigger issues (moving
the repo to block storage[2], possibly behind a CDN[3], and
implementing atomic deploys[4]), which I plan to post bounties[5] for
(using some of the donations) in the next few days.

Beyond that, we have quite a few other smaller issues that are ready
for work (marked with the "ready" tag[6], along with a subjective
rough estimate of effort involved ("small", "medium", "large")), if
people are looking for other ways to contribute. And, if you are
wanting to be more involved in and up to date with what is happening
with Clojars, I urge you to join the clojars-maintainers list[7].

- Toby

[1]: https://groups.google.com/d/msg/clojars-maintainers/75VmB2F0VX4/hL6dQZAKCQAJ
[2]: https://github.com/clojars/clojars-web/issues/433
[3]: https://github.com/clojars/clojars-web/issues/434
[4]: https://github.com/clojars/clojars-web/issues/226
[5]: https://www.bountysource.com/teams/clojars
[6]: https://github.com/clojars/clojars-web/labels/ready
[7]: https://groups.google.com/forum/#!forum/clojars-maintainers

Lucas Bradstreet

unread,
Jan 6, 2016, 3:11:55 AM1/6/16
to clo...@googlegroups.com
Ouch. I'm not sure what happened to that email. I blame autocorrect.

There were some scaling problems with npm in the past and they ended
up taking funding. The list of issues you've provided look good.
Perhaps some "newbie" tags in the issues would be good too. I will
join the maintainers list.

Thank you for your effort in providing this essential service.

Lucas
Reply all
Reply to author
Forward
0 new messages