An Software Story - Securely Proxying 16,000+ Computers in 4,500+ Locations

4 views
Skip to first unread message

Christopher Miller

unread,
Nov 3, 2013, 3:12:17 PM11/3/13
to BerkTI...@googlegroups.com
On the encouragement of John, I've drafted this missive to report on my recent activity in the nonprofit enterprise sector, where I recently spent two years serving as a full-time missionary for my church. (This is not a proselytizing message, so if your heckles are raised by religion, don't worry - this is palatable). The purpose of this essay is to foster discussion relating to the utilization of open-source and other technological tools to solve real-world problems facing real-world organizations.


# The problem domain:

I worked for the Genealogical Society of Utah, DBA FamilySearch. I was attached to their Patron Services Engineering team, which maintained internal software for the Patron Services Division as well as was tasked with keeping 16,000+ computers in 4,500+ Family History Centers within Service Level Agreement (SLA). We also maintain the Premium Services Proxy. The FamilySearch organization has deals with several other genealogical organizations, mainly for-profit companies, to access their premium services for free if you're a FamilySearch Patron in one of our Family History Centers (which simply means that if you're in one of our centers, you can access the things from one of our computers). Those third-party services know that if traffic originates from our proxy they need to remove the pay-wall. This is a popular service which we have offered for about ten years, but the old proxy server is beginning to show its age. We devised a new system which was like a rocket-ship in comparison. We call it Cuttlefish.

# The solution system:

I'll begin by saying that several more elegant solutions fell through because of technical or organizational challenges.

We ultimately decided upon a phone-home pattern. Every machine is remotely administered using IBM Tivoli (BigFix), an enterprise-class system also used for pushing software out to grids of Windows servers. It's a very complex, very interesting system. One of our engineers (not me) focused on this aspect of the solution because he also maintains the Sophos Antivirus polices, so he's all over those machines.

He installed a script that detects whether Tivoli endpoint is active and properly configured, and if so it will "phone home" to a webservice I wrote in Ruby on Rails, called Cuttlefish. It then installs a proxy configuration file (PAC) which configures the machine to use our proxy server (a plain Squid server) to proxy relevant traffic. Squid's configuration uses an external_acl_type declaration which in turn calls a small C program I wrote, which uses libcurl to ask the Cuttlefish application via web-call if a given request is authorized to proxy to a given endpoint. Cuttlefish remains the authoritative source of proxy authorization at all times.

# Lessons learned:

## 1. Never underestimate your traffic

Just don't. We had no analytics on our old server because it was old. It was running SLES 2, which is really old. If the server were a person, it would be collecting social-security checks and would put its teeth in a glass jar at night. Given this, I had to guess how much traffic the Cuttlefish system was going to receive.

4 requests per second sounds like a lot to me! So that's what I used.

Let me tell you, this was a huge mistake.

Our original application stack was MRI 1.8.7 on Phusion Passenger. It uses a process model instead of threading, and that truly sucks (more on that later). I walked on on the Monday we deployed the phone home scripts to all 16,000+ machines, and both our servers were down. Munin graphs showed them pegged at about 50% CPU and 100% memory, while Apache was serving a cool 30-40 requests per second. Requests were incoming far faster than the server could handle, so we were dropping more than half of our service.

That's a problem.

## 2. Multiprocess architecture doesn't scale

I know I'm going to come under some fire for this, but you folks need to go stress-test your software some more, because you're dead wrong. Multiprocess is a terrible idea. There are some tricks using forking and COW memory to alleviate this, however, the cost of creating a new thread will be lower than the cost of creating a new process. Greenthreading for more complex operations will take longer than native threading. (eg:

%w(work_item_1 work_item_2 ... work_item_n).map {|work_item|
Thread.new do
self.send("do_#{work_item}!".to_sym)
end
}.reduce(&:join)

will not work well in a greenthreaded environment. If you never have a worker pattern requirement, I'm happy for you, but somedays you just can't get around it). (Also note that in a COW-environment that you'll want to use Process.fork instead of Thread.new, which will have most of the effects of threading, it will just take a while longer and you'll have to collect the PIDs and use a kernel method to wait for those processes to terminate instead of using in-process machinery to wait for the threads to finish - in layman's terms, it's inelegant to use processes because it has to poke outside of the protected memory scope more often).

Phusion Passenger was spinning up a new Rails process for each and every worker. If Passenger decided that it needed thirty workers, then I had thirty totally independent copies of Rails running at the same time. This created the situation where I had used all of my server's 6gb of RAM, but hadn't even come close to saturating all four CPU cores.

## 3. Never underestimate threading and/or the JVM

I had an alternate application stack I had been working on, by porting my Rails application to TorqueBox, a JRuby application stack which places the Rails application in a JBoss Application Server full of enterprisey acronyms.

At one point I told our system administrator to shut down all garbage mashers on the detention level, so he took down both application servers and we redeployed using TorqueBox. No sooner had we done that, then we were pegged at 100% CPU and about 50% RAM.

Why?

JRuby and the JVM support threading, so it forks a new worker thread instead of a new worker process. The core Rails code wasn't duplicated, and the JVM's concurrent generational garbage collector made the whole system scream along much faster as well. So the memory use went down, but the CPU use went up because we were now processing about 80 requests per second. But that still wasn't enough.

We batphoned our infrastructure team, and they added two more CPU cores to each virtual machine. We now had 6 CPUs and 6gb of RAM. We hit 140 requests per second and stayed at that for a few minutes as the request queue cleared, then dropped down to about 120 requests per second. The CPU graph never rose above 70%.

If it weren't for the engineering behind the JVM and the much more performant architecture of JBoss, our application would have been doomed in another round of server provisioning at next month's datacenter resources meeting.

## 4. Developer operations is not negotiable

I focused on application development and that alone. I built stress tests and I wrote rigorous integration tests for every part of the application. I had two deployment options (MRI/Passenger and JRuby/TorqueBox).

Our heroic system admin handled writing the deployment scripts and configuring the servers for me.

As applications and deployment becomes increasingly more complex, it pays to have people who can specialize in one aspect of your product. I couldn't have done it without Chris.

## 5. Always have analytics

Through every emergency we always had the information we needed to diagnose the problem.

• Munin graphs tracked server CPU count, CPU use, memory, and apache/squid throughput over time. We were able to inference a server's peak throughput.
• We used Splunk, a piece of proprietary software, for log analysis. You could just as easily use an open-source solution - the reason we chose Splunk is because our organization already had Splunk infrastructure set up, so it was as easy as installing a package and adding a few log file locations to a Splunk configuration. During debugging and diagnosis, the ability to search through our log files, and to do it on all servers simultaneously, was absolutely fantastic.
• If you're using Java or any language which runs on the JVM, get VisualVM onto your box. I used remote X11 forwarding due to some firewall issues with the JMX connector, but just use VisualVM. It's amazing software which gives you incredibly deep analytics into your application's performance, threading, CPU usage, and memory. It removes the question of "where is my bottleneck" and lets you solve the more pressing matter of "how do I resolve this to deliver the most value to users?"


I count myself very lucky to have worked on such a project, and I'd like to close with my belief that software, whether free or closed, is built to make people's lives easier. As a developer, things which help me do that more effectively are Good. As an open-source enthusiast, if those things are free or open-source, that's Better.

giovanni_re

unread,
Nov 4, 2013, 2:07:04 AM11/4/13
to BTG, Christopher Miller
Great report, Chris!
:)


Superquick reply, cause im overloaded busy:
1) I LOLd twice! Extra credit, no reply necessary: guess where. ;)

2) i hope to reply to your content, if i can make time. :( ;) :)

3) Please ASAP forward an
_exact_ copy of your statements about your current activities &
educational plans,
from our offlist discussion -
I have a comment cued for you on that, which is timely. :)



PS:

On Sun, Nov 3, 2013, at 12:12 PM, Christopher Miller wrote:
> On the encouragement of John,

My friends call me Giovanni. ;)


I've drafted this missive to report on my
> recent activity in the nonprofit enterprise sector, where I recently
> spent two years serving as a full-time missionary for my church. (This is
> not a proselytizing message, so if your heckles are raised by religion,
> don't worry - this is palatable). The purpose of this essay is to foster
> discussion relating to the utilization of open-source and other
> technological tools to solve real-world problems facing real-world
> organizations.
>
>
> # The problem domain:
>
> I worked for the Genealogical Society of Utah, DBA FamilySearch. I was
> attached to their Patron Services Engineering team, which maintained
...


--- Join the BerkeleyTIP-Global mail list - http://groups.google.com/group/BerkTIPGlobal. All Freedom SW, HW & Culture.

giovanni_re

unread,
Nov 6, 2013, 1:53:16 PM11/6/13
to BTG, Christopher Miller
On Sun, Nov 3, 2013, at 11:07 PM, giovanni_re wrote:
> Great report, Chris!
> :)


> 3) Please ASAP forward an
> _exact_ copy of your statements about your current activities &
> educational plans,
> from our offlist discussion -
> I have a comment cued for you on that, which is timely. :)


Time
waits.
for.
no.
man.

Rick Moen

unread,
Nov 6, 2013, 2:06:02 PM11/6/13
to BTG
Quoting Giovanni Re (joh...@fastmail.us):

> Time
> waits.
> for.
> no.
> man.


Scoutmaster: Time flies.
Smart Tenderfoot: You can't. They go too fast.

-- a 1930 issue of _Boys' Life_



Time flies like an arrow.
Fruit flies like a banana.

-- Bill Banze, Sept. 8, 1982 post to net.jokes
(No, it is _not_ a Groucho Marx quotation.)



Time Flies Like an Arrow
An Ode to Oettinger

Now, thin fruit flies like thunderstorms
And thin farm boys like farm girls narrow;
And tax firm men like fat tax forms -
But time flies like an arrow.
When tax forms tax all firm men's souls,
While farm girls slim their boyfriends' flanks;
That's when the murd'rous thunder rolls -
And thins the fruit flies ranks.
Like tossed bananas in the skies,
The thin fruit flies like common yarrow;
Then's the time to time the time flies -
Like the time flies like an arrow.

-- Edison B. Schroeder
_Scientific American_, Nov. 1966, p. 12, correspondence column
(commenting on Anthony Oettinger's project to devise a computer
program to manipulate the English language, covered in Oettinger's
Sept. 1966 article, 'The Uses of Computers in Science')


giovanni_re

unread,
Nov 6, 2013, 2:21:26 PM11/6/13
to BTG, Rick Moen
On Wed, Nov 6, 2013, at 11:06 AM, Rick Moen wrote:
> Quoting Giovanni Re (joh...@fastmail.us):

Very quick response there, Rick! :)

I presume you had all of these in your memory, yes?

Ie, you didn't merely find them right now from doing a search, yes?
> --
> --
> http://groups.google.com/group/BerkTIPGlobal
> ---
> You received this message because you are subscribed to the Google Groups
> "Berkeley-TIP-Global" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to BerkTIPGloba...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

giovanni_re

unread,
Nov 6, 2013, 2:33:52 PM11/6/13
to BTG, Rick Moen
On Wed, Nov 6, 2013, at 11:06 AM, Rick Moen wrote:
> Quoting Giovanni Re (joh...@fastmail.us):
>
> > Time
> > waits.
> > for.
> > no.
> > man.
>
>
> Scoutmaster: Time flies.
> Smart Tenderfoot: You can't. They go too fast.
>
> -- a 1930 issue of _Boys' Life_

http://en.wikipedia.org/wiki/Speed_of_light#First_measurement_attempts



PS: I've haf a Boy's Life comic on my mind for a week or two to send to
btip,
but im to bz to send it.




>
>
> Time flies like an arrow.
> Fruit flies like a banana.
>
> -- Bill Banze, Sept. 8, 1982 post to net.jokes
> (No, it is _not_ a Groucho Marx quotation.)
>
>
>
> Time Flies Like an Arrow
> An Ode to Oettinger
>
> Now, thin fruit flies like thunderstorms
> And thin farm boys like farm girls narrow;
> And tax firm men like fat tax forms -
> But time flies like an arrow.
> When tax forms tax all firm men's souls,
> While farm girls slim their boyfriends' flanks;
> That's when the murd'rous thunder rolls -
> And thins the fruit flies ranks.
> Like tossed bananas in the skies,
> The thin fruit flies like common yarrow;
> Then's the time to time the time flies -
> Like the time flies like an arrow.
>
> -- Edison B. Schroeder
> _Scientific American_, Nov. 1966, p. 12, correspondence column
> (commenting on Anthony Oettinger's project to devise a computer
> program to manipulate the English language, covered in Oettinger's
> Sept. 1966 article, 'The Uses of Computers in Science')
>
>
> --
> --
> http://groups.google.com/group/BerkTIPGlobal
> ---
> You received this message because you are subscribed to the Google Groups
> "Berkeley-TIP-Global" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to BerkTIPGloba...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

giovanni_re

unread,
Nov 6, 2013, 2:58:25 PM11/6/13
to BTG, Christopher Miller
On Sun, Nov 3, 2013, at 12:12 PM, Christopher Miller wrote:
> On the encouragement of John, I've drafted this missive to report on my
> recent activity in the nonprofit enterprise sector, where I recently
> spent two years serving as a full-time missionary for my church. (This is
> not a proselytizing message, so if your heckles are raised by religion,
> don't worry - this is palatable).


Actually, under the rubric of Free Culture,
it would be fine to have a best effort shot at a proselytizing message.

Prolly would help to be as scientifically based as possible,
but that's just a thought.

You invested 2 years of your life in it-
I think if you valued it that much
you'd care to share the ideas of the main benefits of that organization
,
to benefit btip's members.

;) :)


We'll forward a url to the discussion to Linus when the discussion
finishes.

;)


Oh - & to Richard Stallman / Saint IGNUcious-
probably the Latest Day Saint.

;) :)




The purpose of this essay is to foster
> discussion relating to the utilization of open-source and other
> technological tools to solve real-world problems facing real-world
> organizations.
>
>
> # The problem domain:
>
> I worked for the Genealogical Society of Utah, DBA FamilySearch. I was
> attached to their Patron Services Engineering team, which maintained
> internal software for the Patron Services Division as well as was tasked
> with keeping 16,000+ computers in 4,500+ Family History Centers within
> Service Level Agreement (SLA). We also maintain the Premium Services
> Proxy. The FamilySearch organization has deals with several other
> genealogical organizations, mainly for-profit companies, to access their
> premium services for free if you're a FamilySearch Patron in one of our
> Family History Centers (which simply means that if you're in one of our
> centers, you can access the things from one of our computers). Those
> third-party services know that if traffic originates from our proxy they
> need to remove the pay-wall. This is a popular service which we have
> offered for about ten years, but the old proxy server is beginning to
> show its age. We devised a new system which was like a rocket-ship in
> comparison. We call it Cuttlefish.
>
> # The solution system:




Rick Moen

unread,
Nov 6, 2013, 3:01:50 PM11/6/13
to BTG
Quoting Giovanni Re (joh...@fastmail.us):

> On Wed, Nov 6, 2013, at 11:06 AM, Rick Moen wrote:
> > Quoting Giovanni Re (joh...@fastmail.us):
>
> Very quick response there, Rick! :)
>
> I presume you had all of these in your memory, yes?
>
> Ie, you didn't merely find them right now from doing a search, yes?

I wrote it all myself.

You don't really believe there's such a magazine as _Scientific American_,
do you? I mean, really. It's an obvious put-on.


Reply all
Reply to author
Forward
0 new messages