HAPI FHIR Write Performance Benchmarks

karl....@cms.hhs.gov

unread,

Oct 18, 2016, 2:01:34 PM10/18/16

to HAPI FHIR

Hello all,

I've been talking about it for a long while, and just last sprint finally got a chance to pull together some actual benchmark numbers for HAPI's write performance in our project. The upshot is that I was able to get performance up from an initial low of 7 FHIR Bundle transactions per second up to 82 TXs per second. This is sadly still a couple orders of magnitude below where I need to get it, but I've got a long list of things to try yet.

Here are the things I've done to increase performance so far:

Ensure that more than one TX is running in parallel. This was easily the biggest win I've gotten so far: moving us up from 7 TXs per second to 26 (almost a 4x improvement).

Our data architecture doesn't yet handle this perfectly: we end up with multiple threads trying to upsert the same Practitioner records, causing about 25% of our TXs to fail. We'll be fixing that in early November or so after we move to STU3.

Move from m4.large EC2 instances to c4.8xlarge instances. This moved us from 26 TXs per second to 46 (almost a 2x improvement).

Note that you can pretty well get the same performance on c4.2xlarge instances: performance doesn't seem to keep scaling upwards with more vCPUs as I'd expect it to. This is a big red flag for me that I plan to investigate further in a profiler to track down. HAPI's performance should be mostly linear in respect to available CPUs (assuming that it's being fed enough parallel TXs).

Disabling HAPI's Hibernate Search/Lucene indexing moved us from 46 TXs per second to 82.

I'm not positive I did this correctly yet (see my other recent thread on that subject).

So. That's where I'm at right now. All of my raw data and analysis is available in a Google sheet here: benchmark-data-analysis.ods. Here's some background on our use case and setup:

Our goal here is to complete a giant transform and load of about 2TB of data, which works out to at least 4 billion FHIR Bundle TXs. Mostly Patient and ExplanationOfBenefit resources.

Right now, my timing data is just for 3672 TXs. Wanted to start small to keep the benchmark run times workable. Over the next month, we'll be dialing that up, to see what happens to performance as we go. Obviously, we can expect it to degrade -- just a question of how much.
If we need to, we'll shard our data across multiple DBs and HAPI FHIR instances. That significantly increases AWS costs, though.

Ultimately, the resulting FHIR database will help power the "Blue Button API" that millions of US Medicare beneficiaries use to access their claims data.
We built a bespoke Java ETL application to do the transform and load. It's nothing too exciting: records come in, get converted, and get pushed to a FHIR server.
We're currently using the same random 1.4-SNAPSHOT of HAPI that we started with at the beginning of 2016. That should be changing in the next week or two, as we're working right now on moving up to HAPI 2.0.
The benchmarks yield pretty noisy data, so I quickly decided I'd need multiple, parallel runs to get enough data to achieve a tight confidence interval. We use Ansible to spin up multiple AWS environments, configure an ETL application server, a HAPI FHIR server, and a Postgres DB server (via RDS). Once that's all ready to go, we run the ETL, wait for it to finish, scrape the logs for the performance data we need, and then tear everything down. This lets us run about 10 samples at at a time, getting us the 30 or so samples that we need in a bit under 90 minutes.

If anyone is curious, the benchmark data collection project is open source and can be found here: bluebutton-data-pipeline-benchmarks.

The most intriguing thing I'm seeing with all of this data collection so far is that there's no obvious bottleneck: none of the three systems' CPUs are getting even close to saturated, nor their disks, memory, or network. I'm very curious to get things going in a profiler to see WTF is going on with that.

We need to get performance up to at least 2000 TXs per second. That's a tall order, but I've found plenty of PostgreSQL benchmarks that indicate it's at least theoretically possible: folks were seeing almost 2500 TXs per second on PostgreSQL 9.2, on not-great hardware. I've got lots more ideas on how to get HAPI FHIR there, and I'll keep working my way through those ideas until we do. Based on some conversations I've seen on this list, I'm hopeful that just catching up to the latest release will get me another 2x or 3x, as I think there was some weird extra validation that the server or client are doing in my old pre-1.4 version? And obviously, sharding the data allows us to basically crank up the TXs per second as much as we want, as long as we're willing to pay out the nose for the EC2 time. I'd like to keep costs down, though, so I'm hopeful I can get the performance up significantly before we go there.

This thread is mostly an info-dump that I hope folks in the group find useful, but I'd love to hear about any performance data or suggestions that other folks have!

Best regards,

Karl M. Davis

HHS Associate Entrepreneur-in-Residence, CMS/OEDA

James Agnew

unread,

Oct 18, 2016, 4:33:23 PM10/18/16

to karl....@cms.hhs.gov, HAPI FHIR

Hi Karl,

Just wanted to say a huge thanks to you for posting this, on behalf of the entire HAPI community.

Benchmarks like this are so tedious to write, and so very useful for everyone to read. :)

Cheers,

James

--
You received this message because you are subscribed to the Google Groups "HAPI FHIR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hapi-fhir+unsubscribe@googlegroups.com.
To post to this group, send email to hapi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hapi-fhir/3ec8c6eb-def5-4831-ba2f-050966d30720%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kevin Mayfield

unread,

Oct 20, 2016, 4:08:45 AM10/20/16

to James Agnew, karl....@cms.hhs.gov, HAPI FHIR

Thanks. Very useful information and will be using it.

Did you use any open source tools to test. I've been recommended gatling.io to test read access (DocumentReference and Patient)

.

To view this discussion on the web visit https://groups.google.com/d/msgid/hapi-fhir/CAKowd8nEiaUfG7yD8XrG-ANoh-5xnzq51QUd-X1fMevLj%2BOvAg%40mail.gmail.com.

karl....@cms.hhs.gov

unread,

Oct 20, 2016, 8:09:49 AM10/20/16

to HAPI FHIR, james...@gmail.com, karl....@cms.hhs.gov

Kevin,

I didn't use any open source performance testing tools for the write performance benchmarking, because I want to benchmark the whole-system-of-systems performance, including our bespoke ETL application (though that doesn't appear to be anywhere remotely close to being the bottleneck, it turns out).

Earlier this year, I did start setting up some read performance benchmarks. For those, I used jMeter, because I've had to use it in the past and was already somewhat familiar with it. Not my favorite tool, though, and I've heard good things about gatling. If you do end up using it, I'd strongly encourage you to start here with your test design: Gatling Docs: Scaling Out. It's very important to ensure that you're running any read tests across multiple client machines, with the server hosted on a separate system, as it's otherwise pretty easy to just end up hitting the ceiling the performance of whatever single host you're running on. That's what I did with my jMeter tests, and I found that Ansible made that kind of orchestration quite easy. (I never did really finish those read benchmarks, but they're nonetheless on GitHub here: https://github.com/HHSIDEAlab/fhir-stress-test. The README is wrong, btw-- copy-pasted from another project.)

Best regards,

Karl

To unsubscribe from this group and stop receiving emails from it, send an email to hapi-fhir+...@googlegroups.com.

To post to this group, send email to hapi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hapi-fhir/3ec8c6eb-def5-4831-ba2f-050966d30720%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "HAPI FHIR" group.

To unsubscribe from this group and stop receiving emails from it, send an email to hapi-fhir+...@googlegroups.com.

To post to this group, send email to hapi...@googlegroups.com.

Reply all

Reply to author

Forward