Evaluating Elixir & Phoenix for a web-scale, performance-critical application

605 views
Skip to first unread message

Matt Hornsby

unread,
Jun 10, 2016, 1:20:56 AM6/10/16
to elixir-lang-talk
Hi all - I could use some help. I am currently evaluating Elixir and Phoenix for a performance-critical application for a Fortune 500 company. This could be another great case study for Elixir and Phoenix if I can show that it can meet our needs. Initial performance testing looked phenomenal, but I am running into some performance concerns that will force me to abandon this tech stack entirely if I cannot make the case.

The setup: an out-of-the box phoenix app using mix phoenix.new. No ecto. Returning a static json response. Basically a hello-world app.
The hardware
  • Macbook Pro, 16gb, 8 core, 2.5ghz, running elixir/phoenix natively, and also using docker container
  • Amazon EC2 T2.Medium running Elixir Docker image
The tests: used ab, wrk, siege, artillery, curl with a variety of configurations. Up to 100 concurrent connections. Not super scientific, i know... but

No matter what I try, Phoenix logs out impressive numbers to stdout - generally on the order of 150-300 microseconds. However, none of the load testing tooling agrees. No matter the hardware or load test configuration, I see around 20-40 ms response times. The goal for the services that I am designing is 20ms and several thousand requests per second. The load tests that Chris McCord and others have published suggest that I should be able to expect 3ms or less when running localhost, but i'm not seeing anything close to that.

Would anyone be willing to work with me to look at some options here? I'd be incredibly grateful. Don't make me go back to Java, please :)

Louis Pilfold

unread,
Jun 10, 2016, 2:13:45 AM6/10/16
to elixir-lang-talk

Hello!

Perhaps a silly question, but are you running the application in the production environment when performance testing it?

Cheers,
Louis

--
You received this message because you are subscribed to the Google Groups "elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-ta...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-talk/6a625e85-8c8d-43c7-9c1b-a204db09307a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Doug Goldie

unread,
Jun 10, 2016, 2:18:11 AM6/10/16
to elixir-lang-talk, lo...@lpil.uk
I was thinking the same thing.

So I took a hello_phoenix app I have laying around and compiled it for production.

before the api call took 26ms.
after

23:10:19.497 request_id=me5p9o26becunto6chjloa7oghiq7ek5 [info] GET /api/test/

23:10:19.497 request_id=me5p9o26becunto6chjloa7oghiq7ek5 [info] Sent 200 in 46µs

23:10:19.497 request_id=c6qs9s938de0eku6jb8hcpvak1cal9fn [info] GET /api/test/

23:10:19.497 request_id=c6qs9s938de0eku6jb8hcpvak1cal9fn [info] Sent 200 in 63µs

23:10:19.497 request_id=avqgtd3tgo6q9bs6prm3jmud8cojic6o [info] GET /api/test/

23:10:19.497 request_id=n67rqmpkc5kfjgb28ocqhs0e4odfrj7b [info] GET /api/test/

23:10:19.497 request_id=avqgtd3tgo6q9bs6prm3jmud8cojic6o [info] Sent 200 in 48µs

23:10:19.497 request_id=n67rqmpkc5kfjgb28ocqhs0e4odfrj7b [info] Sent 200 in 74µs



and...on my Mac

I didn't shut anything down :)


Good Luck,


-doug.

Sairam Kunala

unread,
Jun 10, 2016, 2:20:03 AM6/10/16
to elixir-l...@googlegroups.com
I am currently learning Elixir. I have read a handful of posts on this. What I am saying may not be completely accurate or old(since I only read about these in various blogs)

Assuming you are building a front-end web server which serves json? If not, RPC would be a faster mechanism to use rather than exposing http endpoints.
Also, minor note: json is slower compared to rest of the encodings like msgpack if this is an option. (Link to Uber Blog where analysis was made)

You need to set MIX_ENV=prod when compiling elixir http://www.slideshare.net/petegamache/real-world-elixir-deployment#23
Note: t2.medium is 2 vCPUs with 4 Gigs. 

May be you don't even need all the Plugs which Phoenix library provides by default like Cookies etc., esp., given you don't use ecto.

Also, horizontal scaling is another aspect you might want to consider to show scale up and scale down. 
Also, there is a note about no. of open File Descriptors by default Linux has. Change that to infite. 

I have not seen/read many blog posts where Docker is used for Elixir. May be, you could try it on a bigger box on Heroku if possible to get a end to end benchmark (which also takes care of network connectivity)


--

Matt Hornsby

unread,
Jun 10, 2016, 2:29:10 AM6/10/16
to elixir-lang-talk, lo...@lpil.uk
Hi Louis - not a silly question at all. In fact, I hadn't compiled it for production. I just did and ran it again, i *think* the results are better but not convinced yet. The confusing thing is that phoenix says things like:

23:21:25.967 request_id=lhoodfssn5bvpjhlldn1c1r2i74fe38o [info] Sent 200 in 80µs

80 microseconds is pretty great!

But when i hit it with wrk, with 100 open connections from the mac, i get:

  8 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    29.16ms   32.83ms 365.59ms   94.76%
    Req/Sec   494.23    139.45   770.00     76.14%
  116284 requests in 30.06s, 71.44MB read
Requests/sec:   3868.18
Transfer/sec:      2.38MB

But... if i scale it down to only 10 connections, it seems much better:

  8 threads and 15 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.22ms    6.57ms 125.08ms   92.91%
    Req/Sec     1.16k   290.49     1.95k    72.12%
  92484 requests in 10.02s, 56.82MB read
Requests/sec:   9229.39
Transfer/sec:      5.67MB

Still nowhere near 80 microseconds, but still well under the 20ms I am looking for. I would expect more than 100 open connections at a time though, but perhaps it is because of the configuration of this laptop. Thank you for the suggestion!

On Thursday, June 9, 2016 at 11:13:45 PM UTC-7, Louis Pilfold wrote:

Matt Hornsby

unread,
Jun 10, 2016, 2:32:07 AM6/10/16
to elixir-lang-talk, lo...@lpil.uk
Thanks Doug - what do your numbers look like when you hit it with something like ab or wrk for the load testing? Phoenix outputs some amazing numbers to the console, like you listed before, but when I try to actually hit the api with one of the aforementioned tools, or simply through chrome - it still comes back in 8-40 milliseconds.

Appreciate your quick reply, and your help.

Matt Hornsby

unread,
Jun 10, 2016, 2:40:52 AM6/10/16
to elixir-lang-talk
Thanks for the reply Sairam - these are excellent suggestions. I will give them a try. I am in fact building a front-end web server serving json. These are for internal APIs for my company, so perhaps HTTP/json is not the most efficient approach possible. However, I did want to show what phoenix & elixir are capable of next to the well-established Java stack that my company currently wants to continue to invest in.

I am currently not really encoding anything(I don't think). My view has a render method that looks like this:

  def render("product.json", %{product: product}) do
    %{id: product.id,
      name: product.name,
      barcode: product.barcode,
      image: product.image,
      price: product.price,
      market_id: product.market_id}
  end

Thanks for the great information - I'll see if I can change the file descriptors and disable some of the plugs to see if that changes much. My goal was to see what Phoenix & Elixir looked like out of the box without a lot of tweaking, but perhaps I just made some really naive newbie mistakes. As far as docker goes, I mainly did that so that it would be really easy to get running up on EC2. Currently, I can't seem to get yum to find the elixir package from the default Amazon Linux AMIs, so its been kind of a pain to try to get Elixir set up properly.

Doug Goldie

unread,
Jun 10, 2016, 2:48:34 AM6/10/16
to elixir-lang-talk, lo...@lpil.uk
That was ab.
The output below is from the logger.

I ran it again and created a gist of the bench output for:

ab -n 1000 -c 100  http://127.0.0.1:4001/api/test/



I forget how to read this....

Sairam Kunala

unread,
Jun 10, 2016, 2:50:57 AM6/10/16
to elixir-l...@googlegroups.com

José Valim

unread,
Jun 10, 2016, 3:08:14 AM6/10/16
to elixir-l...@googlegroups.com
Can you put the project you are benchmarking online? Can you also include:

* how you installed Erlang and which version
* how you installed Elixir and which version
* the exact command you are using to run the app. For example, running in production environment is extremely important
* the exact command you are using to benchmark the application and from which machine. For example, if you are benchmarking from your machine, the difference you see between client and server will be in latency and those cannot be accounted from the server


--
You received this message because you are subscribed to the Google Groups "elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-ta...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-talk/6a625e85-8c8d-43c7-9c1b-a204db09307a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--


José Valim
Skype: jv.ptec
Founder and Director of R&D

Ed W

unread,
Jun 10, 2016, 6:36:00 AM6/10/16
to elixir-l...@googlegroups.com
On 10/06/2016 06:20, Matt Hornsby wrote:

> No matter the hardware or load test configuration, I see around 20-40
> ms response times. The goal for the services that I am designing is
> 20ms and several thousand requests per second. The load tests that
> Chris McCord and others have published suggest that I should be able
> to expect 3ms or less when running localhost, but i'm not seeing
> anything close to that.

This smells like a networking issue?

I would get a packet trace from the machine and check packet timing. I
wonder if you are seeing:

- Nagle, ie tcp_nodelay
- delayed ack (probably not because this is much larger)

A cursory github query on the phrase "nodelay" (elixir symbol should be
:nodelay) doesn't turn up any references in the phoenix repo... I guess
the correct search would be in cowboy. Again searching the cowboy repo
shows two examples in the tests (and some suggestions in bug reports),
but no other mention

I wonder if you try hacking your local phoenix code to turn on
tcp_nodelay if the extra millisecs delay disappears? (See search of
cowboy repo for examples)

Interesting... Perhaps Chris M could chime in?

Note, I found this helpful for quick and dirty testing (showed me that
name resolution of "localhost" takes half a ms on OSX...)
http://stackoverflow.com/questions/18215389/how-do-i-measure-request-and-response-times-at-once-using-curl

Ed

Ed W

unread,
Jun 10, 2016, 7:26:59 AM6/10/16
to elixir-l...@googlegroups.com
On 10/06/2016 11:35, Ed W wrote:
> On 10/06/2016 06:20, Matt Hornsby wrote:
>
>> No matter the hardware or load test configuration, I see around 20-40
>> ms response times. The goal for the services that I am designing is
>> 20ms and several thousand requests per second. The load tests that
>> Chris McCord and others have published suggest that I should be able
>> to expect 3ms or less when running localhost, but i'm not seeing
>> anything close to that.
>
> This smells like a networking issue?
>
> I would get a packet trace from the machine and check packet timing.
> I wonder if you are seeing:

I have to agree, there is something a bit odd?

I did a mix phoenix.new and created a basic project, nothing else. Then
using curl over localhost on OSX10.11

Phonenix logs show:
[info] GET /
[debug] Processing by HelloPhoenix.PageController.index/2
Parameters: %{}
Pipelines: [:browser]
[info] Sent 200 in 179µs



tcpdump shows:

listening on lo0, link-type NULL (BSD loopback), capture size 262144 bytes
12:20:27.372566 IP localhost.60834 > localhost.terabase: Flags [S], seq
2825464538, win 65535, options [mss 16344,nop,wscale 5,nop,nop,TS val
507676387 ecr 0,sackOK,eol], length 0
12:20:27.372656 IP localhost.terabase > localhost.60834: Flags [S.], seq
2342322357, ack 2825464539, win 65535, options [mss 16344,nop,wscale
5,nop,nop,TS val 507676387 ecr 507676387,sackOK,eol], length 0
12:20:27.372668 IP localhost.60834 > localhost.terabase: Flags [.], ack
1, win 12759, options [nop,nop,TS val 507676387 ecr 507676387], length 0
12:20:27.372681 IP localhost.terabase > localhost.60834: Flags [.], ack
1, win 12759, options [nop,nop,TS val 507676387 ecr 507676387], length 0
12:20:27.372718 IP localhost.60834 > localhost.terabase: Flags [P.], seq
1:79, ack 1, win 12759, options [nop,nop,TS val 507676387 ecr
507676387], length 78
12:20:27.372748 IP localhost.terabase > localhost.60834: Flags [.], ack
79, win 12756, options [nop,nop,TS val 507676387 ecr 507676387], length 0
12:20:27.381452 IP localhost.terabase > localhost.60834: Flags [P.], seq
1:2276, ack 79, win 12756, options [nop,nop,TS val 507676395 ecr
507676387], length 2275
12:20:27.381472 IP localhost.60834 > localhost.terabase: Flags [.], ack
2276, win 12688, options [nop,nop,TS val 507676395 ecr 507676395], length 0
12:20:27.381593 IP localhost.60834 > localhost.terabase: Flags [F.], seq
79, ack 2276, win 12688, options [nop,nop,TS val 507676395 ecr
507676395], length 0
12:20:27.381616 IP localhost.terabase > localhost.60834: Flags [.], ack
80, win 12756, options [nop,nop,TS val 507676395 ecr 507676395], length 0
12:20:27.381623 IP localhost.60834 > localhost.terabase: Flags [.], ack
2276, win 12688, options [nop,nop,TS val 507676395 ecr 507676395], length 0
12:20:27.381638 IP localhost.terabase > localhost.60834: Flags [F.], seq
2276, ack 80, win 12756, options [nop,nop,TS val 507676395 ecr
507676395], length 0
12:20:27.381723 IP localhost.60834 > localhost.terabase: Flags [.], ack
2277, win 12688, options [nop,nop,TS val 507676395 ecr 507676395], length 0


And this is backed up by the curl stats:
time_namelookup: 0.000
time_connect: 0.000
time_appconnect: 0.000
time_pretransfer: 0.000
time_redirect: 0.000
time_starttransfer: 0.009
----------
time_total: 0.009


The delay does appear to be between receiving the GET and delivering the
response? (about 9ms from the packet timestamps and backed up by the
curl timings)

Any thoughts? Perhaps the delay is somewhere in cowboy/ranch? Move it to
the phoenix mailing list?

Ed W

Ben Wilson

unread,
Jun 10, 2016, 10:27:42 AM6/10/16
to elixir-lang-talk
Hey there!

I am confident we can improve your experience here. First, the usual suspects:

1) Are you running MIX_ENV=prod ? In development mode phoenix runs with extra code that improves the development experience, but is of no use in production and should not be part of any benchmark
2) It seems clear that you've got logging output going on every request. This is also problematic for benchmarks. Printing to console is not particularly fast, and does not reflect production logging solutions either. Pleas make sure to set `config :logger, level: :warn` in your config.exs file when benchmarking
3) More generally, can you put your code up somewhere? It will be much easier to look into whatever issues you may be having if we can see the code.

Thanks!

- Ben

Ashley Penney

unread,
Jun 10, 2016, 10:40:34 AM6/10/16
to elixir-l...@googlegroups.com
Just a quick note from someone that tried (and failed) to get acceptable performance numbers from Elixir.

You should check into wrk2 instead of wrk1 for your testing.  The reason being that wrk1 has a "coordinated omission" problem whereby the tool and the service accidently conspire to produce nicer numbers!

Wrk2 will send traffic at a steady rate instead of "when it can", which often results in drastically different numbers.  I was seeing latency in the seconds at the 99th percentile with wrk2 and a quick test with just plug.

When we tested our own scala based service internally we went from seeing 100ms at 100% to 9 minutes with wrk2, so it was quite an eye opener!

--

Matt Hornsby

unread,
Jun 15, 2016, 9:54:32 PM6/15/16
to elixir-lang-talk
Just wanted to say thanks to everyone who jumped onto this thread to provide advice. I am still working through some of the suggestions and will put the sample code up as soon as time permits. I have seen good benchmarking improvements by turning off stdout logging, and by building an exrm release (though in some situations ENV=PROD still outperforms the exrm release which is curious). Thanks again for the quick replies. I will update as soon as I get some code up.

Scott Ribe

unread,
Jul 23, 2016, 9:16:20 AM7/23/16
to elixir-l...@googlegroups.com
On Jun 10, 2016, at 12:29 AM, Matt Hornsby <matt.h...@gmail.com> wrote:
>
> But when i hit it with wrk, with 100 open connections from the mac, i get:

1) Phoenix cannot measure the time that it takes a client to establish a connection, while of course any benchmark from outside will be accounting for that. However I would expect the discrepancy to be more like 2ms.

2) wrk can be slower than Elixir/Phoenix ;-) (I suspect that it's got some poor code around the threading, but anyway when I benchmarked with it I established that latencies were much more strongly correlated to number of threads per wrk node than they were with number of threads to the phoenix node--of course that requires more than 2 machines to test, and is a real pain. But I would say that I don't know why people like wrk so much, I found it to be very weak. Anyway, I can assure you that with 100 threads you absolutely ARE experiencing some delay in wrk itself.)

3) When you use exrm, don't forget MIX_ENV when building the release, otherwise you just bundle the dev build, if you've previously compiled for prod.

--
Scott Ribe
scott...@elevated-dev.com
http://www.elevated-dev.com/
https://www.linkedin.com/in/scottribe/
(303) 722-0567 voice





OvermindDL1

unread,
Jul 25, 2016, 9:59:15 AM7/25/16
to elixir-lang-talk
And don't benchmark from windows either, at least on Windows 10 we've measured the latency at ~200ms, turning off nagle, caching, everything does not effect it.  Bench from a linux VM on the same box and that 200ms vanishes and we get 2ms latency over the internal network again...

Scott Ribe

unread,
Jul 26, 2016, 8:19:03 AM7/26/16
to elixir-l...@googlegroups.com
On Jul 25, 2016, at 7:59 AM, OvermindDL1 <overm...@gmail.com> wrote:
>
> And don't benchmark from windows either, at least on Windows 10 we've measured the latency at ~200ms, turning off nagle, caching, everything does not effect it. Bench from a linux VM on the same box and that 200ms vanishes and we get 2ms latency over the internal network again...

Interesting... The last time I wrote networking code on windows was ~20 years ago for NT, and the networking performance was really *weird* even then.

OvermindDL1

unread,
Jul 26, 2016, 9:54:46 AM7/26/16
to elixir-lang-talk
As far as I can figure it is because the Windows 10 TCP stack holds onto transmitted data (including the SYN's/ACK's I'm thinking but unsure yet, not looked that detailed yet) for up to 200ms or until the buffer is full for one packet.  Disabling nagle and anything else I've tried seems to make no difference.  IE 11, Edge, and Chrome browsers all act the same way as well and the little Googling that I've done has not revealed an easy way to disable this 'feature' yet.  However running ubuntu in hypervisor let me work around it and I was able to siege with impunity (hit my phoenix server with 60k connections to various router-points in less than a second).
Reply all
Reply to author
Forward
0 new messages