Racket Web servlet performance benchmarked and compared

1,576 views
Skip to first unread message

dbohdan

unread,
Sep 1, 2017, 1:19:19 PM9/1/17
to Racket Users
Hi, everyone. Long time (occasional) reader, first time writer here.

In the 5.x days I played with Racket's Web servlets and found them slower than I'd expected. (My exceptions were, admittedly, quite high after seeing how much better Racket performed at other tasks than your average scripting language.) I've decided to try Web servlets out again, but this time to put some rough numbers on the performance with a reproducible benchmark.

My benchmark compares Racket's stateful and stateless servlets against the SCGI package for Racket, Caddy (HTTP server written in Go), Flask (Python web microframework), GNU Guile's Web server module, Ring/Compojure (Clojure HTTP middleware/routing library), Plug (Elixir HTTP middleware), and Sinatra (Ruby web microframework). On each of these platforms the benchmark implements a trivial web application that serves around 4K of plain text. It uses ApacheBench to stress it with a configurable number of concurrent connections. The application and ApacheBench are run in separate Docker containers, which lets you tune the memory and the CPU time available to them. I've published the source code for the benchmark at https://gitlab.com/dbohdan/racket-vs-the-world/. It should be straightforward to run on Linux with Docker (but please report any difficulties!).

I've attached the results I got on a two-core VM. According to them, Racket's servlets do lag behind everything else but Sinatra. The results are for 100 concurrent connections, which is the default, but the differences in throughput are still very similar with 20 connections and quite similar with just one. I'd appreciate any feedback on these results (do they look reasonable to you?) and the code behind the benchmark (did I miss any crucial bits of configuration for the servlet?).

Best,
D. Bohdan

results.txt

Neil Van Dyke

unread,
Sep 1, 2017, 2:38:25 PM9/1/17
to dbohdan, Racket Users
Thank you very much for doing this work, D. Bohdan.

If I'm reading these results quickly, and if my guess about the
distribution is correct, then it looks like Racket SCGI+nginx *might*
actually have the best times of any of your tested combinations *except
when a GC cycle kicks in*.

results/scgi.txt-Percentage of the requests served within a certain time
(ms)
results/scgi.txt- 50% 3
results/scgi.txt- 66% 4
results/scgi.txt- 75% 4
results/scgi.txt- 80% 5
results/scgi.txt- 90% 7
results/scgi.txt- 95% 11
results/scgi.txt- 98% 1018
results/scgi.txt- 99% 1030
results/scgi.txt- 100% 55256 (longest request)

If GC is indeed the cause, if you avoid or reduce the GC hits that are
killing you 5% of the time, then maybe 100% of your requests are fast.
(I suggest looking at avoiding/reducing GC hits holistically, in an
application-specific way, since there's lots of things you can do,
depending, and there are costs and benefits. One very likely situation
is that there are inefficiencies in the application code itself that are
the bottleneck, and it's best to take a look at those before focusing on
where the bottleneck moves next. That familiarization with the
application code can also help you decide how to address any bottlenecks
external to it.)

This performance of Racket SCGI+nginx relative to the others you tested
is surprising to me, since I made the Racket `scgi` package for
particular non-performance requirements, and performance was really
secondary. (If I were prioritizing speed higher, I suspect I could make
serving much faster, doing it a different way, and then micro-optimizing
on top of that.)

Not to look a gift horse in the mouth, but it's possible that something
else was going on, to give surprisingly good numbers. For example,
often, errors can cause good performance numbers. Sometimes I used
JMeter instead of `ab` to rule out that cause of bad numbers in
performance testing (as well as to implement some testing). Also, the
bad numbers could be something else, like the OS pushing into swap,
sporadic network latency (though looks like you might've controlled for
that one), or some other OS/hardware/network burp outside of your Racket
process(es).

I'd want to have a better understanding of these numbers before
Racketeers started either bragging or donning burlap sacks of shame. :)

Neil Van Dyke

unread,
Sep 1, 2017, 2:51:13 PM9/1/17
to dbohdan, Racket Users
Oh yeah, contention from simultaneous requests, if you're doing that,
can also complicate your numbers. Adjusting `#:scgi-max-allow-wait`
might be a quick way to see whether that changes your numbers. (Hitting
a limit here could give you better numbers, or worse numbers, but
removing a limit being hit should change the numbers.) You can also run
Wireshark or "tcpdump", to be certain of what's going on at the packet
level, but that can be time-consuming to trace through.

George Neuner

unread,
Sep 1, 2017, 5:58:21 PM9/1/17
to racket...@googlegroups.com
On Fri, 1 Sep 2017 14:38:21 -0400, Neil Van Dyke
<ne...@neilvandyke.org> wrote:

> ... *except when a GC cycle kicks in*.

Speaking of web servers and GC ...


I have a http web-server application that needs to be able to up/down
load fairly large files. The application is somewhat memory
constrained, and I'd like to handle these large data transfers in
chunks using (relatively) small buffers.

Outgoing is not [much of] a problem because the client TCP connection
is available in response/output ... but with incoming data, the
web-server framework has slurped in everything before my code even
runs.

Is there a way in the web-server to stream incoming data? Or maybe to
get at the request before the (multipart) form data is read in?

George

Piyush Katariya

unread,
Sep 2, 2017, 1:46:54 PM9/2/17
to Racket Users
Just curious ...

Does Racket app make use of all CPU cores by having multiple processes ?

In go app, there isnt any need to becoz golang runtime uses all CPU avialble by default. So is the case with JVM and Erlang VM

Philip McGrath

unread,
Sep 2, 2017, 2:12:00 PM9/2/17
to Piyush Katariya, Racket Users
The Racket web server does not make use of multiple CPU cores, but with stateless continuations you can run multiple instances behind a reverse proxy. See https://groups.google.com/d/topic/racket-users/TC4JJnZo1U8/discussion ("it is exactly node.js without callbacks").

-Philip


--
You received this message because you are subscribed to the Google Groups "Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Piyush Katariya

unread,
Sep 2, 2017, 2:21:22 PM9/2/17
to Racket Users, corporat...@gmail.com
Then it might not be a fair benchmark with a comparison to other Platforms. Isnt it ?

George Neuner

unread,
Sep 2, 2017, 3:01:59 PM9/2/17
to Piyush Katariya, racket users

On 9/2/2017 1:46 PM, Piyush Katariya wrote:
> Does Racket app make use of all CPU cores by having multiple processes ?

If it is written to use "places", which are parallel instances of the
Racket VM that run on separate kernel threads.
https://docs.racket-lang.org/guide/parallelism.html?q=place#%28tech._place%29

What Racket calls "threads" are "green" [user space] multiplexed on a
single kernel thread.
https://en.wikipedia.org/wiki/Green_threads

George

Piyush Katariya

unread,
Sep 2, 2017, 3:05:05 PM9/2/17
to George Neuner, racket users
Thanks for the clarification. 

dbohdan

unread,
Sep 2, 2017, 3:12:40 PM9/2/17
to Racket Users, danyil...@gmail.com
On Friday, September 1, 2017 at 8:19:19 PM UTC+3, dbohdan wrote:
> My exceptions were [...]

This, of course, should say "expectations".


On Friday, September 1, 2017 at 9:38:25 PM UTC+3, Neil Van Dyke wrote:
> Thank you very much for doing this work, D. Bohdan.

You're welcome! I had fun doing it.

> This performance of Racket SCGI+nginx relative to the others you tested
is surprising to me, since I made the Racket `scgi` package for
particular non-performance requirements, and performance was really
secondary.

Thanks for making the 'scgi package. I rather like the SCGI protocol. It's a pity that it isn't as widely supported as FastCGI, considering that it's much simpler to implement (second only to plain old CGI), but still has a performance profile similar to FastCGI's.

> Not to look a gift horse in the mouth, [...]

No worries. The horse is given very much with that in mind. :-) To address your specific concerns:

> errors can cause good performance numbers. Sometimes I used
> JMeter instead of `ab` to rule out that cause of bad numbers in
> performance testing (as well as to implement some testing).

I think the SCGI benchmark works correctly because of the data sizes that ApacheBench reports. For example, here is the request data from one run:

> Complete requests: 178572
> Failed requests: 0
> Total transferred: 755002416 bytes
> HTML transferred: 733038060 bytes

733038060 / 178572 = 4105, which is exactly the size of the text message the application serves. The same is true of other data I've examined so far (5 runs). To help detect errors, the benchmark is also programmed to abort if the first request to an application doesn't serve exactly the right text (see `run-when-ready.sh`) or if ApacheBench sees enough of nginx's status 502 pages, which are served when the SCGI server doesn't respond correctly or at all.

I'll look into using JMeter in addition to ApacheBench.

> the OS pushing into swap

Good point. I thought I'd already disabled the containers' access to swap, but apparently it didn't work because of a thing with cgroups. The "benchmarked" container still must have used swap, because it began to run out of memory for some applications when I disabled the swap on the VM itself. I've increased "benchmarked's" memory quota to 768 MiB and added a recommendation to disable the swap system-wide in README.md.

> sporadic network latency (though looks like you might've controlled for
that one)

The application and the load generator communicate through a virtual network between two Docker containers on the same host, so this should not be an issue.

> some other OS/hardware/network burp outside of your Racket
process(es).

Such burps are possible, and even likely, because I run the VM on a machine I use for other tasks. I try to ensure no taxing tasks run alongside the benchmark and mitigate the inevitable CPU spikes by simply benchmarking every application for longer (three minutes by default).

On Friday, September 1, 2017 at 9:51:13 PM UTC+3, Neil Van Dyke wrote:
> `#:scgi-max-allow-wait

Thanks for the suggestion. This turned out to be the key to SCGI performance. Increasing #:scgi-max-allow-wait from 1 to 4 (default), 16, 64, 256 gives a moderate increase in throughput (from ~2350 req/s to ~2900 req/s), but *decreases* the maximum latency in a very major way (from ~50000 ms to ~250 ms). See scgi-max-allow.md in the attachments for some detailed data samples. The effect levels out at 256. There isn't an obvious difference between 256, 1024, 4096, and 16384. I've pushed the update to run the tests at #:scgi-max-allow-wait 256.

Besides scgi-max-allow.md, I've also attached the results for 1) a five-minute benchmark with one concurrent connection, 768 MiB RAM, no swap, #:scgi-max-allow-wait 4; 2) a rerun of the first benchmark with the updated settings (three minutes, 100 connections, 768 MiB RAM, no swap, #:scgi-max-allow-wait 256).

results-3min-100conns.txt
results-5min-1conn.txt
scgi-max-allow.md

Neil Van Dyke

unread,
Sep 2, 2017, 4:12:38 PM9/2/17
to dbohdan, Racket Users
dbohdan wrote on 09/02/2017 03:12 PM:
> I rather like the SCGI protocol. It's a pity that it isn't as widely
> supported as FastCGI, considering that it's much simpler to implement
> (second only to plain old CGI), but still has a performance profile
> similar to FastCGI's.

I mostly implemented FastCGI in Racket at one point, but then I read
about the FastCGI implementation in my target HTTP server having hard
bugs, so I abandoned that.

I also think there are faster ways to serve HTTP from Racket, but I'd
have to find funding to work through them.

And we have to keep in mind that, unlike benchmarks for LINPACK or
standard transaction processing, the real-world applications of HTTP
servers are messier. And also, I don't think many people have been
tuning for Web application benchmarks, unlike was once done for LINPACK
and TP. I think the Racket community has enough skill to make a
respectable showing in a benchmark tuning war, or in general platform
performance for real-world Web applications, but I'm not aware of any
funding going into that right now.

antoine

unread,
Sep 3, 2017, 6:01:03 AM9/3/17
to racket...@googlegroups.com
A time ago i have implemented a minimal fastcgi protocol and compare it
against various others implementations.

http://antoineb.github.io/blog/2015/06/02/basic-fastcgi-with-racket/

dbohdan

unread,
Sep 3, 2017, 4:50:19 PM9/3/17
to Racket Users, corporat...@gmail.com
On Saturday, September 2, 2017 at 8:46:54 PM UTC+3, Piyush Katariya wrote:
> Does Racket app make use of all CPU cores by having multiple processes ?

Thanks for asking this question. It prompted me to revise how the benchmark is run. The short answer is that the servlet application uses a single core. The SCGI application is the same way, but benefits from nginx's built-in support for multicore through worker processes.

I've made the servlet application use futures according to Jay McCarthy's post at https://lists.racket-lang.org/users/archive/2014-July/063419.html, but found that, as he predicted, it did not improve the performance (in fact, it reduced it). I don't know straight away how to implement places-based workers for servlets. I may investigate it later (I'm interested in message-passing parallelism), but my primary intention with this project is to measure the performance a developer gets out of existing, reusable, hopefully already debugged libraries and frameworks. A little custom code and configuration is fine, but a custom work scheduler seems to me to go beyond that. Does a library exist for running servlets in places?

I've also experimented with having nginx load balance between two Racket SCGI instances. The result was somewhat better throughput (~2650 req/s instead of ~2300 req/s) and identical latency when the application had two cores to work with, and worse throughput (~1800 req/s) and latency with only one.

As far as fairness goes, I don't think either a strictly single-core or a use-them-if-you-can multi-core benchmark is clearly unfair. Both types have value. After some consideration, I've decided to commit to single-core (sort of — read on) as the default for this benchmark.

My first solution was to limit the VM in which I ran the benchmarks to a single core, but that lead to ApacheBench and the application competing for CPU time. This would take the benchmark further away from the real world, and it is generally not recommended to have the benchmarked application and the load generator share a CPU. I've tried a few solutions, and the best I have found in terms of how the resources are allocated is to bind each of the two containers (the application and ApacheBench) to a separate CPU core. That way the applications get only one core, but don't have to fight for it with ApacheBench. I've pushed this update to the repository.

Here are some numbers for the three configurations: one core, two cores, and one core per container.

-- 1 shared core
results/caddy.txt -Requests per second: 2312.93 [#/sec] (mean)
results/compojure.txt -Requests per second: 1677.89 [#/sec] (mean)
results/flask.txt -Requests per second: 977.33 [#/sec] (mean)
results/guile.txt -Requests per second: 1508.77 [#/sec] (mean)
results/plug.txt -Requests per second: 2335.21 [#/sec] (mean)
results/scgi.txt -Requests per second: 2163.00 [#/sec] (mean)
results/sinatra.txt -Requests per second: 317.75 [#/sec] (mean)
results/stateful.txt -Requests per second: 494.55 [#/sec] (mean)
results/stateless.txt -Requests per second: 584.34 [#/sec] (mean)

-- 2 shared cores
results/caddy.txt -Requests per second: 4358.68 [#/sec] (mean)
results/compojure.txt -Requests per second: 4730.50 [#/sec] (mean)
results/flask.txt -Requests per second: 1140.01 [#/sec] (mean)
results/guile.txt -Requests per second: 2092.78 [#/sec] (mean)
results/plug.txt -Requests per second: 5235.78 [#/sec] (mean)
results/scgi.txt -Requests per second: 3074.15 [#/sec] (mean)
results/sinatra.txt -Requests per second: 329.35 [#/sec] (mean)
results/stateful.txt -Requests per second: 604.30 [#/sec] (mean)
results/stateless.txt -Requests per second: 687.77 [#/sec] (mean)

-- 2 fixed cores (one for "benchmarked", one for "ab")
results/caddy.txt -Requests per second: 3963.03 [#/sec] (mean)
results/compojure.txt -Requests per second: 2513.05 [#/sec] (mean)
results/flask.txt -Requests per second: 1207.77 [#/sec] (mean)
results/guile.txt -Requests per second: 2133.48 [#/sec] (mean)
results/plug.txt -Requests per second: 4322.55 [#/sec] (mean)
results/scgi.txt -Requests per second: 2406.02 [#/sec] (mean)
results/sinatra.txt -Requests per second: 347.89 [#/sec] (mean)
results/stateful.txt -Requests per second: 573.48 [#/sec] (mean)
results/stateless.txt -Requests per second: 658.67 [#/sec] (mean)

Jay McCarthy

unread,
Sep 4, 2017, 11:23:46 AM9/4/17
to George Neuner, racket...@googlegroups.com
There is not a way to do this. Should there be? Could there be?

The relevant code is here:

https://github.com/racket/web-server/blob/master/web-server-lib/web-server/http/request.rkt#L52

and

https://github.com/racket/web-server/blob/master/web-server-lib/web-server/http/request.rkt#L219

It would be challenging to do this. Suppose that a request struct
contained the input port (the obvious thing to do.) The problem is
that the servlet MUST consume it all before the Web server can look
for the next HTTP request (because in 1.1 there's multiple per
connection.) So what if the servlet doesn't? Do we detect when it
loses the reference and then slurp it up and keep going on the
connection? It is currently allowed to store a request in a global
data-structure for later, for instance.

Let's suppose we ignore that problem and only enable this when
#:connection-close? is true and there's only one request per
connection... then there's a backwards compatibility problem because a
request already contains the raw-bytes. We could delay them, but
getting them will require having a copy of the result of reading the
input port, which is exactly what you are trying to avoid.

I think we're stuck as far as the Web server's defaults go (but maybe
you can think of something.) I think the best thing to do is to add
the right knob to make-read-request and make it easy to configure a
special dispatcher that does what you want and use URI case matching
to get into the right code path for you. (I.e. Have your server
default to the new behavior and then switch to the old behavior when
you are going to do a normal non-streaming read before handing the
request to your servlet.)

Jay
> --
> You received this message because you are subscribed to the Google Groups "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
-=[ Jay McCarthy http://jeapostrophe.github.io ]=-
-=[ Associate Professor PLT @ CS @ UMass Lowell ]=-
-=[ Moses 1:33: And worlds without number have I created; ]=-

Jay McCarthy

unread,
Sep 4, 2017, 12:11:14 PM9/4/17
to dbohdan, Racket Users, Piyush Katariya
Thank you for working on this Danyil. I think it is fair to test what
the defaults give you.

Would you please add this file to your tests (and each of its three
ways of running?) It would be interesting to compare the performance
of Racket versus the particular Web server library. (The Web server
sets up a lot of safety state per connection to ensure that each
individual connection doesn't run out of memory or crash anything. I
am curious what the cumulative effect of all those features and
protections are.)

Jay

#lang racket/base
(require racket/tcp
racket/match)

(define message #"Lorem ipsum...") ;; XXX fill this in
(define len (bytes-length message))

(define (serve! r w)
(let read-loop ()
(define b (read-bytes-line r 'any))
(unless (or (eof-object? b) (bytes=? b #""))
(read-loop)))
(close-input-port r)
(write-bytes #"HTTP/1.1 200 OK\r\n" w)
(write-bytes #"Connection: close\r\n" w)
(write-string (format "Content-Length: ~a\r\n" len) w)
(write-bytes #"\r\n" w)
(write-bytes message w)
(close-output-port w))

(define port-no 5000)
(define (setup! k)
(define l (tcp-listen port-no 500 #t "0.0.0.0"))
(k l)
(tcp-close l))

(define (single-at-a-time l)
(let loop ()
(define-values (r w) (tcp-accept l))
(serve! r w)
(loop)))

(define (many-at-a-time l)
(let loop ()
(define-values (r w) (tcp-accept l))
(thread (λ () (serve! r w)))
(loop)))

(define (k-places l)
(local-require racket/place)

(define-values (jobs-ch-to jobs-ch-from) (place-channel))
(define ps
(for/list ([i (in-range (processor-count))])
(place/context
local-ch
(let loop ()
(define r*w (place-channel-get jobs-ch-from))
(serve! (car r*w) (cdr r*w))
(loop)))))

(let loop ()
(define-values (r w) (tcp-accept l))
(place-channel-put jobs-ch-to (cons r w))
(loop)))

(module+ main
(setup!
(match (current-command-line-arguments)
[(vector "single")
single-at-a-time]
[(vector "many")
many-at-a-time]
[(vector "places")
k-places])))

George Neuner

unread,
Sep 4, 2017, 4:22:53 PM9/4/17
to racket...@googlegroups.com
Hi Jay,


On Mon, 4 Sep 2017 16:23:43 +0100, Jay McCarthy
<jay.mc...@gmail.com> wrote:


>On Fri, Sep 1, 2017 at 10:57 PM, George Neuner <gneu...@comcast.net> wrote:
>
>> Is there a way in the web-server to stream incoming data? Or maybe to
>> get at the request before the (multipart) form data is read in?
>


>There is not a way to do this. Should there be? Could there be?
>
>The relevant code is here:
>
>https://github.com/racket/web-server/blob/master/web-server-lib/web-server/http/request.rkt#L52
>
>and
>
>https://github.com/racket/web-server/blob/master/web-server-lib/web-server/http/request.rkt#L219
>
>It would be challenging to do this. Suppose that a request struct
>contained the input port (the obvious thing to do.) The problem is
>that the servlet MUST consume it all before the Web server can look
>for the next HTTP request (because in 1.1 there's multiple per
>connection.) So what if the servlet doesn't? Do we detect when it
>loses the reference and then slurp it up and keep going on the
>connection? It is currently allowed to store a request in a global
>data-structure for later, for instance.

Hmmm. Does 1.1 allow queuing requests, or only keeping the connection
open between requests? If the latter, then it should be safe to just
drain the connection.


>Let's suppose we ignore that problem and only enable this when
>#:connection-close? is true and there's only one request per
>connection... then there's a backwards compatibility problem because a
>request already contains the raw-bytes. We could delay them, but
>getting them will require having a copy of the result of reading the
>input port, which is exactly what you are trying to avoid.

This I understand - unfortunately. <frown>


>I think we're stuck as far as the Web server's defaults go (but maybe
>you can think of something.) I think the best thing to do is to add
>the right knob to make-read-request and make it easy to configure a
>special dispatcher that does what you want and use URI case matching
>to get into the right code path for you. (I.e. Have your server
>default to the new behavior and then switch to the old behavior when
>you are going to do a normal non-streaming read before handing the
>request to your servlet.)

I'll take a look at it. Thanks for the exposition!

George

Jay McCarthy

unread,
Sep 4, 2017, 4:30:13 PM9/4/17
to dbohdan, Racket Users, Piyush Katariya
I thought of another way, so here's a fourth version:

#lang racket/base
(require racket/tcp
racket/match
racket/place)
(define-values (jobs-ch-to jobs-ch-from) (place-channel))
(define ps
(for/list ([i (in-range (processor-count))])
(place/context
local-ch
(let loop ()
(define r*w (place-channel-get jobs-ch-from))
(serve! (car r*w) (cdr r*w))
(loop)))))

(let loop ()
(define-values (r w) (tcp-accept l))
(place-channel-put jobs-ch-to (cons r w))
(loop)))

(define (many-places l)
(define ps
(for/list ([i (in-range (processor-count))])
(place/context
jobs-ch-from
(let loop ()
(define r*w (place-channel-get jobs-ch-from))
(thread (λ () (serve! (car r*w) (cdr r*w))))
(loop)))))

(let loop ()
(for ([send-to-p-ch (in-list ps)])
(define-values (r w) (tcp-accept l))
(place-channel-put send-to-p-ch (cons r w)))
(loop)))

(module+ main
(setup!
(match (current-command-line-arguments)
[(vector "single")
single-at-a-time]
[(vector "many")
many-at-a-time]
[(vector "places")
k-places]
[(vector "many-places")
many-places])))

dbohdan

unread,
Sep 5, 2017, 1:38:03 AM9/5/17
to Racket Users, danyil...@gmail.com, corporat...@gmail.com
On Monday, September 4, 2017 at 7:11:14 PM UTC+3, Jay McCarthy wrote:
> Thank you for working on this Danyil.

You're welcome!

> Would you please add this file to your tests (and each of its three
> ways of running?)

Added, and updated to the "many-places" version.

I would like to add you to the AUTHORS file (https://gitlab.com/dbohdan/racket-vs-the-world/blob/master/AUTHORS — please read). Would this attribution line be okay?

> Jay McCarthy <your-real-emai...@gmail.com> https://jeapostrophe.github.io/

I've run the default benchmark with the new application, which I've dubbed "racket-custom". (Actually, I had to make a tweak to the benchmark to accommodate the number of requests it was fulfilling. It made ApacheBench overstep its memory quota and get killed.) When started with the "places" or the "many-places" command line argument on Linux, racket-custom quickly runs out of file descriptors. It opens one per request and apparently doesn't close them. The following results are for the other two modes.

======
> grep 'Requests per second' results/*
results/caddy.txt:Requests per second: 3724.58 [#/sec] (mean)
results/compojure.txt:Requests per second: 3342.73 [#/sec] (mean)
results/custom-single.txt:Requests per second: 8086.51 [#/sec] (mean)
results/custom-many.txt:Requests per second: 7000.06 [#/sec] (mean)
results/flask.txt:Requests per second: 1113.81 [#/sec] (mean)
results/guile.txt:Requests per second: 2025.52 [#/sec] (mean)
results/plug.txt:Requests per second: 4367.07 [#/sec] (mean)
results/scgi.txt:Requests per second: 2243.83 [#/sec] (mean)
results/sinatra.txt:Requests per second: 324.91 [#/sec] (mean)
results/stateful.txt:Requests per second: 538.47 [#/sec] (mean)
results/stateless.txt:Requests per second: 657.18 [#/sec] (mean)
======

Long-form results with latency data are attached.
long.txt

Jon Zeppieri

unread,
Sep 5, 2017, 1:50:27 AM9/5/17
to dbohdan, Racket Users, corporat...@gmail.com
On Tue, Sep 5, 2017 at 1:38 AM, dbohdan <danyil...@gmail.com> wrote:
>
> I've run the default benchmark with the new application, which I've dubbed "racket-custom". (Actually, I had to make a tweak to the benchmark to accommodate the number of requests it was fulfilling. It made ApacheBench overstep its memory quota and get killed.) When started with the "places" or the "many-places" command line argument on Linux, racket-custom quickly runs out of file descriptors. It opens one per request and apparently doesn't close them.

In this code:

(let loop ()
(define-values (r w) (tcp-accept l))
(place-channel-put jobs-ch-to (cons r w))
(loop)))

after sending the ports to the place and before looping, I think the
ports need to be abandoned:

(tcp-abandon-port r)
(tcp-abandon-port w)

- Jon

Jay McCarthy

unread,
Sep 5, 2017, 1:52:10 AM9/5/17
to Jon Zeppieri, dbohdan, Racket Users, Piyush Katariya
I think so too (in both places versions)
> --
> You received this message because you are subscribed to the Google Groups "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



Jay McCarthy

unread,
Sep 5, 2017, 2:01:14 AM9/5/17
to dbohdan, Racket Users, Piyush Katariya
On Tue, Sep 5, 2017 at 6:38 AM, dbohdan <danyil...@gmail.com> wrote:
> On Monday, September 4, 2017 at 7:11:14 PM UTC+3, Jay McCarthy wrote:
> I would like to add you to the AUTHORS file (https://gitlab.com/dbohdan/racket-vs-the-world/blob/master/AUTHORS — please read). Would this attribution line be okay?
>
>> Jay McCarthy <your-real-emai...@gmail.com> https://jeapostrophe.github.io/

Yes, that's good.

> I've run the default benchmark with the new application, which I've dubbed "racket-custom". (Actually, I had to make a tweak to the benchmark to accommodate the number of requests it was fulfilling. It made ApacheBench overstep its memory quota and get killed.) When started with the "places" or the "many-places" command line argument on Linux, racket-custom quickly runs out of file descriptors. It opens one per request and apparently doesn't close them. The following results are for the other two modes.
>
> ======
>> grep 'Requests per second' results/*
> results/custom-single.txt:Requests per second: 8086.51 [#/sec] (mean)
> results/custom-many.txt:Requests per second: 7000.06 [#/sec] (mean)

It is really surprising to me that the many version doesn't perform
better, because I assumed that there would be IO delays on one
connection and you wouldn't want to stall others while waiting to
read/write it. Presumably this is a bit of an artifact of the
benchmarking happening on localhost?

----

These seem like pretty good results (x2 over the best before!) and I
interpret them as telling us that the problem is not in Racket's IO
system but in how the Web server adds overhead.

dbohdan

unread,
Sep 5, 2017, 3:07:08 AM9/5/17
to Racket Users, danyil...@gmail.com, corporat...@gmail.com
On Tuesday, September 5, 2017 at 8:50:27 AM UTC+3, Jon Zeppieri wrote:
> (tcp-abandon-port r)
> (tcp-abandon-port w)

You're right. This worked for "places". I've rerun "single" and "many" along with "places".

======
results/custom-many.txt:Requests per second: 6720.43 [#/sec] (mean)
results/custom-places.txt:Requests per second: 7095.99 [#/sec] (mean)
results/custom-single.txt:Requests per second: 7609.11 [#/sec] (mean)
======

As for "many-places", I was mistaken about it running out of file descriptors. I accidentally tested "places" in its stead. As-is (https://gitlab.com/dbohdan/racket-vs-the-world/blob/97dd7858aecab9af2a66ed687d12ce45adb4899d/apps/racket-custom/lipsum.rkt), "many-places" does not send anything to incoming connections and never closes them.

On Tuesday, September 5, 2017 at 9:01:14 AM UTC+3, Jay McCarthy wrote:
> Yes, that's good.

All right.

> It is really surprising to me that the many version doesn't perform
> better, because I assumed that there would be IO delays on one
> connection and you wouldn't want to stall others while waiting to
> read/write it. Presumably this is a bit of an artifact of the
> benchmarking happening on localhost?

I was wondering about the reason myself. To tease it out, I'll try a few variations on the benchmark later.

Jay McCarthy

unread,
Sep 5, 2017, 3:57:17 AM9/5/17
to dbohdan, Racket Users, Piyush Katariya
On Tue, Sep 5, 2017 at 8:07 AM, dbohdan <danyil...@gmail.com> wrote:
> ======
> results/custom-many.txt:Requests per second: 6720.43 [#/sec] (mean)
> results/custom-places.txt:Requests per second: 7095.99 [#/sec] (mean)
> results/custom-single.txt:Requests per second: 7609.11 [#/sec] (mean)
> ======

That is interesting too. The places version serves one request at a
time on each place, so there's some parallelism but each place does
its work serially.

> As for "many-places", I was mistaken about it running out of file descriptors. I accidentally tested "places" in its stead. As-is (https://gitlab.com/dbohdan/racket-vs-the-world/blob/97dd7858aecab9af2a66ed687d12ce45adb4899d/apps/racket-custom/lipsum.rkt), "many-places" does not send anything to incoming connections and never closes them.

I've just tested on Linux and OS X and I don't see that behavior. I'm
quite confused.

dbohdan

unread,
Sep 5, 2017, 4:41:46 AM9/5/17
to Racket Users, danyil...@gmail.com, corporat...@gmail.com
On Tuesday, September 5, 2017 at 10:57:17 AM UTC+3, Jay McCarthy wrote:
> I've just tested on Linux and OS X and I don't see that behavior. I'm
> quite confused.

Yes, scratch what I said. The "many-places" benchmark only fails this way for me on a particular Linux VM, which just so happened to be the one I was testing it on. Maybe I got the VM in a bad state. If the problem is meaningfully related to the benchmarked application, I'll follow up on it.

Meanwhile, here are some benchmark results for "many-places". The transferred data sizes suggest it worked correctly.

======
results/custom-many-places.txt:Requests per second: 4931.29 [#/sec] (mean)
results/custom-many.txt:Requests per second: 6449.73 [#/sec] (mean)
results/custom-places.txt:Requests per second: 7325.81 [#/sec] (mean)
results/custom-single.txt:Requests per second: 7793.91 [#/sec] (mean)
======

I'll try this again with two fixed cores available to the application container.

dbohdan

unread,
Sep 5, 2017, 5:00:00 AM9/5/17
to Racket Users, danyil...@gmail.com, corporat...@gmail.com
On Tuesday, September 5, 2017 at 11:41:46 AM UTC+3, dbohdan wrote:
> I'll try this again with two fixed cores available to the application container.

results/custom-many-places.txt:Requests per second: 6517.83 [#/sec] (mean)
results/custom-many.txt:Requests per second: 7949.04 [#/sec] (mean)
results/custom-places.txt:Requests per second: 7521.15 [#/sec] (mean)
results/custom-single.txt:Requests per second: 8675.64 [#/sec] (mean)

Jay McCarthy

unread,
Sep 5, 2017, 5:41:04 AM9/5/17
to dbohdan, Racket Users, Piyush Katariya
Is the benchmarking client core the same core as the server core?
Could that help explain why single threaded performance is best?
> --
> You received this message because you are subscribed to the Google Groups "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



Piyush Katariya

unread,
Sep 5, 2017, 6:17:38 AM9/5/17
to Racket Users, danyil...@gmail.com, corporat...@gmail.com
Wow. ~7K looks like good number.

Is it common practice to spawn Thread for each request ? Is it that cheap from resource point of view ? can ThreadPool could be of some help here ?

Jack Firth

unread,
Sep 5, 2017, 9:22:48 PM9/5/17
to Racket Users, danyil...@gmail.com, corporat...@gmail.com
On Tuesday, September 5, 2017 at 3:17:38 AM UTC-7, Piyush Katariya wrote:
> Wow. ~7K looks like good number.
>
> Is it common practice to spawn Thread for each request ? Is it that cheap from resource point of view ? can ThreadPool could be of some help here ?

Racket threads are not OS threads. They're "green threads" and are cooperatively scheduled by the Racket runtime. They're very cheap to create, even with a short life span.

dbohdan

unread,
Sep 7, 2017, 4:52:17 PM9/7/17
to Racket Users
On Tuesday, September 5, 2017 at 12:41:04 PM UTC+3, Jay McCarthy wrote:
> Is the benchmarking client core the same core as the server core?
> Could that help explain why single threaded performance is best?

The not-quite-yes-or-no answer is that they were limited to separate virtual cores inside a VirtualBox VM. When a VirtualBox VM has N virtual cores on a physical CPU with M cores, it can use roughly up to N/M of the CPU's resources. In that benchmark the applications, which had access to two virtual cores, had to themselves 50% of the four-core physical CPU, and the load generator had 25% with one virtual core.

I've run three variants of the benchmark to see if running in a VM had a noticeable effect on single-threaded vs. multi-threaded performance and to address your question about whether "many" underperformed because of a virtual network.

First, I got rid of the VM. Both the applications and the load generator ran in containers on the same machine, but not in a VM. This meant they were limited to different physical cores. The results were similar to those in a VM. The numbers were lower overall due to the slightly weaker hardware.

======
results/caddy.txt:Requests per second: 2758.06 [#/sec] (mean)
results/compojure.txt:Requests per second: 2670.11 [#/sec] (mean)
results/custom-many-places.txt:Requests per second: 4326.27 [#/sec] (mean)
results/custom-many.txt:Requests per second: 4655.04 [#/sec] (mean)
results/custom-places.txt:Requests per second: 4584.75 [#/sec] (mean)
results/custom-single.txt:Requests per second: 5191.93 [#/sec] (mean)
results/flask.txt:Requests per second: 1111.25 [#/sec] (mean)
results/guile.txt:Requests per second: 1933.10 [#/sec] (mean)
results/plug.txt:Requests per second: 3346.99 [#/sec] (mean)
results/scgi.txt:Requests per second: 2092.03 [#/sec] (mean)
results/sinatra.txt:Requests per second: 293.60 [#/sec] (mean)
results/stateful.txt:Requests per second: 532.61 [#/sec] (mean)
results/stateless.txt:Requests per second: 625.02 [#/sec] (mean)
======

Second, I ran the benchmark over a gigabit local network. Yesterday I pushed a script for this (`remote-benchmark.exp`) to the repository. The applications ran on one machine (in a Docker container with access to two virtual cores). The load generator ran on another (my laptop).

======
remote-results/caddy.txt-Requests per second: 3119.23 [#/sec] (mean)
remote-results/compojure.txt-Requests per second: 4009.71 [#/sec] (mean)
remote-results/custom-many-places.txt-Requests per second: 4409.48 [#/sec] (mean)
remote-results/custom-many.txt-Requests per second: 5499.20 [#/sec] (mean)
remote-results/custom-places.txt-Requests per second: 5072.63 [#/sec] (mean)
remote-results/custom-single.txt-Requests per second: 6246.09 [#/sec] (mean)
remote-results/flask.txt-Requests per second: 1106.43 [#/sec] (mean)
remote-results/guile.txt-Requests per second: 2062.53 [#/sec] (mean)
remote-results/plug.txt-Requests per second: 4034.74 [#/sec] (mean)
remote-results/scgi.txt-Requests per second: 2046.91 [#/sec] (mean)
remote-results/sinatra.txt-Requests per second: 288.52 [#/sec] (mean)
remote-results/stateful.txt-Requests per second: 542.27 [#/sec] (mean)
remote-results/stateless.txt-Requests per second: 614.18 [#/sec] (mean)
======

In both cases the ordering is still "single" > "many" > "places" > "many-places". Though "many" and "places" are pretty close in the first case, "many" consistently comes out ahead if you retest.

Last, I ran the benchmark over the Internet with two machines about 1.89×10^-10 light years apart. The applications ran on a very humble VPS. Due to its humbleness I had to reduce the number of concurrent connections to 25. "places", "many-places", and racket-scgi ran out of memory with as few as 10 concurrent connections (racket-scgi seemingly due to nginx), so I decided to exclude them rather than reduce the number of connections further.

======
> env CONNECTIONS=25 ./remote-benchmark.exp vps
remote-results/caddy.txt:Requests per second: 231.37 [#/sec] (mean)
remote-results/compojure.txt:Requests per second: 242.41 [#/sec] (mean)
remote-results/custom-many.txt:Requests per second: 250.35 [#/sec] (mean)
remote-results/custom-single.txt:Requests per second: 255.21 [#/sec] (mean)
remote-results/flask.txt:Requests per second: 235.26 [#/sec] (mean)
remote-results/guile.txt:Requests per second: 242.38 [#/sec] (mean)
remote-results/plug.txt:Requests per second: 244.98 [#/sec] (mean)
remote-results/sinatra.txt:Requests per second: 239.78 [#/sec] (mean)
remote-results/stateful.txt:Requests per second: 239.60 [#/sec] (mean)
remote-results/stateless.txt:Requests per second: 238.71 [#/sec] (mean)
======

Jon Zeppieri

unread,
Sep 7, 2017, 5:01:29 PM9/7/17
to dbohdan, Racket Users
On Thu, Sep 7, 2017 at 4:52 PM, dbohdan <danyil...@gmail.com> wrote:
>
> In both cases the ordering is still "single" > "many" > "places" > "many-places". Though "many" and "places" are pretty close in the first case, "many" consistently comes out ahead if you retest.

This is really interesting. I wonder how costly the inter-place
communication is, relative to the cost of actually generating and
sending the response.

Jay McCarthy

unread,
Sep 8, 2017, 6:09:19 AM9/8/17
to dbohdan, Racket Users
Wow! Thanks for all of this work. It is really interesting to see how
different the performance is on the Internet workload!

Neil Van Dyke

unread,
Sep 8, 2017, 9:29:34 AM9/8/17
to dbohdan, Racket Users
dbohdan wrote on 09/07/2017 04:52 PM:
> Last, I ran the benchmark over the Internet with two machines about 1.89×10^-10 light years apart. The applications ran on a very humble VPS. Due to its humbleness I had to reduce the number of concurrent connections to 25.

The #/sec for each implementation are suspiciously similar. I wonder
whether they're limited by something like an accounting limit imposed on
the VPS (such as network-bytes-per-second or TCP-connections-open), or
by some other host/network limit.

> "places", "many-places", and racket-scgi ran out of memory with as few as 10 concurrent connections (racket-scgi seemingly due to nginx),

I want to acknowledge this humble-VPS benchmarking being/approximating a
real scenario. For example, small embedded/IoT devices communicating
via HTTP, or students/hobbyists using "free tier" VPSs/instances.

Just to note for the email list: small devices tend to force us to think
about resources earlier and more often than bigger devices do. For
example, the difference between "I'm trying to fit the image in this
little computer's flash / This little computer can barely boot before we
start getting out-of-memory process kills" and "I've just been coding
this Web app for three months, and haven't really thought about what
size and number of Amazon EC2 instances we'll need for the non-CDN
serving." GC is another thing you might feel earlier on a small device.

I think we could probably find a way to serve 25 simultaneous
connections via Racket on a pretty small device (maybe even an OpenWrt
small home WiFi router, "http://www.neilvandyke.org/racket-openwrt/").
As is often the case on small devices, it takes some upfront decisions
of architecture and software, with time&space a high priority, and then
often some tuning beyond that.

For purposes of this last humble-VPS benchmarking (if we can keep making
more benchmarking work for you), you might get those initial numbers
from places/many-places/racket-scgi by setting Racket's memory usage
limit. That might force it to GC early and often, and give poorer
numbers, but at least it's running (the first priority is to not exhaust
system memory, get any processes OOM-killed, deadlock, etc.).

For the racket-scgi + nginx setup, if nginx can't quickly be tuned to
not be a problem itself, there are HTTP servers targeting smaller
devices, like what OpenWrt uses for its admin interface. But I'd be
tempted to whip up a fast and light HTTP server in pure Racket, and to
then tackle GC delays and process growth.

(That "tempted" is hypothetical. Personally, any work I did right now
would likely be a holistic approach to a particular consulting
client's/employer's particular needs. Hopefully, this would contribute
back open source Racket packages and knowledge. But the contributions
would probably be of the form "here's one way to do X and Y, which well
works for our needs, in context with A, B, and C requirements and other
architectural decisions", which is usually not the same as "here's an
overall near-optimal generalized solution to a familiar class of
problem". Unless the client/employer needs are for the generalized
solution, or they are altruistic on this.)

dbohdan

unread,
Sep 9, 2017, 2:40:05 PM9/9/17
to Racket Users
On Friday, September 8, 2017 at 4:29:34 PM UTC+3, Neil Van Dyke wrote:
> dbohdan wrote on 09/07/2017 04:52 PM:
>
> The #/sec for each implementation are suspiciously similar. I wonder
> whether they're limited by something like an accounting limit imposed on
> the VPS (such as network-bytes-per-second or TCP-connections-open), or
> by some other host/network limit.
>

While the VPS provider does impose a limit on throughput, at approximately 250 req/s * 5 KB/req = 1.25 MB/s I wasn't hitting it. The numbers were very similar for different applications because at 25 concurrent connections no application reached the maximum request rate it could sustain. I thought the memory constraints wouldn't allow for more than about 25 connections, but I was mistaken. With some tuning I was able to get the applications that ran with 25 concurrent connections to run with 50 and 100. I've rerun the benchmark with 1, 25, 50, 100, and 200 connections to show how the differences between the applications emerge.

======
CONNECTIONS=1
remote-results/caddy.txt:Requests per second: 12.57 [#/sec] (mean)
remote-results/compojure.txt:Requests per second: 12.41 [#/sec] (mean)
remote-results/custom-many-places.txt:Requests per second: 12.62 [#/sec] (mean)
remote-results/custom-many.txt:Requests per second: 12.55 [#/sec] (mean)
remote-results/custom-places.txt:Requests per second: 12.56 [#/sec] (mean)
remote-results/custom-single.txt:Requests per second: 12.58 [#/sec] (mean)
remote-results/flask.txt:Requests per second: 12.44 [#/sec] (mean)
remote-results/guile.txt:Requests per second: 12.53 [#/sec] (mean)
remote-results/plug.txt:Requests per second: 12.57 [#/sec] (mean)
remote-results/scgi.txt:Requests per second: 12.46 [#/sec] (mean)
remote-results/sinatra.txt:Requests per second: 12.08 [#/sec] (mean)
remote-results/stateful.txt:Requests per second: 12.42 [#/sec] (mean)
remote-results/stateless.txt:Requests per second: 12.41 [#/sec] (mean)
======


======
CONNECTIONS=25
remote-results/caddy.txt:Requests per second: 311.19 [#/sec] (mean)
remote-results/compojure.txt:Requests per second: 309.69 [#/sec] (mean)
remote-results/custom-many-places.txt:(Killed) Total of 9153 requests completed
remote-results/custom-many.txt:Requests per second: 309.63 [#/sec] (mean)
remote-results/custom-places.txt:(Killed) Total of 13085 requests completed
remote-results/custom-single.txt:Requests per second: 308.02 [#/sec] (mean)
remote-results/flask.txt:Requests per second: 310.91 [#/sec] (mean)
remote-results/guile.txt:Requests per second: 310.28 [#/sec] (mean)
remote-results/plug.txt:Requests per second: 313.60 [#/sec] (mean)
remote-results/sinatra.txt:Requests per second: 287.03 [#/sec] (mean)
remote-results/stateful.txt:Requests per second: 298.05 [#/sec] (mean)
remote-results/stateless.txt:Requests per second: 295.90 [#/sec] (mean)
======

======
CONNECTIONS=50
remote-results/caddy.txt:Requests per second: 594.78 [#/sec] (mean)
remote-results/compojure.txt:Requests per second: 604.64 [#/sec] (mean)
remote-results/custom-many-places.txt:(Killed) Total of 9444 requests completed
remote-results/custom-many.txt:Requests per second: 598.88 [#/sec] (mean)
remote-results/custom-places.txt:(Killed) Total of 13088 requests completed
remote-results/custom-single.txt:Requests per second: 591.44 [#/sec] (mean)
remote-results/flask.txt:Requests per second: 605.75 [#/sec] (mean)
remote-results/guile.txt:Requests per second: 612.28 [#/sec] (mean)
remote-results/plug.txt:Requests per second: 617.95 [#/sec] (mean)
remote-results/scgi.txt:(Killed) Total of 12020 requests completed
remote-results/sinatra.txt:Requests per second: 367.58 [#/sec] (mean)
remote-results/stateful.txt:Requests per second: 530.00 [#/sec] (mean)
remote-results/stateless.txt:Requests per second: 546.76 [#/sec] (mean)
======

======
CONNECTIONS=100
remote-results/caddy.txt:Requests per second: 1016.63 [#/sec] (mean)
remote-results/compojure.txt:Requests per second: 1103.84 [#/sec] (mean)
remote-results/custom-many-places.txt:(Killed) Total of 9908 requests completed
remote-results/custom-many.txt:Requests per second: 1140.40 [#/sec] (mean)
remote-results/custom-places.txt:(Killed) Total of 13081 requests completed
remote-results/custom-single.txt:Requests per second: 1134.93 [#/sec] (mean)
remote-results/flask.txt:Requests per second: 1024.25 [#/sec] (mean)
remote-results/guile.txt:Requests per second: 1085.03 [#/sec] (mean)
remote-results/plug.txt:Requests per second: 1140.43 [#/sec] (mean)
remote-results/scgi.txt:(Killed) Total of 10969 requests completed
remote-results/sinatra.txt:Requests per second: 384.41 [#/sec] (mean)
remote-results/stateful.txt:Requests per second: 726.84 [#/sec] (mean)
remote-results/stateless.txt:Requests per second: 682.58 [#/sec] (mean)
======

======
CONNECTIONS=200
remote-results/caddy.txt:Requests per second: 1093.88 [#/sec] (mean)
remote-results/compojure.txt:Requests per second: 1157.95 [#/sec] (mean)
remote-results/custom-many-places.txt:(Killed) Total of 9728 requests completed
remote-results/custom-many.txt:Requests per second: 1219.76 [#/sec] (mean)
remote-results/custom-places.txt:(Killed) Total of 13154 requests completed
remote-results/custom-single.txt:Requests per second: 1171.37 [#/sec] (mean)
remote-results/flask.txt:Requests per second: 937.90 [#/sec] (mean)
remote-results/guile.txt:Requests per second: 1182.95 [#/sec] (mean)
remote-results/plug.txt:Requests per second: 1222.03 [#/sec] (mean)
remote-results/scgi.txt:(Killed) Total of 7857 requests completed
remote-results/sinatra.txt:Requests per second: 381.36 [#/sec] (mean)
remote-results/stateful.txt:(Killed) Total of 97039 requests completed
remote-results/stateless.txt:(Killed) Total of 24712 requests completed
======

For comparison, here is one concurrent connection over a local network. With greater computing resources and a lower round-trip time (the latter is probably far more important) you get a much higher request rate.

======
CONNECTIONS=1
remote-results/caddy.txt:Requests per second: 679.79 [#/sec] (mean)
remote-results/compojure.txt:Requests per second: 680.88 [#/sec] (mean)
remote-results/custom-many-places.txt:Requests per second: 842.54 [#/sec] (mean)
remote-results/custom-many.txt:Requests per second: 899.55 [#/sec] (mean)
remote-results/custom-places.txt:Requests per second: 841.69 [#/sec] (mean)
remote-results/custom-single.txt:Requests per second: 775.15 [#/sec] (mean)
remote-results/flask.txt:Requests per second: 513.99 [#/sec] (mean)
remote-results/guile.txt:Requests per second: 661.27 [#/sec] (mean)
remote-results/plug.txt:Requests per second: 678.93 [#/sec] (mean)
remote-results/scgi.txt:Requests per second: 606.11 [#/sec] (mean)
remote-results/sinatra.txt:Requests per second: 247.25 [#/sec] (mean)
remote-results/stateful.txt:Requests per second: 406.68 [#/sec] (mean)
remote-results/stateless.txt:Requests per second: 412.21 [#/sec] (mean)
======

> For purposes of this last humble-VPS benchmarking (if we can keep making
> more benchmarking work for you), you might get those initial numbers
> from places/many-places/racket-scgi by setting Racket's memory usage
> limit.

When I limit the memory usage in racket-custom to the total RAM on the VPS minus what the OS uses (through custodian-limit-memory) Racket quits with an out of memory error at the point when it would be killed by the OS. racket-scgi seems to behave the same, though I didn't look at the memory usage split between Racket and nginx when I tested it.

> For the racket-scgi + nginx setup, if nginx can't quickly be tuned to
> not be a problem itself, there are HTTP servers targeting smaller
> devices, like what OpenWrt uses for its admin interface.

But do they support SCGI?

dbohdan

unread,
Sep 9, 2017, 2:41:39 PM9/9/17
to Racket Users
On Friday, September 8, 2017 at 1:09:19 PM UTC+3, Jay McCarthy wrote:
> Wow! Thanks for all of this work. It is really interesting to see how
> different the performance is on the Internet workload!

Once again, you're welcome! See my reply to Neil Van Dyke for some reasoning about the Internet workload and more results.

Neil Van Dyke

unread,
Sep 9, 2017, 4:14:53 PM9/9/17
to dbohdan, Racket Users
dbohdan wrote on 09/09/2017 02:40 PM:
> When I limit the memory usage in racket-custom to the total RAM on the
> VPS minus what the OS uses (through custodian-limit-memory) Racket
> quits with an out of memory error at the point when it would be killed
> by the OS. racket-scgi seems to behave the same, though I didn't look
> at the memory usage split between Racket and nginx when I tested it.

Especially for small devices pushed to the limit, like this benchmark is
approximating... We can manage process size in Racket so that it doesn't
get OOM-killed or crash on a failed allocation at a bad time in the
Racket VM. This can be via smaller limits within Racket, the timing of
GC, application code being savvy about allocations, maybe something with
Racket Places, or being creative with some of the properties of Linux
host processes.

>> For the racket-scgi + nginx setup, if nginx can't quickly be tuned to not be a problem itself, there are HTTP servers targeting smaller devices, like what OpenWrt uses for its admin interface.
> But do they support SCGI?

I used Lighttpd several years ago, which supports SCGI, though I don't
know the current resource footprint. (I used Lighthttpd as a tiny Web
server within each cloned Windows image in an research virtualization
experimental testbed, and it worked fine for that light purpose.)

For Racket Web serving on *small* devices, I'd want to try a
lightweight, hand-optimized HTTP server in pure Racket, not put
Nginx/Apache/etc. and SCGI/FastCGI in front of it. Nginx and Apache
might not be carrying their own weight on a small device, for the kinds
of applications I'd expect on a small device (unless you need to
implement an organization's complex, custom SSO authentication method,
and there's an Apache module). Other reasons for a fronting server
don't usually apply to small devices: serving high-volume static content
from the same host/port, using off-the-shelf load balancing, and
possibly an off-the-shelf attempt at enduring DoS. Racket's I/O and
language are sophisticated enough that you can do some clever
performance things, and maybe that's how you make a particular
application on a particular device viable.

Your benchmarking work has been good for getting some interest and
discussion going. Racket made a good showing in some of the benchmarks
already, but these aren't going to show off the best that can be done in
Racket, since a lot of space is yet to be explored. Two ways to move
forward:

(1) work on individual real-world applications, incidentally advancing
Racket's capabilities in ways that transfer to some other applications; and

(2) a priori generalized effort like "we want to make a Racket solution
for many simultaneous clients of trivial/nontrivial Web services on
small devices, that will usually do what people need, out of the box",
and/or similar effort for large scale Web services/applications.

The Racket community's skill base is capable of both #1 and #2 above.
But, for funding reasons (I suspect hard to find a research angle on #2,
unless it involves something novel and big with the Racket backend), I
suspect that an organic #1 is more likely than #2. A possible exception
in favor of #2 is if someone has the hobby time available to go to war
on pure benchmarks, without a motivating/guiding application (and I
could certainly appreciate the appeal of that, when one has the time).

Jon Zeppieri

unread,
Sep 9, 2017, 6:25:48 PM9/9/17
to dbohdan, Racket Users
When I ran experiments similar to yours on OS X I saw some odd
scheduling behavior. It looks like after roughly 2^14 requests are
`accept`-ed, there's a *long* delay before the next one succeeds. It
appears that the program is `poll`-ing, waiting for activity, but, for
whatever reason, it doesn't receive notice of any for a long time.

- Jon

Jon Zeppieri

unread,
Sep 9, 2017, 7:52:36 PM9/9/17
to dbohdan, Racket Users
Okay, it seems this occurs when the listen backlog fills up (the
listen(2) man page on OS X says that the backlog is limited to 128),
at which point the server stops sending SYN-ACKs (it appears to send
ACKs instead), and the clients respond with RSTs. It looks like the
server and client play this game for some time, where the clients
backoff exponentially, so that their reset requests come more
infrequently, until the server can manage to start processing requests
again.

It does seem odd, though, that the server seems to *favor* sending
ACKs to clients it can't service over responding to the ones it can.

Jon Zeppieri

unread,
Sep 9, 2017, 8:05:57 PM9/9/17
to dbohdan, Racket Users
On Sat, Sep 9, 2017 at 7:52 PM, Jon Zeppieri <zepp...@gmail.com> wrote:
>
> It does seem odd, though, that the server seems to *favor* sending
> ACKs to clients it can't service over responding to the ones it can.

No, there has to be something else wrong. The tcpdump output shows
significant gaps in time while this ACT/RST game is going on (I'm
looking at a gap of 8 seconds right now), so there's plenty of time
where the server is just sitting idle. But, for whatever reason, the
`poll`-ing Racket program isn't waking up. -J

Jon Zeppieri

unread,
Sep 9, 2017, 9:37:26 PM9/9/17
to dbohdan, Racket Users
As it turns out, the same thing happens with the Caddy example, too,
so it seems to be an OS X thing, rather than a Racket thing. -J

Jon Zeppieri

unread,
Sep 10, 2017, 3:05:34 PM9/10/17
to dbohdan, Racket Users
On Sat, Sep 9, 2017 at 6:25 PM, Jon Zeppieri <zepp...@gmail.com> wrote:
> It looks like after roughly 2^14 requests are
> `accept`-ed, there's a *long* delay before the next one succeeds.

Okay, the above happens when the host runs out of ephemeral ports. So,
not a big deal.
---

My tests suggest the same thing (w.r.t. places) that D. Bohdan's do:
that using places consistently lowers the server throughput (even when
there are multiple cores available). Don't know why, though.

I wanted to see if inter-place communication was the bottleneck, so I
made some changes to allow the individual places to do their work
without needing to communicate:

- First, I made tcp-listeners able to be sent over place-channels, so
the only inter-place communication would be at initialization time.

- Then I realized I could accomplish the same goal with a lot less
fuss by changing the meaning of tcp-listen's reuse? parameter so that
it would set SO_REUSEPORT (instead of SO_REUSEADDR) on the listening
socket. (This allows each place to bind to the same port without
needing any inter-place communication at all.)

This did not improve throughput. But it doesn't exactly prove that
inter-place communication isn't a bottleneck, since both of the above
changes required some other changes to rktio, which, for all I know,
may have caused different performance problems. (If multiple OS
threads are polling the same socket, you need to handle the race
between them to accept an incoming connection.)

So, I'm still puzzled by this.

-J
Reply all
Reply to author
Forward
0 new messages