OSv vs Docker vs Linux networking performance comparison

Waldek Kozaczuk

unread,

Mar 26, 2019, 6:29:03 PM3/26/19

to OSv Development

Last week I spent some time investigating OSv performance and comparing it to Docker and Linux guests. To that end I adopted "unikernels-v-containers"' repo by Tom Goethals and extended it with 2 new apps (Rust and Node.js) and new scripts to build and deploy OSv apps on QEMU/KVM - https://github.com/wkozaczuk/unikernels-v-containers. So as you can see my focus was on OSv on QEMU/KVM and firecracker vs Linux on firecracker vs Docker whereas Tom's paper was comparing OSv on Xen vs Docker (details of discussion around it and the link to the paper you can find here - https://groups.google.com/forum/#!topic/osv-dev/lhkqFfzbHwk).

Specifically I wanted to compare networking performance in terms of number of REST API requests per second processed by a typical microservice app implemented in Rust (built using hyper), Golang and Java (built using vertx.io) and running on following:

OSv on QEMU/KVM
OSv on firecracker
Docker container
Linux on firecracker

Each app in essence implements simple todo REST api returning a json payload 100-200 characters long (for example see here Java one - https://github.com/wkozaczuk/unikernels-v-containers/blob/master/restapi/java-osv/src/main/java/rest/SimpleREST.java). The source code of all apps is under this subtree - https://github.com/wkozaczuk/unikernels-v-containers/blob/master/restapi. One thing to not was that each request would return always the same payload (I wonder if that may cause the response gets cached and affects results).

The test setup looked like this:

Host:

MacBook Pro with Intel i7 4 cores CPU with hyperthreading (8 cpus reported by lscpu) with 16GB of RAM with Ubuntu 18.10 on it
firecracker 0.15.0
QEMU 2.12.0

Client machine:

similar to the one above with wrk as a test client firing requests using 10 threads and 100 open connections for 30 seconds in 3 series one by one (please see this test script - https://github.com/wkozaczuk/unikernels-v-containers/blob/master/test-restapi-with-wrk.sh).
wrk by default uses Keep-Alive for http connections so TCP handshake is minimal

The host and client machine were connected directly to 1 GBit ethernet switch and host exposed guest IP using a bridged TAP nic (please see the script used - https://raw.githubusercontent.com/cloudius-systems/osv/master/scripts/setup-external-bridge.sh).

You can find scripts to start applications on OSv and docker here - https://github.com/wkozaczuk/unikernels-v-containers (run* scripts). Please note --cpu-set parameter used in docker script to limit number of CPUs.

You can find detailed results under https://github.com/wkozaczuk/unikernels-v-containers/tree/master/test_results/remote.

Here are just requests per seconds numbers (full example - https://raw.githubusercontent.com/wkozaczuk/unikernels-v-containers/master/test_results/remote/docker/rust_docker_4_cpu.wrk)

OSv on QEMU

Golang
1 CPU
Requests/sec:  24313.06
Requests/sec:  23874.74
Requests/sec:  23300.26
2 CPUs
Requests/sec:  37089.26
Requests/sec:  35475.22
Requests/sec:  33581.87
4 CPUs
Requests/sec:  42747.11
Requests/sec:  43057.99
Requests/sec:  42346.27

Java
1 CPU
Requests/sec:  41049.41
Requests/sec:  43622.81
Requests/sec:  44777.60
2 CPUs
Requests/sec:  46245.95
Requests/sec:  45746.48
Requests/sec:  46224.42
4 CPUs
Requests/sec:  48128.33
Requests/sec:  45467.53
Requests/sec:  45776.45

Rust
1 CPU
Requests/sec:  43455.34
Requests/sec:  43927.73
Requests/sec:  41100.07
2 CPUs
Requests/sec:  49120.31
Requests/sec:  49298.28
Requests/sec:  48076.98
4 CPUs
Requests/sec:  51477.57
Requests/sec:  51587.92
Requests/sec:  49118.68

OSv on firecracker

Golang
1 cpu
Requests/sec:  16721.56
Requests/sec:  16422.33
Requests/sec:  16540.24
2 cpus
Requests/sec:  28538.35
Requests/sec:  26676.68
Requests/sec:  28100.00
4 cpus
Requests/sec:  36448.57
Requests/sec:  33808.45
Requests/sec:  34383.20

Java
1 cpu
Requests/sec:  20191.95
Requests/sec:  21384.60
Requests/sec:  21705.82
2 cpus
Requests/sec:  40876.17
Requests/sec:  40625.69
Requests/sec:  43766.45
4 cpus
Requests/sec:  46336.07
Requests/sec:  45933.35
Requests/sec:  45467.22

Rust
1 cpu
Requests/sec:  23604.27
Requests/sec:  23379.86
Requests/sec:  23477.19
2 cpus
Requests/sec:  46973.84
Requests/sec:  46590.41
Requests/sec:  46128.15
4 cpus
Requests/sec:  49491.98
Requests/sec:  50255.20
Requests/sec:  50183.11

Linux on firecracker

Golang
1 CPU
Requests/sec:  14498.02
Requests/sec:  14373.21
Requests/sec:  14213.61
2 CPU
Requests/sec:  28201.27
Requests/sec:  28600.92
Requests/sec:  28558.33
4 CPU
Requests/sec:  48983.83
Requests/sec:  47590.97
Requests/sec:  45758.82

Java
1 CPU
Requests/sec:  18217.58
Requests/sec:  17709.30
Requests/sec:  19829.01
2 CPU
Requests/sec:  33188.75
Requests/sec:  33233.55
Requests/sec:  36951.05
4 CPU
Requests/sec:  47718.13
Requests/sec:  46456.51
Requests/sec:  48408.99

Rust
Could not get same rust on Alpine linux that uses musl

Docker

Golang
1 CPU
Requests/sec:  24568.70
Requests/sec:  24621.82
Requests/sec:  24451.52
2 CPU
Requests/sec:  49366.54
Requests/sec:  48510.87
Requests/sec:  43809.97
4 CPU
Requests/sec:  53613.09
Requests/sec:  53033.38
Requests/sec:  51422.59

Java
1 CPU
Requests/sec:  40078.52
Requests/sec:  43850.54
Requests/sec:  44588.22
2 CPUs
Requests/sec:  48792.39
Requests/sec:  51170.05
Requests/sec:  52033.04
4 CPUs
Requests/sec:  51409.24
Requests/sec:  52756.73
Requests/sec:  47126.19

Rust
1 CPU
Requests/sec:  40220.04
Requests/sec:  44601.38
Requests/sec:  44419.06
2 CPUs
Requests/sec:  53420.56
Requests/sec:  53490.33
Requests/sec:  53320.99
4 CPUs
Requests/sec:  53892.23
Requests/sec:  52814.93
Requests/sec:  54050.13

Full example (Rust 4 CPUs - https://raw.githubusercontent.com/wkozaczuk/unikernels-v-containers/master/test_results/remote/docker/rust_docker_4_cpu.wrk):

[{"name":"Write presentation","completed":false,"due":"2019-03-23T15:30:40.579556117+00:00"},{"name":"Host meetup","completed":false,"due":"2019-03-23T15:30:40.579599959+00:00"},{"name":"Run tests","completed":false,"due":"2019-03-23T15:30:40.579600610+00:00"},{"name":"Stand in traffic","completed":false,"due":"2019-03-23T15:30:40.579601081+00:00"},{"name":"Learn Rust","completed":false,"due":"2019-03-23T15:30:40.579601548+00:00"}]-----------------------------------
Running 30s test @ http://192.168.1.73:8080/todos
  10 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.86ms    1.20ms  30.81ms   62.92%
    Req/Sec     5.42k   175.14     5.67k    87.71%
  1622198 requests in 30.10s, 841.55MB read
Requests/sec:  53892.23
Transfer/sec:     27.96MB
-----------------------------------
Running 30s test @ http://192.168.1.73:8080/todos
  10 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.90ms    1.19ms   8.98ms   58.18%
    Req/Sec     5.31k   324.18     5.66k    90.10%
  1589778 requests in 30.10s, 824.73MB read
Requests/sec:  52814.93
Transfer/sec:     27.40MB
-----------------------------------
Running 30s test @ http://192.168.1.73:8080/todos
  10 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.85ms    1.14ms   8.39ms   54.70%
    Req/Sec     5.44k   204.22     7.38k    92.12%
  1626902 requests in 30.10s, 843.99MB read
Requests/sec:  54050.13
Transfer/sec:     28.04MB

I am also enclosing an example of iperf run between client and server machine to illustrate type of raw network bandwidth (BTW I test against iperf running on host natively and on OSv on qemu and firecracker I got pretty much identical results ~ 940 MBits/sec - see https://github.com/wkozaczuk/unikernels-v-containers/tree/master/test_results/remote).

Connecting to host 192.168.1.102, port 5201
[  5] local 192.168.1.98 port 65179 connected to 192.168.1.102 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   111 MBytes   930 Mbits/sec
[  5]   1.00-2.00   sec   111 MBytes   932 Mbits/sec
[  5]   2.00-3.00   sec   112 MBytes   938 Mbits/sec
[  5]   3.00-4.00   sec   112 MBytes   939 Mbits/sec
[  5]   4.00-5.00   sec   112 MBytes   940 Mbits/sec
[  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec
[  5]   6.00-7.00   sec   112 MBytes   940 Mbits/sec
[  5]   7.00-8.00   sec   112 MBytes   940 Mbits/sec
[  5]   8.00-9.00   sec   112 MBytes   941 Mbits/sec
[  5]   9.00-10.00  sec   112 MBytes   941 Mbits/sec
[  5]  10.00-11.00  sec   112 MBytes   939 Mbits/sec
[  5]  11.00-12.00  sec   112 MBytes   941 Mbits/sec
[  5]  12.00-13.00  sec   112 MBytes   941 Mbits/sec
[  5]  13.00-14.00  sec   112 MBytes   942 Mbits/sec
[  5]  14.00-15.00  sec   112 MBytes   941 Mbits/sec
[  5]  15.00-16.00  sec   111 MBytes   927 Mbits/sec
[  5]  16.00-17.00  sec   112 MBytes   941 Mbits/sec
[  5]  17.00-18.00  sec   112 MBytes   942 Mbits/sec
[  5]  18.00-19.00  sec   112 MBytes   941 Mbits/sec
[  5]  19.00-20.00  sec   112 MBytes   941 Mbits/sec
[  5]  20.00-21.00  sec   112 MBytes   936 Mbits/sec
[  5]  21.00-22.00  sec   112 MBytes   940 Mbits/sec
[  5]  22.00-23.00  sec   112 MBytes   941 Mbits/sec
[  5]  23.00-24.00  sec   112 MBytes   941 Mbits/sec
[  5]  24.00-25.00  sec   112 MBytes   941 Mbits/sec
[  5]  25.00-26.00  sec   112 MBytes   941 Mbits/sec
[  5]  26.00-27.00  sec   112 MBytes   940 Mbits/sec
[  5]  27.00-28.00  sec   112 MBytes   941 Mbits/sec
[  5]  28.00-29.00  sec   112 MBytes   940 Mbits/sec
[  5]  29.00-30.00  sec   112 MBytes   941 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-30.00  sec  3.28 GBytes   939 Mbits/sec                  sender
[  5]   0.00-30.00  sec  3.28 GBytes   939 Mbits/sec                  receiver

iperf Done.

Observations/Conclusions

OSv fares a little better on QEMU/KVM than firecracker and that varies from ~5% to ~20% (Golang). Also please note vast difference between 1 cpu test results on firecracker and QEMU (hyperthreading is handled differently). On QEMU there is a small bump from 1 to 2 to 4 cpus except for Golang, on firecracker there is almost ~90-100% bump from 1 to 2 cpus.

To that end I have opened firecracker issue - https://github.com/firecracker-microvm/firecracker/issues/1034.

When you compare OSv on firecracker vs Linux on firecracker (comparing OSv on QEMU would be I guess unfair) you can see that:

Golang app on OSv was ~ 15% better vs on Linux with 1 cpu, almost identical with 2 cpus and app being faster on Linux ~30% with 4 CPUs (I did check that Golang runtime properly detects number of cpus)
Java app on OSv was ~ 5% faster with 1 CPU, ~ 20% faster with 2 CPUs and slightly slower with 4 CPUs
Could not run Rust app on Linux because it was alpine distribution built with musl and I did not have time to get Rust build properly for that scenario

When you compare OSv on QEMU/KVM vs Docker you can see that:

All apps running with single CPU fares almost the same with OSv being sometimes a little faster
Java and Rust apps performed only a little better (2-10%) on Docker vs OSv
Golang on OSv scaled well with number of CPUs but performed much worse on OSv (20-30%) with 2 and 4 cpus

There seems to be a bottleneck around 40-50K requests per seconds somewhere. Looking at one result, the raw network rate reported was around 26-28MB per second. GIven that HTTP requests require sending request and response possibly that is what is the maximum the network - combination of ethernet switch and server and client machines - can handle?

Questions

Are there any flaws in this test setup?
Why does OSv not scale in some scenarios - especially when bumping from 2 to 4 cpus?? Networking bottleneck? Scheduler? Locks?
Could we further optimize OSv running with single CPU (skip global cross-CPU page allocator, etc)?

To get even more insight I also compared how OSv on QEMU would fare against same app running in Docker with wrk running on the host and firing requests locally. You can find the results under https://github.com/wkozaczuk/unikernels-v-containers/tree/master/test_results/host.

OSv on QEMU

Golang
1 CPU
Requests/sec:  25188.60
Requests/sec:  24664.43
Requests/sec:  23935.77
2 CPUs
Requests/sec:  37118.95
Requests/sec:  37108.96
Requests/sec:  35997.58
4 CPUs
Requests/sec:  49987.20
Requests/sec:  48710.74
Requests/sec:  44789.96

Java
1 CPU
Requests/sec:  43648.02
Requests/sec:  45457.98
Requests/sec:  41818.13
2 CPUs
Requests/sec:  76224.39
Requests/sec:  75734.63
Requests/sec:  70597.35
4 CPUs
Requests/sec:  80543.30
Requests/sec:  75187.46
Requests/sec:  72986.93

Rust
1 CPU
Requests/sec:  42392.75
Requests/sec:  39679.21
Requests/sec:  37871.49
2 CPUs
Requests/sec:  82484.67
Requests/sec:  83272.65
Requests/sec:  71671.13
4 CPUs
Requests/sec:  95910.23
Requests/sec:  86811.76
Requests/sec:  83213.93

Docker

Golang
1 CPU
Requests/sec:  24191.63
Requests/sec:  23574.89
Requests/sec:  23716.33
2 CPUs
Requests/sec:  34889.01
Requests/sec:  34487.01
Requests/sec:  34468.03
4 CPUs
Requests/sec:  48850.24
Requests/sec:  48690.09
Requests/sec:  48356.66

Java
1 CPU
Requests/sec:  32267.09
Requests/sec:  34670.41
Requests/sec:  34828.68
2 CPUs
Requests/sec:  47533.94
Requests/sec:  50734.05
Requests/sec:  50203.98
4 CPUs
Requests/sec:  69644.61
Requests/sec:  72704.40
Requests/sec:  70805.84

Rust
1 CPU
Requests/sec:  37061.52
Requests/sec:  36637.62
Requests/sec:  33154.57
2 CPUs
Requests/sec:  51743.94
Requests/sec:  51476.78
Requests/sec:  50934.27
4 CPUs
Requests/sec:  75125.41
Requests/sec:  74051.27
Requests/sec:  74434.78

Does this test even make sense?
As you can see OSv outperforms docker in this scenario to various degree by 5-20%. Can anybody explain why? Is it because in this case iboth wrk and apps are on the same machine and number of context switches are fewer between kernel and user mode in favor of OSv? Does it mean that we could benefit from a setup with a load balancer (for example like haproxy or squid) that would be running on the same host in user mode and forwarding to single-CPU OSv instances vs single OSv with multiple CPUs?

Looking forward to hear what others think.

Waldek

Dor Laor

unread,

Mar 26, 2019, 8:32:00 PM3/26/19

to Waldek Kozaczuk, OSv Development

While the performance numbers indicate something, a mac book is a horrible environment for performance

testing. There are effects of other desktop apps, hyperthreading, etc.

Also 1gbps network can be a bottle neck. Every benchmark case should have a matching performance

analysis and point to the bottleneck reason - cpu/networking/contect switching/locking/filesystem/..

Just hyperthread vs a different thread in another core is very significant change.

Need to pin the qemu threads in the host to the right physical threads.

Better to run on a good physical server (like i3.metal on AWS or similar, could be smaller but not 2 cores) and

track all the metrics appropriately. Best is to isolate workloads (and make sure they scale linearly too) in terms of cpu/mem/net/disk and only then

show how a more complex workload performs.

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pekka Enberg

unread,

Mar 27, 2019, 4:48:44 AM3/27/19

to Waldek Kozaczuk, OSv Development

Hi Waldek!

On Wed, Mar 27, 2019 at 12:29 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:

Last week I spent some time investigating OSv performance and comparing it to Docker and Linux guests.

Nice!

On Wed, Mar 27, 2019 at 12:29 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:

The test setup looked like this:

Host:
MacBook Pro with Intel i7 4 cores CPU with hyperthreading (8 cpus reported by lscpu) with 16GB of RAM with Ubuntu 18.10 on it
firecracker 0.15.0
QEMU 2.12.0

Client machine:
similar to the one above with wrk as a test client firing requests using 10 threads and 100 open connections for 30 seconds in 3 series one by one (please see this test script - https://github.com/wkozaczuk/unikernels-v-containers/blob/master/test-restapi-with-wrk.sh).
wrk by default uses Keep-Alive for http connections so TCP handshake is minimal
The host and client machine were connected directly to 1 GBit ethernet switch and host exposed guest IP using a bridged TAP nic (please see the script used - https://raw.githubusercontent.com/cloudius-systems/osv/master/scripts/setup-external-bridge.sh).

You can find scripts to start applications on OSv and docker here - https://github.com/wkozaczuk/unikernels-v-containers (run* scripts). Please note --cpu-set parameter used in docker script to limit number of CPUs.

You can find detailed results under https://github.com/wkozaczuk/unikernels-v-containers/tree/master/test_results/remote.

Some questions about the evaluation setup and measurements:

- Did you establish a baseline with bare metal configuration?

- Did you measure CPU utilization during the throughput tests? This is important because you could be hitting CPU limits with QEMU and Firecracker because of software processing needed by virtualized networking.

- Are the QEMU and Firecracker tests using virtio or vhost?

- Is Docker also configured to use the bridge device? If not, QEMU and Firecracker also have some additional overheads from the bridging.

- Is multiqueue enabled for QEMU and Firecracker? If not, this would limit the ability to leverage multiple vCPUs.

- Is QEMU or Firecracker setting CPU affinity for the vCPU threads? If not, two or more vCPUs could be running on the same physical CPU, which obviously limits throughput.

Regards,

- Pekka

Pekka Enberg

unread,

Mar 27, 2019, 4:51:05 AM3/27/19

to Waldek Kozaczuk, OSv Development

Oh forgot the obvious:

- Is CPU scaling governor set to performance? Also, if the CPU has TurboBoost, is it disabled?

- Pekka

Matias Vara

unread,

Mar 27, 2019, 6:59:30 AM3/27/19

to Waldek Kozaczuk, OSv Development

Hello Waldek,

The experiments are very interesting. I showed something similar at OSSumit'18 (see https://github.com/torokernel/papers/blob/master/OSSummit18.pdf). What I do not understand from your conclusions is why do you expect that OSv scales with the number of cores? Maybe I did not understand something.

Matias

Dor Laor

unread,

Mar 27, 2019, 1:50:30 PM3/27/19

to Matias Vara, Waldek Kozaczuk, OSv Development

On Wed, Mar 27, 2019 at 3:59 AM Matias Vara <matia...@gmail.com> wrote:

Hello Waldek,

The experiments are very interesting. I showed something similar at OSSumit'18 (see https://github.com/torokernel/papers/blob/master/OSSummit18.pdf). What I do not understand from your conclusions is why do you expect that OSv scales with the number of cores? Maybe I did not understand something.

Because it's designed to scale and scale most of the time with a proper good setup.

There are some times issues related to scheduling with spin locks that effect scaling a lot but OSv

should handle them well, in the past we've done good amount of tests and shared results.

Waldek Kozaczuk

unread,

Mar 27, 2019, 6:36:02 PM3/27/19

to OSv Development

Overall I must say I am not a performance tuning/measuring expert and clearly have lots of things to learn ;-) BTW can you point me to any performance setup/procedures/docs that you guys used with OSv?

I also feel I have tried to kill too many birds with one stone. Ideally I should have divided whole thing into 3 categories:

- OSv on firecracker vs QEMU

- OSv vs Docker

- OSv vs Linux guest

On Tuesday, March 26, 2019 at 8:32:00 PM UTC-4, דור לאור wrote:

While the performance numbers indicate something, a mac book is a horrible environment for performance
testing. There are effects of other desktop apps, hyperthreading, etc.

Well that is what I have available in my home lab :-) I understand you are suggesting that apps running on the MacBook might affect and skew the results. I made sure the only apps open was one or two terminal windows. I had also mpstat open and most of the time CPUs were idle when tests were not running. But I get your point that ideally I should use proper headless server machine. I also get the effect of hyper threading - is there a way to switch it off in Linux by some kind of boot parameter?

Also 1gbps network can be a bottle neck.

Very likely, I have been suspecting same thing.

Every benchmark case should have a matching performance
analysis and point to the bottleneck reason - cpu/networking/contect switching/locking/filesystem/..

To figure this out I guess I would need to use OSv tracing capability - https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py

Just hyperthread vs a different thread in another core is very significant change.
Need to pin the qemu threads in the host to the right physical threads.

I was not even aware that one can pin to specific CPUs. What parameters pass to qemu?

Better to run on a good physical server (like i3.metal on AWS or similar, could be smaller but not 2 cores) and
track all the metrics appropriately. Best is to isolate workloads (and make sure they scale linearly too) in terms of cpu/mem/net/disk and only then
show how a more complex workload performs.

Cannot afford 5$ per hour ;-) Unless I have fully automated test suite.

My dream would be to have an automated process I could trigger with a single click of a button that would:

1) Use cloud formation template to create a VPC with all components of the test environment.

2) Automatically start each instance under test and corresponding test client

3) Automatically collect all test results (both wrk and possibly tracing data) and put them somewhere in S3.

Finally If I had a suite of visualization tools that would generate whatever graphs I need to analyze. It would save soooooo much time. Possibly under hour => then I could pay 5 bucks for it ;-)

But it takes time to build one ;-)

To unsubscribe from this group and stop receiving emails from it, send an email to osv...@googlegroups.com.

Waldek Kozaczuk

unread,

Mar 27, 2019, 6:49:48 PM3/27/19

to OSv Development

On Wednesday, March 27, 2019 at 4:51:05 AM UTC-4, Pekka Enberg wrote:

On Wed, Mar 27, 2019 at 10:48 AM Pekka Enberg <pen...@scylladb.com> wrote:
Hi Waldek!

On Wed, Mar 27, 2019 at 12:29 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Last week I spent some time investigating OSv performance and comparing it to Docker and Linux guests.

Nice!

On Wed, Mar 27, 2019 at 12:29 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
The test setup looked like this:

Host:
MacBook Pro with Intel i7 4 cores CPU with hyperthreading (8 cpus reported by lscpu) with 16GB of RAM with Ubuntu 18.10 on it
firecracker 0.15.0
QEMU 2.12.0

Client machine:
similar to the one above with wrk as a test client firing requests using 10 threads and 100 open connections for 30 seconds in 3 series one by one (please see this test script - https://github.com/wkozaczuk/unikernels-v-containers/blob/master/test-restapi-with-wrk.sh).
wrk by default uses Keep-Alive for http connections so TCP handshake is minimal
The host and client machine were connected directly to 1 GBit ethernet switch and host exposed guest IP using a bridged TAP nic (please see the script used - https://raw.githubusercontent.com/cloudius-systems/osv/master/scripts/setup-external-bridge.sh).

You can find scripts to start applications on OSv and docker here - https://github.com/wkozaczuk/unikernels-v-containers (run* scripts). Please note --cpu-set parameter used in docker script to limit number of CPUs.

You can find detailed results under https://github.com/wkozaczuk/unikernels-v-containers/tree/master/test_results/remote.

Some questions about the evaluation setup and measurements:

- Did you establish a baseline with bare metal configuration?

How would I create baseline with with bare metal configuration for 1, 2, 4 CPUs? With docker or qemu I can specify number of cpus.

- Did you measure CPU utilization during the throughput tests? This is important because you could be hitting CPU limits with QEMU and Firecracker because of software processing needed by virtualized networking.

Nothing rigorous. I has mpstat running and I could see that during 1 and 2 cpu tests they were pretty highly utilized (80-90%) but only 40-50% for 4 cpu tests. But nothing I recorded.

- Are the QEMU and Firecracker tests using virtio or vhost?

I thought OSv only support virtio. Sorry to be ignorant. I heard the terms but what is actually the difference between vhost and virtio?

- Is Docker also configured to use the bridge device? If not, QEMU and Firecracker also have some additional overheads from the bridging.

I need to check. Per this - https://raw.githubusercontent.com/wkozaczuk/unikernels-v-containers/master/run-rest-in-docker.shI am sure - I would expose container port to the host. So I think I was bypassing the bridge.

BTW is there a way to run OSv on QEMU without a bridge to make it visible on LAN?

- Is multiqueue enabled for QEMU and Firecracker? If not, this would limit the ability to leverage multiple vCPUs.

No idea what you are talking about ;-)

- Is QEMU or Firecracker setting CPU affinity for the vCPU threads? If not, two or more vCPUs could be running on the same physical CPU, which obviously limits throughput.

Not sure. I doubt. I have to investigate.

Oh forgot the obvious:

- Is CPU scaling governor set to performance? Also, if the CPU has TurboBoost, is it disabled?

No idea. Need to read up in on that and check ;-)

- Pekka

Waldek Kozaczuk

unread,

Mar 27, 2019, 6:52:22 PM3/27/19

to OSv Development

Thanks for the article. I definitely heard about Toro.

I have not had time to thoroughly read all the slides but it looks like unikernel. But I did not see it advertised as such. Can it run unmodified Linux executables like JVM?

To unsubscribe from this group and stop receiving emails from it, send an email to osv...@googlegroups.com.

Dor Laor

unread,

Mar 27, 2019, 8:24:23 PM3/27/19

to Waldek Kozaczuk, OSv Development

On Wed, Mar 27, 2019 at 3:36 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:

Overall I must say I am not a performance tuning/measuring expert and clearly have lots of things to learn ;-) BTW can you point me to any performance setup/procedures/docs that you guys used with OSv?

I tried to look but didn't find much, I do remember with played with all these options when we implemented different

scheduling options to deal with spin locks, network performance, etc.

I also feel I have tried to kill too many birds with one stone. Ideally I should have divided whole thing into 3 categories:
- OSv on firecracker vs QEMU
- OSv vs Docker
- OSv vs Linux guest

On Tuesday, March 26, 2019 at 8:32:00 PM UTC-4, דור לאור wrote:
While the performance numbers indicate something, a mac book is a horrible environment for performance
testing. There are effects of other desktop apps, hyperthreading, etc.
Well that is what I have available in my home lab :-) I understand you are suggesting that apps running on the MacBook might affect and skew the results. I made sure the only apps open was one or two terminal windows. I had also mpstat open and most of the time CPUs were idle when tests were not running. But I get your point that ideally I should use proper headless server machine. I also get the effect of hyper threading - is there a way to switch it off in Linux by some kind of boot parameter?

I think it's mainly in the bios. You can pin vcpu to a hyperthread running on a different core and thus eliminate 2 hyperthreads on the same core and get almost the real thing this way

Also 1gbps network can be a bottle neck.
Very likely, I have been suspecting same thing.

Every benchmark case should have a matching performance
analysis and point to the bottleneck reason - cpu/networking/contect switching/locking/filesystem/..
To figure this out I guess I would need to use OSv tracing capability - https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py

Yes, it has lots of good tactics, using perf and the tracer and also figure out where the cpu time go to.

Just hyperthread vs a different thread in another core is very significant change.
Need to pin the qemu threads in the host to the right physical threads.
I was not even aware that one can pin to specific CPUs. What parameters pass to qemu?

I forgot, need to read the manual.. everything is supported.

Better to run on a good physical server (like i3.metal on AWS or similar, could be smaller but not 2 cores) and
track all the metrics appropriately. Best is to isolate workloads (and make sure they scale linearly too) in terms of cpu/mem/net/disk and only then
show how a more complex workload performs.
Cannot afford 5$ per hour ;-) Unless I have fully automated test suite.

My dream would be to have an automated process I could trigger with a single click of a button that would:
1) Use cloud formation template to create a VPC with all components of the test environment.
2) Automatically start each instance under test and corresponding test client
3) Automatically collect all test results (both wrk and possibly tracing data) and put them somewhere in S3.

Finally If I had a suite of visualization tools that would generate whatever graphs I need to analyze. It would save soooooo much time. Possibly under hour => then I could pay 5 bucks for it ;-)

But it takes time to build one ;-)

I think it's possible to continue with your desktop (but try to use Linux) and focus on 1,2 vcpus and analyze each

test carefully. Try to realize what's the bottleneck of each test.

To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.

Waldek Kozaczuk

unread,

Mar 27, 2019, 9:14:57 PM3/27/19

to OSv Development

My Mac is configured as a triple-boot machine that can boot to OSX or Ubuntu or Windows 10. All the tests were running on bare-metal Linux booted MacBook Pro. Do you mean that my test client should also run on Linux bare-metal machine?

Dor Laor

unread,

Mar 27, 2019, 9:21:02 PM3/27/19

to Waldek Kozaczuk, OSv Development

No, the client machine shouldn't matter as long as it generates enough requests and it seems this is the case

To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.

Pekka Enberg

unread,

Mar 28, 2019, 6:09:53 AM3/28/19

to Waldek Kozaczuk, OSv Development

Hi Waldek,

On Thu, Mar 28, 2019 at 12:49 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:

Some questions about the evaluation setup and measurements:

- Did you establish a baseline with bare metal configuration?
How would I create baseline with with bare metal configuration for 1, 2, 4 CPUs? With docker or qemu I can specify number of cpus.

You can use the "taskset" command to restrict process to run on specific CPUs. But I think a 4 CPU bare metal baseline is sufficient because then you know what is the maximum expected throughput.

- Did you measure CPU utilization during the throughput tests? This is important because you could be hitting CPU limits with QEMU and Firecracker because of software processing needed by virtualized networking.
Nothing rigorous. I has mpstat running and I could see that during 1 and 2 cpu tests they were pretty highly utilized (80-90%) but only 40-50% for 4 cpu tests. But nothing I recorded.

I would encourage you to run something like "vmstat" or "sar" in the background to obtain average CPU utilization for the run to be able to compare the results.

The drop in CPU utilization suggests that you're bound by the network. How big are the HTTP requests and responses your test is generating? You could be hitting the ~110 MB/s bandwidth limit of a 1 GbE NIC. Also, note that 40-50% CPU utilization is quite low for a throughput test so you're mostly seeing the impact of latency here. This is where network bridge configuration becomes relevant too. If you have more layers the higher the latency is going to be.

- Are the QEMU and Firecracker tests using virtio or vhost?
I thought OSv only support virtio. Sorry to be ignorant. I heard the terms but what is actually the difference between vhost and virtio?

Sorry for not being explicit. Virtio is the guest/hypervisor I/O interface, which OSv also supports. However, there are two implementations of the I/O model: virtio (in host userspace) and vhost (in host kernel). You can think of vhost as a host kernel accelerator for virtio:

http://blog.vmsplice.net/2011/09/qemu-internals-vhost-architecture.html

The main difference is that vhost is supposed to be faster than virtio because it reduces VM exits.

- Is Docker also configured to use the bridge device? If not, QEMU and Firecracker also have some additional overheads from the bridging.
I need to check. Per this - https://raw.githubusercontent.com/wkozaczuk/unikernels-v-containers/master/run-rest-in-docker.shI am sure - I would expose container port to the host. So I think I was bypassing the bridge.

BTW is there a way to run OSv on QEMU without a bridge to make it visible on LAN?

AFAICT, this would require either device assignment or SR-IOV, but neither is supported by OSv due to lack of (real) hardware device drivers.

- Is multiqueue enabled for QEMU and Firecracker? If not, this would limit the ability to leverage multiple vCPUs.
No idea what you are talking about ;-)

IIRC, there's a "queues" option you pass to "-netdev" with QEMU. No idea about Firecracker.

That said, we have the following comment in virtio-net drivers:

//

// We currently have only a single TX queue. Select a proper TXq here when

// we implement a multi-queue.

//

So perhaps we don't even support multiqueue in OSv at the moment...

- Pekka

Reply all

Reply to author

Forward