Hi all,
Last week I moved our Elixir application (
https://github.com/evercam/evercam-media) from EC2 to Hetzner and changed the storage functionality to read and write to local disk (using RAID6) instead of using S3. The app has supervised workers fetching around 100 images per second and serving individual ones when requested.
Since that move the app started to behave strangely - some simple requests would take minutes to complete and the app would start taking more and more memory. In order to pinpoint where the issues came from I wrapped some of the controller and storage code in Task async/await with a timeout of a couple of seconds. This uncovered a pattern: upon starting everything would work normally, and after a couple of minutes functions that use file operations would start to timeout more and more, until the app became completely unresponsive. At that point I'd have to kill the process and start it again.
This is an example of what I'd see in the logs:
[error] {:timeout, {Task, :await, [%Task{owner: #PID<0.3408.2>, pid: #PID<0.3409.2>, ref: #Reference<0.0.11.34035>}, 2000]}}
[error] (elixir) lib/task.ex:336: Task.await/2
(evercam_media) lib/evercam_media/snapshot/storage.ex:54: EvercamMedia.Snapshot.Storage.save/4
(evercam_media) web/controllers/snapshot_controller.ex:226: anonymous fn/2 in EvercamMedia.SnapshotController.parse_camera_response/3
[error] {:timeout, {Task, :await, [%Task{owner: #PID<0.4078.2>, pid: #PID<0.4092.2>, ref: #Reference<0.0.11.34552>}, 1000]}}
[error] (elixir) lib/task.ex:336: Task.await/2
(evercam_media) lib/evercam_media/snapshot/storage.ex:93: EvercamMedia.Snapshot.Storage.exists?/3
(evercam_media) web/controllers/snapshot_controller.ex:166: EvercamMedia.SnapshotController.snapshot_data/3
(elixir) lib/task/supervised.ex:89: Task.Supervised.do_apply/2
(elixir) lib/task/supervised.ex:40: Task.Supervised.reply/5
(stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
It doesn't look like I'm hitting the limit of the server at any point, according to iotop and New Relic the server Disk IO utilization is less than 5% (10MB/s).
I also benchmarked using dd just to be sure:
$ dd bs=1M count=2560 if=/dev/zero of=test oflag=dsync
2684354560 bytes (2.7 GB) copied, 9.17346 s, 293 MB/s
File limits in ulimit were already set pretty high:
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 257435
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 2000000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 257435
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Port limits too:
iex(evercam_media@localhost)1> length(:erlang.ports)
943
iex(evercam_media@localhost)2> :erlang.system_info(:port_limit)
131072
I also tried:
- setting async-threads (+A) to 0, 10, 100, 1024
- enabling and disabling kernel-poll (+K)
At this point I'm stuck. Any ideas?