Performance issues with File operations

293 views
Skip to first unread message

Miloš Mošić

unread,
Feb 29, 2016, 10:48:14 AM2/29/16
to elixir-lang-talk
Hi all,

Last week I moved our Elixir application (https://github.com/evercam/evercam-media) from EC2 to Hetzner and changed the storage functionality to read and write to local disk (using RAID6) instead of using S3. The app has supervised workers fetching around 100 images per second and serving individual ones when requested.

Since that move the app started to behave strangely - some simple requests would take minutes to complete and the app would start taking more and more memory. In order to pinpoint where the issues came from I wrapped some of the controller and storage code in Task async/await with a timeout of a couple of seconds. This uncovered a pattern: upon starting everything would work normally, and after a couple of minutes functions that use file operations would start to timeout more and more, until the app became completely unresponsive. At that point I'd have to kill the process and start it again.

This is an example of what I'd see in the logs:

[error] {:timeout, {Task, :await, [%Task{owner: #PID<0.3408.2>, pid: #PID<0.3409.2>, ref: #Reference<0.0.11.34035>}, 2000]}}
[error]     (elixir) lib/task.ex:336: Task.await/2
    (evercam_media) lib/evercam_media/snapshot/storage.ex:54: EvercamMedia.Snapshot.Storage.save/4
    (evercam_media) web/controllers/snapshot_controller.ex:226: anonymous fn/2 in EvercamMedia.SnapshotController.parse_camera_response/3

[error] {:timeout, {Task, :await, [%Task{owner: #PID<0.4078.2>, pid: #PID<0.4092.2>, ref: #Reference<0.0.11.34552>}, 1000]}}
[error]     (elixir) lib/task.ex:336: Task.await/2
    (evercam_media) lib/evercam_media/snapshot/storage.ex:93: EvercamMedia.Snapshot.Storage.exists?/3
    (evercam_media) web/controllers/snapshot_controller.ex:166: EvercamMedia.SnapshotController.snapshot_data/3
    (elixir) lib/task/supervised.ex:89: Task.Supervised.do_apply/2
    (elixir) lib/task/supervised.ex:40: Task.Supervised.reply/5
    (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3


It doesn't look like I'm hitting the limit of the server at any point, according to iotop and New Relic the server Disk IO utilization is less than 5% (10MB/s).

I also benchmarked using dd just to be sure:

$ dd bs=1M count=2560 if=/dev/zero of=test oflag=dsync
2684354560 bytes (2.7 GB) copied, 9.17346 s, 293 MB/s

File limits in ulimit were already set pretty high:

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 257435
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 2000000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 257435
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Port limits too:

iex(evercam_media@localhost)1> length(:erlang.ports)
943
iex(evercam_media@localhost)2> :erlang.system_info(:port_limit)
131072

I also tried:
 - setting async-threads (+A) to 0, 10, 100, 1024
 - enabling and disabling kernel-poll (+K)


At this point I'm stuck. Any ideas?

Ben Wilson

unread,
Feb 29, 2016, 11:05:08 AM2/29/16
to elixir-lang-talk
Have you been able to hook up a remote console with :observer and look into what the erlang IO, memory, and so on statistics look like? Nothing strikes me as obviously wrong when looking at your code. If you're getting slow downs that could very well be a process mailbox getting long, but it's hard to say without insight into the production system.

Robert Virding

unread,
Feb 29, 2016, 11:08:14 AM2/29/16
to elixir-lang-talk
Have you experimented with increasing the size of the async thread pool ("erl +A ..."). I don't know at how many file ops per second doing this becomes interesting.

Robert

Miloš Mošić

unread,
Feb 29, 2016, 11:43:14 AM2/29/16
to elixir-lang-talk
I tried connecting just now, here are is the screenshot of load charts: http://i.imgur.com/evxbf1T.png

And stats after 54 seconds: http://i.imgur.com/9EyzxnL.png and after 8 minutes: http://i.imgur.com/VlNCzZE.png

At the time of the last snapshot I started getting hundreds of timeouts so I restarted the application. How can I check if the mailbox is growing?

I can give you access to the server if that's easier?


On Monday, February 29, 2016 at 5:05:08 PM UTC+1, Ben Wilson wrote:

Miloš Mošić

unread,
Feb 29, 2016, 11:46:37 AM2/29/16
to elixir-lang-talk
Yes, I mentioned it briefly at the end of the mail. I tried changing +A from the default 10 to 0, 100, and 1024 and didn't see any obvious difference.

But I keep thinking that those kinds of optimizations shouldn't be necessary here and that I'm probably doing something else wrong.

Ben Wilson

unread,
Feb 29, 2016, 11:57:17 AM2/29/16
to elixir-lang-talk
I appreciate your confidence in my trustworthiness but as I don't know of any way to connect a node as "readonly" I'm gonna advise against giving me access to your server.

If you select the Processes tab you can sort by the MsgQ column which will tell you how many messages are in the queue of a given process. If there's more than just a few that's an indicator that you're sending that process more messages than it can handle.

You have a LOT of binary memory usage which isn't that surprising given that you're dealing with images but it does seem to grow quite a lot, which seems to indicate that you've got processes holding onto references somewhere.

Robert Virding

unread,
Feb 29, 2016, 12:07:44 PM2/29/16
to elixir-lang-talk
Also the number of processes grows. From when are the load charts? Near the beginning of the run or at around 8 minutes?

Robert

Miloš Mošić

unread,
Feb 29, 2016, 12:51:35 PM2/29/16
to elixir-lang-talk
Ben,

I already trusted you when I put ex_aws into production, so this isn't that big of a leap :)

In the first couple of minutes MsgQ is 0-1 for all processes: http://i.imgur.com/tHjTA7P.png, then when timeouts start piling up it looks like this: http://i.imgur.com/5bgdfvx.png

Robert,

that was closer to 8 minutes. After taking the above screenshots I opened the load charts on a fresh beam process, this time it took 18 minutes to freeze (using observer seems to have a Hawthorne effect on the server :) ).

These are the screenshots for that crash: http://imgur.com/a/cJda9

Ben Wilson

unread,
Feb 29, 2016, 12:56:42 PM2/29/16
to elixir-l...@googlegroups.com
Wow yeah it looks like there's a fair bit of gridlock at that point. Whether it's all because of Logger or not is hard to say while Logger is still that locked up. Are you logging a bazillion messages? I'm not entirely sure what's causing the lockup, maybe someone else will have an idea?`

--
You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-talk/uba31WOBHsk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-ta...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-talk/a6ec72c1-ee7b-460d-9619-60e6a50fa74b%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Miloš Mošić

unread,
Feb 29, 2016, 1:17:04 PM2/29/16
to elixir-lang-talk
Not really, it's logging about 30 lines per second. I actually decreased the amount of logging in the last couple of weeks, we used to log a lot more. I'll try decreasing it a bit further now and see if it changes anything.

Btw I forgot to mention that I'm caching images for a couple of seconds after receiving them (using con_cache/ETS), so the binary memory usage doesn't seem that high to me. The server this is running on has 64GB of memory, so that shouldn't be an issue.

Saša Jurić

unread,
Feb 29, 2016, 2:59:20 PM2/29/16
to elixir-lang-talk
I just took a casual glance in Erlang/OTP source. If I understand it right, read_file and write_file (which are used by Elixir's File.read/write) will be handled inside a registered server (file_server). Unless I'm missing something, this would mean that all reads/writes are serialized and would explain why your code is timing out (and why +A won't make any difference - all the operations are serialized in a single file_server). 

Perhaps try to work with files in raw mode, in which case you'd be reading it straight from the process, so you'd get concurrent operations. Though I don't know how much will that help (or maybe even unhelp) if it's all working with the same disk.

Also, that logger message queue is looking somewhat large-ish. IIRC, logger reverts to back-pressure at some point (it was I think configurable). Try to completely disable it, or raise the level to :error, and see if you have better throughput.

Jose Garcia-Sacristan

unread,
Feb 29, 2016, 6:30:50 PM2/29/16
to elixir-l...@googlegroups.com
This old entry in erlang-bugs describes a similar problem and also points to revert to raw mode as a possible solution as suggested by Saša.

http://erlang.org/pipermail/erlang-bugs/2011-September/002591.html



--
You received this message because you are subscribed to the Google Groups "elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-ta...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-talk/ebbd1dc7-e46d-4536-a26b-1e0141d4b0dd%40googlegroups.com.

José Valim

unread,
Feb 29, 2016, 6:33:18 PM2/29/16
to elixir-l...@googlegroups.com
If file_server is the bottleneck though, it should be possible to see a large MsgQ in Observer. This has happened with Plug.Static before and we moved to using :raw to fix it as well.



José Valim
Skype: jv.ptec
Founder and Director of R&D

Miloš Mošić

unread,
Mar 1, 2016, 10:26:41 PM3/1/16
to elixir-lang-talk, jose....@plataformatec.com.br
I did more testing yesterday and today, trying out various things.

Raised log level to :error, it didn't have any effect.

Replaced `File.write!(path, image)` with:

`File.open(path, [:write, :binary, :raw], fn(file) -> IO.binwrite(file, image) end)`

This also didn't seem to have improved anything.

I then tried to wrap that and `File.mkdir_p!` in separate tasks, and found out that actually most of the time making the directory times out, not saving the file. I tried using `Porcelain.shell` and `System.cmd` with the same mkdir command and both pegged the CPU to 100%, so that was no-go. I also tried using `:filelib.ensure_dir`, which seems to behave the same or better as `File.mkdir_p!`.

Strangely, what did improve the stability dramatically was disabling motion detection. This a simple C script used via NIF which loads two images, compares them pixel by pixel and counts how many pixels differ between them. I didn't write this, but as far as I can tell everything is done in memory, it doesn't write or read from disk at any point.

But even with the NIF disabled, the app did get into a gridlock again, but this time after about 12 hours of uptime (instead of 5-15 minutes before).

Anyone sees a connection here?

Sasa Juric

unread,
Mar 2, 2016, 4:13:54 AM3/2/16
to elixir-l...@googlegroups.com
If your nif is blocking the scheduler for more than 1ms, it might skew the scheduler or lead to scheduler collapse (see here for a brief explanation), which perhaps explains your situation. If this is indeed the case you could consider dirty schedulers or run C program externally through port.

Regarding the problems which happen after 12 hours of uptime, it’s really hard to say. You’ll probably need to collect various metrics related to your app as well as beam and OS, and also log a bunch of stuff, and then try to analyze what happened. I expect it will probably take a couple of iterations to get to the bottom of it, which sucks, but I don’t have any better proposal at this point.


You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-talk/uba31WOBHsk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-ta...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-talk/61c990ed-2c88-49b2-ab6a-82ec2e225b6c%40googlegroups.com.

Miloš Mošić

unread,
Mar 4, 2016, 12:43:54 PM3/4/16
to elixir-l...@googlegroups.com
I wasn't aware that the nif needs to complete under 1ms, that was an interesting explanation. I'll disable it for now as we have more important things to handle at the moment.

Thanks for the responses everyone, this was very helpful.

Eugene HP

unread,
Mar 6, 2016, 6:56:24 AM3/6/16
to elixir-lang-talk
Hey Milos,

Yes, looks like the architecture has to be changed to support lazy loading of the resulting information back from NIF plugin.

Regards,
Eugene

Srinivas JONNALAGADDA

unread,
Mar 6, 2016, 3:51:42 PM3/6/16
to elixir-lang-talk
Milos,


On Friday, 4 March 2016 23:13:54 UTC+5:30, Miloš Mošić wrote:
I wasn't aware that the nif needs to complete under 1ms, that was an interesting explanation. I'll disable it for now as we have more important things to handle at the moment.

        Did you go ahead with this plan?  Could you let us know the outcome please?  Has this solved the problem?

                                                        -- 0|0

Miloš Mošić

unread,
Mar 7, 2016, 3:14:19 AM3/7/16
to elixir-l...@googlegroups.com
Srinivas, I did disable the nif. It "solved" the problem by making it much less frequent as I mentioned in a previous post. I set up a cron task which restarts the app every couple of hours, which is acceptable for now.

--
You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-talk/uba31WOBHsk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-ta...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages