Erl.exe high CPU usage every Monday morning on Windows

10,265 views
Skip to first unread message

Przemek D

unread,
Apr 3, 2017, 4:05:06 AM4/3/17
to rabbitmq-users
Hi,

Several people of my dev team (at least 2) are experiencing unusual high CPU usage of erl.exe process used by RabbitMQ. It happens on following conditions:
- the computer must be shut down for several days, at least it happens always on Monday morning after a weekend (we turn off our computers)
- the high CPU usage usually drops down after 3-4 hours, everything works normally then
- during this period RabbitMQ is unresponsive in terms it is not possible to send any message to any exchange nor read any message from any of the queues, API calls are being locked/timed out, administration panel show the status as healthy but it is also affected eg. when trying to read something from queue or purge queue it is locked
- it looks like one core is used by thread started in beam.smp.dll, see this picture https://drive.google.com/open?id=0B1E_DOCmqmgiTmFLSUJUZDNwRTQ
- killing erl.exe helps a little (the process respawns and behaves well)

We tried to update RabbitMQ and Erlang several times to newest version, it doesn't help. Our current environment where this "bug" is reproducible every 7 days:
- Windows 10 (running in domain)
- Core i7 (4 cores, 8 logical CPUs), 12GB RAM, at least 10GB of free space on drive C (SSD)
- Erlang 19.3, RabbitMQ 3.6.6 (it happens for one year now, since beginning of our journey with RabbitMQ, so older versions are also affected)
- We simply install Erlang and RabbitMQ without any modifications, only one instance is used

I attached a debugger with sources to this process and all what I achieved is that the thread with CPU usage is continuously running in loop in "scheduler_wait(int *fcalls, ErtsSchedulerData *esdp, ErtsRunQueue *rq)" (erl_process.c) function. The process was built without sufficient debug information so I can't say anything more, maybe I'm wrong. My other ideas:
- I could build Erlang with extended debugging symbols, but learning how to build debug version from scratch and deploy Erlang on Windows could be very time consuming (any help? I could not find any helpful tutorial how to do it easily on Windows)
- Even if I had knowledge what was wrong in Erlang it could be caused by some invalid behavior of RabbitMQ, learning how it works could take another weeks :( any advice?

Could you give me some hints which could help to diagnose the cause? Currently my team leader ignores the issue, also we don't want to spend weeks without progress of our internal work :(

Thanks in advance
Przemek





Karl Nilsson

unread,
Apr 3, 2017, 4:14:33 AM4/3/17
to rabbitmq-users
Hi,

Is there anything in the server log to indicate what RabbitMQ is doing during this high CPU use?

Do these systems have large numbers of messages stored in queues?

Cheers
Karl

Przemek D

unread,
Apr 3, 2017, 5:16:53 AM4/3/17
to rabbitmq-users
Hi Karl,

Thanks for your quick answer. Currently there are 12 total messages in all queues, in "C:\Users\(...)\Roaming\RabbitMQ\db\rabbit@(...)-mnesia\msg_store_persistent" directory of RabbitMQ there is one relatively big file (14MB file, name: 9.rdq), other files are relatively small (less than 1KB).

In rab...@MYHOSTNAME-sasl.log I found something like that, repeating every several minutes:
=SUPERVISOR REPORT==== 3-Apr-2017::08:20:47 ===
     Supervisor: {<0.20480.1>,rabbit_channel_sup}
     Context:    shutdown_error
     Reason:     killed
     Offender:   [{pid,<0.20483.1>},
                  {name,channel},
                  {mfargs,
                      {rabbit_channel,start_link,
                          [1,<0.20475.1>,<0.20481.1>,<0.20475.1>,
                           <<"[::1]:63528 -> [::1]:5672">>,
                           rabbit_framing_amqp_0_9_1,
                           {user,<<"admin">>,
                               [administrator],
                               [{rabbit_auth_backend_internal,none}]},
                           <<"RTICS">>,
                           [{<<"publisher_confirms">>,bool,true},
                            {<<"exchange_exchange_bindings">>,bool,true},
                            {<<"basic.nack">>,bool,true},
                            {<<"consumer_cancel_notify">>,bool,true},
                            {<<"connection.blocked">>,bool,true},
                            {<<"authentication_failure_close">>,bool,true}],
                           <0.20476.1>,<0.20482.1>]}},
                  {restart_type,intrinsic},
                  {shutdown,70000},
                  {child_type,worker}]


=SUPERVISOR REPORT==== 3-Apr-2017::08:35:21 ===
     Supervisor: {<0.20484.1>,rabbit_channel_sup}
     Context:    shutdown_error
     Reason:     killed
     Offender:   [{pid,<0.20487.1>},
                  {name,channel},
                  {mfargs,
                      {rabbit_channel,start_link,
                          [2,<0.20475.1>,<0.20485.1>,<0.20475.1>,
                           <<"[::1]:63528 -> [::1]:5672">>,
                           rabbit_framing_amqp_0_9_1,
                           {user,<<"admin">>,
                               [administrator],
                               [{rabbit_auth_backend_internal,none}]},
                           <<"RTICS">>,
                           [{<<"publisher_confirms">>,bool,true},
                            {<<"exchange_exchange_bindings">>,bool,true},
                            {<<"basic.nack">>,bool,true},
                            {<<"consumer_cancel_notify">>,bool,true},
                            {<<"connection.blocked">>,bool,true},
                            {<<"authentication_failure_close">>,bool,true}],
                           <0.20476.1>,<0.20486.1>]}},
                  {restart_type,intrinsic},
                  {shutdown,70000},
                  {child_type,worker}]



In rab...@MYHOSTNAME.log I found, repeating every several minutes:
=ERROR REPORT==== 3-Apr-2017::07:52:07 ===
closing AMQP connection <0.20234.1> ([::1]:56900 -> [::1]:5672):
missed heartbeats from client, timeout: 60s

=INFO REPORT==== 3-Apr-2017::07:52:08 ===
accepting AMQP connection <0.20272.1> ([::1]:61419 -> [::1]:5672)

=INFO REPORT==== 3-Apr-2017::07:52:08 ===
connection <0.20272.1> ([::1]:61419 -> [::1]:5672): user 'admin' authenticated and granted access to vhost 'RTICS'

=INFO REPORT==== 3-Apr-2017::07:52:38 ===
accepting AMQP connection <0.20291.1> ([::1]:61490 -> [::1]:5672)

=INFO REPORT==== 3-Apr-2017::07:52:38 ===
connection <0.20291.1> ([::1]:61490 -> [::1]:5672): user 'admin' authenticated and granted access to vhost 'RTICS'

=INFO REPORT==== 3-Apr-2017::07:52:38 ===
closing AMQP connection <0.20291.1> ([::1]:61490 -> [::1]:5672)

=ERROR REPORT==== 3-Apr-2017::07:55:07 ===
closing AMQP connection <0.20272.1> ([::1]:61419 -> [::1]:5672):
missed heartbeats from client, timeout: 60s

=INFO REPORT==== 3-Apr-2017::07:55:19 ===
accepting AMQP connection <0.20307.1> ([::1]:61807 -> [::1]:5672)

=INFO REPORT==== 3-Apr-2017::07:55:19 ===
connection <0.20307.1> ([::1]:61807 -> [::1]:5672): user 'admin' authenticated and granted access to vhost 'RTICS'

Michael Klishin

unread,
Apr 3, 2017, 5:44:28 AM4/3/17
to rabbitm...@googlegroups.com
We have seen this on Windows specifically before. The exact reason is not known but
we believe it's a runtime behavior, e.g. Erlang on Windows does not support IOCP.

You can try using 19.3.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Przemek D

unread,
Apr 3, 2017, 7:20:48 AM4/3/17
to rabbitmq-users
Hi Michael,

Unfortunately this does also happen for 19.3 as I wrote :(

Regarding Erlang and IOCP: I thought IOCP should only affect performance. My case looks rather like a bug which is reproducible and it has transient symptoms, something like: "timeout value for scheduler is being calculated in local timezone whereas later it's being check in UTC timezone" or "RabbitMQ database is corrupted on Friday and it has to be checked on Monday".
Nevertheless in meantime I will try to setup environment for building Erlang with debug symbols, which is pretty hard for Windows (msys, openssl, activeperl, nasm... all must be installed and configured by hand)

Regards
Przemek
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Apr 3, 2017, 8:17:58 AM4/3/17
to rabbitm...@googlegroups.com
You can use msys2 to get more or less the same toolchain available on other OS'es.

It would definitely help if there was a way to get more information from the runtime, e.g.
a profiler run data or similar.

Thank you.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Przemek D

unread,
Apr 3, 2017, 8:41:57 AM4/3/17
to rabbitmq-users
Hi Michael,

Is full dump sufficient for this?

Michael Klishin

unread,
Apr 3, 2017, 8:52:36 AM4/3/17
to rabbitm...@googlegroups.com
What specifically do you mean by "full dump"?

Also, I'd like to clarify that while we would be interested in learning more about what's going on,
we are not Erlang runtime engineers or Windows experts, so please don't expect us to take a heap
dump and have an RCA report in a few days.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Przemek D

unread,
Apr 3, 2017, 10:01:58 AM4/3/17
to rabbitmq-users
Hi,

I was referring to the dump created with Process Explorer  "Create Full Dump" option.

Michael Klishin

unread,
Apr 3, 2017, 12:29:09 PM4/3/17
to rabbitm...@googlegroups.com
I believe that creates a point-in-time memory dump? That's not really what you need
when analyzing an ongoing CPU burn problem.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Przemek D

unread,
Apr 4, 2017, 2:25:46 AM4/4/17
to rabbitmq-users
Hi Michael,

I appreciate your help, can you point me some windows application profiler which you are familiar with? Did you work with VS 2015 profiler?

Regarding msys2: I already started to work with it, based on http://erlang.org/doc/installation_guide/INSTALL-WIN32.html I was able to prepare environment a little, but I didn't manage to finish it within one day (I failed on the stage where msys2 was unable to locate VC compiler... it's not easy as for *nix versions)

Regards

Karl Nilsson

unread,
Apr 5, 2017, 11:53:10 AM4/5/17
to rabbitm...@googlegroups.com
Hi,

I don't actually know if a "Full Dump" would tell us anything interesting. It is fairly easy to do however so may be worth a go, you could then either share it or try to analyse using WinDBG or similar.

When you say you shut your computers down, is that a full, proper shutdown with power off or just a hibernate? Erlang can get a bit funny when coming back from hibernation and it seems even more so on windows.

I take it this is something that is only happening on development environments? Are your production server running on windows as well?

Cheers
Karl

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Karl Nilsson

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Apr 5, 2017, 12:12:42 PM4/5/17
to rabbitm...@googlegroups.com
Some specifics: when a machine running an Erlang VM is suspended 
and then resumed, all timer events fire very rapidly to catch up and that causes a CPU usage spike but it eventually ends.

--
Karl Nilsson

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Staff Software Engineer, Pivotal/RabbitMQ

Przemek D

unread,
Apr 10, 2017, 3:57:20 AM4/10/17
to rabbitmq-users
Hi,

Thanks for your suggestion, but I think it's different scenario. The machines are powered down for the weekend and started as usually. Today high CPU usage happened again as usually, I have done small debugging session but it didn't make anything clearer than before.
As I wrote previously: killing erl.exe does help, today I can also confirm that restarting machine also helps.

vladimir...@phaestis.com

unread,
Apr 10, 2017, 5:35:07 AM4/10/17
to rabbitmq-users
Hi,
I get very similar behavior at mondays (but not every time). I don't remember is there high usage of  CPU last time, but interaction with RabbitMQ (purge queue, subscribe) throw timeout exception until I restart RabbitMQ service. In log only this string repeat:
closing AMQP connection <0.9482.0> ([::1]:57984 -> [::1]:5672):
client unexpectedly closed TCP connection

Windows 10
Erlang OTP 19 (8.3)
RabbitMQ Server 3.6.8

Przemek D

unread,
Apr 10, 2017, 8:21:19 AM4/10/17
to rabbitmq-users
Hi Vladimir,

So there are other similar cases, not only in my company :) Next time please check erl.exe process CPU usage, task manager will show you that one logical processor is busy for at least 3 hours (if you have multicore system then task manager will show CPU usage around 100/number_of_cores).
I'm not sure what to do, it's easy to reproduce I think eg. by setting date and time in BIOS but what should be checked then? How to gather some meaningful data?

Regards

vladimir...@phaestis.com

unread,
Apr 10, 2017, 8:32:29 AM4/10/17
to rabbitmq-users
One moment that I forgot: when my client get timeouts I can see rabbitmq status via management plugin (web). It work fine, I see all my client connections. Next time I'm try to create exchanges/queues, send messages. If this work that will be very strange, on my opinion.

Przemek D

unread,
Apr 24, 2017, 5:06:53 AM4/24/17
to rabbitmq-users
Hi everybody!

I think I've found possible cause! Michael was probably right, suspending machine causes that resumed Erlang tries to catch up with his timers and it could take some time. I thought that on Friday my shutdown stopped all windows processes. Today in the morning I turned on the machine and after several hours of checking what his happening (CPU usage was high on Monday as usually) I discovered in Process Explorer that Erlang process was started several days ago (on Thursday exactly, when I was forced to restart RabbitMQ manually), not today in the morning! As a matter of fact there were plenty of such service processes which were started several weeks ago according to PE.

That gave me a clue that Win 10 Fast Startup feature must mess something with Erlang, it does not stop the processes but it keeps them in sleeping state!

I disabled Fast Startup in Windows settings, next time I will see if that worked! Hope this will solve the issue :)

Regards

Michael Klishin

unread,
Apr 24, 2017, 5:08:10 AM4/24/17
to rabbitm...@googlegroups.com
Hi Przemek,

Thank you very much for reporting back!

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Benjamin Dodd

unread,
Sep 3, 2017, 4:59:17 PM9/3/17
to rabbitmq-users
Hi,

We are also experiencing this problem in our organisation. Exactly the same environment and situation.

Did the fast startup fix the problem? Would it be possible to handle this within erlang code?

Ben
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Przemek

unread,
Sep 6, 2017, 8:16:58 AM9/6/17
to rabbitm...@googlegroups.com
Hi Ben,

Yes, disabling fast startup fixed the issue. From my understanding erlang VM uses their own implementation of interval timers which try to guarantee that in given period of time let say 1 hour the timer is fired 3600 / timer_interval_period_seconds times (just an example). When process is put into sleep mode and awaken later then VM tries to catch up with number of timer routine executions.
I think one could detect wake up event using some sort of system events like: https://msdn.microsoft.com/en-us/library/microsoft.win32.systemevents.powermodechanged.aspx?f=255&MSPPError=-2147217396 or https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/distinguishing-fast-startup-from-wake-from-hibernation

However from my perspective rabbit-mq is a server component which should be highly reliable so there should be no hibernation or fast startup enabled on server machine. We experienced the issues only on developers machines.

Regards
Przemek



--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/myl74gsYyYg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

Michael Klishin

unread,
Sep 8, 2017, 11:58:26 AM9/8/17
to rabbitmq-users
In Przemek's case it was VM suspension which causes Erlang VM's timers to fire very rapidly for a period of time.
This cannot be handled in Erlang code. The solution is to make sure your VM is not suspended.

Benjamin Dodd

unread,
Sep 8, 2017, 8:21:28 PM9/8/17
to rabbitm...@googlegroups.com
Thanks Przemek, Michael.

This is a different case to us then - we have laptops with lids closed over the weekend, on Monday when we're back at work this occurs. 

I'm surprised this couldn't be handled in code, as there is no other software that I know that has this problem. 

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/myl74gsYyYg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

Michael Klishin

unread,
Sep 8, 2017, 10:36:42 PM9/8/17
to rabbitm...@googlegroups.com
Closing your laptop lid will have the same effect: the OS is hibernated,
then woken up and time suddenly jumps several hours or even days. Stop RabbitMQ before you leave.

The runtime manages timers when time jumps and I can think of only two options:

 * Skip all timers that were not fired during the time your laptop was asleep
 * Fire all those timers rapidly

Both of them have pros and cons. I don't know why the Erlang runtime choses option #2. The OTP team is on the erlang-questions list, perhaps they can explain it. Anyhow, RabbitMQ does not control it and it is a piece of server software.
Servers are not usually hibernated for a long time.

Benjamin Dodd

unread,
Sep 10, 2017, 6:10:33 PM9/10/17
to rabbitm...@googlegroups.com
My workaround is to stop/start the RabbitMQ service after startup. It's hard to do this though as my machine grinds to a halt for 5-10 minutes while it tries to start up. 

I'll take a look around and see if I can stop the windows service on hibernate/lid close.

I understand this is designed for servers, but we still have to build software for it on our own machines (including debug, testing, etc).

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages