System Lockups

68 views
Skip to first unread message

mikeant...@gmail.com

unread,
Aug 18, 2014, 9:19:23 AM8/18/14
to sna...@googlegroups.com
A project I'm developing uses SnakeMQ as it's messaging system and has been experiencing total system-wide lockups seemingly randomly. This is very troubling as nothing running in userspace should be able to lock the entire system up (a hard reset is required).

Below is a minimal example which reproduces this problem:

Server:

import json

from snakemq.link import Link
from snakemq.packeter import Packeter
from snakemq.messaging import Messaging
from snakemq.message import Message


def on_recv(conn, ident, message):
print(message.data)
payload = json.loads(message.data.decode())
messaging.send_message(ident, Message(b'Hello :)' * 100))


link = Link()
packeter = Packeter(link)
messaging = Messaging('worker_test', '', packeter)

link.add_listener(('', 4000))
messaging.on_message_recv.add(on_recv)
link.loop()

==========================================================================


Client:

from snakemq.link import Link
from snakemq.packeter import Packeter
from snakemq.messaging import Messaging
from snakemq.message import Message

import threading
import json
import time


class TestThread(threading.Thread):
def __init__(self, ident):
"""Initialises thread and message queue.
"""
super(TestThread, self).__init__()

self.snakemq_ident = ident
self.received_flag = threading.Event()
self.stop_flag = threading.Event()
self.link = Link()
self.packeter = Packeter(self.link)

self.messaging = Messaging(ident, '', self.packeter)
self.messaging.on_message_recv.add(self.on_recv)
self.link.on_loop_pass.add(self.on_loop_pass)
self.link.add_connector(('localhost', 4000))

def on_recv(self, conn_id, ident, payload):
"""Called when thread receives message."""
print(payload.data)
self.received_flag.set()

def on_loop_pass(self):
"""Called after link loop poll is processed. Checks to see if
we signalled for message thread to exit.
"""
if self.stop_flag.is_set():
self.link.stop()
self.link.cleanup()

def start(self):
"""Block for a brief period to let thread start up."""
super(TestThread, self).start()
time.sleep(0.2)

def stop(self):
"""Terminates the thread."""
self.stop_flag.set()

def run(self):
"""Main thread subroutine - run link loop."""
self.link.loop()

def send_message(self, ident, message):
"""Packages up message and sends via broker to destination
ident.
"""
self.messaging.send_message(
'worker_test',
Message(json.dumps(message).encode('utf-8')))


iteration = 0

while True:
iteration += 1
print(iteration)
t1 = TestThread('t1')
t1.start()
t1.send_message('worker_test', 'Hello from T1!!!' * 100)
t1.received_flag.wait()
t1.stop()

t2 = TestThread('t2')
t2.start()
t2.send_message('worker_test', 'Hello from T2!!!' * 100)
t2.received_flag.wait()
t2.stop()

==========================================================================

I've found (possibly a coincidence) that crashes sometimes happen when there's some user interaction occurring - two recent examples include opening new tabs in the browser, and opening an Explorer window. The only common link between these is that they all utilise sockets.

Sometimes only a few iterations are required (< 100), sometimes it requires many more before a crash occurs.

There's nothing in the SnakeMQ logs, nothing on the console and nothing in Event viewer. I have absolutely no idea what's causing the lockup.

System is Windows 7 64-bit, Python 3.4 64-bit.


I've tried this on three separate Windows machines (different hardware) and the result is the same on each. No problems running under Ubuntu 14.04 64-bit.


Any clues?

Unfortunately for me this bug is a bit of a show-stopper :(

David Siroky

unread,
Aug 18, 2014, 11:14:55 AM8/18/14
to sna...@googlegroups.com
Hi!

Does it lock only input (keyboard, mouse) or does it freeze the whole
system?
Is the system "pingable" from outside after the freeze?
Does it run out of memory?

David

mikeant...@gmail.com

unread,
Aug 18, 2014, 11:35:58 AM8/18/14
to sna...@googlegroups.com
Hi David,

It freezes the entire system, even if you leave it for a couple of hours. Interestingly the machine is still pingable, so it hasn't trashed the IP stack.

Resources usage is normal at the time of crash: http://imgur.com/IyrirVC.

Cheers,
Mike

David Siroky

unread,
Aug 18, 2014, 12:15:33 PM8/18/14
to sna...@googlegroups.com
There might be one reason as I look at the cli.py: The problem might be
that you creating new and new snakemq stacks and thus new and new
sockets. The snakemq's Link cleanup might have some resources/sockets
leakage and thus it will exhaust the system.

Try following link.py:
https://gist.github.com/dsiroky/be08e9d88ffcfec6d8e0

Anyway I advice to create only one snakemq stack for the whole
process/thread lifetime.

David

mikeant...@gmail.com

unread,
Aug 20, 2014, 4:00:08 AM8/20/14
to sna...@googlegroups.com
Hi David,

Sorry for the delay. Unfortunately the revised link.py didn't work, it still crashes (sometimes after only 1 - 10 iterations). I'm not convinced it's a resource exhaustion issue as I've seen it make several thousand iterations before crashing, but also as few as 1 or 2. It seems more like a resource contention / access violation problem given its random nature.

I've been fairly busy this week, however I shall have another look through the source of link.py. One test I shall try is setting up a link without sending a message.

Do you have any further thoughts as to where the problem may be?

Shall I submit an issue for this on GitHub?

Regards,
Mike

mikeant...@gmail.com

unread,
Aug 20, 2014, 5:02:07 AM8/20/14
to sna...@googlegroups.com, mikeant...@gmail.com
I have also made the demonstration code as small as possible: https://gist.github.com/anonymous/6fcca5461af04f4e1598.

Some interesting things to note:

- Code seems to reliably crash in a very short period now (< 10 iterations most times)
- Crashes occur even when no messages are sent - definitely something wrong with link setup / teardown
- Changing line 23 of client.py so that it listens instead of connects, I can't seem to get it to crash - maybe something to do with internal connection messages?


I've also written a non-threaded client: https://gist.github.com/anonymous/468b597162b4c9461d10

This raises an exception after 40 iterations:
Traceback (most recent call last):
File "C:/Users/mwild/SW-SCRT-0020/bin/nonthreaded.py", line 34, in <module>
link.loop()
File "C:\Users\mwild\SW-SCRT-0020\venv34\lib\site-packages\snakemq\link.py", line 433, in loop
self.on_loop_pass()
File "C:\Users\mwild\SW-SCRT-0020\venv34\lib\site-packages\snakemq\callbacks.py", line 31, in __call__
callback(*args, **kwargs)
File "C:/Users/mwild/SW-SCRT-0020/bin/nonthreaded.py", line 17, in on_loop_pass
link.cleanup()
File "C:\Users\mwild\SW-SCRT-0020\venv34\lib\site-packages\snakemq\link.py", line 276, in cleanup
self.handle_close(sock)
File "C:\Users\mwild\SW-SCRT-0020\venv34\lib\site-packages\snakemq\link.py", line 650, in handle_close
del self._sock_by_fd[sock.fileno()]
KeyError: 568

Do you think this could be the issue?

David Siroky

unread,
Aug 21, 2014, 4:13:17 PM8/21/14
to sna...@googlegroups.com
>Hi David,
>
>Sorry for the delay. Unfortunately the revised link.py didn't work, it still crashes (sometimes after only 1 - 10 iterations). I'm not convinced it's a resource exhaustion issue as I've seen it make several thousand iterations before crashing, but also as few as 1 or 2. It seems more like a resource contention / access violation problem given its random nature.
>
>I've been fairly busy this week, however I shall have another look through the source of link.py. One test I shall try is setting up a link without sending a message.
>
>Do you have any further thoughts as to where the problem may be?
>
>Shall I submit an issue for this on GitHub?
>
>Regards,
>Mike

The random count of iterations might be caused by a random execution of
garbage collection which releases the unclosed sockets. But if it
crashes even after few iterations then the problem will be somewhere
else.

I don't have any other idea what might be going on there. For now I
suggest to go around by creating only a single snakemq stack.


David

David Siroky

unread,
Aug 21, 2014, 4:21:38 PM8/21/14
to sna...@googlegroups.com
>I have also made the demonstration code as small as possible:
>https://gist.github.com/anonymous/6fcca5461af04f4e1598.
>
>Some interesting things to note:
>
>- Code seems to reliably crash in a very short period now (< 10 iterations most times)
>- Crashes occur even when no messages are sent - definitely something wrong with link setup / teardown
>- Changing line 23 of client.py so that it listens instead of connects, I can't seem to get it to crash - maybe something to do with internal connection messages?
>
>
>I've also written a non-threaded client: https://gist.github.com/anonymous/468b597162b4c9461d10
>
>This raises an exception after 40 iterations:
>Traceback (most recent call last):
> File "C:/Users/mwild/SW-SCRT-0020/bin/nonthreaded.py", line 34, in <module>
> link.loop()
> File "C:\Users\mwild\SW-SCRT-0020\venv34\lib\site-packages\snakemq\link.py", line 433, in loop
> self.on_loop_pass()
> File "C:\Users\mwild\SW-SCRT-0020\venv34\lib\site-packages\snakemq\callbacks.py", line 31, in __call__
> callback(*args, **kwargs)
> File "C:/Users/mwild/SW-SCRT-0020/bin/nonthreaded.py", line 17, in on_loop_pass
> link.cleanup()
> File "C:\Users\mwild\SW-SCRT-0020\venv34\lib\site-packages\snakemq\link.py", line 276, in cleanup
> self.handle_close(sock)
> File "C:\Users\mwild\SW-SCRT-0020\venv34\lib\site-packages\snakemq\link.py", line 650, in handle_close
> del self._sock_by_fd[sock.fileno()]
>KeyError: 568
>
>Do you think this could be the issue?

In this code you are calling link.cleanup() inside the link loop which
will cause unpredicted behaviour. Call cleanup after link.loop().
I see some issues in the cleanup process anyway. I'll look into it.

David

David Siroky

unread,
Aug 21, 2014, 5:46:24 PM8/21/14
to sna...@googlegroups.com
I have pushed to github another cleanup fix. Now it correctly closes all
sockets.
There is an exhaustion problem at least under linux: After approx. 23000
connects it fails with EADDRNOTAVAIL which is an OS limitation. After
socket closing the port remains in a TIME_WAIT state. Setting
SO_REUSEADDR on a connecting socket does not work. Similar problem might
be under MSW but instead of an exception the system will freeze.

David

mikeant...@gmail.com

unread,
Aug 22, 2014, 6:22:42 AM8/22/14
to sna...@googlegroups.com
I pulled in the fix and raised my hopes slightly when it went a thousand iterations without crashing, unfortunately it crashed soon after. The bug is still present I'm afraid.

Mike

David Siroky

unread,
Aug 22, 2014, 8:16:50 AM8/22/14
to sna...@googlegroups.com
>I pulled in the fix and raised my hopes slightly when it went a
>thousand iterations without crashing, unfortunately it crashed soon
>after. The bug is still present I'm afraid.
>
>Mike

If it is what I described then it is not a bug. It's an OS limitation.
Why do you need to create so many snakemq stacks on a single host?

David

mikeant...@gmail.com

unread,
Aug 22, 2014, 8:50:02 AM8/22/14
to sna...@googlegroups.com
I have quite a complicated architecture which comprises of a server (with several threads), multiple clients (browsers) and multiple workers (each with several threads). Each of these needs to be able to communicate between themselves, so the simplest way to do this is to have everything connected to a central message broker which passes everything on to its destination.

The system requires both permanent and temporary message queues (spawned on a HTTP request). This is why the demonstration is the way it is: a HTTP request is made, a message thread is spawned to fire off a query to some other part of the system, it waits for a response and then closes the thread and passes the result back to the client.

David Siroky

unread,
Aug 22, 2014, 10:30:59 AM8/22/14
to sna...@googlegroups.com
> a HTTP request is made, a message thread is spawned to fire off
>a query to some other part of the system, it waits for a response and
>then closes the thread and passes the result back to the client.

Is it necessary to open and close the communication every time as well?
The simplest but not very clever way might be creating a snakemq stacks
pool which will hold a "ready made" stack for every system part. The
best way should be creating only a single stack. Let the threads send
the messages (snakemq sending is thread safe) and received messages
filter by their origin into local queues.

mikeant...@gmail.com

unread,
Aug 22, 2014, 10:35:22 AM8/22/14
to sna...@googlegroups.com
I spawn temporary message threads because it seemed like the simplest way to do it at the time. Now I know that's not possible I'll have to find a workaround - the thread pool suggestion sounds good!

Thanks,
Mike
Reply all
Reply to author
Forward
0 new messages