Unix vs Windows & asyn connections

195 views
Skip to first unread message

iodboi

unread,
Jun 17, 2009, 11:45:57 PM6/17/09
to rpyc
Hi all,
I've started to play with rpyc as it looks like it will fit
what I need very well and I've been very impressed so far, however I
have come up with one slight inconsistency which I can't quite figure
out so I'm hoping someone here may be able to :)

The problem in a nutshell is a discrepency between how long an
asynchronous connection takes to return on a UNX like system compared
to a win32 system. For simplicty sake we can take the two blobs of
example code from the tutorial on the website as your server and
client (though the problem is there in other client server code also)
with the addition of a simple timer to see how long the call to
'conn.root.FileMonitor' takes (obviously swapping out the location of
the temp file depending on *nix/win32):

<snip server>
import rpyc
import os
import time
from threading import Thread

class FileMonitorService(rpyc.Service):
class exposed_FileMonitor(object): # exposing names is not
limited to methods :)
def __init__(self, filename, callback, interval = 1):
self.filename = filename
self.interval = interval
self.last_stat = None
self.callback = rpyc.async(callback) # create an async
callback
self.active = True
self.thread = Thread(target = self.work)
self.thread.start()
def exposed_stop(self): # this method has to be exposed too
self.active = False
self.thread.join()
def work(self):
while self.active:
stat = os.stat(self.filename)
if self.last_stat is not None and self.last_stat !=
stat:
self.callback(self.last_stat, stat) # notify the
client of the change
self.last_stat = stat
time.sleep(self.interval)

if __name__ == "__main__":
from rpyc.utils.server import ThreadedServer
ThreadedServer(FileMonitorService, port = 18871).start()
</snip server>

<snip client>
import rpyc, time

#win32
f = open("c:\\Users\\username\\AppData\\Local\\temp\\floop.bloop",
"w")
#nix
#f = open("/tmp/floop.bloop", "w")
conn = rpyc.connect("localhost", 18871)
bgsrv = rpyc.BgServingThread(conn) # create a bg thread to process
incoming events

def on_file_changed(oldstat, newstat):
print "file changed"
print " old stat: %s" % (oldstat,)
print " new stat: %s" % (newstat,)
st = time.time()
#win32
mon = conn.root.FileMonitor("c:\\Users\\username\\AppData\\Local\\temp\
\floop.bloop", on_file_changed)
#nix
#mon = conn.root.FileMonitor("/tmp/floop.bloop", on_file_changed)
print "Time to connect:",time.time()-st
</snip client>

When this is run over local host on a win32 system the time it takes
for the call to return is roughly 0.5 secs and this is a consistent
number.
However when the same code is run on a nix system the time taken can
vary wildly, in my tests this has been between 1 and 290 seconds!!

I have tested this across a variety of different *nix & win32 systems
using both python 2.5 & 2.6 with the same effect always happening
(win2k, xp, vista, ubuntu, gentoo, fedora, OS X).

So my question is obvious - what the heck is going on!!

My initial thoughts were something to do with the GIL, threads &
scheduling across multiple cores as my nix boxes all had more than one
core (I tested on 2, 4 and 8 cores) & the windows boxes were only
single cores, but I get the same results when I try a Linux VM with a
single core ? So I'm stumped :(

Can other people reproduce this behaviour, and if so any ideas on how
to fix it?

Many thanks
Rich

iodboi

unread,
Jun 19, 2009, 2:01:23 PM6/19/09
to rpyc
OK so I worked out the problem, it is with the client. There is a lock
contention between the 'BGServingThread' and the 'root' in protocol.py
method trying to establish the remote root object if you start the
BGServingThread prior to connecting to an exposed service (FileMon in
the above example).

So in the client if you do:

bgsrv = rpyc.BgServingThread(conn) # create a bg thread to process
mon = conn.root.FileMonitor("/tmp/floop.bloop", on_file_changed)

You get a race condition between the BGServingThread object and the
sync_request call made from the call to 'conn.root.FileMonitor', both
are attempting to get a lock within the same connection object's serve
method. bgsrv wins most of the time as is essentially just a tight
loop dominating the serve method :

<snip>
def _bg_server(self):
try:
while self._active:
self._conn.serve(self.INTERVAL)
except Exception:
if self._active:
raise
</snip>

This however only happens in *nix environments, in Windows the race
condition between the two threads seems not to be as one sided ????

If people want my debug scripts to show this happening let me know.

tomer filiba

unread,
Jun 20, 2009, 4:40:54 AM6/20/09
to rp...@googlegroups.com
hi

sorry for the long delay, i've haven't had the time to look into it this week. however, the race you're reporting is serious and i'll look into it asap. i have experienced some sporadic 'deadlocks' in my own tests, which may be related to this race.

hopefully by tuesday or so i'll release a bugfix version. thanks for investigating :)


-tomer

An NCO and a Gentleman

tomer filiba

unread,
Jun 22, 2009, 11:07:00 AM6/22/09
to rp...@googlegroups.com, iod...@gmail.com
hey

thanks to your research, i was able to focus my debuggin efforts on the relevant
spot, but it turns out to be much more tricky than i thought. it's either i'm
doing something very stupid, or there are serious bug in python's locking
mechanism (i'd imagine it's not the kernel :)

btw, i'm using python 2.5 and 2.6 on linux (2.6.28)

i adapted a little tests/threads.py. it is essentially very similar to your code,
running a bg server thread to allow the main thread to work "freely". i also
added debug prints to core/protocol.py, to have better understanding of what's
going on there.

you can find this version in the svn (http://sebulbasvn.googlecode.com/svn/trunk/rpyc)

it works around 10%-20% of the time, and rest of the time, i get this:

TEST Multithreaded
    STEP invoker
conn1 (0) | recv wait_for_lock= True
conn1 (1) | send
conn1 (0) | recv acq
conn1 (1) | send acq
conn1 (1) | send rel
conn1 (1) | recv wait_for_lock= True
conn1 (0) | recv rel
conn1 (0) | send
conn1 (0) | send acq
conn1 (0) | send rel
conn1 (0) | recv wait_for_lock= True
conn1 (0) | recv acq
conn1 (0) | recv rel
conn1 (0) | recv wait_for_lock= True
conn1 (0) | recv acq
conn1 (0) | recv rel
conn1 (0) | recv wait_for_lock= True
conn1 (0) | recv acq
conn1 (0) | recv rel
conn1 (0) | recv wait_for_lock= True
conn1 (0) | recv acq
conn1 (0) | recv rel
... ad infinitum

as you can see, two threads attempt to use conn1 (a single rpyc connection).
thread 0 is the bg thread, and thread 1 is the main thread. "recv" is printed
before trying to acquire the lock, "recv acq" is printed after it has been
acquired, and "recv rel" is printed when the lock is released.

there are two locks in rpyc's connection: _sendlock and _recvlock. these two are
used to synchronize the bytes on the underlying socket.

so what we see in this snippet is this:
* t0 acquired the recvlock.
* meanwhile t1 sent its request (c.root....), and is now expecting the reply
* t1 calls _recv(), which attempts to acquire the recvlock, but since it belongs
  to t0, t1 blocks.
* t0 now releases the lock and processes the request (which triggers more
  requests to the server)
* for some reason, t1 never woke up. t0 continues to acquire and release the
  lock, while t1 never gets woken.

this is very cryptic. when a thread blocks on a lock, and the lock is released,
the thread should wake up... my gut feeling is that this is related to the GIL
or something nasty in that context.

i found (a terrible) workaround for this -- giving a penalty to the thread that
acquired the lock, by making it sleep a little. (protocol.py:292)

this makes t0 sleep and allows t1 to take up and continue its loop.

another solution is to add a little sleep in the BgServingThread (utils/helper.py),
i.e.,
    while self._active:
        self._conn.serve(self.INTERVAL)
        time.sleep(0.05)

but it doesn't solve all of the problem.

these are all hacks, for sure, but i tried to reproduce the bug without rpyc,
to no avail. however, i didn't put enough effort into it. i'm a little busy at
work.

if you or anybody else can look into it, it would be great. otherwise i can't
guarantee a bugfix in the near future.




An NCO and a Gentleman


iodboi

unread,
Jun 22, 2009, 11:24:10 AM6/22/09
to rpyc
Hi there Tomer,
I agree this is a strange bug and I too think it is
related to the GIL - this a talk about the GIL, multi cores and
locking that I *think* is probably at the root of the problems we are
seeing http://blip.tv/file/2232410 even if it is not the problem it is
still a very good talk :)

Anyway from the explanations given by Dave Beasley in the video I
think this could well be at at least some of what we are seeing.

Now we are agreed it is a bug, not me doing something stupid I will
take some time later today to see if I can come up with a solution
that does not involve the performance hit of hardcoding a sleep into
the BGServingThread.

Thanks for the debugging work & confirmation of what I was seeing.

Cheers
Rich

Petr Man

unread,
Jun 25, 2009, 8:28:47 PM6/25/09
to rpyc
Hello guys,
I have run into exactly the same problem tonight. I have been ripping
out my hair for hours now, trying to find out, what is wrong with my
code.
The weird thing is, that is seems to happen somewhat randomly, as some
of my code using these constructs works flawlessly and other just
locks up. I will try to look into it more deeply tomorrow and possibly
post the two pieces of code in question, so that we can find out,
where is the difference.
Good night,
Petr

tomer filiba

unread,
Jun 27, 2009, 8:28:18 AM6/27/09
to rp...@googlegroups.com, pe...@madnetwork.org
i wrote a debug lock (to make sure i don't mix up locks, etc), but the phenomenon still happens:

[T0 L(send 1)] acq
[T0 L(send 1)] ACQ
[T0 L(send 1)] rel
[T1 L(recv 0)] acq
[T1 L(recv 0)] ACQ
[T0 L(send 1)] REL
[T0 L(recv 0)] acq
[T1 L(recv 0)] rel
[T1 L(recv 0)] REL
[T1 L(send 1)] acq
[T0 L(recv 0)] ACQ
[T1 L(send 1)] ACQ
[T1 L(send 1)] rel
[T1 L(send 1)] REL
[T1 L(recv 0)] acq  <<<<<<<
[T0 L(recv 0)] rel
[T0 L(recv 0)] REL
[T0 L(recv 0)] acq
[T0 L(recv 0)] ACQ
[T0 L(recv 0)] rel
[T0 L(recv 0)] REL
[T0 L(recv 0)] acq
[T0 L(recv 0)] ACQ
.
.
.
ad infinitum.

T is for thread, L is the lock name. "acq" is printed before acquiring, "ACQ", is printed after successfully acquiring. the same with "rel" and "REL".

as you can see, T1 tries to acquire recv0. a moment later, T0 releases recv0, and then T0 goes on to acquire and release it, time and again -- while T1 remains blocked for some reason.

i can't see why a lock wouldn't wake up a thread that blocks on it, when another thread releases it. it seems like a bug bug in python -- most probably with the GIL.

this deadlock happens around ~25% of the times, and i can't think of any workaround.



-tomer

An NCO and a Gentleman


Manuel Naranjo - www.aircable.net

unread,
Sep 16, 2009, 10:30:52 AM9/16/09
to tomer filiba, rp...@googlegroups.com
Hello guys,

I'm the main developer of OpenProximity: http://code.google.com/p/proximitymarketing/
and I've been struggling with your problem for a few days, and finally
found the solution :D.

Problem is that <connection>.root is a sync call, then if there are
async calls waiting it gets locked and never ends. The way I figured
out to solve it is a bit like twisted does: ALL calls needs to be
async.

So what I did was getting the async wrapper (rpyc.async) way before I
was going to make the call, then I don't need to do <connection>.root
every time I do the call, I just do it once and keep the reference
variable in memory.

I know this is not the best solution at all, but it works by now.

Manuel

ps: Tomer sorry I had to send the mail directly to you, but I can't
reply this thread from google group directly to the group only.

Reply all
Reply to author
Forward
0 new messages