Hi all –
Using bog standard code (see http://pastebin.com/W3AvhQn5), and then forcing the tcp connection between the publisher and the subscriber to be reset, I am seeing duplicate messages on the subscriber’s side. I’ve observed this behavior both in 2.0 and in 2.1.7, and in C and python code.
Basic timeline of events:
· Start up publisher on some regular message generation loop
· Start up subscriber connecting to publisher.
· Validate that subscriber sees one message per publisher message.
· Force the connection to be closed – in my case, I attached gdb to the python process and close(fd)’ed the non-listening socket (easily identifiable via lsof).
· The subscriber auto-reconnects, but now receives duplicates of every message being sent.
My questions:
· Is this expected?
· What should I be doing differently to prevent this from happening?
Cheers,
--B
There are two details I’d like to add to this.
First, the approach I used to close the fd – gdb and calling close(fd) explicitly – hits an unexpected behavior with the epoll mechanism. On fd close, epoll automatically – and, as best I can determine, silently – removes the fd from the event set. That means
no notifications occur.
However, I would not be surprised if the internal tracking in 0mq simply tracks this by fd. Since the client reconnects so rapidly, it gets the same fd with the server, and a duplicate entry is contained within the internal tracking objects. This, I suspect, explains the duplicate messages.
I’d like some independent clarification of this analysis.
_______________________________________________
zeromq-dev mailing list
zerom...@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> Using bog standard code (see http://pastebin.com/W3AvhQn5), and then forcing
> the tcp connection between the publisher and the subscriber to be reset, I
> am seeing duplicate messages on the subscriber’s side. I’ve observed this
> behavior both in 2.0 and in 2.1.7, and in C and python code.
This is normal and expected. The PUB socket holds a small queue of
outgoing messages so that new subscribers get some history (10
messages or so, iirc). The SUB socket gets its messages, and then you
break the connection and the SUB socket reconnects, thus appears to
the PUB as a new subscriber.
-Pieter
> However, I would not be surprised if the internal tracking in 0mq simply
> tracks this by fd. Since the client reconnects so rapidly, it gets the
> same fd with the server, and a duplicate entry is contained within the
> internal tracking objects. This, I suspect, explains the duplicate messages.
>
> I’d like some independent clarification of this analysis.
Yes. That's the case. If an error is encountered with the socket the
connection is restarted.
Martin
Just to clarify - I don't believe this is any pub queuing case, since the duplications occur indefinitely. Rather it's an issue with having the same 'fd' value registered in the internal data structures multiple times, since it never gets any kind of notification from epoll that the fd has been artificially closed.
Cheers,
--B
> Just to clarify - I don't believe this is any pub queuing case, since
> the duplications occur indefinitely. Rather it's an issue with
> having the same 'fd' value registered in the internal data structures
> multiple times, since it never gets any kind of notification from
> epoll that the fd has been artificially closed.
Not even EPOLLERR?
Martin
I haven't experimented directly, but the documentation indicates that it just gets cleaned up. There was some mildly related discussion on this point here: http://lwn.net/Articles/430793/
I haven't banged together a test case, but if it is generating an EPOLLERR, then 0mq surely isn't cleaning up the lingering reference internally.
Cheers,
--B
>> Not even EPOLLERR?
>
> I haven't experimented directly, but the documentation indicates that
> it just gets cleaned up. There was some mildly related discussion on
> this point here: http://lwn.net/Articles/430793/
>
> I haven't banged together a test case, but if it is generating an
> EPOLLERR, then 0mq surely isn't cleaning up the lingering reference
> internally.
Ok. But what are we actually discussing here? 0MQ is surely not expected
to be resilient against attaching it to gdb and messing with its
internal structures.
Are you seeing any problem in real usage?
Martin