[zeromq-dev] Duplicate messages on subscriber reconnect in pub/sub pair

799 views
Skip to first unread message

Benn Bollay

unread,
May 12, 2011, 6:36:15 PM5/12/11
to zerom...@lists.zeromq.org

Hi all –

 

Using bog standard code (see http://pastebin.com/W3AvhQn5), and then forcing the tcp connection between the publisher and the subscriber to be reset, I am seeing duplicate messages on the subscriber’s side.  I’ve observed this behavior both in 2.0 and in 2.1.7, and in C and python code.

 

Basic timeline of events:

·         Start up publisher on some regular message generation loop

·         Start up subscriber connecting to publisher.

·         Validate that subscriber sees one message per publisher message.

·         Force the connection to be closed – in my case, I attached gdb to the python process and close(fd)’ed the non-listening socket (easily identifiable via lsof).

·         The subscriber auto-reconnects, but now receives duplicates of every message being sent.

 

My questions:

·         Is this expected?

·         What should I be doing differently to prevent this from happening?

 

Cheers,

--B

Benn Bollay

unread,
May 12, 2011, 7:21:40 PM5/12/11
to ZeroMQ development list

There are two details I’d like to add to this.


First, the approach I used to close the fd – gdb and calling close(fd) explicitly – hits an unexpected behavior with the epoll mechanism.  On fd close, epoll automatically – and, as best I can determine, silently – removes the fd from the event set.  That means no notifications occur.

 

However, I would not be surprised if the internal tracking in 0mq simply tracks this by fd.  Since the client reconnects so rapidly, it gets the same fd with the server, and a duplicate entry is contained within the internal tracking objects.  This, I suspect, explains the duplicate messages.

 

I’d like some independent clarification of this analysis.

Ilja Golshtein

unread,
May 13, 2011, 1:09:09 AM5/13/11
to ZeroMQ development list
Benn,
 
1. Use identities  - zmq_setsockopt (socket, ZMQ_IDENTITY, "something_uniq", 14)   See http://zguide.zeromq.org/page:all#toc42 for details.
 
13.05.2011, 03:21, "Benn Bollay" <be...@f5.com>:

_______________________________________________
zeromq-dev mailing list
zerom...@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

 
 
--
Best regards,
Ilja Golshtein.

Pieter Hintjens

unread,
May 13, 2011, 2:19:16 AM5/13/11
to ZeroMQ development list
On Fri, May 13, 2011 at 12:36 AM, Benn Bollay <be...@f5.com> wrote:

> Using bog standard code (see http://pastebin.com/W3AvhQn5), and then forcing
> the tcp connection between the publisher and the subscriber to be reset, I
> am seeing duplicate messages on the subscriber’s side.  I’ve observed this
> behavior both in 2.0 and in 2.1.7, and in C and python code.

This is normal and expected. The PUB socket holds a small queue of
outgoing messages so that new subscribers get some history (10
messages or so, iirc). The SUB socket gets its messages, and then you
break the connection and the SUB socket reconnects, thus appears to
the PUB as a new subscriber.

-Pieter

Martin Sustrik

unread,
May 13, 2011, 4:00:41 AM5/13/11
to ZeroMQ development list, Benn Bollay
Hi Benn,

> However, I would not be surprised if the internal tracking in 0mq simply
> tracks this by fd. Since the client reconnects so rapidly, it gets the
> same fd with the server, and a duplicate entry is contained within the
> internal tracking objects. This, I suspect, explains the duplicate messages.
>
> I’d like some independent clarification of this analysis.

Yes. That's the case. If an error is encountered with the socket the
connection is restarted.

Martin

Benn Bollay

unread,
May 13, 2011, 12:08:30 PM5/13/11
to ZeroMQ development list
> > Using bog standard code (see http://pastebin.com/W3AvhQn5), and then forcing
> > the tcp connection between the publisher and the subscriber to be reset, I
> > am seeing duplicate messages on the subscriber's side.  I've observed this
> > behavior both in 2.0 and in 2.1.7, and in C and python code.
>
> This is normal and expected. The PUB socket holds a small queue of
> outgoing messages so that new subscribers get some history (10
> messages or so, iirc). The SUB socket gets its messages, and then you
> break the connection and the SUB socket reconnects, thus appears to
> the PUB as a new subscriber.

Just to clarify - I don't believe this is any pub queuing case, since the duplications occur indefinitely. Rather it's an issue with having the same 'fd' value registered in the internal data structures multiple times, since it never gets any kind of notification from epoll that the fd has been artificially closed.

Cheers,
--B

Martin Sustrik

unread,
May 13, 2011, 12:15:08 PM5/13/11
to ZeroMQ development list, Benn Bollay
On 05/13/2011 06:08 PM, Benn Bollay wrote:

> Just to clarify - I don't believe this is any pub queuing case, since
> the duplications occur indefinitely. Rather it's an issue with
> having the same 'fd' value registered in the internal data structures
> multiple times, since it never gets any kind of notification from
> epoll that the fd has been artificially closed.

Not even EPOLLERR?

Martin

Benn Bollay

unread,
May 13, 2011, 12:31:03 PM5/13/11
to Martin Sustrik, ZeroMQ development list
> > Just to clarify - I don't believe this is any pub queuing case, since
> > the duplications occur indefinitely. Rather it's an issue with
> > having the same 'fd' value registered in the internal data structures
> > multiple times, since it never gets any kind of notification from
> > epoll that the fd has been artificially closed.
>
> Not even EPOLLERR?

I haven't experimented directly, but the documentation indicates that it just gets cleaned up. There was some mildly related discussion on this point here: http://lwn.net/Articles/430793/

I haven't banged together a test case, but if it is generating an EPOLLERR, then 0mq surely isn't cleaning up the lingering reference internally.

Cheers,
--B

Martin Sustrik

unread,
May 14, 2011, 4:29:50 AM5/14/11
to Benn Bollay, ZeroMQ development list
On 05/13/2011 06:31 PM, Benn Bollay wrote:

>> Not even EPOLLERR?
>
> I haven't experimented directly, but the documentation indicates that
> it just gets cleaned up. There was some mildly related discussion on
> this point here: http://lwn.net/Articles/430793/
>
> I haven't banged together a test case, but if it is generating an
> EPOLLERR, then 0mq surely isn't cleaning up the lingering reference
> internally.

Ok. But what are we actually discussing here? 0MQ is surely not expected
to be resilient against attaching it to gdb and messing with its
internal structures.

Are you seeing any problem in real usage?

Martin

Reply all
Reply to author
Forward
0 new messages