Seti user count mismatch when one comet is paused

27 views
Skip to first unread message

steven winner

unread,
Feb 24, 2016, 12:20:18 PM2/24/16
to cometd-users
I'm profiling our cometd app, and one thing I noticed (this might be unavoidable, but I'm not sure) is that the count returned from

seti.getUserIds().size()

is the same on each comet/node (as I expect it to be), and then if I take a snapshot/heapdump, it obviously pauses the JVM of that cometd while it takes the snapshot/dump (which is to be expected), 
but the counts are off once the snapshot/dump completes and seti doesn't seem to sync back up properly with each other
(I have two nodes, and if I snapshot node A, node B's seti count is the one that drops by approximately the count of clients connected to node A when the snapshot/dump was initiated)

Is this behavior expected/intentional, or is there something I can implement to make sure the user's associated with seti in each node are properly in sync with each other? Or a way to manually trigger them to sync back up?


Thanks,

Steve

Simone Bordet

unread,
Feb 24, 2016, 1:26:21 PM2/24/16
to cometd-users
Hi,

On Wed, Feb 24, 2016 at 6:20 PM, steven winner <steven...@gmail.com> wrote:
> I'm profiling our cometd app, and one thing I noticed (this might be
> unavoidable, but I'm not sure) is that the count returned from
>
> seti.getUserIds().size()
>
> is the same on each comet/node (as I expect it to be), and then if I take a
> snapshot/heapdump, it obviously pauses the JVM of that cometd while it takes
> the snapshot/dump (which is to be expected),
> but the counts are off once the snapshot/dump completes and seti doesn't
> seem to sync back up properly with each other

That's not expected.

> (I have two nodes, and if I snapshot node A, node B's seti count is the one
> that drops by approximately the count of clients connected to node A when
> the snapshot/dump was initiated)

This typically means that the pause has been really long and the node
B thought that node A was down.

> Is this behavior expected/intentional, or is there something I can implement
> to make sure the user's associated with seti in each node are properly in
> sync with each other? Or a way to manually trigger them to sync back up?

There is no facility to resync nodes across the cluster but that might
be a good thing to have.

Can you provide the details (DEBUG logs) of the 2 nodes while you're
heap dumping ?

Thanks !

--
Simone Bordet
----
http://cometd.org
http://webtide.com
Developer advice, training, services and support
from the Jetty & CometD experts.

steven winner

unread,
Feb 24, 2016, 2:15:08 PM2/24/16
to cometd-users, sbo...@webtide.com
Thanks for the quick reply Simone.


On Wednesday, February 24, 2016 at 1:26:21 PM UTC-5, Simone Bordet wrote:
Hi,

On Wed, Feb 24, 2016 at 6:20 PM, steven winner <steven...@gmail.com> wrote:
> I'm profiling our cometd app, and one thing I noticed (this might be
> unavoidable, but I'm not sure) is that the count returned from
>
> seti.getUserIds().size()
>
> is the same on each comet/node (as I expect it to be), and then if I take a
> snapshot/heapdump, it obviously pauses the JVM of that cometd while it takes
> the snapshot/dump (which is to be expected),
> but the counts are off once the snapshot/dump completes and seti doesn't
> seem to sync back up properly with each other

That's not expected.

> (I have two nodes, and if I snapshot node A, node B's seti count is the one
> that drops by approximately the count of clients connected to node A when
> the snapshot/dump was initiated)

This typically means that the pause has been really long and the node
B thought that node A was down.

> Is this behavior expected/intentional, or is there something I can implement
> to make sure the user's associated with seti in each node are properly in
> sync with each other? Or a way to manually trigger them to sync back up?

There is no facility to resync nodes across the cluster but that might
be a good thing to have.

Can you provide the details (DEBUG logs) of the 2 nodes while you're
heap dumping ?

I will get those logs for you, but since it's from production (profiling/snapshots from testing doesn't pause long enough to replicate this scenario), the logs will be rather chatty when I put org.cometd.oort into debug logging.
Do you need the debug logs from startup through the heapdump, or should I null the logs out (cat /dev/null > debug.log) right before initiating the snapshot/heapdump?

Just want to keep the logs as short and relevant as possible so you don't have to sift through a bunch of irrelevant debug log statements.
I can provide the complete logs from startup or the shortened version, whichever you'd like.

-Steve

Simone Bordet

unread,
Feb 24, 2016, 2:41:45 PM2/24/16
to cometd-users, Bordet, Simone
Hi,

On Wed, Feb 24, 2016 at 8:15 PM, steven winner <steven...@gmail.com> wrote:
> I will get those logs for you, but since it's from production
> (profiling/snapshots from testing doesn't pause long enough to replicate
> this scenario), the logs will be rather chatty when I put org.cometd.oort
> into debug logging.
> Do you need the debug logs from startup through the heapdump, or should I
> null the logs out (cat /dev/null > debug.log) right before initiating the
> snapshot/heapdump?

The latter may suffice... perhaps give it a minute to get some context.

If you can reproduce this with a test environment will be great.

steve winner

unread,
Feb 24, 2016, 2:57:47 PM2/24/16
to cometd...@googlegroups.com
Simone,

> On Feb 24, 2016, at 2:41 PM, Simone Bordet <sbo...@webtide.com> wrote:
>
> Hi,
>
> On Wed, Feb 24, 2016 at 8:15 PM, steven winner <steven...@gmail.com> wrote:
>> I will get those logs for you, but since it's from production
>> (profiling/snapshots from testing doesn't pause long enough to replicate
>> this scenario), the logs will be rather chatty when I put org.cometd.oort
>> into debug logging.
>> Do you need the debug logs from startup through the heapdump, or should I
>> null the logs out (cat /dev/null > debug.log) right before initiating the
>> snapshot/heapdump?
>
> The latter may suffice... perhaps give it a minute to get some context.
>
> If you can reproduce this with a test environment will be great.

No problem. I’ll try again in a test environment to double check if this issue is reproducible at that scale. If so, awesome, if not, I’ll post those shortened logs, and if we need to, we’ll escalate to full debug logs.

I’ll post them in a bit.

Thanks!

-Steve

>
> --
> Simone Bordet
> ----
> http://cometd.org
> http://webtide.com
> Developer advice, training, services and support
> from the Jetty & CometD experts.
>
> --
> --
> You received this message because you are subscribed to the Google Groups "cometd-users" group.
> To post to this group, send email to cometd...@googlegroups.com
> To unsubscribe from this group, send email to cometd-users...@googlegroups.com
> For more options, visit this group at http://groups.google.com/group/cometd-users
>
> Visit the CometD website at http://www.cometd.org
>
> ---
> You received this message because you are subscribed to a topic in the Google Groups "cometd-users" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/cometd-users/RGStNGweVTc/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to cometd-users...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

steven winner

unread,
Feb 25, 2016, 9:52:00 AM2/25/16
to cometd-users
Simone,

Here are the logs (they were too big to share directly here, so here are google drive links)

push1_cometd_oort.log.gz:  https://drive.google.com/file/d/0B2koH7ylr620eEhlQjdPZ3JnSHM/view?usp=sharing


I am using log4j2.x which I have setup to scan for changes to the log config file, so I changed it to send all logging from package org.cometd to a separate log file, 
with org.cometd logging level set to INFO, and org.cometd.oort set to DEBUG.

As soon as I did that, this scenario happened again (it unleashes a LOT of log statements), so I didn't end up taking a snapshot/heapdump to replicate this (seti was only a few clients out of sync when I changed the logging conf). 

Please let me know if you need more/different logging data.

Thanks,

Steve

*note.  I did a string replacement for our sites domain name with "DOMAIN_NAME_REPLACED", just in case you are wondering why that's in there.
> To unsubscribe from this group, send email to cometd-users-unsubscribe@googlegroups.com
> For more options, visit this group at http://groups.google.com/group/cometd-users
>
> Visit the CometD website at http://www.cometd.org
>
> ---
> You received this message because you are subscribed to a topic in the Google Groups "cometd-users" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/cometd-users/RGStNGweVTc/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to cometd-users+unsubscribe@googlegroups.com.

Simone Bordet

unread,
Mar 4, 2016, 9:37:21 AM3/4/16
to cometd-users
Steven,

On Thu, Feb 25, 2016 at 3:51 PM, steven winner <steven...@gmail.com> wrote:
> Simone,
>
> Here are the logs (they were too big to share directly here, so here are
> google drive links)
>
> push1_cometd_oort.log.gz:
> https://drive.google.com/file/d/0B2koH7ylr620eEhlQjdPZ3JnSHM/view?usp=sharing
> push2_cometd_oort.log.gz:
> https://drive.google.com/file/d/0B2koH7ylr620bllOUHVENEV3bEE/view?usp=sharing
>
>
> I am using log4j2.x which I have setup to scan for changes to the log config
> file, so I changed it to send all logging from package org.cometd to a
> separate log file,
> with org.cometd logging level set to INFO, and org.cometd.oort set to DEBUG.
>
> As soon as I did that, this scenario happened again (it unleashes a LOT of
> log statements), so I didn't end up taking a snapshot/heapdump to replicate
> this (seti was only a few clients out of sync when I changed the logging
> conf).
>
> Please let me know if you need more/different logging data.

Unfortunately I don't see anything weird in these logs.
The first log entry that tells how many associations there are for
push1 says 1580.
For push2 this number is 1744.

Yes they are off, but there is no evidence in the logs of what happened.

If you can reproduce in test environment, can please attach the logs
from the beginning ?

steven winner

unread,
Mar 10, 2016, 10:19:02 AM3/10/16
to cometd-users, sbo...@webtide.com
Simone,

I am unable to reproduce this in testing, but it's not necessarily a bad thing, since I refactored my whole application after profiling with yourkit, and it's running smoother than ever now (no more memory leaks, and memory footprint is tiny now).  However, I would be of the opinion that it would be useful to be able to "re-sync" the associated Seti users (by some exposed method you can call manually, or by some other means)  when they are clearly out of sync.  This is something I've noticed from time to time in the past during normal usage (admittedly in my older, memory leaking version of my application, that would cause some STW operations), and might come in handy.

Also, one other quick question about Seti... Are there any plans in the works to possibly associate more than just a string user id with Seti, like you can with the session?  Would be a useful feature to have a simple user object associated with a key in Seti (a DB user id), like you can with sessions, and obviously can accomplish with OortStringMap etc, but would be available across comets without having to maintain our own distributed maps/lists/sets/whatever of connected users... Just a thought.

Thanks!

Steve

Simone Bordet

unread,
Mar 11, 2016, 3:03:28 AM3/11/16
to cometd-users, Bordet, Simone
Hi,

On Thu, Mar 10, 2016 at 4:19 PM, steven winner <steven...@gmail.com> wrote:
> Simone,
>
> I am unable to reproduce this in testing, but it's not necessarily a bad
> thing, since I refactored my whole application after profiling with yourkit,
> and it's running smoother than ever now (no more memory leaks, and memory
> footprint is tiny now). However, I would be of the opinion that it would be
> useful to be able to "re-sync" the associated Seti users (by some exposed
> method you can call manually, or by some other means) when they are clearly
> out of sync. This is something I've noticed from time to time in the past
> during normal usage (admittedly in my older, memory leaking version of my
> application, that would cause some STW operations), and might come in handy.

I'm not sure about the Seti re-sync. If there is a bug so that users
go out of sync, then I'd prefer to chase that down.
Also, there always is the possibility of a race between an external
re-sync and a normal presence message, so that the re-sync may
actually cause an out-of-sync.

> Also, one other quick question about Seti... Are there any plans in the
> works to possibly associate more than just a string user id with Seti, like
> you can with the session? Would be a useful feature to have a simple user
> object associated with a key in Seti (a DB user id), like you can with
> sessions, and obviously can accomplish with OortStringMap etc, but would be
> available across comets without having to maintain our own distributed
> maps/lists/sets/whatever of connected users... Just a thought.

Not sure here either. There would be restrictions to be applied to the
objects you can store (they must be JSON serializable), and perhaps at
that point using OortStringMap rather than Seti is similar.

Feel free to raise an issue, and would be great if you have a more
detailed use case, perhaps one that shows that there is a missing
functionality between Seti and OortObject: you can do that
functionality with both, but it's too much work with either. If you
are in that spot, we may consider adding a new functionality (e.g. an
extended Seti) to CometD.

steven winner

unread,
Mar 11, 2016, 9:42:59 AM3/11/16
to cometd-users, sbo...@webtide.com
Simone,

Thanks for the response/feedback.


On Friday, March 11, 2016 at 3:03:28 AM UTC-5, Simone Bordet wrote:
Hi,

On Thu, Mar 10, 2016 at 4:19 PM, steven winner <steven...@gmail.com> wrote:
> Simone,
>
> I am unable to reproduce this in testing, but it's not necessarily a bad
> thing, since I refactored my whole application after profiling with yourkit,
> and it's running smoother than ever now (no more memory leaks, and memory
> footprint is tiny now).  However, I would be of the opinion that it would be
> useful to be able to "re-sync" the associated Seti users (by some exposed
> method you can call manually, or by some other means)  when they are clearly
> out of sync.  This is something I've noticed from time to time in the past
> during normal usage (admittedly in my older, memory leaking version of my
> application, that would cause some STW operations), and might come in handy.

I'm not sure about the Seti re-sync. If there is a bug so that users
go out of sync, then I'd prefer to chase that down.
Also, there always is the possibility of a race between an external
re-sync and a normal presence message, so that the re-sync may
actually cause an out-of-sync.


Yea, I can see how that would be a potential problem.  I need to setup some load testing so I can try to replicate in a development environment, because the debug logs in production are crazy.  I do notice however that bringing up a new comet, or restarting just one comet will have that new comet unaware of existing users (i.e. the ```Seti.getUserIds().size()``` on the new comet doesn't seem to reflect the total number of users that were switched to the comet still running when the other comet was taken down and brought back up, most likely due to the presence messages not being received when down... maybe when a comet joins the cluster, the new comet should be informed of all the associated users on the other nodes?)

Hopefully I can get some good logs in dev so we can shed some more light on this.
 
> Also, one other quick question about Seti... Are there any plans in the
> works to possibly associate more than just a string user id with Seti, like
> you can with the session?  Would be a useful feature to have a simple user
> object associated with a key in Seti (a DB user id), like you can with
> sessions, and obviously can accomplish with OortStringMap etc, but would be
> available across comets without having to maintain our own distributed
> maps/lists/sets/whatever of connected users... Just a thought.

Not sure here either. There would be restrictions to be applied to the
objects you can store (they must be JSON serializable), and perhaps at
that point using OortStringMap rather than Seti is similar.


This one was a long shot, and I agree can be handled already with an OortStringMap.  Just something that might be easier to setup than using the Oort objects, and both need to be JSON serializable anyhow.

Simone Bordet

unread,
Mar 11, 2016, 9:58:56 AM3/11/16
to cometd-users, Bordet, Simone
Hi,

On Fri, Mar 11, 2016 at 3:42 PM, steven winner <steven...@gmail.com> wrote:
> I do notice however that bringing
> up a new comet, or restarting just one comet will have that new comet
> unaware of existing users (i.e. the ```Seti.getUserIds().size()``` on the
> new comet doesn't seem to reflect the total number of users that were
> switched to the comet still running when the other comet was taken down and
> brought back up, most likely due to the presence messages not being received
> when down... maybe when a comet joins the cluster, the new comet should be
> informed of all the associated users on the other nodes?)

Seti has a node listener that is informed when a new node comes up.
That listener sends to the new Seti the full list of userIds in order
to keep the Setis in sync.

However, starting a node involves first starting Oort and then starting Seti.
The Seti on the started node will see that Oort started and may emit
the userIds before the Seti on the starting node is ready to receive
them.

Can you please file an issue so that I can verify this ?

steven winner

unread,
Mar 11, 2016, 10:14:37 AM3/11/16
to cometd-users, sbo...@webtide.com
Simone,


Thanks!

Simone Bordet

unread,
Mar 11, 2016, 10:19:27 AM3/11/16
to cometd-users, Bordet, Simone
Hi,

> On Friday, March 11, 2016 at 9:58:56 AM UTC-5, Simone Bordet wrote:
>> Seti has a node listener that is informed when a new node comes up.
>> That listener sends to the new Seti the full list of userIds in order
>> to keep the Setis in sync.
>>
>> However, starting a node involves first starting Oort and then starting
>> Seti.
>> The Seti on the started node will see that Oort started and may emit
>> the userIds before the Seti on the starting node is ready to receive
>> them.
>>
>> Can you please file an issue so that I can verify this ?

Nah, it's handled correctly.

While it is true that the first notification may be missed, when Seti2
is finally started it broadcasts its own users (typically an empty
array) with a specific field called "alive".
When Seti1 receives this "alive" message from Seti2, it sends back its
own users, so that the 2 Setis are now in sync.

We're back to square 1: if you can reproduce that will be great.

Thanks !

steven winner

unread,
Mar 11, 2016, 10:56:02 AM3/11/16
to cometd-users, sbo...@webtide.com
Simone,


On Friday, March 11, 2016 at 10:19:27 AM UTC-5, Simone Bordet wrote:
Nah, it's handled correctly.

While it is true that the first notification may be missed, when Seti2
is finally started it broadcasts its own users (typically an empty
array) with a specific field called "alive".
When Seti1 receives this "alive" message from Seti2, it sends back its
own users, so that the 2 Setis are now in sync.

We're back to square 1: if you can reproduce that will be great.

Thanks !


We're going to get to the bottom of this one.  I'm going to get a good test going on my dev so we have some logs we can work with and a test that's reproducible. 

Thanks again!

Steve

Fabian Christ

unread,
Mar 23, 2016, 4:59:16 AM3/23/16
to cometd-users, sbo...@webtide.com
Hi Steve, Simone,

we have seen the same issue once in our setup. We have two nodes and restart them occasionally on updates. Once we had the situation that Seti was out of sync but we could not reproduce it since then. So if there are any news about this issue, it would be great to get to know.

Best
Fabian

Simone Bordet

unread,
Mar 23, 2016, 6:04:35 AM3/23/16
to cometd-users, Bordet, Simone
Hi,

On Wed, Mar 23, 2016 at 9:59 AM, 'Fabian Christ' via cometd-users
<cometd...@googlegroups.com> wrote:
> Hi Steve, Simone,
>
> we have seen the same issue once in our setup. We have two nodes and restart
> them occasionally on updates. Once we had the situation that Seti was out of
> sync but we could not reproduce it since then. So if there are any news
> about this issue, it would be great to get to know.

Sure, and any mean you have to reproduce the problem will be great.
So far, it has been an elusive issue.
Reply all
Reply to author
Forward
0 new messages