GCM CCS stability issues

Bogdan Zurac

unread,

Dec 4, 2014, 7:32:09 AM12/4/14

to andro...@googlegroups.com

Hello there.

We are in the final stages of releasing our app on Google Play, which uses GCM CCS upstream+downstream at its core, so stability is paramount to us. Until now we've ran into issues at every step of the way and we're still facing them.

For example, last night at around 8PM I received an upstream message from the app with id c2573369-db4d-4055-95c2-2c7e79bb15c2. I couldn't proccess it at the moment, so I didn't ACK it, relying on the fact that it will be sent again in the future by GCM server as per docs (For each device message your app server receives from CCS, it needs to send an ACK message. It never needs to send a NACK message. If you don't send an ACK for a message, CCS will just resend it.). It's 2 PM the next day and it's still not sent. This is just unbelievable... I really don't know what else to do. The amount of instability from GCM is huge for apps that rely on it at their core. If we keep running into issues like this, I'm seriously reconsidering relying on GCM... Haven't you guys started to wonder by this time what's up with all these people with stability issues ? This should've been resolved by now...

Thanks for taking the time to read this.

Have a nice day!

Sergey Mkhitaryan

unread,

Dec 4, 2014, 3:36:05 PM12/4/14

to andro...@googlegroups.com

Hi Bogdan,

CCS doesn't work that way. The message will be resent only after your client disconnect before acking it (assuming that the message would not expire by the moment). Also, if you receive too many messages and not ack them, your connection will be flow controlled and it won't receive any upstream messages before acking previous first.

If you're not always able to process upstream messages immediately, you should maintain your own storage for them.

Good day to you and thanks for your interest in GCM.

-Sergey

Bogdan Zurac

unread,

Dec 4, 2014, 3:43:57 PM12/4/14

to andro...@googlegroups.com

The message will be resent only after your client disconnect before acking it (assuming that the message would not expire by the moment)

I haven't understood what you we're trying to explain here. There was no problem with too many messages. In 10 minutes only that message was received.

Anyway, like I wrote above, this is the API DOC ("For each device message your app server receives from CCS, it needs to send an ACK message. It never needs to send a NACK message. If you don't send an ACK for a message, CCS will just resend it."). So the system did *not* perform as stated in the doc. That's the problem.

Tingmui Li

unread,

Dec 4, 2014, 9:22:25 PM12/4/14

to andro...@googlegroups.com

Hi,

The API documentation unfortunately is not clear (we will update to clarify for upcoming release), the way that CCS protocol works is:

It sends a message to the app server on a XMPP connection (connection 1), if the app server doesn't ack, we will retry sending only if the following conditions are met:

1) The app server disconnected the XMPP connection (connection 1) and reestablished a new XMPP connection

2) The message has not expired at the time that we retry (as mentioned above)

The reason that we do not retry sending the same message on the same XMPP connection is because upon delivering the message, our server has no visibility on what happens to the message (received, being processed etc); therefore, our server will retry only when a new XMPP connection is established (in which case, we will be sure that the app server did not receive/process the message).

Hope this helps clarify the confusion resulting from the documentation.

If you think that CCS is not delivering the message to your app server (first time or upon reconnect retry), please provide us the message attribute (without the payload and sensitive data) and time that it was sent from the device, and we can take a closer look.

Thanks

Bogdan Zurac

unread,

Dec 5, 2014, 4:23:27 PM12/5/14

to andro...@googlegroups.com

Hi Tingmui.

Thank you for the clarification, now I understand the situation and indeed this was the cause. You guys should be more careful with documentation parts like this, as developers rely on it for the actual implementation and system design. We are now required to redesign the respective part of our system in order to take into consideration that GCM system interaction.

Thanks anyway for clearing things up.

Have a nice day !

Bogdan Zurac

unread,

Dec 9, 2014, 4:38:17 AM12/9/14

to andro...@googlegroups.com

There also appears to be a problem with the same message being received multiple times on the server from GCM. Ex, I first received message with id "5bf3688f-f654-4c33-894e-06cef294076a" at 2014-12-08T17:42:48.041+0100 as I should, then I ACKed it.

Afterwards, at 2014-12-08T17:55:54.178+0100 I receive it again, with the exact same message id. What's up with that ?

On Thursday, December 4, 2014 2:32:09 PM UTC+2, Bogdan Zurac wrote:

Bogdan Zurac

unread,

Dec 10, 2014, 5:35:04 PM12/10/14

to andro...@googlegroups.com

I also get the below error from time to time. Today I got it twice in 5 minutes... Any idea what causes it ?

[2014-12-10T11:46:22.592-0500] [glassfish 4.1] [WARNING] [] [org.jivesoftware.smack.tcp.PacketWriter] [tid: _ThreadID=410 _ThreadName=Smack Packet Writer (1)] [timeMillis: 1418229982592] [levelValue: 900] [[
  Exception writing closing stream element
java.net.SocketException: Connection closed by remote host
 at sun.security.ssl.SSLSocketImpl.checkWrite(SSLSocketImpl.java:1523)
 at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:71)
 at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
 at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
 at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
 at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
 at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
 at java.io.BufferedWriter.flush(BufferedWriter.java:254)
 at org.jivesoftware.smack.tcp.PacketWriter.writePackets(PacketWriter.java:190)
 at org.jivesoftware.smack.tcp.PacketWriter.access$000(PacketWriter.java:40)
 at org.jivesoftware.smack.tcp.PacketWriter$1.run(PacketWriter.java:77)
]]


[2014-12-10T11:46:22.594-0500] [glassfish 4.1] [WARNING] [] [org.jivesoftware.smack.XMPPConnection] [tid: _ThreadID=411 _ThreadName=Smack Packet Reader (1)] [timeMillis: 1418229982594] [levelValue: 900] [[
  Connection closed with error
java.io.EOFException: no more data available - expected end tag </stream:stream> to close start tag <stream:stream> from line 1, parser stopped on END_TAG seen ...<iq type="result" id="zHEa7-9"/> ... @1:464
 at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:3035)
 at org.xmlpull.mxp1.MXParser.more(MXParser.java:3046)
 at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1384)
 at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
 at org.jivesoftware.smack.tcp.PacketReader.parsePackets(PacketReader.java:291)
 at org.jivesoftware.smack.tcp.PacketReader.access$000(PacketReader.java:47)
 at org.jivesoftware.smack.tcp.PacketReader$1.run(PacketReader.java:81)
]]

On Thursday, December 4, 2014 2:32:09 PM UTC+2, Bogdan Zurac wrote:

Bogdan Zurac

unread,

Dec 15, 2014, 7:36:49 AM12/15/14

to andro...@googlegroups.com

Any info on the 2 above issues ? They really hinder all progress...

Costin Manolache

unread,

Dec 15, 2014, 5:28:07 PM12/15/14

to andro...@googlegroups.com

Duplicated message looks like a bug, we'll investigate it - it looks unusual, are you sure it is not a message id generated by your

application, and any chance it'll be sent 2x ? Do you happen to also have the persistent ID and/or registration ID associated with the message ?

For the 2 errors - looks like connection got terminated, it is expected to happen occasionally - we normally send a notification when

servers are upgrading gracefully, but there are many causes for a connection to be killed ( routers in between, unexpected hardware issues, etc ).

Your server needs to reconnect when this happens.

Costin

On Mon, Dec 15, 2014 at 4:36 AM, Bogdan Zurac <bogdan...@gmail.com> wrote:

Any info on the 2 above issues ? They really hinder all progress...

--
You received this message because you are subscribed to the Google Groups "android-gcm" group.
To unsubscribe from this group and stop receiving emails from it, send an email to android-gcm...@googlegroups.com.
To post to this group, send email to andro...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/android-gcm/8c8e1b93-2202-4aa0-90b6-55e3f4f6d614%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Bogdan Zurac

unread,

Dec 15, 2014, 5:53:14 PM12/15/14

to andro...@googlegroups.com

I have double checked multiple times to see if I don't generate/send the message multiple times. That doesn't appear to be the case. Another example happened today again.

First message received was at 2014-12-15T18:47:21.573+0000 with message_id 75898163-907a-4e76-a84b-47fa183b1568 from registration_id APA91bGuhx1EvYrbAUiV4v4NDAhZWko6qcttO4PtouS-To_dX-F9vDaQ1I-Br7cg8Qv1yTflSfu3KG6crjIH7KrFC7yTpOGQr44EFsr1DLgZX5OCnqT71dJKOzyu-f1dZz4mHmRr1pWOAhYOvypc7kze9eJfyFcuQU4Zr64snxBQmUC-Kk9dHe8.

I ACKed it at 2014-12-15T18:47:21.594+0000 with this JSON payload {"message_type":"ack","message_id":"75898163-907a-4e76-a84b-47fa183b1568","to":"APA91bGuhx1EvYrbAUiV4v4NDAhZWko6qcttO4PtouS-To_dX-F9vDaQ1I-Br7cg8Qv1yTflSfu3KG6crjIH7KrFC7yTpOGQr44EFsr1DLgZX5OCnqT71dJKOzyu-f1dZz4mHmRr1pWOAhYOvypc7kze9eJfyFcuQU4Zr64snxBQmUC-Kk9dHe8"}

Second message had the exact same message_id and registration_id as the first one and came at 2014-12-15T18:55:52.003+0000. Does something appear to be wrong from this data on my end ?

As for the connection terminated errors, how often would you expect them to occur ? I encounter them quite often; since a week ago from when I reported it, I encountered 3-4 more. So they don't appear to be as rare as one would think, ex once a week or so. Also, from the log ("Connection closed by remote host") it appears to be an outside error, not related to our server. Can't this be fixed by any means ? Regarding the re-connection, we already have that in place, of course. But it would be great to not have this problem in the first place. That's why I'm asking about a proper fix. Hasn't anybody else encountered this issue in their implementations ?

Bogdan Zurac

unread,

Dec 17, 2014, 6:01:08 AM12/17/14

to andro...@googlegroups.com

Again, 3 connection terminations in 6 hours...

2014-12-16T17:55:26.991+0000

2014-12-16T19:03:58.008+0000 - connection draining, opened another one

2014-12-16T19:24:45.857+0000 - 2nd terminated

2014-12-16T23:27:44.039+0000 - 2nd terminated

I seriously doubt this behavior is normal...

Bogdan Zurac

unread,

Dec 17, 2014, 8:46:36 PM12/17/14

to andro...@googlegroups.com

Just 4 hours ago from what I can see, I also received the same message twice, but now it was on the client side, which has never happened before. What the hell is going on.... ?

Tingmui Li

unread,

Dec 19, 2014, 12:11:05 PM12/19/14

to andro...@googlegroups.com

- For client receiving dup message, can you provide the message ID & registration ID, we will take a look. We are not aware of any dup message issues on client direction in general.

- For server receiving dup message (We are still investigating)

a) It appeared that message id: 75898163-907a-4e76-a84b-47fa183b1568 was delivered to your server 8 times. (We deliver message until we receive an ACK) - Interesting that your server only received it twice

b) Please confirm that you sent ACK on the same connection that you receive the message

c) It is possible that an ACK is sent but not received by us (router/network connection), the recommended best practice is to ACK each message received (even if it is a dup).

Do you ack message that is dup today? Would be good to know if after you ack dup messages whether there are more delivery of the message with the same ID

Bogdan Zurac

unread,

Dec 19, 2014, 12:39:16 PM12/19/14

to andro...@googlegroups.com

On Friday, December 19, 2014 7:11:05 PM UTC+2, Tingmui Li wrote:

- For client receiving dup message, can you provide the message ID & registration ID, we will take a look. We are not aware of any dup message issues on client direction in general.

There were 4 messages, all sent to APA91bF7sK8SocdhuECi_pn9qjbqA7nkrgb6hj20vhe4xcLNBYtDe5A_rk59iyQhAjUKkz6jO0xOhyr5F3em_cqFkVSmomQSQ06UBpVD-5o-U7TLjDInf4a2b79AQhd6-58BvmH4WQv8ArJYWe5rBOlGeLB29qx5gUIptjfVa0DXWeNMNoireTk.

m-cfa964a7-0ad5-41ea-b08d-15554921d4c3 sent at 2014-12-18T00:46:47.531+0000

m-9360fbeb-26bc-466d-b4ef-abbfe48602ed sent at 2014-12-18T00:46:47.551+0000

m-d7f332a6-ec43-4c09-91bc-aa3fef3ac584 sent at 2014-12-18T00:46:47.582+0000

m-ecc8e926-eaed-4e6b-9ac0-ce0ac5ffc985 sent at 2014-12-18T00:46:47.639+0000

I'll keep an eye out in the future for this.

- For server receiving dup message (We are still investigating)
a) It appeared that message id: 75898163-907a-4e76-a84b-47fa183b1568 was delivered to your server 8 times. (We deliver message until we receive an ACK) - Interesting that your server only received it twice

b) Please confirm that you sent ACK on the same connection that you receive the message

I have only 1 connection open at all times (except the time it takes for your servers to close the draining connection when switching to a newly opened one, but this wasn't the case), so yes, I ACKed it as soon as I received the message, as you can tell by the time stamps, 0.021 secs.

c) It is possible that an ACK is sent but not received by us (router/network connection), the recommended best practice is to ACK each message received (even if it is a dup).

It may be best practice to ACK it even though I receive it twice, but it still completely hinders our system put in place if I receive it multiple times. So this should not happen at all.

Do you ack message that is dup today? Would be good to know if after you ack dup messages whether there are more delivery of the message with the same ID

No, I haven't ACKed it after the first time. That's probably why you've seen it sent 8 times. Still, the question is why it has been sent the 2nd time, even though I ACKed it. Please check what happened in that respective time span (2014-12-15T18:47:21.573+0000 --- 2014-12-15T18:55:52.003+0000), not afterwards.

Bogdan Zurac

unread,

Jan 6, 2015, 2:36:17 AM1/6/15

to andro...@googlegroups.com

Can we get a resolution on these matters ? We really need to wrap things up once and for all...

Tingmui Li

unread,

Jan 6, 2015, 5:55:48 PM1/6/15

to andro...@googlegroups.com

We did not receive the ACK sent from your server to the upstream message. Although XMPP is a reliable protocol, but message not reaching an end point is still possible depending on actual connectivity.

Recommendation is to send ACK if you receive the same upstream message again.

Bogdan Zurac

unread,

Jan 8, 2015, 6:23:29 AM1/8/15

to andro...@googlegroups.com

On Wednesday, January 7, 2015 12:55:48 AM UTC+2, Tingmui Li wrote:

We did not receive the ACK sent from your server to the upstream message. Although XMPP is a reliable protocol, but message not reaching an end point is still possible depending on actual connectivity.

Recommendation is to send ACK if you receive the same upstream message again.

From all the issues we've encountered the past few months while developing with GCM CCS, doesn't really feel like it's very reliable... For example, on the live server we haven't encountered any issue for the last 4-5 days. But other times it's just not reliable at all (see examples above). Anyhow, we'll try and modify our system to account for duplicate messages and ACK-ing them as well. I'll try and keep an eye out if duplicate messages arrive after the change and if so, how many duplicates arrive.

What about the duplicate downstream messages ? Have you checked those as well ?

Also, is there currently any way to check info like this by ourselves ? This is not a very efficient way of debugging issues with developer apps... If there was a way for developers to view and observe by themselves messages from their GCM implementations, it would be much better, as developers know their applications best and are way more efficient in debugging the issues.

Tingmui Li

unread,

Jan 8, 2015, 8:48:34 PM1/8/15

to andro...@googlegroups.com

On downstream dup messages, we did not observe anomaly from server side, i.e.,

1) we sent the first message, wait for ack from device (ack is handled by GCM on device)

2) device disconnected without opportunities to send ack, it reconnected and we re-sent the message to device

In most of this situation, this does not translate to app receiving dup messages because GCM on device would handle de-duping messages.

The situation where dup messages could arrive to the app is when the device storage max out. Do you know if that device storage was at max when the downstream dup message occurred?

Is this easily reproducible? If so, it would be good to get the bug report from the device side, this together with the corresponding server info would help to give a more complete picture for the scenario.

Regarding enabling devs to check info like this, we are exploring the possibilities, and will update the dev community if this becomes available.

Bogdan Zurac

unread,

Jan 9, 2015, 10:09:37 AM1/9/15

to andro...@googlegroups.com

On Friday, January 9, 2015 at 3:48:34 AM UTC+2, Tingmui Li wrote:

On downstream dup messages, we did not observe anomaly from server side, i.e.,
1) we sent the first message, wait for ack from device (ack is handled by GCM on device)
2) device disconnected without opportunities to send ack, it reconnected and we re-sent the message to device
In most of this situation, this does not translate to app receiving dup messages because GCM on device would handle de-duping messages.
The situation where dup messages could arrive to the app is when the device storage max out. Do you know if that device storage was at max when the downstream dup message occurred?

No, 10GB of storage are available..

Is this easily reproducible? If so, it would be good to get the bug report from the device side, this together with the corresponding server info would help to give a more complete picture for the scenario.

Well, no. Like I said, that was the first time it happened, and no code base was changed. We'll look into this further for more examples. If you have any other idea how to troubleshoot this, it would be great.

Bogdan Zurac

unread,

Apr 25, 2015, 3:47:31 PM4/25/15

to andro...@googlegroups.com

Almost 4 months have passed and we haven't received anymore duplicate messages on the client side. Until today.

We received 3 upstream messages from a device and tried to send 6 downstream messages to 2 devices. Unfortunately we received a lot of SERVICE_UNAVAILABLE NACKS. No problem, our implementation saves the downstream unsent yet messages and tried to send them later on. And so it did, until they were ACK. The problem is that on one device, we received 2 messages twice.

Again, no code change, nothing out of the ordinary. Just that the server received those NACKs. So, is it possible that we received a NACK, even though the message got through?

Device registration id: APA91bF7sK8SocdhuECi_pn9qjbqA7nkrgb6hj20vhe4xcLNBYtDe5A_rk59iyQhAjUKkz6jO0xOhyr5F3em_cqFkVSmomQSQ06UBpVD-5o-U7TLjDInf4a2b79AQhd6-58BvmH4WQv8ArJYWe5rBOlGeLB29qx5gUIptjfVa0DXWeNMNoireTk

Message 1 that wasn't duplicated: f169ca3a-656a-4b7e-8652-116b3dbc2541

Messages 2 and 3 which were duplicated: 98ffc915-0c1b-40ce-a350-b67a5560cfca, ea625215-5c59-4624-93d4-248ff5b2e5b3

Bogdan Zurac

unread,

Apr 30, 2015, 3:35:42 AM4/30/15

to andro...@googlegroups.com

Any feedback on this?

Bogdan Zurac

unread,

May 5, 2015, 6:26:31 AM5/5/15

to andro...@googlegroups.com

Can we get some response on this matter please ?

Bogdan Zurac

unread,

May 16, 2015, 2:49:46 AM5/16/15

to andro...@googlegroups.com

Yea, cool feedback guys... GG. We had to switch away from the GCM CCS implementation because of extremely poor stability and support.

I would suggest to other developers to do the same, if their system provides this option. It's just not worth all the trouble...

Reply all

Reply to author

Forward