Pub/Sub performance / Stackdriver metrics

450 views
Skip to first unread message

Doug Hoard

unread,
Feb 28, 2018, 8:01:59 PM2/28/18
to Google Cloud Pub/Sub Discussions
I have a Java application that pulls a batch of messages directly via the HTTPS API.

The application uses 50 threads to pull messages, puts them on an internal queue as a buffer.

The application uses 50 threads to process the queue to forward to an HTTPS REST service.

The application uses 50 threads to asynchronously ack the messages.

This seems to work well and isn’t CPU bound ... but the Stackdriver metrics seem odd.

Stackdriver will sometimes show the count of unacknowledged messages for the subscription slowly growing, yet it will also show the age of the oldest unacknowledged message being 60s or less.

If I look at my logs, it seems to disagree with the unacknowledged message ... but seems to agree with the oldest unacknowledged message age.

Thoughts?

Kir Titievsky

unread,
Mar 1, 2018, 6:35:45 AM3/1/18
to doug....@gmail.com, Google Cloud Pub/Sub Discussions
Doug, 

I think the data point missing in your report is the publish rate.  Sorry if that's trivial, but worth covering: your observations could be explained by spikes in publish volume, lasting under 60 seconds. 

Otherwise, if you have a support plan, please submit a case with all the details (project number & id, topic & subscription names, time range) and Stackdriver screenshots.

If not, you can send me a mail directly with the same info.  No promises, but I might have a bit of time to look into this.

Also: may I ask why you chose to call the REST API directly rather than using the java client library?




--
You received this message because you are subscribed to the Google Groups "Google Cloud Pub/Sub Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-pubsub-dis...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-pubsub-discuss/241bb823-d442-4bc1-a98c-009744a36d0a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Kir Titievsky | Product Manager | Google Cloud Pub/Sub 

Doug Hoard

unread,
Mar 1, 2018, 9:26:49 AM3/1/18
to Google Cloud Pub/Sub Discussions
Kir,

Regarding publish rate, from about 4 AM - 6 AM this morning, the rate is only around 3 messages a minute (we send "heartbeat" messages from our application).... yet I still see around 100 messages on the subscription as well as the oldest message being over 60 minutes.

Yet, if I do a "gcloud beta pubsub subscriptions pull <subscription name> --limit 1000 --auto-ack" I don't get any messages other than the 3 "heartbeat" messages.

We have support, and I have opened Case 15126613, which has more detailed / private information.

I chose to implement direct to the HTTP API because I saw various issues that have been documented in ...


Some of these have been marked "Closed", yet I have seen them with the lasted official Google Java Pub / Sub client libraries.

Also, I was seeing pauses responding to messages volume adjustments. (When messages volumes where consistent, processing was consistent .... when message volumes dropped / sporadic, processing had high latency.)


On Thursday, March 1, 2018 at 6:35:45 AM UTC-5, Kir Titievsky wrote:
Doug, 

I think the data point missing in your report is the publish rate.  Sorry if that's trivial, but worth covering: your observations could be explained by spikes in publish volume, lasting under 60 seconds. 

Otherwise, if you have a support plan, please submit a case with all the details (project number & id, topic & subscription names, time range) and Stackdriver screenshots.

If not, you can send me a mail directly with the same info.  No promises, but I might have a bit of time to look into this.

Also: may I ask why you chose to call the REST API directly rather than using the java client library?




On Wed, Feb 28, 2018 at 8:02 PM Doug Hoard <doug....@gmail.com> wrote:
I have a Java application that pulls a batch of messages directly via the HTTPS API.

The application uses 50 threads to pull messages, puts them on an internal queue as a buffer.

The application uses 50 threads to process the queue to forward to an HTTPS REST service.

The application uses 50 threads to asynchronously ack the messages.

This seems to work well and isn’t CPU bound ... but the Stackdriver metrics seem odd.

Stackdriver will sometimes show the count of unacknowledged messages for the subscription slowly growing, yet it will also show the age of the oldest unacknowledged message being 60s or less.

If I look at my logs, it seems to disagree with the unacknowledged message ... but seems to agree with the oldest unacknowledged message age.

Thoughts?

--
You received this message because you are subscribed to the Google Groups "Google Cloud Pub/Sub Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-pubsub-discuss+unsub...@googlegroups.com.

Doug Hoard

unread,
Mar 3, 2018, 1:29:17 PM3/3/18
to Google Cloud Pub/Sub Discussions
Looking at some more logging I have added to the code, it appears that the Pub / Sub service is introducing latency.

Here is some example data ...

PubSubMessage(ackId=[GERJUytDCypYEU4EISE-MD5FU0RQBhYsXUZIUTcZCGhRDk9eIz81IChFEQMHTwIoXXk1SzRBW3NoUQ0Zcn1gc2sOQFQCEAEqDlEaDmxaXn8HQzwZUENld2hbFQIFQlV5XloYCECwlqns5NkjZiE9XhJLLD5-LC1FQV5AEkwn] publishTimestamp=[2018-03-03T18:03:49.785Z] creationTimestamp=[1520100530455]

The message was published at "2018-03-03T18:03:49.785Z" but the API pull request (which is set to Integer.MAX_VALUE messages with returnImmediately = false did not get the message until "1520100530455" milliseconds, which is ""2018-03-03T18:08:50.000Z"

With a a 300 seconds (5 minute) acknowledgement timeout on the subscription, the message is already "expired" when we get it.

Doug Hoard

unread,
Mar 4, 2018, 11:14:06 PM3/4/18
to Google Cloud Pub/Sub Discussions
I have made some code changes, specifically changed the pull size to 100 messages (from Integer.MAX_VALUE), and appear to be experiencing less latency. The code hasn't ran long enough to confirm this works around the issue.

Without any hard metrics, it seems like setting the pull size to a large value (Integer.MAX_VALUE), with "returnImmediately = false", that server will end up returning messages with less then the configured subscription acknowledgement setting.

I can change the pull size to a large value, with "returnImmediately = true", sleeping for 10 second, if no messages are returned, which in theory would fix latency, but concerned with over polling the Pub / Sub REST API.

Alex Mordkovich

unread,
Mar 5, 2018, 10:11:27 AM3/5/18
to doug....@gmail.com, Google Cloud Pub/Sub Discussions
On Sun, Mar 4, 2018 at 11:14 PM Doug Hoard <doug....@gmail.com> wrote:
I have made some code changes, specifically changed the pull size to 100 messages (from Integer.MAX_VALUE), and appear to be experiencing less latency. The code hasn't ran long enough to confirm this works around the issue.

Without any hard metrics, it seems like setting the pull size to a large value (Integer.MAX_VALUE), with "returnImmediately = false", that server will end up returning messages with less then the configured subscription acknowledgement setting.

I can change the pull size to a large value, with "returnImmediately = true", sleeping for 10 second, if no messages are returned, which in theory would fix latency, but concerned with over polling the Pub / Sub REST API.

You need not worry about overpolling the Pub / Sub REST API. However, it is recommended that you use the client libraries, which are built on top of the GRPC-based (rather than JSON/REST-based) Streaming Pull API, which allows you to not worry about maintaining outstanding Pull requests.


On Saturday, March 3, 2018 at 1:29:17 PM UTC-5, Doug Hoard wrote:
Looking at some more logging I have added to the code, it appears that the Pub / Sub service is introducing latency.

Here is some example data ...

PubSubMessage(ackId=[GERJUytDCypYEU4EISE-MD5FU0RQBhYsXUZIUTcZCGhRDk9eIz81IChFEQMHTwIoXXk1SzRBW3NoUQ0Zcn1gc2sOQFQCEAEqDlEaDmxaXn8HQzwZUENld2hbFQIFQlV5XloYCECwlqns5NkjZiE9XhJLLD5-LC1FQV5AEkwn] publishTimestamp=[2018-03-03T18:03:49.785Z] creationTimestamp=[1520100530455]

The message was published at "2018-03-03T18:03:49.785Z" but the API pull request (which is set to Integer.MAX_VALUE messages with returnImmediately = false did not get the message until "1520100530455" milliseconds, which is ""2018-03-03T18:08:50.000Z"

With a a 300 seconds (5 minute) acknowledgement timeout on the subscription, the message is already "expired" when we get it.

Note that the acknowledgement deadline "starts ticking" when the message is returned in the PullResponse, not when the message is published.

 


On Wednesday, February 28, 2018 at 8:01:59 PM UTC-5, Doug Hoard wrote:
I have a Java application that pulls a batch of messages directly via the HTTPS API.

The application uses 50 threads to pull messages, puts them on an internal queue as a buffer.

The application uses 50 threads to process the queue to forward to an HTTPS REST service.

The application uses 50 threads to asynchronously ack the messages.

This seems to work well and isn’t CPU bound ... but the Stackdriver metrics seem odd.

Stackdriver will sometimes show the count of unacknowledged messages for the subscription slowly growing, yet it will also show the age of the oldest unacknowledged message being 60s or less.

If I look at my logs, it seems to disagree with the unacknowledged message ... but seems to agree with the oldest unacknowledged message age.

Thoughts?

--
You received this message because you are subscribed to the Google Groups "Google Cloud Pub/Sub Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-pubsub-dis...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-pubsub-discuss/bd41eaea-c0f8-45f3-abee-d024c62ef7a4%40googlegroups.com.

Doug Hoard

unread,
Mar 5, 2018, 10:38:38 AM3/5/18
to Google Cloud Pub/Sub Discussions
Good to know that Pub / Sub REST API isn't sensitive to overpolling.

Regarding the official Java gRPC libraries ... 

The gRPC pull API was originally in use.  After seeing numerous latency issues as well as gRPC INFO prints about "AbstractClientStream - Received data on closed stream"  (https://github.com/grpc/grpc-java/issues/3810) I moved to my own client.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-pubsub-discuss+unsub...@googlegroups.com.

Doug Hoard

unread,
Mar 9, 2018, 2:47:17 PM3/9/18
to Google Cloud Pub/Sub Discussions
Things seemed pretty good with my implementation of a polling client, but had a large number of requests.

I switch to back to use the Official Pub / Sub Java async client ... which seemed good until about 10:37 EST AM today.

Logging in the code there are messages that are taking 10 minutes from publish time till the time I receive them.

Message was published more than 11 minutes 1 seconds ago, received 0 minutes 0 seconds ago, message = [PubSubMessage(publishTimestamp=[1520621403167],receiveTimestamp=[1520622064226],data=[{"key":"value"})]


On Wednesday, February 28, 2018 at 8:01:59 PM UTC-5, Doug Hoard wrote:
Reply all
Reply to author
Forward
0 new messages