High latencies with NSQ

330 views
Skip to first unread message

Ashish Goel

unread,
Oct 15, 2015, 5:23:31 AM10/15/15
to nsq-users
Hi,

We are observing some unexpected high latencies while producing a single message in NSQ in production server. The queue is hosted on the same server. The latencies sometimes goes as high as 5 minutes. During the same time, I also observe a lot connection of errors in nsq logs. Is there any default timeout configured in Trendrr JavaClient?

Matt Reiferson

unread,
Oct 15, 2015, 11:11:52 AM10/15/15
to Ashish Goel, nsq-users
Hi Ashish,

Can you provide the command line or config you're providing nsqd as well as logs from nsqd and the consumer in question?

Thanks,

Matt

Ashish Goel

unread,
Oct 15, 2015, 6:34:25 PM10/15/15
to Matt Reiferson, nsq-users
Hi Matt,

Thanks for the quick response.

Here are the params to start the nsqd

-tcp-address=0.0.0.0:8195 -http-address=0.0.0.0:8196 -e2e-processing-latency-percentile=10,50,90,100 -lookupd-tcp-address=lookup-server-1.ec2:8197 -lookupd-tcp-address=lookup-server-2.ec2:8197

The nsqd logs during that time shows io error during that time. Mailed you the logs.

Overall the TCP connection and disconnection process kept on happening again and again for almost 10 minutes after which it stabilized. 

--
Thanks,
Ashish

Matt Reiferson

unread,
Oct 17, 2015, 2:32:46 PM10/17/15
to Ashish Goel, nsq-users
After looking at the logs (provided separately for those following along at home), and given your description that it was intermittent, it looks like there could have been either CPU or network starvation on that host causing a variety of timeouts and errors.

Was the host under pressure during this window of time?  Do you have CPU/memory/disk/network charts for the time period in question?

Also, according to the logs I don't see any TrendrrJava clients, they all look like various nsq_to_file and nsq_to_http consumers?

Ashish Goel

unread,
Oct 17, 2015, 4:49:26 PM10/17/15
to Matt Reiferson, nsq-users
Intermittent because after that event the service had brown out and had to be restarted. 
TrendrrJavaClient is used for publishing. nsq_to_file and nsq_to_http are used for consumers.

Host starvation possible.  Not CPU, because it is a 36 core machine and I looked at the CPU load during that time, it certainly did go high for a short period but max to 21. Enough memory > 50G was available. Disk usage was also pretty low < 10%.

It can be a network starvation. What network logs you want me to look at? The CLOSE_WAIT connections did increase on the host because the clients which were making calls to the service(The service which publishes to nsq) started to time out creating a big backlog of CLOSE_WAIT on the host.
--
Thanks,
Ashish

Matt Reiferson

unread,
Oct 17, 2015, 7:17:41 PM10/17/15
to Ashish Goel, nsq-users
Unless you're running nsqd with GOMAXPROCS set to > 1, it will only use a single core.

By disk I meant IO, not space.

What is the volume of messages and number of topics on this nsqd?

Ashish Goel

unread,
Oct 17, 2015, 8:39:46 PM10/17/15
to Matt Reiferson, nsq-users
Thanks Matt for quick succession of responses. Appreciate the support.

We haven't touched the GOMAXPROCS param. Quick Google search shows that its a GO runtime variable that controls concurrency. For the problem that we are discussing, is it worth changing this param?

For this particular environment, the system is processing 10K messages per minute spread over a fleet of 4 hosts. So each host is processing 41.66 messages/sec.
Number of Topics - 3
Number of Consumer nodes - 6.

All Consumers are consuming on all topics.

Disk IO starvation is unlikely. The system is not disk IO bound. I looked at the Disk IO metrics, the disk usage graphs only indicate that disk usage was reduced that time. 




--
Thanks,
Ashish

Ashish Goel

unread,
Oct 17, 2015, 10:38:23 PM10/17/15
to Matt Reiferson, nsq-users
One more thing Matt, I could see some relation in past 2 similar issues. I ain't expert in understanding kernel logs but the kernel logs show CPU stalling during that time.
--
Thanks,
Ashish

Matt Reiferson

unread,
Oct 18, 2015, 2:09:41 PM10/18/15
to Ashish Goel, nsq-users
That message volume isn't high at all so it would seem unlikely that nsqd would be maxing a single core (because of GOMAXPROCS defaulting to 1).

It seems much more likely that some process (periodic or long running) co-tenant with nsqd on this host might be causing occasional issues.

Ashish Goel

unread,
Oct 23, 2015, 2:13:20 AM10/23/15
to Matt Reiferson, nsq-users
I can't say the process did it. Maybe some hardware issue. I was able to reproduce the same issue and the same sequence of error logs by manually stopping(not killing) the process and resuming it after a minute. 

Now that you've mentioned about the nsqd core usage. Is there any benchmarking showing single threaded nsqd's performance with increase in number of parallel connections from producer. Right now we've configured it to 32.

The TrendrrJavaClient timeout is too high - 15sec. I need to reduce that to a few milliseconds for our case.
--
Thanks,
Ashish

Lior Messinger

unread,
Feb 16, 2023, 11:31:08 AM2/16/23
to nsq-users
Hi guys

I know a few years have passed... I wondered if you managed to solve the issue and how  

thank you!
Lior
Reply all
Reply to author
Forward
0 new messages