CPU goes to 90%, nsqd stops responding

44 views
Skip to first unread message

Lior Messinger

unread,
Feb 16, 2023, 5:49:37 AM2/16/23
to nsq-...@googlegroups.com
Hi everyone,

At times, we have the above issue. the machine is a linux  e2-standard-2 vm gcp, with 2 vCPU and 8GB memory. 

seems that there's no problem with memory leak, but not sure.

messages are not very small - i guess could be 500B-1K bytes in size

Wondered if you've experienced such issues and what was the reason.
How can i test/monitor/poke?
Any way to add code to see why and where the CPU shoots up?

thanks for any direction...!
Lior
--



Answers. Spot On.

Lior Messinger
CTO & Co-founder

t: +1-646-3730044
   +972-546-888401

e:li...@korra.ai
w: www.korra.ai 
  

Lior Messinger

unread,
Feb 16, 2023, 5:59:51 AM2/16/23
to nsq-...@googlegroups.com
to add a little bit more details, maybe it will give some more ideas
  • we call nsqd from python and golang
  • we use to process sometimes lengthy operations - 1-2 minute per message
  • usually 1-4 messages are worked on in parallel, by each consumer
  • sometimes the queues has a few hundreds of messages waiting
  • we are not using docker, but running it directly on the OS

thanks!
Lior

Pierce Lopez

unread,
Feb 16, 2023, 9:36:01 PM2/16/23
to Lior Messinger, nsq-...@googlegroups.com
Hi Lior,

What version of nsqd - latest v1.2.1?

Using ephemeral channels or topics? There have been some potential issues caused by repeatedly connecting and disconnecting from ephemeral topics or channels (do you do that?) with some internal busy-looping trying to cleanup or recreate the "flapping" ephemeral topic/channel.

Even if memory exhaustion isn't the problem, nsqd offers a lot of memory stats that might show GC issues. If you have statsd+graphite (or dogstatsd ...) setup, you can use the --statsd-address flag to nsqd to send topic/channel stats and memory/gc stats through those systems. (You can also configure nsqadmin to fetch nice relevant graphs from graphite.) Or you can just take a few snapshots of the current stats from an nsqd by fetching its `/stats` endpoint (on the http port, using curl or similar, there's also a format=json query param).

Finally, nsqd does have pprof profiling endpoints, see https://nsq.io/components/nsqd.html#get-debugpprof

Good luck,
- Pierce



--
You received this message because you are subscribed to the Google Groups "nsq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nsq-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nsq-users/CAHgG0C%2BT9TKxLBdQULHeMLMpUgnyTrUfYmODVwba0vPXh5yE%3DA%40mail.gmail.com.

Lior Messinger

unread,
Feb 20, 2023, 6:14:22 AM2/20/23
to Pierce Lopez, nsq-...@googlegroups.com
Pierce, thank you for your response. The topic/channel are not ephemeral. 

We think now the issue is related to the fact that there were producers that weren't stopped.
in other words, code like:
config := nsq.NewConfig()
producer, err := nsq.NewProducer(EnvNSQDUrl(), config)
producer.Publish(models.User_Created, data)


didnt have the
defer producer.Stop()

so nsqd tcp connections grew in number. eventually it led to high cpu (the connection between them is not straight-forward; maybe because it exposed some other bug)

on any case, thanks! 
I'll keep the group posted
best
Lior
Reply all
Reply to author
Forward
0 new messages