NSQ Message Persistance

Shahked Bleicher

unread,

Nov 14, 2020, 7:49:15 PM11/14/20

to nsq-users

Hello!

I'm looking into different tools for creating a worker queue for processing background tasks. NSQ looks like a great option, the only thing I'm unsure about is message loss in case of nsqd crash. The following are some of the strategies I'v seen discussed to deal with this. I apologize in advance for the wall of text.

Use nsq_to_file on the task topic to create a log of published messages/tasks
1. Leaves a very small window between publishing and persisting a message for loss to occur
Use a redundant nsqd broker that receives the same messages as another, so if one crashes the other can continue processing tasks.
1. Need to either dedup and or make tasks idempotent, though should probably do that anyway
2. Maybe set the --mem-queue-size to 0 for the second nsqd broker
Use an intermediary/secondary storage like Cassandra to persist messages separately. Publisher would add message and payload here at the same time as publishing, ideally as one transaction

Of those, 3) seems the most foolproof, but requires another service and logic. I'm on a project with only two devs, including me, so I'm hoping to keep everything as simple as possible.

Basically, what I'm picturing is below as a take on option 2):

Primary NSQ cluster
- nsqd (1 or more) that would be configured normally, with in-memory messages
- nsqlookupd (1 or more) registering primary nsqd brokers only
Secondary, backup cluster
- nsqd( 1 or more ) with --mem-queue-size set to 0 so all messages are persisted
- nsqlookupd(1 or more) registering secondary nsqd brokers only

This way one cluster has optimal performance, while the secondary may be slower, but ensures a message isn't lost. Publishers would publish to both clusters at once. Does this seem like a reasonable production setup? Having never set up or worked on any message queue infrastructure before, I'd appreciate any insights.

Thanks

Patrick Morris

unread,

Nov 14, 2020, 8:13:23 PM11/14/20

to Shahked Bleicher, nsq-users

Your option 3 actually sounds like a good idea. Have a registration service.

If machine fails repopulate messages.

Only issue is not easy way to get very large amount of records out of cassandra if failure.

Try to keep count low and you are fine.

Cassandra partition design is important.

--
You received this message because you are subscribed to the Google Groups "nsq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nsq-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nsq-users/b7b122c6-a3aa-4f3c-9374-f07c9679b45an%40googlegroups.com.

Shahked Bleicher

unread,

Nov 15, 2020, 2:17:40 PM11/15/20

to nsq-users

Hi, thanks for the input.

I think we're going with option 3 like you suggested, but for now stick with a table in Postgres and jsonb fields, since we're already using it. If we reach a scale where this becomes a bottleneck, we'll revisit using something like Cassandra instead.

Pierce Lopez

unread,

Nov 15, 2020, 2:43:47 PM11/15/20

to Shahked Bleicher, nsq-users

For what it's worth, I recommend option 2, redundant nsqd. I guess you'd want to make a helper function for the service producing messages, to publish each message to multiple NSQD.

I don't recommend --mem-queue-size=0 because has a significant performance impact, doesn't work with ephemeral channels, and doesn't protect against the primary risk: the physical hardware or operating system crashing. nsqd crashing without the OS/hardware crashing is very rare. Publishing to two or three different nsqd on different hardware (usually different virtual machines in different "availability zones" or different "racks") should prevent losing messages when a single server somewhere crashes.

If processing of these messages is idempotent, then you could have the same nsqlookupd and consumers for all the redundant nsqd nodes. That would be pretty simple. It would result in redundant work processing the copies of the messages, hurting performance ... but other schemes where you only switch to the "backup cluster" when there's a problem with the "primary cluster" can be pretty complicated. When do you switch, how do you switch, and how do you throw out the many days-old messages on the backup cluster that were successfully processed on the primary cluster? (There are a couple ways I can think of, but they can get tricky ...)

If your strategy using a conventional database as backup for the full message history works well for your case, then by all means, do what works. NSQD was always a bit more of a toolkit than a solution. IMHO.

Patrick Morris

unread,

Nov 15, 2020, 2:55:37 PM11/15/20

to Pierce Lopez, Shahked Bleicher, nsq-users

Problem with multiple nsq is processing the message multiple times or complicated coordination so you don't multi process.

DB solution you can process single time or have a very small window of potential multi process. This is also very simple with virtually no coordination. Common publish and common subscribe logic will easily handle it.

On node/service/machine start could check for messages and add to nsq. Would have to fine-tune logic for use case.

--

You received this message because you are subscribed to the Google Groups "nsq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nsq-users+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/nsq-users/CAOR_4NDgDjZisUFZ7CVk9qw5xv06snNXDAF0fgKAUjjO6Ek-Qw%40mail.gmail.com.

Reply all

Reply to author

Forward