Reducing duplicate messages using bloom filters with message queues

Kelly Sommers

unread,

Jul 27, 2012, 4:38:08 PM7/27/12

to distsys...@googlegroups.com

So I had this crazy idea today while debating on Twitter about message delivery and the trade-offs batching introduces. I was thinking about ways that duplication could be avoided.

In the case that at least once delivery is the guarantee offered is there value in reducing the amount of duplication or because you have to accept duplicates there's no purpose in reducing the duplication?

In an extremely high throughput system rather than retaining all the message ids that should be treated as duplicates.(which you may hold in memory for a certain period of time, only offering de-duplication within a small window) I wonder if a bloom filter would be more appropriate.

A bloom filter can cause some false positives but you already are accepting this in an environment with at least once delivery but this could reduce the amount of duplicates processed. The other upside is I'm guessing a bloom filter would be much smaller to hold in memory than a list of message ids in a system that is going at a high rate. You would also be able to offer de-duplication for a larger window of time.

I can't recall hearing any bloom filters being used in message queues so I'm guessing this is a bad idea for some reason and thought I would bring it up for discussion.

Thoughts?

Michael Rose

unread,

Jul 27, 2012, 5:03:35 PM7/27/12

to distsys...@googlegroups.com

I've heard before about using opposite bloom filter

http://somethingsimilar.com/2012/05/21/the-opposite-of-a-bloom-filter/

I haven't explored the concept, we haven't hit any issues that'd need it.

--

Michael Rose (@Xorlev)

Senior Backend Engineer, FullContact
mic...@fullcontact.com

--
You received this message because you are subscribed to the Google Groups "Distributed Systems" group.
To post to this group, send email to distsys...@googlegroups.com.
To unsubscribe from this group, send email to distsys-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/distsys-discuss/-/1jKc4WlCiaAJ.
For more options, visit https://groups.google.com/groups/opt_out.

Justin Sheehy

unread,

Jul 31, 2012, 10:28:36 AM7/31/12

to distsys...@googlegroups.com

Kelly,

It's an interesting idea, but (as Michael points out by referring to Jeff's blog post) a Bloom filter is the wrong way around for this. A "false positive" in terms of a Bloom filter would mean that you'd filter out some messages even if they haven't been seen before. This would violate your at-least-once guarantee.

I think that for at least some versions of this problem the simplest answer is probably to use a fixed-size identifier cache with time-based expiry. This is likely to work particularly well for cases like the one you describe since I would expect most duplicates are likely to be sent soon after each other.

In higher-volume or less predictable scenarios the "opposite of a bloom filter" as shown by Jeff may have better behavior. Measure!

-Justin

Kelly Sommers

unread,

Aug 1, 2012, 9:17:24 PM8/1/12

to distsys...@googlegroups.com

Thanks Justin and Michael on the clarification!

I would guess that the most common approach to de-duplication in messaging systems is the fixed-size identifier cache with time based expiration. I think I mentioned that in my original post. The problem with this is it can be quite short sighted, and increasing the cache size can consume a lot of memory.

The nice thing about a bloom filter is you can offer guarantees for a longer period of time. You're getting a probabilistic result with lower memory consumption (as far as I understand it). Although as you both pointed out, it's the opposite of whats good for this problem. False negatives make more sense than false positives.

It would be cool if you could combine the fixed-size identifier cache for recently processed messages to get perfect de-duplication in a small window of time but then use a probabilistic approach for messages that are older than that cache.

Is there no data structure that can give probabilistic results with false negatives instead?

I'll need to give the link Michael provided a few more reads to grok it properly :)

Please correct me if I've said anything incorrect!

Thanks,

Kelly

--
You received this message because you are subscribed to the Google Groups "Distributed Systems" group.
To post to this group, send email to distsys...@googlegroups.com.
To unsubscribe from this group, send email to distsys-discu...@googlegroups.com.

Carter Schonwald

unread,

Aug 4, 2012, 4:02:18 PM8/4/12

to distsys...@googlegroups.com

on a slightly related space of things, Edward Kmett directed me recently to a blog post of his for using Bloom Filters

to accelerate data base joins (which is relevant / spiritually related). The idea here is that bloom filters are used to reduce the # of elements you need to filter through (in which case false positive are ok, you just make them rare)

http://comonad.com/reader/2008/linear-bloom-filters/

this isn't the same as your deduplication idea, but its nifty and related, enjoy!

-Carter

Reply all

Reply to author

Forward