I would guess that the most common approach to de-duplication in messaging systems is the fixed-size identifier cache with time based expiration. I think I mentioned that in my original post. The problem with this is it can be quite short sighted, and increasing the cache size can consume a lot of memory.
The nice thing about a bloom filter is you can offer guarantees for a longer period of time. You're getting a probabilistic result with lower memory consumption (as far as I understand it). Although as you both pointed out, it's the opposite of whats good for this problem. False negatives make more sense than false positives.
It would be cool if you could combine the fixed-size identifier cache for recently processed messages to get perfect de-duplication in a small window of time but then use a probabilistic approach for messages that are older than that cache.
Is there no data structure that can give probabilistic results with false negatives instead?
I'll need to give the link Michael provided a few more reads to grok it properly :)
Please correct me if I've said anything incorrect!
Thanks,
Kelly