Thanks for your reply Oscar. The use case is deduplicating messages from kafka in a spark streaming context. Taking a number of bytes as a parameter is not a requirement but being able to fit in memory indefinitely is. Obviously space and therefore time is a limitation for this deduplicating process, but if it were true that the Algebird bloom filter size was always in proportion to the dataset then it would grow without bound in a streaming dataset. It's good to know that isn't the case.
So now the question becomes do we have any idea of what the upper bounds are and what the probabilities are for various sets of parameters?