Huge BloomFilter with more than 10 billions elements capacity

446 views
Skip to first unread message

Alexey Ponkin

unread,
Oct 31, 2016, 11:08:54 AM10/31/16
to algebird
Hi,
Are there any possibility to use twitter-algebird/BloomFilter with sets of elements with size more than 10 billions?
May be you can point me some other scala bloomfilter implementation, or the way to extend twitter-algebird/BloomFilter?

P. Oscar Boykin

unread,
Oct 31, 2016, 1:39:14 PM10/31/16
to Alexey Ponkin, algebird
I have not used it with that many, so I can't say, and don't have another recommendation.

The bloomfilter in algebird was a contribution, and has not been very extensively optimized. I can't really speak to how well it performs.

If I were making one that large, I would probably build something custom. Do you need it all in ram at one time? Maybe an on-disk implementation will do well? What about a distributed implementation where you back the bloomfilter with redis or something similar (you can have the key be to a chunk of the bloomfilter bit-array).

I'd be happy to see what you wind up doing.

--
You received this message because you are subscribed to the Google Groups "algebird" group.
To unsubscribe from this group and stop receiving emails from it, send an email to algebird+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alexey Ponkin

unread,
Nov 2, 2016, 6:41:23 AM11/2/16
to algebird, alexey...@gmail.com
So made some research
There are a lot of interesting implementations of BloomFilters(hadoop, breeze, and interesting bloom-filter-scala), also there is interesting blog post with comparison different implementations by size and speed(including algebird version ).
Also since version 2.0 Apache Spark also has its own bloom filter (similar to guava/bloomFilter)
But the are all suffer from huge memory size.
For example Spark BloomFilter for 150 millions elements weight more than 1G
And now I am confused, since I have some Spark streaming application that need to check some condition in real time.
Oscar, can you please tell more about using distributed implementation?
Or may be I need to choose other solution not BloomFilter?

понедельник, 31 октября 2016 г., 20:39:14 UTC+3 пользователь P. Oscar Boykin написал:

Alexey Ponkin

unread,
Nov 6, 2016, 11:09:14 AM11/6/16
to algebird, alexey...@gmail.com
If someebody is interested what`s the end of the story, I wrote small library to work with remote bloom filter(stored in Redis, thanks Oscar). Only two simple methods: put element to the filter and check membership. https://github.com/ponkin/bloomfilter-store

среда, 2 ноября 2016 г., 13:41:23 UTC+3 пользователь Alexey Ponkin написал:

Mayur Kaloge

unread,
Aug 9, 2017, 3:57:13 AM8/9/17
to algebird, alexey...@gmail.com
to use this Redis bloomfilter on how many servers you need to install redis? Actually my concern is can i put redis bloom filter on 1 server and call it from my spark execution whenever needed.
Thank you.

Alexey Ponkin

unread,
Aug 10, 2017, 4:59:16 AM8/10/17
to algebird, alexey...@gmail.com
That is exactly how we used it. One server - call from Spark cluster

среда, 9 августа 2017 г., 10:57:13 UTC+3 пользователь Mayur Kaloge написал:

Mayur Kaloge

unread,
Aug 17, 2017, 2:39:14 AM8/17/17
to algebird, alexey...@gmail.com
Thank You Alexey.

Darshan Manek

unread,
Sep 8, 2017, 5:48:34 AM9/8/17
to algebird
Hi Alexey,
              Actually i am trying to use your code for my bloom implementation but i am facing some issues that i have mentioned below:

1. When i am using scala 2.10 platform i am providing finagle-core2.10,finagle-redis2.10,util-core2.10 as an external library but at the runtime it is giving me noSuchMethodException in com.twitter.finagle.resolver class.
2. To avoid this i have used all 2.11 versions but now it is giving me NoClasssDef Found exception for class com.twitter.finagle.stats.StatsReceiver. when i have extracted 2.11 jar of finagle then i can't find this class but this is contained in 2.10 version

so i stuck here with this problem and don't know weather to use 2.10 version or 2.11 version or m i completely on wrong path.
Your help will be greatly appreciated.

Thank You.
Reply all
Reply to author
Forward
0 new messages