Hey Bloomers,
I'm really interested in the Bloom system and its focus on the set as a basic data type. By the way, if what i'm asking has an obvious answer in the Hadoop jungle, don't hesitate to answer my question with 'you're just asking for X implemented in Y so ask mailing-list Z'.
So here's the problem statement:
Suppose I have a large group of small sets. For this example lets say that each set can have up to 5 elements.
These sets are composed of strings that all relate to a certain topic, for example:
pets = {cat, dog, bird, rabbit, snake}
domestic = {cat, dog, goat, cow}
house = {chair, table, cat, bedroom}
furniture = {chair, table, bed, couch}
carry = {backpack, laptop, phone, dog}
get_stolen = {seat, backpack, girlfriend}
etc.
I hold a personal set of up-to-5 elements:
Abhay = {cat, dog, backpack, laptop, girlfriend}
My problem statement is that I want to create a list of sets ordered by ‘relevance’. Suppose that the space of sets is 10s of millions, and that I want to quickly/cheaply recompute the relevance when I add, remove, change an element in my set. Relevance here means not only sets with highest coverage (importance) but also sets that combine uncommon elements (uniqueness). In the example, the overlap is:
Abhay pets = {cat, dog}
Abhay domestic = {cat, dog}
Abhay house = {}
Abhay furniture = {}
Abhay carry = {backpack, laptop, dog}
Abhay get_stolen = {backpack, girlfriend}
The “abhay carry” overlap is relevant for its importance, but the “abhay get_stolen” is relevant for its uniqueness.
————————————————
My current attempt at this problem is to use three tiers:
1. Inverted index of elements to quickly retrieve list of sets that contain a single element
2. Merging the indices from above, using memoization along the way to speed up results
3. Precomputed set coverage for all 5-choose-4 subsets of my personal set (so if I remove or change an element the system doesn't have to do step 2)
--------------------------------------------------
The reason why Bloom interests me for this problem is that everything in my system is fundamentally a set or set-of-sets. I also have a lot of concurrency problems with using old data in a constantly shifting environment. Thankfully, I have lower constraints on how correct the results are. I'm curious whether anyone has advice or direction for me in this problem, using Bloom (or otherwise).