Distributed real-time set coverage system

41 views

Skip to first unread message

Abhay Agarwal

unread,

Apr 3, 2014, 12:08:22 AM4/3/14

to bloom...@googlegroups.com

Hey Bloomers,

I'm really interested in the Bloom system and its focus on the set as a basic data type. By the way, if what i'm asking has an obvious answer in the Hadoop jungle, don't hesitate to answer my question with 'you're just asking for X implemented in Y so ask mailing-list Z'.

So here's the problem statement:

Suppose I have a large group of small sets. For this example lets say that each set can have up to 5 elements.

These sets are composed of strings that all relate to a certain topic, for example:
pets = {cat, dog, bird, rabbit, snake}
domestic = {cat, dog, goat, cow}
house = {chair, table, cat, bedroom}
furniture = {chair, table, bed, couch}
carry = {backpack, laptop, phone, dog}
get_stolen = {seat, backpack, girlfriend}
etc.

I hold a personal set of up-to-5 elements:
Abhay = {cat, dog, backpack, laptop, girlfriend}

My problem statement is that I want to create a list of sets ordered by ‘relevance’. Suppose that the space of sets is 10s of millions, and that I want to quickly/cheaply recompute the relevance when I add, remove, change an element in my set. Relevance here means not only sets with highest coverage (importance) but also sets that combine uncommon elements (uniqueness). In the example, the overlap is:

Abhay pets = {cat, dog}
Abhay domestic = {cat, dog}
Abhay house = {}
Abhay furniture = {}
Abhay carry = {backpack, laptop, dog}
Abhay get_stolen = {backpack, girlfriend}

The “abhay carry” overlap is relevant for its importance, but the “abhay get_stolen” is relevant for its uniqueness.

————————————————

My current attempt at this problem is to use three tiers:
1. Inverted index of elements to quickly retrieve list of sets that contain a single element
2. Merging the indices from above, using memoization along the way to speed up results
3. Precomputed set coverage for all 5-choose-4 subsets of my personal set (so if I remove or change an element the system doesn't have to do step 2)

--------------------------------------------------

The reason why Bloom interests me for this problem is that everything in my system is fundamentally a set or set-of-sets. I also have a lot of concurrency problems with using old data in a constantly shifting environment. Thankfully, I have lower constraints on how correct the results are. I'm curious whether anyone has advice or direction for me in this problem, using Bloom (or otherwise).

Reply all

Reply to author

Forward

0 new messages