Subsets and Supersets vs. SINTER

180 views
Skip to first unread message

mb

unread,
Sep 29, 2010, 12:10:18 AM9/29/10
to Redis DB
My app matches jobs to workers according to shared skills. I naively
thought I could use SINTER to do the match, but since skills are
asymmetric between jobs and workers, it only works in one direction.

Simplified example - one job which requires one skill, one worker who
has two skills:

SADD JOB:j1:SKILLS skill1
SADD SKILL:skill1:JOBS j1

SADD WORKER:w1:SKILLS skill1
SADD SKILL:skill1:WORKERS w1
SADD WORKER:w1:SKILLS skill2
SADD SKILL:skill2:WORKERS w1

There could be a large number of jobs and a large number of workers.
For a given job I want to find all the eligible workers, and SINTER
does the trick nicely:

SINTER SKILL:skill1:WORKERS SKILL:skill2:WORKERS
--> 1. "w1"

This doesn't work in the other direction. Given a worker with
multiple skills, how to find the jobs she can do?

SINTER SKILL:skill1:JOBS SKILL:skill2:JOBS
--> (nil)

I can go both directions in Python with set.issubset() and
set.issuperset(), but only if I pull all the data into the client or
iterate over all the jobs.

Can this be done cleanly and quickly in Redis?

Josiah Carlson

unread,
Sep 29, 2010, 12:45:26 AM9/29/10
to redi...@googlegroups.com
Asking "Is A a subset of B" is a fancy way of saying A - B -> <empty
set>. Superset is merely the reverse, "Is A a superset of B" is the
same as B - A -> <empty set>. You can use SDIFF to calculate either.

That said, checking jobs is going to be expensive with your current
scheme, especially when you run into jobs with multiple skills, and
people with multiple skills. It can be done *much* faster with Redis.
That is, there is a simple way of calculating "get me all jobs that
user A can do" and "get me all users that can perform job Z". Think
it over for a while. If you need a hint or are pressed for time and
want the answer, ping this thread.

Regards,
- Josiah

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>

mb

unread,
Oct 2, 2010, 5:12:19 AM10/2/10
to Redis DB
Josiah, many thanks for helping think this through. You're right,
the challenge is to get all jobs a worker can do each time the
given set of skills changes for the worker, given a large number
of jobs each with different sets of required skills.

If a job's set of required skills is a subset of the worker's skills,
it's a match.

I've come up with several inelegant and inefficient solutions, that
involve:

(a) pulling large amounts of data back to the client, or
(b) iterating over many jobs to SDIFF against the worker's skill set,
or
(c) storing hashes of skill combos, which explodes beyond ~10 skills

Certainly I'm missing something obvious, I could use a hint!

mb



On Sep 28, 9:45 pm, Josiah Carlson <josiah.carl...@gmail.com> wrote:
> Asking "Is A a subset of B" is a fancy way of saying A - B -> <empty
> set>.  Superset is merely the reverse, "Is A a superset of B" is the
> same as B - A -> <empty set>.  You can use SDIFF to calculate either.
>
> That said, checking jobs is going to be expensive with your current
> scheme, especially when you run into jobs with multiple skills, and
> people with multiple skills.  It can be done *much* faster with Redis.
>  That is, there is a simple way of calculating "get me all jobs that
> user A can do" and "get me all users that can perform job Z".   Think
> it over for a while.  If you need a hint or are pressed for time and
> want the answer, ping this thread.
>
> Regards,
>  - Josiah
>
>
>
>
>
>
>
> On Tue, Sep 28, 2010 at 9:10 PM, mb <doit...@gmail.com> wrote:
> > My app matches jobs to workers according to shared skills.  I naively
> > thought I could use SINTER to do the match, but since skills are
> > asymmetric between jobs and workers, it only works in one direction.
>
> > Simplified example - one job which requires one skill, oneworkerwho
> > has two skills:
>
> > SADD JOB:j1:SKILLS skill1
> > SADD SKILL:skill1:JOBS j1
>
> > SADDWORKER:w1:SKILLS skill1
> > SADD SKILL:skill1:WORKERS w1
> > SADDWORKER:w1:SKILLS skill2

Josiah Carlson

unread,
Oct 2, 2010, 12:41:46 PM10/2/10
to redi...@googlegroups.com
Hint 1: ZSETs are your friend.
Hint 2: Change the way you structure your data. (don't use sets of skills)

I'm going to put a bunch of blank lines, then I'm going to provide the
answer (I'm going to be out much of this weekend, so I won't be able
to help otherwise). If you want to figure it out, use the hint. If
you want the answer, scroll down :)


...

...

The answer is as follows:

Rather than storing for each job the set of skills necessary to
perform the job, instead store ZSETs for each skill. Each skill zset
includes all of the jobs that require that skill, with their scores
all being 1 (if you are using Redis master for the 2.2 release, you
can use plain sets for this, as we are going to be using zinterstore,
which can take sets as an input, which default the scores to 0). You
also have a single zset with all of the jobs and the number of skills
that it requires. To be more clear, for joba that requires skill1 and
skill2, and jobb that requires skill2, skill3, and skill4:

ZADD skill1 1 joba
ZADD skill2 1 joba
ZADD jobs 2 joba
ZADD skill2 1 jobb
ZADD skill3 1 jobb
ZADD skill4 1 jobb
ZADD jobs 3 jobb

Say that we have someone with 3 skills: skill1, skill2, skill3. To
find out if there are any available jobs for them:
ZUNIONSTORE results 3 skill1 skill2 skill3 AGGREGATE SUM
results -> {joba: 2, jobb: 2}

ZINTERSTORE results 2 results jobs WEIGHTS -1 1 AGGREGATE SUM
results -> {joba: 0, jobb: 1}

Every job in the results ZSET with a score of 0 is able to be done
with the person you just queried for. You can trim off all jobs with
scores >= 1 and keep results for that person, or you can calculate it
on the fly whenever you need it. Either way is fine. Heck, you could
store it, then every day, calculate the difference and email out the
new jobs to the user.

You *can* perform the entire operation with a single ZUNIONSTORE
(weights for the skills are all -1, weight for jobs is 1) with the
same result semantics, but you always end up with a set as large as
your 'jobs' zset. It's mostly personal choice.


Regards,
- Josiah

Josiah Carlson

unread,
Oct 2, 2010, 12:44:59 PM10/2/10
to redi...@googlegroups.com
Sorry, mistype. Scores default to 1 for SETs in ZUNIONSTORE/ZINTERSTORE.

- Josiah

mb

unread,
Oct 3, 2010, 9:30:21 PM10/3/10
to Redis DB
Thanks very much, Josiah. That's an ingenious use of zsets,
and I would never have gotten there with hints alone!

This solution performs well for testing and prototyping.
With 10,000 jobs and 1,000 workers, ZUNIONSTORE/ZINTERSTORE
can find jobs for one worker in about 50 ms.

As the dataset grows, however, performance drops dramatically.
With 100,000 jobs and 10,000 workers, it takes 2,100 ms to find
jobs for one worker. Blocking other Redis calls for that long
won't work.

My use cases involve 30-40 skills per worker, and 2-10 skills
per job.

I'm guessing this solution uses a lot of cycles doing work
that the problem doesn't require. For example, I only need
to show the jobs that match a worker's skills. Since I don't
need to compute or store the exact score for every combination,
it would be more efficient to stop comparing a job as soon
as one skill doesn't match.

I appreciate your help on this, I now have enough to complete
the prototype. Though a more efficient approach will be
needed if the app is successful.

Thanks again for your analysis and recommendations, I've
learned a lot!

mb
> > On Sat, Oct 2, 2010 at 2:12 AM, mb <doit...@gmail.com> wrote:
> >> Josiah, many thanks for helping think this through.  You're right,
> >> the challenge is to get all jobs aworkercan do each time the
> >> given set of skills changes for theworker, given a large number
> >> of jobs each with different sets of required skills.
>
> >> If a job's set of required skills is a subset of theworker'sskills,
> >> it's a match.
>
> >> I've come up with several inelegant and inefficient solutions, that
> >> involve:
>
> >> (a) pulling large amounts of data back to the client, or
> >> (b) iterating over many jobs to SDIFF against theworker'sskill set,

Josiah Carlson

unread,
Oct 4, 2010, 2:54:34 AM10/4/10
to redi...@googlegroups.com
On Sun, Oct 3, 2010 at 6:30 PM, mb <doi...@gmail.com> wrote:
> Thanks very much, Josiah.  That's an ingenious use of zsets,
> and I would never have gotten there with hints alone!

Thank you, I'm glad to help. :)

> This solution performs well for testing and prototyping.
> With 10,000 jobs and 1,000 workers, ZUNIONSTORE/ZINTERSTORE
> can find jobs for one worker in about 50 ms.
>
> As the dataset grows, however, performance drops dramatically.
> With 100,000 jobs and 10,000 workers, it takes 2,100 ms to find
> jobs for one worker.  Blocking other Redis calls for that long
> won't work.
>
> My use cases involve 30-40 skills per worker, and 2-10 skills
> per job.

If you are ok with the same or greater wall-clock time, without
blocking other operations, you can chunk your jobs into (perhaps)
groups of 1k-10k (as you tested), and do them chunk by chunk; that is,
check skill1:chunk1, skill2:chunk1, ... to look for a job. Then move
onto skill1:chunk2, skill2:chunk2, ..., unioning the result into the
result from the first chunk. Since the smaller unions/intersections
finish much faster, you won't be blocking everyone else, and you can
still end up with one list with everything.

> I'm guessing this solution uses a lot of cycles doing work
> that the problem doesn't require.  For example, I only need
> to show the jobs that match a worker's skills.  Since I don't
> need to compute or store the exact score for every combination,
> it would be more efficient to stop comparing a job as soon
> as one skill doesn't match.

Depending on the total number of skills you have, you could perform
differences instead of unions, and it might be faster...

Say that a person has skill1, skill3, skill4... but there are skills 1..5 .

SDIFFSTORE ok jobs skill2 skill5

Technically, anything that is in 'ok' is a job that the person can do,
as we've excluded those jobs that he can't do. Straight up set
arithmetic. Good eye about probably doing more work than was
necessary :)

In this updated version, "jobs" is the set of all jobs, with 'skillX'
being the set of all jobs that require skillX.

> I appreciate your help on this, I now have enough to complete
> the prototype.  Though a more efficient approach will be
> needed if the app is successful.
>
> Thanks again for your analysis and recommendations, I've
> learned a lot!

You are welcome, it was a fun problem to solve :)

Regards,
- Josiah

Reply all
Reply to author
Forward
0 new messages