Matching

42 views
Skip to first unread message

Alexis EidelMan

unread,
Mar 19, 2013, 8:36:20 AM3/19/13
to liam...@googlegroups.com
Hi,

about the matching, every user should have noticed that step requires more time than every other one. As the calculus time is at least 0(n^2) while other ones are probably in 0(n), the relative importance in time of the matching function is heavier as the database size n increases.

I open that discussion to collect all ideas for decreasing the matching function calculus time.

For now, I can see three of them :

1 - user better precision of group to match.

The calculus time is 0(n^2) but, in fact, it depend only of the size of population we want to match. If we want to match p men with p women (see example given in the demo.yml files of Liam2) the calculus time is, say, A x p x p. If we managed to do two subgroup, the calculus time become A x p x (p/2), as for each woman, her partner is looked for in only p/2 candidates. That a limited but interesting idea. 

2 - Use the fact there is, in general, a small number of case in the cost matrix.

I need to think a little bit more about it, but for example, in demo05.yml the matching is only with age, with age between 18 and 90 for bot men and women, that means there is only 73 x 73 possible values for the distance between a man and a woman. In the demo06.yml the distance is dependant of workstate and multiply by 4 the set size of possible values.
Taking this into consideration, we can probably reduce a lot the calculus time. For example, for each woman we can look for the minimum value in the 73 x 2 possible value of possible match and then take a random man in that group (if that group is not empty, else into a smaller set of values).
That way calculus time doesn't depend any more of the number of men we want to match.

To go  further, we must think about a matching depending on quantitative variable. In that case, a suggested option is to transform it into a qualitative variable.

3 - Draw a random sample of candidates

That option is rather simple to implement and the calculus time can be easily manage. We can define a number p and for each woman draw a random sample of size p and then select the best match within these p men.
The calculus time is 0(n x p).
Indeed, the resulting matching can be far from the optimal one ( all the way ,the optimal matching issue is an other one) but that technique can be useful during a working process and not so bad as p can go to infinity ( or only p superior to n).

Gaëtan de Menten

unread,
Mar 19, 2013, 11:44:10 AM3/19/13
to liam...@googlegroups.com

On 19/03/2013 13:36, Alexis EidelMan wrote:

> about the matching, every user should have noticed that step requires
> more time than every other one.

I somehow doubt that because I think that in practice users often match
only a small subset of their population.

> As the calculus time is at least 0(n^2)
> while other ones are probably in 0(n), the relative importance in time
> of the matching function is heavier as the database size n increases.

Indeed, but it only increase with the size of the "matching pool".

> I open that discussion to collect all ideas for decreasing the matching
> function calculus time.
>
> For now, I can see three of them :
>
> 1 - user better precision of group to match.
>
> The calculus time is 0(n^2) but, in fact, it depend only of the size of
> population we want to match. If we want to match p men with p women (see
> example given in the demo.yml files of Liam2) the calculus time is, say,
> A x p x p. If we managed to do two subgroup, the calculus time become A
> x p x (p/2), as for each woman, her partner is looked for in only p/2
> candidates. That a limited but interesting idea.

That would be ideal (and can already be done by the modellers) if
possible, but the problem is that subgroups will be totally disjoint.
Even if one can find subgroups that are much more likely to be matched
together, splitting them like that would prevent some rare matches from
happening. Given that the current solution is hardly perfect, this might
produce better matches anyway.

> 2 - Use the fact there is, in general, a small number of case in the
> cost matrix.
>
> I need to think a little bit more about it, but for example, in
> demo05.yml the matching is only with age, with age between 18 and 90 for
> bot men and women, that means there is only 73 x 73 possible values for
> the distance between a man and a woman. In the demo06.yml the distance
> is dependant of workstate and multiply by 4 the set size of possible
> values.
> Taking this into consideration, we can probably reduce a lot the
> calculus time. For example, for each woman we can look for the minimum
> value in the 73 x 2 possible value of possible match and then take a
> random man in that group (if that group is not empty, else into a
> smaller set of values).
> That way calculus time doesn't depend any more of the number of men we
> want to match.
>
> To go further, we must think about a matching depending on quantitative
> variable. In that case, a suggested option is to transform it into a
> qualitative variable.

Interesting idea, but even on the limited set of models where this could
be applied, I am unsure this would work in practice because when the
"best" group is empty, you would have to try the second best, etc, and
finding and checking all those groups in turn might kill performance,
especially if you have many empty groups or many groups with very few
people in them, which I fear would often be the case in real models. A
quick estimate gives 191844 categories for our model and I guess most of
them only contain very few individuals. Do you have any idea how to find
the best "non empty" group efficiently?

> 3 - Draw a random sample of candidates
>
> That option is rather simple to implement and the calculus time can be
> easily manage. We can define a number p and for each woman draw a random
> sample of size p and then select the best match within these p men.
> The calculus time is 0(n x p).
> Indeed, the resulting matching can be far from the optimal one ( all the
> way ,the optimal matching issue is an other one) but that technique can
> be useful during a working process and not so bad as p can go to
> infinity ( or only p superior to n).

Simple, I like it. As a fall back when the "matching pool" is too large,
this would be great, and it might even give decent results if n is large
enough.

Gaëtan


----------------------------------------------------------------------------

Disclaimer: please see "www.plan.be/disclaimer.html"

Please consider your environmental responsibility before printing this email

----------------------------------------------------------------------------

Gijs Dekkers

unread,
Apr 25, 2013, 7:52:17 AM4/25/13
to liam...@googlegroups.com

...

3 - Draw a random sample of candidates

That option is rather simple to implement and the calculus time can be easily manage. We can define a number p and for each woman draw a random sample of size p and then select the best match within these p men.
The calculus time is 0(n x p).
Indeed, the resulting matching can be far from the optimal one ( all the way ,the optimal matching issue is an other one) but that technique can be useful during a working process and not so bad as p can go to infinity ( or only p superior to n).
The problem of Gaetan's disjoint subgroups (see point 1) can be resolved by this. But I think one could gain additional efficiency by not sampling purely randomly but rather on 'group-matching' characteristics.
 
Just thinking out loud: usually in a matching procedure, the matching for each woman is based on a set of variables using a logit, right? And if there is more than 1 potential partner with exactly the same characteristics, then a random draw is used to 'decide' who gets the girl. So it is a matching over a large number of potential partners, followed by a random draw over a small number of potential partners (if any). But what would happen if this order would be reversed? We would select a group of potential partners using group matching. This matching would be done on the basis of overall characteristics (for example, level of education of the male potential partner and whatever other characteristics), I guess like they would do it in ABM. And then individual partners would be selected from the specific group of potential partners by a simple random draw.
The results will be worse than with full matching, for sure. But the results of the classical matching procedure are not very good either (I think there was a dynacan paper on this) and the advantage would be that we could introduce some ABM-kind of recursity in the procedure.
Or is it just a lousy idea?
 
Gijs

Gaëtan de Menten

unread,
Apr 25, 2013, 9:45:26 AM4/25/13
to liam...@googlegroups.com

On 25/04/2013 13:52, Gijs Dekkers wrote:
>
> 3 - Draw a random sample of candidates
>
> That option is rather simple to implement and the calculus time can
> be easily manage. We can define a number p and for each woman draw a
> random sample of size p and then select the best match within these
> p men.
> The calculus time is 0(n x p).
> Indeed, the resulting matching can be far from the optimal one ( all
> the way ,the optimal matching issue is an other one) but that
> technique can be useful during a working process and not so bad as p
> can go to infinity ( or only p superior to n).
>
> The problem of Gaetan's disjoint subgroups (see point 1) can be resolved
> by this.

How exactly?

> But I think one could gain additional efficiency by not
> sampling purely randomly but rather on 'group-matching' characteristics.
> Just thinking out loud: usually in a matching procedure, the matching
> for each woman is based on a set of variables using a logit, right?

Currently, only the "pure"/raw score is used to sort individuals. So it
really depends on what you use in the score expression.

> And
> if there is more than 1 potential partner with exactly the same
> characteristics, then a random draw is used to 'decide' who gets the
> girl.

No, in that case, the first one (with the lowest id) "gets the girl".

> So it is a matching over a large number of potential partners,
> followed by a random draw over a small number of potential partners (if
> any).

> But what would happen if this order would be reversed? We would
> select a group of potential partners using group matching. This matching
> would be done on the basis of overall characteristics (for
> example, level of education of the male potential partner and whatever
> other characteristics), I guess like they would do it in ABM. And then
> individual partners would be selected from the specific group of
> potential partners by a simple random draw.

I am not sure I get your "select a group of potential partners using
group matching"? If I understand correctly, this is basically what
Alexis suggested in point 2...

Say you have a working woman aged between 30 and 40 with a high level of
education, you would determine that the best group(s) for her (for
example working men aged 30-50 with either a medium or high education
level), and pick from those groups randomly.

Is this what you meant? If so, it is equivalent to what Alexis
suggested, and it will suffer (IMO) from the same issue: what do you do
when the target group(s) is empty? (but see below)

> The results will be worse than with full matching, for sure. But the
> results of the classical matching procedure are not very good either (I
> think there was a dynacan paper on this)

Again supposing I understood your method correctly, the results would be
exactly the same as the current method, it would only differ in speed.
With the additional thinking since Alexis' initial proposal I think it
would probably be faster (potentially much, especially with a hybrid
approach -- see point 4 below) on average. But the question is in that
case, is it worth it to optimize for speed something which has not yet
proved a bottleneck on actual data yet? It will be a problem someday,
but I would personally wait for that day.

4 - hybrid partition/best score (mix of option 2 and the "current"
algorithm)

a hybrid approach would probably be best: only partition and do the
grouping for those variables which have very few possible values
(boolean fields are obvious candidates), and then instead of taking a
random man in that target group, compute the score expression for men
only in that group(s) and take the best scoring one. Maybe this is what
Alexis envisioned all along, but this seems more realistic than
partitioning using all the variables which would give many groups to
look into, and thus more cases of the "empty group problem". This would
probably be much more efficient than the current approach though.

I wonder if a fifth variant wouldn't give good results on average with
acceptable performance:

5 - Draw a random sample of candidates (both men and women)

You start by drawing random samples of size N (where N is determined
experimentally to have acceptable performance) which include half men
and half women (N/2 for each), you do a perfect match for that sample,
then draw another sample from the population to match, rince-repeat
until the whole population is done.

> and the advantage would be that
> we could introduce some ABM-kind of recursity in the procedure.

Uh??? You lost me here!


Ga�tan

Alexis EidelMan

unread,
Apr 26, 2013, 7:08:03 AM4/26/13
to liam...@googlegroups.com, g...@plan.be
The matching issue is important on two side.
First, in calculus time, even acceptable, it seems that it's the most greedy step in Liam2. Its importance increases with the dataset size.
And as important it's a crucial point in the stastical point of view. We expect the matching to be able to produce the joint stastistics we observe.
I'm not really aware of litterature on the matching procedure even if it really interests me much. If you have anything, please do not hesitate to tell me.
I read something about DYNACAN, but it seems to me that they use more the notion of stable mariage than optimal one.
As Gaetan, I don't know much about ABM.

Alexis
Ga�tan

Alexis EidelMan

unread,
Aug 7, 2013, 11:38:27 AM8/7/13
to liam...@googlegroups.com, g...@plan.be
Hi evreryone,

I re-open that discussion about matching techniques. I've recently implemented the method 2 of my first mail.
Results are quite astonishing.
To answer an issue raised by Gaëtan at that time, I deal with cells as we usually deal with individual in table2 and I don't lose time with empty cell.

In the first stage, I match "women" (=table1) with cells (group of individuals with same matching value) from the table2.
I remove cells from the basket when empty so the best match is always. Indeed, I have a counter for each cell to control if it's empty or not.
In a second stage, I select randomly people from each group of men and give them to women who had select that cell.

I used pandas in my code (even if the loop use only numpy). It can give an idea. The attached code on a theoretical example can help to convince.
.

The quadratic problem turns to be a linear one, and when size raises it does matter.

Alexis


 def evaluate(self, orderby, method):
        table2 = self.table2
        index_init = self.table1.index
        if orderby is not None:
            table1 = self.table1.sort(orderby)
        else:
            table1 = self.table1.loc[np.random.permutation(index_init)]
        index_init = table1.index
       
        table2 = self.table2.fillna(0)
        table1 = self.table1.fillna(0)
            
        if len(table1)>len(table2):
            print ("WARNING : La table de gauche doit être la plus petite, "\
                "traduire le score dans l'autre sens et changer l'ordre." \
                "pour l'instant table1 est reduite à la taille de table2. ")
            table1 = table1[:len(table2)]
        index_modif = table1.index   
       

        score_str, vars = _rewrite_score(self.score_str, 'temp', 'table2', table1.columns.tolist(), table2.columns.tolist())   
        n = len(table1)      
       
        if method=='cells':
            groups2 = table2.groupby(vars)
            cells_ini = pd.DataFrame(groups2.groups.keys(),columns =vars)
            score_str, vars = _rewrite_score(self.score_str, 'temp', 'cells', table1.columns.tolist(), vars)
            size = pd.DataFrame(groups2.size(), columns = ['size'])
            if len(size) != len(cells_ini):
                raise Exception('Oups, il y a un problème de valeurs nulles a priori')
            cells_ini = cells_ini.merge(size, left_on=vars, right_index=True, how='left')
            cells_ini['id'] = cells_ini.index
            # conversion en numpy
            table1 = np.array(table1, dtype=np.int32)
            cells = np.array(cells_ini, dtype=np.int32)
            #definition de la boucle
            nvar = len(vars)-1
        match = np.empty(n, dtype=int)
        percent = 0
        start = time.clock()
        #check
        assert  cells[:,nvar+1].min() > 0
        if method=='cells':
            for k in xrange(n):  
#                 real_match_cells(k,cells)
                temp = table1[k]
                try:
                    score = eval(score_str)
                except:
                    pdb.set_trace()
                try:
                    idx = score.argmax()
                except:
                    pdb.set_trace()
                idx2 = cells[idx,nvar+2]
                match[k] = idx2
                cells[idx,nvar+1] -= 1
                if cells[idx,nvar+1]==0:
                    cells = np.delete(cells, idx, 0)    
                # update progress bar
                percent_done = (k * 100) / n
                to_display = percent_done - percent
                if to_display:
                    chars_to_write = list("." * to_display)
                    offset = 9 - (percent % 10)
                    while offset < to_display:
                        chars_to_write[offset] = '|'
                        offset += 10
                    sys.stdout.write(''.join(chars_to_write))
                percent = percent_done

 match = pd.Series(match, index = index_modif).sort_index()
        if method == 'cells':
            match_ini = match.copy()
            match_count = match.value_counts()
            for group in match_count.index:
                nb_selected = match_count[group]
                keys_group = cells_ini.loc[group,vars].tolist()
                try:
                    match[match_ini==group] = groups2.groups[tuple(keys_group)][:nb_selected]
                except:
                    pdb.set_trace()
        print 'temps dédié au real_matching :', time.clock() - start 
       
        assert match.nunique() == len(match)
        return match
test_matching.py

Gaëtan de Menten

unread,
Sep 23, 2013, 11:34:22 AM9/23/13
to liam...@googlegroups.com

Hi,

Sorry for the really slow reply... Holidays and health issues don't help...

On 07/08/2013 17:38, Alexis EidelMan wrote:

> I re-open that discussion about matching techniques. I've recently
> implemented the method 2 of my first mail.
> Results are quite astonishing.
> To answer an issue raised by Gaëtan at that time, I deal with cells as
> we usually deal with individual in table2 and I don't lose time with
> empty cell.
>
> In the first stage, I match "women" (=table1) with cells (group of
> individuals with same matching value) from the table2.
> I remove cells from the basket when empty so the best match is always.
> Indeed, I have a counter for each cell to control if it's empty or not.

*facepalm*... Sorry for being stupid. This is great. The initial setup
is a bit expensive, but I guess it quickly repays. The question is now
if it is fast enough with small to medium datasets that it would be the
default implementation or should we keep the old implementation (not
sure if we should choose the implementation automatically via an
heuristic or via an option).

> In a second stage, I select randomly people from each group of men and
> give them to women who had select that cell.
>
> I used pandas in my code (even if the loop use only numpy). It can give
> an idea. The attached code on a theoretical example can help to convince.
> .
>
> The quadratic problem turns to be a linear one, and when size raises it
> does matter.

Indeed.

Note that we could be even faster (for very large datasets) by grouping
"women" (set1) too. It might not be worth the added complexity (*), but
in case someone ever finds your new algorithm too slow, it is an option
that should be remembered... (*) we would not need to recompute the
score for each woman with the same characteristics, just need to keep
track how many men are left in the best group and call argmax each time
when there are not any left.

> def evaluate(self, orderby, method):
> ...

That code does not seem to apply to liam2 directly? Have you integrated
that code in your copy of liam2 already?

Gaetan
Reply all
Reply to author
Forward
0 new messages