Hi evreryone,
I re-open that discussion about matching techniques. I've recently implemented the method 2 of my first mail.
Results are quite astonishing.
To
answer an issue raised by Gaëtan at that time, I deal with cells as we
usually deal with individual in table2 and I don't lose time with empty
cell.
In the first stage, I match "women" (=table1) with cells (group of individuals with same matching value) from the table2.
I
remove cells from the basket when empty so the best match is always.
Indeed, I have a counter for each cell to control if it's empty or not.
In a second stage, I select randomly people from each group of men and give them to women who had select that cell.
I used pandas in my code (even if the loop use only numpy). It can give an idea. The attached code on a theoretical example can help to convince.
.
The quadratic problem turns to be a linear one, and when size raises it does matter.
Alexis
def evaluate(self, orderby, method):
table2 = self.table2
index_init = self.table1.index
if orderby is not None:
table1 = self.table1.sort(orderby)
else:
table1 = self.table1.loc[np.random.permutation(index_init)]
index_init = table1.index
table2 = self.table2.fillna(0)
table1 = self.table1.fillna(0)
if len(table1)>len(table2):
print ("WARNING : La table de gauche doit être la plus petite, "\
"traduire le score dans l'autre sens et changer l'ordre." \
"pour l'instant table1 est reduite à la taille de table2. ")
table1 = table1[:len(table2)]
index_modif = table1.index
score_str, vars = _rewrite_score(self.score_str, 'temp', 'table2', table1.columns.tolist(), table2.columns.tolist())
n = len(table1)
if method=='cells':
groups2 = table2.groupby(vars)
cells_ini = pd.DataFrame(groups2.groups.keys(),columns =vars)
score_str, vars = _rewrite_score(self.score_str, 'temp', 'cells', table1.columns.tolist(), vars)
size = pd.DataFrame(groups2.size(), columns = ['size'])
if len(size) != len(cells_ini):
raise Exception('Oups, il y a un problème de valeurs nulles a priori')
cells_ini = cells_ini.merge(size, left_on=vars, right_index=True, how='left')
cells_ini['id'] = cells_ini.index
# conversion en numpy
table1 = np.array(table1, dtype=np.int32)
cells = np.array(cells_ini, dtype=np.int32)
#definition de la boucle
nvar = len(vars)-1
match = np.empty(n, dtype=int)
percent = 0
start = time.clock()
#check
assert cells[:,nvar+1].min() > 0
if method=='cells':
for k in xrange(n):
# real_match_cells(k,cells)
temp = table1[k]
try:
score = eval(score_str)
except:
pdb.set_trace()
try:
idx = score.argmax()
except:
pdb.set_trace()
idx2 = cells[idx,nvar+2]
match[k] = idx2
cells[idx,nvar+1] -= 1
if cells[idx,nvar+1]==0:
cells = np.delete(cells, idx, 0)
# update progress bar
percent_done = (k * 100) / n
to_display = percent_done - percent
if to_display:
chars_to_write = list("." * to_display)
offset = 9 - (percent % 10)
while offset < to_display:
chars_to_write[offset] = '|'
offset += 10
sys.stdout.write(''.join(chars_to_write))
percent = percent_done
match = pd.Series(match, index = index_modif).sort_index()
if method == 'cells':
match_ini = match.copy()
match_count = match.value_counts()
for group in match_count.index:
nb_selected = match_count[group]
keys_group = cells_ini.loc[group,vars].tolist()
try:
match[match_ini==group] = groups2.groups[tuple(keys_group)][:nb_selected]
except:
pdb.set_trace()
print 'temps dédié au real_matching :', time.clock() - start
assert match.nunique() == len(match)
return match