Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

pandas dataframe, find duplicates and add suffix

1,058 views

Skip to first unread message

zlju...@gmail.com

unread,

Mar 28, 2017, 5:00:55 PM3/28/17

In dataframe

import pandas as pd

data = {'model': ['first', 'first', 'second', 'second', 'second', 'third', 'third'],
'dtime': ['2017-01-01_112233', '2017-01-01_112234', '2017-01-01_112234', '2017-01-01_112234', '2017-01-01_112234', '2017-01-01_112235', '2017-01-01_112235'],
}
df = pd.DataFrame(data, index = ['a.jpg', 'b.jpg', 'c.jpg', 'd.jpg', 'e.jpg', 'f.jpg', 'g.jpg'], columns=['model', 'dtime'])

print(df.head(10))

model dtime
a.jpg first 2017-01-01_112233
b.jpg first 2017-01-01_112234
c.jpg second 2017-01-01_112234
d.jpg second 2017-01-01_112234
e.jpg second 2017-01-01_112234
f.jpg third 2017-01-01_112235
g.jpg third 2017-01-01_112235

within model, there are duplicate dtime values.
For example, rows d and e are duplicates of the c row.
Row g is duplicate of the f row.

For each duplicate (within model) I would like to add suffix (starting from 1) to the dtime value. Something like this:

model dtime
a.jpg first 2017-01-01_112233
b.jpg first 2017-01-01_112234
c.jpg second 2017-01-01_112234
d.jpg second 2017-01-01_112234-1
e.jpg second 2017-01-01_112234-2
f.jpg third 2017-01-01_112235
g.jpg third 2017-01-01_112235-1

How to do that?

Pavol Lisy

unread,

Mar 30, 2017, 8:23:25 AM3/30/17

> --
> https://mail.python.org/mailman/listinfo/python-list
>

I am not expert, just played a little...

This one could work:

gb = df.groupby([df.model, df.dtime])
df.dtime = df.dtime + gb.cumcount().apply(lambda a:str(-a) if a else '')

this one is probably more readable:
df.dtime = df.dtime + [str(-a) if a else '' for a in gb.cumcount()]

I don't know which one is better in memory consumption and/or speed.

This small dataframe gave me:

%timeit -r 5 df.dtime + gb.cumcount().apply(lambda a:str(-a) if a else '')
1000 loops, best of 5: 387 µs per loop

%timeit -r 5 df.dtime + [str(-a) if a else '' for a in gb.cumcount()]
1000 loops, best of 5: 324 µs per loop

PL.

0 new messages