Unable to use regex to drop columns

SELVA MUTHU KUMARAN SATHAPPAN

unread,

Oct 23, 2019, 1:13:02 AM10/23/19

to modin-dev

Hello Everyone,

I started using modin recently.

Though it reads the csv file faster (read_csv), I am not able to drop columns in a dataframe based on regex.

I am trying to follow the approaches from stack overflow (https://stackoverflow.com/questions/56806528/use-regex-to-remove-exclude-columns-from-dataframe-python) but none of these options work in modin.

As I deal with more number of files, I can't write individual drop statements, My code looks like as below

import modin.pandas as pd

filenames = sorted(glob.glob('*.csv'))
df_list=[]
for f in filenames:
print(f)
t = vars()['df_'+ f] = pd.read_csv(f,low_memory=False)
t = t[t.columns.difference(t.filter(like='Unnamed').columns)] # note I tried other options from so
df_list.append(t)

When I view the dataframes in df_list, I am able to see column names like `Unnamed:0` etc

Can you help?

Thanks

Selva

Devin Petersohn

unread,

Oct 23, 2019, 2:29:44 PM10/23/19

to SELVA MUTHU KUMARAN SATHAPPAN, modin-dev

Hi Selva,

I was able to reproduce the issue locally:

In [1]: import modin.pandas as pd

In [2]: df = pd.read_csv("2e16x2e6.csv.gz")

In [3]: df
Out[3]:
Unnamed: 0 Unnamed: 0.1 col0 col1 col2 col3 col4 col5 col6 col7 col8 col53 col54 col55 col56 col57 col58 col59 col60 col61 col62 col63
0 0 0 90 32 83 38 31 31 95 72 55 67 2 2 74 22 11 68 60 64 84 52
1 1 1 60 14 0 53 76 56 74 24 54 29 57 20 35 73 28 6 96 6 75 77
2 2 2 38 94 7 11 42 50 38 30 11 2 86 2 66 47 97 24 35 30 61 66
3 3 3 76 60 49 57 79 21 26 69 30 38 59 89 99 23 65 48 2 73 79 30
4 4 4 12 90 71 50 0 43 70 17 64 3 57 39 77 68 45 91 6 59 25 55
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
65531 65531 65531 30 13 28 1 8 37 97 33 41 42 21 19 15 22 23 59 11 1 39 49
65532 65532 65532 34 25 5 22 19 8 13 45 63 95 72 16 95 22 96 71 80 94 3 30
65533 65533 65533 78 12 86 13 55 56 26 15 16 47 0 25 56 4 79 57 50 20 17 13
65534 65534 65534 69 38 55 35 21 67 35 9 28 64 27 99 37 39 42 7 97 74 82 65
65535 65535 65535 84 67 48 95 21 49 51 49 18 41 68 52 85 42 14 96 99 88 10 15

[65536 rows x 66 columns]

In [4]: df[df.columns.difference(df.filter(like='Unnamed').columns)]
Out[4]:
col0 col1 col10 col11 col12 col13 col14 col15 col16 col17 col18 col57 col58 col59 col6 col60 col61 col62 col63 col7 col8 col9
0 90 32 14 15 71 10 46 84 81 25 4 22 11 68 95 60 64 84 52 72 55 93
1 60 14 9 1 12 30 86 99 55 30 45 73 28 6 74 96 6 75 77 24 54 45
2 38 94 65 87 29 26 96 19 81 71 98 47 97 24 38 35 30 61 66 30 11 16
3 76 60 68 13 19 94 90 58 93 59 81 23 65 48 26 2 73 79 30 69 30 99
4 12 90 79 90 37 69 40 66 88 91 51 68 45 91 70 6 59 25 55 17 64 78
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
65531 30 13 48 29 53 37 55 12 66 50 59 22 23 59 97 11 1 39 49 33 41 47
65532 34 25 86 97 41 71 25 77 88 43 12 22 96 71 13 80 94 3 30 45 63 68
65533 78 12 8 31 0 22 22 64 77 5 52 4 79 57 26 50 20 17 13 15 16 20
65534 69 38 66 26 40 55 83 84 58 3 93 39 42 7 35 97 74 82 65 9 28 22
65535 84 67 6 0 0 24 88 90 74 58 15 42 14 96 51 99 88 10 15 49 18 53

[65536 rows x 64 columns]

In [5]: df.columns
Out[5]:
Index(['Unnamed: 0', 'Unnamed: 0.1', 'col0', 'col1', 'col2', 'col3', 'col4',
'col5', 'col6', 'col7', 'col8', 'col9', 'col10', 'col11', 'col12',
'col13', 'col14', 'col15', 'col16', 'col17', 'col18', 'col19', 'col20',
'col21', 'col22', 'col23', 'col24', 'col25', 'col26', 'col27', 'col28',
'col29', 'col30', 'col31', 'col32', 'col33', 'col34', 'col35', 'col36',
'col37', 'col38', 'col39', 'col40', 'col41', 'col42', 'col43', 'col44',
'col45', 'col46', 'col47', 'col48', 'col49', 'col50', 'col51', 'col52',
'col53', 'col54', 'col55', 'col56', 'col57', 'col58', 'col59', 'col60',
'col61', 'col62', 'col63'],
dtype='object')

It is a metadata issue, where we are dropping the columns in the data, but not in the metadata tracked separately. It should be a simple fix, thanks for the report! I will open an issue on the GitHub repo to track this.

Devin

--
You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/CAFpq_VV_Z9vf35zgE4P_51jSufuhE46MUY8_nMEhGXQYyuCPSw%40mail.gmail.com.

Devin Petersohn

unread,

Oct 23, 2019, 2:48:44 PM10/23/19

to SELVA MUTHU KUMARAN SATHAPPAN, modin-dev

So in my sleep deprived state, I have just realized that I was checking the columns of the original dataframe, not the result of the indexing. If you check the columns of the result of the indexing, it is correct:

In [8]: df[df.columns.difference(df.filter(like='Unnamed').columns)].columns
Out[8]:
Index(['col0', 'col1', 'col10', 'col11', 'col12', 'col13', 'col14', 'col15',
'col16', 'col17', 'col18', 'col19', 'col2', 'col20', 'col21', 'col22',
'col23', 'col24', 'col25', 'col26', 'col27', 'col28', 'col29', 'col3',

'col30', 'col31', 'col32', 'col33', 'col34', 'col35', 'col36', 'col37',

'col38', 'col39', 'col4', 'col40', 'col41', 'col42', 'col43', 'col44',
'col45', 'col46', 'col47', 'col48', 'col49', 'col5', 'col50', 'col51',

'col52', 'col53', 'col54', 'col55', 'col56', 'col57', 'col58', 'col59',

'col6', 'col60', 'col61', 'col62', 'col63', 'col7', 'col8', 'col9'],
dtype='object')

Because this is the case, I cannot reproduce your issue. Is there additional information that would help me reproduce the behavior you're seeing?

Devin

Reply all

Reply to author

Forward