Locate value changes in multiple columns and find order in which they changed

arm...@uchicago.edu

unread,

Jan 11, 2018, 1:13:24 PM1/11/18

to PyData

Hello, All,

I am working with a dataset where I want to identify locations where values change. Specifically, I have a series of columns (named "color1", "color2"... to "color36") and all but one are populated with the names of two colors - one of the colors is always silver. One column in the dataset will have only 1 color in it: silver. In the example dataset below, "color1" happens to be the only-silver column. The column next to it,"color10" has teal and silver. "color33" has purple and silver (though you can't see the row with silver in this example image).

There are multiple subjects (column = "Subjects") in this dataset, and each subject completed several versions of the task (column = "ExperimentName), though in the example image you see only 1 subject (91) and 1 task (FacFac). Each task contains the same # of color columns.

I need to identify the order in which the colors change from ___ to silver, per Subject per ExperimentName.

Essentially, what I think I need to do is find the first instance of a 'silver' in each column, get its index, and then create a column that places the name of the original column (e.g. color3, color35) in order by its index (because the index will indicate when it was clicked in relation to the other colors.) The problem is that this dataset includes multiple participants and experiments, so I can't just find the first instance of a silver in a given column (that will tell me only when participant 91 for experiment FacFac had a color change); I have to find the first instance per Subject per ExperimentName.

The second problem is that I'm very new to programming and python, and am struggling with wrapping my head around this conceptually as well as syntactically.

What I have done so far is to create a table of Booleans where True (1) = Silver and False (0) = any other color.

If anyone has any ideas and/or advice, I would very much appreciate it. I want to try and accomplish this in Python to improve my skills at it and not resort to point-and-click in Excel! Thank you SO much in advance!

This is the example dataset:

Here is the relevant code so far:

data = pd.read_excel("Data.xlsx")

data.filter(like='color').isin(['silver'])

boolSilver = data.filter(like='color').isin(['silver']).astype(int)

chris bertinato

unread,

Jan 12, 2018, 10:43:14 AM1/12/18

to PyData

The notebook for all of this is here: https://gist.github.com/anonymous/ba85be3b972b15f6c8674de8ab70554f

To work per experiment and per subject you can use groupby:

grouped = df.groupby(['ExperimentName', 'Subject'])

which will give you a list of tuples, the second element of which is the dataframe for each grouping.

One way to find the first occurrence of 'silver' as you have them mapped here as boolean values, is to use .idxmax(), which return the index of the first occurrence of the maximum value:

df['color10'].idxmax()

For the table that you had given, this would return 9 when mapped to boolean values with any other color mapped to 0 and 'silver' mapped to 1.

You can then iterate over all of the relevant columns, find the first occurrence of max value, and make a table of occurrence index and column name. Sort that list and you've got an ordering for your columns.

for group in grouped:
    cols = group[1].filter(regex='color1.', axis=1).reset_index(drop=True)
    order = cols.apply(lambda s: s.idxmax()).sort_values()
    print(order.index.tolist())

Stepping through:

cols = group[1].filter(regex='color1.', axis=1).reset_index(drop=True)

The filter application here just returns the columns that match the regular expression which excludes 'color1'. After the filter, the index may not necessarily need to be reset since the indices are only used to sort the list of column names.

order = cols.apply(lambda s: s.idxmax()).sort_values()

The lambda function is applied to each column, since axis=0 by default. The result is a series with values of the index returned by idxmax() and index of column names. The series is sorted from least to greatest value.

You can choose what to do with series from here on.

Dennis O'Brien

unread,

Jan 16, 2018, 4:00:42 AM1/16/18

to PyData

Hi,

There are several ways to interpret your question, each subtly different, so it would be great to get a minimal example of input and expected output.

Here's an example of constructing a very pared down version of your dataset. I'm using the categorical type for the colors, but it's not necessary here.

In [1]: import numpy as np


In [2]: import pandas as pd


In [3]: from pandas.api.types import CategoricalDtype


In [4]: colors = ['silver', 'teal', 'purple']


In [5]: color_category = CategoricalDtype(categories=colors, ordered=False)


In [6]: df = pd.DataFrame(data={
    ...:     'color0': pd.Series(['silver'] * 10).astype(color_category), 
    ...:     'color1': pd.Series(np.random.choice(colors, 10)).astype(color_category),
    ...:     'color2': pd.Series(np.random.choice(colors, 10)).astype(color_category),
    ...:     'color3': pd.Series(np.random.choice(colors, 10)).astype(color_category),
    ...: })


In [6]: df
Out[6]: 
   color0  color1  color2  color3
0  silver    teal  silver    teal
1  silver    teal  silver  silver
2  silver  silver    teal  silver
3  silver    teal  silver    teal
4  silver  purple    teal    teal
5  silver  purple  silver    teal
6  silver    teal    teal    teal
7  silver    teal    teal    teal
8  silver  purple  purple  silver
9  silver  purple  purple    teal

So if we wanted the answer the question "is the value in this row, column different from the same row, previous column, we could use apply(..., axis=1) to operate on each row, and shift to move the series by 1.

In [7]: df.apply(lambda r: r != r.shift(1), axis=1)

Out[7]: 
   color0  color1  color2  color3
0    True    True    True    True
1    True    True    True   False
2    True   False    True    True
3    True    True    True    True
4    True    True    True   False
5    True    True    True    True
6    True    True   False   False
7    True    True   False   False
8    True    True   False    True
9    True    True   False    True

But I'm still not sure that is what you are looking for. If you could provide example input and ideal output, that would help clarify the problem.