Computed columns in DataFrame

Allen Windhorn

unread,

Feb 9, 2018, 3:43:31 PM2/9/18

to PyData

I have a large dataframe (600,000 lines x 44 columns) produced by a data acquisition device. All the columns share the same uniform fp numeric index and there should be no missing values. Now I want to add computed columns based on the existing ones, e.g.:

Cc1 = C1*C19+C2*C20+C3*C21
Cc2 = C4*C22+C5*C23+C6*C24
...
Cc6 = C16*C34+C17*C35+C18*C36
Ctot = Cc1+Cc2+Cc3+Cc4+Cc5+Cc6

plus some formulas like:

Cr1 = sqrt(C1^2+C2^2+C3*2)

where the Cnn are existing columns.

It seems that this should be elementary, but I can't find it in the literature I have, which is mostly about demographic data. Since it's a very large database, I can't use loops, and I would prefer to do it inline, as I don't really want to proliferate dataframes. What is the best (fastest) way to do this? Preferably all at once instead of column by column.

I tried the following just as a test:

df = df.assign(PPh1A = pdat['Gen Voltage Phase 1A']*pdat['Gen Current Phase 1A'])

with the result:

>>> pdat_t = pdat_t.assign(PPh1A = pdat['Gen Voltage Phase 1A']*pdat['Gen Current Phase 1A'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\windhoal\AppData\Local\Continuum\anaconda2\lib\site-packages\pandas\core\frame.py", line 2501, in assign
    data[k] = v
File "C:\Users\windhoal\AppData\Local\Continuum\anaconda2\lib\site-packages\pandas\core\frame.py", line 2331, in __setitem__
    self._set_item(key, value)
File "C:\Users\windhoal\AppData\Local\Continuum\anaconda2\lib\site-packages\pandas\core\frame.py", line 2397, in _set_item
    value = self._sanitize_column(key, value)
File "C:\Users\windhoal\AppData\Local\Continuum\anaconda2\lib\site-packages\pandas\core\frame.py", line 2547, in _sanitize_column
    value = reindexer(value)
File "C:\Users\windhoal\AppData\Local\Continuum\anaconda2\lib\site-packages\pandas\core\frame.py", line 2539, in reindexer
    raise e
ValueError: cannot reindex from a duplicate axis

Thanks for any help you can give me.

Paul Hobson

unread,

Feb 9, 2018, 4:00:08 PM2/9/18

to pyd...@googlegroups.com

How does the "pdat" dataframe relate to the "df" dataframe?

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Allen Windhorn

unread,

Feb 9, 2018, 5:39:27 PM2/9/18

to PyData

Paul,

On Friday, February 9, 2018 at 3:00:08 PM UTC-6, Paul Hobson wrote:

How does the "pdat" dataframe relate to the "df" dataframe?

Sorry, they are effectively the same -- I was trying the same thing with a smaller test database, got the same message.

Regards,
Allen

Paul Hobson

unread,

Feb 9, 2018, 5:49:14 PM2/9/18

to pyd...@googlegroups.com

Hmmm. I think you might need to whittle down a reproducible example. Your example operations should work as a expected:

import pandas

df = pandas.DataFrame({
    'A': [-1.006, -0.973, -1.027, 0.922, 1.291, 0.699, -2.02 ],
    'B': [ 0.773, -1.545, 0.159, -0.541, -0.083, 0.48 , 0.138],
    'C': [ 1.369, -0.18 , 1.104, 0.197, 1.185, 0.313, -0.739]
})

df = df.assign(D=df['A'] + df['B'] * df['C'])
df

       A      B      C         D
0 -1.006  0.773  1.369  0.052237
1 -0.973 -1.545 -0.180 -0.694900
2 -1.027  0.159  1.104 -0.851464
3  0.922 -0.541  0.197  0.815423
4  1.291 -0.083  1.185  1.192645
5  0.699  0.480  0.313  0.849240
6 -2.020  0.138 -0.739 -2.121982

--

Pietro Battiston

unread,

Feb 9, 2018, 5:51:37 PM2/9/18

to pyd...@googlegroups.com

Il giorno ven, 09/02/2018 alle 08.02 -0800, Allen Windhorn ha scritto:
> I have a large dataframe (600,000 lines x 44 columns) produced by a
> data acquisition device. All the columns share the same uniform fp
> numeric index and there should be no missing values.

Is the index unique?

> [...]

> ValueError: cannot reindex from a duplicate axis
>

(This suggests it is not)

Pietro

Allen Windhorn

unread,

Feb 12, 2018, 10:01:42 AM2/12/18

to PyData

On Friday, February 9, 2018 at 4:49:14 PM UTC-6, Paul Hobson wrote:

Hmmm. I think you might need to whittle down a reproducible example. Your example operations should work as a expected:

import pandas

df = pandas.DataFrame({
    'A': [-1.006, -0.973, -1.027, 0.922, 1.291, 0.699, -2.02 ],
    'B': [ 0.773, -1.545, 0.159, -0.541, -0.083, 0.48 , 0.138],
    'C': [ 1.369, -0.18 , 1.104, 0.197, 1.185, 0.313, -0.739]
})

df = df.assign(D=df['A'] + df['B'] * df['C'])

Yes, this works, so there must be something else wrong. I'll try truncating the database until it works, maybe there will be a clue. Thanks.

Allen

Allen Windhorn

unread,

Feb 12, 2018, 10:02:55 AM2/12/18

to PyData

On Friday, February 9, 2018 at 4:51:37 PM UTC-6, toobaz wrote:

> ...Is the index unique?

Ought to be, but I will check. Thanks.

Regards,
Allen

Allen Windhorn

unread,

Feb 13, 2018, 2:35:29 PM2/13/18

to PyData

On Friday, February 9, 2018 at 4:49:14 PM UTC-6, Paul Hobson wrote:

Hmmm. I think you might need to whittle down a reproducible example. Your example operations should work as a expected:

import pandas

df = pandas.DataFrame({
    'A': [-1.006, -0.973, -1.027, 0.922, 1.291, 0.699, -2.02 ],
    'B': [ 0.773, -1.545, 0.159, -0.541, -0.083, 0.48 , 0.138],
    'C': [ 1.369, -0.18 , 1.104, 0.197, 1.185, 0.313, -0.739]
})

df = df.assign(D=df['A'] + df['B'] * df['C'])
df
       A      B      C         D
0 -1.006  0.773  1.369  0.052237
1 -0.973 -1.545 -0.180 -0.694900
2 -1.027  0.159  1.104 -0.851464
3  0.922 -0.541  0.197  0.815423
4  1.291 -0.083  1.185  1.192645
5  0.699  0.480  0.313  0.849240
6 -2.020  0.138 -0.739 -2.121982

The following works:

# Import data processing and graphing packages
import os,sys
from math import *
import numpy as np
import pandas as pd
# import scipy
import matplotlib.pyplot as plt
from matplotlib import style
style.use(['ggplot', 'fast'])
plt.ion()   # Plot interactively
#
# Create pandas DataFrame
tm = range(26)
tm = np.array(tm)*0.003
#
sf1 = [ 0.        , 0.24862903, 0.48164355, 0.68440969, 0.84419323,
        0.95095939, 0.99800299, 0.98236957, 0.90504096, 0.77087357,
        0.58829346, 0.36876708, 0.12608123, -0.12452282, -0.36730654,
        -0.58702251, -0.76987205, -0.90437174, -0.9820747 , -0.99810098,
        -0.95144409, -0.8450342 , -0.68555411, -0.48301955, -0.25015019,
        0.]
#
sf2 = [ 0.8660254 , 0.71451664, 0.51813449, 0.28921224, 0.04212676,
        -0.20760439, -0.44429747, -0.6530876 , -0.82086223, -0.93708473,
        -0.99445602, -0.98937306, -0.92215507, -0.7970235 , -0.62183691,
        -0.40759746, -0.16775989, 0.0826134 , 0.32779837, 0.55239681,
        0.74230338, 0.8855915 , 0.97326232, 0.9998099 , 0.96356699,
        0.8660254]
#
sf3 = [-0.8660254 , -0.96314567, -0.99977804, -0.97362193, -0.88631999,
       -0.743355 , -0.55370551, -0.32928197, -0.08417872, 0.16621115,
       0.40616257, 0.62060598, 0.79607383, 0.92154632, 0.98914345,
       0.99461997, 0.93763194, 0.82175834, 0.65427633, 0.44570417,
       0.2091407 , -0.04055731, -0.28770822, -0.51679035, -0.7134168,
       -0.8660254]
#
pdat = pd.DataFrame({'sf1': sf1, 'sf2': sf2, 'sf3': sf3}, index = tm)
# Add columns
pdat['sf1_2'] = np.power(pdat.loc[:,'sf1'],2)
pdat['sf2_2'] = np.power(pdat.loc[:,'sf2'],2)
pdat['sf3_2'] = np.power(pdat.loc[:,'sf3'],2)
pdat['sfssq'] = pdat.loc[:,'sf1_2']+pdat.loc[:,'sf2_2']+pdat.loc[:,'sf3_2']
meansq = np.mean(pdat.loc[:,'sfssq'])/3
rms = sqrt(meansq)
print 'Mean Square = ', meansq, ', RMS = ', rms
pdat.plot()
--------
I have 14 sets of three columns to do this to, plus product columns
(for power) plus filtering, plus some miscellaneous calculations, all on
data sets approaching a million rows. It would of course be more
elegant to do it in one go, but this will get the job done.

Thanks everyone for the help.

Regards,
Allen

Reply all

Reply to author

Forward