apply converters on multiple columns

1,860 views

Skip to first unread message

massimo di stefano

unread,

Nov 30, 2013, 7:08:41 PM11/30/13

to pyd...@googlegroups.com

Hi,

i was wondering if/how is possible to apply a converter function on more than one column, just like "parse_dates" does.

i've a dataset of hundred thousands rows that looks like :

https://gist.github.com/anonymous/7714935

i'm read it using :

###

import pandas as pd
import datetime as dt

filename = 'https://gist.github.com/anonymous/7714935/raw/19e1a3810b564616b2990651e1e91d11c185b9a4/data.dat'

def parse(UTCDate,UTCTime):
    return dt.datetime.strptime(UTCDate, '%m/%d/%Y') + dt.timedelta(seconds=float(UTCTime))

names=['Remote','Master','AntHgt','UTCDate','UTCTime','Week','Date','GPSTime','LatD','LatM','LatS','LonD','LonM','LonS','H-Ell','H-MS',
'Easting','Northing','X-LL','Y-LL','Z-LL','SDEast','SDNorth','SDHeight','StdDev','HgtDiff','H-Sep ','AmbDrift',' Q','PDOP',
'NS','L1Rms','SolType','VNorth','VEast','VUp','AccNrth','AccEast','AccUp','Easting','Northing','Grid-Z']

df = pd.read_csv(filename, delimiter=r'\s+', skiprows=25, index_col='date', parse_dates={'date':['UTCDate','UTCTime']}, names=names, date_parser=parse)

###

unfortunately this code it is splitting the "Longitude" and "Latitude" columns in 6 columns :

'LatD','LatM','LatS','LonD','LonM','LonS'

so i cannot use my converter dms2deg1 :

###

    def dms2deg1(latitude):
        # latitude = '54 35 56.46'  
        deg, min, sec = latitude.split()
        sgn = float(deg) / abs(float(deg))
        return sgn * (abs(float(deg)) + (float(min) + (float(sec)/60) ) / 60 )

###

using the previouse read_csv command i should try something like :

###
    def dms2deg(deg,min,sec):
        sgn = float(deg) / abs(float(deg))
        return sgn * (abs(float(deg)) + (float(min) + (float(sec)/60) ) / 60 )

###

but this takes multiple coumns as input ... and i do not know if/how can be done in pandas.


i was also trying to avoid to pass manually the names list to readcsv, 
the file has a header on line 24, i was thinking to genarate the names list with :


###

    import linecache
    headerline = linecache.getline(r, 24)
    header = " ".join(header.split()).split(' ')

###

but this will not work with the previous code because "Latitude" and "Longitude" are no more columns ... instead i have :

'LatD','LatM','LatS','LonD','LonM','LonS'

i posted also on Stack Overflow :

http://stackoverflow.com/questions/20296430/read-csv-in-pandas-how-to-use-a-specific-row-as-header

thanks a lot for any help!

Massimo.

Scott Lasley

unread,

Dec 1, 2013, 7:13:26 PM12/1/13

to pyd...@googlegroups.com

Is something like this OK ?

import pandas as pd

import linecache

import datetime as dt

def parse(UTCDate,UTCTime):

return dt.datetime.strptime(UTCDate, '%m/%d/%Y') + dt.timedelta(seconds=float(UTCTime))

filename = 'data.dat.txt'

names = linecache.getline(filename, 24)

names = names.replace('Latitude', 'LatD LatM LatS')

names = names.replace('Longitude', 'LonD LonM LonS')

df = pd.read_csv(filename, delimiter=r'\s+', skiprows=25, index_col='date',

parse_dates={'date':['UTCDate','UTCTime']}, date_parser=parse,

names=names.split())

df['Lat'] = pd.np.sign(df['LatD']) * \

(abs(df['LatD'].astype(pd.np.float64)) + \

df['LatM'].astype(pd.np.float64) + \

df['LatS'].astype(pd.np.float64) / 60.) / 60.

df['Lon'] = pd.np.sign(df['LonD']) * \

(abs(df['LonD'].astype(pd.np.float64)) + \

df['LonM'].astype(pd.np.float64) + \

df['LonS'].astype(pd.np.float64) / 60.) / 60.

# for some reason df.drop(['LatD', 'LatM', 'LatS', 'LonD', 'LonM', 'LonS'], axis=1, inplace=True)

# throws an IndexError

# IndexError: index 36 is out of bounds for axis 0 with size 36

# with pandas 0.13.0rc1 from github, so delete the columns manually

del df['LatD']

del df['LatM']

del df['LatS']

del df['LonD']

del df['LonM']

del df['LonS']

Reply all

Reply to author

Forward

0 new messages