apply converters on multiple columns

1,855 views
Skip to first unread message

massimo di stefano

unread,
Nov 30, 2013, 7:08:41 PM11/30/13
to pyd...@googlegroups.com
Hi,

i was wondering if/how is possible to apply a converter function on more than one column, just like "parse_dates" does.

i've a dataset of hundred thousands rows that looks like :

https://gist.github.com/anonymous/7714935


i'm read it using :


###
import pandas as pd
import datetime as dt

filename = 'https://gist.github.com/anonymous/7714935/raw/19e1a3810b564616b2990651e1e91d11c185b9a4/data.dat'

def parse(UTCDate,UTCTime):
return dt.datetime.strptime(UTCDate, '%m/%d/%Y') + dt.timedelta(seconds=float(UTCTime))

names=['Remote','Master','AntHgt','UTCDate','UTCTime','Week','Date','GPSTime','LatD','LatM','LatS','LonD','LonM','LonS','H-Ell','H-MS',
'Easting','Northing','X-LL','Y-LL','Z-LL','SDEast','SDNorth','SDHeight','StdDev','HgtDiff','H-Sep ','AmbDrift',' Q','PDOP',
'NS','L1Rms','SolType','VNorth','VEast','VUp','AccNrth','AccEast','AccUp','Easting','Northing','Grid-Z']

df = pd.read_csv(filename, delimiter=r'\s+', skiprows=25, index_col='date', parse_dates={'date':['UTCDate','UTCTime']}, names=names, date_parser=parse)

###

unfortunately this code it is splitting the "Longitude" and "Latitude" columns in 6 columns :

'LatD','LatM','LatS','LonD','LonM','LonS'

so i cannot use my converter
dms2deg1 :

###

def dms2deg1(latitude):
#
latitude = '54 35 56.46'
deg, min, sec =
latitude.split()
sgn = float(deg) / abs(float(deg))
return sgn * (abs(float(deg)) + (float(min) + (float(sec)/60) ) / 60 )

###

using the previouse read_csv command i should try something like :

###
def dms2deg(deg,min,sec):
sgn = float(deg) / abs(float(deg))
return sgn * (abs(float(deg)) + (float(min) + (float(sec)/60) ) / 60 )

###

but this takes multiple coumns as input ... and i do not know if/how can be done in pandas.


i was also trying to avoid to pass manually the names list to readcsv,
the file has a header on line 24, i was thinking to genarate the names list with :


###

import linecache
headerline = linecache.getline(r, 24) header = " ".join(header.split()).split(' ')
###

but this will not work with the previous code because "Latitude" and "Longitude" are no more columns ... instead i have :

'LatD','LatM','LatS','LonD','LonM','LonS'

i posted also on Stack Overflow :

http://stackoverflow.com/questions/20296430/read-csv-in-pandas-how-to-use-a-specific-row-as-header

thanks a lot for any help!

Massimo.

Scott Lasley

unread,
Dec 1, 2013, 7:13:26 PM12/1/13
to pyd...@googlegroups.com
Is something like this OK ?

import pandas as pd
import linecache
import datetime as dt

def parse(UTCDate,UTCTime):
    return dt.datetime.strptime(UTCDate, '%m/%d/%Y') + dt.timedelta(seconds=float(UTCTime))

filename = 'data.dat.txt'
names = linecache.getline(filename, 24)
names = names.replace('Latitude', 'LatD LatM LatS')
names = names.replace('Longitude', 'LonD LonM LonS')

df = pd.read_csv(filename, delimiter=r'\s+', skiprows=25, index_col='date', 
                 parse_dates={'date':['UTCDate','UTCTime']}, date_parser=parse,
                 names=names.split())
df['Lat'] = pd.np.sign(df['LatD']) * \
            (abs(df['LatD'].astype(pd.np.float64)) + \
                 df['LatM'].astype(pd.np.float64) + \
                 df['LatS'].astype(pd.np.float64) / 60.) / 60.
df['Lon'] = pd.np.sign(df['LonD']) * \
            (abs(df['LonD'].astype(pd.np.float64)) + \
                 df['LonM'].astype(pd.np.float64) + \
                 df['LonS'].astype(pd.np.float64) / 60.) / 60.
# for some reason df.drop(['LatD', 'LatM', 'LatS', 'LonD', 'LonM', 'LonS'], axis=1, inplace=True)
# throws an IndexError
# IndexError: index 36 is out of bounds for axis 0 with size 36
# with pandas 0.13.0rc1 from github, so delete the columns manually
del df['LatD']
del df['LatM']
del df['LatS']
del df['LonD']
del df['LonM']
del df['LonS']
Reply all
Reply to author
Forward
0 new messages