USGS NWIS download to Pandas DataFrame?

221 views
Skip to first unread message

Collin Bode

unread,
Aug 18, 2013, 5:10:17 PM8/18/13
to ul...@googlegroups.com
I'm new to ulmo.  Just started playing with it today.  I am interested in using it to pull NWIS data instead of my usual wget webscrape-and-parse.  However, I do not see a parameter "as_dataframe" like the other data source functions have. 

code:
import numpy as np

import pandas as pd

import ulmo

usgs_eldercreek = ulmo.usgs.nwis.get_site_data('11475560',service='instantaneous',start='2012-10-01',end='2013-08-01')
 
Am I missing something?   Is there a function to convert the data dictionary into a pandas.Dataframe? 
Otherwise a nice set of code. Thanks,
Collin Bode, Eel River CZO

Dharhas Pothina

unread,
Aug 19, 2013, 11:41:53 AM8/19/13
to ul...@googlegroups.com

So there is no simple/good way to encapsulate all the information returned by the nwis request as a simple pandas dataframe or at least we haven't yet thought of one. What we usually do is the following:

>>> import ulmo
>>> import pandas as pd
>>> usgs_eldercreek = ulmo.usgs.nwis.get_site_data('11475560',service='instantaneous',start='2012-10-01',end='2013-08-01') 

now usgs_eldercreek is a python dictionary. if you look at the keys

>>> usgs_eldercreek.keys()
['00065:00011', '00010:00011', '00060:00011']

say you are interested in the parameter code '00065:00011' then

>>> df = pd.DataFrame(usgs_eldercreek['00065:00011']['values']).set_index('datetime')

as an fyi there are are also two other interfaces to get nwis data in ulmo, one is ulmu.cuahsi.wof that hits CUAHSI's reflection of NWIS data through a wateroneflow web service. The other is ulmo.usgs.nwis.hdf5 that implements a local cache of the data in hdf5 files and is very useful if you are regularly downloading data since it remembers when you last downloaded and only gets new stuff (this can be overridden)

- dharhas



--
You received this message because you are subscribed to the Google Groups "ulmo" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulmo+uns...@googlegroups.com.
To post to this group, send email to ul...@googlegroups.com.
Visit this group at http://groups.google.com/group/ulmo.
For more options, visit https://groups.google.com/groups/opt_out.

Andy Wilson

unread,
Aug 19, 2013, 12:19:50 PM8/19/13
to ul...@googlegroups.com

To add onto that, you can also convert to a real datetime index that allows pandas' timeseries slicing and resampling and such (it'll be an index of string objects if you don't):

df = pd.DataFrame(usgs_eldercreek['00060:00011']['values'])
df.index = pd.DatetimeIndex(df.datetime)



An as_dataframe argument for the nwis module has been requested a few times now so it seems like it's an important missing feature. We could just convert the 'values' key to a properly indexed dataframe so get_site_data wouldn't return dataframe per se, but you'd get a dict with an immediately useful dataframe attached to it on the 'values' key.


-andy


Dharhas Pothina

unread,
Aug 19, 2013, 12:21:30 PM8/19/13
to ul...@googlegroups.com

yeah I was thinking about suggesting that as the most obvious way to do it.

Dharhas Pothina

unread,
Aug 22, 2013, 9:46:17 AM8/22/13
to ul...@googlegroups.com

forwarding to group ...

---------- Forwarded message ----------
From: Dharhas Pothina <dha...@gmail.com>
Date: Thu, Aug 22, 2013 at 8:45 AM
Subject: Re: [ulmo] USGS NWIS download to Pandas DataFrame?
To: Collin Bode <kali...@gmail.com>



Aah timezones.... I would stay away from daylight savings as much as possible. If you look at the data you are retrieving (i.e. '2012-10-01T00:00:00-07:00') it is in UTC-7.00 so I would convert the USGS data to UTC-8.00 rather than US/Pacific. US/Pacific has daylight savings embedded in it. Of course the best practice is to always use UTC and only localize of output/display purposes but in practice converting to a fixed UTC offset is often good enough.

In your case it looks like Pandas autoconverts the date to UTC as a naive datetime object (ie with no timezone) --> Timestamp('2012-10-01 07:00:00', tz=None)

now pytz (the package pandas uses under the hood for timezones doesn't directly have UTC offsets but we can work around that with this custom function

import pytz

def get_timezone(timezone_string):
    """return timezone for a given string this is a thin wrapper
    around pytz.timezone() that also interprets 'UTC-n' as UTC minus n
    hours
    """
    if 'UTC-' in timezone_string:
        minutes = int(timezone_string[3:]) * 60
        return pytz.FixedOffset(minutes)
    else:
        return pytz.timezone(timezone_string)

now you can define UTC-8 as:

utc_8 = get_timezone('UTC-8')

and then your original date conversion will change to:

df.index = pd.DatetimeIndex(df.datetime)
df = df.tz_localize('UTC').tz_convert(utc_8)

hope that helps.

- d 


On Wed, Aug 21, 2013 at 5:38 PM, Collin Bode <kali...@gmail.com> wrote:

Okay, I have folded your suggestions into a function, since I need to do this for each parameter at a station.  Also, I realized the 'values' are objects, not float64, so I had to type them as well.  Otherwise I couldn't plot them or do other operations.  

def usgs2df(usgs_waterml,parameter):

df = pd.DataFrame(usgs_waterml[parameter]['values'])

df.index = pd.DatetimeIndex(df.datetime)

df = df.tz_localize('US/Pacific')

df[['value']] = df[['value']].astype(float)

return df


# Pull Elder Creek into WaterML format

usgs_elder = ulmo.usgs.nwis.get_site_data('11475560',service='iv',start='2012-10-01',end='2013-08-01')

# Using Dharhas/Andy response

eq = usgs2df(usgs_elder,'00060:00011') # discharge

eg = usgs2df(usgs_elder,'00065:00011') # gage height

et = usgs2df(usgs_elder,'00010:00011') # water temperature


This works unless I cross daylight savings time.  Then it will throw an error.  Looks like I am running into timezone ugliness.  The timestamp has the UTC-offset.  My own data is in PST without daylight savings, i.e. UTC-8:00.  To compare the two datasets I need to somehow define the timezone or offsets in the USGS data. Any thoughts on dealing with this?


Collin Bode

unread,
Aug 22, 2013, 3:10:01 PM8/22/13
to ul...@googlegroups.com
Thanks Dharhas that did help.  I thought pandas was making a naive datetime object from the text and stripping the timezone information away from it.  I did not realize it was performing a full convert automatically to UTC.  It really should set the timezone to 'UTC' when it does that.

For other USGS nwis users - the USGS data is in local time with daylight savings included.  If you download the text versions, you will see a timezone column.  Here is a snippet:
 USGS    11475560    2011-11-06 01:00    PDT            2.3    P    3.17    P
 USGS    11475560    2011-11-06 01:00    PST            2.1    P    3.16    P
 USGS    11475560    2011-11-06 01:15    PDT            2.1    P    3.16    P
 USGS    11475560    2011-11-06 01:15    PST            2.0    P    3.15    P
 USGS    11475560    2011-11-06 01:30    PDT            2.1    P    3.16    P
 USGS    11475560    2011-11-06 01:30    PST            2.0    P    3.15    P
 USGS    11475560    2011-11-06 01:45    PDT            2.1    P    3.16    P
 USGS    11475560    2011-11-06 01:45    PST            2.0    P    3.15    P
 Note: PDT and PST are interwoven in the text file, due to sorting on the datetime column.  Ugly.
 
The WaterML text format does the right thing and uses UTC-offset format Here is the same data:

 {'datetime': '2011-11-06T01:00:00-07:00', 'qualifiers': 'A', 'value': '3.17'},
 {'datetime': '2011-11-06T01:15:00-07:00', 'qualifiers': 'A', 'value': '3.16'},
 {'datetime': '2011-11-06T01:30:00-07:00', 'qualifiers': 'A', 'value': '3.16'},
 {'datetime': '2011-11-06T01:45:00-07:00', 'qualifiers': 'A', 'value': '3.16'},
 {'datetime': '2011-11-06T01:00:00-08:00', 'qualifiers': 'A', 'value': '3.16'},
 {'datetime': '2011-11-06T01:15:00-08:00', 'qualifiers': 'A', 'value': '3.15'},
 {'datetime': '2011-11-06T01:30:00-08:00', 'qualifiers': 'A', 'value': '3.15'},
 {'datetime': '2011-11-06T01:45:00-08:00', 'qualifiers': 'A', 'value': '3.15'},

So here's my current, functional code (requires pytz import).


def get_timezone(timezone_string):

if 'UTC-' in timezone_string:

minutes = int(timezone_string[3:]) * 60

return pytz.FixedOffset(minutes)

else:

return pytz.timezone(timezone_string)

def usgs2df(usgs_waterml,parameter):

# Using Dharhas/Andy response

df = pd.DataFrame(usgs_waterml[parameter]['values'])

df.index = pd.DatetimeIndex(df.datetime)

utc_8 = get_timezone('UTC-8')

df = df.tz_localize('utc').tz_convert(utc_8)

df[['value']] = df[['value']].astype(float)

return df


# Pull Elder Creek into WaterML format

usgs_eldercreek = ulmo.usgs.nwis.get_site_data('11475560',service='iv',start='2012-10-01',end='2013-08-01')

# Call function for each parameter

eq = usgs2df(usgs_elder,'00060:00011') # discharge (q) cubic feet per second

eg = usgs2df(usgs_elder,'00065:00011') # gage height (feet)

et = usgs2df(usgs_elder,'00010:00011') # water temperature (celsius)


Reply all
Reply to author
Forward
0 new messages