CFSR grib2 data access is slow

302 views
Skip to first unread message

sc...@allvertum.com

unread,
Mar 14, 2013, 3:02:36 PM3/14/13
to py...@googlegroups.com
This dataset is stored in monthly files at hourly temporal resolution:

http://nomads.ncdc.noaa.gov/thredds/dodsC/cfsr1hr/<YYYYMM>/wnd10m.gdas.<YYYYMM>.grb2.html

Described here:
http://nomads.ncdc.noaa.gov/thredds/dodsC/cfsr1hr/200912/wnd10m.gdas.200912.grb2.html

There are 3 variables I am interested in:
U-component_of_wind [time = 745][height_above_ground = 1][lat = 576][lon = 1152]
V-component_of_wind [time = 745][height_above_ground = 1][lat = 576][lon = 1152]
time [time = 745]


As an example, I would like to extract an hourly timeseries at only 1 grid point at lat=10 and lon=10.  So, I am coding this in Python using PyDAP as follows:

import numpy as np
from pydap.client import open_url,open_dods
        dap_u10m= open_url('http://nomads.ncdc.noaa.gov/thredds/dodsC/cfsr1hr/200912/wnd10m.gdas.200912.grb2?U-component_of_wind[0:1:744][0][10][10]')
        dap_v10m= open_url('http://nomads.ncdc.noaa.gov/thredds/dodsC/cfsr1hr/200912/wnd10m.gdas.200912.grb2?V-component_of_wind[0:1:744][0][10][10]')
        dap_time= open_url('http://nomads.ncdc.noaa.gov/thredds/dodsC/cfsr1hr/200912/wnd10m.gdas.200912.grb2?time[0:1:744]')
        #
        # Why is this necessary?  I already sub-setted the data above.
        u10m    = dap_u10m['U-component_of_wind']
        v10m   = dap_v10m['V-component_of_wind']
        time   = dap_time['time']
        wspd_out= np.zeros([745],np.dtype('float32'))
        wdir_out= np.zeros([745],np.dtype('float32'))
        #                                                                                                                                                                                                                                                                    
        for hr in xrange(0,745):
                u10ma   = u10m.array[hr,0,0,0]
                v10ma   = v10m.array[hr,0,0,0]

                wspd_out[hr] = (np.sqrt(u10ma*u10ma+v10ma*v10ma))
                wdir_out[hr]   = (np.arctan2(u10ma,v10ma)*rad2dgr+180)
                del u10ma,v10ma
        date_out = [str(cur_yr)+'/'+str(cur_mm).rjust(2,'0')+'/'+str(cur_dd).rjust(2,'0')+' '+str(int(hrs/60.)).rjust(2,'0')+':00' for hrs in time]
       

This method is rather slow and I was wondering if I am doing something wrong?  Any help is very much appreciated.

Thank you,

Scott 

Rich Signell

unread,
Mar 14, 2013, 4:07:42 PM3/14/13
to py...@googlegroups.com
Scott,

It's really slow to read time series at a single point from
aggregations of most gridded model output because the records are
written at each time step as the model plods along, with time the
least fast varying dimension. So when you extract a time series at
one location, you have to read one value, then skip a whole bunch of
data, read one more value, etc.

If it wasn't GRIB, this could be potentially be changed on the
provider side by flipping the dimensions, or rechunking the data along
the time dimension if was NetCDF4 or HDF5, but that is rarely done.

Just for comparison, I tried this in NetCDF4-Python, and yes, it takes
a really long time: 538 seconds (nearly 9 minutes) to extract 745
points! (I extracted a different point, just in case the server is
caching, but the same number of time steps):

In [1]: import netCDF4
import time
url='http://nomads.ncdc.noaa.gov/thredds/dodsC/cfsr1hr/200912/wnd10m.gdas.200912.grb2'
nc=netCDF4.Dataset(url)
tic=time.time()
u=nc.variables['U-component_of_wind'][0:744,0,40,40]
toc=time.time()
print ('elapsed time= %f seconds' % (toc-tic))

Out [1]: elapsed time= 538.030953 seconds

Just to confirm that it's a server side problem, I also tried
accessing the data using the NCO "ncks" program, and it took a long
time also:

rsignell@gam:/usgs/data2/rsignell/data/bathy$ time ncks -O -d lat,12
-d lon,20 -d height_above_ground,0 -v U-component_of_wind
'http://nomads.ncdc.noaa.gov/thredds/dodsC/cfsr1hr/200912/wnd10m.gdas.200912.grb2'
test.nc

real 7m7.268s
user 0m0.012s
sys 0m0.020s


-Rich
> --
> You received this message because you are subscribed to the Google Groups
> "pydap" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pydap+un...@googlegroups.com.
> To post to this group, send email to py...@googlegroups.com.
> Visit this group at http://groups.google.com/group/pydap?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
Dr. Richard P. Signell (508) 457-2229
USGS, 384 Woods Hole Rd.
Woods Hole, MA 02543-1598

Roberto De Almeida

unread,
Mar 14, 2013, 4:35:29 PM3/14/13
to py...@googlegroups.com
Rich, thanks for the analysis, you beat me to it. :)

Scott, there's an interesting article here about chunking as a solution to the problem you're facing:


Though like Rich said, there's not much you can do, it's a server problem.

I also simplified your code, you only need to call open_url once, and there's no need for the loop (as a rule of thumb, void loops in Python whenever possible, they're slow):


--Rob
Roberto De Almeida, PhD

sc...@allvertum.com

unread,
Mar 14, 2013, 7:28:00 PM3/14/13
to py...@googlegroups.com
Thank you, Rich and Roberto for your suggestions. 

If I run it using this code:

dap_file= open_url('http://nomads.ncdc.noaa.gov/thredds/dodsC/cfsr1hr/200912/wnd10m.gdas.200912.grb2')                                                                                                                                                                                                                                         
u10m   = dap_file['U-component_of_wind'][:,0,10,10]                                                                                                                                                                                                                 
v10m   = dap_file['V-component_of_wind'][:,0,10,10]                                                                                                                                                                                                                 
timed  = dap_file['time'][:]                                                                                                                                                                                                                                        
wsdp_out = np.sqrt( u10m**2 + v10m**2 )                                                                                                                                                                                                                             
wdir_out = np.arctan2( v10m, u10m ) * pi / 180.                                                                                                                                                                                                                     
                                                                                                                                                                                                                           
It never finishes and timesout at record 505.  This is why I have to loop through the data:  Here is my crude way of doing it which still takes a while:
        dap_file= open_url('http://nomads.ncdc.noaa.gov/thredds/dodsC/cfsr1hr/200912/wnd10m.gdas.200912.grb2')                                                                                                                                                                                                                                    
        u10m   = dap_file['U-component_of_wind']
        v10m   = dap_file['V-component_of_wind']
        timed  = dap_file['time']
        nhrs   = timed.shape[0]

        dayofmo  = 1
        step     = 24
        idx1     = 0
        idx      = step
        while (idx1<=nhrs):
                if (nhrs-idx<step):
                        idx = idx+(nhrs-idx)
                       
                        u10ma    = u10m.array[idx1:idx,0,10,10]
                        v10ma    = v10m.array[idx1:idx,0,10,10]
                        wspd_out = np.sqrt(u10ma**2+v10ma**2)
                        wdir_out = np.arctan2(u10ma,v10ma)*rad2dgr+180
                        del u10ma,v10ma
                        date_out = ['2009/12/'+str(dayofmo)+' '+str(int(hrs/60.)).rjust(2,'0')+':00' for hrs in timed[idx1:idx]]
                       
                        del wspd_out,wdir_out
                        break
                else:
                       
                        u10ma    = u10m.array[idx1:idx,0,10,10]
                        v10ma    = v10m.array[idx1:idx,0,10,10]
                        wspd_out = np.sqrt(u10ma**2+v10ma**2)
                        wdir_out = np.arctan2(u10ma,v10ma)*rad2dgr+180
                        date_out = ['2009/12/'+str(dayofmo)+' '+str(int(hrs/60.)).rjust(2,'0')+':00' for hrs in timed[idx1:idx]]
                       
                        del u10ma,v10ma,wspd_out,wdir_out
                dayofmo  = dayofmo+1
                idx1     = idx-1
                idx      = idx+step

Roy Mendelssohn - NOAA Federal

unread,
Mar 14, 2013, 10:57:38 PM3/14/13
to py...@googlegroups.com, Richard Signell
Hi Scott:

Could you do me a favor.  Look at the thredds catalog for:


or that has base URL at:


and extract a time series from that and let us know the timings.  Could you also do the same for:


and let us know the timings.


Thanks a lot,

-Roy M.

--
You received this message because you are subscribed to the Google Groups "pydap" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydap+un...@googlegroups.com.
To post to this group, send email to py...@googlegroups.com.
Visit this group at http://groups.google.com/group/pydap?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

**********************
"The contents of this message do not reflect any position of the U.S. Government or NOAA."
**********************
Roy Mendelssohn
Supervisory Operations Research Analyst
NOAA/NMFS
Environmental Research Division
Southwest Fisheries Science Center
1352 Lighthouse Avenue
Pacific Grove, CA 93950-2097

e-mail: Roy.Men...@noaa.gov (Note new e-mail address)
voice: (831)-648-9029
fax: (831)-648-8440
www: http://www.pfeg.noaa.gov/

"Old age and treachery will overcome youth and skill."
"From those who have been given much, much will be expected" 
"the arc of the moral universe is long, but it bends toward justice" -MLK Jr.

sc...@allvertum.com

unread,
Mar 15, 2013, 2:20:30 PM3/15/13
to py...@googlegroups.com
Roy,

Tried the following per your request:

dataset = open_url("http://oceanwatch.pfeg.noaa.gov/thredds/dodsC/FNMOC/navgem/halfdegree/pressure/6hr/best")                          
pres   = dataset['pres_msl'][:,45,45]

For us to download the entire timeseries at 1 point:  13.6 seconds

dataset = open_url("http://oceanwatch.pfeg.noaa.gov/thredds/dodsC/satellite/BA/ssta/5day")
bassta  = dataset['BAssta'][:,0,35,35]

This does not work:

Traceback (most recent call last):
  File "navgem_opendap_test.py", line 9, in <module>
    bassta    = dataset['BAssta'][:,0,35,35]
  File "/usr/local/lib/python2.7/dist-packages/pydap/model.py", line 662, in __getitem__
    var.data = var.data[slice_]
  File "/usr/local/lib/python2.7/dist-packages/pydap/proxy.py", line 118, in __getitem__
    data = data2 = DapUnpacker(xdrdata, dataset).getvalue()
  File "/usr/local/lib/python2.7/dist-packages/pydap/xdr.py", line 119, in getvalue
    out.append(self.getvalue())
  File "/usr/local/lib/python2.7/dist-packages/pydap/xdr.py", line 143, in getvalue
    out = numpy.fromstring(self._buf[i:j], dtype=dtype)
ValueError: string size must be a multiple of element size

Scott



On Thursday, March 14, 2013 12:02:36 PM UTC-7, sc...@allvertum.com wrote:

Roy Mendelssohn - NOAA Federal

unread,
Mar 15, 2013, 2:25:15 PM3/15/13
to py...@googlegroups.com
Thanks.  I may have given the wrong base url for the netcdf file  (I was curious how aggregate netcdf compared to aggregate GRIB).  Rich also looked at these, see:


The main point is that something is going on at NCDC, that the extract shouldn't be that slow.  We know the folks at NCDC and this info has been relayed.  Hopefully the situation can be rectified so that you don't have the problems you were seeing, and that you don't need to loop a request where one is not needed.

Thanks,

-Roy


--
You received this message because you are subscribed to the Google Groups "pydap" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydap+un...@googlegroups.com.
To post to this group, send email to py...@googlegroups.com.
Visit this group at http://groups.google.com/group/pydap?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

sc...@allvertum.com

unread,
Mar 15, 2013, 2:50:43 PM3/15/13
to py...@googlegroups.com
Many thanks for reporting this!  Hopefully, it will be resolved. 


Scott

On Thursday, March 14, 2013 12:02:36 PM UTC-7, sc...@allvertum.com wrote:

sc...@allvertum.com

unread,
Mar 15, 2013, 2:52:10 PM3/15/13
to py...@googlegroups.com
Thank you for reporting this.  Hopefully, it will be resolved.

Scott
Reply all
Reply to author
Forward
0 new messages