Subset with maps

22 views
Skip to first unread message

fredri...@gmail.com

unread,
Apr 15, 2013, 5:16:03 PM4/15/13
to py...@googlegroups.com
Hi,

I'm working with a data set that consists of length, degrees and data. The length is approx. 100.000.000 points, degrees ~180 points, and the data is a value in each point (matrix (100.000.000 x 180) ~ 18 billion points).

I need to get the data with respect to the length, eg. data[data.length > 100 & data.length < 200]

The problem I'm facing, is that it seems that pydap downloads the complete length vector, picks out the indices between 100 and 200, and return the correct data.
This is time-consuming and a waste of memory for such a huge dataset. 

Is there a way of getting the data without downloading the whole length vector?

I'm running a opendap Hyrax server (can be replaced with something else), and is planning to create a Python user application to visualize the data.

I have tried with both netcdf and hdf5 (does any of you recommend any of these format or something else for my data?)

I've also looked into pytables, but find the lack of client/server capabilities troublesome (I see that pydap once supported this)

Conceptual example:
length=1,2,3,4,5
data = 10,20,30,40,50

What happens:
data[length>2&length<5]
downloads length:
length=1,2,3,4,5
Finds indices
length>2&length<5 = 3,4 = index 2,3
Downloads data
data[2:4] = 30,40

What I want:
data[length>2&length<5]
data = 30,40
E.g.The server does all the job.

Thanks!

Best regards
Fredrik

Chris Barker - NOAA Federal

unread,
Apr 15, 2013, 5:31:07 PM4/15/13
to py...@googlegroups.com
On Mon, Apr 15, 2013 at 2:16 PM, <fredri...@gmail.com> wrote:

> I'm working with a data set that consists of length, degrees and data. The
> length is approx. 100.000.000 points, degrees ~180 points, and the data is a
> value in each point (matrix (100.000.000 x 180) ~ 18 billion points).
>
> I need to get the data with respect to the length, eg. data[data.length >
> 100 & data.length < 200]

is length sorted? i.e do you need data[100:200] -- if so, you need to
ask for it that way.

if not, and you need the items in data where the length value is
between 100 and 200, but that could be any arbitrary indexes, then you
are doing what in numpy terms is called "fancy indexing", and OpenDAP
does not support that. You'll either need to download the entire
length vector and then subset, or do a loop and download each item
with a single call -- which will be very slow, too.

There has been some talk of extending OpenDAP to support requesting
and arbitrary sequence of indexes, but I don't know if that's getting
anywhere. You could probably do it with a server-side funciton, but
you'd have to extend your server to support it.

> Conceptual example:
> length=1,2,3,4,5
> data = 10,20,30,40,50
>
> What happens:
> data[length>2&length<5]

in this case, you want:

data[2:4]

instead, but that only works id length is indeed sorted, as in this example.

> What I want:
> data[length>2&length<5]
> data = 30,40
> E.g.The server does all the job.

if you don't know the structure of length enough to compute those
indices, you can't do that.

I would look at server-side functions, though -- HRAX has support, ai
think for things like latitude>some_value, you may be able to apply
that here.

-Chris


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris....@noaa.gov

Roy Mendelssohn - NOAA Federal

unread,
Apr 15, 2013, 5:38:50 PM4/15/13
to py...@googlegroups.com
Actually, OPeNDAP as a protocol can make the exact kind of request being sought by Fredrik, but not in the grid interface. OPenDAP sequences, which unfortunately are not supported by most OpeNDAP servers, support just such constraint expressions. Our ERDDAP server, while not needing OPeNDAP, is also an OpeNDAP server and supports sequences and constraint expressions. To see an example of a large in situ data set served by ERDDAP, look at:

http://upwell.pfeg.noaa.gov/erddap/search/index.html?page=1&itemsPerPage=1000&searchFor=GTSPP

-Roy
> --
> You received this message because you are subscribed to the Google Groups "pydap" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pydap+un...@googlegroups.com.
> To post to this group, send email to py...@googlegroups.com.
> Visit this group at http://groups.google.com/group/pydap?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

**********************
"The contents of this message do not reflect any position of the U.S. Government or NOAA."
**********************
Roy Mendelssohn
Supervisory Operations Research Analyst
NOAA/NMFS
Environmental Research Division
Southwest Fisheries Science Center
1352 Lighthouse Avenue
Pacific Grove, CA 93950-2097

e-mail: Roy.Men...@noaa.gov (Note new e-mail address)
voice: (831)-648-9029
fax: (831)-648-8440
www: http://www.pfeg.noaa.gov/

"Old age and treachery will overcome youth and skill."
"From those who have been given much, much will be expected"
"the arc of the moral universe is long, but it bends toward justice" -MLK Jr.

Roberto De Almeida

unread,
Apr 15, 2013, 7:58:23 PM4/15/13
to py...@googlegroups.com
Hi, Fredrik.

On Mon, Apr 15, 2013 at 2:16 PM, <fredri...@gmail.com> wrote:
I'm working with a data set that consists of length, degrees and data. The length is approx. 100.000.000 points, degrees ~180 points, and the data is a value in each point (matrix (100.000.000 x 180) ~ 18 billion points).

I need to get the data with respect to the length, eg. data[data.length > 100 & data.length < 200]

The problem I'm facing, is that it seems that pydap downloads the complete length vector, picks out the indexes between 100 and 200, and return the correct data.
This is time-consuming and a waste of memory for such a huge dataset. 

Is there a way of getting the data without downloading the whole length vector?
 
Can you post your script and give more details of your data -- if the server is not public can you send me the DDS response of the dataset? What do you mean with data[data.length > 100 & data.length < 200]? If your data is a 2D matrix (length, degrees) and you are referring to the indexes you can do like Chris suggested:

  from pydap.client import open_url
  dataset = open_url(url)
  data = dataset.array[100:200,:]

This will download only 100 × ~180 points of data.

If, on the other hand, you want to filter on values of length (not indexes), one solution is sort the data by length like Chris suggested and create a support vector with the values. You can then request this vector, extract the proper indexes and request the 2D data. Or, as Roy suggested, you can put your data in tabular form and use constraint expression for Sequences.

--Rob



 
I'm running a opendap Hyrax server (can be replaced with something else), and is planning to create a Python user application to visualize the data.

I have tried with both netcdf and hdf5 (does any of you recommend any of these format or something else for my data?)

I've also looked into pytables, but find the lack of client/server capabilities troublesome (I see that pydap once supported this)

Conceptual example:
length=1,2,3,4,5
data = 10,20,30,40,50

What happens:
data[length>2&length<5]
downloads length:
length=1,2,3,4,5
Finds indices
length>2&length<5 = 3,4 = index 2,3
Downloads data
data[2:4] = 30,40

What I want:
data[length>2&length<5]
data = 30,40
E.g.The server does all the job.

Thanks!

Best regards
Fredrik

--
You received this message because you are subscribed to the Google Groups "pydap" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydap+un...@googlegroups.com.
To post to this group, send email to py...@googlegroups.com.
Visit this group at http://groups.google.com/group/pydap?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Roberto De Almeida, PhD

fredri...@gmail.com

unread,
Apr 16, 2013, 3:49:27 AM4/16/13
to py...@googlegroups.com, fredri...@gmail.com
Hi,
and thank you for quick answers!

I see that my example was not very good, and that there where misunderstandings.

Rewritten example:
I'm working with a data set that consists of position, degrees and data. The position is sorted, but does not have an uniform step size. The position is approx. 100.000.000 points. The degrees have ~180 points, sorted but not uniform step size, and the data is a value in each point (matrix (100.000.000 x 180) ~ 18 billion points).
I need to get the data for all degrees in between selected positions.

Example position vector:
[0, 1.23, 1.5, 1.94, 2.94 ....]
Example degrees vector:
[-179.3, -178.3, .... 179.9]

Sample netcdf file with netcdf4-python:
import netCDF4
import numpy
 
f = netCDF4.Dataset('test.nc', 'w', format='NETCDF4')
  
f.createDimension('position', None)
f.createDimension('degrees', 180)
f.createVariable('position','f8', ('position',))
f.createVariable('degrees','f8', ('degrees',))
f.createVariable('data','f8', ('position','degrees',), chunksizes=(100,180))
  
lengthSampleSize = 1000 #up to 100000000
f.variables['position'].units = 'm'
f.variables['degrees'].units = 'deg'
f.variables['data'].units = 'm'
  
for i in xrange(180):
    f.variables['degrees'][i] = -179 + 2*i + numpy.random.randn()/10
      
for i in xrange(lengthSampleSize):
    f.variables['position'][i] = 10 + i + numpy.random.randn()/10
    f.variables['data'][i,:] = numpy.ones(180)*numpy.random.randn()
    if i%10000 == 0:
        f.sync
f.close()

Simple script for reading the data:
from pydap.client import open_url


dData = dataset['data']
dPosition = dataset['position']
dDegrees = dataset['degrees']

position = dPosition[(dPosition > 100) (dPosition < 200)]
degrees = dDegrees[:]
data = dData[(dData.position > 100) & (dData.position < 200)]

The plan is to sit on a local network with a 1Gbit switch. When I test with smaller dataset, the response if really fast. However, when the position vector has 100 000 000 indexes with double-precision, the vector is  ~763MB large. I could hold this vector in the user application, but that is not preferable. It also takes time to search through this vector, and send all the results back to the server. Speed is a big concern here. How the data is structured, which server, type of database or whatever is open. All I need, is a way of hosting this data as fast as possible. There will be no writes, and really simple queries like the ones stated above, with the possibility of several users( up to ten) at one time. The user application works like this: The user browses back and forth through the data with "next" and "previous" button. This should be really fast, and I plan to implement caching to hold a subset of the data in the region the user is looking at. The user can jump to a position, leaving the current "region".  This can take some time, as the application needs to update the cache to the new user subset. There is several different types of data (not mentioned in the example) for each position and degrees. I also need to create several interpolated data sets, so that the user can look at different "zoom"-levels. The reason I mention this, is that there might be a different solution to this problem by using something else than opendap.

I could also use the url with constraint expressions as stated on the opendap web page, http://docs.opendap.org/index.php/UserGuideOPeNDAPMessages, like this:
but this does not seem to be supported.


"if the server is not public can you send me the DDS response of the dataset?" - How do I do this?

I will look further into your answers, Chris and Roy.

Thank you for your time!

Best regards,
Fredrik


Reply all
Reply to author
Forward
0 new messages