Data Access URL

Todd King

unread,

Jun 29, 2010, 5:41:11 PM6/29/10

to hdmc-dataaccess

Hi all -

One aspect of data access is how to express a URL in a way that allow
the expression of simple queries for data. There are two basic needs
in retrieving data. First is to specify a time range of interest and
the second is to request a format for the data. The SPASE Registry and
the DataShop File Finder can return a list of URL for Granules/Files
that match a time constrain. With a list of URLs a service or
application can retrieve each file and then assimilate the data. This
is not always convienent of readily supported by applications (for
example, IDL or MATLab). A better alternative is to have the
retrieval, assimilation and format conversion (if necessary) to be
performed by a service. The SPASE Downloader does assemble the
selected granules into Zip files. While this provides a rudimentary
delivery system its insufficient for application integration. An
approach such as the DataShop Data Grabber provides uniform data
streams. The essential requirements for data access is to allow easy
embedding and command line use can be summarized as:

1) REST based API

2) Parameterized filters (start/stop times)

3) Selectable Format

A recommended approach would be like the one used for twitter search
API (http://dev.twitter.com/doc/get/search) or the HELM approach
(http://helm.gsfc.nasa.gov/development/HelmWebServices.html) where
filters are parameters and the desired format is expressed as a file
name extension.

For example:

http://loclhost/get/data.cdf?start=2001&end=2006

which will return "data" in CDF format filtered to include on data
between 2001 and 2006.

One benefit of this approach is that different formats can be cached
(or pre-generated) and delivered with the HTTP protocol. For example,
a URL like:

http://loclhost/get/data.tab

would return an ASCII table version of "data". This could actually
exist or can be generated on demand by the "get" service.

Comments? Thoughts?

Joe Hourcle

unread,

Jun 30, 2010, 5:43:03 AM6/30/10

to hdmc-da...@googlegroups.com

On Jun 29, 2010, at 5:41 PM, Todd King wrote:

> Hi all -
>
> One aspect of data access is how to express a URL in a way that allow
> the expression of simple queries for data. There are two basic needs
> in retrieving data. First is to specify a time range of interest and
> the second is to request a format for the data. The SPASE Registry and
> the DataShop File Finder can return a list of URL for Granules/Files
> that match a time constrain. With a list of URLs a service or
> application can retrieve each file and then assimilate the data. This
> is not always convienent of readily supported by applications (for
> example, IDL or MATLab). A better alternative is to have the
> retrieval, assimilation and format conversion (if necessary) to be
> performed by a service. The SPASE Downloader does assemble the
> selected granules into Zip files. While this provides a rudimentary
> delivery system its insufficient for application integration. An
> approach such as the DataShop Data Grabber provides uniform data
> streams. The essential requirements for data access is to allow easy
> embedding and command line use can be summarized as:
>
> 1) REST based API
>
> 2) Parameterized filters (start/stop times)
>
> 3) Selectable Format

[trimmed]

> For example:
>
> http://loclhost/get/data.cdf?start=2001&end=2006
>
> which will return "data" in CDF format filtered to include on data
> between 2001 and 2006.

[trimmed]

> Comments? Thoughts?

First, I'll start by saying that there's no possible way that I can
support
this, because the data volumes that result are just too large to be
useful.

I explained it to Tom Narock when he was working on SPASE-QL --
I *do*not* want to go straight from a query to data because I want
people
to be able to get metadata about the granules and determine if they
really want to download the data or not.

I won't even get into SDO data, as you give an example of 2001 to 2006,
so let's use EIT -- it was observing at the time, and it's a 15 year old
instrument, so doesn't really gather that much in comparison to modern
telescopes

(I'm assuming that 'end=2006' means to include 2006)

mysql> select count(*), sum(file_size) from filemgmt join
observation on id_filemgmt=id_obs where observation.id_instrume=3 and
date_obs between '2001-01-01' and '2007-01-01';
+----------+----------------+
| count(*) | sum(file_size) |
+----------+----------------+
| 222413 | 402519099584 |
+----------+----------------+
1 row in set (2.35 sec)

That size in in bytes -- so over 200k images in 374.5GB

How about STEREO/SECCHI? They launched in 2006, so they could
possibly respond to the query, too. We'll be fair, and just look at
one telescope:

mysql> select count(*), sum(filesize) from vso_view where
detector='EUVI' and source='STEREO_A' and date_obs between
'2001-01-01' and '2007-01-01';
+----------+------------------+
| count(*) | sum(filesize) |
+----------+------------------+
| 4852 | 41078888.4057617 |
+----------+------------------+
1 row in set (1.14 sec)

Ah ... much smaller ... under 5k images. Of course, the filesize is
in kB, so it's still 39GB.
(and that's less than 60 days worth of images). 2007 was 378187
images in 2.87TB

Hinode was also launched in 2006 -- SOT took over 1.1TB in the last
75 days before
the end of the year.

...

All of that being said, I *do* have mechanisms for requesting lots of
data at once,
and serving tarballs on demand -- but it's a two step process --
first the query, and then
if you like the results, you can order them.

I can give a URL to automatically search the data:

http://sdac.virtualsolar.org/cgi/vsoui?
timerange=20010101-20070101;instrument=eit

(but it'll hit a limit of 1000 records ... the idea originally being
that it was too excessive of
a search ... if the people really wanted the data, they could break
it down into multiple
searches for about a day's worth of data at a time)

STEREO searches are limited to 4000 records:

http://sdac.virtualsolar.org/cgi/vsoui?
timerange=20010101-20070101;detector=stereo_a.secchi.euvi

Hinode does a trick where if it hits its limit, it switches to an
alternate mode, where
records returned correspond to an hour's worth of records of a given
filter and
observing mode:

http://sdac.virtualsolar.org/cgi/vsoui?
timerange=20010101-20070101;instrument=sot

(so the 5321 records returned correspond to the 488118 images)

To support the multiple-record returns, I'll return an hour's worth
of SOT data in a given
filter/processing as a tarball with a URL:

http://sdac.virtualsolar.org/cgi-bin/gethinode?pptid=SP4D:
6302A;date=2006-10-20T22

The 'gethinode' CGI generates tarballs on the fly; it's never written
out to disk, so I
don't have to deal with cleaning up afterwards. The only problem is,
if your client doesn't
support the Content-Disposition header, the files might not get saved
as useful names.

It's even worse for SDO/AIA and HMI data, where the keys to the data
are a sequence,
so someone who's really foolish *could* try to just give a really
large window ... but as
it is, some of the returns are already 3GB, if you don't specify that
you support Rice
compression (which most tools don't):

http://vso.tuc.noao.edu/cgi-bin/drms_test/drms_export.cgi?
series=aia_lev1;record=171_1055234411-1055234999

That'll get a 10 minute set of AIA 171Angstrom images ... it's 50
images, 64MB each, so
more than 3GB ... or the same data, but FITS files with Rice
compressed data, so only
about 10MB per image, 500GB total:

http://vso.tuc.noao.edu/cgi-bin/drms_test/drms_export.cgi?
series=aia_lev1;record=171_1055234411-1055234999;compress=rice

...

And the comment about the doing conversion on the fly -- we're having
a hard enough
time supporting FITS for SDO/AIA and HMI, because we're tied into
their data system
(that the scientists made ... and just isn't intended for bulk
exporting of data). The data's
stored in FITS, but without any scientific metadata, so we have to
recombine the
data w/ the headers, and possibly uncompress the data, too.

I'd like to see tools support HTTP 503 response code w/ a Retry-After
header, so I can
give a message that's effectively 'ask me again in 10 minutes'.

I was looking into using the 'Sparse Bag' option of the BagIt protocol
[1], but there just
isn't enough support for the format yet. (it'd allow you to send a
tarball with what you
had, and a list of additional URLs to download)

The other problem that we're seeing with generation of large tarballs
is that because
we're not writing them out to disk first, and we don't know for sure
that the file's going to
be bytewise the same if it's regenerated, we can't support the Ranges
HTTP header
to support continuation of interrupted downloads -- so if you got
2.8GB of that 3GB file ...
you're starting all over again.

We've been looking into serving 'metalink' files[2], which most
download managers
support (v3, at least ... v4 was just recently released as an RFC),
so the browsers
would just have a list of possibly thousands of files to download, so
they'd only
have to retry the individual files that were incomplete, not the
whole tarball.

...

and all of that being said -- as the majority of the data you're
dealing with
is time series in CDF, is there any reason you're not looking at
either adding
a connector to OPeNDAP [3] or using the IVOA Table Access Protocol [4],
before we go designing yet another protocol?

-Joe

ps. CGI style key-value pairs kinda suck when dealing w/ boolean
logic. If you noticed
in my EUVI example, the parameters were :
timerange=20010101-20070101;detector=stereo_a.secchi.euvi
and not :
timerange=20010101-20070101;detector=euvi;source=stereo_a
You can try the second one, and you'll see why. (it was to be
able to support the
input from the webform interface, where someone might ask for
all data from one
observatory + another instrument from another)

[1] http://tools.ietf.org/html/draft-kunze-bagit-03 ; https://
wiki.ucop.edu/display/Curation/BagIt
[2] http://en.wikipedia.org/wiki/Metalink ; http://tools.ietf.org/
html/rfc5854
[3] http://opendap.org/
[4] http://www.ivoa.net/Documents/TAP/

Todd King

unread,

Jun 30, 2010, 1:20:08 PM6/30/10

to hdmc-da...@googlegroups.com

Hi Joe -

Thanks. A very informative e-mail. The detail and examples are great.

One point you made was:

> -- but it's a two step process --
> first the query, and then
> if you like the results, you can order them.

I agree with you. I call the two steps "discovery" and "delivery".
The Registry working group is exploring the "discovery" part. This working
group is looking at the "delivery" part. Some coordination is needed.
As you pointed out there have to be practical limits and as part
of the "discovery" there should be some indication of size/volume.
And on the delivery side there needs to be safe guards for blind
requests. Most of the delivery issues are self-correcting.
That is, a user will only wait so long for a delivery. If the
request is excessive, a user will discover that and adjust the
request. What is "excessive" has changed over time. With
increased bandwidth "excessive" has gone from MB to GB.

> The other problem that we're seeing with generation of large tarballs
> is that because
> we're not writing them out to disk first, and we don't know for sure
> that the file's going to
> be bytewise the same if it's regenerated

Same thing for us. "metalinks" is an interesting solution.
It's a topic worth discussing in more detail.

> either adding
> a connector to OPeNDAP [3] or using the IVOA Table Access Protocol [4],
> before we go designing yet another protocol?

I have looked at OPeNDAP and thought the same thing.
There could be a SPASE connector. It's worth looking at.
Can OPeNDAP work for solar data? I would prefer a more
universal solution, but perhaps we need one solution for
time series and another for images.

The URLs you gave as examples are good real world illustrations of
a REST approach. The issue of informing the user when a request
cannot be serviced is a good point. We've "solved" it by including
a file called "error.txt" in the zip file to explain why a delivery
could not be completed. Establishing some standard would be useful
since a user would know where to look for additional information.
We also include a "acknowledgement.txt" file for affirmation requests.

-Todd-

Todd King

unread,

Jun 30, 2010, 2:26:37 PM6/30/10

to hdmc-da...@googlegroups.com

Hi all -

Here's an attempt to summarize the APIs for data delivery which
have been mentioned in this thread.

DataShop (get_data.pl)
req: Type of request. GET_DATA returns data; LIST_URL returns a list of
URLs. (required)
dsid: The unique identifier for the resource (data). (required)
t1: Start date of interval: Format Y,M,D,H,M,S
t2: End date of interval: Format Y,M,D,H,M,S
type: Semantic type of data. Controls output format. (required)
index: The index of the dataset within a resource. (required)
channel: A series within the data set. Similar to parameter.

SPASE Downloader (VMO)
id : SPASE Resource ID (required)
startdate : Start date of interval. ISO-8601 format.
stopdate : Stop date of interval. ISO-8601 format.

VSO API (universal?)
pptid : Identifier of the resource.
timerange : Date range of interval. Format start-end. YYYYMMDD format.
date : Single event instance. YYYMMDDD format.
series : A series within the data set. Similar to parameter.
record : Range of records. Format start-end
compress : Compression method (i.e.: rice)
instrument : instrument name
detector : detector name
observatory (implied with service name, i.e.:
http:/localhost/cgi-bin/getihode)

Using the intersection of these three APIs and adopting SPASE terms results
in:

resourceid: The unique identifier for the resource (data). (required)
startdate : Start date of interval. ISO-8601 format.
stopdate : Stop date of interval. ISO-8601 format.
parameter : A series within the data set. Comma separated list.
format : The desired format for the delivery.

Additional options could include:

startrecord : Start record of an interval.
endrecord : End record of an interval.
compress : Compression method (i.e.: rice)

Each option could have short name such as "id" for "resourceid".

"format" is similar to DataShop "Type" since semantic type implies are
particular
structure to the data. In this context it's more like the SPASE use of
"Format".

A "startdate" or "stopdate" without the other specifies a single event (VSO
date).
A "startrecord" or "stoprecord" without the other specifies a single record.

Instrument, detector and observatory are assumed to be clearly resolved
with the "resourceid".

I do like the concise way the VSO expresses timerange and record. With
timerange
using "-" as a delimiter conflicts with using ISO-8601 times.

Suggested service name is "download".

To download an entire resource has the form:

http://localhost/download?resourceid=SPASE://VxO/NumericalData/Example

What is returned is still to be decided. It could a zip or tarball
containing
multiple files (like SPASE and VSO), a single file in the requested format
(DataShop), or ???

-Todd-

Joe Hourcle

unread,

Jun 30, 2010, 9:31:51 PM6/30/10

to hdmc-da...@googlegroups.com

On Jun 30, 2010, at 1:20 PM, Todd King wrote:

> Hi Joe -
>
> Thanks. A very informative e-mail. The detail and examples are great.
>
> One point you made was:
>
>> -- but it's a two step process --
>> first the query, and then
>> if you like the results, you can order them.
>
> I agree with you. I call the two steps "discovery" and "delivery".

No making new terms!

If we're following the OAIS model, there's three steps:

Finding
Ordering
Retrieval

Your 'discovery' step is Finding, while your 'delivery' is a combination
of both Ordering + Retrieval.

If we're following the FRBR tasks:

Find
Identify
Select
Obtain

In that case, they don't differentiate between Ordering vs. Retrieval,
as it's all just "Obtain" (unless you consider 'Select' to be
'Ordering',
but I personally think 'Identify' and 'Select' to be tasks that OAIS
considers
to be external to the system they're modeling)

So, as a use case for the VSO:

Finding:
user submits their query to the VSO, and gets back a list of records
Identify / Select
user looks over the records, and decides which ones they want to get
Ordering
user sends their request to the VSO and how they want to get the data,
and the VSO responds with either a list of URLs or a message of when
it'll be ready for them to pick up. (and then the URL of where
it's staged
is sent to them via e-mail).
Retrieval
user follows the URLs to download the files.

It's possible that the 'Finding' step could also be enhanced using
either
'browse' with Helioviewer or the HEK (or HELM, etc.)

> The Registry working group is exploring the "discovery" part. This
> working
> group is looking at the "delivery" part. Some coordination is needed.
> As you pointed out there have to be practical limits and as part
> of the "discovery" there should be some indication of size/volume.
> And on the delivery side there needs to be safe guards for blind
> requests. Most of the delivery issues are self-correcting.
> That is, a user will only wait so long for a delivery. If the
> request is excessive, a user will discover that and adjust the
> request. What is "excessive" has changed over time. With
> increased bandwidth "excessive" has gone from MB to GB.

But there's cost to the data provider, even if the connection is
aborted.
Did I have to stage the data, and force me to age something else out
of my cache? Did their request require us to load from tape, or
pull data from another mirror?

>> The other problem that we're seeing with generation of large tarballs
>> is that because
>> we're not writing them out to disk first, and we don't know for sure
>> that the file's going to
>> be bytewise the same if it's regenerated
>
> Same thing for us. "metalinks" is an interesting solution.
> It's a topic worth discussing in more detail.
>
>> either adding
>> a connector to OPeNDAP [3] or using the IVOA Table Access Protocol
>> [4],
>> before we go designing yet another protocol?
>
> I have looked at OPeNDAP and thought the same thing.
> There could be a SPASE connector. It's worth looking at.
> Can OPeNDAP work for solar data? I would prefer a more
> universal solution, but perhaps we need one solution for
> time series and another for images.

I don't think it'll work for images -- it's just not designed for
that sort of thing.
(or spectra, or anything else other than time series, really).

IVOA has different specifications for images vs. spectra vs. tables,
but they make assumptions that coordinates are in ra/dec, and we
don't have that luxury. Eg:

http://example.org/cgi-bin/VOimq?POS=180.567,-30.45&SIZE=0.0125

I think it makes sense -- you use different sorts of parameters to
search
for images than you would for time series. (and as they're passing
off to
cutout services and/or subsetting, the way you reduce the data is
different, too).

> The URLs you gave as examples are good real world illustrations of
> a REST approach. The issue of informing the user when a request
> cannot be serviced is a good point. We've "solved" it by including
> a file called "error.txt" in the zip file to explain why a delivery
> could not be completed. Establishing some standard would be useful
> since a user would know where to look for additional information.
> We also include a "acknowledgement.txt" file for affirmation requests.

If we're going to follow a standard, I'd suggest BagIt, as it has
many of the
items that we need:

http://tools.ietf.org/html/draft-kunze-bagit-04

(and oops, I had linked to an older spec the last time)

Basically, BagIt can be a directory structure on physical media (eg,
CD),
or any form of multi-file archive format (zip, tar, stuffit, etc.)

Within a Bag, you can include any metadata files, but the standards call
for :
bagit.txt
(a file to declare bagit version & file encoding)
manifest-*.txt
(where * is the name of the checksum used, can have multiple)
tagmanifest-*.txt
(similar to manifest-*.txt, but for the metadata files)
fetch.txt
(a list of URLs to download to complete the Bag)
bag-info.txt
(metadata for the Bag)

We could either extend the standard by defining extra files we need
(eg, 'errors.txt'), or by adding SPASE-specific keywords to bag-info.txt
(eg, 'Acknowlegements')

-Joe H.

ps. I don't consider my stuff to be REST -- REST suggests that
you're accessing a
document using its primary key, and that you're using HTTP
methods other than
GET to manipulate the object. All I have is a plain, boring,
old-fashioned CGI
that returns stuff. If my stuff is REST, then a directory full
of files is REST.

Jon Vandegriff

unread,

Jul 2, 2010, 2:39:25 PM7/2/10

to hdmc-da...@googlegroups.com

Todd,

Thanks for creating this summary of the various APIs. This discussion is a good starting point for this group. Like Joe has pointed out, our current thinking is not scoped to handle large amounts of image data. The SDO problem is so huge that in my mind it is a separate category altogether. Nevertheless, although my own focus is certainly with lower volume timeseries data, I think it is important to realize that data volumes for time series are not going down as a function of time, so we should not paint ourselves into a corner with a design that can't handle larger volumes. But, we also have to be realistic with the resources that we have to solve this problem, so focusing on this lower volume time series data is appropriate at this stage.

Coming up with a common API for (lower-volume, timeseries) data requests would be very useful and would go a long way to adressing what this Google group is designed to do. I think your overview shows that it should be possible. Part of the issue is getting all the stakeholders (existing service providers) to agree to the process and commit themselves to using a common API.

What we need is really a low-layer API for getting to the numeric content of the data in such a way that science systems can be built on top of this access in an efficient way. We do not have to include all the fancy items in the low level API, since more sophisticated things can be built on top, as long as they can be done efficiently.

One suggestion for thinking about data requests is to think of the request (I think this is the "Order" in Joe's OAIS terminology) as having several components:
1. the source data to use (dataset identifier)
2. identifying what you want done to the data (subsetting according to a time range, filtering, averaging, merging with other source data)
3. identifying how you want the data to be delivered (one single file with everything, daily files, what file format, how to bundle the files, or maybe just stream the data instead, or if the data is too big, have it staged for later download)

Things that make an API complicated are when you start having more that one of something, like multiple input time ranges. Or perhaps merging multiple datasets, or even multiple output formats.

Also, for integration with the VxO's, the source data is fuzzier - a VxO search already ends up with not just a dataset identifier (or identifiers), but a list of URLs, and probably a set of time ranges and other constraints.

You could even think about splitting the process up so that the original "Order" does not provide the data, but a response indicating whether or not the Order will be accepted, and if so, how long it will take, and what kind of protocol is appropriate for obtaining the resulting data. (We would need some new terms if we did this!)

Jon V.

Joe Hourcle

unread,

Jul 2, 2010, 3:13:42 PM7/2/10

to hdmc-da...@googlegroups.com

On Jul 2, 2010, at 2:39 PM, Jon Vandegriff wrote:

> Todd,
>
> Thanks for creating this summary of the various APIs. This
> discussion is a
> good starting point for this group. Like Joe has pointed out, our
> current
> thinking is not scoped to handle large amounts of image data. The SDO
> problem is so huge that in my mind it is a separate category
> altogether.
> Nevertheless, although my own focus is certainly with lower volume
> timeseries data, I think it is important to realize that data
> volumes for
> time series are not going down as a function of time, so we should
> not paint
> ourselves into a corner with a design that can't handle larger
> volumes. But,
> we also have to be realistic with the resources that we have to
> solve this
> problem, so focusing on this lower volume time series data is
> appropriate at
> this stage.

Agreed.

> Coming up with a common API for (lower-volume, timeseries) data
> requests
> would be very useful and would go a long way to adressing what this
> Google
> group is designed to do. I think your overview shows that it should be
> possible. Part of the issue is getting all the stakeholders (existing
> service providers) to agree to the process and commit themselves to
> using a
> common API.

Still Agree.

Possibly losing agreement here, but I'll explain.

Yes, when we designed the VSO, we striped it down pretty bare, and it
made it easier to implement and to get lots of data in quickly -- but
the
decision to build our own, new methods, rather than building on existing
community standards means we have to do everything, and there are
a lot of scenarios that we just didn't plan for.

I had looked at Z39.50 (a protocol for searching library catalogs), and
we thought there'd be no reason to deal with that amount of complexity.
... but so many of the extensions in there would be beneficial to the
scientists (saving queries to replay them later; we had to re-implement
our own way of saving results) or for people writing new UIs (asking
the services what extensions they support)

...

For the issue with time formats, we have a simple solution for the
VSO --
it's always start & end time in a single format. That's not to say
that a
new UI couldn't allow people to specify it using other methods, and
handle the translation to the preferred method.

In the case of units -- we use Angstrom internally for the VSO, but
the UI can accept other formats (so long as they convert it to GHz, keV
or Angstrom), and the data providers can respond with whatever they
want. (although, I admit, it *does* make it harder on the UIs to handle
sorting, as some might return cm, nm, MHz, MeV, etc ... but I do have
some perl modules that handle the translation)

> Also, for integration with the VxO's, the source data is fuzzier -
> a VxO
> search already ends up with not just a dataset identifier (or
> identifiers),
> but a list of URLs, and probably a set of time ranges and other
> constraints.
>
> You could even think about splitting the process up so that the
> original
> "Order" does not provide the data, but a response indicating
> whether or not
> the Order will be accepted, and if so, how long it will take, and
> what kind
> of protocol is appropriate for obtaining the resulting data. (We
> would need
> some new terms if we did this!)

See :
Data Transfer Negotiation within the Virtual Solar Observatory
http://lwsde.gsfc.nasa.gov/Hourcle_VSO_27Oct04.pdf

I might need to change the response, to deal with the case where I
need to
return more than one URL for a given record identifier (either because
it's served as multiple files, or it's so large I have to segment it)

I've also added a few new subtypes since this was written:

http://vso.nascom.nasa.gov/API/VSO_API.html#getdata

(and can probably clean up the deprecated stuff that I know we've
removed)

It might be worth adding fields for an estimate on time to wait
before retrieving. (or time when you should worry if you haven't
gotten a response back)

-Joe

Jon Vandegriff

unread,

Jul 2, 2010, 3:44:10 PM7/2/10

to hdmc-da...@googlegroups.com

Joe,

That is speedy feedback - its hard to keep up with you on this and the SPASE list!
My comments are inserted below.

I am very receptive to disagreements here (here meaning the business about designing the right kind of lower level API). This does seem to be a thorny issue because if you design a simple API underneath that is too simple, it cannot effciently support the aggregations you end up wanting to do later on. So we somehow need to think ahead about how the low level API might be used in the future and make sure that time consuming operations are not called in a tight loop.

As far as standards go, the OPeNDAP standard seems to have what we need. But I have never used it myself, so we are initiating an exploration of what it would take to emit data using the DAP standard. Its easy if your data is in files, but its not clear what to do if your data is coming out of a service. You have to write your own software that either speaks OPeNDAP or plugs into something that does. Then the data request API like the one we have been talking about would live on top of the OPeNDAP protocol. The plain old DAP (Data Access Protocol) could also serve as the streaming protocol, which would allow OPeNDAP clients (of which there are many) to interpret the data.

...

For the issue with time formats, we have a simple solution for the VSO --
it's always start & end time in a single format. That's not to say that a
new UI couldn't allow people to specify it using other methods, and
handle the translation to the preferred method.

In the case of units -- we use Angstrom internally for the VSO, but
the UI can accept other formats (so long as they convert it to GHz, keV
or Angstrom), and the data providers can respond with whatever they
want. (although, I admit, it *does* make it harder on the UIs to handle
sorting, as some might return cm, nm, MHz, MeV, etc ... but I do have
some perl modules that handle the translation)

Also, for integration with the VxO's, the source data is fuzzier - a VxO
search already ends up with not just a dataset identifier (or identifiers),
but a list of URLs, and probably a set of time ranges and other constraints.

You could even think about splitting the process up so that the original
"Order" does not provide the data, but a response indicating whether or not
the Order will be accepted, and if so, how long it will take, and what kind
of protocol is appropriate for obtaining the resulting data. (We would need
some new terms if we did this!)

See :
Data Transfer Negotiation within the Virtual Solar Observatory
http://lwsde.gsfc.nasa.gov/Hourcle_VSO_27Oct04.pdf

I'll take a look at this later - maybe next week.

Todd King

unread,

Jul 2, 2010, 4:16:52 PM7/2/10

to hdmc-da...@googlegroups.com

Hi -

Interesting information at the links. It'll take a will to read through,
but it looks like there's a lot of well thought out concepts.

With regards to Jon's

> You could even think about splitting the process up so that the

> original "Order" does not provide the data...

Just to clarify things a little...
We are looking at the HPDE system as a whole?

For example, the Registry services supports query and certain select
functions.
That is, if a Resource ID is submitted to the Registry service I can
retrieve the metadata for the resource. Part of the metadata is a list
of URLs for the Granules which comply to certain constraints (like a time
range).
The "data access" service would rely on the Registry service to obtain
metadata.
The function of the "data access" service to do all the transforms and
packaging
on the data (subsetting according to a time range, filtering, averaging,
merging with other source data, converting format, generating delivery
package)

Is this everyone's understanding of the focus of the "data access" service
and the role of other services in the HPDE?

-Todd-

> -----Original Message-----
> From: hdmc-da...@googlegroups.com [mailto:hdmc-
> dataa...@googlegroups.com] On Behalf Of Joe Hourcle
> Sent: Friday, July 02, 2010 12:14 PM
> To: hdmc-da...@googlegroups.com

Joe Hourcle

unread,

Jul 2, 2010, 4:25:14 PM7/2/10

to hdmc-da...@googlegroups.com

On Jul 2, 2010, at 3:44 PM, Jon Vandegriff wrote:

> Joe,
>
> That is speedy feedback - its hard to keep up with you on this and
> the SPASE
> list!
> My comments are inserted below.

I'm not avoiding 'real work' on the day before a 3 day weekend.
(why won't anyone believe me?)

Besides, years of mudding helps with the typing speed.

>>> What we need is really a low-layer API for getting to the numeric
>>> content of
>>> the data in such a way that science systems can be built on top
>>> of this
>>> access in an efficient way. We do not have to include all the
>>> fancy items in
>>> the low level API, since more sophisticated things can be built
>>> on top, as
>>> long as they can be done efficiently.

I'll just comment on this one -- it's a matter of figuring out which
stuff can be
safely layered on top, and which stuff is just better to do as early
as possible.

An example -- we have a 'near' keyword in the VSO IDL client, which

Allows you to do:

IDL> list = vso_search( near='2010-06-08 12:45:00', inst='aia',
wave=171)

... and it'll return the closest record to that time.

If you try sending that to older data providers that don't support
it ... they'll try to
send back 2 hrs worth of results around that time ... now, we *could*
then have
the UI look for the closest match, and trash the rest of the results,
but that's
really inefficient if we can get the necessary support at the data
catalog.

(which um ... is basically what you said ... an issue of efficiency)

> I am very receptive to disagreements here (here meaning the
> business about
> designing the right kind of lower level API). This does seem to be
> a thorny
> issue because if you design a simple API underneath that is too
> simple, it
> cannot effciently support the aggregations you end up wanting to do
> later
> on. So we somehow need to think ahead about how the low level API
> might be
> used in the future and make sure that time consuming operations are
> not
> called in a tight loop.

Agreed.

> As far as standards go, the OPeNDAP standard seems to have what we
> need. But
> I have never used it myself, so we are initiating an exploration of
> what it
> would take to emit data using the DAP standard. Its easy if your
> data is in
> files, but its not clear what to do if your data is coming out of a
> service.
> You have to write your own software that either speaks OPeNDAP or
> plugs into
> something that does. Then the data request API like the one we have
> been
> talking about would live on top of the OPeNDAP protocol. The plain
> old DAP
> (Data Access Protocol) could also serve as the streaming protocol,
> which
> would allow OPeNDAP clients (of which there are many) to interpret
> the data.

I think the files vs. streams thing is going to be an issue we need
to sort out.

I assume streaming will become more prevalent, as a way to deal with
serving customized responses to users to reduce the bandwidth demand
from large data volume investigations.

(and sorry, no experience personally w/ OPeNDAP ... but they seem to
have
some outreach efforts, as they've had a booth at AGU the last couple of
years, I think)

-Joe

Joseph B. Gurman

unread,

Jul 2, 2010, 4:36:22 PM7/2/10

to hdmc-da...@googlegroups.com

> Besides, years of mudding helps with the typing speed.

Or leads to RSI.

Joe Hourcle

unread,

Jul 2, 2010, 4:53:51 PM7/2/10

to hdmc-da...@googlegroups.com

On Jul 2, 2010, at 4:36 PM, Joseph B. Gurman wrote:

>> Besides, years of mudding helps with the typing speed.
>
> Or leads to RSI.

Oh no!

And then we'd write really strange programming languages!

Ick.

-Joe

Joseph B. Gurman

unread,

Jul 2, 2010, 4:57:25 PM7/2/10

to hdmc-da...@googlegroups.com

....with really gnarly hands.

I didn't say you'd get ITTVis, after all.
----
"I love deadlines. I love the whooshing sound they make as they go by."

- Douglas Adams, 1952 - 2001

Joseph B. Gurman, Solar Physics Laboratory, NASA Goddard Space Flight Center, Greenbelt MD 20771 USA

Reply all

Reply to author

Forward