Storing timeseries data

21 views
Skip to first unread message

David Stansby

unread,
Jan 10, 2023, 2:38:46 PM1/10/23
to pyhc...@googlegroups.com
Hi all,

I have been funded by a NumFocus small development grant to investigate what is the most appropriate object is for sunpy to store timeseries data in going forward. This comes with the context of
  1. sunpy’s current choice, pandas.DataFrame having a number of drawbacks
  2. There not being a one-size fits all data format used in solar physics for timeseries data (compared to imaging data where FITS is a relatively common standard)
  3. Historically less development work being done on sunpy’s support for timeseries data (compared to imaging data)

Although I'm funded to do this for sunpy, I really want some input from the PyHC community on any advice or recommendations for choosing a suitable container. Perhaps a common data container for Python is something we could start standardising on as a community eventually?

I have made some notes here: https://hackmd.io/@5tUXI9ejSAmY27nKY7SvAA/HJsh8qMPs. I would be greatful for any feedback anyone has on:

  • User requirements I've missed off the list
  • If you're using a data container for timeseries data, are there any particularly good or bad bits about that data container I should be aware of?
  • Am I missing any options in my list of data containers? So far I have:
astropy.timeseries.TimeSeries
pandas.DataFrame
xarray.DataArray (or xarray.DataSet)
numpy.ndarray
ndcube

Any other comments or suggestions outside those prompts very welcome too!

All the best,
David



Jonathan Niehof

unread,
Jan 10, 2023, 3:42:54 PM1/10/23
to pyhc...@googlegroups.com
Hi David--

I presume you're asking about the object representation in memory (not a storage on disc) and that this would change TimeSeries from have-a dataframe to have-a something else. Or is the door open to have TimeSeries be-a something, or even disappear in favour of directly using something else?


The main thing we could find consensus on was that easy degradation to more basic objects (e.g. ndarray) is essential.

SpacePy is a wrinkle on an existing container in your list, where we have dmarray which is-a ndarray and SpaceData, which is-a dict; the addition being the metadata in attached attributes. Subclassing ndarray certainly has its plusses and minuses.
--
You received this message because you are subscribed to the Google Groups "pyhc-list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyhc-list+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pyhc-list/CAGm9cHDS-4211BAnMOFYwqWWGrTFRr%3Dz%3D4AUNLuB-4oBjWz1Mg%40mail.gmail.com.


-- 
Jonathan Niehof, Ph.D.
UNH Space Science Center
106 Morse Hall, 8 College Rd., Durham, NH 03824
(603) 862-0649

Alec Engell

unread,
Jan 10, 2023, 3:48:00 PM1/10/23
to Jonathan Niehof, pyhc...@googlegroups.com
This may be of interest. We use both TimeScale DB and Parquet in a complimentary way. 


--
Alec Engell
Principal Scientist
aen...@nextgenfed.com
NextGen Federal Systems, LLC
HUBZone Certified #44547
Bozeman, MT 59715
Inline image 1

Baptiste Cecconi

unread,
Jan 10, 2023, 4:04:49 PM1/10/23
to Alec Engell, Jonathan Niehof, pyhc...@googlegroups.com
Hi all, 

I'm interested in this discussion. 
I'm leading the development of a module for reading low frequency radio astronomy dynamic spectra (which is a timeseries of spectra). 
We plan to integrate SunPy eventually. At the moment, we are planning to xarray.DataArray, but if there is a better (or more standard) way coming out of this discussion, we can switch gears. Our choice towards xarray was the simplicity to provide a plot interface to our users.

Cheers,
Baptiste 

<nextgen.png>

--
You received this message because you are subscribed to the Google Groups "pyhc-list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyhc-list+...@googlegroups.com.

Alec Engell

unread,
Jan 10, 2023, 4:09:51 PM1/10/23
to Baptiste Cecconi, Jonathan Niehof, pyhc...@googlegroups.com
Please check out this discussion. It may be helpful. 

At conference so can’t get into it now. But I can set up a meeting with some experts at a future PyHC meeting. 
Inline image 1

Baptiste Cecconi

unread,
Jan 10, 2023, 4:31:10 PM1/10/23
to Alec Engell, Jonathan Niehof, pyhc...@googlegroups.com
Thanks Alec, 

I'll have a deeper look later. After my first glimpse: Parquet is fast because it's using a parquet file (specific file format). I my case, this is not really an option, since the data is already distributed using CDF, FITS or other custom binary formats (for older collections) from their archive and data centres… 

However, I may be wrong so I'm still listening :-) 

Cheers
Baptiste

<nextgen.png>

Alec Engell

unread,
Jan 10, 2023, 4:46:14 PM1/10/23
to Baptiste Cecconi, Jonathan Niehof, pyhc...@googlegroups.com
Understood. Check out kerchunk and ffspec. We are working an active NASA grant with Anaconda on them. 

Inline image 1

Eric Grimes

unread,
Jan 10, 2023, 5:09:30 PM1/10/23
to David Stansby, pyhc...@googlegroups.com
Hi David,

Regarding your notes:

PySPEDAS uses PyTplot internally. From my review, the requirements in your notes that aren't currently met by PySPEDAS+PyTplot are:

- Handle different time scales:
-- PyTplot uses UTC 
- Support for storing out-of memory datasets
-- should be able to add this; the datasets we we work with tend to be small, so this hasn't been requested yet

Not sure about this one:
- Have a way to store an observer coordinate alongside the time index
-- You can do this with 2 tplot variables (one for the observer coordinates and one for the data), but I'm not sure if this meets your requirement

Cheers,
Eric


alexis....@gmail.com

unread,
Jan 10, 2023, 5:27:37 PM1/10/23
to David Stansby, pyhc...@googlegroups.com
Hello David,

For Speasy I did try several times to use something existing but had some issues:
1/ With pandas:
 - doesn't support well ND data
 - no really support for meta-data (last time I checked)
 - not always the best for performances 

2/ With xarrays
- this issue https://github.com/pydata/xarray/issues/2233 where basically a coordinate can't depend on another, which mean we can't use it for spectrometers data with variable energy table for example.

So I'm still "stuck" with my own "naive" implementation (based on np arrays). The actual benefit of this solution is that it fits our needs and I can easily provide methods to export to other containers based on user requests.

I would be happy to join for a common effort to find/improve and use a common time series container in PyHC projects!

BR,
Alexis.

Eric Grimes

unread,
Jan 10, 2023, 6:15:44 PM1/10/23
to David Stansby, pyhc...@googlegroups.com
Hi David,

I just released a notebook that shows what all of this looks like in PySPEDAS+PyTplot:


Hope this helps!

Eric

Rebecca Ringuette

unread,
Jan 10, 2023, 7:20:32 PM1/10/23
to Eric Grimes, David Stansby, pyhc...@googlegroups.com
Hi all,

Not meaning to throw a wrench in the whole idea, but Kamodo uses functions to represent data. One of the unique challenges we have with model outputs (basically a timeseries with spatial components of 3+ dimensions) is that they often do not fit in memory. So, we have come up with a way to functionalize each dataset and only load the pieces needed for the user's request into memory. An advantage to this functionalization choice is that Kamodo naturally handles differing time and spatial grids, even in different coordinate systems, and can handle large model outputs with ease. At the heart of the code, we use numpy arrays (ndarray) to feed the pieces of the model data to SciPy or custom interpolators and build from there. When needed, we typically store pre-processing calculations in netCDF files. We also have our own metadata structure with several custom functions built on top of it, so I'm not expecting a standard container to satisfy my requirements either. Finally, we use UTC timestamps and measure time in hours since midnight of the date of the first file (again, not a standard representation).
Looking at your notes, most of the issues you are facing are exactly what we face in Kamodo with model outputs. CCMC's Kamodo has some capability built into the flythrough function to automatically output the result to either a tab- or comma-separated file or a netCDF4 file. See https://github.com/nasa/Kamodo and https://www.youtube.com/playlist?list=PLBWJQ5-pik_yBBcrpDRPM2hLluh-jreFa for more details.

Alexis, Kamodo easily handles variables that depend on other variables through function composition. Take a look at the core Kamodo documentation at https://ensemblegovservices.github.io/kamodo-core/ for some ideas. I work on CCMC's Kamodo, which is built on the core Kamodo package linked, but I can likely help if you have questions. You might also play with this notebook: https://github.com/nasa/Kamodo/blob/master/docs/notebooks/DataFunctionalization.ipynb

David, the idea of a standardized container for PyHC has been bounced around before, but always with the same conclusion. Each package has chosen a container that it finds suitable to its purpose, even creating a new type of container when a standard one is not suitable. Attempting to push the packages to a uniform solution will be a dividing step - one that I do not recommend taking. An alternative is to work towards writing (and maintaining) adapters between the various data containers used across the main PyHC packages. This is the work that Sandy Antunes and colleagues worked on over this past summer and made some progress. More work remains to be done, but this is likely the best path forward. I'm hoping to work on this for PlasmaPy and Kamodo at the next PyHC spring meeting. Happy hunting!

Rebecca Ringuette


David Stansby

unread,
Jan 20, 2023, 12:55:29 PM1/20/23
to pyhc...@googlegroups.com
Thanks for all the responses, all super helpful! Replies to some of the points raised below:

> I presume you're asking about the object representation in memory (not a storage on disc)

Yep

> and that this would change TimeSeries from have-a dataframe to have-a something else. Or is the door open to have TimeSeries be-a something, or even disappear in favour of directly using something else?

The door is very much open to all of these! I think TimeSeries appeared because nothing existing was suitable and we needed to write wrappers around pandas to deal with units and metadata (and other things, but those are the big ones). We might end up doing the same thing again, but the Python world of data containers has definitely moved on in the last ten years so we thought it was worth evaluating our choice, with all options open.

> PySPEDAS uses PyTplot internally

What class specifically does PySPEDAS use? I looked at https://pytplot.readthedocs.io/en/latest/ but couldn't obviously find any API docs for a data container.

> Kamodo uses functions to represent data

Surely at some point you have to load the data values into some other data structure? Do you just use a numpy array for that? Similarly to the above question, is there any API documentaiton anywhere for the class or function used to load data in Kamodo? I searched for `Functionalize_Dataset` which is what's in the notebook you posted, but couldn't find any mentions of it in the kamodo docs.

> Attempting to push the packages to a uniform solution will be a dividing step

I defintely disagree with this - while there are certainly some downsides to developing something uniform, I wouldn't say it's dividing (I certainly wouldn't want to force anyone to use anything!). I don't doubt the work involved in creating something that's widely applicable would be large, but I think the payoff would be even larger. Maintaining one thing is always easier than maintaining n > 1 things (n**2 things if converters between each structure have to be maintained!), and a common interface to data for users who come from historically separate 'domains' would be huge. I have never used it (because it's not Pythonic enough for my tastes ;), but my impression is that {Py}SPEDAS has done a really great job here in bringing lots of datasets from different areas into a common package/data interface.

Cheers,
David


Rebecca Ringuette

unread,
Jan 20, 2023, 2:58:55 PM1/20/23
to David Stansby, pyhc...@googlegroups.com
Hi David,

Interesting comments. I will be the first to confirm that CCMC's Kamodo documentation is completely useless at the moment. Correcting that problem is a current project. I have created a youtube channel of tutorials as an intermediate step while that work is in progress (don't worry, it'll stay afterwards, too). The link is at the bottom of the installation instructions file (also a high priority issue). Under the hood, Kamodo loads the data into numpy arrays (ndarray) and takes care of metadata separately. The rest of this email is meant to be a thoughtful contribution to the conversation, honestly a stream of consciousness, so please take it as such.

What you are suggesting for data structures quite honestly isn't even done for file formats, particularly because of the metadata issue. Operationally, file formats such as FITS, CDF, netCDF4, h5 and similar are all reasonable choices for a variety of reasons, including ease of accessibility from multiple programming languages. If I have the conversation right, the decision on file formats was to not choose one as the standard, but to concede on this idea and instead provide a few ideal file formats as options and a standard/framework on how to best represent the metadata to improve on metadata searching capabilities. I'm saying this to point out that an ideal data container may not be achievable for reasons similar to the problems encountered in the standard file format discussions, particularly the metadata issue but also as a matter of practice in each domain.

With all of that said, maybe this is something we can do in Python. I definitely agree on the n**2 issue, but how much of it can we avoid? Take SWxTREC for example. They wrote an adapter for each data interface (e.g. the interface the archives present to their users) to create a uniform interface for their users for data across archives. I believe the goal is to write an adapter to each archive to present all data to their users for the entire space weather discipline, even extending into neighboring disciplines. This is similar to what Earth Science is doing for their archives, but on a simpler level. On the software side, each of the core PyHC packages has tackled a similar problem for the datasets they support and have done quite a sizable work to present a single interface and method of access to their users. If we mimic what each package has done for their supported datasets, then the final desired result would be a package sitting on top of the PyHC packages that presents a uniform interface and single data object for all space weather analysis. Then, only one adapter per package would be needed to hook into this new layer instead of the n**2 adapters currently required. There is work being done on the commercial side for this, which is summarized towards the end of this paper (https://doi.org/10.1016/j.asr.2022.05.012). To summarize, I am convinced that the issue is more than that of a single data container for all, but rather a single interface without sacrificing capability and flexibility. Maybe a good step would be for packages to develop widgets for a sample set of functionalities (e.g. a flythrough widget for Kamodo, a coordinate converter for SpacePy, etc), which can then be hosted in the same memory as the others. A simplistic approach would likely use numpy arrays as the data container to communicate between them. I don't have a good solution for how to handle the metadata of each data object behind the scenes, but there needs to be a system for that. I put a few ideas here (doi: 10.22541/essoar.167214348.80256157/v1), but it may be much more than you are looking for. 

Rebecca Ringuette


Russell Stoneback

unread,
Jan 20, 2023, 3:00:07 PM1/20/23
to pyhc...@googlegroups.com
Thanks David,

  I’m late to the data discussion but pysat currently builds upon both pandas and xarray to provide users with access to data. For us, the comment element is the underlying datetimeindex functionality.

> Attempting to push the packages to a uniform solution will be a dividing step

I defintely disagree with this - while there are certainly some downsides to developing something uniform, I wouldn't say it's dividing (I certainly wouldn't want to force anyone to use anything!). I don't doubt the work involved in creating something that's widely applicable would be large, but I think the payoff would be even larger. Maintaining one thing is always easier than maintaining n > 1 things (n**2 things if converters between each structure have to be maintained!), and a common interface to data for users who come from historically separate 'domains' would be huge.

It may cheaper to develop but a single common solution for all users pushes the costs onto users. Though pysat supports xarray, we’ve also kept pandas because it is easier to work with for 1d data. By necessity, xarray is more complicated as it supports multidimensional data. While xarray works with all kinds of data, for some users files are way too large to load. This type of workload involves different data design choices, so those users need something else. The variety of formats and data types reflects the variety of user needs.


a common interface to data for users who come from historically separate 'domains' would be huge.
There are both benefits and costs to a common user interface. Fortunately, a common interface doesn’t have to exist at the file or data format level. If created at a higher level those that want a common interface can use it without imposing additional requirements on users that don’t/can't. 

Cheers,
Russell

-----------------------------
Dr. Russell Stoneback
Stoneris





Rebecca Ringuette

unread,
Jan 20, 2023, 3:08:39 PM1/20/23
to Russell Stoneback, pyhc...@googlegroups.com
Hi Russell!

One thing you missed is Alec's previous comments about using dask and bokeh to resolve large data issues. NextGen is working on that. That may have wound up on a different part of the thread, though.

Rebecca Ringuette


Jan Gieseler

unread,
Mar 10, 2023, 6:43:49 AM3/10/23
to David Stansby, pyhc...@googlegroups.com

Hi all,

Coming back to this discussion, I was wondering if David (or someone else) has followed the changes in pandas with the upcoming version 2.0, which among other things introduce pyarrow as the backend instead of numpy (e.g. https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i)? I'm not sure if this will address some of the constraints that pandas.dataframes had so far (like memory usage)...

Cheers,
Jan

Davin E. LARSON

unread,
Mar 10, 2023, 3:07:21 PM3/10/23
to Jan Gieseler, David Stansby, pyhc...@googlegroups.com
Hi David,
Sorry about coming into this discussion very late..... And thanks for asking the question. The decisions that get made will affect the usability - at least for me.

I've dealt with data processing and instrument design and testing for many years. Having started with IDL (possibly before Python existed) I am only very slowly transitioning to using Python. This transition is hampered by my lack of time to write or ability to find a suitable solution within Python.

As an instrument developer one of the things that is essential is the ability to plot data in real time (or at least no more than a few seconds delayed).  This means dealing with data streams (not just data files). This requires efficient storage of data in real time and maintaining data structures that have an indeterminate length - and the data size is increasing in real time as new data is generated. 

Many years ago I wrote a software package in IDL (TPLOT and its associated routines) that eventually grew into SPEDAS. It has now evolved into a giant (messy) collection of IDL routines. Over the years I modified TPLOT and associated routines so that it could handle real time data streams and meet the requirement stated above, yet still be functional when analyzing older, archived data. (by archived data I mean anything more than a few minutes old).  TPLOT was designed so that it is trivial to get access to the underlying data, modify it, and create new data parameters for display.  It seems to be popular for that reason only. (It is not well written or well documented)

As instrument development progresses from real time testing to something that involves analysis of historical data files I have been forced to live within the IDL environment because that software has already been developed by the time the instrument has been launched.  Transitioning to a different software package (python) is too high a hurdle to get over by the time the instrument has been built.

So within IDL I have a working solution, but I don't know of an adequate solution within Python.  If anyone here knows of a real time plotting/analysis solution I would be grateful to hear about it.

In summary, I would encourage the ability to use a data object that allows (efficient/speedy) realtime updates.
Also:   I recommend  using UNIX time as the underlying numerical representation of time. Every computer operating system in the world already uses it.

Thanks for your consideration,
Davin Larson. (Space Sciences Lab/ Berkeley)



--
Space Sciences Laboratory

Rebecca Ringuette

unread,
Mar 10, 2023, 5:18:18 PM3/10/23
to Davin E. LARSON, Jan Gieseler, David Stansby, pyhc...@googlegroups.com

Lei Cai

unread,
Mar 13, 2023, 4:28:34 AM3/13/23
to pyhc-list
Hi David,

Sorry that I noticed your interesting discussion so late.

Please let me introduce Dataset and Variable in GeospaceLAB (https://github.com/JouleCai/geospacelab) for managing the time-series data. The classes Dataset and Variable can be used as a universal data container for any observational or modeling data in space physics. They are similar to xarray.DataSet and xarray.DataArray, however, are lightweight and more flexible for subclassing and customizing. A Dataset object contains one or more variables (e.g., DATETIME, GEO_LAT, GEO_LON, N_e, T_e, ... ). A Variable object contains a Numpy array and associated attributes such as unit, label, name, and visual attributes. The Variable object has an attribute "depends" to map its dependencies (other Variable objects) within the same dataset. This can solve the dependency of a variable along different axes or multiple dependencies along the same axis.

GeospaceLAB has successfully used them as base classes for managing various data products, such as OMNI, DMSP SSJ/SSM, SuperMAG indices, EISCAT incoherent scatter radar data, SWARM observations .... The data structures are further used for developing the visualization module in GeospaceLAB (currently Matplotlib-based). When a dataset and variables are well-defined, the variables can be viewed easily (see examples 2 and 3 on the GitHub homepage). 

Best regards,
Lei
Reply all
Reply to author
Forward
0 new messages