7.5 billion data points. How to HDF5 that?

495 views
Skip to first unread message

ga...@meetnexi.com

unread,
Aug 18, 2016, 12:43:40 PM8/18/16
to h5py
We're new HDF5 users and I wanted to get an opinion or suggestion on structuring the data we're working with.

We want to import 7.5 billion time series data points that is currently stored in 6500 separate CSV files. 
Each data point contains a time stamp, and three related 10-digit floating point values.

We'll be parsing this huge dataset to create groups of much smaller datasets to work with. The smaller datasets will contain windows of signal data around specific events. These windows will be around 200,000 data points each. We'll create around 25 groups each containing roughly 20 event windows. So when accessing the huge set above, we'll only be pulling a few hundred thousand consecutive points at a time.

I'm confident that we can handle the groups of event data. But some suggestions on how to create the huge source signal dataset would be greatly appreciated.

I assume we'll have to use chunking and extendible datasets (which we haven't learned yet). 
I'm also wondering if it's better to create separate databases for each column of data, create one database for the timestamp and one database for the 3-dimensional signal data, or combine all into a single database.

Thanks.

 As a preliminary exercise, this is how we are populating a HDF file from one of these CSV files. Realizing that we can't just append HDF data by looping though all the data files, we figured that we were ripe for some guidance.

import h5py
import numpy as np
FILE_HDF5_SIGNAL = 'external_signal.hdf5'
# creating the hdf5 file for the master signal file
files = np.loadtxt('location_001_ivdata_129.txt', delimiter =',', skiprows=23, usecols=(0,1,2,3),dtype=float).T
X_Values = files[0]
Current_A = np.asarray(files[1],'f8')
Current_B = np.asarray(files[2],'f8')
Voltage_A = np.asarray(files[3], 'f8')
with h5py.File(FILE_HDF5_SIGNAL, 'w') as hf:
g1 = hf.create_group('SIGNAL')
g1.attrs['Instances'] = 'NXinstance'
g1.create_dataset('timestamps', data = X_Values)
g1.create_dataset( 'CurrentA', data = Current_A)
g1.create_dataset('CurrentB', data = Current_B)
g1.create_dataset('VoltageA', data = Voltage_A)
 

Bryan Lajoie

unread,
Aug 19, 2016, 12:11:22 PM8/19/16
to h5py
Hi Gabe,

I am fairly new to HDF5 myself, but thought I would try to help:

For your first point, you are correct, you cannot append to a HDF5 file.
You will have to write a script that takes in a directory containing all of your csv files (or takes in a file listing the path to all of your csv files).
Then as you loop through each file, dump the data into a single hdf5.

So your ideal dataset would be 4 columns wide x 7.5 billion rows?
I assume you ever only want your data by rows, and not columns?
All populated, and no missing/empty entries correct?

hf.create_dataset("table", shape=(4,7500000000), dtype=np.float64, compression='gzip', chunks=(4,256))

Now your data will be stored in 29,296,875 (4,256) 'chunks', each compressed separately on disk.
I would attempt to buffer your input data into chunks of (4,256) and then load each HDF5 block separately.

Something like this:

import numpy as np
import h5py

hf
=h5py.File('example.hdf5','w')
hf
.create_dataset("table", shape=(4,7500000000), dtype=np.float64, compression='gzip', chunks=(4,256))
hf
["table"].shape

data_block
=np.random.rand(4,256)
data_block
.shape

# load data
for row in xrange(0,256*1000,256):
    hf
["table"][:,row:row+256]=np.random.rand(4,256)

# get data back
np.nonzero(hf["table"][:,0:256000])
np
.sum(hf["table"][:,0:256000])
hf
.close()

Pierre Complex

unread,
Aug 22, 2016, 9:39:45 AM8/22/16
to h5py
Hi,


On Friday, August 19, 2016 at 6:11:22 PM UTC+2, Bryan Lajoie wrote:
For your first point, you are correct, you cannot append to a HDF5 file.
You will have to write a script that takes in a directory containing all of your csv files (or takes in a file listing the path to all of your csv files).
Then as you loop through each file, dump the data into a single hdf5.

A HDF5 file can be extended in two ways. One is simply to create an additional
dataset in which to store the additional data. The other way is to use resizable datasets.

To do so requires to create the dataset with a given maximum dimension (see
https://www.hdfgroup.org/HDF5/doc/RM/RM_H5S.html#Dataspace-CreateSimple for the
HDF5 library and http://docs.h5py.org/en/latest/high/dataset.html#resizable-datasets for
h5py).
 
Pierre

ga...@meetnexi.com

unread,
Aug 23, 2016, 1:00:20 AM8/23/16
to h5py
I really appreciate the responses. It certainly helps refine the process when getting a different perspective.

@Bryan Lajoie:
Great point about structuring by rows!
I'm not so sure about loading and assigning data in that way. It appears that you create the entire numpy array before populating the hdf set. I think the enormity of 7.5B points may have been missed. We are talking about 325GB of uncompressed data. I'm pretty sure we have no choice but to progressively resize an extendable dataset. I have questions about choosing a good balance between chunk size and the resulting Chunk Tree size.

@Pierre Complex:
I'm very curious about the second way you propose extending a dataset. As I said above, I'm assuming that we'll have to use a resizeable dataset. But you mentioned a simple sounding creation of additional datasests that aren't intrinsically resizeable. Considering that we are breaking up ordered time-series data, do have any idea as to how this style would work when you need to pull a section of data that starts on one dataset and finishes in another?

Thanks and regards!

Pierre Complex

unread,
Aug 23, 2016, 3:48:46 AM8/23/16
to h5py
On Tuesday, August 23, 2016 at 7:00:20 AM UTC+2, ga...@meetnexi.com wrote:
@Pierre Complex:
I'm very curious about the second way you propose extending a dataset. As I said above, I'm assuming that we'll have to use a resizeable dataset. But you mentioned a simple sounding creation of additional datasests that aren't intrinsically resizeable. Considering that we are breaking up ordered time-series data, do have any idea as to how this style would work when you need to pull a section of data that starts on one dataset and finishes in another?

Resizable datasets are really not a problem. If you enable compression, you must have chunking enabled. Resizable datasets just need a suitable (in Python, None gives unlimited value) maximum dimensions setting. You must call the "resize" method on a dataset to extend it.

With split datasets, you would have to manage the access logic yourself.

ga...@meetnexi.com

unread,
Aug 24, 2016, 5:25:36 AM8/24/16
to h5py
Thanks for the feedback. I've created a model to work from. 

If anyone sees any glaring issues, please let me know.


#-------------------------------------------------------------------------------
# Concept model for reading a series of CSV files of unknown length and combining
# them into a continuous, chunked HDF file. Relies on CSV filename to be t%.txt 
# where % is a consecutive integer for each respective file.
# -Gabe
#-------------------------------------------------------------------------------
import h5py
import numpy as np
import sys

#--INITIALIZE VARIABLES--
ChunkSize = (4,4)
StartFileNum = (23)
EndFileNum = (25)

#--INITIALIZE HDF FILE--
f = h5py.File('testfile.hdf5','w')

for x in range (StartFileNum,EndFileNum + 1):

#--READ CSV--
inputfile = ('t%i.txt' % x)
files = np.loadtxt(inputfile, delimiter =',', usecols=(0,1,2,3), dtype='float64')

#--PIPE CSV TO NUMPY ARRAY--
arr = files[:]
LenCSV = arr.shape[0]
print(inputfile,'is length:',LenCSV)
if (x==StartFileNum):
#--INITIALIZE HDF DB. Compressing for expected large dataset--
dset1 = f.create_dataset('timetrace', (LenCSV,4), maxshape = (None, 4), chunks=ChunkSize, dtype=np.float64, compression='gzip')
#--FILL HDF WITH CSV--
dset1[:] = arr
LenHDF = LenCSV
if (x>StartFileNum):
#--new CSV will have been read...
#--RESIZE FOR APPEND--
LenHDFnew = (LenHDF + LenCSV)
dset1.resize((LenHDFnew, 4))
#--APPEND CSF TO END OF RESIZED HDF--
dset1[LenHDF:LenHDFnew,:] = arr
LenHDF = LenHDFnew
print (dset1[...])
print ('Final shape:',dset1.shape)
 

Pierre Complex

unread,
Aug 24, 2016, 8:30:26 AM8/24/16
to h5py
Hello,

I see no problem when reading your code. You should be aware that chunk size has an impact on
performance though, and your choice is really small (4 by 4).

See https://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/ and https://www.hdfgroup.org/HDF5/doc/H5.user/Chunking.html

In short, you must experiment with cache size. From h5py's doc

It’s recommended to keep the total size of your chunks between 10 KiB and 1 MiB

P

Gabe Krause

unread,
Aug 24, 2016, 2:55:04 PM8/24/16
to h5...@googlegroups.com
Absolutely on the chunk size! I made ChunkSize easily accessible so I could experiment.

I anticipate analyzing contiguous sets of data that are [120000,4]. So, I'll probably set the chunk size to be between [30000,4] and [60000,4] when I read in the actual dataset. 

Thanks for the link!

Gabe Krause  Chief Technical Officer  NEXI  707-737-4273

--
You received this message because you are subscribed to a topic in the Google Groups "h5py" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/h5py/b9f4fhxpQ5k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to h5py+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages