We're new HDF5 users and I wanted to get an opinion or suggestion on structuring the data we're working with.
We want to import 7.5 billion time series data points that is currently stored in 6500 separate CSV files.
Each data point contains a time stamp, and three related 10-digit floating point values.
We'll be parsing this huge dataset to create groups of much smaller datasets to work with. The smaller datasets will contain windows of signal data around specific events. These windows will be around 200,000 data points each. We'll create around 25 groups each containing roughly 20 event windows. So when accessing the huge set above, we'll only be pulling a few hundred thousand consecutive points at a time.
I'm confident that we can handle the groups of event data. But some suggestions on how to create the huge source signal dataset would be greatly appreciated.
I assume we'll have to use chunking and extendible datasets (which we haven't learned yet).
I'm also wondering if it's better to create separate databases for each column of data, create one database for the timestamp and one database for the 3-dimensional signal data, or combine all into a single database.
Thanks.
As a preliminary exercise, this is how we are populating a HDF file from one of these CSV files. Realizing that we can't just append HDF data by looping though all the data files, we figured that we were ripe for some guidance.
import h5py
import numpy as np
FILE_HDF5_SIGNAL = 'external_signal.hdf5'
# creating the hdf5 file for the master signal file
files = np.loadtxt('location_001_ivdata_129.txt', delimiter =',', skiprows=23, usecols=(0,1,2,3),dtype=float).T
X_Values = files[0]
Current_A = np.asarray(files[1],'f8')
Current_B = np.asarray(files[2],'f8')
Voltage_A = np.asarray(files[3], 'f8')
with h5py.File(FILE_HDF5_SIGNAL, 'w') as hf:
g1 = hf.create_group('SIGNAL')
g1.attrs['Instances'] = 'NXinstance'
g1.create_dataset('timestamps', data = X_Values)
g1.create_dataset( 'CurrentA', data = Current_A)
g1.create_dataset('CurrentB', data = Current_B)
g1.create_dataset('VoltageA', data = Voltage_A)