Creating a 16TB dataset is slow: due to zero-filling?

91 views
Skip to first unread message

Alex Gittens

unread,
Mar 31, 2016, 8:54:01 PM3/31/16
to h5py
I'm trying to write a 16TB hdf5 file using h5py in parallel mode:

rowchunk = 1024
colchunk = 8*numProcs

report("Creating file and dataset")

fout = h5py.File(join(datapath, "atmosphere-chunked.hdf5"), "w", driver="mpio", comm=MPI.COMM_WORLD)
rows = fout.create_dataset("rows", (numRows, numCols), dtype=np.float64, chunks=(colchunk, rowchunk))

reportbarrier("Finished creating file and dataset")
reportbarrier("Writing T")

report and reportbarrier just print the message with the current time (the latter has an mpi barrier first)

This produces a 16TB file--- I see it in the directory---, but then doesn't reach the reportbarrier ("Finished") line. I've been running it for more than 10 minutes now. I imagine it's zero-filling the chunks of the rows dataset, since I read somewhere that HDF5 does this by default for parallel io. I'm going to completely fill the matrix, so have no need for zero filling. Is there a way to turn this off using the high level interface?

Or if not, how can I accomplish this using the low-level interface?

Also, the file's access time hasn't been updated since it was created more than 10 minutes ago, so what's going on? I'm still assuming zero-filling, but shouldn't that continually change the access time?

Thanks,
Alex

Alex Gittens

unread,
Mar 31, 2016, 9:55:10 PM3/31/16
to h5py
Got it:

report("Creating dataset")
#rows = fout.create_dataset("rows", (numRows, numCols), dtype=np.float64, chunks=(colchunk, rowchunk))

# create the rows dataset using the low-level api so I can force it to not do zero-filling, then convert to a high level object
spaceid = h5py.h5s.create_simple((numRows, numCols))
plist = h5py.h5p.create(h5py.h5p.DATASET_CREATE)
plist.set_fill_time(h5py.h5d.FILL_TIME_NEVER)
plist.set_chunk((rowchunk, colchunk))
datasetid = h5py.h5d.create(fout.id, "rows", h5py.h5t.NATIVE_DOUBLE, spaceid, plist)
rows = h5py.Dataset(datasetid)

reportbarrier("Finished creating dataset")

finishes creating the file in about 1 minute 20 seconds.
Reply all
Reply to author
Forward
0 new messages