On Aug 21, 2009, at 2:40 AM, Markus Nordenstam wrote:
>
> Hello Jim, Chris (fire's looking good), and Mike!
>
> It's been awhile since I looked at GTO, and I'm currently using a
> simple, home-brew format called EMP simply because I needed to get
> certain features (such as lazy file I/O) done very quickly and didn't
> have time to investigate how easy that would be using GTO.
Marcus! You beat me to the EMP name! I was thinking about that for
this multimedia format (which would be GTO based). Now I have to call
it something boring!
>
> However, the software I'm working on (Naiad) would love to be part of
> the "community" in as many ways as possible. I know the Tippett guys
> are planning on open-sourcing their GTO import Naiad-NOP and that's
> really exciting -- but I think perhaps I the EMP format itself could
> evolve and I can see three different scenarios:
Do you have some whitepaper or something on EMP we can read? Or even a
basic description?
> 1) EMP format switches to GTO under the hood -- the API to access the
> data in the EMP would not change, but being a full-fledged GTO, one
> could also access the data using the standard GTO approach
Are your users heavily invested in the EMP API? Can you we see it?
Maybe there's a happy comprimise or maybe we can help you have a
second output method using GTO.
> 2) EMP format switches to HDF5 under the hood, leveraging its built-in
> threaded I/O capabilities
I've been looking at HDF too. We could make a parallel reader on top
of the existing one without too much of an issue or adopt HDF as the
format on disk. I think I sent you email a while back about real
(measured) issues with parallel I/O cross platform and how each method
we tried succeeded or failed. Basically, there is no right way to do
it: its always a function of the I/O latency, bandwidth, the cpus, and
kernel. So for example parallel I/O over NFS to a SAN has completely
different requirements from parallel I/O over fibre channel to a RAID.
There are actually other ways to saturate the network (like tsunami).
But these require a specialized server.
> 3) EMP format continues evolving on its own, should (1) or (2) not be
> feasable for what I need to do..
>
> So, the two critical things EMPs need to do are:
>
> a) threaded file I/O - the simplest way I can think of would be to say
> "here is a 'read/write request' which can be arbitrarily simple of
> complex, and please now perform this request using 'n' threads. How
> would you express this using the existing/proposed GTO API?
We'd need a class on top of Gto::Reader which opened the file N times
for N threads and shared header information. There are definitely
requirements on the way the file is laid out to make it work
efficiently. I haven't tested HDF performance for this on real network
+hardware combinations. For rv we've seen it run on a number of
different configurations and it requires tuning for each configuration
+ file format.
>
> b) partial reads - this perhaps mostly comes down to how I choose to
> store the data as opposed to the GTO format. But the idea is that I
> can choose to read only specific parts of the GTO... This is possible
> now, isn't it?
Yes, you can also do pure random access where you specifically ask for
data instead of being asked about it. Right now the random access is
limited to whole objects on the corse side and single properties on
the granular side. Since you have a lot of particles we may need to do
something like Nick is doing at unnamed company with their particles
(bucket components) or allow for random access inside a property
(which would only be an issue when zlib compressed although multi-
threading that might eliminate those issues).
BTW, one HDF feature which I think we might want to implement in GTO
is a tuple property which is user defined. That might give you better
locality of data for parallel I/O and tree structures in the file.
-Jim