GTO and EMP

Markus Nordenstam

unread,

Aug 21, 2009, 5:40:56 AM8/21/09

to open-gto-discussion

Hello Jim, Chris (fire's looking good), and Mike!

It's been awhile since I looked at GTO, and I'm currently using a
simple, home-brew format called EMP simply because I needed to get
certain features (such as lazy file I/O) done very quickly and didn't
have time to investigate how easy that would be using GTO.

However, the software I'm working on (Naiad) would love to be part of
the "community" in as many ways as possible. I know the Tippett guys
are planning on open-sourcing their GTO import Naiad-NOP and that's
really exciting -- but I think perhaps I the EMP format itself could
evolve and I can see three different scenarios:

1) EMP format switches to GTO under the hood -- the API to access the
data in the EMP would not change, but being a full-fledged GTO, one
could also access the data using the standard GTO approach

2) EMP format switches to HDF5 under the hood, leveraging its built-in
threaded I/O capabilities

3) EMP format continues evolving on its own, should (1) or (2) not be
feasable for what I need to do..

So, the two critical things EMPs need to do are:

a) threaded file I/O - the simplest way I can think of would be to say
"here is a 'read/write request' which can be arbitrarily simple of
complex, and please now perform this request using 'n' threads. How
would you express this using the existing/proposed GTO API?

b) partial reads - this perhaps mostly comes down to how I choose to
store the data as opposed to the GTO format. But the idea is that I
can choose to read only specific parts of the GTO... This is possible
now, isn't it?

Either way, GTO will definitely be supported in Naiad, out of the box,
no matter what -- if only for mesh/particle data.. But the reason I'm
on this discussion is simply to see if my "default" EMP format could
perhaps also be a GTO (but just using our own 'EMP' protocol...)

(Just to be clear, when I say 'GTO', I mean Open-GTO 1.0 or GTO 4 or
whatever you guys want to call it)

cheers
Marcus

Christopher Horvath

unread,

Aug 21, 2009, 12:41:44 PM8/21/09

to open-gto-discussion

Hey Marcus! It's fantastic to hear from you!

Can you clearly list what your primary design and technical priorities
for EMP are, so we can include them (or keep them in mind for the
future) when making any changes to the next revision of GTO?

The big changes we have in front of us are primarily streaming and a
cleanup/expansion of our protocols. I feel like we'd benefit from
being able to specify multiple protocols for an object, but it's not
an essential feature, and could be potentially confusing.

Another area that needs fairly intensive discussion and design is
"referenced" GTO files. In other words, when you have an extremely
large scene that is broken up into pieces, you probably don't want to
store all of those pieces in a single file, but there needs to be
relationships between objects between files (or, in the case of
"difference" files representing time samples), how are those things
expressed? Right now I have lots of ad-hoc solutions that often
involve string attributes with file path names, and this is very very
fragile. I feel like this will probably affect what you're doing,
because the size of outputs from giant "exotic matter" simulations
will certainly be too big to put into a single file. This is
essentially the 'non-local-relationship' problem that I'm sure you
still have scars from back here.

Chris

On Aug 21, 2:40 am, Markus Nordenstam <markus.nordens...@gmail.com>
wrote:

Jim Hourihan

unread,

Aug 21, 2009, 1:15:31 PM8/21/09

to open-gto-...@googlegroups.com

On Aug 21, 2009, at 2:40 AM, Markus Nordenstam wrote:

>
> Hello Jim, Chris (fire's looking good), and Mike!
>
> It's been awhile since I looked at GTO, and I'm currently using a
> simple, home-brew format called EMP simply because I needed to get
> certain features (such as lazy file I/O) done very quickly and didn't
> have time to investigate how easy that would be using GTO.

Marcus! You beat me to the EMP name! I was thinking about that for
this multimedia format (which would be GTO based). Now I have to call
it something boring!

>
> However, the software I'm working on (Naiad) would love to be part of
> the "community" in as many ways as possible. I know the Tippett guys
> are planning on open-sourcing their GTO import Naiad-NOP and that's
> really exciting -- but I think perhaps I the EMP format itself could
> evolve and I can see three different scenarios:

Do you have some whitepaper or something on EMP we can read? Or even a
basic description?

> 1) EMP format switches to GTO under the hood -- the API to access the
> data in the EMP would not change, but being a full-fledged GTO, one
> could also access the data using the standard GTO approach

Are your users heavily invested in the EMP API? Can you we see it?
Maybe there's a happy comprimise or maybe we can help you have a
second output method using GTO.

> 2) EMP format switches to HDF5 under the hood, leveraging its built-in
> threaded I/O capabilities

I've been looking at HDF too. We could make a parallel reader on top
of the existing one without too much of an issue or adopt HDF as the
format on disk. I think I sent you email a while back about real
(measured) issues with parallel I/O cross platform and how each method
we tried succeeded or failed. Basically, there is no right way to do
it: its always a function of the I/O latency, bandwidth, the cpus, and
kernel. So for example parallel I/O over NFS to a SAN has completely
different requirements from parallel I/O over fibre channel to a RAID.

There are actually other ways to saturate the network (like tsunami).
But these require a specialized server.

> 3) EMP format continues evolving on its own, should (1) or (2) not be
> feasable for what I need to do..
>
> So, the two critical things EMPs need to do are:
>
> a) threaded file I/O - the simplest way I can think of would be to say
> "here is a 'read/write request' which can be arbitrarily simple of
> complex, and please now perform this request using 'n' threads. How
> would you express this using the existing/proposed GTO API?

We'd need a class on top of Gto::Reader which opened the file N times
for N threads and shared header information. There are definitely
requirements on the way the file is laid out to make it work
efficiently. I haven't tested HDF performance for this on real network
+hardware combinations. For rv we've seen it run on a number of
different configurations and it requires tuning for each configuration
+ file format.

>
> b) partial reads - this perhaps mostly comes down to how I choose to
> store the data as opposed to the GTO format. But the idea is that I
> can choose to read only specific parts of the GTO... This is possible
> now, isn't it?

Yes, you can also do pure random access where you specifically ask for
data instead of being asked about it. Right now the random access is
limited to whole objects on the corse side and single properties on
the granular side. Since you have a lot of particles we may need to do
something like Nick is doing at unnamed company with their particles
(bucket components) or allow for random access inside a property
(which would only be an issue when zlib compressed although multi-
threading that might eliminate those issues).

BTW, one HDF feature which I think we might want to implement in GTO
is a tuple property which is user defined. That might give you better
locality of data for parallel I/O and tree structures in the file.

-Jim

Reply all

Reply to author

Forward