Possible improvements to the measurement group

1 view

Skip to first unread message

Darren Dale

unread,

Jan 26, 2010, 4:37:53 PM1/26/10

to ma...@googlegroups.com

This discussion started in a thread titled "quick comment about
multidimensional hdf5 arrays".

Matt Newville wrote:
> If I understand correctly, the some names and layout will be mandated,
> perhaps as
> Entry/Positioners/
> and
> Entry/Detectors/
> (All names here being negotiable, decorating with "NX_" as desired).

I have been using:

/entry1/measurement/scalar_data # signals and scanned positioners
/entry1/measurement/positioners # starting positions of every motor on
the beamline
/entry1/measurement/mca1

That is, the raw data is contained in a measurement group, and what
you are calling positioners and detectors were collected inte
scalar_data. I mention it only to remind how I have been doing it for
some time, the organization you suggest up to this point is similar in
its essentials.

> If that is already the case, I don't see that allowing
> Entry/Detectors/Canberra Ge Elem #3/
> with attributes including (type='MCA', npts=2048, dimension=1)
>
> is all that much better than
> Entry/Detectors/mca003/
> with attributes including
> (label='Canberra MED Ge Element #3, APS DetectorPool GE2',
> npts=2048, dimension=1)

I think the latter still need a type attribute, but otherwise I am with you.

> Having to deal with whitespace, punctuation, (and unicode?) in dataset
> names seems like a bad idea to me. The detector label is meant to be
> read by a human, but the dataset name will need to be traversed by the
> reading program.

I agree, its a bad idea. In practice, we have a lot of spec data files
here at the lab that contain spaces in the names of positioners or
counters, and I would have a hard time arguing to someone that "you
can't use names with spaces" when hdf5 in fact allows it.
Discouraging, ok, but disallowing means enforcing...

> The First approach would require looking for the
> required 'type' attribute (which must be one of some pre-defined list
> of valid types) to understand what the data was, and guarantees a
> user-defined label. The Second approach means the dataset name itself
> tells the type (and the dateset name cannot be absent!), and has a
> user-defined label as an attribute to help understand the values in
> the dataset.

Ok.

>>> If I typically collect 3 ion chamber intensities and 40 ROIs in
>>> addition to 4 x 2048 MCAs at each point in the scan, how many detector
>>> datasets should I have? I can see how it could be 1, 2, 44 or 47:
>>> 1: 1 array of (npts, 43+4*2048)
>>
>> Yuck.
>>
>>> 2: 1 array of (npts,43) 1 array of (npts, 4, 2048)
>>> 44: 43 arrays of (npts,) 1 array of (npts, 4, 2048)
>>> 47: 43 arrays of (npts,) 4 array of (npts, 2048)
>>
>> I think 44 or 47 makes sense. Armando suggested that if you have
>> independent calibrations for each of your 4 MCAs, it makes sense to
>> separate them. I agree.
>
> That's a good point, and I can certainly see how 47 datasets looks
> like the most reasonable solution here.
>
> Well, except that this assumes that calibration information is best
> held in attributes, which may or may not be the case. It may be best
> to store an (npts, 4, 2008) array for the data and an (4, 2008) array
> for the MCA Energies at each Channel.

In which case, maybe it makes sense for the detector to be a group,
and the spectra and the calibration would be datasets in that group.

> I think this just expresses the inherent tension of when something is
> an attribute and when it is data. My bias is to use attributes
> sparingly, and use them for labels, flags, and descriptive integers
> (dimensions, etc) that describe or modify the data contained in the
> dataset. If an attribute is required for all Detectors or Positions,
> I tend to think something may be wrong.

Ok.

> But perhaps, especially given the "intrinsic_dimensionality"
> confusion, we should postpone the Detectors discussion and limit the
> topic to how to organize the Positioners Group.
>
>>> OK, I got it now. Thanks, and sorry for adding noise. If detectors
>>> are stored with their "intrinsic dimension" at each point in the scan
>>> (seems to be the consensus), wouldn't the shape of the data array be
>>> self-describing?
>>
>> Only for the case where any scan is flattened into (npts,...) rather
>> than (nx,ny,...). If datasets are structured according to the latter
>> scheme, which appears to be the normal way for nexus, then you need
>> more information to know whether a 3-dimensional dataset, taken out of
>> context, is a 2D scan of spectra, a 3D scan of a counter, or a 1D scan
>> of images.
>
> I'm definitely in favor of flattening the positioners from (nx,
> ny,...) to (npts,...). But, as above, what is less clear to me is how
> to store the Positioner data beyond that. One could have
>
> Entry/Positioners/Positions (npts, NPositioners)
> Entry/Positioners/PositionLabels (NPositioners)
> Entry/Positioners/PositionAddrs (NPositioners)
>
> at each Point i in the scan, the values of all relevent positioners
> are recorded. All their labels and addresses are also stored. As a
> bonus, the need for the Positioners Group seems weak, which could
> flatten the structure.

Or save it in an hdf5 table. (I'm not advocating for this.)

> I have the sense that some would prefer
> Entry/Positioners/Position1 (npts,) attributes: label, address
> Entry/Positioners/Position2 (npts,) attributes: label, address
> ...
> Entry/Positioners/PositionN (npts,) attributes: label, address
>
> At this point, I have a slight preference for the first variation, as
> it has predictable names for datasets, 2-D array of numerical data,
> and replaces attributes which are required by every dataset with
> "attribute arrays".

I would prefer that a positioner's data and attributes (like units)
were encapsulated in a single entity: a dataset. Plus, I think it
would be more difficult to interactively work with such monolithic
data structures. I guess it would be possible to extend phynx so that
it inspected an array of position labels, made those labels available
for tab completion, and for dictionary-like access, but I still have a
very strong preference for the second approach. With more descriptive
labels than Position1 (we already know it is a Position, its in the
Positioner group).

> But we should probably think about how to move from long email thread
> to actual written documents. Google Wave doesn't seem quite ready for
> us (or vice versa). Perhaps a wiki or group-wide Google Doc?

Wave would be my first choice (by far), but that suggestion was not
well received. I don't have a preference between the other two, they
are both good suggestions. Someone (Carlos?) was uncomfortable with
requiring a google account, is that necessary for google docs? I have
a wiki at http://dale.chess.cornell.edu/chess-wiki, it is
write-protected to prevent being spammed.