"killer" data format?

6 views
Skip to first unread message

Jeremy

unread,
Apr 26, 2011, 7:59:10 AM4/26/11
to Open Stimulus Developers
Hi all,

As we are all aware, there are quite a few great data collection tools
out there (pick your favorite :-) ), and there are also quite a few
strong stats packages for analyses (like R, matlab, or scipy, and lots
of psychologists use SPSS). But there's not so much in between, as far
as I know. It would be cool to have something like eprime's e-data
aid, but open source and able to work with output from whatever tools
best meet your needs for data collection. For this, and perhaps for
other reasons too, I'm thinking it would be nice to have a
standardized data format.

A standard in the sense I am talking about is completely platform or
software agnostic. Its a description, not an implementation. Nifti is
an example of a standard for saving imaging data.

What would it take to arrive at an defined, open standard for psych /
human neuro data output? What pros and cons? What would a "killer data
format" (.kdf) look like? If it seems worthwhile, what considerations
would there be for such a standard (e.g., relatively human-readable
and .txt based, relatively gloppy and .xml based, etc).

Pros: If an standardized format were enabled as an option by enough
data collection packages, it could facilitate the development and
sharing of tools for data reduction and quality control. Some
consensus on what "should" be recorded is also some indication of best
practices for data management. Might be helpful for archiving or data
accession. Once there's a .kdf specification, people could write
translators, e.g., to arrive at data that can be read by R, python,
matlab, or whatever.

Cons: Its some work to define a standard, let alone implement in for a
given software package.

Some preliminary thoughts:
- a data file will generally have information about the specific data
collection event, or RUN (software version, experiment script version,
date & time, random seed, maybe experiment condition(s) / subject
group, experimenter, etc), about the SUBJECT (subj number, handedness,
maybe age, sex, gave consent or assent), and then there are generally
a series of trials of some sort.

- a TRIAL generally consists of an onset time, a set of stimuli
(possibly each having their own onset times and durations with the
trial), subject response(s) consisting of response + response time,
and so on. the usual stuff.

- YAML seems pretty good as a general purpose standard container for
data (http://yaml.org/). its data oriented, and a lot easier on the
eyes than xml. its been around for about 10 years, and if you believe
wikipedia it looks like it has support in lots of languages, including
perl, C, python, C++, matlab, R, and others. so using YAML would
sidestep need to write a file parser in most cases, while allowing the
output to just look sensible to most people. and its also quite grep-
able and amenable to writing custom parsers. so its more structured
than plain text, less gloppy than xml, and implemented in many
languages.

The idea would be to standardize the way that standard info is saved
into a data file. There should be some provision for saving
nonstandard things, and for extending an original standard.

Thoughts?

--Jeremy

Jeremy Gray

unread,
Apr 26, 2011, 10:11:03 AM4/26/11
to Open Stimulus Developers
here's a more concrete example of what I'm thinking about to help ground discussion. say you have a file 'data.kdf' that contains the text between start file and end file:

----start file----
run_info:
    date: 2011 04 26
    script:
        name:  example_MR_script.py
        sha1_digest: 044db3cbb2b27a09ce6bbb2a1d9988a5e4cc1571
        start_time: 09:19.45.230
    scanner:
        whole_brain_TR_sec: 2.000
        scanner: MRRC timtrioa
   
subject_info:
    subj_ID: 0123x
    sex: male
    age: 23
    group: trained
    consenter: JRG
   
instructions_001:
    abs_onset: 0.042
    duration: 12.221
   
trial_0001:
    abs_onset: 12.321
   name: silly trial
    stimulus: press 2
    key_response: 1
    correct: False
    key_response_RT: 0.654
    post_trial_ITI: 5.000

trial_0002:
    abs_onset: 18.345
    name: silly trial
    stimulus: press 2
    key_response: 2
    correct: True
    key_response_RT: 0.556
    post_trial_ITI: 3.000
   
----end of file----

its quite readable, and is easy to load it as a data structure in your language of choice. in python, with just two lines of code, your data is in memory and addressable:
>>> import yaml
>>> data = yaml.load(open('/Users/jgray/Desktop/data.kdf'))
>>> data['run_info']['scanner']['whole_brain_TR_sec']
2.0

In terms of this example (YAML just for illustrative purposes), what I mean by arriving at a specification would include decisions about:
- whether to use YAML or something else
- what encoding (utf-8 or something else)
- canonical key names to use for the associative arrays / python dicts
- default units to use for timing (e.g., seconds versus milliseconds)
- date & time format to use
- whether repeated things, like trials, should actually be written out as a list / array / ordered sequence (which YAML supports), or whether they should just be associative arrays / dicts with keys that can ensure unambiguous ordering in subsequent processing (as is the case above).

Plus a spec would include some thoughts about: what fields should be required to be present for a data file to be considered "up to spec". what is suggested but not required, and things that are forbidden. and so on. fairly boring decisions, but the set of them defining a standard.

--Jeremy

Sebastiaan Mathot

unread,
Apr 27, 2011, 5:30:30 AM4/27/11
to Open Stimulus Developers
Hi Jeremy,

I would certainly support any standardized format in OpenSesame, at
least if it's fairly easy to implement, which it seems it will.

To be honest, I'm not sure whether a standardized format is really
necessary, as the output of most experiments is so simple that a .csv
file will do. But on the other hand, there is no reason _not_ to have
one, and there might be some benefits, so I'm all for!

Regarding the format, it seems like you pretty much got it covered.
Perhaps you could make a full specification, so that we can discuss it
here and anyone can have his or her say?

Regards,
Sebastiaan

Jonathan Peirce

unread,
Apr 27, 2011, 5:47:48 AM4/27/11
to open-sti...@googlegroups.com
Open Behavioural Format? (*.obf isn't currently used by any software as far as I can find). That would convey that this is suitable for behavioural data - I think you don't want EEG, single units and other such recordings in flat-text formats like this.

1. need for a new format

I'm not overly interested in what the actual file syntax is (yaml, xml...). I think the harder stuff related to this might be generating a structure that is a) easy enough to read for a simple experiment b) flexible enough to store the data of a complex one. So far the PsychoPy approach has been to output csv/xlsx files to achieve (a) and python native ('pickle') files to achieve (b). I think the solution you're suggesting is on the (b) end of the scale (for now you'd have to write some script to visualise anything from the below), but is certainly more readable than python pickle files. Actually, PsychoPy can also output log files, because some people like data organised chronologically, rather than in the logical structure of the experiment. I personally find those files harder to analyse so never use them.

What do other packages do? (these are not much more than guesses)
    - Psychtoolbox: the user simply handles this themselves?
    - Presentation: a log of events
    - PyEPL uses a log of events
    - eprime: proprietary binary format, with E-DataAid to convert to others (e.g. excel)
    - OpenSesame: ?

2. how to structure it

So, if the harder part is choosing a structure, what will be the issues here? Most experiments revolve around the concept of trials of different types that are looped over. That's easy enough to handle. Your form below pretty much does it

What about something that wasn't in the standard loop (e.g. a single datapoint that precedes or follows the main loop of experiments)? What about nested loops (multiple blocks of multiple trials)? Or maybe the format shouldn't care *why* a particular event (e.g. trial) occured, only *when* it occurred, so it shouldn't care about the existence of loops and whether or not they nest etc.

The second issue is what to do with an experiment that doesn't fit into 'trials' e.g. recording the keypresses of a subject viewing an ambiguous stimulus wouldn't be done in trials as such, and the resulting output would be a series of events, rather than an individual response.

3. a new package for viewing/exporting and batch analysis (akin to E-DataAid)

This sounds interesting, but could snowball. I guess the question would be how to focus on the aspects that aren't already done by other packages (like excel or SPSS). Going down that route will always result in complaints that the viewer can't do <insert feature of excel/spss that hasn't been implemented yet>. Also it potentially stops users/students from learning how to use those more-general packages. If they can run the repeated-measures ANOVA in the new data viewer, they don't learn to use a stats package and then get stuck when the viewer can't do a mixed-design ANOVA. So I think it could be dangerous.

*But* there are some things that might be very useful here, like having a viewer to;
    - export to other different formats (right now a psychopy 'psydat' file can then be reopened and saved to .csv or .xlsx but I bet nobody ever did that because they have to write a script for it
    - combine data from multiple runs, or repeat an analysis over multiple files
    - export a copy of the original experiment file. I was planning for PsychoPy to start saving the experiment inside the data file. Again, users will never make use of that unless they can load the file into a viewer and see a button that says something like "Export Experiment File"


Apologies, that all got a bit long-winded! ;-)
Jon
-- 
Dr. Jonathan Peirce
Nottingham Visual Neuroscience

http://www.peirce.org.uk/

This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it. Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham.

This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation.

Reply all
Reply to author
Forward
0 new messages