Hi all,
As we are all aware, there are quite a few great data collection tools
out there (pick your favorite :-) ), and there are also quite a few
strong stats packages for analyses (like R, matlab, or scipy, and lots
of psychologists use SPSS). But there's not so much in between, as far
as I know. It would be cool to have something like eprime's e-data
aid, but open source and able to work with output from whatever tools
best meet your needs for data collection. For this, and perhaps for
other reasons too, I'm thinking it would be nice to have a
standardized data format.
A standard in the sense I am talking about is completely platform or
software agnostic. Its a description, not an implementation. Nifti is
an example of a standard for saving imaging data.
What would it take to arrive at an defined, open standard for psych /
human neuro data output? What pros and cons? What would a "killer data
format" (.kdf) look like? If it seems worthwhile, what considerations
would there be for such a standard (e.g., relatively human-readable
and .txt based, relatively gloppy and .xml based, etc).
Pros: If an standardized format were enabled as an option by enough
data collection packages, it could facilitate the development and
sharing of tools for data reduction and quality control. Some
consensus on what "should" be recorded is also some indication of best
practices for data management. Might be helpful for archiving or data
accession. Once there's a .kdf specification, people could write
translators, e.g., to arrive at data that can be read by R, python,
matlab, or whatever.
Cons: Its some work to define a standard, let alone implement in for a
given software package.
Some preliminary thoughts:
- a data file will generally have information about the specific data
collection event, or RUN (software version, experiment script version,
date & time, random seed, maybe experiment condition(s) / subject
group, experimenter, etc), about the SUBJECT (subj number, handedness,
maybe age, sex, gave consent or assent), and then there are generally
a series of trials of some sort.
- a TRIAL generally consists of an onset time, a set of stimuli
(possibly each having their own onset times and durations with the
trial), subject response(s) consisting of response + response time,
and so on. the usual stuff.
- YAML seems pretty good as a general purpose standard container for
data (
http://yaml.org/). its data oriented, and a lot easier on the
eyes than xml. its been around for about 10 years, and if you believe
wikipedia it looks like it has support in lots of languages, including
perl, C, python, C++, matlab, R, and others. so using YAML would
sidestep need to write a file parser in most cases, while allowing the
output to just look sensible to most people. and its also quite grep-
able and amenable to writing custom parsers. so its more structured
than plain text, less gloppy than xml, and implemented in many
languages.
The idea would be to standardize the way that standard info is saved
into a data file. There should be some provision for saving
nonstandard things, and for extending an original standard.
Thoughts?
--Jeremy