Re: Open Behavioral Format [was "killer" data format?]

1 view
Skip to first unread message

Jeremy Gray

unread,
Apr 27, 2011, 4:17:18 PM4/27/11
to open-sti...@googlegroups.com
Hi all,

thanks for the comments. I'm replying to both at once here. (long!)

Sebastiaan wrote:
To be honest, I'm not sure whether a standardized format is really necessary, as the output of most experiments is so simple that a .csv file will do.

let me elaborate my motivation. basically, .csv (or .txt, or yaml) of itself is great, but is too unconstrained to allow a general purpose abstraction layer between an experiment and subsequent processing. you can put things in pretty much any order you like, and label them or not label them just so long as you know what is what. different packages could do things differently, or do it exactly the same except for using different labels. or might assume different units (seconds? milliseconds?). so just a .csv file will be ambiguous to a general purpose behavioral-data-parser, it could not make safe assumptions about what was intended. so the idea is that an additional constraints (i.e., a format, beyond .csv) would be useful, it would allow general purpose / automated parsing.

I think its nice for a data file to be semi-self-documenting (for archival purposes), which is at odds with things like easy import into SPSS, where the focus would tend to be on the analysis of mean accuracy across multiple trials, or median RT. in SPSS, you don't want all the data. so its nice to record everything at run time, archive it, and extract only the things you need for a given analysis in SPSS.

in my lab, my half-baked solution has been ad-hoc: perl or python scripts, or excel macros, to parse data files. this is a step up from cutting and pasting things, or paying RA's to do things. but ad-hoc solutions are time consuming and error-prone. e.g., missing values can cause havoc if not handled with care (.csv is more robust to this than tab-separated values, but ....). so its nice to have a pipe-line for doing this stuff, in part so its explicit and inspectable, and you can redo it quickly and easily.

so, as part of a pipe-line, I'm thinking about the possibility of a general purpose data-viewing tool that only requires something like a directory name, and from there it can slurp in all relevant data files, discover their internal structure (such as experimental conditions and loops), and then do some basic but useful things with it. and enable the user to do some further things. Jon's point is well taken, that even slightly advanced stuff would be best left for a real stats package. the kind of tool I'm thinking about would make it easier to format the data for import into such a package, not to replace such a package.

so the motivation for a standard format is in large part to make data-viewing tools be more widely useful (and not just specific to PsychoPy, for example). to elaborate, some things for such a viewer could include:

- automatic computation and display of descriptive stats (checking for outliers, skewness, faster-than-plausible reaction times, etc), speed-accuracy trade-offs within subject (or variation between subjects). and perhaps some less-obvious things like reliability, checking that the same experiment script was used to generate all of the data files that are being treated the same way for analyses. and so on.

- allow the user some way to interact with the data and extract things that can't be inferred automatically, eg, signal detection d' and beta, which depend on two conditions (target present & target absent trial types)

- data formatting for SPSS import, for batches of subjects, eg, saving some but not all of the variables, such as only median RT

- generating trial-type onset and duration vectors for SPM, FSL, and so on, potentially with corresponding behavioral covariates (such as RT on each trial, potentially after some transformation, e.g., centered / de-meaned)
 
But on the other hand, there is no reason _not_ to have one, and there might be some benefits, so I'm all for!

:-)
 
Regarding the format, it seems like you pretty much got it covered. Perhaps you could make a full specification, so that we can discuss it here and anyone can have his or her say?

sure, its in progress. I'm working on some of the situations Jon brought up, like loops of loops. there are several trade-offs that I did not anticipate, so am trying to at least flag such things as I think of them.

Jonathan Peirce wrote:
Open Behavioural Format? (*.obf isn't currently used by any software as far as I can find). That would convey that this is suitable for behavioural data - I think you don't want EEG, single units and other such recordings in flat-text formats like this.

sounds great.
 
1. need for a new format

I'm not overly interested in what the actual file syntax is (yaml, xml...). I think the harder stuff related to this might be generating a structure that is a) easy enough to read for a simple experiment b) flexible enough to store the data of a complex one.

very much agreed.
 
So far the PsychoPy approach has been to output csv/xlsx files to achieve (a) and python native ('pickle') files to achieve (b). I think the solution you're suggesting is on the (b) end of the scale (for now you'd have to write some script to visualise anything from the below), but is certainly more readable than python pickle files.

nicely put. I'd characterize the aim as being to streamline the passage to post-processing, in a way that's discoverable from the data file, to the extent possible. and ideally: work for both simple and complex situations, be platform / language agnostic, and be human readable.

and yes, there would definitely be real work put into doing anything more with the data, like actually building a data-viewer tool (real work!). an OBF format by itself is not that useful. by analogy, nifti is an MR image data format, which is all fine and well, but it does not help me except that there are several good viewers out there. without having some formats, people would be less inclined to write viewers, or you'd have your local, hacked-in-house viewer.

I'm tempted to have a go at writing a data viewer (as you can probably guess!). for having a good format, its possible other people would too. they will probably write better ones that I can do. which is all good.
 
Actually, PsychoPy can also output log files, because some people like data organised chronologically, rather than in the logical structure of the experiment. I personally find those files harder to analyse so never use them.

actually, I should look at the log file format more closely. I think aiming for near-chronological is good for a general format. to be rather long winded here: for some kinds of analyses, temporal stuff is absolutely critical, like for analyzing imaging data in terms of behavioral covariates. also, order info is relevant if counterbalancing the order of conditions within a run is done. (e.g., consider trying to verify that an undergrad got the counterbalancing right in his or her senior thesis, when it was your grad student who was supervising them day to day, and the undergrad graduated a year ago -- you want an explicit record, ideally right there in the data file, not stored in some notes files somewhere, or in someone's head). Or doing a re-analysis of trial effects based on the stimulus that preceded each item, based on a reviewer demand for such an analysis that you did not anticipate at run time. all this probably does not strictly require a chronological order in the data file, as long as its reconstructable. but a general data format does need some way to preserve timing info, and to me chronological order is intuitive (and maybe strangely reassuring to see in the data file). maybe I'm just biased.
 
What do other packages do? (these are not much more than guesses)
    - Psychtoolbox: the user simply handles this themselves?
    - Presentation: a log of events
    - PyEPL uses a log of events
    - eprime: proprietary binary format, with E-DataAid to convert to others (e.g. excel)
    - OpenSesame: ?

psyscope is basically a log file, trial by trial

2. how to structure it

So, if the harder part is choosing a structure, what will be the issues here? Most experiments revolve around the concept of trials of different types that are looped over. That's easy enough to handle. Your form below pretty much does it

What about something that wasn't in the standard loop (e.g. a single datapoint that precedes or follows the main loop of experiments)?

a single data point is even easier than a loop of trials. in brief, each data point in a yaml file just needs to have a unique label. Example: a subject gives consent only once. the data record could be something like the following

consent:
    stimulus: Press 'y' to agree to participate.
    response:
        key: y
 
What about nested loops (multiple blocks of multiple trials)?

for loops of loops, the uniq tag for a given data point could be something like "outerLoopName.003+innerLoopName.027", which would provide a unique identifer for YAML and would be reconstructable into sequences, and extensible to loops of loops of loops.

yaml would give you a dict entry like data["outerLoopName.index+innerLoopName.index"]. your data viewer would then have to parse those keys, to reconstruct lists of lists. this format would allow a data viewer would do that parsing without ambiguity. PsychoPy or OpenSesame would just have to know how to construct such tags, based on loop names and the number of repetitions within a loop, which seems easy enough.

I think there's another way to do it, too. you can tell yaml to join the next item onto a specific list, so that you can have intervening stuff between list items. that looks to be possible. not sure yet whether that would make life easier or harder for developers, versus for end users (experimenters). or which would be more intuitive and clear. tradeoffs.
 
Or maybe the format shouldn't care *why* a particular event (e.g. trial) occured, only *when* it occurred, so it shouldn't care about the existence of loops and whether or not they nest etc.

I think its probably possible to preserve everything, seems desirable to do so. there could indeed by several ways to achieve this, I can envision at least two general approaches, one of them being the above trick. another would be to just have every item be named 'trial', and then it has an attribute 'conditions', which is a list of all the relevant the loop names. this is quite like the psyscope approach.
 
The second issue is what to do with an experiment that doesn't fit into 'trials' e.g. recording the keypresses of a subject viewing an ambiguous stimulus wouldn't be done in trials as such, and the resulting output would be a series of events, rather than an individual response.

its possible to use lists to save multiple data points within a given trial. here's an example with multiple mouse clicks within a single trial, given as lists (x, y, RT):

trial.0003:
    onset: 24.43
    type: <label>
    response:
        type: mouse
        x: [10, 20, 30, 40, 50] # list
        y: [20, 30, 40, 50, 100] # list
        RT.ms: [543, 1033, 3449, 5467, 6587] # list; all lists should be the same length
        click_on: release
        button: left
        sample: clicks
 
3. a new package for viewing/exporting and batch analysis (akin to E-DataAid)

This sounds interesting, but could snowball. I guess the question would be how to focus on the aspects that aren't already done by other packages (like excel or SPSS). Going down that route will always result in complaints that the viewer can't do <insert feature of excel/spss that hasn't been implemented yet>. Also it potentially stops users/students from learning how to use those more-general packages. If they can run the repeated-measures ANOVA in the new data viewer, they don't learn to use a stats package and then get stuck when the viewer can't do a mixed-design ANOVA. So I think it could be dangerous.

good considerations! I had not really thought about that part at all. I agree that its not necessary to provide real stats, and it could even be undesirable to do so. and its less work to not include that, so I love it :-)
 
*But* there are some things that might be very useful here, like having a viewer to;
    - export to other different formats (right now a psychopy 'psydat' file can then be reopened and saved to .csv or .xlsx but I bet nobody ever did that because they have to write a script for it
    - combine data from multiple runs, or repeat an analysis over multiple files
    - export a copy of the original experiment file. I was planning for PsychoPy to start saving the experiment inside the data file. Again, users will never make use of that unless they can load the file into a viewer and see a button that says something like "Export Experiment File"

yes, I agree. a viewer that aims to provide import / inspect / transform / export for single or multiple data files is much more along the lines of what I think could be useful. I would like a way to streamline the conversion of my "raw" data (as output by whatever program I use to get the data) into something useful for an existing stats or imaging analysis package.

I'll post a draft of an Open Behavioral Format, hopefully within a day or two.

--Jeremy


Jeremy Gray

unread,
Apr 29, 2011, 2:29:39 PM4/29/11
to open-sti...@googlegroups.com
Hi all,

gee, writing a format is more work than I thought. I'm attaching a file with my notes (400+ lines of notes). These are followed in the file by a demo (100 lines). I put a copyright on this just because I saw that the YAML people put a copyright on YAML. maybe it looks more official that way, or something.

I've convinced myself that an OBF is worth doing, so I ended up putting some real time into trying to think of things that could go wrong, or be done different ways, and to explain it. Hopefully you can see the basic idea, and not get overwhelmed with my blather. There's a lot that is not there, its still very underspecified (especially concerning the representation of values). There are a lot of ways it could go, and having a lot of flexibility might be what is actually best at this point.

So I'm at the point where a) feedback would be helpful, and b) I need to actually write an OBF parser and play around with how it works.

The file is plain text in YAML syntax, which means you can try out the simple examples (but not the ones that need an OBF parser, as distinct from a YAML parser). If you save the attached file as 'data.obf', then you can play a bit. in python:
>>> import yaml
>>> data = yaml.load(open('data.obf'))
>>> data['lots_of_mouse_clicks']['response']['RT.ms']
<list>
>>> data.keys()
<stuff>
>>> data['debriefing']['response']
<a list of multi-line strings>

The point is just to show that its not only possible, but actually pretty easy, to handle some fairly complicated data.

A final note, esp for Jon: I played around with saving a script into a data file using base64-encoding. I plopped it into the data file as a single string (23,000-ish characters), and YAML happily did its thing. seems promising.

Next up: a parser.

best,

--Jeremy
data.obf

Jonathan Peirce

unread,
May 3, 2011, 1:06:57 PM5/3/11
to open-sti...@googlegroups.com
Hi Jeremy,

I don't have time to look at this in detail (quite a bit to do before
going to the vision science society meeting on thursday) but I had a
look at the specimen file at the end of the document.

What springs to my mind is that the code specifying the 'stimulus' might
be better as 'condition' (or 'parameters'?) and give all the parameters
that defined the trial type (ie that varied from one trial to another)::

loop.1, trial.0002:
onset: 18.345
tag: type1
condition:
text: press 2
pos: [-2,0]
response:
key: 3
correct: False
RT: 0.444

loop.1, trial.0003:
onset: 20.500
tag: type2
condition:
text: press 2
pos: [+2,0]
response:
key: 3
correct: False
RT: 0.563

Also, I wondered whether, instead of the "loop.1 + trial.003" as the tag
you could use a parameter within you trial objects that gave the heirarchy::

trial:
heirarchy: blocks.1, trials.1
onset: 5.064
...

trial:
heirarchy: blocks.1, trials.2
onset: 8.321
...
(actually, I'm not sure under that scheme what the outer name would be,
instead of trial)

Jon

--
Dr. Jonathan Peirce
Nottingham Visual Neuroscience

http://www.peirce.org.uk/


This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it. Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham.

This message has been checked for viruses but the contents of an attachment
may still contain software viruses which could damage your computer system:
you are advised to perform your own checks. Email communications with the
University of Nottingham may be monitored as permitted by UK legislation.

Yaroslav Halchenko

unread,
May 3, 2011, 1:56:04 PM5/3/11
to open-sti...@googlegroups.com
Hey Jeremy -- great to see this effort going forward. Silly question:
is the document/spec available somewhere under GIT?

On Fri, 29 Apr 2011, Jeremy Gray wrote:

> Hi all,
> gee, writing a format is more work than I thought. I'm attaching a file
> with my notes (400+ lines of notes). These are followed in the file by
> a demo (100 lines). I put a copyright on this just because I saw that
> the YAML people put a copyright on YAML. maybe it looks more official
> that way, or something.

having the copyright is the right way ;-) but with

Copyright:
- (c) 2011 Jeremy R. Gray
- This document may be freely copied, as long as it is not modified.

I cannot even forward you any in-text spell-fixes (if there would be any ;)
)... why bother with such restrictions instead of just releasing it under
some appropriate license... e.g. CC BY-SA 3.0 ?

--
=------------------------------------------------------------------=
Keep in touch www.onerussian.com
Yaroslav Halchenko www.ohloh.net/accounts/yarikoptic

Jeremy Gray

unread,
May 3, 2011, 2:28:28 PM5/3/11
to open-sti...@googlegroups.com
Hey Jon (and all),

thanks for the feedback. my last email was quite a brain dump, my apologies.

it will be a lot easier to discuss things via examples, and examples will be more fun once there's a parser, which there almost is. just needs documentation. I'll pass around a link to a git repo when the time comes. Its parsing 1000 trials in about 1.6 seconds, about 1 second of which is the YAML conversion -- there's a C library (libyaml) which people say can speed up the conversion by a factor of 7, which would be great.

Jon, I agree that having 'conditions' or 'parameters' or some such within the trial (instead of the tag) might be a nice way to go, that's quite like how PsyScope does things (in essence). it looks simpler in some ways, and is potentially more explicit to read. It would be easy to support the 'parameters' style you describe, and / or allow that as an alternate, equivalent style. I'll have to think carefully about the other suggestions.

the biggest evolution in my thinking is that an OBF (in the sense of a formal definition, as distinct from a parser) might best be as general as possible: just define how a header looks, how trial info works (nothing but key:value pairs), and a defined procedure to implement "take action X if the key matches expression Y".

That is, there could mostly just be sets of 'conventions' for what actions to take given specific key words, without formalizing those into The One Format To Rule Them All. People might differ quite a lot in terminology and needs, eg, psychophysics to fMRI to social psych. Using conventions would mean that you could just use a parser without worrying about stuff and it will probably just work most of the time: the parser would just discover what is a the data file and tell you, and that would tend to be identical in structure across all subjects in a given experiment. so even without defined conventions things would tend to work out. But for further smoothing the user experience, there could also be OpenSesame conventions, PsychToolBox conventions, PsychoPy conventions, fMRI conventions, and so on. And you could roll your own "custom" conventions for a given experiment. I have the infra-structure for such 'conventions' in the parser, as a dict of regex: function-reference pairs. You just have to create the functions that the regex will trigger, and what goes in them could be more or less anything you can do in python (or your language of choice).

Yarik: thanks for helping out on the copyright :-) I'm always more interested in writing the code, but I should think about the license carefully too. I'll look into CC BY SA 3, thanks for the suggestion.

I'll try to make a new git project for the spec and parser (and eventually a writer). If git bends my brain too much, I'll just make a new branch in my clone of PsychoPy and pass the link around.

--Jeremy

Jeremy Gray

unread,
May 3, 2011, 4:06:26 PM5/3/11
to open-sti...@googlegroups.com
here's the link to a new project on github (my first, probably some rough edges):
    https://github.com/jeremygray/obf
release early and often, they say. so here it is, warts and all.

to try the parser:
$ python obf.py

--Jeremy

Jeremy Gray

unread,
Sep 8, 2011, 11:53:16 AM9/8/11
to open-sti...@googlegroups.com
Hi all,

Seems like others have been working on a solution to the same general problem:
http://www.frontiersin.org/neuroinformatics/10.3389/fninf.2011.00016/abstract

"Here, we propose a simple format, the “open metaData Markup Language” (odML), for collecting and exchanging metadata in an automated, computer-based fashion. In odML arbitrary metadata information is stored as extended key–value pairs in a hierarchical structure. Central to odML is a clear separation of format and content, i.e., neither keys nor values are defined by the format.... We started to define such terminologies for neurophysiological data, but aim at a community driven extension and refinement of the proposed definitions."

These features sound very much like the OBF format, including hierarchical key-value pairs, plus the extensibility and flexibility. convergent evolution in action.

--Jeremy
Reply all
Reply to author
Forward
0 new messages