Extending the specification to cover rich phenotypic data

39 views
Skip to first unread message

Chris Gorgolewski

unread,
Sep 19, 2016, 10:16:44 PM9/19/16
to bids-discussion
Following our previous discussions about data dictionaries for questionnaires. I have written up a section of the specification summarizing the changes (this will be included in the upcoming 1.0.1-rc1). I also added ability to split the phenotypic information into multiple .tsv files (same as it is done in NKI Enhanced for example). Please let me know what do you think:

From: https://docs.google.com/document/d/1HFUkAEE-pB-angVcYe6pf_-fVf4sCpOHKesUvfb8Grc/edit#heading=h.pi5iigxxt8vy

If the dataset includes multiple sets of participant level measurements (for example responses from multiple questionnaires) they can be split into individual files separate from participants.tsv. Those measurements should be kept in phenotype/ folder and end with the .tsv extension. They can include arbitrary set of columns, but one of them has to be participant_id with matching sub-<participant_label>.


As with all other tabular data those additional phenotypic information files can be accompanied with a JSON file describing the columns in detail (see Section 4.2). In addition to the column description a section describing the measurement tool (as a whole) can be added under the name "MeasurementToolMetadata". This section consists of two keys: "Description" - a free text description of the tool, and "TermURL" a link to an entity in an ontology corresponding to this tool. For example (content of phenotype/acds_adult.json):


{

   "MeasurementToolMetadata": {

       "Description": "Adult ADHD Clinical Diagnostic Scale V1.2",

       "TermURL": "http://www.cognitiveatlas.org/task/id/trm_5586ff878155d"

   },

   "adhd_b": {

       "Description": "B. CHILDHOOD ONSET OF ADHD (PRIOR TO AGE 7)",

       "Levels": {

           "1": "YES",

           "2": "NO"

       }

   },

   "adhd_c_dx": {

       "Description": "As child met A, B, C, D, E and F diagnostic criteria",

       "Levels": {

           "1": "YES",

           "2": "NO"

       }

   },

}


Please note that in this example "MeasurementToolMetadata" includes information about the questionnaire and "adhd_b" and "adhd_c_dx" correspond to individual columns.


In addition to the keys available to describe columns in all tabular files (LongName, Description, Levels, Units, and TermURL) the participants.json file as well as phenotypic files can also include column descriptions with  Derivative field that, when set to true, indicates that values in the corresponding column is a transformation of values from other columns (for example a summary score based on a subset of items in a questionnaire).

Satrajit Ghosh

unread,
Sep 22, 2016, 8:58:06 AM9/22/16
to bids-discussion
hi chris,

how would this handle the following scenario: combined phenotype file with multiple measurements and fields. how do the columns get associated with the appropriate measurement tool? one suggestion would be to group column headings associated with a measurement tool under the measurement tool. the other is to associate a column heading with a measurement tool.

also it seems that one can have one json for all phenotypic files at a level higher in the layout. multiple json dictionaries should be allowed, but it should also be able to represent the entire dictionary in json dataset.

how will questionnaires where multiple selections are allowed handled in the tsv file? 

how will mismatch between term URL and information entered locally be handled? the intent of a term from an ontology may be to describe the term, its units, etc.,. and perhaps more importantly to ensure harmonization across datasets. will the validator check this? so ideally the term URL should be sufficient for a lot of things.

further, if the intent for "Levels" is to capture the types of responses, something should also indicate types of questions. here is a spec that we use to drive our mobile app questions that specifies the type of question and values: https://github.com/satra/voiceup-mdd/blob/master/specs/specs_160913.json#L3273 - and our responses are stored as json.

for a more comprehensive list you can look at the NDA dictionaries, which still require parsing, but a starting point. i'll let nolan pitch in here, but the details of how such information is captured has been worked on quite a bit in a couple of different efforts.

cheers,

satra

Auer, Tibor

unread,
Sep 22, 2016, 9:32:42 AM9/22/16
to bids-di...@googlegroups.com

Hi,

 

It seems to me that we are heading towards NIDM-E…

 

Vale,

Tibor

 

Auer, Tibor (Ph '99)

--
You received this message because you are subscribed to the Google Groups "bids-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bids-discussi...@googlegroups.com.
To post to this group, send email to bids-di...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bids-discussion/9bf075b9-90d5-45b7-83fc-8a1c0d3c6725%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris Gorgolewski

unread,
Sep 22, 2016, 11:10:34 AM9/22/16
to bids-discussion
On Thu, Sep 22, 2016 at 5:58 AM, Satrajit Ghosh <satraji...@gmail.com> wrote:
hi chris,

how would this handle the following scenario: combined phenotype file with multiple measurements and fields. how do the columns get associated with the appropriate measurement tool? one suggestion would be to group column headings associated with a measurement tool under the measurement tool. the other is to associate a column heading with a measurement tool.
I right now (as you probably noticed) the MeasurementToolMetadata field is on the same level as column descriptions. We could group columns into MeasurementTool groups, but this extra layer of hierarchy would break backwards compatibility (remember that the data dictionary were part of the 1.0.0). This might be a good idea for 2.0.0, but for 1.*.* maybe an easier option would be to enforce that there would be one MeasurementTool per one .tsv file? So there could only be one MeasurementToolMetatdata per .json file? WDYT? Splitting the .tsv files by measurement tools seems pretty natural anyway.

also it seems that one can have one json for all phenotypic files at a level higher in the layout. multiple json dictionaries should be allowed, but it should also be able to represent the entire dictionary in json dataset.
I haven't thought how the hierarchical rule would apply to the phenotypic data (since all of it is on group level). I don't see a nice solution (but I also don't see much of a need - there is nothing I can think that two data dictionaries would share - assuming there is one .json file per measurement tool and one only).

how will questionnaires where multiple selections are allowed handled in the tsv file? 
That's a good question. 1) I have not thought about it 2) how is it handled in NDA or ISATAb? 3) maybe we can wait with deciding this until we have data that we can use as an example 4) all of the above

(see what I did there? ;)

how will mismatch between term URL and information entered locally be handled? the intent of a term from an ontology may be to describe the term, its units, etc.,. and perhaps more importantly to ensure harmonization across datasets. will the validator check this? so ideally the term URL should be sufficient for a lot of things.
As you know finding such mismatch automatically is really hard. It would be useful to think of some ontology based fuzzy matching algorithm for checking it TermURLs are roughly correct. The least we could do is to check if the TermURL resolves.
 
further, if the intent for "Levels" is to capture the types of responses, something should also indicate types of questions. here is a spec that we use to drive our mobile app questions that specifies the type of question and values: https://github.com/satra/voiceup-mdd/blob/master/specs/specs_160913.json#L3273 - and our responses are stored as json.
Levels is intended to capture the coding scheme for responses. I am not sure what do you mean, by "types of questions". There is a question/column level TermURL that could serve such purpose. 

for a more comprehensive list you can look at the NDA dictionaries, which still require parsing, but a starting point. i'll let nolan pitch in here, but the details of how such information is captured has been worked on quite a bit in a couple of different efforts.
I'll have a look at NDA and ISATab.

Thanks for the feedback!

Best,
Chris 

--
You received this message because you are subscribed to the Google Groups "bids-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bids-discussion+unsubscribe@googlegroups.com.
To post to this group, send email to bids-discussion@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages