Resolve comments: Naming scheme of data folders

35 views
Skip to first unread message

Felix Schönbrodt

unread,
Sep 20, 2019, 10:50:23 AM9/20/19
to Psych-Data-Standards
Hi all,

in yesterday's developer call, we noticed that we should move on and try to resolve some open issues in the Google doc. I'll tackle a first one here, which is very relevant for the validator app:

The spec doc has accumulated some comments concerning the naming scheme of the data folders.
(Tagging Wolf Vanpaemel, Ian Hussey, Melissa Kline, Eoghan Ó Carragáin, Brett Buttliere)

I think there is consensus about the meaning of the files:
  • "source": Original, unaltered data in any format (e.g. video files)
  • "raw": Also original, unaltered data (e.g., untransformed, unaggregated, no outlier exclusions, etc.), BUT that complies to the specs (i.e., tsv-file with specific file name, etc.).

In some cases the differences between both will be larger (e.g., a source video file gets manually coded; scorings are in /raw_data), sometimes smaller (e.g., the survey software returns a source csv-file, which only needs minimal adjustment to get a compliant tsv-file in /raw_data).

So the open issue is just about the naming. (Unless you see a need for discussion of the meaning, or I misunderstood something). As we try move to a consensus, I try to summarize the ideas and arguments here.

(I agree with Wolf's comments that we should separate the "what" (meaning) and the "where" (folder names) in the description).

Desiderata that have been brought up for the folder names:
  • nice alphabetization in the folder structure (first source, then raw; data-folders should "stick" together)
  • intuitively understandable for users - names should reflect the meaning
  • consistency with existing naming schemes. Specifically, the definitions from the official recommendations of the German Psych Society (https://www.dgps.de/fileadmin/documents/Empfehlungen/Data_Management_eng.pdf; disclaimer: I was an author of that):
    • "First, a distinction should be made between raw data and primary data. Raw data are the original [potentially non-digital] record; for instance, checkmarks on a paper questionnaire, drawings, or audio and video recordings. Primary data are the first transfer of raw data into a digital format; for instance, code “1” for a “yes”, [or the digitized video file] etc. Thus, primary data in psychology are completely unaltered (i.e., not transformed, aggregated, etc.)."
    • That means, both "source" and "raw" are primary data in that sense; sometimes primary and raw coincides (when data are collected electronically)


Name suggestions for "source" folder
  • source_data (current)
  • data_source (better alphabetization)
  • data_primary (mostly consistent with German guidelines):
  • unmodified_data

Name suggestions for "raw" folder
  • raw_data (current)
  • data_raw (better alphabetization)
  • data_primary_structured (consistent with German guidelines)
  • unmodified_formatted

My personal favorite is a hierarchical structure with:

  • data/1-unmodified <- can have arbitrary subfolder structure
  • data/2-unmodified_formatted    <- only this is checked by validator
  • data/3-processed
(Numbers introduced for proper alphabetization)


Alternatively, if we stick to the guideline definitions:
  • data/primary <- can have arbitrary subfolder structure
  • data/primary_formatted <- only this is checked by validator
  • data/processed

The second set automatically gives proper alphabetization, which also reflects (at least my) workflow:

1. Throw all original data files in "primary"
2. Write an R script that transforms these data into compliant tsv-files, save in "primary_formatted"
3. Make intermediate data summaries, etc: save in "processed".

What do you think?
Felix

Melissa Kline

unread,
Sep 20, 2019, 11:07:58 AM9/20/19
to Felix Schönbrodt, Psych-Data-Standards
Just to throw a perspective into the mix, there was a healthy contingent (including I think Tal Yarkoni?) who argued for the removal of *spec controlled* folder naming for anything other than raw/source/unmodified/unformatted data, on the argument that the tools and validator only need to know about this folder (i.e. whatever primary/formatted/unmodified/first in the reproducible analytic pipeline chain is called, the validator will verify its contents). 

I"m very mixed on this question myself!

The counterargument to eliminating 'unmodified-formatted/primary-structured' IMO would be the role of Psych-DS in encouraging good practice and comparability across datasets. That is, identifying the first 'usable' state of the data is special for many applications, and, if we get a great transition to machine-readable, Psych-DS compliant data *generation* (IE from tools that we write for our experiments), then those raw/source data will be compliant already, and the 'source' folder would be empty.  Similarly, if we have lots of people making (a) experiments that write directly to compliant data or (b) 'converters' that take standard but not Psych-DS formats (e.g. Qualtrics, your favorite eyetracker, etc.) and produce legal Psych-DS, it would be desirable for them all to write their output to a standardly named folder. 

Terminology is a *major* challenge here, and one of the main things that user-testers at SIPS struggled to understand (what goes in 'source'? what goes in 'raw'?). Even 'unmodified-formatted' (which I otherwise like), I suspect may seem contradictory to users encountering the folder in the wild.  

--
You received this message because you are subscribed to the Google Groups "Psych-Data-Standards" group.
To unsubscribe from this group and stop receiving emails from it, send an email to psych-data-stand...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/psych-data-standards/f89e4396-a6de-4b01-b62b-2d5d8c437551%40googlegroups.com.

Felix Henninger

unread,
Sep 20, 2019, 1:39:12 PM9/20/19
to psych-data...@googlegroups.com

Hej everyone,

thanks for your awesome thoughts! Here's some more less thought-out ones :-)

[Melissa: ...] there was a healthy contingent (including I think Tal Yarkoni?) who argued for the removal of *spec controlled* folder naming for anything other than raw/source/unmodified/unformatted data, on the argument that the tools and validator only need to know about this folder (i.e. whatever primary/formatted/unmodified/first in the reproducible analytic pipeline chain is called, the validator will verify its contents).

For what it's worth, I've been thinking along similar lines, in that my opinion is that the metadata standard shouldn't be too strongly tied to the folder naming, and be able to (potentially) exist separately. I don't think that means giving up on the folder structure, however (see below).

More specifically, I think it would be useful to make explicit in the metadata to which files the information applies, rather than leaving that implicit (e.g. by default, the top-level dataset_description.json would include an entry like "applies_to": "./raw_data/**/*_data.tsv", signalling that the metadata is valid to all files in raw_data ending in _data.tsv, subfolders included)
I would imagine that this would make it much easier for tools that just understand JSON-LD to deal with the data (surely there's a standard JSON-LD key for this already that I just haven't found yet?).

The counterargument to eliminating 'unmodified-formatted/primary-structured' IMO would be the role of Psych-DS in encouraging good practice and comparability across datasets.

I agree, and I don't think that's incompatible with the above: I personally wouldn't give up the project structure as part of the standard: In my view it would also be great to (independently) enforce a folder structure, and only give the Psych-DS stamp to datasets that also meet this part of the standard too.

Terminology is a *major* challenge here, and one of the main things that user-testers at SIPS struggled to understand (what goes in 'source'? what goes in 'raw'?). Even 'unmodified-formatted' (which I otherwise like), I suspect may seem contradictory to users encountering the folder in the wild.  

Yeah, as a non-native speaker, the raw/source distinction isn't intuitive to me, but I don't have better ideas (I like Ian's suggestion of unmodified; unprocessed, maybe, might be another alternative?). I've added some comments to the doc to this end. I'm also with Felix S. in that I like the hierarchical structure with a top-level data directory.

Ok, so much for my unprocessed raw thoughts, straight from the source if you will 🙊 Kind regards, and have a great weekend y'all!


-Felix

Brett Buttliere

unread,
Sep 23, 2019, 4:07:55 AM9/23/19
to Felix Henninger, psych-data...@googlegroups.com
Hello All, 

Good work, I think this is important, though I do have to get to a meeting I think to hear more specifics. 

As per questions above: If you generate a first level Data folder, with subfolders for e.g., any distinction you want (original, source, cleaned) and use a 'final/pull from' datafolder where all analyses pull from, it shouldn't be too large a problem and one knows where to look then. Or even just using that Data folder as the main hub and then subfolders e.g., scans, first, cleaned, etc. 

One thing I will say in general is that if you will build a validator, I think we should be clear on the conventions and etc, otherwise you might find yourself reprogramming it 2 or 10 times. Maybe just build it for individual files, or projects? I see the value in a validator but it seems like part of that 20% value that comes with 80% time. 

I have submitted the attached paper to AMPPS some weeks ago, I think the DS standard might find good value in encouraging standard variable names and labels for psychological variables that are in nearly every dataset. e.g., condition, participant id, gender. There are many standards, yes, but .. there is much value in standardizing variable labels (as much as possible), I think you will see. 

All feedback welcome, 

Best, 
Brett

The case for setting some standard variable labels when developing new scales.4.docx

Melissa Kline

unread,
Sep 23, 2019, 10:43:03 AM9/23/19
to Brett Buttliere, Felix Henninger, Psych-Data-Standards
Thanks for these Brett! 

You're right that users will be able to name subfolders anything they want inside the data/ folder. The most important thing, from the validator's perspective, is that it knows that it can *ignore* a folder called source (or something...) which contains non-TSV data. 

Since this specification is on the one hand, standardization down to machine-readable levels, and on the other hand, designed for users that may not already have strong instincts on e.g. what folder naming will set them up for success later on, we are skating a little bit of a line here, between recommending just enough to make the spec work and giving broader recommendations for useful/positive patterns that the spec *doesn't* touch. 

There are 2 reasons I think it'll be useful to have a standard name for 'first Psych-DS formatted version'. Both of them really just come down to the fact that the researchers I know (and I do this as well sometimes) have a really, really strong pattern of creating multiple versions of data files as part of the cleaning & analysis plan:

(1) Many users will start with some data source that's really idiosyncratic - e.g. printed sheets of paper on which participants have circled their responses. We want them to save this data! (E.g. by scanning them in so they are digitally archived).  So, their first step will to be to upload that archive, whatever it is. Their second step will be to produce a compliant dataset, and IMO we want to tell them *exactly* where to put it. Putting it at the top level of data/ seems bad, because this doesn't set up a good pattern of clearly separating different versions (if any) of the data. So, having specific boxes tagged "put your first non-compliant version here, put your first Psych-DS compliant stuff there" seems useful from the perspective of training users how to use Psych-DS effectively.

(2) For tool users, who get to create Psych-DS compliant data from the get go, they want to give the user back their data somewhere that's predictable/encapsulated and easy to find. If the other 'bad' kinds of data didn't exist, they'd probably be putting it in a 'raw/source' folder to indicate it's the initial output, but we've stollen that for non-compliant files! 

The 'formatted' suffix makes me nervous because people 'format' all kinds of things (like creating indices from variables, dropping variables, removing outlier data....), but maybe it's the best way to indicate the meaning we have, e.g. 

primary-formatted
primary-unformatted

How do people feel about that general pattern? [Word for initial data, sans cleaning, data transformations, etc]-[word indicating whether or not Psych-DS compliant]



****

Other notes, just to keep the thread above focused on folder naming: 

(The logic for starting with the validator is to prevent the problems you mentioned with conventions diverging slightly. In its final form, the validator essentially *is* the specification, because whatever it says, goes. Then for instance, a script which builds Psych-DS datasets as output should finish with a check that the resulting dataset really does validate, to ensure that it's passing a 'good' dataset off to the user of that script.  We've just started getting more eyes on the validator and already discovering many places where the tech spec document turns out to be not specific enough!)

(On variable names: We've discussed this in the past, and there are a ton of people on the list interested in variable name standardization. The way I personally think about this is that the Psych-DS spec will give you a place to *write down* a set of standardized variable names, and enforce them against a data file - the JSON data_description metadata file - but the variable names themselves are not *part* of the spec. At least not yet :) )

Melissa Kline

unread,
Oct 4, 2019, 7:57:56 AM10/4/19
to Brett Buttliere, Felix Henninger, Psych-Data-Standards
Returning to this as we come up on the developer meeting, since I'd like to settle this issue! I'd like to propose the following, and hear if anyone has any objections!

- Files inside a folder called raw/ or raw_*/ are ignored by the spec
- EXCEPTION, files inside a folder called raw_psychds/ are checked by the spec

This means that the simplest possible data folder containing raw might look like

data/
--somedatafile_data.tsv
--raw/
----unformattedfiles.jpg

Advantages -

This lets all users use the term 'raw' to denote the initial format of the data (or to use more descriptive terms, like raw_qualtrics, raw_photos, raw_videos), while still allowing for the case where we want to denote raw data that *is* psych-DS compliant.

Moves the issue of 'good patterns for users' more firmly out of the spec and into 'how to use the spec' territory (this contradicts my previous argument :) )  We can suggest a default place to have users put the first version of their psych-DS formatted data, giving clear instructions without actually baking into the spec.

Disadvantages -

Aesthetically, I don't love baking the name 'psychDS' into the spec, but I can get over it

Confusing to users to have the nested pattern of ignore/don't ignore? I think we can deal with this by recommending the validator give a warning whenever it ignores a file, e.g. "147 data files checked, 35 files in raw_qualtrics/ have been ignored"

Others?




Ó Carragáin, Eoghan

unread,
Oct 4, 2019, 8:45:24 AM10/4/19
to Melissa Kline, Brett Buttliere, Felix Henninger, Psych-Data-Standards

Hi all,

Apologies, I won’t make the dev call today but will try to join in future.

 

From a cursory reading of this thread, it seems to be a question of where psych-ds sits in terms of “convention over configuration” (https://en.wikipedia.org/wiki/Convention_over_configuration). It would be nice to agree on a set of standard folder names but may be a barrier to entry for some adopters. I wonder if the spec could: a) clearly define the discrete types of data (i.e. source vs raw etc.) and state which ones are relevant for validation purposes; b) suggest a ‘reasonable convention’ of ‘sensible defaults’  for people who don’t have a strong preference (most new comers may just follow this convention); c) have a mechanism in the metadata file to ‘configure’ non standard folder names, i.e. someone could designate a folder called “source” as relating to the Psych-ds notion of raw-formatted. A validator could look first for the configuration, and failing that check for the convention? Felix, I think this is similar to what you are suggesting with some sort of "applies_to" property below?

 

Apologies, though, I haven’t really digested the issue to be sure this is really makes sense, so feel free to disregard!

 

Best,
Eoghan

Erin Buchanan

unread,
Oct 4, 2019, 9:38:17 AM10/4/19
to Ó Carragáin, Eoghan, Melissa Kline, Brett Buttliere, Felix Henninger, Psych-Data-Standards

I guess I am on board with a mix of ideas … I like the idea of raw and source being in different folders, but I was 100% on board with everything Felix S. said – having numbers or alphabetical order is very appealing. I might agree with this because it mostly matches what I do.

 

If I think about an end user who is trying this for the first time, having a set of common rules would be very helpful. Oh, you want me to name it “source/swiss_cheese”? Great, I’ll do that. From my experience, people like black and white instructions when they are a novice. See you guys in a few.

 

erin

James Green

unread,
Oct 10, 2019, 6:02:35 PM10/10/19
to Erin Buchanan, Ó Carragáin, Eoghan, Melissa Kline, Brett Buttliere, Felix Henninger, Psych-Data-Standards
So I've been mostly out of the loop on psych-ds since SIPS, but still actually very interested. I was prompted to come back to this tonight, because I'm setting up a new project folder for an old project I'm tidying up. My 2c

- If we use numbers, then the order of precedence is explicit, even if your project folder is sorted by date, or if you can't remember the exact definitions. So that seems more foolproof.
- Controversially (and there may be good reasons not to), but I was wondering whether the non-formatted whatever, perhaps that can be 0. It's a then not obviously part of a 1, 2, 3 schematic, but still fits in if required.

Best wishes,
James


Reply all
Reply to author
Forward
0 new messages