Hi everyone - despite the silence on the mailing list for the past little while, I have some Psych-DS updates! As usual, here's the summary, followed by details:
(1) "Road Testers" - updates, issues?
(2) I have a new job! (And a related question for BIDS people...)
(4) What's next?
Reminder: Feel free to respond directly to this email thread! In addition, where there are bigger projects to spin off, there are related github issues linked that we can use for discussion.
Second reminder: All are welcome to participate in these conversations, even if you've just joined the group! Please make sure you take a look at our
code of conduct.
We now have 3 (soon to be 4) datasets! They come from a variety of sub-domains and most importantly, were constructed by different teams, which means we're likely to find inconsistencies that help us understand where Psych-DS is ambiguous or poorly documented.
* If you have a dataset you've been thinking of adding, please do so! The more the merrier!
* If you have already contributed a dataset (THANK YOU!), this would be a good time to check out the How-To document again: If you used it, were there any points that were confusing? Either way, please add what you've learned, and what tools you used! A good possible reader to keep in mind is a scientist (including a graduate student) who is used to working in Excel/SPSS/not the command line, but who is excited to get on the Psych-DS train: this document should hopefully help them feel *more* excited and prepared to get started.
* If you are looking for a way to help out, this would be a good moment to look through the existing datasets for inconsistencies (especially in dataset_description.json). If you find possible things to discuss, you can
open an issue in the example-datasets repository.
(2) New job for MK
There will be a more official announcement coming shortly, but I have accepted a research scientist position at the Center for Open Science! The project I'm working on involves a large collaboration that will be conducting a significant number of replications in the social sciences, and I'm hopeful that Psych-DS will be a good fit for some of our data management needs.
I've also talked briefly with the OSF development team - right now, they are focusing on ways to create better metadata for *individual files*, and trying to stay away from too much domain-specific stuff. So in other words, right now the immediate goal is to provide ways for people who already have data in OSF to tag the files that are uploaded in the system.
BIDS people - have you had any discussion about integration with OSF? This seems like it might be primarily a job of work on OSF's end to properly ingest a directory that already comes with structured info, but I"m curious to know if you've had this discussion at all. In the very long term, just as OSF supports a lot of different storage integration, it would be great if they supported common specification formats from particular disciplines.
(3) CSV or TSV?
Patrick brought up
an issue that I've also discussed with a few other people: TSV may be a barrier to using Psych-DS, because (unlike CSV), it's not in the drop-down list of formats to save from an Excel spreadsheet, and it requires the user to fiddle directly with file extension naming (save as a *.txt, convert to *.tsv). I will recap how we initially made this decision, but I do think it would be good to step back now that we've tried things out and think about re-evaluating. All possible solutions have "pain points", so the goal here is to decide which ones we hate the least!
- We originally chose TSV for 2 reasons: This is the format for tabular data that BIDS uses, and, BIDS chose it because tab separators lead to fewer encoding issues (because humans put commas all over natural text.) Once people get their data into this format, they are probably going to be happy, and have fewer troubles. On the other hand, for users who have never paid any attention to file extensions, TSV is a totally unfamiliar format, and somewhat hard to access. At least on Windows with Excel, for instance, producing a CSV requires 1 click in a drop-down menu, while Excel requires a click, and then using your right-click to change the extension (Mac), or getting into the terminal (Windows). There is a good chunk of our potential users who will bounce off this step.
Road testers - how was this for you? How do you think it would be for other users?
- In contrast, CSV is more widely used. Our potential users are more likely to have encountered it 'in the wild', and many programs already have the option to output or download as CSV, without requiring further fiddling.
The major downside here is that humans are more likely to include a comma than a tab as *actual data*, which means that programs that encode and read CSV have to include careful rules/quoting to avoid clobbering the data. So, once a user has converted their data to CSV, it's somewhat more likely that they'll run into a technical problem they have to solve further down the road.
- Why not just let people use both? We could indeed make this call, but the primary downside, and it's a big one, is that it makes almost every program we write to deal with Psych-DS directories more complicated. Among other things, you need logic checking whether a file is TSV or CSV, asking users which kind of output they want, etc. Essentially, this doubles the list of ways things can go wrong, in exchange for flexibility on the user end.
I'd love the community to weigh in on this, so I've set up a straw poll
here. Feel free to make your arguments either way as well, if you'd like to influence this vote :)
(4) What's next?
We should 'refresh' the roadmap that's sitting on the main repository page. I know we have a number of people who've expressed interest in getting started on the validator - can you re-identify yourselves in this email thread, and we can find a time to have a call?
More generally, I'd like for us to identify some specific 2019 goals to work towards! My list includes:
* A functioning validator
* Submitting a journal article! I think the actual writing will be straightforward, because much of the text from the actual specification document can be reused, but we should flesh out the checklist of what-all Psych-DS needs to have ready for what will hopefully be an influx of new users who read that article :)
If you got all the way to the end of this email, you get a gold star!
All the best,
Melissa