Updates and a straw poll!

29 views
Skip to first unread message

Melissa Kline

unread,
Feb 3, 2019, 9:27:33 AM2/3/19
to Psych-Data-Standards
Hi everyone - despite the silence on the mailing list for the past little while, I have some Psych-DS updates! As usual, here's the summary, followed by details:

(1) "Road Testers" - updates, issues? 
(2) I have a new job! (And a related question for BIDS people...)
(3) CSV or TSV? Take the poll!
(4) What's next?

Reminder: Feel free to respond directly to this email thread! In addition, where there are bigger projects to spin off, there are related github issues linked that we can use for discussion. 

Second reminder: All are welcome to participate in these conversations, even if you've just joined the group! Please make sure you take a look at our code of conduct

(1) Example-Datasets & Road Testers

We now have 3 (soon to be 4) datasets! They come from a variety of sub-domains and most importantly, were constructed by different teams, which means we're likely  to find inconsistencies that help us understand where Psych-DS is ambiguous or poorly documented.

* If you have a dataset you've been thinking of adding, please do so! The more the merrier!

* If you have already contributed a dataset (THANK YOU!), this would be a good time to check out the How-To document again: If you used it, were there any points that were confusing? Either way, please add what you've learned, and what tools you used! A good possible reader to keep in mind is a scientist (including a graduate student) who is used to working in Excel/SPSS/not the command line, but who is excited to get on the Psych-DS train: this document should hopefully help them feel *more* excited and prepared to get started. 

* If you are looking for a way to help out, this would be a good moment to look through the existing datasets for inconsistencies (especially in dataset_description.json). If you find possible things to discuss, you can open an issue in the example-datasets repository. 

(2) New job for MK

There will be a more official announcement coming shortly, but I have accepted a research scientist position at the Center for Open Science! The project I'm working on involves a large collaboration that will be conducting a significant number of replications in the social sciences, and I'm hopeful that Psych-DS will be a good fit for some of our data management needs. 

I've also talked briefly with the OSF development team - right now, they are focusing on ways to create better metadata for *individual files*, and trying to stay away from too much domain-specific stuff. So in other words, right now the immediate goal is to provide ways for people who already have data in OSF to tag the files that are uploaded in the system. 

BIDS people - have you had any discussion about integration with OSF? This seems like it might be primarily a job of work on OSF's end to properly ingest a directory that already comes with structured info, but I"m curious to know if you've had this discussion at all.  In the very long term, just as OSF supports a lot of different storage integration, it would be great if they supported common specification formats from particular disciplines. 

(3) CSV or TSV?

Patrick brought up an issue that I've also discussed with a few other people: TSV may be a barrier to using Psych-DS, because (unlike CSV), it's not in the drop-down list of formats to save from an Excel spreadsheet, and it requires the user to fiddle directly with file extension naming (save as a *.txt, convert to *.tsv). I will recap how we initially made this decision, but I do think it would be good to step back now that we've tried things out and think about re-evaluating. All possible solutions have "pain points", so the goal here is to decide which ones we hate the least! 

- We originally chose TSV for 2 reasons: This is the format for tabular data that BIDS uses, and, BIDS chose it because tab separators lead to fewer encoding issues (because humans put commas all over natural text.) Once people get their data into this format, they are probably going to be happy, and have fewer troubles. On the other hand, for users who have never paid any attention to file extensions, TSV is a totally unfamiliar format, and somewhat hard to access. At least on Windows with Excel, for instance, producing a CSV requires 1 click in a drop-down menu, while Excel requires a click, and then using your right-click to change the extension (Mac), or getting into the terminal (Windows). There is a good chunk of our potential users who will bounce off this step.

Road testers - how was this for you? How do you think it would be for other users?

- In contrast, CSV is more widely used. Our potential users are more likely to have encountered it 'in the wild', and many programs already have the option to output or download as CSV, without requiring further fiddling.  

The major downside here is that humans are more likely to include a comma than a tab as *actual data*, which means that programs that encode and read CSV have to include careful rules/quoting to avoid clobbering the data. So, once a user has converted their data to CSV, it's somewhat more likely that they'll run into a technical problem they have to solve further down the road. 

- Why not just let people use both? We could indeed make this call, but the primary downside, and it's a big one, is that it makes almost every program we write to deal with Psych-DS directories more complicated. Among other things, you need logic checking whether a file is TSV or CSV, asking users which kind of output they want, etc. Essentially, this doubles the list of ways things can go wrong, in exchange for flexibility on the user end. 

I'd love the community to weigh in on this, so I've set up a straw poll here. Feel free to make your arguments either way as well, if you'd like to influence this vote :)

(4) What's next? 

We should 'refresh' the roadmap that's sitting on the main repository page. I know we have a number of people who've expressed interest in getting started on the validator - can you re-identify yourselves in this email thread, and we can find a time to have a call? 

More generally, I'd like for us to identify some specific 2019 goals to work towards! My list includes:

* A functioning validator

* Submitting a journal article! I think the actual writing will be straightforward, because much of the text from the actual specification document can be reused, but we should flesh out the checklist of what-all Psych-DS needs to have ready for what will hopefully be an influx of new users who read that article :)

If you got all the way to the end of this email, you get a gold star!

All the best,

Melissa

Ruben Arslan

unread,
Feb 3, 2019, 9:57:40 AM2/3/19
to Melissa Kline, Psych-Data-Standards
Melissa, congratulations on the new job! 

Just one very brief point:
If we are going to think about Excel implementations, CSV has a big drawback.
In many European countries that use commas as the decimal point, CSV by default uses semicola. This is often both frustrating both to export and import in Excel across countries.
This might be an argument for TSV? We can easily link to a drag and drop CSV to TSV converter.
Do all Excel distributions default to UTF-8 now? Because that was another commonplace problem when you allow plaintext formats from Excel (but maybe that changed?) and it's very hard to fix once it happens.

--
You received this message because you are subscribed to the Google Groups "Psych-Data-Standards" group.
To unsubscribe from this group and stop receiving emails from it, send an email to psych-data-stand...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/psych-data-standards/CAF%3DPoJOAG6R9SG5K7rkedLt2AJHaz59ZWxXPyUc3LJ7%3D%3DsvuoA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Russ Poldrack

unread,
Feb 3, 2019, 10:28:10 AM2/3/19
to Melissa Kline, Psych-Data-Standards
hey Melissa - congrats on the new position!
We have been in touch with OSF folks at various points about better integration between OSF and OpenNeuro, but I don't think we have specifically discussed BIDS (Chris can clarify if I'm wrong about that).
My philosophy has always been that the #1 priority is reducing barriers to adoption.  So in some sense you are polling the wrong people - you really want to know whether the requirement to use TSV (which I agree is a well-motivated one, and prevents all sort of hairballs with CSV parsing) will prevent people from using the standard. I don't get the feel that it has been much of a problem for BIDS, but then again most of our users are not working in Excel!
cheers
russ


--
You received this message because you are subscribed to the Google Groups "Psych-Data-Standards" group.
To unsubscribe from this group and stop receiving emails from it, send an email to psych-data-stand...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/psych-data-standards/CAF%3DPoJOAG6R9SG5K7rkedLt2AJHaz59ZWxXPyUc3LJ7%3D%3DsvuoA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


--
Russell A. Poldrack
Albert Ray Lang Professor of Psychology
Professor (by courtesy) of Computer Science
Bldg. 420, Jordan Hall
Stanford University
Stanford, CA 94305

pold...@stanford.edu
http://www.poldracklab.org/

Rickard Carlsson

unread,
Feb 3, 2019, 10:28:10 AM2/3/19
to Ruben Arslan, Melissa Kline, Psych-Data-Standards
Hi,

Congrats!

I also agree with Ruben on CSV. This has been one of my major challenges for myself and working with students / colleagues. I’ve still not found a way to make things work properly if I don’t change my locale settings in OS X from Swedish to pseudo-american.

Best,
Rickard Carlsson

Tal Yarkoni

unread,
Feb 3, 2019, 11:02:30 AM2/3/19
to Rickard Carlsson, Ruben Arslan, Melissa Kline, Psych-Data-Standards

Congratulations, Melissa!

Re: csv vs tsv, I agree with the comments above. Like Russ, I'm generally on the "emphasize adoption" side. But in this case I don't think the cost of using tsv is very high. It's worth keeping in mind that there are other things people will have to do to make their datasets compliant besides saving with the right extension. As obstacles go, renaming the file extension from .txt to .tsv in Excel seems to me like a much lower bar than formatting a JSON file properly. Put differently, if a user stalls at "save the file as a .tsv" (and I don't doubt some will), I think there almost certainly be other failure points anyway, and the solution is to develop good GUI-based tools that guide them through the process. E.g., if a user is already going to be relying on a JSON-generation tool, it would be trivial for that tool to read in .txt files and fix the extension on output. So in practice, I think enforcing .tsv is not actually going to reduce adoption meaningfully given that we're not budging on JSON, strict file-naming conventions, etc.

Tal

Chris Gorgolewski

unread,
Feb 3, 2019, 11:17:56 AM2/3/19
to Tal Yarkoni, Rickard Carlsson, Ruben Arslan, Melissa Kline, Psych-Data-Standards
Congrats Melissa!

It's worth noting that Excel has other issues when it comes to exporting text data - it uses a wrong new line character: https://nicercode.github.io/blog/2013-04-30-excel-and-line-endings/. We had to add a new test for that in the validator.

Best,
Chris

Melissa Kline

unread,
Feb 3, 2019, 11:23:07 AM2/3/19
to Chris Gorgolewski, Tal Yarkoni, Rickard Carlsson, Ruben Arslan, Psych-Data-Standards
I am very familiar with that line-ending nonsense, it's the worst! I agree it would be nice to poll potential users (rather than ourselves) - maybe the PSA would be okay with us sending around a poll? In addition to the TSV/CSV question, we could ask about Excel use, and familiarity with file extensions generally (modern Mac OS hides them by default, for instance. Often my undergrads don't know about them.) 

My suspicion is that Excel - including all kinds of old versions & localizations - is going to be non-negotiable, and we need to figure out how to manage its quirks, rather than ban it. And, hopefully, use Psych-DS to introduce people to better tools!


-----
Melissa Kline
Postdoctoral Fellow
Harvard Psychology/MIT Brain & Cognitive Sciences

I check email about 2-3 times a day. If you need something from me in the next 4 hours please call or text me on my cell phone!

Rickard Carlsson

unread,
Feb 3, 2019, 11:23:08 AM2/3/19
to Tal Yarkoni, Ruben Arslan, Melissa Kline, Psych-Data-Standards
Agree 100 % with Tal!

Melissa Kline

unread,
Feb 3, 2019, 2:01:32 PM2/3/19
to Patrick S. Forscher, Psych-Data-Standards
Thanks Patrick! Ruben, you mentioned a drag & drop converter, did you have something specific in mind? I agree that if we stick with TSV, we'll need to be prepared to support people around that step. 

-m

-----
Melissa Kline
Postdoctoral Fellow
Harvard Psychology/MIT Brain & Cognitive Sciences

I check email about 2-3 times a day. If you need something from me in the next 4 hours please call or text me on my cell phone!


On Sun, Feb 3, 2019 at 1:59 PM Patrick S. Forscher <fors...@uark.edu> wrote:
Let me join the chorus in saying that’s awesome, Melissa! COS should feel lucky to have you. 😄

I raised the csv issue on Github, so maybe it’s worthwhile for me to elaborate on my reasoning.

Just to lay out where I’m coming from, I’m above the median psych researcher in tech-savviness but probably not by much (70 percentile maybe). That said, I have never worked with or seen .tsv files and I had to puzzle through for a while to figure out how get my data file saved with a .tsv extension — I often use Excel for format conversion, and while Excel saves in a tab delimited .txt format, it does not save in .tsv, nor do any of the tools I regularly use have this as a default (ex Qualtrics, Inquisit). Yes, I can use the command line to modify the extension, and the process is not hard if you’re familiar with the command line, but I don’t regularly interact with the command line.

I’ve also taught a number of new grad students, and this experience has taught me that they vary a lot in their comfort & familiarity with tech. For example, we usually need to devote ~a week of tech support time to helping students install R.

I hadn’t thought of the different .csv standards across countries, so maybe that would be a reason to stick with .tsv. However I worry that most of the people on this list are beyond even my 70 percentile in tech savviness and so may not realize that the barrier imposed by the unfamiliar extension is actually a significant one. Even allowing tab delimited .txt file would help, as these are often referenced and supported in the Windows software ecosystem.

All that said, maybe the .tsvs are the right way to go — but if this is what we do we’ll need to be very thoughtful about how to reduce frictions for non-tech-savvy users.

- Patrick

--
You received this message because you are subscribed to the Google Groups "Psych-Data-Standards" group.
To unsubscribe from this group and stop receiving emails from it, send an email to psych-data-stand...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/psych-data-standards/CAF%3DPoJOAG6R9SG5K7rkedLt2AJHaz59ZWxXPyUc3LJ7%3D%3DsvuoA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
Patrick S Forscher
Assistant Professor
Department of Psychological Science
University of Arkansas

Chris Gorgolewski

unread,
Feb 3, 2019, 2:25:35 PM2/3/19
to Melissa Kline, Patrick S. Forscher, psych-data...@googlegroups.com
I'm afraid that because of the line ending issue some conversion will be necessary even if you pick CSV. The issue, however, only affects Excel on Mac - it would be good to know what percentage of your target community would use that tool combination to curate their datasets.

Marietta Papadatou-Pastou

unread,
Feb 4, 2019, 7:40:40 AM2/4/19
to Chris Gorgolewski, Melissa Kline, Patrick S. Forscher, psych-data...@googlegroups.com
Dear Melissa,

Congrats on your new post and thank you for managing us!

I agree that some languages (Greek in my case) use commas as the decimal character, which complicates matters.

I wonder if we can get in touch with Microsoft and OpenOffice people and ask them to include TSV in their dropdown menu in future versions. Open Science is becoming bigger and bigger, so they might consider adopting this proposal.

Best,
Marietta


For more options, visit https://groups.google.com/d/optout.


--
Dr Marietta Papadatou-Pastou CPsychol CSci AFBPsS
National and Kapodistrian University of Athens

Peder Isager

unread,
Feb 4, 2019, 7:40:40 AM2/4/19
to Chris Gorgolewski, Melissa Kline, Patrick S. Forscher, psych-data...@googlegroups.com
Congrats on the position Melissa! 

Just FYI, I have implemented psych-DS for a data directory I am currently populating as part of an ongoing project. I can't share this on github yet because we have some proprietary data in there (non-open access journal articles etc.), but I will try to update relevant documents with anything relevant that might come up while curating the dataset. One issue I'm currently uncertain about is what to do when the project has several independent components. That is, there will be a pilot dataset in addition to the main dataset, and the pilot has several files and directories connected to it. I considered simply making a pilot/ directory in the top level structure, but I'm not sure if I should then treat pilot/ as a self-contained psych-DS directory with its own top level structure or not. Is there any information in the spec document about this that I might have missed?

Regarding TSV vs CSV, I agree with Tal. I think that if a user not familiar with the file formats and logic of the specification decides to get on board, figuring out how to generate a TSV will be a small part of the total learning curve. 

From a users perspective, one potential issue I noticed while working with TSV in Excel was that Excel (sometimes) decides to launch an inquisition against the format. It would constantly prompt me to change the file format whenever I would save or close the TSV file inside Excel. I explicitly had to save the file and then exit by clicking "continue without saving" which I imagine could be a source of frustration for some users. 


Best,
Peder

Melissa Kline

unread,
Apr 4, 2019, 7:51:58 PM4/4/19
to Alicia Hofelich Mohr, Marietta Papadatou-Pastou, Chris Gorgolewski, Patrick S. Forscher, Psych-Data-Standards
(Sorry, catching up on old threads as I settle into the new job!)

In principle there's nothing wrong with the .txt except that it will tend to get used for a wider range of things, while .tsv files are more regular, so it's useful to specify that this is one of those 'special' kinds of text files :)  Conversely, users might get confusing errors if they have non-table txt files sitting around and the Psych-DS validator tries to 'eat' them. 

(And the logic for just using one extension is to make it easier for the validator - fewer cases to check - and easier for people to keep track of - fewer rules.) 

It may indeed be that allowing multiple kinds of delimiters is the best case in the long term, but in the short term, the hope is it will be useful to get people on the same page (and then we can scaffold them into (a) thinking about what their delimiters are and (b) recording what their delimiters are). 



 

On Mon, Feb 4, 2019 at 9:16 AM Alicia Hofelich Mohr <hofe...@umn.edu> wrote:
Congrats, Melissa! Your new position sounds like a perfect fit!

What is the issue with people saving files as tab-delimited .txt rather than .tsv? From a curation/longevity perspective, aren't these files exactly the same? I get that it's not as transparent from the file extension (and therefore one may have to coax excel into opening it), but are there other issues? The census and many other repositories keep data files (whether csv, tsv, or ascii) as .txt and include documentation on delimiter specifics. We could just encourage tab-delimited rather than comma delimited files and not worry too much about the extension (.tsv/.txt).  

Best,
Alicia


For more options, visit https://groups.google.com/d/optout.


--
Alicia Hofelich Mohr, Ph.D.
Research Support Services Coordinator
College of Liberal Arts LATIS
University of Minnesota | 612-626-8456
Pronouns: she, her, hers

Melissa Kline

unread,
Apr 4, 2019, 8:03:54 PM4/4/19
to Psych-Data-Standards
I'm migrating some topics to github issues to help us keep track of things from these threads.  

There is already an issue open about working with psych-DS when you wish to keep the data directory private but others public on OSF: https://github.com/psych-ds/psych-DS/issues/19

For the issue with Excel clobbering your text files (I experience this with both TSV and CSV), our user manuals should warn people about that confusing 'warning' popup that leads you to convert files to xlsx: https://github.com/psych-ds/psych-DS/issues/20
Reply all
Reply to author
Forward
0 new messages