Post SIPS2019 welcome & roundup

19 views

Skip to first unread message

Melissa Kline

unread,

Jul 13, 2019, 4:03:38 AM7/13/19

to Psych-Data-Standards

Dear all - here comes one of my (apparently) standard mega-long emails. As always, here's the summary:

(0) Vacation

(1) THANK YOU and welcome to new members

(2) Brief summary of SIPS2019 (Attendees: chime in with things I forgot)

(3) Clarifying the scope of Psych-DS

(4) Immediate tasks

(5) Additional notes/roundup (With bonus animal picture for getting this far)

(0) Vacation

I am on one! I'll be very sporadically available for about the next week & a half; thanks in advance for your patience and for helping each other if you know the answer to something! Please open issues/ask questions liberally :)

(1) THANK YOU & welcome

SIPS went amazingly, and it's thanks to all the people (both current contributors and new faces) who put in a TON of time to learn, critique & move the specification forward. There were ~30 people on the conference slack channel, and at least 10 new members I've counted so far new to this list. To the new folks (and everyone) I apologize for the out-of-dateness of the README, but it's still a good place to get oriented to materials if you are just arriving. In particular, please take a minute to look at the code of conduct.

(2) SIPS2019 Summary

We had two psych-ds events: a session focused on introducing the spec and brainstorming use cases and problems, and then a second hackathon that really got down to business! You can see the hackathon document here. The third big point of interest from SIPS2019 is that there were a number of projects that are pre-emptively planning to head toward using Psych-DS (!) Fortunately, many of those projects have at least one person who's now on this list and interested in pulling a usable product over the finish line - THANK YOU - and we should do our best to keep track of who all these folks are. In particular, the open research documentation (Lakens & DeBruine) hack deals with a specification for scientific project folders writ large, where the psych-ds specification can serve to describe the data component/folder in particular. See section (3) for more on this, but it's important to remember that merging psych-ds into other standards is a *great* possible outcome - the goal is consensus on a system that works for the research community.

- At the brainstorm hackathon, we broke into 3 groups to discuss a series of prompts aimed at identifying who wants to use this specification and why. Their notes are in the hack document, and I wanted to particularly mention an idea for a teaching document, which people can work on *right now* (and for which we'll have a real issue when I'm not on vacation..) if so inclined: a 'why does the specification want me to do this' list that pairs each requirement with questions/explanation, for instance why the spec is asking me to format my data files as TSV. This point-by-point summary of the specification will serve a bunch of coherent purposes, and making sure it meets the needs of newer users** is a big one. This session also resulted in the beginnings of a list of 'converters' to cover common use cases we may want to write, such as "make my raw qualtrics output/mechanical turk downloads/whatever into a psych-ds folder system", "make my psych-ds folders into a website that Google will crawl/ an OSF project that lets me control privacy settings of the data component separate from the rest", etc.

- At the second hackathon, we had a group of newer users** who dove right in to continue this work, and who left a lot of comments on the specification. It would be super useful for people to take a look at these comments, even if just to add '+1' if someone else is confused about the same things as you are. We also had our first real, in-person developer meeting to lay out the architecture of the validator app(s). We need a longer summary of this meeting (in particular, some comments/outcomes are in the slack channel, which not everyone here has access to; we need to migrate these to the issue list at psych-ds/psych-ds), but the basic upshot is that we are planning on getting started writing it! See (4) for details below.

**I mean people who are taking the time to dive in and shape the specification to what they actually need for their scientific work, point out what isn't clear and what features are necessary. If you're in this group (if you are on this list, and your primary 'thing' isn't writing the validator, this is you), any ideas for a name? Test-drivers?

(3) Psych-DS vision

An important outcome of this meeting (IMO) is that as psych-DS grows, we need to keep a 'mission statement' in mind. Writing this should be a community effort, but here are some initial thoughts about 'the point' of this project. (Please disagree freely!)

Users are actually central here: Working scientists, especially small team/single people who are making the many, 'small' datasets across psychology. We want to prioritize clear communication and consensus building so that people working in this field can work with each other.
What psych-ds is trying to describe is really 'scientific data in context’ It’s not just a packet of ‘data’ of some kind. That is, one thing that make this effort different from some other standards (less like frictionlessdata.io, more like BIDS), is that we don't count a pdf article as 'data' - we really mean, measurements of behavior or other social scientific data that is generated for the research, and which subsequently gets analyzed and interpreted.
Psych-ds tries to encourage best practices. The schema should ’nudge’ you into other good practices, so that if you want to try for (e.g.) a computationally reproducible workflow on your next project, you're in a good position to do so.
The spec is syntactic first, with only enough 'meaning' baked in to let us build that skeleton. It can support semantics (e.g. things like an ontology for variables or topics) but we need to focus on implementing a simple structure that works first. BIDS has a robust extension system that applies the skeleton to new kinds of meaningful data & workflows; this may be a good model to work with.
Finally, it's OK if the specification part of Psych-DS eventually merges into some other specification! We aren''t trying to 'get our way' with an exact filenaming scheme, we are trying to come to a consensus as a scientific community on a set of tools & patterns that support our work. There are a lot of other efforts in this direction, and this is GREAT! In fact, here's a conference on this topic: http://www.researchobject.org/ . In my view we are in a useful 'thousand flowers blooming' phase, but if we find we can build psych-ds into other projects eventually, that's great.

(4) Immediate tasks for programmers and non programmers.

There's a LOT we can do in this space; we can do a bunch of it in parallel, but this project is sorely in need of a roadmap so we can solve some chicken-and-egg problems. I could use feedback on this rough plan, but will also generate some specific issues for discussion once not on vacation :D

- The specification needs updated! I've archived version 0.1.0 so it's safe, renamed the existing file v 0.2.0, and put it into 'suggest text' mode. We should suggest all the changes so it matches the discussions had at SIPS2019, so we can discuss them with the whole listserv and nail down those decisions. While some programming on the validator can proceed without knowing the exact rules, a lot of it is blocked, so this comes first!

- We need to prioritize our documentation needs, so that people who find this project can actually interact with it. (The Github README is first; next is 'how you can contribute', but we have some other options beyond that that our hackathon-ers already started writing; crystalize these into some concrete docs!)

- A big change: We tentatively decided to kill the 'processed-data' top level folder, in favor of a single data folder. Psych-DS can't "know" which is the first version of a dataset, and it became clear that the notion of source/raw/primary data (IE a jpg, formerly source_data) vs. the first time a validate-able TSV exists (formerly raw_data) is really hard to nail down. *However*, people who write code that writes out psych-DS data should probably have a special place to put their output, that also gets validated (so it shouldn't go in raw). Let's migrate this discussion to a github issue.

- We made a basic plan for writing the validator!! There are two main needs: A browser location people can visit and point at a folder on their hard drive, which will tell the user whether the folder passes validation, WITHOUT UPLOADING DATA TO A SERVER. Unfortunately, this means that a (non local) shiny app is out, (though a shinyapp that runs on the second thing would probably be useful, see next). Felix Henninger started a prototype here.

- The second is a software package in some language that the community knows well. This is definitely R, for this community! A python validator is likely to also come into play, but R is the clear winner for what our users are using. Eventually, this should be a package that has functions like psych-ds.validate(filepath) that returns useful error messages and/or success report.

- As much as possible, these two implementations should rely on the same shared code. Where requirements are regular expressions, the regex can be shared! We started a repository, here. However, lots of the requirements can't be expressed that way (for instance, 'the dataset_description.json file is a legal schema.org Dataset'). Ideally, this would get wrapped up in a shared node.js package (I am probably mis-describing this) that could be packaged up and shared across languages, but we don't have anyone willing & able to 'own' such a thing right now.

- Instead, we'll need to carefully define features/requirements so we can try to get the R package and browser implementation to align, which will involve boiling the tech spec down to a list of specific points, and determining which go in the regex and which don't. This is the part we can't do until we finish the spec updates!

(5) Other notes

Here are some other upcoming activities we should get to. If you can help out with any of these, please drop a note here!

Can anyone go to the 'research objects' conference described above? I probably cannot, but it would be super useful to have a connection with this group and find out how psych-ds fits into this ecosystem.
Now that we'll be developing code in earnest, we need to establish some guidelines for how we contribute code to each repository. I favor pull requests with a smaller group of people identified as mergers for each repo, but am open to suggestions to pointers / existing patterns for doing this well. (AS ALWAYS: If you are on this list and you don't know what a pull request is, don't stress! If we adopt this, there will be (1) good instructions on how to do it and (2) lots of roads for contribution that don't require using it)
Our website is functional thanks to Kirstie W. magicking it into existence before MozFest, but could use an expansion and more content. Make yourself known if interested in this kind of thing!
We need to make all manner of gh issues relating to the SIPS work, both from the psych-ds sessions and mentions in other projects. Need to come up with a pattern for these to help keep these issues coherent (tag them with SIPS2019?)
Coding is cool, but onboarding materials are the most important. We should do some high-level plotting about how people (researchers, labs, people writing code for psych research) will arrive at psych-ds, and what they need to know to get started on their journeys.

Welcome to the end of the email! Your reward is a cockatiel wearing a straw hat.

Ian Hussey

unread,

Jul 15, 2019, 7:58:49 AM7/15/19

to Psych-Data-Standards

>- We made a basic plan for writing the validator!! There are two main needs: A browser location people can visit and point at a folder on their hard drive, which will tell the user whether the folder passes validation, WITHOUT UPLOADING DATA TO A SERVER. Unfortunately, this means that a (non local) shiny app is out, (though a shinyapp that runs on the second thing would probably be useful, see next). Felix Henninger started a prototype here.

>- The second is a software package in some language that the community knows well. This is definitely R, for this community! A python validator is likely to also come into play, but R is the clear winner for what our users are using. Eventually, this should be a package that has functions like psych-ds.validate(filepath) that returns useful error messages and/or success report.

Just for my own understanding, I took a few minutes to check that it is indeed not possible for a Shiny app to map client side files without uploading them. I can't see a solution for this, although I can't say if it's technically possible or not.

There are some approximations of this however. I took a few minutes to write a Shiny app that prints a list of the files in a directory (without uploading those files anywhere). The Shiny app itself has to run locally, but it can be downloaded and run using a single line of code.

Code to run in RStudio to download and run proof of concept app:

runUrl('https://github.com/ianhussey/validator-R/archive/master.zip')

Of course, for many, even this one line of code might be a barrier to use. Perhaps it still has some value within the development/use of a validator R package?

Best,

Ian

Reply all

Reply to author

Forward

0 new messages