Cyberling post

1 view

Skip to first unread message

Richard Littauer

unread,

Sep 28, 2011, 8:45:36 AM9/28/11

to Reproducible Linguistics Project

I've drafted a cyberling blog post talking about this, and my previous work. What do you think? Sound good? Hopefully, we can get more help from the readership there.

-R

===

As part of the National Science Foundation initiative DataONE,the Data Observation Network for Earth, the use and characteristics of the social network and repository sitemyExperiment.org was analysed. myExperiment is a repository for scientific workflows, of which the vast majority were built in workflow workbenches such as Taverna or Kepler. Mostly used in the fields of bioinformatics, "Workflows [6] provide (1) a systematic and automated means of conducting analyses across diverse datasets and applications; (2) a way of capturing this process so that results can be reproduced and the method can be reviewed, validated, repeated, and adapted; (3) a visual scripting interface so that computational scientists can create these pipelines without low-level programming concern; and (4) an integration and access platform for the growing pool of independent resource providers so that computational scientists need not specialize in each one. The workflow is thus becoming a paradigm for enabling science on a large scale by managing data preparation and analysis pipelines, as well as the preferred vehicle for computational knowledge extraction." (Goble and de Roure, from The Fourth Paradigm - Data Intensive Scientific Discovery.)

The work done in this study is still ongoing, although some of its results can be read at theOpen Notebook kept by the student intern (me.) More information will be uploaded as the research continues. More importantly, some of the initial findings of the research are interesting:

Workflows are most often downloaded by members of the site, showing that a community can grow up around cyberinfrastructure repositories.
Complex workflows are more commonly downloaded, which suggests that reuse occurs more often the more a workflow does.
Workflow documentation and citation can lead to greater workflow use.

In Linguistics (and similar social sciences), there are no standard 'workflow workbenches' that can be used for non-programmers to develop, use, and share their workflows. However, as an increasingly data-intensive science, computational linguists are using computational pipelines in their research, in order to facilitate their main work. In some occasions, this code can be uploaded as a supplement - the Journal of Experimental Linguistics is a good example of a journal that strives towards providing extra supplementary material needed for reuse and reproducibility.

However, the use of supplementary code only applies for single journal articles, and while open access and open source projects are common (to an extent) in Linguistics and the Social Sciences, there is not as yet a single repository for code, of any sort; either workflows or pipelines, or codes that are project based, used in a publication, or useful in non-publishable or published research. As such, the purpose of this post is to call for participation in setting up such a repository; in setting up an open access journal that can cover ground in reproducible, data-intensive research that the JEL does not cover; in developing a workflow workbench architecture for interoperability for Linguistics data and research; and in promoting the use of pipelines in research. This work is currently in it's very early stages, and any help would be appreciated. One of the ways to get involved is to join the dedicated listserv.

Reply all

Reply to author

Forward

0 new messages