Hi,
Over the past few days we have had the occasion to discuss at various
places the main reproducibility feature of OpenRefine: extracting
operation histories as JSON blobs (I call these "workflows") and
reapplying them afterwards.
This feature is crucial (to me at least) but it is also very brittle and
would deserve to be improved in so many ways, or even redesigned from
the ground up. Can we use this thread to flesh out what this should
ideally look like? Perhaps even without reference to the existing system
if you feel like it is so bad that it not even worth being mentioned!
Here are some basic things on top of my head (stealing ideas from others) :
- Workflows should be versioned, because the behavior of the operations
and expressions change as we improve them. This is something Owen has
been preaching for for a long time and it would really remove a lot of
pain in the decisions around breaking changes. If a user applies a
workflow created in OpenRefine 3.2 in a 3.5 instance, we want to be able
to warn them / forbid the operation / translate the discrepancies
automatically if possible.
- Exposing JSON to the user in this way is rude, this is not appropriate
as a workflow language (it is hard to edit). What other syntaxes can we
think about? A fluent-style API perhaps? YAML like the Common Workflow
Language (
https://www.commonwl.org/)?
- Could we build on existing standards for workflows to avoid
reinventing the wheel? Our workflows are specific in that they apply to
tabular data, with a specific data model (reconciled cells, for
instance), but there could be scope for interoperability with other
languages.
- It should be possible to capture the importing stage in a workflow,
not just the operations that follow them. Similarly it could make sense
to capture the exporting stage as well, such that a workflow could go
from an input file to an output file. Representing isolated series of
operations is also a valid need, so that should be catered for as well.
- We need much better error handling. A workflow expects a certain
structure from a project. If it is not present, at the moment the
operations will be applied in sequence, the ones which fail will be
skipped (and the following ones will still be applied!). This might mean
letting workflows declare their initial requirements (the columns they
depend on, for instance).
- We should be able to rename the column dependencies of a workflow.
- We should be able to apply a workflow to a subset of rows using facets.
- We need better tools to visualize and edit workflows - some of this is
on my roadmap this year (
https://hackmd.io/OWW43gzUSs62_wMKrKz2NA?view)
- Users should be able to define new operations using workflows, pretty
much like the "common transforms" menu offers various "operations" which
in fact all use the same underlying transform operation with different
GREL expression. As a user I would like to define other operations in
the same way and make them easily accessible from the UI: for instance,
say I often need to split a column of names into (given name, family
name) via a complicated set of rules: I want to be able to create one
workflow for that and add it as a common transform in the menu.
- I want to be able to execute workflows outside OpenRefine in scalable
environments (such as Spark): that will be possible soon.
I initially thought about adding issues for all of these but that is
only useful if we go for incremental improvements on the existing
system. It is totally worth stepping back from isolated issues to design
something radically different.
Looking forward to reading your ideas!
Antonin