Reproducibility wishlist - how should JSON workflows be improved?

53 views
Skip to first unread message

Antonin Delpeuch (lists)

unread,
Jul 4, 2020, 4:31:30 PM7/4/20
to openref...@googlegroups.com
Hi,

Over the past few days we have had the occasion to discuss at various
places the main reproducibility feature of OpenRefine: extracting
operation histories as JSON blobs (I call these "workflows") and
reapplying them afterwards.

This feature is crucial (to me at least) but it is also very brittle and
would deserve to be improved in so many ways, or even redesigned from
the ground up. Can we use this thread to flesh out what this should
ideally look like? Perhaps even without reference to the existing system
if you feel like it is so bad that it not even worth being mentioned!

Here are some basic things on top of my head (stealing ideas from others) :

- Workflows should be versioned, because the behavior of the operations
and expressions change as we improve them. This is something Owen has
been preaching for for a long time and it would really remove a lot of
pain in the decisions around breaking changes. If a user applies a
workflow created in OpenRefine 3.2 in a 3.5 instance, we want to be able
to warn them / forbid the operation / translate the discrepancies
automatically if possible.

- Exposing JSON to the user in this way is rude, this is not appropriate
as a workflow language (it is hard to edit). What other syntaxes can we
think about? A fluent-style API perhaps? YAML like the Common Workflow
Language (https://www.commonwl.org/)?

- Could we build on existing standards for workflows to avoid
reinventing the wheel? Our workflows are specific in that they apply to
tabular data, with a specific data model (reconciled cells, for
instance), but there could be scope for interoperability with other
languages.

- It should be possible to capture the importing stage in a workflow,
not just the operations that follow them. Similarly it could make sense
to capture the exporting stage as well, such that a workflow could go
from an input file to an output file. Representing isolated series of
operations is also a valid need, so that should be catered for as well.

- We need much better error handling. A workflow expects a certain
structure from a project. If it is not present, at the moment the
operations will be applied in sequence, the ones which fail will be
skipped (and the following ones will still be applied!). This might mean
letting workflows declare their initial requirements (the columns they
depend on, for instance).

- We should be able to rename the column dependencies of a workflow.

- We should be able to apply a workflow to a subset of rows using facets.

- We need better tools to visualize and edit workflows - some of this is
on my roadmap this year (https://hackmd.io/OWW43gzUSs62_wMKrKz2NA?view)

- Users should be able to define new operations using workflows, pretty
much like the "common transforms" menu offers various "operations" which
in fact all use the same underlying transform operation with different
GREL expression. As a user I would like to define other operations in
the same way and make them easily accessible from the UI: for instance,
say I often need to split a column of names into (given name, family
name) via a complicated set of rules: I want to be able to create one
workflow for that and add it as a common transform in the menu.

- I want to be able to execute workflows outside OpenRefine in scalable
environments (such as Spark): that will be possible soon.

I initially thought about adding issues for all of these but that is
only useful if we go for incremental improvements on the existing
system. It is totally worth stepping back from isolated issues to design
something radically different.

Looking forward to reading your ideas!

Antonin


Thad Guidry

unread,
Jul 4, 2020, 7:23:18 PM7/4/20
to openref...@googlegroups.com
Thank you Antonin for writing this.

Have you downloaded and tried out Apache NiFi yet?  I really think you should, just to get perspective.

Apache NiFi supports Templates
and versioning of Dataflows
and has a REST API

I've often imagined if OpenRefine was built on top of it or rather could talk to Apache NiFi and leverage parts of it?
OpenRefine might even expose some of NiFi's Processors as direct replacements for our own functions?

Lastly, a word about EASY. :-)
User input taken from our survey often describes our OpenRefine UI as EASY.
Prior to Gridworks development, when I showed David the Pentaho Spoon editor (similar to NiFi in a way) he was impressed, even seeing the data preview of rows it provides on each Pentaho workflow step was part of the inspiration of the data preview we have for the OpenRefine Expression editor.  But David thought that Pentaho was far too complex to build on for what users needed to often do which was cleaning small datasets quickly.  We decided we needed a much simpler tool custom built with Java so it could run on the 3 OS's and that could work like ETL tools but didn't have to worry about workflows that much and could be run in the browser so Mac and Linux users could also contribute to Freebase as well. :)

Btw, Hadoop had only been out for a couple of years and he thought of allowing OpenRefine to run a bit "differently" was always in his mind.

me: yeah. That's why we're stilling running COMMUNITY (free) Talend and Pentaho
10:35 PM David: +1
  if i can only figure out how to get refine to work on hadoop
10:36 PM me: right on a graph backbone ?
  that'd be cool
  well, I mean, for Google and you.
  and probably Yahoo too
10:37 PM David: hmm, it'd be very interesting... well, save that for a rainy day
 me: +1

Now within this thread, we are definitely thinking of changing those simple ideas that inspired OpenRefine that our users think of as EASY and so we'll need to be careful to make it still feel EASY for our users when dealing with workflows.
But I'm glad that we are trying to think more broadly, that's something that David and I always did on our late night chats.  Pondering the "what-ifs".


Antonin Delpeuch (lists)

unread,
Jul 5, 2020, 2:41:18 AM7/5/20
to openref...@googlegroups.com
I like NiFi, I think it is indeed something worth keeping in mind when
designing UIs to represent and manipulate workflows.

I think OpenRefine's strength lies in the fact that the primary UI lets
users manipulate the data directly, without having to think about what
they are doing in terms of workflows. So I am keen to introduce a
graph-based workflow UI similar to NiFi (and many other tools by the
way), adapted to OpenRefine's data model (as in the roadmap shared
earlier). But it should not be the primary way to interact with the
tool: we want to keep the simplicity of interactive data cleaning that
we have now.

By keeping the interactive data manipulation from the grid view as the
default UI, we would also avoid some drawbacks with visual programming,
where:
- the user needs to come up with a planar layout for the nodes and
edges, even if this layout is actually irrelevant for execution - this
can be useful to organize a workflow logically, but can also be a bit
distracting;
- manipulating boxes and arrows is a bit complicated, it is hard to
build UIs that are completely intuitive for this, using mouse and keyboard.
But we are also making it harder for ourselves in other ways. Since the
graph should be generated from a series of operations automatically,
that means having an algorithm to produce the graph layout directly
(thankfully this is not very hard for our data model, we can use the
sequence of operations to guide the process).

Antonin
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine Development" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openrefine-de...@googlegroups.com
> <mailto:openrefine-de...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/openrefine-dev/CAChbWaNBMuAvubM0w%2B1zW1zYgnZ6zz8N%3D%2BfNCPP5x4ggjvkjBQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/openrefine-dev/CAChbWaNBMuAvubM0w%2B1zW1zYgnZ6zz8N%3D%2BfNCPP5x4ggjvkjBQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Thad Guidry

unread,
Jul 5, 2020, 6:29:29 AM7/5/20
to openref...@googlegroups.com
I am happy to see that you and I see things the same way.
Which does not deviate from the original Refine vision.

Thanks,

Tom Morris

unread,
Jul 5, 2020, 3:28:20 PM7/5/20
to openref...@googlegroups.com
OpenRefine doesn't have "workflows" and thinking of the operation undo history in those terms is bound to lead you (and others) astray. I know I've said that before, but I'm going to keep repeating it until it is understood.

As for reproducibility, I see several different aspects:

1. Documentation of the transformation process so that it can be reproduced later. The English version of the operation history can form an automated starting point for writing this down so it could be published in a journal.
2. Replication of the exact same transformation process on the exact same data using the exact same tooling to demonstrate that you get the exact same results. The current JSON could be published to FigShare for this purpose.
3. Reproduction of the transform on a "similar" (tbd) data set to demonstrate equivalent results

Of course, often what people want to do is (4) extend / modify the transform for either a different data set or to use the transform as the starting point for additional processing.

OpenRefine operation histories can be used for #1 & #2 and occasionally #3, if the data is similar enough, the phase of the moon is correct, and you don't really care about error handling or undetected data corruption. They are completely ill-suited to #4 for the many reasons you listed (amongst others).

Modularity and parameterization are two key things which would be needed in a workflow language that are lacking in the current setup. They'd address many of the feature requests on your list.

Compatibility and versioning are important, but the example given is backwards:

If a user applies a workflow created in OpenRefine 3.2 in a 3.5 instance, we want to be able to warn them / forbid the operation / translate the discrepancies automatically if possible.

A v4.2 workflow should run unchanged on a v4.5 instance, full stop. Versioning is to protect against the opposite case of attempting to run a v4.5 workflow on a v4.3 execution engine. Incompatibilities need to be considered serious bugs to protect the investment in the ecosystem of workflows created.

Language syntax is important for a language designed to be edited by users. JSON sucks for this. We could trivially switch to or add YAML since it is a superset of JSON and JSON can be read as valid YAML, but that wouldn't make it suck much less. CWL is based on YAML, but was optimized for consuming workflow engines rather than workflow authors, so it's got a huge learning curve. (One of the primary language designers worked on my last engineering team, so I'm very familiar with CWL.) Having said that, given that there are approximately a bazillion existing workflow languages, and more being created every day, it'd be hard to justify investing in inventing yet another one.

Error handling is just a basic hygiene thing, so not really something that needs to be discussed. The fact that OpenRefine's Apply Undo History doesn't have any is one key way that you can tell it was never intended to even pretend to be a workflow solution.

Yes, import / export need to be included in the set of operations. I thought there was an issue for at least the import piece, but I can't find it. A workflow solution would also need to be able to work with multiple files (which are currently considered separate OpenRefine "projects"). An example was discussed here, but there are many other use cases scattered across the 2011-2012 issues. A good starting point for requirements collection might be surveying the relevant issues, enumerating the use cases, then ranking them.

I don't understand this piece:

- We need better tools to visualize and edit workflows - some of this is on my roadmap this year (https://hackmd.io/OWW43gzUSs62_wMKrKz2NA?view)

Why would you build on an unsound foundation after you just enumerated all the ways that it's broken? Certainly, though, a reasonable set of editing and visualization tools should be part of the decision making criteria in selecting a workflow language.

One key thing to think about when discussing workflow languages/engines is how the choices interact with the scalability goals and backend implementation. Some combinations will be natural fits, while other combinations might not work at all.

Tom

--
You received this message because you are subscribed to the Google Groups "OpenRefine Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine-dev/fd64653a-6596-0eb7-998b-274760cbe3a4%40antonin.delpeuch.eu.

Tom Morris

unread,
Jul 5, 2020, 3:37:06 PM7/5/20
to openref...@googlegroups.com
On Sun, Jul 5, 2020 at 3:28 PM Tom Morris <tfmo...@gmail.com> wrote:
OpenRefine doesn't have "workflows" and thinking of the operation undo history in those terms is bound to lead you (and others) astray. 
 
Antonin asked:
 
I would be curious to know what it was designed for, actually! I cannot think of another use for this dialog :)

The way that I think of it is as an unfinished Proof of Concept which can be used as a tool to help collect user requirements. This is obscured by the fact that the hiatus between when the original developers introduced the POC and when the new developers are considering implementing an actual workflow solution spans the better part of a decade.

If you've got a JSON operation history to support your Undo/Redo, it's trivial to toss it in a text dialog along with an "Apply" button, but that doesn't make it a workflow engine.

Tom

Tom Morris

unread,
Jul 5, 2020, 3:46:38 PM7/5/20
to openref...@googlegroups.com
On Sat, Jul 4, 2020 at 7:23 PM Thad Guidry <thadg...@gmail.com> wrote:

Prior to Gridworks development, [...]

I always thought that Metaweb Gridworks (which later became Google Refine) grew out of David's various bulk editing experiments while on MIT's Simile project.

Tom

Antonin Delpeuch (lists)

unread,
Jul 5, 2020, 3:49:51 PM7/5/20
to openref...@googlegroups.com
Thanks Tom!

On 05/07/2020 21:28, Tom Morris wrote:
> OpenRefine doesn't have "workflows" and thinking of the operation undo
> history in those terms is bound to lead you (and others) astray. I know
> I've said that before, but I'm going to keep repeating it until it is
> understood.

I don't mind calling these JSON things something else than "workflow" if
your eyes bleed when I use this term. I avoid saying just "JSON" because
it can refer to all sorts of things (the JSON importer, for instance).

I am glad to see that we are pretty much on the same page for everything
else though!

>
> Why would you build on an unsound foundation after you just enumerated
> all the ways that it's broken? Certainly, though, a reasonable set of
> editing and visualization tools should be part of the decision making
> criteria in selecting a workflow language.

I think it makes sense to introduce visualization tools for the
operation history even without talking about reproducibility at all.
As a user, when I have a project with a big pile of operations in the
history, I would like to understand it better than using the textual
descriptions we have. If the software also allows me to reorder
independent steps in the history, it really helps if I can tell from the
visualization which ones can be reordered, so I can easily predict if I
will be able to undo an operation that is not the last one I have
applied, for instance (just by looking at the connectivity of the
network of dependencies).

I see this as a stepping stone towards something actually usable as a
workflow language: once we understand the visualizations that work, we
will be in a better place to make them truly editable, saveable,
reusable, and so on.

Antonin
> <mailto:openrefine-dev%2Bunsu...@googlegroups.com>.
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine Development" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openrefine-de...@googlegroups.com
> <mailto:openrefine-de...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/openrefine-dev/CAE9vqEHHPA%3DnxH5EJzviNKU2ogHcJWbT-Ys1EMEQBOzb68cJNw%40mail.gmail.com
> <https://groups.google.com/d/msgid/openrefine-dev/CAE9vqEHHPA%3DnxH5EJzviNKU2ogHcJWbT-Ys1EMEQBOzb68cJNw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Tom Morris

unread,
Jul 5, 2020, 4:47:49 PM7/5/20
to openref...@googlegroups.com
On Sun, Jul 5, 2020 at 3:49 PM Antonin Delpeuch (lists) <li...@antonin.delpeuch.eu> wrote:
On 05/07/2020 21:28, Tom Morris wrote:
> OpenRefine doesn't have "workflows" and thinking of the operation undo
> history in those terms is bound to lead you (and others) astray. I know
> I've said that before, but I'm going to keep repeating it until it is
> understood.

I don't mind calling these JSON things something else than "workflow" if
your eyes bleed when I use this term. I avoid saying just "JSON" because
it can refer to all sorts of things (the JSON importer, for instance).

The dialog is labeled "Extract Operation History" and the Java class is HistoryEntry.
I think either "operation history" or "undo history" work. There's no need, from my
point of view, to mention the syntax. The word "history" is key to understanding the
design center.

Terminology is important because it carries with it context and semantic baggage. 
Undo histories have push/pop semantics and if people are pushing the boundaries, 
they might ask if they can undo an operation in the middle. When discussing workflows,
people ask things like whether it supports branching, scatter/gather, conditionals, etc.
Entirely different beasts.

Tom

Martin Magdinier

unread,
Jul 8, 2020, 11:51:17 AM7/8/20
to openref...@googlegroups.com

I see countless of users using history as a workflow feature. It makes sense for us to follow our users in that direction and provide them with a proper system.

I agree with everything mentioned previously in terms of design. I will only suggest to 
* Include #368 the option to add comments (notes) to the operations (like commenting a code).
* Take into account project dependencies with the usage of the cell.cross() function 

Martin


--
You received this message because you are subscribed to the Google Groups "OpenRefine Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine-dev/CAE9vqEFAxhabhZasBd633rGxcYW2A2cUU%3DpPkv1aApY%3De9P81w%40mail.gmail.com.

Antonin Delpeuch (lists)

unread,
Sep 11, 2020, 7:06:04 AM9/11/20
to openref...@googlegroups.com
Here is one interesting example of a JSON history being offered as a tool:
https://hangingtogether.org/?p=8222

Quoting them:
> it finds the VIAF or LCCN in each person’s Wikidata page (if available), and uses these to create a column with links to their WorldCat Identities pages. It then pulls in the number of library holdings of works by or about each person and populates this information as a new column

This is probably a good stress-test for reproducibility: how can we make
such use cases more natural?

Antonin

On 08/07/2020 17:51, Martin Magdinier wrote:
>
> I see countless of users using history as a workflow feature. It makes
> sense for us to follow our users in that direction and provide them with
> a proper system.
>
> I agree with everything mentioned previously in terms of design. I will
> only suggest to 
> * Include #368 <https://github.com/OpenRefine/OpenRefine/issues/368> the
> option to add comments (notes) to the operations (like commenting a code).
> * Take into account project dependencies with the usage of the
> cell.cross() function 
> *
> *
> <mailto:openrefine-de...@googlegroups.com>.
> <https://groups.google.com/d/msgid/openrefine-dev/CAE9vqEFAxhabhZasBd633rGxcYW2A2cUU%3DpPkv1aApY%3De9P81w%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine Development" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openrefine-de...@googlegroups.com
> <mailto:openrefine-de...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/openrefine-dev/CAO%2BzzMssYweHQ5UOTh5HJ1chumqor5BtAHKWtOs4twcPb1UynA%40mail.gmail.com
> <https://groups.google.com/d/msgid/openrefine-dev/CAO%2BzzMssYweHQ5UOTh5HJ1chumqor5BtAHKWtOs4twcPb1UynA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Reply all
Reply to author
Forward
0 new messages