Interesting question!
The latency issue is also something I struggled with for the Spark-based
executor in OpenRefine. Arguably it's a bit easier for Spark since it
can be running on the same machine, but still, there is a lot of
overhead and you feel it when using the tool interactively. For now the
solution I have is to let users opt-in to the Spark executor and use a
local, low-latency one by default.
With the new architecture one could try to write a BigQuery executor for
OpenRefine, that could also be selected manually by the user when that
makes sense. I don't think we would go very far with that because we
would need to translate Java closures to BigQuery's API, which seems
nearly impossible.
I have zero experience with BigQuery, but after a brief look it seems
that they offer a SQL interface. If that is the only interface they
offer, I have the same concern as with Spark SQL: it seems difficult to
translate many OpenRefine operations to SQL in an efficient way. For
instance, what would the Fill down operation look like?
For SQL-based backends it feels like we would need to make significant
changes to the user experience, for instance replacing the records mode
by something pretty different. As you know I am all in for replacing the
records mode, but it looks a bit daunting to me. I would say that it
should be done with a lot of care for the end-user experience and not be
dictated by whatever execution backend we want to bring in.
Best,
Antonin