Can I fetch data from Google BigQuery?

98 views
Skip to first unread message

Daniel Lidio Pereira do Sacramento

unread,
Mar 17, 2021, 4:29:53 PM3/17/21
to OpenRefine
Hi folks!

I'm new to OpenRefine, I was using Trifacta Dataprep for cleaning and transforming data before but now I want to try new ways to work with data.

In my company we use BigQuery to store all of our clients data. I tried to find a way to connect to GBQ but I was unnable to find an answer.

Is not possible to connect OpenRefine to BigQuery?

Thanks in advance!

Thad Guidry

unread,
Mar 17, 2021, 5:33:01 PM3/17/21
to openr...@googlegroups.com
Hi Daniel,

Yes, I've worked with several clients before that simply download their tables as data from BigQuery into multiple files and then load each one individually into OpenRefine as necessary (giving OpenRefine sufficient memory on the Java heap to work with them).

We support loading files that are GZIP btw, so you can choose the compress option when exporting the data from BQ.

We don't support the BQ API directly.
And because you cannot save results from BQ to a local file via the API or the bq command-line tool, you will need to use the Cloud Console to download and export.
https://cloud.google.com/bigquery/docs/writing-results#downloading-saving-results-console

So you will have to perform these steps:
1. Query via interactive or batch
2. Create temp/permanent tables from those queries
3. Export the tables as necessary

Good luck!
And welcome to OpenRefine!


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/bb8f7fa7-f71d-4e93-87bc-a14bef0ed3f5n%40googlegroups.com.

Tom Morris

unread,
Mar 18, 2021, 7:09:06 PM3/18/21
to openr...@googlegroups.com
A BigQuery backend for OpenRefine is something I thought about many
years ago, which BigQuery first came out, and I still think it might
be an interesting combination.

You can see some of the previous discussions here:
https://groups.google.com/g/openrefine/search?q=bigquery
One of the former developers was concerned about latency:
https://groups.google.com/g/openrefine/c/kTB5lFhHZ6E/m/qHCjwC5e-RQJ
but obviously the 100ms vs few seconds advantage that the current
implementation has over BigQuery for small datasets quickly shifts the
other way to an advantage for BigQuery at a few seconds vs 10s of
seconds (or worse) for OpenRefine with big data sets.

What type of connection to BigQuery are you looking for? Native use of
the API to transform in place or the ability to easily extract the
results that match a query or something else entirely?

Tom

Antonin Delpeuch (lists)

unread,
Mar 19, 2021, 4:00:57 AM3/19/21
to openr...@googlegroups.com
Interesting question!

The latency issue is also something I struggled with for the Spark-based
executor in OpenRefine. Arguably it's a bit easier for Spark since it
can be running on the same machine, but still, there is a lot of
overhead and you feel it when using the tool interactively. For now the
solution I have is to let users opt-in to the Spark executor and use a
local, low-latency one by default.

With the new architecture one could try to write a BigQuery executor for
OpenRefine, that could also be selected manually by the user when that
makes sense. I don't think we would go very far with that because we
would need to translate Java closures to BigQuery's API, which seems
nearly impossible.

I have zero experience with BigQuery, but after a brief look it seems
that they offer a SQL interface. If that is the only interface they
offer, I have the same concern as with Spark SQL: it seems difficult to
translate many OpenRefine operations to SQL in an efficient way. For
instance, what would the Fill down operation look like?

For SQL-based backends it feels like we would need to make significant
changes to the user experience, for instance replacing the records mode
by something pretty different. As you know I am all in for replacing the
records mode, but it looks a bit daunting to me. I would say that it
should be done with a lot of care for the end-user experience and not be
dictated by whatever execution backend we want to bring in.

Best,
Antonin

Antonin Delpeuch (lists)

unread,
Mar 19, 2021, 4:12:30 AM3/19/21
to openr...@googlegroups.com
Another project, of course, is simply to add support for BigQuery in our
existing "database" extension (which implements all the SQL
integration). That means being able to run a SQL query on that backend
and download the results in OpenRefine to work on them locally.

It's probably doable, but I am not sure how useful that is compared to
downloading the query results manually and loading them in OpenRefine.

Antonin
Reply all
Reply to author
Forward
0 new messages