Using GitHub Actions for data processing

258 views
Skip to first unread message

Felix Lohmeier

unread,
Jul 9, 2021, 7:31:14 AM7/9/21
to OpenRefine
I've been playing around with automating OpenRefine via GitHub Actions. Has anyone here had any experience using it for data processing? It seems useful to me for periodically transforming and enriching smaller public datasets. Here's a little experiment, inspired by the "git scraping" approach:
https://github.com/opencultureconsulting/openrefine-github-action/blob/main/.github/workflows/openrefine.yml

Thad Guidry

unread,
Jul 9, 2021, 8:37:25 AM7/9/21
to openr...@googlegroups.com
Hi Felix,

It's sometimes more work to use IaaS platforms that have resource constraints and lack of integrated tooling and software, depending on a users expectations of transform,  enrichment, analysis.
I think it would depend on a user's needs and their skill set especially with Linux and shell.  Some may be more versed in SQL, Python, or Java.
If they are indeed transforming and enriching data, and large amounts of it, then perhaps a broader platform with integrated machine learning and analytics like Google Dataflow with FlexRS and the Vertex AI with Notebooks might fit them better and give them much more control and a unified platform already tied together.  Unlike with GitHub Actions, where you would have to have multiple workflows to stay under resource constraints that are offered for free:

GitHub-hosted Runners:
Hardware specification for Windows and Linux virtual machines:

  • 2-core CPU
  • 7 GB of RAM memory
  • 14 GB of SSD disk space
and then deal with ephemeral storage and caching between the Runners and the /github/workflow directory from the Docker container filesystem.

At the end of the day, GitHub Actions is just IaaS and has limits as with any service.
I think it's great that you're continuing to learn and use your deep skill set with Linux and shell and are able to make IaaS platforms like GitHub Actions that offer a free limited amount of resources... to bend to your beck and call.
Nicely Done!



On Fri, Jul 9, 2021 at 6:31 AM Felix Lohmeier <felix.l...@opencultureconsulting.com> wrote:
I've been playing around with automating OpenRefine via GitHub Actions. Has anyone here had any experience using it for data processing? It seems useful to me for periodically transforming and enriching smaller public datasets. Here's a little experiment, inspired by the "git scraping" approach:
https://github.com/opencultureconsulting/openrefine-github-action/blob/main/.github/workflows/openrefine.yml

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/53040b6e-0e03-4a32-8204-08f9c6f04a4an%40googlegroups.com.

Duketiana Travels & Tours

unread,
Jul 9, 2021, 8:57:00 AM7/9/21
to openr...@googlegroups.com
Hmmm, I just went through a training like that.

On Fri, Jul 9, 2021, 13:31 Felix Lohmeier <felix.l...@opencultureconsulting.com> wrote:
I've been playing around with automating OpenRefine via GitHub Actions. Has anyone here had any experience using it for data processing? It seems useful to me for periodically transforming and enriching smaller public datasets. Here's a little experiment, inspired by the "git scraping" approach:
https://github.com/opencultureconsulting/openrefine-github-action/blob/main/.github/workflows/openrefine.yml

--

Felix Lohmeier

unread,
Jul 11, 2021, 6:48:37 AM7/11/21
to OpenRefine
Many thanks for the valuable advice, Thad, always much appreciated!

If GitHub Actions remains free for public repositories, then that seems like an accessible form of automation because many already use GitHub. The GitHub Actions workflow file is just a small experiment. One could imagine an OpenRefine extension "Automate this project" that creates a GitHub repository after authentication and sets up a GitHub Actions workflow based on the import metadata and operation history stored in the project. It would only need a little additional information (e.g. about the desired export format), which could be prompted for in a dialog. For public datasets that don't exceed the limits (14 GB disk space, 100 MB file size), this could be very useful.

I can already imagine this for some of my clients in libraries. Once set up, they can upload new source data themselves via GitHub or even replace the operations history file. Each execution produces an artifact with the OpenRefine project that can be used for debugging or expanding the operations history. Workflows with GitHub Actions can be started manually, based on events (push, pull request) or on a regular basis.

Of course, this only works well for certain data sets. I find the "Flat Data" approach of the GitHub Developer Experience team very attractive: https://octo.github.com/projects/flat-data

I'll experiment further and report back when I can show an example for a real-world use case.

Ernesto Ortiz

unread,
Jul 12, 2021, 9:54:06 AM7/12/21
to OpenRefine
Hello,

I configured a few weeks ago a data pipeline with GitHub actions + OpenRefine + Google Drive for one client. The client was doing the process to get the files from GDrive, apply OpenRefine recipes, and then uploading the files back to GDrive. With Github actions, I was able to automate all that process with the library https://github.com/opencultureconsulting/openrefine-client. I added some additional logic to create reports, validate that the OpenRefine recipes were applied correctly, and avoid issues with corrupted projects and the client is very happy. Basically, they when from processing around 100 files weekly to 100 in less than 1 minute.

I will say that the major limitations are:
Regards,
Ernesto
Reply all
Reply to author
Forward
0 new messages