How to Create a Schema and Validate CSV or Excel Values Against it?

532 views
Skip to first unread message

Tibaut Houzanme

unread,
Oct 5, 2017, 3:05:30 PM10/5/17
to OpenRefine
Greetings,

I am curious if OpenRefine has this functionality: To allow me to create a schema (list of possible values for each column entry) and validate csv or Excel files against those values. 
If this is possible, would you be kind enough to point me towards the documentation o how-to videos, or even perhaps a quick explanation?

Thanks much,
Tibaut

Owen Stephens

unread,
Oct 6, 2017, 4:58:28 AM10/6/17
to OpenRefine
There is no built in functionality to achieve this that I'm aware of. However, there are a couple of ways you could potentially do this:

Use 'cross':
  • set up a project with a column containing your list of possible values you are going to check against (for convenience I'll call this project 'schema' and let it contain a single column 'ValidValues')
  • set up the project you want to validate
  • In the project/column you want validate, choose Facet->Custom Text Facet
  • In the expression box enter GREL:
    cell.cross("schema","ValidValues").length()>0
  • You will get a facet which contains 'true' for matched (valid) values and 'false' for non-matched (non valid) values
  • More information on the 'cross' function is available at https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions
Use 'inArray'
  • 'inArray' is a function added by the GOKb Utilities extension so you need to install that first:
    • Download the extension from https://github.com/ostephens/refine-gokbutils (click on 'clone or download' and download the zip)
    • Unzip the downloaded file - you should get a folder called refine-gokbutils-master or similar
    • Drop this folder into the OpenRefine 'extensions' folder
    • Start/Restart OpenRefine
  • In the project/column you want validate, choose Facet->Custom Text Facet
  • In the expression box enter GREL like: 
    inArray(["list","of","valid","values"],value)
  • This will give 'true' if the value in the cell is in the list (i.e. valid value), or false otherwise
Owen

Antonin Delpeuch (lists)

unread,
Oct 6, 2017, 5:13:50 AM10/6/17
to openr...@googlegroups.com
There has been some interest around data validation in OpenRefine in the
past.

As Martin mentioned recently [1], some forks/extensions of OpenRefine
have implemented their own validation features, such as GoKB [2].

But we don't have anything generic yet. I guess it would be interesting
to brainstorm that. I am currently hacking my own solution for Wikidata
but it could probably be made a bit more reusable if we give it some
thought.

I imagine there could be a generic way for an OverlayModels to emit some
"Issues" about the current state of the data, and these issues would be
reported in some panel of OpenRefine. Implementing this sort of
abstraction in OpenRefine would let extensions report issues in a
consistent way.

This thread also reminds me that OpenRefine has currently no support for
column metadata, which would be required to implement support for Data
Packages (#778 [3] and #1096 [4]). So maybe we should not go down the
OverlayModel route but rather add support for that and implement a more
restricted notion of validation (namely, just checking a cell value
against the declared type of the column).

Cheers,
Antonin

[1] :
https://github.com/OpenRefine/OpenRefine/issues/1213#issuecomment-316354869
[2] : https://www.youtube.com/watch?v=0wAY94hc7Gw
[3] : https://github.com/OpenRefine/OpenRefine/issues/778
[4] : https://github.com/OpenRefine/OpenRefine/issues/1096

On 06/10/2017 09:58, Owen Stephens wrote:
> There is no built in functionality to achieve this that I'm aware of.
> However, there are a couple of ways you could potentially do this:
>
> *Use 'cross':*
>
> * set up a project with a column containing your list of possible
> values you are going to check against (for convenience I'll call
> this project 'schema' and let it contain a single column 'ValidValues')
> * set up the project you want to validate
> * In the project/column you want validate, choose Facet->Custom Text Facet
> * In the expression box enter GREL:
> |
> cell.cross("schema","ValidValues").length()>0
> |
> * You will get a facet which contains 'true' for matched (valid)
> values and 'false' for non-matched (non valid) values
> * More information on the 'cross' function is available
> at https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions
>
> *Use 'inArray'*
>
> * 'inArray' is a function added by the GOKb Utilities extension so you
> need to install that first:
> o Download the extension
> from https://github.com/ostephens/refine-gokbutils (click on
> 'clone or download' and download the zip)
> o Unzip the downloaded file - you should get a folder called
> refine-gokbutils-master or similar
> o Drop this folder into the OpenRefine 'extensions' folder
> o Start/Restart OpenRefine
> * In the project/column you want validate, choose Facet->Custom Text Facet
> * In the expression box enter GREL like: 
> |
> inArray(["list","of","valid","values"],value)
> |
> * This will give 'true' if the value in the cell is in the list (i.e.
> valid value), or false otherwise
>
> Owen
>
> On Thursday, October 5, 2017 at 8:05:30 PM UTC+1, Tibaut Houzanme wrote:
>
> Greetings,
>
> I am curious if OpenRefine has this functionality: To allow me to
> create a schema (list of possible values for each column entry) and
> validate csv or Excel files against those values. 
> If this is possible, would you be kind enough to point me towards
> the documentation o how-to videos, or even perhaps a quick explanation?
>
> Thanks much,
> Tibaut
>
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openrefine+...@googlegroups.com
> <mailto:openrefine+...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Owen Stephens

unread,
Oct 6, 2017, 10:14:41 AM10/6/17
to OpenRefine
Apologies to Tibaut for hijacking the thread :)

Thanks Antonin,

The way GOKb integrated data validation was to have an additional 'tab' in the lefthand panel alongside the existing Filters/Facets and Undo/Redo. This flagged errors and warnings, and then offered tools to help fix (e.g. link to pre-programmed facets that isolated the non-valid rows etc.) - see [1]

I really like the idea of having this standardised way of reporting/fixing issues which could then have various rulesets plugged into it for different validators (maybe a bit like the reconciliation service currently works?). I liked the panel approach taken by GOKb, but I'm biased having been involved in that project! There's code on github in relation to this[2]
[2] https://github.com/k-int/gokb-phase1/search?utf8=✓&q=validationpanel&type=
Reply all
Reply to author
Forward
0 new messages