How do I find values contains only a-z A-Z and unicode characters

101 views
Skip to first unread message

Yogesh

unread,
Jul 30, 2017, 11:49:17 PM7/30/17
to OpenRefine
I am trying to correct names of the STATES but quite a few have numbers and non-characters (eg. 1-9 / (  & etc. ) . How can I find all the facets with such nonsensical values? Thanks in advance. 


Ettore Rizza

unread,
Jul 31, 2017, 1:37:25 AM7/31/17
to OpenRefine
You can use a text filter on your column with this expression (dont forget to check the box 'regular expression' after you have copy-pasted this formula]

[^\p{L}-]

It means : find everything that it's not a unicode letter or a hyphen.

You can also play with fact ->  Customized facets -> Unicode char-code facet. According to the Documentation, "that generates a distribution of which unicode characters are used in a particular column. That distribution will allow you to spot outliers, meaning characters that are used infrequently and that might suggest encoding issues. You can use that char distribution facet to 'scan' and inspect the values for yourself and then use fix their encoding with the 'reinterpret' GREL function."


Hope this helps.

Yogesh

unread,
Jul 31, 2017, 11:23:24 AM7/31/17
to OpenRefine
Hello Ettore,

Thank you once again. However, my results include 

Not Applicable
Puerto Rico
Hong Kong 
etc 

How can I exclude such two letter words with spaces from the filter? (ps - I am a newbie at regex too ).

Yogesh

Thad Guidry

unread,
Jul 31, 2017, 2:56:46 PM7/31/17
to OpenRefine
We actually have a Customized Facet -> Unicode char-code facet

It helps for that use case of finding strings in a column that have Unicode characters.

Then you can apply whatever additional facets or perform GREL expressions on your included or excluded Unicode char strings.

-Thad


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ettore Rizza

unread,
Aug 1, 2017, 5:46:43 AM8/1/17
to OpenRefine
How can I exclude such two letter words with spaces from the filter? (ps - I am a newbie at regex too ).

By simply adding a blank space (or the symbol \s) inside the expression :

 [^\p{L}-\s]

Now, it means "find any cell of the column that contains something that is not a unidcode letter, an hyphen or a space.
Reply all
Reply to author
Forward
0 new messages