How to Remove Special Characters?

2,311 views
Skip to first unread message

MRB

unread,
Jun 13, 2019, 10:45:07 AM6/13/19
to OpenRefine
Greetings,

I have a lot of 'junk' in our text fields , how can I use OR to remove non-printable characters to clean up this data. It isn't a common transform.

I'm wondering if I need to do a foreach loop and inside test for the characcer value and if it's character code is not between 32-128 then delete it...

Any suggestions and sample GREL would be appreciated.

Thanks, Mark

Actual text sample:

Managers want a “super user� “role� in

should be: Managers want a super user role in

Owen Stephens

unread,
Jun 13, 2019, 10:51:55 AM6/13/19
to OpenRefine
Hi Mark

You could use the GREL replace function for this, using a regular expression to match (or not match) a range of unicode characters. So for example:

value.replace(/[^\u0020-\u007F]/,"")

Would remove any characters that were not part of the Basic Latin set (see https://jrgraphix.net/r/Unicode/0020-007F)

Owen

Owen Stephens

unread,
Jun 13, 2019, 10:54:08 AM6/13/19
to OpenRefine
Oh - also worth mentioning - it would be my guess that the examples you give are create when smart quotes in the original data have not been interpreted correctly on import. One thing it may be worth doing, if it is an option, is re-importing the original data and forcing the import to use UTF-8 - that may help you get cleaner data into OpenRefine in the first place - but of course this may not be possible or the mis-encoding may have originated somewhere earlier in the chain than the OpenRefine import

Owen

MRB

unread,
Jun 13, 2019, 10:55:59 AM6/13/19
to OpenRefine
I tried a to Text transform and remove html characters transform - neither worked.

Owen Stephens

unread,
Jun 13, 2019, 11:37:41 AM6/13/19
to OpenRefine
Hi Mark

To use the "replace" function I mention you need to:

Choose "Edit Cells" from the column dropdown
Choose "Transform"
In the "Expression" box put the GREL:

value.replace(/[^\u0020-\u007F]/,"")

Click "OK"

This would remove any characters not in the latin basic set


If you were able to share the original data on here then I could tell you if the problem has been caused by import to OpenRefine or if it is already in the data file

Owen

MRB

unread,
Jun 13, 2019, 12:46:00 PM6/13/19
to OpenRefine


On Thursday, June 13, 2019 at 10:45:07 AM UTC-4, MRB wrote:

MRB

unread,
Jun 13, 2019, 12:52:52 PM6/13/19
to OpenRefine
Thank you for the formula Owen!
.  It works well - except it's removing the CR/LF  sequence - or newline character,  I'll see if I can fix that.  

I also want to create a facet that would be true/false if the field contains a non-printable character so I can get and idea of the extent of these issues - I'll look for a contains function or something...

Also, the characters were in the data prior to OR import - they were causing an import into a database to fail ...

Thank you for you help!

-Mark


On Thursday, June 13, 2019 at 10:45:07 AM UTC-4, MRB wrote:

Owen Stephens

unread,
Jun 13, 2019, 1:16:45 PM6/13/19
to OpenRefine
I think in terms of the non-printable aspect - I'm not sure that these characters are non-printable - I think they would print fine? However they were created, what you have now is a number of valid UTF-8 characters instead of the original characters.

To find the extent of the problem I'd suggest using a Customised Facet->Unicode char-code facet
This will give you a facet where you can narrow to rows which contain characters in a particular range.

In terms of solving the problem it's a difficult one to resolve. You can try using an approach like the one I've already suggested, or you could identify specific character combinations that have been caused and fix them using more specific "replace" expressions - e.g.

value.replace("“","\"").replace("�","\"")

If the only problems are the smart quotes you might only need 2 replace statements (or 4 if there were single and double smart quotes in the original). But if the problems are more varied in nature then using a broad brush approach such as character ranges maybe the way to go

Owen

MRB

unread,
Jun 13, 2019, 2:27:10 PM6/13/19
to OpenRefine
Your suggestion of changing the import to use UTF-8 worked really well too!  That cleared up most issuse I've seen. Thanks again!  -Mark


On Thursday, June 13, 2019 at 10:45:07 AM UTC-4, MRB wrote:

MRB

unread,
Jun 13, 2019, 4:58:35 PM6/13/19
to OpenRefine
When I enforced the import to UTF-8, the original gibberish in my 1sr post translated to quotes. I need to read up on smart quotes, translation formats and code pages ... thanks again Owen.

MRB

unread,
Jun 14, 2019, 4:01:44 PM6/14/19
to OpenRefine
If I need to replace a " - double quote with an empty string - how would I do that?  Because the delimiters in the replace function appear to be double quotes?


On Thursday, June 13, 2019 at 10:45:07 AM UTC-4, MRB wrote:

Thad Guidry

unread,
Jun 14, 2019, 4:50:43 PM6/14/19
to openr...@googlegroups.com
Use single quote as your preferred wrapping method ...anytime in GREL that you need to deal with double quotes inside or vice-versa.

replace('"', '*')



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/bffa4e68-880b-4469-9b52-af3d9d6cc017%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark Bustin

unread,
Jun 14, 2019, 5:15:40 PM6/14/19
to openr...@googlegroups.com

Owen Stephens

unread,
Jun 15, 2019, 5:09:40 AM6/15/19
to OpenRefine
Just for completeness - you can also use a backslash to "escape" the character you want to replace:

value.replace("\"","*")

Also remember if they are smart quotes, this won't work as smart quotes are different characters to the straight up-and-down ascii inverted commas or apostrophes. In the upcoming OpenRefine v3.2 we've added a new "Common transform" to replace all smart quotes with their ascii equivalents, but in the meantime the following expression will do the job:

value.replace(/[‘’‚‛‹›‚]/,"'").replace(/[“”«»„]/,"\"")

Owen


On Friday, June 14, 2019 at 9:50:43 PM UTC+1, Thad Guidry wrote:
Use single quote as your preferred wrapping method ...anytime in GREL that you need to deal with double quotes inside or vice-versa.

replace('"', '*')



On Fri, Jun 14, 2019 at 3:01 PM MRB <mrbu...@gmail.com> wrote:
If I need to replace a " - double quote with an empty string - how would I do that?  Because the delimiters in the replace function appear to be double quotes?

On Thursday, June 13, 2019 at 10:45:07 AM UTC-4, MRB wrote:
Greetings,

I have a lot of 'junk' in our text fields , how can I use OR to remove non-printable characters to clean up this data. It isn't a common transform.

I'm wondering if I need to do a foreach loop and inside test for the characcer value and if it's character code is not between 32-128 then delete it...

Any suggestions and sample GREL would be appreciated.

Thanks, Mark

Actual text sample:

Managers want a “super user� “role� in

should be: Managers want a super user role in

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openr...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages