Replacing diacritic (accent) characters

684 views

Skip to first unread message

Thad Guidry

unread,

Apr 16, 2017, 11:59:33 AM4/16/17

to openrefine

Hello OpenRefine Community !

A lot of us dealing with open data often have to struggle with language accent characters called diacritics and need to replace them for reconciling, equivalence checking, or general English usage.

Marc Márquez

Marc Marquez

I've added a simple recipe to the Jython tutorial that can replace diacritic characters.

https://github.com/OpenRefine/OpenRefine/wiki/Recipes#replacing-diacritic-accent-characters

Note: The original strings need to be in unicode (utf-8), so ensure that your data is encoded properly when you first import into OpenRefine.

(This might land as a new Common Transformation in OpenRefine later on depending on what works better overall for our community...I'm unsure if either Apache Commons Lang3 StringUtils.stripAccents ... or the above recipe use of Python's unidecode library will work better. So far, it seems that Python's unidecode library converts better, but I'd like to gather opinions)

Happy Easter ! (in all languages! - with or without diacritics!)

-Thad

+ThadGuidry

Reply all

Reply to author

Forward

0 new messages