Diacritics and OpenRefine

Thad Guidry

unread,

Aug 2, 2017, 10:54:03 AM8/2/17

to openrefine

I had previously posted this on our developer mailing list. Forgetting to also CC our wider community... so here it is again.

While investigating use cases for replacing diacritics (regular mailing list thread), I found out that there might be a better way for doing comparisons of strings across other languages. http://stackoverflow.com/a/5697575

Evidentally, this Java class can be used for such kind of equivalence checking...

http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Collator.html

I can see perhaps using a numerical slider or some widget to skew across the different strengths.

Any other thoughts or ideas on where/how to best incorporate into OpenRefine and what visualization feedback, facets, or dialogs might work well with that class ? Quick and dirty drawings snapped with your cell phone are acceptable :)

-Thad

+ThadGuidry

Ettore Rizza

unread,

Aug 4, 2017, 1:12:22 PM8/4/17

to OpenRefine

Hi Thad,

I'm not sure I understood all the implications of this Java class. But if it's to be able to transform unicode letters into ascii, I am interested. The fingerprint() function does that, but it also changes the order of the tokens. So I often use Jython and the Unicode module to turn "été" into "ete". Is that what we're talking about?

Thad Guidry

unread,

Aug 4, 2017, 3:17:37 PM8/4/17

to OpenRefine

YEAP that's the idea.

So bikeshedding...

For the simple common cases that we all seem to need oftentimes...and making that much easier for OpenRefine users to convert to simple strings (behind the scenes, there's a smart class with rules that knows what humans typically expect to see happen for that conversion of diacritics...with special attention paid to German diacritics and conversion)... without having to worry about GREL or Jython unless you really need more power.

1. So what do you think ?

2. What languages often irritate you when working with diacritics ?

3. Do you need more sliders or widgets than equivalency checking or Levenshtein-like similarity comparison ?

4. Do you use other criteria beyond what ICU4J does in regard to " the Unicode Collation Algorithm (UCA), there are 5 different levels of strength used in comparisons:

PRIMARY strength: Typically, this is used to denote differences between base characters (for example, "a" < "b"). It is the strongest difference. For example, dictionaries are divided into different sections by base character.
SECONDARY strength: Accents in the characters are considered secondary differences (for example, "as" < "às" < "at"). Other differences between letters can also be considered secondary differences, depending on the language. A secondary difference is ignored when there is a primary difference anywhere in the strings.
TERTIARY strength: Upper and lower case differences in characters are distinguished at tertiary strength (for example, "ao" < "Ao" < "aò"). In addition, a variant of a letter differs from the base form on the tertiary strength (such as "A" and "Ⓐ"). Another example is the difference between large and small Kana. A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings.
QUATERNARY strength: When punctuation is ignored (see Ignoring Punctuations in the User Guide) at PRIMARY to TERTIARY strength, an additional strength level can be used to distinguish words with and without punctuation (for example, "ab" < "a-b" < "aB"). This difference is ignored when there is a PRIMARY, SECONDARY or TERTIARY difference. The QUATERNARY strength should only be used if ignoring punctuation is required.
IDENTICAL strength: When all other strengths are equal, the IDENTICAL strength is used as a tiebreaker. The Unicode code point values of the NFD form of each string are compared, just in case there is no difference. For example, Hebrew cantellation marks are only distinguished at this strength. This strength should be used sparingly, as only code point value differences between two strings is an extremely rare occurrence. Using this strength substantially decreases the performance for both comparison and collation key generation APIs. This strength also increases the size of the collation key.

Unlike the JDK, ICU4J's Collator deals only with 2 decomposition modes, the canonical decomposition mode and one that does not use any decomposition. The compatibility decomposition mode, java.text.Collator.FULL_DECOMPOSITION is not supported here. If the canonical decomposition mode is set, the Collator handles un-normalized text properly, producing the same results as if the text were normalized in NFD. If canonical decomposition is turned off, it is the user's responsibility to ensure that all text is already in the appropriate form before performing a comparison or before getting a CollationKey.

5. What would be a dream facet, widget, extension, etc... to help you deal with diacritics more easily, and how would it function in your perfect world view ?

6. Can we help the user more with getting the text into the appropriate form prior to comparison ? How would that ideally work , what drop down common transformations could be done ?

That's what we want know...so that we can scope it and plan for it better.

For me, having a widget that exposes those 5 levels and canonical decomposition option ON/OFF ... but then what else ?

-Thad

+ThadGuidry

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Joe Wicentowski

unread,

Aug 4, 2017, 3:35:39 PM8/4/17

to OpenRefine

Super interesting! Briefly, I can think of 3 places where these options would be useful:

1. Sort - advanced options beyond "a-z"

2. Cluster and edit - tweaking sensitivity

3. Transform cell - normalizing unicode form or stripping diacritics for normalization purposes

I'm not sure how to express these in UI forms, but I'll follow the conversation and chime in when I think I have something to contribute.

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "OpenRefine" group.

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+unsubscribe@googlegroups.com.

Thad Guidry

unread,

Aug 4, 2017, 5:55:40 PM8/4/17

to OpenRefine

Yes, that's the kind of brainstorming I'm looking for. Thanks Joe.

I like 1 and 2 ... good ideas ... we'll make sure to include those when we get to the design.

For 3... that's what we are bikeshedding now in this discussion (called text normalization, or just normalization)

Currently,

"été".reinterpret("ascii")

results in

��t��

but through some new proposed GREL commands, that could use the ICU4J Java library, we could do lots of smarter character replacement and transliteration (not Translation...you still need a linguist for that or Google Translate :) )

Ettore, to make you think more... :) ... some details here: http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Transliterator.html

"ete".transliterate("English-French")

results in

été

"été".normalize()

results in

ete

Transliteration and Normalization are different things in the world of data. And in ICU4J, where they each have some options that we might expose with some GREL parameters in those 2 methods, or perhaps even more methods..dunno quite yet and depends on community needs...kinda like what we do for reinterpret() that takes an encoder parameter like "UTF8" or "ASCII". You have to tell a computer HOW you want to reinterpret and with WHAT OPTIONS.

Joe, I guess your looking for the proposed normalize() method via a quick menu option ?

-Thad

+ThadGuidry

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

Reply all

Reply to author

Forward