YEAP that's the idea.
So bikeshedding...
For the simple common cases that we all seem to need oftentimes...and making that much easier for OpenRefine users to convert to simple strings (behind the scenes, there's a smart class with rules that knows what humans typically expect to see happen for that conversion of diacritics...with special attention paid to German diacritics and conversion)... without having to worry about GREL or Jython unless you really need more power.
1. So what do you think ?
2. What languages often irritate you when working with diacritics ?
3. Do you need more sliders or widgets than equivalency checking or Levenshtein-like similarity comparison ?
4. Do you use other criteria beyond what ICU4J does in regard to "
the Unicode Collation Algorithm (UCA), there are 5 different levels of strength used in comparisons:- PRIMARY strength: Typically, this is used to denote differences between base characters (for example, "a" < "b"). It is the strongest difference. For example, dictionaries are divided into different sections by base character.
- SECONDARY strength: Accents in the characters are considered secondary differences (for example, "as" < "às" < "at"). Other differences between letters can also be considered secondary differences, depending on the language. A secondary difference is ignored when there is a primary difference anywhere in the strings.
- TERTIARY strength: Upper and lower case differences in characters are distinguished at tertiary strength (for example, "ao" < "Ao" < "aò"). In addition, a variant of a letter differs from the base form on the tertiary strength (such as "A" and "Ⓐ"). Another example is the difference between large and small Kana. A tertiary difference is ignored when there is a primary or secondary difference anywhere in the strings.
- QUATERNARY strength: When punctuation is ignored (see Ignoring Punctuations in the User Guide) at PRIMARY to TERTIARY strength, an additional strength level can be used to distinguish words with and without punctuation (for example, "ab" < "a-b" < "aB"). This difference is ignored when there is a PRIMARY, SECONDARY or TERTIARY difference. The QUATERNARY strength should only be used if ignoring punctuation is required.
- IDENTICAL strength: When all other strengths are equal, the IDENTICAL strength is used as a tiebreaker. The Unicode code point values of the NFD form of each string are compared, just in case there is no difference. For example, Hebrew cantellation marks are only distinguished at this strength. This strength should be used sparingly, as only code point value differences between two strings is an extremely rare occurrence. Using this strength substantially decreases the performance for both comparison and collation key generation APIs. This strength also increases the size of the collation key.
Unlike the JDK, ICU4J's Collator deals only with 2 decomposition modes, the canonical decomposition mode and one that does not use any decomposition. The compatibility decomposition mode, java.text.Collator.FULL_DECOMPOSITION is not supported here. If the canonical decomposition mode is set, the Collator handles un-normalized text properly, producing the same results as if the text were normalized in NFD. If canonical decomposition is turned off, it is the user's responsibility to ensure that all text is already in the appropriate form before performing a comparison or before getting a CollationKey.
5. What would be a dream facet, widget, extension, etc... to help you deal with diacritics more easily, and how would it function in your perfect world view ?
6. Can we help the user more with getting the text into the appropriate form prior to comparison ? How would that ideally work , what drop down common transformations could be done ?
That's what we want know...so that we can scope it and plan for it better.
For me, having a widget that exposes those 5 levels and canonical decomposition option ON/OFF ... but then what else ?
-Thad