Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#649729: uniq: merges obscure Cyrillic characters

11 views
Skip to first unread message

Alex Shinn

unread,
Feb 3, 2012, 1:40:02 AM2/3/12
to
The problem is in strcoll/strxfrm as described in:

http://unix.stackexchange.com/questions/17198/where-has-my-uniq-or-sort-u-line-gone-with-some-unicode-characters

$ LANG=en_US.UTF-8 perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*",
strxfrm($_)) foreach @ARGV' a b c А В Г Ѯ Ѻ Ѳ
a c010801020
b d010801020
c e010801020
А 2cbb10801090
В 2cdb10801090
Г 2ceb10801090
Ѯ 101010102c6b102c6b
Ѻ 101010102c6b102c6b
Ѳ 101010102c6b102c6b

The latin and common cyrillic chars all have different values,
but the rare characters all convert to the same collation element.
It also does this for Japanese kana, but not kanji.

As the link states, it's pretty clearly a bug - the correct behavior
would be to sort the unknown characters after all known characters
and consider them distinct. As a workaround, adding values for
all characters to every locale file in /usr/share/i18n/locales/ should
work.

--
Alex



--
To UNSUBSCRIBE, email to debian-bugs-...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Bob Proulx

unread,
Feb 3, 2012, 2:10:02 AM2/3/12
to
forcemerge 139861 649729
thanks

Alex Shinn wrote:
> As the link states, it's pretty clearly a bug - the correct behavior
> would be to sort the unknown characters after all known characters
> and consider them distinct. As a workaround, adding values for
> all characters to every locale file in /usr/share/i18n/locales/ should
> work.

See also this issue:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=139861

It is a known deficiency in coreutils that the utilities are not
multibyte aware. The following can be found in the upstream source
package TODO file.

Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be
multibyte aware. The problem is that I want to avoid duplicating
significant blocks of logic, yet I also want to incur only minimal
(preferably `no') cost when operating in single-byte mode.

Some vendors have hacked in patches to make the utilities multibyte
aware but none of those patches have been considered clean enough to
incorporate into the upstream source yet. Debian's maintainer has
stated that he does not want to diverge from upstream this radically.
The patches are very messy and incomplete. The best course of action
would be to get this resolved upstream with the functionally properly
integrated. Until then this remains a known deficiency.

Bob
signature.asc
0 new messages