Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Bug#649729: uniq: merges obscure Cyrillic characters
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  2 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Alex Shinn  
View profile  
 More options Feb 3 2012, 1:40 am
Newsgroups: linux.debian.bugs.dist
From: Alex Shinn <alexsh...@gmail.com>
Date: Fri, 03 Feb 2012 07:40:02 +0100
Local: Fri, Feb 3 2012 1:40 am
Subject: Bug#649729: uniq: merges obscure Cyrillic characters
The problem is in strcoll/strxfrm as described in:

http://unix.stackexchange.com/questions/17198/where-has-my-uniq-or-so...

$ LANG=en_US.UTF-8 perl -C255 -MPOSIX -le 'print "$_ ", unpack("h*",
strxfrm($_)) foreach @ARGV' a b c А В Г Ѯ Ѻ Ѳ
a c010801020
b d010801020
c e010801020
А 2cbb10801090
В 2cdb10801090
Г 2ceb10801090
Ѯ 101010102c6b102c6b
Ѻ 101010102c6b102c6b
Ѳ 101010102c6b102c6b

The latin and common cyrillic chars all have different values,
but the rare characters all convert to the same collation element.
It also does this for Japanese kana, but not kanji.

As the link states, it's pretty clearly a bug - the correct behavior
would be to sort the unknown characters after all known characters
and consider them distinct.  As a workaround, adding values for
all characters to every locale file in /usr/share/i18n/locales/ should
work.

--
Alex

--
To UNSUBSCRIBE, email to debian-bugs-dist-REQU...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bob Proulx  
View profile  
 More options Feb 3 2012, 2:10 am
Newsgroups: linux.debian.bugs.dist
From: Bob Proulx <b...@proulx.com>
Date: Fri, 03 Feb 2012 08:10:02 +0100
Local: Fri, Feb 3 2012 2:10 am
Subject: Bug#649729: uniq: merges obscure Cyrillic characters

forcemerge 139861 649729
thanks

Alex Shinn wrote:
> As the link states, it's pretty clearly a bug - the correct behavior
> would be to sort the unknown characters after all known characters
> and consider them distinct.  As a workaround, adding values for
> all characters to every locale file in /usr/share/i18n/locales/ should
> work.

See also this issue:

  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=139861

It is a known deficiency in coreutils that the utilities are not
multibyte aware.  The following can be found in the upstream source
package TODO file.

  Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be
    multibyte aware.  The problem is that I want to avoid duplicating
    significant blocks of logic, yet I also want to incur only minimal
    (preferably `no') cost when operating in single-byte mode.

Some vendors have hacked in patches to make the utilities multibyte
aware but none of those patches have been considered clean enough to
incorporate into the upstream source yet.  Debian's maintainer has
stated that he does not want to diverge from upstream this radically.
The patches are very messy and incomplete.  The best course of action
would be to get this resolved upstream with the functionally properly
integrated.  Until then this remains a known deficiency.

Bob

  signature.asc
< 1K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »