Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Gawk match() and numbers in scientific notation

59 views
Skip to first unread message

Hermann Peifer

unread,
May 6, 2008, 7:16:01 AM5/6/08
to
Hi,


I am somehwat puzzled with match() results for numbers in scientific
notation. See below.

$ cat testdata
100
100e-3
100E3

I am wondering what kind of uppercase character is matched in record
2:

$ gawk '{print $1,match($1,/[A-Z]/)}' testdata
100 0
100e-3 4
100E3 4

I am also wondering about the non-matches in these cases:

$ gawk 'BEGIN{print match(100e-3,/[A-Z]/)}'
0
$ gawk 'BEGIN{print match(100E3,/[A-Z]/)}'
0

What is the logic behind the matches and non-matches?

I am using gawk 3.1.5 and LANG=en_US.UTF-8

Thanks in advance, Hermann

pk

unread,
May 6, 2008, 7:28:06 AM5/6/08
to
On Tuesday 6 May 2008 13:16, Hermann Peifer wrote:

> Hi,
>
>
> I am somehwat puzzled with match() results for numbers in scientific
> notation. See below.
>
> $ cat testdata
> 100
> 100e-3
> 100E3
>
> I am wondering what kind of uppercase character is matched in record
> 2:
>
> $ gawk '{print $1,match($1,/[A-Z]/)}' testdata
> 100 0
> 100e-3 4
> 100E3 4

Try this with LC_ALL=C (always a good idea when working with
locale-sensitive data):

$ LC_ALL=C gawk '{print $1,match($1,/[A-Z]/)}' testdata
100 0
100e-3 0
100E3 4

> I am also wondering about the non-matches in these cases:
>
> $ gawk 'BEGIN{print match(100e-3,/[A-Z]/)}'
> 0
> $ gawk 'BEGIN{print match(100E3,/[A-Z]/)}'
> 0

Unlike before, here you are asking awk to do number-to-string conversion.
The problem is (I think) that the numbers lose the Es when they are
converted to strings:

$ gawk 'BEGIN{printf "%s\n", 100E3}'
100000
$ gawk 'BEGIN{printf "%s\n", 100e-3}'
0.1

Compare with this (which also shows the influence of the locale):

$ gawk 'BEGIN{print match("100E3",/[A-Z]/)}'
4
$ gawk 'BEGIN{print match("100e-3",/[A-Z]/)}'
4
$ LC_ALL=C gawk 'BEGIN{print match("100E3",/[A-Z]/)}'
4
$ LC_ALL=C gawk 'BEGIN{print match("100e-3",/[A-Z]/)}'
0


Of course, I stand to be corrected for anything wrong I could have written.

--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.

Hermann Peifer

unread,
May 7, 2008, 10:11:38 AM5/7/08
to
On May 6, 1:28 pm, pk <p...@pk.invalid> wrote:
> On Tuesday 6 May 2008 13:16, Hermann Peifer wrote:
>
> > I am also wondering about the non-matches in these cases:
>
> > $ gawk 'BEGIN{print match(100e-3,/[A-Z]/)}'
> > 0
> > $ gawk 'BEGIN{print match(100E3,/[A-Z]/)}'
> > 0
>
> Unlike before, here you are asking awk to do number-to-string conversion.
> The problem is (I think) that the numbers lose the Es when they are
> converted to strings:
>
> $ gawk 'BEGIN{printf "%s\n", 100E3}'
> 100000
> $ gawk 'BEGIN{printf "%s\n", 100e-3}'
> 0.1
>
> Compare with this (which also shows the influence of the locale):
>
> $ gawk 'BEGIN{print match("100E3",/[A-Z]/)}'
> 4
> $ gawk 'BEGIN{print match("100e-3",/[A-Z]/)}'
> 4
> $ LC_ALL=C gawk 'BEGIN{print match("100E3",/[A-Z]/)}'
> 4
> $ LC_ALL=C gawk 'BEGIN{print match("100e-3",/[A-Z]/)}'
> 0
>

Thanks. This seems to explain the matching logic.

Hermann

Ed Morton

unread,
May 7, 2008, 10:18:57 AM5/7/08
to

On 5/6/2008 6:16 AM, Hermann Peifer wrote:
> Hi,
>
>
> I am somehwat puzzled with match() results for numbers in scientific
> notation. See below.
>
> $ cat testdata
> 100
> 100e-3
> 100E3
>
> I am wondering what kind of uppercase character is matched in record
> 2:
>
> $ gawk '{print $1,match($1,/[A-Z]/)}' testdata

There may not be an uppercase character matching. [A-Z] represents the list of
characters in between the character A and the character Z in your locale - that
does NOT mean it has to be upper case characters. For example, your locale might
consider characters ordered as:

aAbBcCdDeEfF....zZ

so "e" would sit between "A" and "Z". That's why you should use character
classes instead of specific ranges, e.g.:

gawk '{print $1,match($1,/[[:upper:]]/)}' testdata

Regards,

Ed.

Janis

unread,
May 7, 2008, 10:59:10 AM5/7/08
to
On 7 Mai, 16:18, Ed Morton <mor...@lsupcaemnt.com> wrote:
> On 5/6/2008 6:16 AM, Hermann Peifer wrote:
>
> > Hi,
>
> > I am somehwat puzzled with match() results for numbers in scientific
> > notation. See below.
>
> > $ cat testdata
> > 100
> > 100e-3
> > 100E3
>
> > I am wondering what kind of uppercase character is matched in record
> > 2:
>
> > $ gawk '{print $1,match($1,/[A-Z]/)}' testdata
>
> There may not be an uppercase character matching. [A-Z] represents the list of
> characters in between the character A and the character Z in your locale

Can you provide some reference for that definition?

I thought that ranges like [A-Z] depend on the _coding_ of the
character set (IOW, on the code values of the characters), and
not depending on the locale.

So that in case you have a different code set than ISO Latin 1,
ASCII, or similar, e.g. like EBCDIC (where there may be other
characters spread in between the letter code positions) you'd
get unexpected results.

So, because of that, your suggestion below is valid anyway, but
I'd like to be sure about what you wrote if that is in fact true.

Janis

Ed Morton

unread,
May 7, 2008, 11:20:16 AM5/7/08
to

On 5/7/2008 9:59 AM, Janis wrote:
> On 7 Mai, 16:18, Ed Morton <mor...@lsupcaemnt.com> wrote:
>
>>On 5/6/2008 6:16 AM, Hermann Peifer wrote:
>>
>>
>>>Hi,
>>
>>>I am somehwat puzzled with match() results for numbers in scientific
>>>notation. See below.
>>
>>>$ cat testdata
>>>100
>>>100e-3
>>>100E3
>>
>>>I am wondering what kind of uppercase character is matched in record
>>>2:
>>
>>>$ gawk '{print $1,match($1,/[A-Z]/)}' testdata
>>
>>There may not be an uppercase character matching. [A-Z] represents the list of
>>characters in between the character A and the character Z in your locale
>
>
> Can you provide some reference for that definition?

From the GNU awk user guide
(http://www.gnu.org/software/gawk/manual/gawk.html#Character-Lists):

... For example, in the default C locale, `[a-dx-z]' is equivalent to
`[abcdxyz]'. Many locales sort characters in dictionary order, and in these
locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]'; instead it might
be equivalent to `[aBbCcDdxXyYz]', for example.

> I thought that ranges like [A-Z] depend on the _coding_ of the
> character set (IOW, on the code values of the characters), and
> not depending on the locale.
>
> So that in case you have a different code set than ISO Latin 1,
> ASCII, or similar, e.g. like EBCDIC (where there may be other
> characters spread in between the letter code positions) you'd
> get unexpected results.
>
> So, because of that, your suggestion below is valid anyway, but
> I'd like to be sure about what you wrote if that is in fact true.

Sounds to me like it's dependent on locale but if what your describing above is
something different from locale, then you may be right. Whatever the "thingy" is
that causes the difference, using character classes is the right approach.

Ed.

pk

unread,
May 7, 2008, 11:25:24 AM5/7/08
to
On Wednesday 7 May 2008 17:20, Ed Morton wrote:

> From the GNU awk user guide
> (http://www.gnu.org/software/gawk/manual/gawk.html#Character-Lists):
>
> ... For example, in the default C locale, `[a-dx-z]' is equivalent to
> `[abcdxyz]'. Many locales sort characters in dictionary order, and in
> these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]';
> instead it might be equivalent to `[aBbCcDdxXyYz]', for example.

Is there a way to explicitly print out that information (or, better, the
entire collating sequence in use)? I've been looking for a method to do
that for long time, but I have found no complete answer.

Ed Morton

unread,
May 7, 2008, 11:37:01 AM5/7/08
to

On 5/7/2008 10:25 AM, pk wrote:
> On Wednesday 7 May 2008 17:20, Ed Morton wrote:
>
>
>>From the GNU awk user guide
>>(http://www.gnu.org/software/gawk/manual/gawk.html#Character-Lists):
>>
>>... For example, in the default C locale, `[a-dx-z]' is equivalent to
>>`[abcdxyz]'. Many locales sort characters in dictionary order, and in
>>these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]';
>>instead it might be equivalent to `[aBbCcDdxXyYz]', for example.
>
>
> Is there a way to explicitly print out that information (or, better, the
> entire collating sequence in use)? I've been looking for a method to do
> that for long time, but I have found no complete answer.
>

I expect you could use the ord() and chr() functions described here:

http://www.gnu.org/software/gawk/manual/gawk.html#Ordinal-Functions

to do something like:

for (i=ord("a");i<=ord("z");i++) {
print chr(i)
}

Regards,

Ed.

pk

unread,
May 7, 2008, 12:04:24 PM5/7/08
to
On Wednesday 7 May 2008 17:37, Ed Morton wrote:

>>>these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]';
>>>instead it might be equivalent to `[aBbCcDdxXyYz]', for example.
>>
>>
>> Is there a way to explicitly print out that information (or, better, the
>> entire collating sequence in use)? I've been looking for a method to do
>> that for long time, but I have found no complete answer.
>>
>
> I expect you could use the ord() and chr() functions described here:
>
> http://www.gnu.org/software/gawk/manual/gawk.html#Ordinal-Functions
>
> to do something like:
>
> for (i=ord("a");i<=ord("z");i++) {
> print chr(i)
> }

Take this scenario:

$ cat file
100e3
$ echo $LC_ALL
en_GB
$ awk '/[A-Z]/' file
100e3
$ LC_ALL=C awk '/[A-Z]/' file
$

(or, perhaps more elegant,
$ awk '[[:upper:]]' file
$ )

It seems that the function you point out use the mere numeric character
values and don't take locale into account. Using the proposed code for the
ord() and chr() functions, a loop to print the sequence from "A" to "Z"
always yields

A
B
C
...
Z

under many different locales, even en_GB which, as seen above, clearly
expands [A-Z] differently.

In fact, my question is not awk-specific, and is generically about how
collating sequences affect the interpretation of bracket expressions, and
thus influence how programs like grep, sort, awk, etc. work.
What I'm looking for is a command which, ideally, behaves as follows:

$ LC_ALL=C <command> '[A-C]'
ABC

$ LC_ALL=en_GB <command> '[A-C]'
AaBbCc # or whatever it's expanded to

and, ideally, also something like

$ <command> -a
# prints the entire current collating sequence, according to current locale

Of course, I don't know whether such a command exists, or even whether it's
possible to gather that information in some other way.

I'm setting the followup for this discussion to comp.unix.shell, since this
is not awk-specific anymore.

Hermann Peifer

unread,
May 7, 2008, 1:50:11 PM5/7/08
to
Ed Morton wrote:
>
> On 5/6/2008 6:16 AM, Hermann Peifer wrote:
>> Hi,
>>
>>
>> I am somehwat puzzled with match() results for numbers in scientific
>> notation. See below.
>>
>> $ cat testdata
>> 100
>> 100e-3
>> 100E3
>>
>> I am wondering what kind of uppercase character is matched in record
>> 2:
>>
>> $ gawk '{print $1,match($1,/[A-Z]/)}' testdata
>
> There may not be an uppercase character matching. [A-Z] represents the list of
> characters in between the character A and the character Z in your locale - that
> does NOT mean it has to be upper case characters. For example, your locale might
> consider characters ordered as:
>
> aAbBcCdDeEfF....zZ

You are right: in my locale en_GB.UTF-8, [A-Z] matches all upper and
lower case letters (including accented letters), except lower case a. In
return [a-z] matches all upper/lower case letters, except upper case Z.


>
> so "e" would sit between "A" and "Z". That's why you should use character
> classes instead of specific ranges, e.g.:
>
> gawk '{print $1,match($1,/[[:upper:]]/)}' testdata
>

I will do so. Thanks, Hermann

Ed Morton

unread,
May 7, 2008, 2:03:32 PM5/7/08
to

As the other part of this thread continues over at comp.unix.shell, I came up
with this script which you can run to see which characters are contained in
which character lists (REs actually):

$ cat rechars.awk
# Prints every character that matches a given RE.
# Originally created to print all characters in a given character list.
#
# usage:
# LC_ALL=C awk -v re="[a-z]" -f rechars.awk
# LC_ALL=en_GB awk -v re="[a-z]" -f rechars.awk
# awk -v re="[[:upper:]]" -f rechars.awk
#
BEGIN{
for (i=0;i<=1000;i++)
chars[sprintf("%c",i)]
for (c in chars)
if (c ~ re)
s=s c
print re"="s
}
$ awk -v re="[A-Z]" -f rechars.awk
[A-Z]=ABCDEFGHIJKLMNOPQRSTUVWXYZ

so you can play with that if you're curious about which characters match in
specific locales...

Ed.

schuler...@googlemail.com

unread,
May 7, 2008, 2:16:35 PM5/7/08
to

Why is LC_COLLATE instead of LC_ALL in the above case not enough?
--
Steffen

pk

unread,
May 7, 2008, 2:27:53 PM5/7/08
to
On Wednesday 7 May 2008 20:16, schuler...@googlemail.com wrote:

> Why is LC_COLLATE instead of LC_ALL in the above case not enough?

This is another thing that confuses me. I tried that during my experiments,
but didn't want to mention it to avoid putting too many irons in the fire.
Look:

$ cat file1
e
$ LC_COLLATE=C awk '/[A-Z]/' file1
e
$ LC_ALL=C awk '/[A-Z]/' file1
$

Hermann Peifer

unread,
May 7, 2008, 2:39:44 PM5/7/08
to

Thanks for this one. I do however not think that sprintf("%c",i) makes
much sense with i > 255. As far as I can see: for values between 0 and
255, printf prints the (control) characters from ASCII and ISO-8859-1,
but with i > 255, the same series of characters is printed again, see here:

$ awk 'BEGIN{printf "%c\n",65}'
A
$ awk 'BEGIN{printf "%c\n",65+256}'
A
$ awk 'BEGIN{printf "%c\n",65+256+256}'
A

Hermann

schuler...@googlemail.com

unread,
May 7, 2008, 4:16:24 PM5/7/08
to
On May 7, 5:25 pm, pk <p...@pk.invalid> wrote:
[snip]

> Is there a way to explicitly print out that information (or, better, the
> entire collating sequence in use)? I've been looking for a method to do
> that for long time, but I have found no complete answer.
[snip]

Here the collating sequence of the alphabetical ([:alpha:]) letters
using
gawk and sort for the locales C, en_US, and en_US.UTF-8
(using Fedora Core 7 and the current gawk from CVS):

$ cat collating_chars.sh
#!/bin/bash
# collating sequence of the alphabetic letters
# in an externally specified locale
gawk 'BEGIN {
for (i= 1; i <= 32767; ++i) {
c = sprintf("%c", i)
if (c ~ /[[:alpha:]]/)
a[c] = 1
}
for (c in a)
print c
}' |
sort |
gawk '{
printf "%s", $0
}
END {
print ""
}'
$ LANG=C LC_ALL=C ./collating_chars.sh
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
$ LANG=en_US LC_ALL=en_US ./collating_chars.sh
µaAáÁàÀâÂåÅäÄãêæÆbBcCçÇdDðÐeEéÉèÈêÊëËfFgGhHiIíÍìÌîÎïÏjJkKlLmMnNñÑoOóÓòÒôÔöÖõÕøØºpPqQrRsSßtTuUúÚùÙûÛüÜvVwWxXyYýÝÿzZþÞ
$ LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 ./collating_chars.sh
aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
$


--
Steffen

Ed Morton

unread,
May 7, 2008, 10:48:37 PM5/7/08
to

I just picked a big number since I don't know what the biggest number is for all
charsets, then created the array so the characters could wrap around with no
impact to the output.

Ed.

Ed Morton

unread,
May 7, 2008, 10:49:51 PM5/7/08
to

It may be enough, but LC_ALL works for this and less typing.

Ed.


pk

unread,
May 8, 2008, 5:25:06 AM5/8/08
to

This actually extends Ed's original idea, adding sort ordering (an useful
thing, thanks).
What still stumps me is that in my locale (en_GB.utf8) the
expressions '[a-z]', '[[:lower:]]', and '[[:alpha:]]' all DO match
lowercase accented characters; nonetheless, using either of them with your
script does not show accented characters. The same happens with en_US.utf8,
as can be seen in your output. I used gawk-3.1.6a from CVS.

Hermann Peifer

unread,
May 8, 2008, 9:58:39 AM5/8/08
to
On May 8, 11:25 am, pk <p...@pk.invalid> wrote:
> What still stumps me is that in my locale (en_GB.utf8) the
> expressions '[a-z]', '[[:lower:]]', and '[[:alpha:]]' all DO match
> lowercase accented characters; nonetheless, using either of them with your
> script does not show accented characters.

The collating_chars.sh script uses sprintf("%c", i).

This function works in fine for: i = 0 ... 127

For locales with a codeset of ISO-8859-something, it will also work
fine for i = 128 ... 255.

Your locale's codeset is utf8, so '[a-z]', '[[:lower:]]', and
'[[:alpha:]]' DO include accented characters. The script is simply not
working properly, because of sprintf() limitations. At least this is
how I understand the issue.

Hermann


pk

unread,
May 8, 2008, 10:22:59 AM5/8/08
to

Yes, this is also my understanding. Non-utf8 locales work fine because they
have useful characters in the 0-255 range, while utf8 uses a different
encoding and does not necessarily have useful characters in that range.
Moreover, the tests show that characters repeat themselves every 256 (or,
maybe, it is %c that is limited to these values only).

This is why I'm looking for a solution the can gather the information by
making effective use of the sources where localized features are defined.

For example, under linux it seems that each locale is defined by a series of
files under /usr/share/i18n/locales (on my system at least).
A quick look at those files reveals that they are where the various
LC_COLLATE, LC_NUMERIC etc. values are defined (I must say that I'm still a
bit puzzled about the syntax used, but that can be solved by carefully
reading the docs and the standard, I hope).
As an example, the file /usr/share/i18n/locales/en_GB (could not find a
en_GB.utf8 file) has inside it:

...
LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
END LC_COLLATE
...

The file /usr/share/i18n/locales/iso14651_t1 includes the file
iso14651_t1_common, which finally has (some comments removed):

LC_COLLATE

# Déclaration des systèmes d'écriture / Declaration of scripts
script <SPECIAL>
script <LATIN>
script <TIFINAGH>
script <ARABINT>
script <ARABFOR>
script <HEBREU>
script <GREC>
script <CYRIL>
script <ARMENIAN>

# Déclaration des symboles internes / Declaration of internal symbols
#
# SYMB N° Expl.
#
collating-symbol <RES-1>
#
# <ARABINT>/<ARABFOR>
#
#
collating-symbol <ANO> # 2 normal --> voir/see <MIN>
collating-symbol <AIS> # 3 isol.
collating-symbol <AFI> # 4 final
collating-symbol <AII> # 5 initial
collating-symbol <AME> # 6 medial/m<e'>dian
...[snip]...
# Ordre des symboles internes / Order of internal symbols
#
# SYMB. N°
#
<RES-1>
...[snip]...
order_start <SPECIAL>;forward;backward;forward;forward,position
#
# Tout caractère non précisément défini sera considéré comme caractère
# spécial et considéré uniquement au dernier niveau.
#
# Any character not precisely specified will be considered as a special
# character and considered only at the last level.
# <U0000>......<U7FFFFFFF> IGNORE;IGNORE;IGNORE;<U0000>......<U7FFFFFFF>
#
# SYMB. N° GLY
#
<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>
<U005F> IGNORE;IGNORE;IGNORE;<U005F> # 33 _
<U0332> IGNORE;IGNORE;IGNORE;<U0332> # 34 <"_>
<U00AF> IGNORE;IGNORE;IGNORE;<U00AF> # 35 - (MACRON)
<U00AD> IGNORE;IGNORE;IGNORE;<U00AD> # 36 <SHY>
<U002D> IGNORE;IGNORE;IGNORE;<U002D> # 37 -
<U002C> IGNORE;IGNORE;IGNORE;<U002C> # 38 ,
<U003B> IGNORE;IGNORE;IGNORE;<U003B> # 39 ;
<U003A> IGNORE;IGNORE;IGNORE;<U003A> # 40 :
<U0021> IGNORE;IGNORE;IGNORE;<U0021> # 41 !
<U00A1> IGNORE;IGNORE;IGNORE;<U00A1> # 42 ¡
<U003F> IGNORE;IGNORE;IGNORE;<U003F> # 43 ?
<U00BF> IGNORE;IGNORE;IGNORE;<U00BF> # 44 ¿
<U002F> IGNORE;IGNORE;IGNORE;<U002F> # 45 /
...
...etc.etc.

As far as I understand, these files are read by the commands "locale-gen"
and "localedef" and used to generate a binary form of locale information,
located (on my system) under /usr/lib/locale, which is what is used by the
various locale-sensitive programs (perhaps through libc) at runtime.

Either way, my point is that, since the localization info is defined
somewhere, there should be a way to extract and display that information.
The command "locale", with its various options, seems unable to provide the
kind of information I'm interested in (unless I overlooked something, which
may entirely be possible).

$ locale -c LC_COLLATE -k
LC_COLLATE
collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=1
collate-codeset="UTF-8"

To anybody who might reply: if you feel that the discussion is OT on
comp.lang.awk, feel free to reply on comp.unix.shell or whatever group you
deem appropriate. Thanks.

Hermann Peifer

unread,
May 8, 2008, 11:46:54 AM5/8/08
to
On May 8, 4:22 pm, pk <p...@pk.invalid> wrote:
>
> For example, under linux it seems that each locale is defined by a series of
> files under /usr/share/i18n/locales (on my system at least).
> A quick look at those files reveals that they are where the various
> LC_COLLATE, LC_NUMERIC etc. values are defined (I must say that I'm still a
> bit puzzled about the syntax used, but that can be solved by carefully
> reading the docs and the standard, I hope).
> As an example, the file /usr/share/i18n/locales/en_GB (could not find a
> en_GB.utf8 file) has inside it:
>

My /usr/share/i18n/locales/en_GB has also inside it:

LC_CTYPE
copy "i18n"

i18n in return defines the character classes: upper, lower, etc. Each
one with a list of Unicode version 5.0.0 code points, that are
included in a given character class.

Hermann

pk

unread,
May 8, 2008, 12:11:28 PM5/8/08
to
On Thursday 8 May 2008 17:46, Hermann Peifer wrote:

> My /usr/share/i18n/locales/en_GB has also inside it:
>
> LC_CTYPE
> copy "i18n"

Yes, mine too. I just extrapolated a sample section.

> i18n in return defines the character classes: upper, lower, etc. Each
> one with a list of  Unicode version 5.0.0 code points, that are
> included in a given character class.

However, that file is included by many other files in that directory
(notably, not by POSIX).

$ grep -l 'copy "i18n"' /usr/share/i18n/locales/* | wc -l
99

Does that mean that all these 99 locales have all the same characters in
their upper, alpha, etc. classes?

And, even if it is so, that helps only for POSIX character classes; how does
one answer the question "what characters match, eg, '[A-Z]' under a given
locale?". I think the answer to this lies in the understanding of
LC_COLLATE.

I guess I'll try reading the standard doumentation for localization when I
have more time. It scares me a little bit because its language is not (to
me at least) much simple and clear (unlike other parts of the standard),
however I think it's an obligatory step to understand the whole thing
better.

Thanks for your persisting interest in this discussion!

Hermann Peifer

unread,
May 8, 2008, 12:45:28 PM5/8/08
to
On May 8, 6:11 pm, pk <p...@pk.invalid> wrote:
> On Thursday 8 May 2008 17:46, Hermann Peifer wrote:
>
> > My /usr/share/i18n/locales/en_GB has also inside it:
>
> > LC_CTYPE
> > copy "i18n"
>
> Yes, mine too. I just extrapolated a sample section.
>
> > i18n in return defines the character classes: upper, lower, etc. Each
> > one with a list of  Unicode version 5.0.0 code points, that are
> > included in a given character class.
>
> However, that file is included by many other files in that directory
> (notably, not by POSIX).
>
> $ grep -l 'copy "i18n"' /usr/share/i18n/locales/* | wc -l
> 99
>
> Does that mean that all these 99 locales have all the same characters in
> their upper, alpha, etc. classes?
>

I would think so. All these 99 locales are Unicode-aware, so why
should they differ in their understanding of what is an upper case or
lower case, character?

This is defined in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
(and related files in this directory).

Code point U+0041 is e.g. defined as:
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;

Property value "Lu" stands for upper case letter and all Unicode-aware
locales will observe these properties (e.g. via the character class
definitions in file i18n).


> And, even if it is so, that helps only for POSIX character classes; how does
> one answer the question "what characters match, eg, '[A-Z]' under a given
> locale?". I think the answer to this lies in the understanding of
> LC_COLLATE.
>

I guess you are right but I don't have a clear idea myself. I will
also have to read more in the documentation.

Hermann

Hermann Peifer

unread,
May 8, 2008, 1:21:58 PM5/8/08
to

Big numbers just don't work as expected. As also confirmed by pk in this
thread: printf "%c", i basically prints the character of i%256.

Hermann

Janis Papanagnou

unread,
May 8, 2008, 4:29:32 PM5/8/08
to
pk wrote:
>
> And, even if it is so, that helps only for POSIX character classes; how does
> one answer the question "what characters match, eg, '[A-Z]' under a given
> locale?".

Yes. Why does /[A-C]/ (in locale en_GB and de_DE) match the five characters
65 A
66 B
67 C
98 b
99 c

> I think the answer to this lies in the understanding of
> LC_COLLATE.

Meanwhile, I fear, that there's either a bug or a fundamental flaw in the
conception of the locales (at least in conjunction with regexp character
ranges). The above result doesn't make the least sense to me.

Janis

Hermann Peifer

unread,
May 8, 2008, 4:49:38 PM5/8/08
to
Janis Papanagnou wrote:
> pk wrote:
>>
>> And, even if it is so, that helps only for POSIX character classes;
>> how does
>> one answer the question "what characters match, eg, '[A-Z]' under a given
>> locale?".
>
> Yes. Why does /[A-C]/ (in locale en_GB and de_DE) match the five characters
> 65 A
> 66 B
> 67 C
> 98 b
> 99 c
>

What I know from testing with en_GB.UTF-8 locale: the order of
characters seems to aAbBcCdD...

So [A-C] matches the 5 mentioned characters, but not a lowercase a.
Accented characters in this range are also matched.

Hermann

Steffen Schuler

unread,
May 9, 2008, 2:51:38 AM5/9/08
to
schuler...@googlemail.com wrote:
> On May 7, 5:25 pm, pk <p...@pk.invalid> wrote:
> [snip]
>> Is there a way to explicitly print out that information (or, better, the
>> entire collating sequence in use)? I've been looking for a method to do
>> that for long time, but I have found no complete answer.
> [snip]
>
> Here the collating sequence of the alphabetical ([:alpha:]) letters
> using
> gawk and sort for the locales C, en_US, and en_US.UTF-8
> (using Fedora Core 7 and the current gawk from CVS):
[snip]

Perhaps the result of the GLIBC functions iswalpha and wprintf together
with GNU sort explains better the collating of the alphabetic characters
for the 3 locales C, en_US, en_US.utf8 (my current system: Debian
GNU/Linux testing (lenny)):

$ cat collate.c
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>

int
main(int argc, char **argv)
{
wint_t i;

if (argc != 2) {
fprintf(stderr, "usage: ./collate MAX_CHAR_CODE\n");
exit(1);
}

for (i = 1; i <= atoi(argv[1]); ++i)
if (iswalpha(i))
wprintf(L"%c\n", i);
return 0;
}
$ cc -o collate collate.c
$ export LC_ALL=C; ./collate 255 | sort | tr -d '\n'; echo
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
$ export LC_ALL=en_US; ./collate 255 | sort | tr -d '\n'; echo
aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
$ # ^^^^^^^...: only Latin letters
$ export LC_ALL=en_US.UTF-8; ./collate 65535 | sort | tr -d '\n'; echo
aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
$


There seem to be some inconsistencies between gawk and glibc locale
usage. Especially in the C-code sample only Latin letters are listed for
LC_ALL=en_US contrary to the similar gawk sample.

Besides, the collating order may be different from the order of the
character codes --- see:

http://tinyurl.com/6gdraw

I'm really interested whether there are any further locale bugs in gawk
(for trying to fix them).
Perhaps I have more time for the research at the current weekend.

--
Steffen

Hermann Peifer

unread,
May 9, 2008, 3:44:54 AM5/9/08
to
Janis Papanagnou wrote:
> pk wrote:
>>
>> And, even if it is so, that helps only for POSIX character classes;
>> how does
>> one answer the question "what characters match, eg, '[A-Z]' under a given
>> locale?".
>
> Yes. Why does /[A-C]/ (in locale en_GB and de_DE) match the five characters
> 65 A
> 66 B
> 67 C
> 98 b
> 99 c
>

I think the *.src files here have some nice human-readable definitions
about how character ranges like [A-C] are expanded under a given locale.

http://unicode.org/cldr/data/common/posix/

According to the LC_COLLATE section in de_DE.UTF-8.src, [A-C] is
expanded to the below characters, in the given order (sorry for the long
list).

Hermann

...
<seven>
<eight>
<nine>
<a>
<FULLWIDTH_LATIN_SMALL_LETTER_A>
--- start of range [A-C] ---
<A>
<FULLWIDTH_LATIN_CAPITAL_LETTER_A>
<MODIFIER_LETTER_SMALL_A>
<FEMININE_ORDINAL_INDICATOR>
<LATIN_SUBSCRIPT_SMALL_LETTER_A>
<MODIFIER_LETTER_CAPITAL_A>
<LATIN_SMALL_LETTER_A_WITH_ACUTE>
<LATIN_CAPITAL_LETTER_A_WITH_ACUTE>
<LATIN_SMALL_LETTER_A_WITH_GRAVE>
<LATIN_CAPITAL_LETTER_A_WITH_GRAVE>
<LATIN_SMALL_LETTER_A_WITH_BREVE>
<LATIN_CAPITAL_LETTER_A_WITH_BREVE>
<LATIN_SMALL_LETTER_A_WITH_BREVE_AND_ACUTE>
<LATIN_CAPITAL_LETTER_A_WITH_BREVE_AND_ACUTE>
<LATIN_SMALL_LETTER_A_WITH_BREVE_AND_GRAVE>
<LATIN_CAPITAL_LETTER_A_WITH_BREVE_AND_GRAVE>
<LATIN_SMALL_LETTER_A_WITH_BREVE_AND_TILDE>
<LATIN_CAPITAL_LETTER_A_WITH_BREVE_AND_TILDE>
<LATIN_SMALL_LETTER_A_WITH_BREVE_AND_HOOK_ABOVE>
<LATIN_CAPITAL_LETTER_A_WITH_BREVE_AND_HOOK_ABOVE>
<LATIN_SMALL_LETTER_A_WITH_CIRCUMFLEX>
<LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX>
<LATIN_SMALL_LETTER_A_WITH_CIRCUMFLEX_AND_ACUTE>
<LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX_AND_ACUTE>
<LATIN_SMALL_LETTER_A_WITH_CIRCUMFLEX_AND_GRAVE>
<LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX_AND_GRAVE>
<LATIN_SMALL_LETTER_A_WITH_CIRCUMFLEX_AND_TILDE>
<LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX_AND_TILDE>
<LATIN_SMALL_LETTER_A_WITH_CIRCUMFLEX_AND_HOOK_ABOVE>
<LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX_AND_HOOK_ABOVE>
<LATIN_SMALL_LETTER_A_WITH_CARON>
<LATIN_CAPITAL_LETTER_A_WITH_CARON>
<LATIN_SMALL_LETTER_A_WITH_RING_ABOVE>
<LATIN_CAPITAL_LETTER_A_WITH_RING_ABOVE>
<ANGSTROM_SIGN>
<LATIN_SMALL_LETTER_A_WITH_RING_ABOVE_AND_ACUTE>
<LATIN_CAPITAL_LETTER_A_WITH_RING_ABOVE_AND_ACUTE>
<LATIN_SMALL_LETTER_A_WITH_DIAERESIS>
<LATIN_CAPITAL_LETTER_A_WITH_DIAERESIS>
<LATIN_SMALL_LETTER_A_WITH_DIAERESIS_AND_MACRON>
<LATIN_CAPITAL_LETTER_A_WITH_DIAERESIS_AND_MACRON>
<LATIN_SMALL_LETTER_A_WITH_TILDE>
<LATIN_CAPITAL_LETTER_A_WITH_TILDE>
<LATIN_SMALL_LETTER_A_WITH_DOT_ABOVE>
<LATIN_CAPITAL_LETTER_A_WITH_DOT_ABOVE>
<LATIN_SMALL_LETTER_A_WITH_DOT_ABOVE_AND_MACRON>
<LATIN_CAPITAL_LETTER_A_WITH_DOT_ABOVE_AND_MACRON>
<LATIN_SMALL_LETTER_A_WITH_OGONEK>
<LATIN_CAPITAL_LETTER_A_WITH_OGONEK>
<LATIN_SMALL_LETTER_A_WITH_MACRON>
<LATIN_CAPITAL_LETTER_A_WITH_MACRON>
<LATIN_SMALL_LETTER_A_WITH_HOOK_ABOVE>
<LATIN_CAPITAL_LETTER_A_WITH_HOOK_ABOVE>
<LATIN_SMALL_LETTER_A_WITH_DOUBLE_GRAVE>
<LATIN_CAPITAL_LETTER_A_WITH_DOUBLE_GRAVE>
<LATIN_SMALL_LETTER_A_WITH_INVERTED_BREVE>
<LATIN_CAPITAL_LETTER_A_WITH_INVERTED_BREVE>
<LATIN_SMALL_LETTER_A_WITH_DOT_BELOW>
<LATIN_CAPITAL_LETTER_A_WITH_DOT_BELOW>
<LATIN_SMALL_LETTER_A_WITH_BREVE_AND_DOT_BELOW>
<LATIN_CAPITAL_LETTER_A_WITH_BREVE_AND_DOT_BELOW>
<LATIN_SMALL_LETTER_A_WITH_CIRCUMFLEX_AND_DOT_BELOW>
<LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX_AND_DOT_BELOW>
<LATIN_SMALL_LETTER_A_WITH_RING_BELOW>
<LATIN_CAPITAL_LETTER_A_WITH_RING_BELOW>
<MODIFIER_LETTER_CAPITAL_AE>
<LATIN_SMALL_LETTER_AE>
<LATIN_CAPITAL_LETTER_AE>
<LATIN_SMALL_LETTER_AE_WITH_ACUTE>
<LATIN_CAPITAL_LETTER_AE_WITH_ACUTE>
<LATIN_SMALL_LETTER_AE_WITH_MACRON>
<LATIN_CAPITAL_LETTER_AE_WITH_MACRON>
<LATIN_SMALL_LETTER_A_WITH_RIGHT_HALF_RING>
<LATIN_LETTER_SMALL_CAPITAL_A>
<LATIN_SMALL_LETTER_A_WITH_STROKE>
<LATIN_CAPITAL_LETTER_A_WITH_STROKE>
<LATIN_SMALL_LETTER_A_WITH_RETROFLEX_HOOK>
<LATIN_LETTER_SMALL_CAPITAL_AE>
<LATIN_SMALL_LETTER_TURNED_AE>
<MODIFIER_LETTER_SMALL_TURNED_AE>
<LATIN_SMALL_LETTER_TURNED_A>
<MODIFIER_LETTER_SMALL_TURNED_A>
<LATIN_SMALL_LETTER_ALPHA>
<MODIFIER_LETTER_SMALL_ALPHA>
<LATIN_SMALL_LETTER_ALPHA_WITH_RETROFLEX_HOOK>
<LATIN_SMALL_LETTER_TURNED_ALPHA>
<MODIFIER_LETTER_SMALL_TURNED_ALPHA>
<b>
<FULLWIDTH_LATIN_SMALL_LETTER_B>
<B>
<FULLWIDTH_LATIN_CAPITAL_LETTER_B>
<MODIFIER_LETTER_SMALL_B>
<MODIFIER_LETTER_CAPITAL_B>
<LATIN_SMALL_LETTER_B_WITH_DOT_ABOVE>
<LATIN_CAPITAL_LETTER_B_WITH_DOT_ABOVE>
<LATIN_SMALL_LETTER_B_WITH_DOT_BELOW>
<LATIN_CAPITAL_LETTER_B_WITH_DOT_BELOW>
<LATIN_SMALL_LETTER_B_WITH_LINE_BELOW>
<LATIN_CAPITAL_LETTER_B_WITH_LINE_BELOW>
<LATIN_LETTER_SMALL_CAPITAL_B>
<LATIN_SMALL_LETTER_B_WITH_STROKE>
<LATIN_CAPITAL_LETTER_B_WITH_STROKE>
<MODIFIER_LETTER_CAPITAL_BARRED_B>
<LATIN_LETTER_SMALL_CAPITAL_BARRED_B>
<LATIN_SMALL_LETTER_B_WITH_MIDDLE_TILDE>
<LATIN_SMALL_LETTER_B_WITH_PALATAL_HOOK>
<LATIN_SMALL_LETTER_B_WITH_HOOK>
<LATIN_CAPITAL_LETTER_B_WITH_HOOK>
<LATIN_SMALL_LETTER_B_WITH_TOPBAR>
<LATIN_CAPITAL_LETTER_B_WITH_TOPBAR>
<c>
<FULLWIDTH_LATIN_SMALL_LETTER_C>
<C>
--- End of range [A-C] --
<FULLWIDTH_LATIN_CAPITAL_LETTER_C>
<MODIFIER_LETTER_SMALL_C>
<LATIN_SMALL_LETTER_C_WITH_ACUTE>
<LATIN_CAPITAL_LETTER_C_WITH_ACUTE>
...

pk

unread,
May 9, 2008, 4:24:00 AM5/9/08
to
On Friday 9 May 2008 09:44, Hermann Peifer wrote:

> I think the *.src files here have some nice human-readable definitions
> about how character ranges like [A-C] are expanded under a given locale.
>
> http://unicode.org/cldr/data/common/posix/
>
> According to the LC_COLLATE section in de_DE.UTF-8.src, [A-C] is
> expanded to the below characters, in the given order (sorry for the long
> list).

Yeah, thanks for the link. It seems that reading those files requires the
same syntax knowledge that is required to read /usr/share/i18n files, so,
once I get that one, I'll have more sources to read :-)

pk

unread,
May 9, 2008, 4:32:37 AM5/9/08
to

Yes, I noticed that, but I could not explain it.

> Besides, the collating order may be different from the order of the
> character codes --- see:
>
> http://tinyurl.com/6gdraw

This is something that can immediately be noticed by looking at the en_US
output above.

> I'm really interested whether there are any further locale bugs in gawk
> (for trying to fix them).
> Perhaps I have more time for the research at the current weekend.

I'm not sure I can follow you on this, mostly due to my lack of in-depth
libc/wchar knowledge. However, I'll keep this program and come back to it
when my understanding of the subject is better.

Thanks!

Janis

unread,
May 9, 2008, 5:08:34 AM5/9/08
to
On 9 Mai, 09:44, Hermann Peifer <pei...@gmx.eu> wrote:
> Janis Papanagnou wrote:
> > pk wrote:
>
> >> And, even if it is so, that helps only for POSIX character classes;
> >> how does
> >> one answer the question "what characters match, eg, '[A-Z]' under a given
> >> locale?".
>
> > Yes. Why does /[A-C]/ (in locale en_GB and de_DE) match the five characters
> > 65 A
> > 66 B
> > 67 C
> > 98 b
> > 99 c
>
> I think the *.src files here have some nice human-readable definitions
> about how character ranges like [A-C] are expanded under a given locale.
>
> http://unicode.org/cldr/data/common/posix/
>
> According to the LC_COLLATE section in de_DE.UTF-8.src, [A-C] is
> expanded to the below characters, in the given order (sorry for the long
> list).

Thanks. I understand the technical reason why the results
are created the way we see them; but I think the implicit(!)
and well hidden semantics of an expression like /[A-Z]/ are
not very useful, to say the least.

I suppose it's also disputable whether character classes like
[[:alpha:]] shall match non-English letters in a en_GB locale
or e.g. accent diacritical characters in a de_DE locale. What
is the intended rationale with locales in such cases; shall
they cover multilingual data? Shall [[:lower:]] in a de_DE
locale match a (French) á but not a (Greek) small omega (if
the latter is not in the codepage)?

Janis

>
> Hermann
>
> ...[snip list]

Hermann Peifer

unread,
May 10, 2008, 4:58:52 AM5/10/08
to
Janis wrote:
>
> Thanks. I understand the technical reason why the results
> are created the way we see them; but I think the implicit(!)
> and well hidden semantics of an expression like /[A-Z]/ are
> not very useful, to say the least.
>
> I suppose it's also disputable whether character classes like
> [[:alpha:]] shall match non-English letters in a en_GB locale
> or e.g. accent diacritical characters in a de_DE locale. What
> is the intended rationale with locales in such cases; shall
> they cover multilingual data? Shall [[:lower:]] in a de_DE
> locale match a (French) á but not a (Greek) small omega (if
> the latter is not in the codepage)?
>

From a semantics point of view, it is also surprising that, when using
de_DE.UTF-8 locale, e.g. <LATIN_SMALL_LETTER_A_WITH_ACUTE> is included
in the range [A-C], but <LATIN_SMALL_LETTER_C_WITH_ACUTE> is not, as it
sorts after uppercase C.

In summary: using character ranges in combination with a non-POSIX
locale is a good recipe for surprising results. To a somewhat lesser
degree, this is also true for character classes.

Hermann

pk

unread,
May 10, 2008, 5:52:19 AM5/10/08
to
On Saturday 10 May 2008 10:58, Hermann Peifer wrote:

> From a semantics point of view, it is also surprising that, when using
> de_DE.UTF-8 locale, e.g. <LATIN_SMALL_LETTER_A_WITH_ACUTE> is included
> in the range [A-C], but <LATIN_SMALL_LETTER_C_WITH_ACUTE> is not, as it
> sorts after uppercase C.

Well, considered the general strangeness of the thing, from a certain point
of view this makes sense to me. If [A-C] expands to AaBbCc, the "c" is not
matched because it comes after the "C" (which is the last character in the
bracket expression). On the other hand, if [A-C] expands to aAbBcC,
then "a" is not matched, again because it's outside the range.

> In summary: using character ranges in combination with a non-POSIX
> locale is a good recipe for surprising results. To a somewhat lesser
> degree, this is also true for character classes.

Agreed. That's why I'm trying to find a way to print character classes and
collating sequences, so that when someone comes asking "why doesn't sort
work correctly" or "why doesn't grep/awk/sed. etc. match what I expect",
they can be instructed to run the (non-existent yet) command or script to
see for themselves what the programs' idea of what they waht to do is.

pk

unread,
May 10, 2008, 5:55:35 AM5/10/08
to
On Saturday 10 May 2008 11:52, pk wrote:

> Well, considered the general strangeness of the thing, from a certain
> point of view this makes sense to me. If [A-C] expands to AaBbCc, the "c"
> is not matched because it comes after the "C" (which is the last character
> in the bracket expression). On the other hand, if [A-C] expands to aAbBcC,
> then "a" is not matched, again because it's outside the range.

I should better have said:

"If the current collating sequence is AaBbCc"...and similarly for the second
part.

Janis Papanagnou

unread,
May 10, 2008, 9:07:10 AM5/10/08
to
Hermann Peifer wrote:
> Janis wrote:
>
>>
>> Thanks. I understand the technical reason why the results
>> are created the way we see them; but I think the implicit(!)
>> and well hidden semantics of an expression like /[A-Z]/ are
>> not very useful, to say the least.
>>
>> I suppose it's also disputable whether character classes like
>> [[:alpha:]] shall match non-English letters in a en_GB locale
>> or e.g. accent diacritical characters in a de_DE locale. What
>> is the intended rationale with locales in such cases; shall
>> they cover multilingual data? Shall [[:lower:]] in a de_DE
>> locale match a (French) á but not a (Greek) small omega (if
>> the latter is not in the codepage)?
>>
>
> From a semantics point of view, it is also surprising that, when using
> de_DE.UTF-8 locale, e.g. <LATIN_SMALL_LETTER_A_WITH_ACUTE> is included
> in the range [A-C], but <LATIN_SMALL_LETTER_C_WITH_ACUTE> is not, as it
> sorts after uppercase C.

Yes, it would be less surprising to assume [A-C] as defining the
representatives base for the respective diacritical characters.

Has anybody experience, in this context, with equivalence classes
syntax [=...=]? (I haven't used them yet, but there's a chance
that equivalence classes might provide some more natural solutions
with locales.)

Janis

Hermann Peifer

unread,
May 10, 2008, 2:10:19 PM5/10/08
to
pk wrote:
> On Saturday 10 May 2008 10:58, Hermann Peifer wrote:
>
> Agreed. That's why I'm trying to find a way to print character classes and
> collating sequences, so that when someone comes asking "why doesn't sort
> work correctly" or "why doesn't grep/awk/sed. etc. match what I expect",
> they can be instructed to run the (non-existent yet) command or script to
> see for themselves what the programs' idea of what they waht to do is.
>

I combined some code from Ed's and Steffen's scripts and added
/usr/bin/printf in the middle part. Unlike bash's builtin printf or
gawk's (s)printf: /usr/bin/printf is able to convert Unicode code point
values into chars.

The script is far away from being smart, efficient, or anything like
that, but it seems to work with Unicode-aware locales.

$ LC_ALL=en_GB.UTF-8 ./collating_chars.sh [A-C]
AáÁàÀăắằẵẳặĂẮẰẴẲẶâấầẫẩậÂẤẦẪẨẬǎǍåǻÅǺäǟÄǞãÃȧǡȦǠąĄāĀảẢȁȀȃȂạẠḁḀẚªæǽǣÆǼǢbBḃḂḅḄḇḆɓƁcC

$ LC_ALL=da_DK.UTF-8 ./collating_chars.sh [A-C]
AaÁáÀàĂẮẰẴẲẶăắằẵẳặÂẤẦẪẨẬâấầẫẩậǍǎǺǻǞǟÃãȦǠȧǡĄąĀāẢảȀȁȂȃẠạḀḁẚªBbḂḃḄḅḆḇƁɓC

The example shows that in da_DK.UTF-8 locale, the [A-C] range expands to
less characters than in en_GB (and most other) locales. The reason is
that A_RING and AE_LIGATURE characters sort after Z, according to Danish
sorting rules.

Hermann


$ cat collating_chars.sh
#!/bin/bash
# collate sequence of characters which belong
# to a given character range or class
# based on Ed's and Steffen's code
# seems to work for Unicode locales
#
# usage:
# collating_chars.sh [A-Z]
# collating_chars.sh [[:upper:]]

gawk 'BEGIN {

for (i=0;i<=32767;i++) {

num = sprintf("%X", i)
l = length(num)

# construct a format that is
# understood by /usr/bin/printf

if (i < 16)
num = "\\\\x0" num
else if (i < 128)
num = "\\\\x" num
else if (l == 2)
num = "\\\\u00" num
else if (l == 3)
num = "\\\\u0" num
else
num = "\\\\u" num

# exclude C1 control chars
# as printf doesnt like them

if (i < 128 || i > 159)
print num
}
}' |

# Print Unicode chars with /usr/bin/printf
while read num ; do /usr/bin/printf "$num\n" ; done | sort |

# Collate characters which are matched by the given range or class
gawk -v re=$1 '$1 ~ re { s = s $1 } END { print s }'

Hermann Peifer

unread,
May 10, 2008, 2:31:22 PM5/10/08
to
Hermann Peifer wrote:
>
> $ LC_ALL=en_GB.UTF-8 ./collating_chars.sh [A-C]
> AáÁàÀăắằẵẳặĂẮẰẴẲẶâấầẫẩậÂẤẦẪẨẬǎǍåǻÅǺäǟÄǞãÃȧǡȦǠąĄāĀảẢȁȀȃȂạẠḁḀẚªæǽǣÆǼǢbBḃḂḅḄḇḆɓƁcC
>
>
> $ LC_ALL=da_DK.UTF-8 ./collating_chars.sh [A-C]
> AaÁáÀàĂẮẰẴẲẶăắằẵẳặÂẤẦẪẨẬâấầẫẩậǍǎǺǻǞǟÃãȦǠȧǡĄąĀāẢảȀȁȂȃẠạḀḁẚªBbḂḃḄḅḆḇƁɓC
>
> The example shows that in da_DK.UTF-8 locale, the [A-C] range expands to
> less characters than in en_GB (and most other) locales. The reason is
> that A_RING and AE_LIGATURE characters sort after Z, according to Danish
> sorting rules.
>

I just noticed: A umlaut characters are also not part of the Danish
[A-C] range. Furthermore, Danish sorts AaBbCc, whereas en_GB sorts
aAbBcC. These are just a few more reasons why character ranges are best
avoided when working in non-POSIX locales.

Hermann

Steffen Schuler

unread,
May 10, 2008, 3:56:00 PM5/10/08
to
^^
better use: re="$1"

Nice script. Works fine. I tested a lot with it and adapted it to my
personal suites.

--
Steffen

Hermann Peifer

unread,
May 10, 2008, 5:14:44 PM5/10/08
to
Steffen Schuler wrote:
>> # Collate characters which are matched by the given range or class
>> gawk -v re=$1 '$1 ~ re { s = s $1 } END { print s }'
> ^^
> better use: re="$1"

Thanks for the hint.

>
> Nice script. Works fine. I tested a lot with it and adapted it to my
> personal suites.
>

Thanks again. I am however really surprised that no serious programmer
has ever invented a much better tool to do the same thing (and more, and
faster). So much time has been invested in all these locale definitions
and nobody is interested in (providing) a tool that could easily show
what character ranges and classes exactly mean in a given locale?

Puzzled. Hermann

pk

unread,
May 11, 2008, 5:17:15 AM5/11/08
to
On Saturday 10 May 2008 20:10, Hermann Peifer wrote:

> I combined some code from Ed's and Steffen's scripts and added
> /usr/bin/printf in the middle part. Unlike bash's builtin printf or
> gawk's (s)printf: /usr/bin/printf is able to convert Unicode code point
> values into chars.
>
> The script is far away from being smart, efficient, or anything like
> that, but it seems to work with Unicode-aware locales.

>[snip]

Great job, really!

I'm still trying to put together something that reads the locale files
directly, but that involves reading libc documentation for wc/mb etc. +
posix docs for locales (probably some of that is not strictly needed, but
since it's something that I should have done anyway, I'm just seizing the
opportunity). This of course is not to say that your script is not
useful...on the contrary instead, it's really neat and smart.

A big thank you!

Cesar Rabak

unread,
May 11, 2008, 9:50:15 AM5/11/08
to
Hermann Peifer escreveu:
[snipped]

> Thanks again. I am however really surprised that no serious programmer
> has ever invented a much better tool to do the same thing (and more, and
> faster). So much time has been invested in all these locale definitions
> and nobody is interested in (providing) a tool that could easily show
> what character ranges and classes exactly mean in a given locale?
>

Hermann,

I'm affraid this wasn't seen as necessary as the script would a sort of
analysis of the pasta by dissolving in vinegar to separate the egg from
flour instead of reading the package's label.

Those sequences are specified, aren't they?

Hermann Peifer

unread,
May 11, 2008, 11:27:57 AM5/11/08
to

Oi Cesar,

It's true: the sequences are specified. But it might not be everyone's
hobby to read through dozens (hundreds) of locale definition source
files in order to find out what the exact details and differences of a
character range like [A-C] in various Unicode locales are.

Anyway: I see that there are at least 3 users of my script (Steffen, pk
and me). It looks like it was worth writing it.

;-) Hermann

Hermann Peifer

unread,
May 13, 2008, 6:41:09 AM5/13/08
to


For those who might be interested: here another variation of the
script. Reading an additional local file is perhaps not the smartest
solution, but guarantees that all relevant Unicode code points are
covered. (I promise this to be my last OT posting in this thread ;-)

Hermann

$ cat collate_rechars.sh
#!/bin/bash
# Collate a sorted list of chars that belong to a range or class
# Based on Ed's and Steffen's code, works only for Unicode locales
#
# Usage:
# LC_ALL=en_GB.UTF-8 collate_rechars.sh [A-Z]
# LC_ALL=da_DK.UTF-8 collate_rechars.sh [[:upper:]]
#
# For this script you need to have a local copy of
# http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

gawk -F";" '

# Construct a format for chars in ASCII range
#
NR < 129 { print "\\\\x" substr($1,3,2) ; next }

# Construct a format for other code points in UnicodeData.txt
# Exclude C1 control chars and some other code points
# as usr/bin/printf reports errors for them
#
NR > 160 && $1 !~ /^(D800|DB7F|DB80|DBFF|DC00|DFFF)$/ {
printf "\\\\U%08s\n", $1 }' UnicodeData.txt |

# Print chars with /usr/bin/printf and sort them
#
while read f ; do /usr/bin/printf "$f\n" ; done | sort |

# Collate chars for the given character range or class
#
gawk -v re="$1" '$1 ~ re { s = s $1 } END { print re "=" s }'

Cesar Rabak

unread,
May 16, 2008, 11:11:13 AM5/16/08
to
Hermann Peifer escreveu:

> Cesar Rabak wrote:
>> Hermann Peifer escreveu:
>> [snipped]
>>
>>> Thanks again. I am however really surprised that no serious
>>> programmer has ever invented a much better tool to do the same thing
>>> (and more, and faster). So much time has been invested in all these
>>> locale definitions and nobody is interested in (providing) a tool
>>> that could easily show what character ranges and classes exactly mean
>>> in a given locale?
>>>
>> Hermann,
>>
>> I'm affraid this wasn't seen as necessary as the script would a sort
>> of analysis of the pasta by dissolving in vinegar to separate the egg
>> from flour instead of reading the package's label.
>>
>> Those sequences are specified, aren't they?
>
> Oi Cesar,
>

Olá Hermann,

> It's true: the sequences are specified. But it might not be everyone's
> hobby to read through dozens (hundreds) of locale definition source
> files in order to find out what the exact details and differences of a
> character range like [A-C] in various Unicode locales are.

I do know what you mean!

>
> Anyway: I see that there are at least 3 users of my script (Steffen, pk
> and me). It looks like it was worth writing it.
>
> ;-) Hermann

Sure! :-D

Hermann Peifer

unread,
May 24, 2008, 12:25:58 PM5/24/08
to
pk wrote:
> On Wednesday 7 May 2008 20:16, schuler...@googlemail.com wrote:
>
>> Why is LC_COLLATE instead of LC_ALL in the above case not enough?
>
> This is another thing that confuses me. I tried that during my experiments,
> but didn't want to mention it to avoid putting too many irons in the fire.

I just found the below footnote the GNU documentation.

Hermann

--- snip ---
Note that setting only LC_COLLATE has two problems. First, it is
ineffective if LC_ALL is also set. Second, it has undefined behavior if
LC_CTYPE (or LANG, if LC_CTYPE is unset) is set to an incompatible value.
--- snip ---

http://www.gnu.org/software/coreutils/manual/coreutils.html#fn-2

0 new messages