I am somehwat puzzled with match() results for numbers in scientific
notation. See below.
$ cat testdata
100
100e-3
100E3
I am wondering what kind of uppercase character is matched in record
2:
$ gawk '{print $1,match($1,/[A-Z]/)}' testdata
100 0
100e-3 4
100E3 4
I am also wondering about the non-matches in these cases:
$ gawk 'BEGIN{print match(100e-3,/[A-Z]/)}'
0
$ gawk 'BEGIN{print match(100E3,/[A-Z]/)}'
0
What is the logic behind the matches and non-matches?
I am using gawk 3.1.5 and LANG=en_US.UTF-8
Thanks in advance, Hermann
> Hi,
>
>
> I am somehwat puzzled with match() results for numbers in scientific
> notation. See below.
>
> $ cat testdata
> 100
> 100e-3
> 100E3
>
> I am wondering what kind of uppercase character is matched in record
> 2:
>
> $ gawk '{print $1,match($1,/[A-Z]/)}' testdata
> 100 0
> 100e-3 4
> 100E3 4
Try this with LC_ALL=C (always a good idea when working with
locale-sensitive data):
$ LC_ALL=C gawk '{print $1,match($1,/[A-Z]/)}' testdata
100 0
100e-3 0
100E3 4
> I am also wondering about the non-matches in these cases:
>
> $ gawk 'BEGIN{print match(100e-3,/[A-Z]/)}'
> 0
> $ gawk 'BEGIN{print match(100E3,/[A-Z]/)}'
> 0
Unlike before, here you are asking awk to do number-to-string conversion.
The problem is (I think) that the numbers lose the Es when they are
converted to strings:
$ gawk 'BEGIN{printf "%s\n", 100E3}'
100000
$ gawk 'BEGIN{printf "%s\n", 100e-3}'
0.1
Compare with this (which also shows the influence of the locale):
$ gawk 'BEGIN{print match("100E3",/[A-Z]/)}'
4
$ gawk 'BEGIN{print match("100e-3",/[A-Z]/)}'
4
$ LC_ALL=C gawk 'BEGIN{print match("100E3",/[A-Z]/)}'
4
$ LC_ALL=C gawk 'BEGIN{print match("100e-3",/[A-Z]/)}'
0
Of course, I stand to be corrected for anything wrong I could have written.
--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.
Thanks. This seems to explain the matching logic.
Hermann
On 5/6/2008 6:16 AM, Hermann Peifer wrote:
> Hi,
>
>
> I am somehwat puzzled with match() results for numbers in scientific
> notation. See below.
>
> $ cat testdata
> 100
> 100e-3
> 100E3
>
> I am wondering what kind of uppercase character is matched in record
> 2:
>
> $ gawk '{print $1,match($1,/[A-Z]/)}' testdata
There may not be an uppercase character matching. [A-Z] represents the list of
characters in between the character A and the character Z in your locale - that
does NOT mean it has to be upper case characters. For example, your locale might
consider characters ordered as:
aAbBcCdDeEfF....zZ
so "e" would sit between "A" and "Z". That's why you should use character
classes instead of specific ranges, e.g.:
gawk '{print $1,match($1,/[[:upper:]]/)}' testdata
Regards,
Ed.
Can you provide some reference for that definition?
I thought that ranges like [A-Z] depend on the _coding_ of the
character set (IOW, on the code values of the characters), and
not depending on the locale.
So that in case you have a different code set than ISO Latin 1,
ASCII, or similar, e.g. like EBCDIC (where there may be other
characters spread in between the letter code positions) you'd
get unexpected results.
So, because of that, your suggestion below is valid anyway, but
I'd like to be sure about what you wrote if that is in fact true.
Janis
On 5/7/2008 9:59 AM, Janis wrote:
> On 7 Mai, 16:18, Ed Morton <mor...@lsupcaemnt.com> wrote:
>
>>On 5/6/2008 6:16 AM, Hermann Peifer wrote:
>>
>>
>>>Hi,
>>
>>>I am somehwat puzzled with match() results for numbers in scientific
>>>notation. See below.
>>
>>>$ cat testdata
>>>100
>>>100e-3
>>>100E3
>>
>>>I am wondering what kind of uppercase character is matched in record
>>>2:
>>
>>>$ gawk '{print $1,match($1,/[A-Z]/)}' testdata
>>
>>There may not be an uppercase character matching. [A-Z] represents the list of
>>characters in between the character A and the character Z in your locale
>
>
> Can you provide some reference for that definition?
From the GNU awk user guide
(http://www.gnu.org/software/gawk/manual/gawk.html#Character-Lists):
... For example, in the default C locale, `[a-dx-z]' is equivalent to
`[abcdxyz]'. Many locales sort characters in dictionary order, and in these
locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]'; instead it might
be equivalent to `[aBbCcDdxXyYz]', for example.
> I thought that ranges like [A-Z] depend on the _coding_ of the
> character set (IOW, on the code values of the characters), and
> not depending on the locale.
>
> So that in case you have a different code set than ISO Latin 1,
> ASCII, or similar, e.g. like EBCDIC (where there may be other
> characters spread in between the letter code positions) you'd
> get unexpected results.
>
> So, because of that, your suggestion below is valid anyway, but
> I'd like to be sure about what you wrote if that is in fact true.
Sounds to me like it's dependent on locale but if what your describing above is
something different from locale, then you may be right. Whatever the "thingy" is
that causes the difference, using character classes is the right approach.
Ed.
> From the GNU awk user guide
> (http://www.gnu.org/software/gawk/manual/gawk.html#Character-Lists):
>
> ... For example, in the default C locale, `[a-dx-z]' is equivalent to
> `[abcdxyz]'. Many locales sort characters in dictionary order, and in
> these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]';
> instead it might be equivalent to `[aBbCcDdxXyYz]', for example.
Is there a way to explicitly print out that information (or, better, the
entire collating sequence in use)? I've been looking for a method to do
that for long time, but I have found no complete answer.
On 5/7/2008 10:25 AM, pk wrote:
> On Wednesday 7 May 2008 17:20, Ed Morton wrote:
>
>
>>From the GNU awk user guide
>>(http://www.gnu.org/software/gawk/manual/gawk.html#Character-Lists):
>>
>>... For example, in the default C locale, `[a-dx-z]' is equivalent to
>>`[abcdxyz]'. Many locales sort characters in dictionary order, and in
>>these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]';
>>instead it might be equivalent to `[aBbCcDdxXyYz]', for example.
>
>
> Is there a way to explicitly print out that information (or, better, the
> entire collating sequence in use)? I've been looking for a method to do
> that for long time, but I have found no complete answer.
>
I expect you could use the ord() and chr() functions described here:
http://www.gnu.org/software/gawk/manual/gawk.html#Ordinal-Functions
to do something like:
for (i=ord("a");i<=ord("z");i++) {
print chr(i)
}
Regards,
Ed.
>>>these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]';
>>>instead it might be equivalent to `[aBbCcDdxXyYz]', for example.
>>
>>
>> Is there a way to explicitly print out that information (or, better, the
>> entire collating sequence in use)? I've been looking for a method to do
>> that for long time, but I have found no complete answer.
>>
>
> I expect you could use the ord() and chr() functions described here:
>
> http://www.gnu.org/software/gawk/manual/gawk.html#Ordinal-Functions
>
> to do something like:
>
> for (i=ord("a");i<=ord("z");i++) {
> print chr(i)
> }
Take this scenario:
$ cat file
100e3
$ echo $LC_ALL
en_GB
$ awk '/[A-Z]/' file
100e3
$ LC_ALL=C awk '/[A-Z]/' file
$
(or, perhaps more elegant,
$ awk '[[:upper:]]' file
$ )
It seems that the function you point out use the mere numeric character
values and don't take locale into account. Using the proposed code for the
ord() and chr() functions, a loop to print the sequence from "A" to "Z"
always yields
A
B
C
...
Z
under many different locales, even en_GB which, as seen above, clearly
expands [A-Z] differently.
In fact, my question is not awk-specific, and is generically about how
collating sequences affect the interpretation of bracket expressions, and
thus influence how programs like grep, sort, awk, etc. work.
What I'm looking for is a command which, ideally, behaves as follows:
$ LC_ALL=C <command> '[A-C]'
ABC
$ LC_ALL=en_GB <command> '[A-C]'
AaBbCc # or whatever it's expanded to
and, ideally, also something like
$ <command> -a
# prints the entire current collating sequence, according to current locale
Of course, I don't know whether such a command exists, or even whether it's
possible to gather that information in some other way.
I'm setting the followup for this discussion to comp.unix.shell, since this
is not awk-specific anymore.
You are right: in my locale en_GB.UTF-8, [A-Z] matches all upper and
lower case letters (including accented letters), except lower case a. In
return [a-z] matches all upper/lower case letters, except upper case Z.
>
> so "e" would sit between "A" and "Z". That's why you should use character
> classes instead of specific ranges, e.g.:
>
> gawk '{print $1,match($1,/[[:upper:]]/)}' testdata
>
I will do so. Thanks, Hermann
As the other part of this thread continues over at comp.unix.shell, I came up
with this script which you can run to see which characters are contained in
which character lists (REs actually):
$ cat rechars.awk
# Prints every character that matches a given RE.
# Originally created to print all characters in a given character list.
#
# usage:
# LC_ALL=C awk -v re="[a-z]" -f rechars.awk
# LC_ALL=en_GB awk -v re="[a-z]" -f rechars.awk
# awk -v re="[[:upper:]]" -f rechars.awk
#
BEGIN{
for (i=0;i<=1000;i++)
chars[sprintf("%c",i)]
for (c in chars)
if (c ~ re)
s=s c
print re"="s
}
$ awk -v re="[A-Z]" -f rechars.awk
[A-Z]=ABCDEFGHIJKLMNOPQRSTUVWXYZ
so you can play with that if you're curious about which characters match in
specific locales...
Ed.
Why is LC_COLLATE instead of LC_ALL in the above case not enough?
--
Steffen
> Why is LC_COLLATE instead of LC_ALL in the above case not enough?
This is another thing that confuses me. I tried that during my experiments,
but didn't want to mention it to avoid putting too many irons in the fire.
Look:
$ cat file1
e
$ LC_COLLATE=C awk '/[A-Z]/' file1
e
$ LC_ALL=C awk '/[A-Z]/' file1
$
Thanks for this one. I do however not think that sprintf("%c",i) makes
much sense with i > 255. As far as I can see: for values between 0 and
255, printf prints the (control) characters from ASCII and ISO-8859-1,
but with i > 255, the same series of characters is printed again, see here:
$ awk 'BEGIN{printf "%c\n",65}'
A
$ awk 'BEGIN{printf "%c\n",65+256}'
A
$ awk 'BEGIN{printf "%c\n",65+256+256}'
A
Hermann
Here the collating sequence of the alphabetical ([:alpha:]) letters
using
gawk and sort for the locales C, en_US, and en_US.UTF-8
(using Fedora Core 7 and the current gawk from CVS):
$ cat collating_chars.sh
#!/bin/bash
# collating sequence of the alphabetic letters
# in an externally specified locale
gawk 'BEGIN {
for (i= 1; i <= 32767; ++i) {
c = sprintf("%c", i)
if (c ~ /[[:alpha:]]/)
a[c] = 1
}
for (c in a)
print c
}' |
sort |
gawk '{
printf "%s", $0
}
END {
print ""
}'
$ LANG=C LC_ALL=C ./collating_chars.sh
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
$ LANG=en_US LC_ALL=en_US ./collating_chars.sh
µaAáÁàÀâÂåÅäÄãêæÆbBcCçÇdDðÐeEéÉèÈêÊëËfFgGhHiIíÍìÌîÎïÏjJkKlLmMnNñÑoOóÓòÒôÔöÖõÕøØºpPqQrRsSßtTuUúÚùÙûÛüÜvVwWxXyYýÝÿzZþÞ
$ LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 ./collating_chars.sh
aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
$
--
Steffen
I just picked a big number since I don't know what the biggest number is for all
charsets, then created the array so the characters could wrap around with no
impact to the output.
Ed.
It may be enough, but LC_ALL works for this and less typing.
Ed.
This actually extends Ed's original idea, adding sort ordering (an useful
thing, thanks).
What still stumps me is that in my locale (en_GB.utf8) the
expressions '[a-z]', '[[:lower:]]', and '[[:alpha:]]' all DO match
lowercase accented characters; nonetheless, using either of them with your
script does not show accented characters. The same happens with en_US.utf8,
as can be seen in your output. I used gawk-3.1.6a from CVS.
The collating_chars.sh script uses sprintf("%c", i).
This function works in fine for: i = 0 ... 127
For locales with a codeset of ISO-8859-something, it will also work
fine for i = 128 ... 255.
Your locale's codeset is utf8, so '[a-z]', '[[:lower:]]', and
'[[:alpha:]]' DO include accented characters. The script is simply not
working properly, because of sprintf() limitations. At least this is
how I understand the issue.
Hermann
Yes, this is also my understanding. Non-utf8 locales work fine because they
have useful characters in the 0-255 range, while utf8 uses a different
encoding and does not necessarily have useful characters in that range.
Moreover, the tests show that characters repeat themselves every 256 (or,
maybe, it is %c that is limited to these values only).
This is why I'm looking for a solution the can gather the information by
making effective use of the sources where localized features are defined.
For example, under linux it seems that each locale is defined by a series of
files under /usr/share/i18n/locales (on my system at least).
A quick look at those files reveals that they are where the various
LC_COLLATE, LC_NUMERIC etc. values are defined (I must say that I'm still a
bit puzzled about the syntax used, but that can be solved by carefully
reading the docs and the standard, I hope).
As an example, the file /usr/share/i18n/locales/en_GB (could not find a
en_GB.utf8 file) has inside it:
...
LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"
END LC_COLLATE
...
The file /usr/share/i18n/locales/iso14651_t1 includes the file
iso14651_t1_common, which finally has (some comments removed):
LC_COLLATE
# Déclaration des systèmes d'écriture / Declaration of scripts
script <SPECIAL>
script <LATIN>
script <TIFINAGH>
script <ARABINT>
script <ARABFOR>
script <HEBREU>
script <GREC>
script <CYRIL>
script <ARMENIAN>
# Déclaration des symboles internes / Declaration of internal symbols
#
# SYMB N° Expl.
#
collating-symbol <RES-1>
#
# <ARABINT>/<ARABFOR>
#
#
collating-symbol <ANO> # 2 normal --> voir/see <MIN>
collating-symbol <AIS> # 3 isol.
collating-symbol <AFI> # 4 final
collating-symbol <AII> # 5 initial
collating-symbol <AME> # 6 medial/m<e'>dian
...[snip]...
# Ordre des symboles internes / Order of internal symbols
#
# SYMB. N°
#
<RES-1>
...[snip]...
order_start <SPECIAL>;forward;backward;forward;forward,position
#
# Tout caractère non précisément défini sera considéré comme caractère
# spécial et considéré uniquement au dernier niveau.
#
# Any character not precisely specified will be considered as a special
# character and considered only at the last level.
# <U0000>......<U7FFFFFFF> IGNORE;IGNORE;IGNORE;<U0000>......<U7FFFFFFF>
#
# SYMB. N° GLY
#
<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>
<U005F> IGNORE;IGNORE;IGNORE;<U005F> # 33 _
<U0332> IGNORE;IGNORE;IGNORE;<U0332> # 34 <"_>
<U00AF> IGNORE;IGNORE;IGNORE;<U00AF> # 35 - (MACRON)
<U00AD> IGNORE;IGNORE;IGNORE;<U00AD> # 36 <SHY>
<U002D> IGNORE;IGNORE;IGNORE;<U002D> # 37 -
<U002C> IGNORE;IGNORE;IGNORE;<U002C> # 38 ,
<U003B> IGNORE;IGNORE;IGNORE;<U003B> # 39 ;
<U003A> IGNORE;IGNORE;IGNORE;<U003A> # 40 :
<U0021> IGNORE;IGNORE;IGNORE;<U0021> # 41 !
<U00A1> IGNORE;IGNORE;IGNORE;<U00A1> # 42 ¡
<U003F> IGNORE;IGNORE;IGNORE;<U003F> # 43 ?
<U00BF> IGNORE;IGNORE;IGNORE;<U00BF> # 44 ¿
<U002F> IGNORE;IGNORE;IGNORE;<U002F> # 45 /
...
...etc.etc.
As far as I understand, these files are read by the commands "locale-gen"
and "localedef" and used to generate a binary form of locale information,
located (on my system) under /usr/lib/locale, which is what is used by the
various locale-sensitive programs (perhaps through libc) at runtime.
Either way, my point is that, since the localization info is defined
somewhere, there should be a way to extract and display that information.
The command "locale", with its various options, seems unable to provide the
kind of information I'm interested in (unless I overlooked something, which
may entirely be possible).
$ locale -c LC_COLLATE -k
LC_COLLATE
collate-nrules=4
collate-rulesets=""
collate-symb-hash-sizemb=1
collate-codeset="UTF-8"
To anybody who might reply: if you feel that the discussion is OT on
comp.lang.awk, feel free to reply on comp.unix.shell or whatever group you
deem appropriate. Thanks.
My /usr/share/i18n/locales/en_GB has also inside it:
LC_CTYPE
copy "i18n"
i18n in return defines the character classes: upper, lower, etc. Each
one with a list of Unicode version 5.0.0 code points, that are
included in a given character class.
Hermann
> My /usr/share/i18n/locales/en_GB has also inside it:
>
> LC_CTYPE
> copy "i18n"
Yes, mine too. I just extrapolated a sample section.
> i18n in return defines the character classes: upper, lower, etc. Each
> one with a list of Unicode version 5.0.0 code points, that are
> included in a given character class.
However, that file is included by many other files in that directory
(notably, not by POSIX).
$ grep -l 'copy "i18n"' /usr/share/i18n/locales/* | wc -l
99
Does that mean that all these 99 locales have all the same characters in
their upper, alpha, etc. classes?
And, even if it is so, that helps only for POSIX character classes; how does
one answer the question "what characters match, eg, '[A-Z]' under a given
locale?". I think the answer to this lies in the understanding of
LC_COLLATE.
I guess I'll try reading the standard doumentation for localization when I
have more time. It scares me a little bit because its language is not (to
me at least) much simple and clear (unlike other parts of the standard),
however I think it's an obligatory step to understand the whole thing
better.
Thanks for your persisting interest in this discussion!
I would think so. All these 99 locales are Unicode-aware, so why
should they differ in their understanding of what is an upper case or
lower case, character?
This is defined in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
(and related files in this directory).
Code point U+0041 is e.g. defined as:
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
Property value "Lu" stands for upper case letter and all Unicode-aware
locales will observe these properties (e.g. via the character class
definitions in file i18n).
> And, even if it is so, that helps only for POSIX character classes; how does
> one answer the question "what characters match, eg, '[A-Z]' under a given
> locale?". I think the answer to this lies in the understanding of
> LC_COLLATE.
>
I guess you are right but I don't have a clear idea myself. I will
also have to read more in the documentation.
Hermann
Big numbers just don't work as expected. As also confirmed by pk in this
thread: printf "%c", i basically prints the character of i%256.
Hermann
Yes. Why does /[A-C]/ (in locale en_GB and de_DE) match the five characters
65 A
66 B
67 C
98 b
99 c
> I think the answer to this lies in the understanding of
> LC_COLLATE.
Meanwhile, I fear, that there's either a bug or a fundamental flaw in the
conception of the locales (at least in conjunction with regexp character
ranges). The above result doesn't make the least sense to me.
Janis
What I know from testing with en_GB.UTF-8 locale: the order of
characters seems to aAbBcCdD...
So [A-C] matches the 5 mentioned characters, but not a lowercase a.
Accented characters in this range are also matched.
Hermann
Perhaps the result of the GLIBC functions iswalpha and wprintf together
with GNU sort explains better the collating of the alphabetic characters
for the 3 locales C, en_US, en_US.utf8 (my current system: Debian
GNU/Linux testing (lenny)):
$ cat collate.c
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
int
main(int argc, char **argv)
{
wint_t i;
if (argc != 2) {
fprintf(stderr, "usage: ./collate MAX_CHAR_CODE\n");
exit(1);
}
for (i = 1; i <= atoi(argv[1]); ++i)
if (iswalpha(i))
wprintf(L"%c\n", i);
return 0;
}
$ cc -o collate collate.c
$ export LC_ALL=C; ./collate 255 | sort | tr -d '\n'; echo
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
$ export LC_ALL=en_US; ./collate 255 | sort | tr -d '\n'; echo
aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
$ # ^^^^^^^...: only Latin letters
$ export LC_ALL=en_US.UTF-8; ./collate 65535 | sort | tr -d '\n'; echo
aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
$
There seem to be some inconsistencies between gawk and glibc locale
usage. Especially in the C-code sample only Latin letters are listed for
LC_ALL=en_US contrary to the similar gawk sample.
Besides, the collating order may be different from the order of the
character codes --- see:
I'm really interested whether there are any further locale bugs in gawk
(for trying to fix them).
Perhaps I have more time for the research at the current weekend.
--
Steffen
I think the *.src files here have some nice human-readable definitions
about how character ranges like [A-C] are expanded under a given locale.
http://unicode.org/cldr/data/common/posix/
According to the LC_COLLATE section in de_DE.UTF-8.src, [A-C] is
expanded to the below characters, in the given order (sorry for the long
list).
Hermann
...
<seven>
<eight>
<nine>
<a>
<FULLWIDTH_LATIN_SMALL_LETTER_A>
--- start of range [A-C] ---
<A>
<FULLWIDTH_LATIN_CAPITAL_LETTER_A>
<MODIFIER_LETTER_SMALL_A>
<FEMININE_ORDINAL_INDICATOR>
<LATIN_SUBSCRIPT_SMALL_LETTER_A>
<MODIFIER_LETTER_CAPITAL_A>
<LATIN_SMALL_LETTER_A_WITH_ACUTE>
<LATIN_CAPITAL_LETTER_A_WITH_ACUTE>
<LATIN_SMALL_LETTER_A_WITH_GRAVE>
<LATIN_CAPITAL_LETTER_A_WITH_GRAVE>
<LATIN_SMALL_LETTER_A_WITH_BREVE>
<LATIN_CAPITAL_LETTER_A_WITH_BREVE>
<LATIN_SMALL_LETTER_A_WITH_BREVE_AND_ACUTE>
<LATIN_CAPITAL_LETTER_A_WITH_BREVE_AND_ACUTE>
<LATIN_SMALL_LETTER_A_WITH_BREVE_AND_GRAVE>
<LATIN_CAPITAL_LETTER_A_WITH_BREVE_AND_GRAVE>
<LATIN_SMALL_LETTER_A_WITH_BREVE_AND_TILDE>
<LATIN_CAPITAL_LETTER_A_WITH_BREVE_AND_TILDE>
<LATIN_SMALL_LETTER_A_WITH_BREVE_AND_HOOK_ABOVE>
<LATIN_CAPITAL_LETTER_A_WITH_BREVE_AND_HOOK_ABOVE>
<LATIN_SMALL_LETTER_A_WITH_CIRCUMFLEX>
<LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX>
<LATIN_SMALL_LETTER_A_WITH_CIRCUMFLEX_AND_ACUTE>
<LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX_AND_ACUTE>
<LATIN_SMALL_LETTER_A_WITH_CIRCUMFLEX_AND_GRAVE>
<LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX_AND_GRAVE>
<LATIN_SMALL_LETTER_A_WITH_CIRCUMFLEX_AND_TILDE>
<LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX_AND_TILDE>
<LATIN_SMALL_LETTER_A_WITH_CIRCUMFLEX_AND_HOOK_ABOVE>
<LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX_AND_HOOK_ABOVE>
<LATIN_SMALL_LETTER_A_WITH_CARON>
<LATIN_CAPITAL_LETTER_A_WITH_CARON>
<LATIN_SMALL_LETTER_A_WITH_RING_ABOVE>
<LATIN_CAPITAL_LETTER_A_WITH_RING_ABOVE>
<ANGSTROM_SIGN>
<LATIN_SMALL_LETTER_A_WITH_RING_ABOVE_AND_ACUTE>
<LATIN_CAPITAL_LETTER_A_WITH_RING_ABOVE_AND_ACUTE>
<LATIN_SMALL_LETTER_A_WITH_DIAERESIS>
<LATIN_CAPITAL_LETTER_A_WITH_DIAERESIS>
<LATIN_SMALL_LETTER_A_WITH_DIAERESIS_AND_MACRON>
<LATIN_CAPITAL_LETTER_A_WITH_DIAERESIS_AND_MACRON>
<LATIN_SMALL_LETTER_A_WITH_TILDE>
<LATIN_CAPITAL_LETTER_A_WITH_TILDE>
<LATIN_SMALL_LETTER_A_WITH_DOT_ABOVE>
<LATIN_CAPITAL_LETTER_A_WITH_DOT_ABOVE>
<LATIN_SMALL_LETTER_A_WITH_DOT_ABOVE_AND_MACRON>
<LATIN_CAPITAL_LETTER_A_WITH_DOT_ABOVE_AND_MACRON>
<LATIN_SMALL_LETTER_A_WITH_OGONEK>
<LATIN_CAPITAL_LETTER_A_WITH_OGONEK>
<LATIN_SMALL_LETTER_A_WITH_MACRON>
<LATIN_CAPITAL_LETTER_A_WITH_MACRON>
<LATIN_SMALL_LETTER_A_WITH_HOOK_ABOVE>
<LATIN_CAPITAL_LETTER_A_WITH_HOOK_ABOVE>
<LATIN_SMALL_LETTER_A_WITH_DOUBLE_GRAVE>
<LATIN_CAPITAL_LETTER_A_WITH_DOUBLE_GRAVE>
<LATIN_SMALL_LETTER_A_WITH_INVERTED_BREVE>
<LATIN_CAPITAL_LETTER_A_WITH_INVERTED_BREVE>
<LATIN_SMALL_LETTER_A_WITH_DOT_BELOW>
<LATIN_CAPITAL_LETTER_A_WITH_DOT_BELOW>
<LATIN_SMALL_LETTER_A_WITH_BREVE_AND_DOT_BELOW>
<LATIN_CAPITAL_LETTER_A_WITH_BREVE_AND_DOT_BELOW>
<LATIN_SMALL_LETTER_A_WITH_CIRCUMFLEX_AND_DOT_BELOW>
<LATIN_CAPITAL_LETTER_A_WITH_CIRCUMFLEX_AND_DOT_BELOW>
<LATIN_SMALL_LETTER_A_WITH_RING_BELOW>
<LATIN_CAPITAL_LETTER_A_WITH_RING_BELOW>
<MODIFIER_LETTER_CAPITAL_AE>
<LATIN_SMALL_LETTER_AE>
<LATIN_CAPITAL_LETTER_AE>
<LATIN_SMALL_LETTER_AE_WITH_ACUTE>
<LATIN_CAPITAL_LETTER_AE_WITH_ACUTE>
<LATIN_SMALL_LETTER_AE_WITH_MACRON>
<LATIN_CAPITAL_LETTER_AE_WITH_MACRON>
<LATIN_SMALL_LETTER_A_WITH_RIGHT_HALF_RING>
<LATIN_LETTER_SMALL_CAPITAL_A>
<LATIN_SMALL_LETTER_A_WITH_STROKE>
<LATIN_CAPITAL_LETTER_A_WITH_STROKE>
<LATIN_SMALL_LETTER_A_WITH_RETROFLEX_HOOK>
<LATIN_LETTER_SMALL_CAPITAL_AE>
<LATIN_SMALL_LETTER_TURNED_AE>
<MODIFIER_LETTER_SMALL_TURNED_AE>
<LATIN_SMALL_LETTER_TURNED_A>
<MODIFIER_LETTER_SMALL_TURNED_A>
<LATIN_SMALL_LETTER_ALPHA>
<MODIFIER_LETTER_SMALL_ALPHA>
<LATIN_SMALL_LETTER_ALPHA_WITH_RETROFLEX_HOOK>
<LATIN_SMALL_LETTER_TURNED_ALPHA>
<MODIFIER_LETTER_SMALL_TURNED_ALPHA>
<b>
<FULLWIDTH_LATIN_SMALL_LETTER_B>
<B>
<FULLWIDTH_LATIN_CAPITAL_LETTER_B>
<MODIFIER_LETTER_SMALL_B>
<MODIFIER_LETTER_CAPITAL_B>
<LATIN_SMALL_LETTER_B_WITH_DOT_ABOVE>
<LATIN_CAPITAL_LETTER_B_WITH_DOT_ABOVE>
<LATIN_SMALL_LETTER_B_WITH_DOT_BELOW>
<LATIN_CAPITAL_LETTER_B_WITH_DOT_BELOW>
<LATIN_SMALL_LETTER_B_WITH_LINE_BELOW>
<LATIN_CAPITAL_LETTER_B_WITH_LINE_BELOW>
<LATIN_LETTER_SMALL_CAPITAL_B>
<LATIN_SMALL_LETTER_B_WITH_STROKE>
<LATIN_CAPITAL_LETTER_B_WITH_STROKE>
<MODIFIER_LETTER_CAPITAL_BARRED_B>
<LATIN_LETTER_SMALL_CAPITAL_BARRED_B>
<LATIN_SMALL_LETTER_B_WITH_MIDDLE_TILDE>
<LATIN_SMALL_LETTER_B_WITH_PALATAL_HOOK>
<LATIN_SMALL_LETTER_B_WITH_HOOK>
<LATIN_CAPITAL_LETTER_B_WITH_HOOK>
<LATIN_SMALL_LETTER_B_WITH_TOPBAR>
<LATIN_CAPITAL_LETTER_B_WITH_TOPBAR>
<c>
<FULLWIDTH_LATIN_SMALL_LETTER_C>
<C>
--- End of range [A-C] --
<FULLWIDTH_LATIN_CAPITAL_LETTER_C>
<MODIFIER_LETTER_SMALL_C>
<LATIN_SMALL_LETTER_C_WITH_ACUTE>
<LATIN_CAPITAL_LETTER_C_WITH_ACUTE>
...
> I think the *.src files here have some nice human-readable definitions
> about how character ranges like [A-C] are expanded under a given locale.
>
> http://unicode.org/cldr/data/common/posix/
>
> According to the LC_COLLATE section in de_DE.UTF-8.src, [A-C] is
> expanded to the below characters, in the given order (sorry for the long
> list).
Yeah, thanks for the link. It seems that reading those files requires the
same syntax knowledge that is required to read /usr/share/i18n files, so,
once I get that one, I'll have more sources to read :-)
Yes, I noticed that, but I could not explain it.
> Besides, the collating order may be different from the order of the
> character codes --- see:
>
> http://tinyurl.com/6gdraw
This is something that can immediately be noticed by looking at the en_US
output above.
> I'm really interested whether there are any further locale bugs in gawk
> (for trying to fix them).
> Perhaps I have more time for the research at the current weekend.
I'm not sure I can follow you on this, mostly due to my lack of in-depth
libc/wchar knowledge. However, I'll keep this program and come back to it
when my understanding of the subject is better.
Thanks!
Thanks. I understand the technical reason why the results
are created the way we see them; but I think the implicit(!)
and well hidden semantics of an expression like /[A-Z]/ are
not very useful, to say the least.
I suppose it's also disputable whether character classes like
[[:alpha:]] shall match non-English letters in a en_GB locale
or e.g. accent diacritical characters in a de_DE locale. What
is the intended rationale with locales in such cases; shall
they cover multilingual data? Shall [[:lower:]] in a de_DE
locale match a (French) á but not a (Greek) small omega (if
the latter is not in the codepage)?
Janis
>
> Hermann
>
> ...[snip list]
From a semantics point of view, it is also surprising that, when using
de_DE.UTF-8 locale, e.g. <LATIN_SMALL_LETTER_A_WITH_ACUTE> is included
in the range [A-C], but <LATIN_SMALL_LETTER_C_WITH_ACUTE> is not, as it
sorts after uppercase C.
In summary: using character ranges in combination with a non-POSIX
locale is a good recipe for surprising results. To a somewhat lesser
degree, this is also true for character classes.
Hermann
> From a semantics point of view, it is also surprising that, when using
> de_DE.UTF-8 locale, e.g. <LATIN_SMALL_LETTER_A_WITH_ACUTE> is included
> in the range [A-C], but <LATIN_SMALL_LETTER_C_WITH_ACUTE> is not, as it
> sorts after uppercase C.
Well, considered the general strangeness of the thing, from a certain point
of view this makes sense to me. If [A-C] expands to AaBbCc, the "c" is not
matched because it comes after the "C" (which is the last character in the
bracket expression). On the other hand, if [A-C] expands to aAbBcC,
then "a" is not matched, again because it's outside the range.
> In summary: using character ranges in combination with a non-POSIX
> locale is a good recipe for surprising results. To a somewhat lesser
> degree, this is also true for character classes.
Agreed. That's why I'm trying to find a way to print character classes and
collating sequences, so that when someone comes asking "why doesn't sort
work correctly" or "why doesn't grep/awk/sed. etc. match what I expect",
they can be instructed to run the (non-existent yet) command or script to
see for themselves what the programs' idea of what they waht to do is.
> Well, considered the general strangeness of the thing, from a certain
> point of view this makes sense to me. If [A-C] expands to AaBbCc, the "c"
> is not matched because it comes after the "C" (which is the last character
> in the bracket expression). On the other hand, if [A-C] expands to aAbBcC,
> then "a" is not matched, again because it's outside the range.
I should better have said:
"If the current collating sequence is AaBbCc"...and similarly for the second
part.
Yes, it would be less surprising to assume [A-C] as defining the
representatives base for the respective diacritical characters.
Has anybody experience, in this context, with equivalence classes
syntax [=...=]? (I haven't used them yet, but there's a chance
that equivalence classes might provide some more natural solutions
with locales.)
Janis
I combined some code from Ed's and Steffen's scripts and added
/usr/bin/printf in the middle part. Unlike bash's builtin printf or
gawk's (s)printf: /usr/bin/printf is able to convert Unicode code point
values into chars.
The script is far away from being smart, efficient, or anything like
that, but it seems to work with Unicode-aware locales.
$ LC_ALL=en_GB.UTF-8 ./collating_chars.sh [A-C]
AáÁàÀăắằẵẳặĂẮẰẴẲẶâấầẫẩậÂẤẦẪẨẬǎǍåǻÅǺäǟÄǞãÃȧǡȦǠąĄāĀảẢȁȀȃȂạẠḁḀẚªæǽǣÆǼǢbBḃḂḅḄḇḆɓƁcC
$ LC_ALL=da_DK.UTF-8 ./collating_chars.sh [A-C]
AaÁáÀàĂẮẰẴẲẶăắằẵẳặÂẤẦẪẨẬâấầẫẩậǍǎǺǻǞǟÃãȦǠȧǡĄąĀāẢảȀȁȂȃẠạḀḁẚªBbḂḃḄḅḆḇƁɓC
The example shows that in da_DK.UTF-8 locale, the [A-C] range expands to
less characters than in en_GB (and most other) locales. The reason is
that A_RING and AE_LIGATURE characters sort after Z, according to Danish
sorting rules.
Hermann
$ cat collating_chars.sh
#!/bin/bash
# collate sequence of characters which belong
# to a given character range or class
# based on Ed's and Steffen's code
# seems to work for Unicode locales
#
# usage:
# collating_chars.sh [A-Z]
# collating_chars.sh [[:upper:]]
gawk 'BEGIN {
for (i=0;i<=32767;i++) {
num = sprintf("%X", i)
l = length(num)
# construct a format that is
# understood by /usr/bin/printf
if (i < 16)
num = "\\\\x0" num
else if (i < 128)
num = "\\\\x" num
else if (l == 2)
num = "\\\\u00" num
else if (l == 3)
num = "\\\\u0" num
else
num = "\\\\u" num
# exclude C1 control chars
# as printf doesnt like them
if (i < 128 || i > 159)
print num
}
}' |
# Print Unicode chars with /usr/bin/printf
while read num ; do /usr/bin/printf "$num\n" ; done | sort |
# Collate characters which are matched by the given range or class
gawk -v re=$1 '$1 ~ re { s = s $1 } END { print s }'
I just noticed: A umlaut characters are also not part of the Danish
[A-C] range. Furthermore, Danish sorts AaBbCc, whereas en_GB sorts
aAbBcC. These are just a few more reasons why character ranges are best
avoided when working in non-POSIX locales.
Hermann
Nice script. Works fine. I tested a lot with it and adapted it to my
personal suites.
--
Steffen
Thanks for the hint.
>
> Nice script. Works fine. I tested a lot with it and adapted it to my
> personal suites.
>
Thanks again. I am however really surprised that no serious programmer
has ever invented a much better tool to do the same thing (and more, and
faster). So much time has been invested in all these locale definitions
and nobody is interested in (providing) a tool that could easily show
what character ranges and classes exactly mean in a given locale?
Puzzled. Hermann
> I combined some code from Ed's and Steffen's scripts and added
> /usr/bin/printf in the middle part. Unlike bash's builtin printf or
> gawk's (s)printf: /usr/bin/printf is able to convert Unicode code point
> values into chars.
>
> The script is far away from being smart, efficient, or anything like
> that, but it seems to work with Unicode-aware locales.
>[snip]
Great job, really!
I'm still trying to put together something that reads the locale files
directly, but that involves reading libc documentation for wc/mb etc. +
posix docs for locales (probably some of that is not strictly needed, but
since it's something that I should have done anyway, I'm just seizing the
opportunity). This of course is not to say that your script is not
useful...on the contrary instead, it's really neat and smart.
A big thank you!
> Thanks again. I am however really surprised that no serious programmer
> has ever invented a much better tool to do the same thing (and more, and
> faster). So much time has been invested in all these locale definitions
> and nobody is interested in (providing) a tool that could easily show
> what character ranges and classes exactly mean in a given locale?
>
Hermann,
I'm affraid this wasn't seen as necessary as the script would a sort of
analysis of the pasta by dissolving in vinegar to separate the egg from
flour instead of reading the package's label.
Those sequences are specified, aren't they?
Oi Cesar,
It's true: the sequences are specified. But it might not be everyone's
hobby to read through dozens (hundreds) of locale definition source
files in order to find out what the exact details and differences of a
character range like [A-C] in various Unicode locales are.
Anyway: I see that there are at least 3 users of my script (Steffen, pk
and me). It looks like it was worth writing it.
;-) Hermann
For those who might be interested: here another variation of the
script. Reading an additional local file is perhaps not the smartest
solution, but guarantees that all relevant Unicode code points are
covered. (I promise this to be my last OT posting in this thread ;-)
Hermann
$ cat collate_rechars.sh
#!/bin/bash
# Collate a sorted list of chars that belong to a range or class
# Based on Ed's and Steffen's code, works only for Unicode locales
#
# Usage:
# LC_ALL=en_GB.UTF-8 collate_rechars.sh [A-Z]
# LC_ALL=da_DK.UTF-8 collate_rechars.sh [[:upper:]]
#
# For this script you need to have a local copy of
# http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
gawk -F";" '
# Construct a format for chars in ASCII range
#
NR < 129 { print "\\\\x" substr($1,3,2) ; next }
# Construct a format for other code points in UnicodeData.txt
# Exclude C1 control chars and some other code points
# as usr/bin/printf reports errors for them
#
NR > 160 && $1 !~ /^(D800|DB7F|DB80|DBFF|DC00|DFFF)$/ {
printf "\\\\U%08s\n", $1 }' UnicodeData.txt |
# Print chars with /usr/bin/printf and sort them
#
while read f ; do /usr/bin/printf "$f\n" ; done | sort |
# Collate chars for the given character range or class
#
gawk -v re="$1" '$1 ~ re { s = s $1 } END { print re "=" s }'
Olá Hermann,
> It's true: the sequences are specified. But it might not be everyone's
> hobby to read through dozens (hundreds) of locale definition source
> files in order to find out what the exact details and differences of a
> character range like [A-C] in various Unicode locales are.
I do know what you mean!
>
> Anyway: I see that there are at least 3 users of my script (Steffen, pk
> and me). It looks like it was worth writing it.
>
> ;-) Hermann
Sure! :-D
I just found the below footnote the GNU documentation.
Hermann
--- snip ---
Note that setting only LC_COLLATE has two problems. First, it is
ineffective if LC_ALL is also set. Second, it has undefined behavior if
LC_CTYPE (or LANG, if LC_CTYPE is unset) is set to an incompatible value.
--- snip ---
http://www.gnu.org/software/coreutils/manual/coreutils.html#fn-2