I can not find an appropriate newsgroup for tr utility, then I post it
here. :-)
I have a questions about tr utility. From [1] we know that "The -C
operand is added, and the description of the -c operand is changed to
align with the IEEE P1003.2b draft standard".
My question is: What's the difference between "-c" and "-C" options in
SUSv3 for tr utility? Would you be so kind to give me a example to show
the difference (e.g. "tr -dC ..." and "tr -dc ..." )?
Thank you very much for your time!
Best Regards,
Xie Bo
Reference
1.http://www.opengroup.org/onlinepubs/000095399/utilities/tr.html
> My question is: What's the difference between "-c" and "-C" options in
> SUSv3 for tr utility?
The difference is that -c operates on byte values, whereas -C operates
on (possibly multibyte) characters.
> Would you be so kind to give me a example to show
> the difference (e.g. "tr -dC ..." and "tr -dc ..." )?
First let's obtain some multibyte characters in UTF-8 encoding:
$ ae_utf8=$(echo 'áé' | iconv -f 8859-1 -t UTF-8)
Now ae_utf8 contains two two-byte characters:
$ echo "$ae_utf8" | LC_ALL=C od -c
0000000 303 241 303 251 \n
0000005
The 303 241 sequence is a-acute, and 303 251 is e-acute.
Put just the a-acute into another variable:
$ a_utf8=$(echo 'á' | iconv -f 8859-1 -t UTF-8)
$ echo "$a_utf8" | LC_ALL=C od -c
0000000 303 241 \n
0000003
Now let's try the two different tr options with a UTF-8 locale:
$ echo "$ae_utf8" | LC_ALL=en_US.UTF-8 tr -Cd "$a_utf8" | LC_ALL=C od -c
0000000 303 241
0000002
This deleted the e-acute two-byte character and the newline, leaving
just the a-acute two-byte character.
$ echo "$ae_utf8" | LC_ALL=en_US.UTF-8 tr -cd "$a_utf8" | LC_ALL=C od -c
0000000 303 241 303
0000003
This deleted just the 251 byte and the newline. It treated the $a_utf8
string as containing byte values instead of (multibyte) characters.
(The above was executed using /usr/xpg6/bin/tr on Solaris 10.)
--
Geoff Clare <net...@gclare.org.uk>