Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Ask for help about SUSv3 tr utility "-C" option

10 views
Skip to first unread message

xiebo...@gmail.com

unread,
Jul 27, 2005, 7:46:35 AM7/27/05
to
Hello all,

I can not find an appropriate newsgroup for tr utility, then I post it
here. :-)

I have a questions about tr utility. From [1] we know that "The -C
operand is added, and the description of the -c operand is changed to
align with the IEEE P1003.2b draft standard".

My question is: What's the difference between "-c" and "-C" options in
SUSv3 for tr utility? Would you be so kind to give me a example to show
the difference (e.g. "tr -dC ..." and "tr -dc ..." )?

Thank you very much for your time!

Best Regards,
Xie Bo

Reference
1.http://www.opengroup.org/onlinepubs/000095399/utilities/tr.html

Geoff Clare

unread,
Jul 28, 2005, 8:55:12 AM7/28/05
to
xiebo...@gmail.com wrote, on Wed, 27 Jul 2005 04:46:35 -0700:

> My question is: What's the difference between "-c" and "-C" options in
> SUSv3 for tr utility?

The difference is that -c operates on byte values, whereas -C operates
on (possibly multibyte) characters.

> Would you be so kind to give me a example to show
> the difference (e.g. "tr -dC ..." and "tr -dc ..." )?

First let's obtain some multibyte characters in UTF-8 encoding:

$ ae_utf8=$(echo 'áé' | iconv -f 8859-1 -t UTF-8)

Now ae_utf8 contains two two-byte characters:

$ echo "$ae_utf8" | LC_ALL=C od -c
0000000 303 241 303 251 \n
0000005

The 303 241 sequence is a-acute, and 303 251 is e-acute.

Put just the a-acute into another variable:

$ a_utf8=$(echo 'á' | iconv -f 8859-1 -t UTF-8)
$ echo "$a_utf8" | LC_ALL=C od -c
0000000 303 241 \n
0000003

Now let's try the two different tr options with a UTF-8 locale:

$ echo "$ae_utf8" | LC_ALL=en_US.UTF-8 tr -Cd "$a_utf8" | LC_ALL=C od -c
0000000 303 241
0000002

This deleted the e-acute two-byte character and the newline, leaving
just the a-acute two-byte character.

$ echo "$ae_utf8" | LC_ALL=en_US.UTF-8 tr -cd "$a_utf8" | LC_ALL=C od -c
0000000 303 241 303
0000003

This deleted just the 251 byte and the newline. It treated the $a_utf8
string as containing byte values instead of (multibyte) characters.

(The above was executed using /usr/xpg6/bin/tr on Solaris 10.)

--
Geoff Clare <net...@gclare.org.uk>

0 new messages