Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

substitution for octal chars

96 views
Skip to first unread message

Lao Ming

unread,
Jul 7, 2010, 10:44:57 PM7/7/10
to
When I download a text file containing octal chararcters

e.g. \342\200\231

as in: can\342\200\231t or don\342\200\231t

is there a way to replace these with their ascii equivalent
from the shell with sed, perl or awk?
Thanks.

Janis Papanagnou

unread,
Jul 8, 2010, 4:13:58 AM7/8/10
to
Lao Ming schrieb:

I fear there might not be an ASCII equivalent if some encoding
of a different character set has been used here instead of ASCII.
You'll have to find out what encoding has been used in the first
place. Then the program iconv may help you converting the data.

Janis

> Thanks.

Ben Bacarisse

unread,
Jul 8, 2010, 8:23:26 AM7/8/10
to
Lao Ming <laomi...@gmail.com> writes:

The example is a useful one. \342\200\231 is the UTF-8 encoding of a
"right single quote" which Unicode recommends as the character to use
for an apostrophe. It is therefore very likely that the file is UTF-8
encoded.

When you say the file contains octal characters it is not clear if you
are showing us the octal values for the characters or whether the file
really has the backslash followed by the three digits. In other words,
does \342\200\231 represent 3 or 12 octets?

If (as is likely) it is the former then iconv (with //translit) is the
place to start. You may run into trouble when there are characters in
the file that have no obvious ASCII equivalent, but that is another
problem.

iconv --from=utf-8 --to=ascii//translit my-input-file

--
Ben.

0 new messages