Dealing with non-Ascii charaters in SED

Kompu Kid

unread,

Jun 13, 2009, 1:28:54 PM6/13/09

to

Hello All:

I have a SED script that strips the non-ASCII characters in foreign
texts I download.

When it is characters like o with umlauts (o with two dots on top)
that I see in my BSD environment, I simply to these characters and
replace them with regular o's.

However, recently I have been running into characters represented in
the following fashion:

\x9f
\x87a
\x9e

etc.

I am note is saying that Vi treats these "\x9"'s etc. as one
character.

My SED script calls a text file that has a list of changes:
s/ö/o/
s/ü/u/
etc.

I am unable to type in these characters that start with "\x"s in the
same manner in this file.

How do I create these in the text file using Vi in a manner that the
SED would interpret them correctly?

Deguza

Moi

unread,

Jun 13, 2009, 1:44:17 PM6/13/09

to

I am not quite sure what you mean. It might be that vi(m) interprets
ASCII values as being part of UTF-8 sequences. (in that case your search
and replace should also attempt to operate on the _whole_ sequence)

Please repost with a (part of) a hexdump of the file.

HTH,
AvK

Aaron W. Hsu

unread,

Jun 13, 2009, 2:12:14 PM6/13/09

to

Kompu Kid <deg...@hotmail.com> writes:

>I have a SED script that strips the non-ASCII characters in foreign
>texts I download.

You may also want to consider tr(1).

>I am unable to type in these characters that start with "\x"s in the
>same manner in this file.

>How do I create these in the text file using Vi in a manner that the
>SED would interpret them correctly?

In Vi you can do something like [0-9a-f][0-9a-f] to create a character
with the two hexadecimal digits you follow the command with.

--
Aaron W. Hsu <arc...@sacrideo.us> | <http://www.sacrideo.us>
"Government is the great fiction, through which everybody endeavors to
live at the expense of everybody else." -- Frederic Bastiat
+++++++++++++++ ((lambda (x) (x x)) (lambda (x) (x x))) ++++++++++++++

Bill Marcum

unread,

Jun 13, 2009, 5:28:17 PM6/13/09

to

["Followup-To:" header set to comp.unix.shell.]

You could process the files with iconv or recode before running them
through sed. And you could use tr to translate any "unknown" characters,
for example: tr '\200-\377' '.'

Kompu Kid

unread,

Jun 13, 2009, 10:56:20 PM6/13/09

to

I think the text *is* in UTF-8.

Here is what one word:

- How it looks in vi: BaD\x9fdad
- How it looks viwed through a Google Chrome browser on a PC: Bağdad
(I guess it means "Bagdad").

Deguza

Stephane CHAZELAS

unread,

Jun 14, 2009, 12:26:00 PM6/14/09

to

2009-06-13, 19:56(-07), Kompu Kid:
[...]

> I think the text *is* in UTF-8.
>
> Here is what one word:
>
> - How it looks in vi: BaD\x9fdad

> - How it looks viwed through a Google Chrome browser on a PC: Ba??dad

> (I guess it means "Bagdad").

[...]

If it's utf8, then

recode -f u8..flat

should do.

If you don't have recode, you could try:

PERLIO=:utf8 perl '-MUnicode::Normalize decompose' -pe '
$_=decompose($_);s/\pM//g'

You'd need a reasonably recent version of perl though.

$ print 'Ba\xC4\x9Fdad' | recode u8..dump-with-names
UCS2 Mne Description

0042 B latin capital letter b
0061 a latin small letter a
011F g( latin small letter g with breve
0064 d latin small letter d
0061 a latin small letter a
0064 d latin small letter d
000A LF line feed (lf)

$ print 'Ba\xC4\x9Fdad' | recode -f u8..flat
Bagdad

$ print 'Ba\xC4\x9Fdad' | PERLIO=:utf8 perl '-MUnicode::Normalize decompose' -pe '
pipe quote> $_=decompose($_);s/\pM//g'
Bagdad

--
Stï¿œphane