I have a SED script that strips the non-ASCII characters in foreign
texts I download.
When it is characters like o with umlauts (o with two dots on top)
that I see in my BSD environment, I simply to these characters and
replace them with regular o's.
However, recently I have been running into characters represented in
the following fashion:
\x9f
\x87a
\x9e
etc.
I am note is saying that Vi treats these "\x9"'s etc. as one
character.
My SED script calls a text file that has a list of changes:
s/ö/o/
s/ü/u/
etc.
I am unable to type in these characters that start with "\x"s in the
same manner in this file.
How do I create these in the text file using Vi in a manner that the
SED would interpret them correctly?
Deguza
I am not quite sure what you mean. It might be that vi(m) interprets
ASCII values as being part of UTF-8 sequences. (in that case your search
and replace should also attempt to operate on the _whole_ sequence)
Please repost with a (part of) a hexdump of the file.
HTH,
AvK
>I have a SED script that strips the non-ASCII characters in foreign
>texts I download.
You may also want to consider tr(1).
>I am unable to type in these characters that start with "\x"s in the
>same manner in this file.
>How do I create these in the text file using Vi in a manner that the
>SED would interpret them correctly?
In Vi you can do something like [0-9a-f][0-9a-f] to create a character
with the two hexadecimal digits you follow the command with.
--
Aaron W. Hsu <arc...@sacrideo.us> | <http://www.sacrideo.us>
"Government is the great fiction, through which everybody endeavors to
live at the expense of everybody else." -- Frederic Bastiat
+++++++++++++++ ((lambda (x) (x x)) (lambda (x) (x x))) ++++++++++++++
I think the text *is* in UTF-8.
Here is what one word:
- How it looks in vi: BaD\x9fdad
- How it looks viwed through a Google Chrome browser on a PC: Bağdad
(I guess it means "Bagdad").
Deguza
If it's utf8, then
recode -f u8..flat
should do.
If you don't have recode, you could try:
PERLIO=:utf8 perl '-MUnicode::Normalize decompose' -pe '
$_=decompose($_);s/\pM//g'
You'd need a reasonably recent version of perl though.
$ print 'Ba\xC4\x9Fdad' | recode u8..dump-with-names
UCS2 Mne Description
0042 B latin capital letter b
0061 a latin small letter a
011F g( latin small letter g with breve
0064 d latin small letter d
0061 a latin small letter a
0064 d latin small letter d
000A LF line feed (lf)
$ print 'Ba\xC4\x9Fdad' | recode -f u8..flat
Bagdad
$ print 'Ba\xC4\x9Fdad' | PERLIO=:utf8 perl '-MUnicode::Normalize decompose' -pe '
pipe quote> $_=decompose($_);s/\pM//g'
Bagdad
--
Stᅵphane