Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

sed for removing non ASCII characters

3,924 views
Skip to first unread message

quarkLore

unread,
Mar 6, 2009, 4:31:19 AM3/6/09
to
I know the tr solution for the problem but I have to use sed hence the
question here. I have to remove all the bytes from 128 - 255 in my
text file. For which I have following command:

sed 's/[\200-\377]//' text_file

This removes other characters which should not be removed. The reason
I *think* is the range specified does not take \200 as an octal value
but as three different characters - '\' '2' '0' . Any suggestions of
how to achieve this using sed? I have searched on google for the
solutions but either they solve different problem all together or
suggest usage of tr.

pk

unread,
Mar 6, 2009, 4:41:27 AM3/6/09
to
On Friday 6 March 2009 10:31, quarkLore wrote:

> I know the tr solution for the problem but I have to use sed hence the
> question here.

Uhm...this is comp.lang.awk.

> I have to remove all the bytes from 128 - 255 in my text file. For which I
> have following command:
>
> sed 's/[\200-\377]//' text_file
>
> This removes other characters which should not be removed. The reason
> I *think* is the range specified does not take \200 as an octal value
> but as three different characters - '\' '2' '0' . Any suggestions of
> how to achieve this using sed? I have searched on google for the
> solutions but either they solve different problem all together or
> suggest usage of tr.

If you have a sed that understands \xnnn (like eg GNU sed), you can do

sed 's/[\x80-\xff]//' text_file

also, prefixing the value with \o or \d may work (tested with GNU sed):

sed 's/[\o200-\o377]//' text_file

sed 's/[\d128-\d255]//' text_file

pk

unread,
Mar 6, 2009, 4:48:00 AM3/6/09
to
On Friday 6 March 2009 10:41, pk wrote:

> sed 's/[\x80-\xff]//' text_file

And in all variations, you probably need a /g at thend:

sed 's/[\x80-\xff]//g' text_file

Ed Morton

unread,
Mar 6, 2009, 9:29:27 AM3/6/09
to
On Mar 6, 3:31 am, quarkLore <agarwal.prat...@gmail.com> wrote:
> I know the tr solution for the problem but I have to use sed hence the
> question here.

This is comp.lang.AWK. You can get good sed answers at
comp.unix.shell.

Ed.

Edward Rosten

unread,
Mar 11, 2009, 11:12:27 AM3/11/09
to
On 6 Mar, 09:31, quarkLore <agarwal.prat...@gmail.com> wrote:
> I know the tr solution for the problem but I have to use sed hence the
> question here. I have to remove all the bytes from 128 - 255 in my
> text file. For which I have following command:
>
> sed 's/[\200-\377]//' text_file
>
> This removes other characters which should not be removed. The reason
> I *think* is the range specified does not take \200 as an octal value
> but as three different characters - '\' '2' '0' . Any suggestions of
> how to achieve this using sed?

s/[^ -~]//g

or, more on topic:

{ gsub(/[^ -~]/, "")}1


> I have searched on google for the
> solutions but either they solve different problem all together or
> suggest usage of tr.

Why do you have to use sed?

-Ed
--
(You can't go wrong with psycho-rats.)(http://mi.eng.cam.ac.uk/~er258)

/d{def}def/f{/Times s selectfont}d/s{11}d/r{roll}d f 2/m{moveto}d -1
r 230 350 m 0 1 179{ 1 index show 88 rotate 4 mul 0 rmoveto}for/s 12
d f pop 235 420 translate 0 0 moveto 1 2 scale show showpage

pekka.le...@gmail.com

unread,
Mar 23, 2014, 4:36:46 PM3/23/14
to
Following worked for me (warning, overwrites the file)
sudo apt-get install ssed
ssed -i 's/[\d128-\d255]//g' MYFILE

Hermann Peifer

unread,
Mar 23, 2014, 10:13:34 PM3/23/14
to
On 2014-03-23 21:36, pekka.le...@gmail.com wrote:
> Following worked for me (warning, overwrites the file)
> sudo apt-get install ssed
> ssed -i 's/[\d128-\d255]//g' MYFILE
>

Funny enough, I was just about to look for (and ask the group about) a
similar approach in AWK. I remember that earlier this millennium, I
already asked about an [[:ascii:]] character class, which doesn't seem
to exist [1]. Occasionally, I would like to have such a class, e.g. for
the below hand-made transliteration of Portuguese UTF-8 street names
into UPPERCASE US-ASCII [2].

Hermann

PS for Pekka: I guess you know that there are non-ASCII chars beyond
position 255 (obviously not in single-byte encodings, but in others).

[1] https://groups.google.com/forum/#!topic/comp.lang.awk/yT2AhTy_0hk

[2]

{
# Do replacements according
# to some non-iconv logic

$0 = toupper($0)
gsub(/[ہءآأؤھ]/, "A")
gsub(/ا/, "C")
gsub(/[بةت]/, "E")
gsub(/ح/, "I")
gsub(/ر/, "N")
gsub(/[سشصض°؛]/, "O")
gsub(/[عـ]/, "U")
gsub(/ك/, "SS")

# Handle all the rest
# gsub(/[^[:ascii:]]/, "") # ???

print
}


Hermann Peifer

unread,
Mar 25, 2014, 12:53:46 AM3/25/14
to
On 2014-03-24 3:13, Hermann Peifer wrote:
>
> # Handle all the rest
> # gsub(/[^[:ascii:]]/, "") # ???
>

Silly me: gsub(/[^\x00-\x7F]/, "")

Hermann

david...@averinformatics.com

unread,
Aug 19, 2014, 3:35:23 PM8/19/14
to

>
> This removes other characters which should not be removed. The reason
> I *think* is the range specified does not take \200 as an octal value
> but as three different characters - '\' '2' '0' . Any suggestions of
> how to achieve this using sed? I have searched on google for the
> solutions but either they solve different problem all together or
> suggest usage of tr.

This is not the answer for sed but can be supportive for someone who ends up here:

similar replacements techniques here:
tr -cs '[a-zA-Z0-9",\n-_)(*&^%$#@)!]' ' ' < inputfilename.txt > outputFileFIXED.txt

grep -P '[^\x00-\x7f]' inputfilename.txt > outputFileFIXED.txt


I have been trying to get this to work.. and maybe somebody with more experience can comment:
This can replace the TR line
awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print }' inputfile.txt > CONVERTED_FILE.txt

-- test with this
echo 'test:;,.+=/?<>a-z`A-Z~0-9\"!@#$%^&*_\[](){}' | awk '{ sub("[^a-z`A-Z~0-9:;,.+=/?\"!@#$%^&*_<>\[](){}", ""); print }'

igor__

unread,
Aug 19, 2014, 10:24:21 PM8/19/14
to


On Tue, 19 Aug 2014, david...@averinformatics.com wrote:

>
>>
>> This removes other characters which should not be removed. The reason
>> I *think* is the range specified does not take \200 as an octal value
>> but as three different characters - '\' '2' '0' . Any suggestions of
>> how to achieve this using sed? I have searched on google for the
>> solutions but either they solve different problem all together or
>> suggest usage of tr.
>

<snip>

> This can replace the TR line
> awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print }' inputfile.txt > CONVERTED_FILE.txt
>
> -- test with this
> echo 'test:;,.+=/?<>a-z`A-Z~0-9\"!@#$%^&*_\[](){}' | awk '{ sub("[^a-z`A-Z~0-9:;,.+=/?\"!@#$%^&*_<>\[](){}", ""); print }'

Some (most?) awk implementations are not fully binary safe, especially
they don't handle \0 as you may expect in this sub, so the above script
wouldn't be portable. It should work perfectly with gawk, but would
fail on \0 in mawk for example.


Hermann Peifer

unread,
Aug 20, 2014, 1:59:10 PM8/20/14
to
On 2014-08-19 21:35, david...@averinformatics.com wrote:
>
>
> I have been trying to get this to work.. and maybe somebody with more experience can comment:
> This can replace the TR line
> awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print }' inputfile.txt > CONVERTED_FILE.txt
>

Range interpretaton depends an the gawk version you are using, and on
your locale, see this info in the GAWK manual:

> A.8 Regexp Ranges and Locales: A Long Sad Story
http://www.gnu.org/software/gawk/manual/gawk.html#Ranges-and-Locales

The latest GAWK manual in Git Repo says:

Some utilities that match regular expressions provide a non-standard
`[:ascii:]' character class; `awk' does not. However, you can simulate
such a construct using `[\x00-\x7F]'. This matches all values
numerically between zero and 127, which is the defined range of the
ASCII character set. Use a complemented character list
(`[^\x00-\x7F]') to match any single-byte characters that are not in
the ASCII range.


Hermann
0 new messages