Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Using grep to find non-ASCII text

446 views
Skip to first unread message

Noob

unread,
Oct 3, 2014, 7:47:00 AM10/3/14
to
Hello,

I've downloaded ~100 files encoded in Windows-1251.
https://en.wikipedia.org/wiki/Windows-1251
> Windows-1251 (a.k.a. code page CP1251) is a popular 8-bit character
> encoding, designed to cover languages that use the Cyrillic script.

(CP1251 matches ASCII for values less than 128.)

I am trying to use grep to find lines containing "non-ASCII" characters,
i.e. values 128-255.

According to the following discussion, I should be able to use pcre
in GNU grep, as in

grep -P "[\x80-\xFF]" file

but this does not work for me :-(

https://stackoverflow.com/questions/3001177/how-do-i-grep-for-all-non-ascii-characters-in-unix

$ hexdump.exe -C test.txt
00000000 54 45 53 54 0a 4e 61 6d 65 3a 20 cf f3 f1 f2 ee |TEST.Name: .....|
00000010 3b 0a |;.|
00000012
$ grep -P "[\x80-\xFF]" test.txt
$ echo $?
1

There clearly are values > 128 in the file. What am I doing wrong?
(I should note that I am using Cygwin, not a "real" env.)

Perhaps there are other tools, better suited for this task?
(awk might be useful, but I've never used it for serious work.)

Regards.

Underactive Moth

unread,
Oct 3, 2014, 8:17:36 AM10/3/14
to
I suspect your locale setting is affecting the way grep
determines the characters in its input. Try:
LANG=C grep -P "[\x80-\xFF]" test.txt

Noob

unread,
Oct 3, 2014, 8:43:03 AM10/3/14
to
On 03/10/2014 14:17, Underactive Moth wrote:
Yeees! You nailed it! You win a free hug.

(LANG is set to en_US.UTF-8 here.)

Locales, charsets, code pages, LANG, LC_ALL, Unicode, UCS, UTF
All this stuff has my head spinning.

The only way I found to display the file contents in my terminal is:
process file | iconv -f CP1251

I couldn't find any way to set my terminal to CP1251.
(I suppose other tools expect to output UTF-8)

Regards.

0 new messages