Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

locate lines with 8-bit character(s) in a text file

20 views
Skip to first unread message

Harry

unread,
Jul 3, 2014, 2:18:59 PM7/3/14
to
I have had a .csv file which was used to upload to Oracle database.
If there was any 8-bit character present, it would cause problem on the application using the database.
So, I would like to do some grep/awk search on the .csv file before the database upload.

Any advice how I could achieve this?

E.G. Instead of grep known line, I want something like grep/awk [8-bit chars] file.

$ grep -iH Cytometry file.csv
file.csv:Pathology,Pathology,Pathology,,,0,0,0,,0,0,,0,,CW▒Flow▒Cytometry,1,y

Harry

unread,
Jul 3, 2014, 9:33:48 PM7/3/14
to
Harry wrote...

>E.G. Instead of grep known line, I want something like grep/awk [8-bit char=
>s] file.
>
>$ grep -iH Cytometry file.csv
>file.csv:Pathology,Pathology,Pathology,,,0,0,0,,0,0,,0,,CW=E2=96=92Flow=E2=
>=96=92Cytometry,1,y

Nevermind...

Found the answer on the web.

$ LC_ALL=C command -p awk '/[\200-\377]/' file.csv

Wayne

unread,
Jul 4, 2014, 1:03:56 AM7/4/14
to
On 7/3/2014 2:18 PM, Harry wrote:
> I have had a .csv file which was used to upload to Oracle database.
> If there was any 8-bit character present, it would cause problem on the application using the database.
> So, I would like to do some grep/awk search on the .csv file before the database upload.
>
> Any advice how I could achieve this?

alias nonascii='LC_ALL=C grep -q "[^[:print:]]"'

if nonascii some-file; then ...; fi

--
Wayne

Chris Nehren

unread,
Jul 4, 2014, 2:00:01 PM7/4/14
to
On 2014-07-03, Harry scribbled these curious markings:
> I have had a .csv file which was used to upload to Oracle database.
> If there was any 8-bit character present, it would cause problem
> on the application using the database. So, I would like to do some
> grep/awk search on the .csv file before the database upload.
>
> Any advice how I could achieve this?

The better solution is to fix the application so that it can
function in a world of 8-bit characters. Or, judging from the
fact that you said 'Oracle', beg and plead with the vendor to fix
it, as it probably isn't open source.

--
Chris Nehren

Harry

unread,
Jul 4, 2014, 10:42:09 PM7/4/14
to
Chris Nehren wrote...

>The better solution is to fix the application so that it can
>function in a world of 8-bit characters. Or, judging from the
>fact that you said 'Oracle', beg and plead with the vendor to fix
>it, as it probably isn't open source.

It's kind of a data mismatch rather than application issue.
The 8-bit characters were generated probably by copy-n-paste
errors from Windows O/S when some 8-bit characters were in place
of the space character.

frank.w...@gmail.com

unread,
Jul 5, 2014, 5:26:26 AM7/5/14
to
From harryooopotter:
>It's kind of a data mismatch rather than application
>issue.
>The 8-bit characters were generated probably by
>copy-n-paste
>errors from Windows O/S when some 8-bit characters were
>in place
>of the space character.

Then as another option it might be sufficient to replace
all "\xA0" with "\x20" in all source files. Your use of
'grep' causes an additional read of each file, which is
what 'sed' would do, and 'sed' might not write unless
there was something to be written.

Frank

Chris Nehren

unread,
Jul 5, 2014, 10:59:52 AM7/5/14
to
On 2014-07-05, Harry scribbled these curious markings:
... ouch. I'm sorry. I would suggest putting some sort of
sanitization filter on the input going forward so this kind of
thing can't happen again.

--
Chris Nehren
0 new messages