How to find and replace non-breaking spaces in a text file

4,869 views
Skip to first unread message

Amin Farajian

unread,
Jun 30, 2012, 7:57:22 AM6/30/12
to Persian Computing
Hi all,
I am trying to find non-breaking spaces in a UTF8 encoded text file, which contains some Persian text. How can I do this using bash or Perl (bash is preferred)? Additionally, I want to replace these characters with standard space. How it is possible?

Bests,
M. Amin

Behnam Esfahbod

unread,
Jun 30, 2012, 2:49:28 PM6/30/12
to Amin Farajian, Persian Computing
Amin,

NSBP is <0xC2, 0x,A0> in UTF-8.

Finding NBSP in a file:
grep -P '\xc2\xa0' sample.txt

Replacing it with something else:
sed 's/\xc2\xa0/something_else/' sample.txt

-Behnam





--
    '     بهنام اسفهبد
    '     Behnam Esfahbod
   '      http://behnam.esfahbod.info
  *  ..   Persian Internet Society
 *  `  *  http://persian-isoc.org
  * o *   3E7F B4B6 6F4C A8AB 9BB9 7520 5701 CA40 259E 0F8B

Behnam Esfahbod

unread,
Jun 30, 2012, 4:13:21 PM6/30/12
to Amin Farajian, Persian Computing
Amin jan,

(CCing the list)

First, what do you mean by "no break space". Do you mean "faasele-ye majaazi", which is names U+200C ZERO-WIDTH NON-JOINER (ZWNJ) in Unicode? The thing is that there is no U+00A0 NBSP character in the attached text file.


I also noticed that your file has a BOM. (https://en.wikipedia.org/wiki/Byte_order_mark) Look at this Unicode faq about BOM in UTF encodings: http://unicode.org/faq/utf_bom.html

<0xC2, 0xA0> is the UTF-8 encoding for characteU+00A0. You can find the algorithm online (like https://en.wikipedia.org/wiki/UTF-8) or you can use some application or website (like fileformat.info) to get the UTF-8 byte-sequence for each character, like: http://www.fileformat.info/info/unicode/char/a0/index.htm

Anyway, the commands I noted may be used to work with utf-8 files, as long as you pass them the "escaped utf-8 sequence" of the characters from bash. You may want to try them out with a visible character like U+0627 ARABIC LETTER ALEF ( http://www.fileformat.info/info/unicode/char/0627/index.htm ), then use it with ZWNJ, NBSP or BOM.

Best,
-Behnam
Reply all
Reply to author
Forward
0 new messages