UTF-8, ascii and awk lookups

Sivaram Neelakantan

unread,

Sep 28, 2018, 2:22:43 PM9/28/18

to

When I tried to use one file as a look up in awk, everything worked
except the first field. Since I had already been bitten by UNIX/DOS
file endings issue, I checked the following.
--8<---------------cut here---------------start------------->8---

$ file columns_from_dataset.txt
columns_from_dataset.txt: UTF-8 Unicode (with BOM) text
$ file p_generic.txt
p_generic.txt: ASCII text

--8<---------------cut here---------------end--------------->8---

I thought that was fine and went ahead and found it still failing. So
I checked bytes in it
--8<---------------cut here---------------start------------->8---
$ head -1 columns_from_dataset.txt |od -bc
0000000 357 273 277 163 153 171 167 012
357 273 277 s k y w \n

$ head -1 p_generic.txt |od -bc
0000000 163 153 171 012
s k y \n
--8<---------------cut here---------------end--------------->8---
Don't know where those 3 bytes came from and I believe that might be
the cause. I fired up Emacs and placed the cursor on 's' in
dataset.txt file and got only this(C-x = ) which matches the above.

Char: s (115, #o163, #x73) point=1 of 520 (0%) column=0

I receive these files from different systems mac, AIX and linux and of
course XLS from WIN10. How do I fix these kind of weirdness in
scripts? I currently only check linefeeds and fix those.

sivaram
--

Janis Papanagnou

unread,

Sep 28, 2018, 3:50:14 PM9/28/18

to

Are you asking where the bytes 's', 'k', and 'y' come from?
I'd say; obviously they come from your data file.
What exactly you consider to be weird?
Above you say you have issues with the first field, and your file
has obviously byte markers ("UTF-8 Unicode (with BOM) text") in it.
Do you maybe just want to remove those markers before processing?
Generally; if you have problems with data formats from different
sources then convert them when importing them to your native system.

Janis

>
>
> sivaram
>

Ben Bacarisse

unread,

Sep 28, 2018, 4:51:01 PM9/28/18

to

Sivaram Neelakantan <nsivar...@gmail.com> writes:

> When I tried to use one file as a look up in awk, everything worked
> except the first field. Since I had already been bitten by UNIX/DOS
> file endings issue, I checked the following.
>
>

> $ file columns_from_dataset.txt
> columns_from_dataset.txt: UTF-8 Unicode (with BOM) text

This is key...

> $ file p_generic.txt
> p_generic.txt: ASCII text
>
>

> I thought that was fine and went ahead and found it still failing. So
> I checked bytes in it
>

> $ head -1 columns_from_dataset.txt |od -bc
> 0000000 357 273 277 163 153 171 167 012
> 357 273 277 s k y w \n
>
> $ head -1 p_generic.txt |od -bc
> 0000000 163 153 171 012
> s k y \n
>

> Don't know where those 3 bytes came from and I believe that might be
> the cause.

It's a Byte Order Mark (or BOM). It's a character that is supposed to
be ignored by can be used to tell if a file is big- or little-endian.
This only matters for UTF-16 and UTF-32 (AKA UCS4). There is only one
byte order for UTF-8 so it's pointless to have a an BOM and, in fact, it
is not consider correct to write UTF-8 that has one. However, it's
not unusual for it remain after a file in converted, and some software
generated fresh UTF-8 file with them because it acts like an "I'm
Unicode" flag.

> I fired up Emacs and placed the cursor on 's' in
> dataset.txt file and got only this(C-x = ) which matches the above.
>
> Char: s (115, #o163, #x73) point=1 of 520 (0%) column=0

That's odd. Are you sure the cursor in on the 's' and not on an almost
invisibly thin character preceding it? It's easier to see if you use a
block cursor.

> I receive these files from different systems mac, AIX and linux and of
> course XLS from WIN10. How do I fix these kind of weirdness in
> scripts? I currently only check linefeeds and fix those.

You need to remove that rogue BOM.

--
Ben.

Helmut Waitzmann

unread,

Sep 28, 2018, 9:53:01 PM9/28/18

to

Sivaram Neelakantan <nsivar...@gmail.com>:

>When I tried to use one file as a look up in awk, everything worked
>except the first field. Since I had already been bitten by UNIX/DOS
>file endings issue, I checked the following.
>--8<---------------cut here---------------start------------->8---
>
>$ file columns_from_dataset.txt
>columns_from_dataset.txt: UTF-8 Unicode (with BOM) text

^^^
|
There is a hint: -------------------------------'

[…]

>$ head -1 columns_from_dataset.txt |od -bc
>0000000 357 273 277 163 153 171 167 012
> 357 273 277 s k y w \n

>Don't know where those 3 bytes came from and I believe that might be
>the cause.

BOM is the byte order mark, see also Wikipedia:

<https://en.wikipedia.org/wiki/Byte_Order_Mark#top>

As Unicode or UTF-16 is a 16‐bit character set, whereas in the
Unix world, we've got 8‐bit characters, the 16‐bit characters are
represented by pairs of 8‐bit characters.

Now, there's the question: Should the lower 8 of the 16 bits make up
the first of the two bytes and the upper 8 of the 16 bits
make up the second of the two bytes (also called little
endian) or should the order of the two characters be reversed
(also called big endian)?

To indicate to the receiver of the file, which of the two
possibilities to order the two bytes is used, software would
put a special 16‐bit unicode character, the so called byte order
mark, at the beginning of the file.

The byte order mark is the 16‐bit value

0xFEFF (hexadecimal) =
65279 (decimal)

And because the byte‐swapped value
0xFFFE (hexadecimal) =
65534 (decimal)

is not a valid unicode character, when receiving a unicode‐encoded
file, one can determine the intended byte order of the 16‐bit
characters by looking at that first two bytes at the beginning of
the file.

Now, as your file is made up of UTF-8 encoded Unicode characters,
the byte order mark will get encoded in the three bytes

357 273 277 (octal) =
239 187 191 (decimal) =
0xEF 0xBB 0xBF (hexadecimal),

as can be seen in the output of the command above.

>I fired up Emacs and placed the cursor on 's' in dataset.txt file
>and got only this(C-x = ) which matches the above.
>
>Char: s (115, #o163, #x73) point=1 of 520 (0%) column=0

Yes. Emacs will either hide the byte order mark from you or
discard it after using it to investigate the byte order.

>I receive these files from different systems mac, AIX and linux and of
>course XLS from WIN10. How do I fix these kind of weirdness in
>scripts? I currently only check linefeeds and fix those.

If there is Recode installed on your system, you could do

recode -- UTF-8..UTF-16BE,UTF-16.. < \
columns_from_dataset.txt | od -bc

or, using Iconv, you could do

iconv -f UTF-8 -t UTF-16BE | iconv -f UTF-16 < \
columns_from_dataset.txt | od -bc

to receive the data encoded in the character set according to your
locale setting.

Jorgen Grahn

unread,

Sep 28, 2018, 10:43:20 PM9/28/18

to

On Fri, 2018-09-28, Helmut Waitzmann wrote:
> Sivaram Neelakantan <nsivar...@gmail.com>:
...

>>I fired up Emacs and placed the cursor on 's' in dataset.txt file
>>and got only this(C-x = ) which matches the above.
>>
>>Char: s (115, #o163, #x73) point=1 of 520 (0%) column=0
>
> Yes. Emacs will either hide the byte order mark from you or
> discard it after using it to investigate the byte order.

Note though that (as someone already wrote) that UTF-8 /has/ no byte
order, so Emacs using the BOM to investigate the byte order makes no
sense. The Emacs manual should document what it really does to that
"character".

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Sivaram Neelakantan

unread,

Sep 29, 2018, 4:08:19 AM9/29/18

to

On Fri, Sep 28 2018,Janis Papanagnou wrote:

[snipped 35 lines]

> Are you asking where the bytes 's', 'k', and 'y' come from?
> I'd say; obviously they come from your data file.
> What exactly you consider to be weird?

weird because the assoc array look up in awk fails to match the row.
only the first row, everything else works fine

> Above you say you have issues with the first field, and your file
> has obviously byte markers ("UTF-8 Unicode (with BOM) text") in it.
> Do you maybe just want to remove those markers before processing?

Yes

> Generally; if you have problems with data formats from different
> sources then convert them when importing them to your native system.
>

Right, then would this be considered a bug in gnu gawk? I save files
as CSV, remove CRLF and generally had things working. I didn't know
what was happening till I investigated every step I did.

sivaram
--

Sivaram Neelakantan

unread,

Sep 29, 2018, 4:12:15 AM9/29/18

to

On Fri, Sep 28 2018,Ben Bacarisse wrote:

[snipped 27 lines]

>> Don't know where those 3 bytes came from and I believe that might be
>> the cause.
>
> It's a Byte Order Mark (or BOM). It's a character that is supposed to
> be ignored by can be used to tell if a file is big- or little-endian.
> This only matters for UTF-16 and UTF-32 (AKA UCS4). There is only one
> byte order for UTF-8 so it's pointless to have a an BOM and, in fact, it
> is not consider correct to write UTF-8 that has one. However, it's
> not unusual for it remain after a file in converted, and some software
> generated fresh UTF-8 file with them because it acts like an "I'm
> Unicode" flag.
>
>> I fired up Emacs and placed the cursor on 's' in
>> dataset.txt file and got only this(C-x = ) which matches the above.
>>
>> Char: s (115, #o163, #x73) point=1 of 520 (0%) column=0
>
> That's odd. Are you sure the cursor in on the 's' and not on an almost
> invisibly thin character preceding it? It's easier to see if you use a
> block cursor.

Yes, Column=0 in the place where the cursor as shown abov

>
>> I receive these files from different systems mac, AIX and linux and of
>> course XLS from WIN10. How do I fix these kind of weirdness in
>> scripts? I currently only check linefeeds and fix those.
>
> You need to remove that rogue BOM.

If I do remove them, what would happen to the french chars in
the file?

sivaram
--

Sivaram Neelakantan

unread,

Sep 29, 2018, 4:17:41 AM9/29/18

to

On Fri, Sep 28 2018,Helmut Waitzmann wrote:

[snipped 53 lines]

>
> Now, as your file is made up of UTF-8 encoded Unicode characters,
> the byte order mark will get encoded in the three bytes
>
> 357 273 277 (octal) =
> 239 187 191 (decimal) =
> 0xEF 0xBB 0xBF (hexadecimal),
>
> as can be seen in the output of the command above.
>
>>I fired up Emacs and placed the cursor on 's' in dataset.txt file
>>and got only this(C-x = ) which matches the above.
>>
>>Char: s (115, #o163, #x73) point=1 of 520 (0%) column=0
>
> Yes. Emacs will either hide the byte order mark from you or
> discard it after using it to investigate the byte order.
>
>>I receive these files from different systems mac, AIX and linux and of
>>course XLS from WIN10. How do I fix these kind of weirdness in
>>scripts? I currently only check linefeeds and fix those.
>
> If there is Recode installed on your system, you could do
>
> recode -- UTF-8..UTF-16BE,UTF-16.. < \
> columns_from_dataset.txt | od -bc
>
> or, using Iconv, you could do
>
> iconv -f UTF-8 -t UTF-16BE | iconv -f UTF-16 < \
> columns_from_dataset.txt | od -bc
>
> to receive the data encoded in the character set according to your
> locale setting.

Now, I'm a bit worried; broadly, I'm going to receive files from
different countries with their data in their native languages. The
database columns p_generic is the only one in ASCII. Rest are going
to be some UTF* files. What should be the base conversion to CSV for
all of them to make it work with awk and bash? Not asking for code,
just an outline of what should be checked for before attempting a look
up.

sivaram
--

Jorgen Grahn

unread,

Sep 29, 2018, 6:13:07 AM9/29/18

to

On Sat, 2018-09-29, Sivaram Neelakantan wrote:
> On Fri, Sep 28 2018,Helmut Waitzmann wrote:

...

>> If there is Recode installed on your system, you could do
>>
>> recode -- UTF-8..UTF-16BE,UTF-16.. < \
>> columns_from_dataset.txt | od -bc
>>
>> or, using Iconv, you could do
>>
>> iconv -f UTF-8 -t UTF-16BE | iconv -f UTF-16 < \
>> columns_from_dataset.txt | od -bc
>>
>> to receive the data encoded in the character set according to your
>> locale setting.

I'm not sure the examples above are correct: I don't see why they
would remove the BOM, and taking that extra step via UTF-16 is a
problem if there are characters which don't fit in UTF-16. I meant to
ask Helmut earlier, but didn't.

> Now, I'm a bit worried; broadly, I'm going to receive files from
> different countries with their data in their native languages. The
> database columns p_generic is the only one in ASCII. Rest are going
> to be some UTF* files. What should be the base conversion to CSV for
> all of them to make it work with awk and bash? Not asking for code,
> just an outline of what should be checked for before attempting a look
> up.

How about this?

An UTF-8 text can be seen as ASCII with some garbage characters in it.

An XML parser can parse UTF-8 without knowing anything about UTF-8, since
the <tag> stuff is still ASCII, and characters like <, >, = and / don't
appear in the encoded Unicode. That's an explicit feature of UTF-8.

The same should be true for CSV files: the separators (comma,
whitespace or whatever) are still ASCII in an UTF-8 CSV text.
(CSV can be a problematic format to parse because there are so many
variants, but that's a different story.)

/Except/, if the BOM is present you have a problem. The BOM becomes an
extra token prepended to the first CSV "word". And that's why the
UTF-8 BOM is both useless /and/ evil.

So without knowing your exact needs, I propose that you:
- convert to UTF-8 using iconv (or recode)
- remove the BOM if it's present
- do your CSV processing, knowing that the columns you find will be
encoded as UTF-8.

Janis Papanagnou

unread,

Sep 29, 2018, 7:42:47 AM9/29/18

to

On 29.09.2018 06:08, Sivaram Neelakantan wrote:
> On Fri, Sep 28 2018,Janis Papanagnou wrote:

[...]

> weird because the assoc array look up in awk fails to match the row.
> only the first row, everything else works fine
>
>> Above you say you have issues with the first field, and your file
>> has obviously byte markers ("UTF-8 Unicode (with BOM) text") in it.
>> Do you maybe just want to remove those markers before processing?
>
> Yes
>
>> Generally; if you have problems with data formats from different
>> sources then convert them when importing them to your native system.
>
> Right, then would this be considered a bug in gnu gawk? I save files
> as CSV, remove CRLF and generally had things working. I didn't know
> what was happening till I investigated every step I did.

Erm, no, it's not a bug. Awk, as other tools, does no heuristical data
analysis before processing the data - this cannot be reliable in the
general case, and it's impractical in case of tools that work as filter.
Usually tools also assume system-native line-endings encoding.

The data provider is the only one who has knowledge about the data and
who can control it to adjust encodings and its format, specifically if
data is taken from one system domain and processing it in another one.

Janis

Janis Papanagnou

unread,

Sep 29, 2018, 7:51:51 AM9/29/18

to

As explained in another post the BOM defines whether characters
of size 16 or 32 bit are byte-wise encoded as ABCD or DCBA, or,
AB or BA, (each letter A-D shall formally represent a byte here).

UTF-8 is a byte encoding; basic units are bytes and these are
_sequentially_ arranged in cases where more than one byte is
necessary (like in french). The format is unambiguous, there's
no need to define a byte order or insert spurious binary meta
data in front of the textual "payload" data.

Janis

Janis Papanagnou

unread,

Sep 29, 2018, 8:06:01 AM9/29/18

to

On 29.09.2018 06:17, Sivaram Neelakantan wrote:
>
> Now, I'm a bit worried; broadly, I'm going to receive files from
> different countries with their data in their native languages.

No problem. You can use UTF-8 encoding generally for all languages.

> The database columns p_generic is the only one in ASCII.

ASCII data is only a "special case", since it can be represented by
UTF-8 (or other encodings like ISO Latin-15) without change of format.
In other words, it's a subset of these common encodings, and there's
no need to handle it differently.

> Rest are going
> to be some UTF* files. What should be the base conversion to CSV for
> all of them to make it work with awk and bash?

Just to make sure; UTF-8 is a character encoding, while CSV is a "data
structure", an encoding, on top.

The issue you have is obviously that you get data from different sources
with different character encodings. Either you have the authority to
define the encoding _standards_ for the provided data (if there isn't
already a standard defined), or you have to bite the bullet and ask every
data provider what encoding standard he uses, and you would then have to
add a preprocessing step (depending on the data source) to create a unique
encoding - I'd suggest to use UTF-8 - before you do your (awk-)processing.

Janis

Ivan Shmakov

unread,

Sep 29, 2018, 9:57:58 AM9/29/18

to

>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:

[...]

>> I fired up Emacs and placed the cursor on 's' in dataset.txt file
>> and got only this(C-x = ) which matches the above.

>> Char: s (115, #o163, #x73) point=1 of 520 (0%) column=0

> That's odd. Are you sure the cursor in on the 's' and not on an
> almost invisibly thin character preceding it? It's easier to see if
> you use a block cursor.

I'm pretty sure that Emacs removes stray BOMs when reading UTF-8
files. Why, I've just tried it with 25.1 from Debian 9, and it does
exactly that. (A BOM read from a UTF-16-LE file was unaffected.)

[...]

--
FSF associate member #7257 http://am-1.org/~ivan/ np. Tristesse -- MRT

Ben Bacarisse

unread,

Sep 29, 2018, 11:51:08 AM9/29/18

to

Sivaram Neelakantan <nsivar...@gmail.com> writes:

> On Fri, Sep 28 2018,Ben Bacarisse wrote:
>
>
> [snipped 27 lines]
>
>>> Don't know where those 3 bytes came from and I believe that might be
>>> the cause.
>>
>> It's a Byte Order Mark (or BOM). It's a character that is supposed to
>> be ignored by can be used to tell if a file is big- or little-endian.
>> This only matters for UTF-16 and UTF-32 (AKA UCS4). There is only one
>> byte order for UTF-8 so it's pointless to have a an BOM and, in fact, it
>> is not consider correct to write UTF-8 that has one. However, it's
>> not unusual for it remain after a file in converted, and some software
>> generated fresh UTF-8 file with them because it acts like an "I'm
>> Unicode" flag.
>>
>>> I fired up Emacs and placed the cursor on 's' in
>>> dataset.txt file and got only this(C-x = ) which matches the above.
>>>
>>> Char: s (115, #o163, #x73) point=1 of 520 (0%) column=0
>>
>> That's odd. Are you sure the cursor in on the 's' and not on an almost
>> invisibly thin character preceding it? It's easier to see if you use a
>> block cursor.
>
> Yes, Column=0 in the place where the cursor as shown abov

Now I look again, it's even more odd. Not the column=0 (Emacs uses a
complex algorithm to estimate the column number) but point=1 and Char: s
suggests that the BOM has been removed.

If I make a file with only a BOM and an 's', I see either

Char: (65279, #o177377, #xfeff, file ...) point=1 of 2 (0%) column=0
Char: s (115, #o163, #x73) point=2 of 2 (50%) column=0

depending on where the cursor is.

>>> I receive these files from different systems mac, AIX and linux and of
>>> course XLS from WIN10. How do I fix these kind of weirdness in
>>> scripts? I currently only check linefeeds and fix those.
>>
>> You need to remove that rogue BOM.
>
> If I do remove them, what would happen to the french chars in
> the file?

Nothing. The BOM is not needed in UTF-8 files, but AWK will simply
consider it to be part of the first field of the first line.

--
Ben.

Ben Bacarisse

unread,

Sep 29, 2018, 11:55:31 AM9/29/18

to

Ivan Shmakov <iv...@siamics.net> writes:

>>>>>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>
> [...]
>
> >> I fired up Emacs and placed the cursor on 's' in dataset.txt file
> >> and got only this(C-x = ) which matches the above.
>
> >> Char: s (115, #o163, #x73) point=1 of 520 (0%) column=0
>
> > That's odd. Are you sure the cursor in on the 's' and not on an
> > almost invisibly thin character preceding it? It's easier to see if
> > you use a block cursor.
>
> I'm pretty sure that Emacs removes stray BOMs when reading UTF-8
> files. Why, I've just tried it with 25.1 from Debian 9, and it does
> exactly that. (A BOM read from a UTF-16-LE file was
> unaffected.)

Mine does not, I just tested with emacs -Q to see if I have some setting
to leave it there but no, the BOM is still there at point=1 and the s at
point=2. This is with emacs 25.2.2.

--
Ben.

Ben Bacarisse

unread,

Sep 29, 2018, 12:07:43 PM9/29/18

to

Ben Bacarisse <ben.u...@bsb.me.uk> writes:

> Sivaram Neelakantan <nsivar...@gmail.com> writes:
<snip>

>> Yes, Column=0 in the place where the cursor as shown abov
>
> Now I look again, it's even more odd. Not the column=0 (Emacs uses a
> complex algorithm to estimate the column number) but point=1 and Char: s
> suggests that the BOM has been removed.

No, ignore that!

Emacs does indeed silently not show you the BOM as Ivan said. My
mistake. It's there in the file, but emacs is being "helpful" bu not
showing it to you. I saw a BOM because I added one, and a second BOM is
treated as a character in the file.

--
Ben.

Ben Bacarisse

unread,

Sep 29, 2018, 12:10:00 PM9/29/18

to

Sivaram Neelakantan <nsivar...@gmail.com> writes:
<snip>

> Now, I'm a bit worried; broadly, I'm going to receive files from
> different countries with their data in their native languages.

The language does not matter, only the encoding. If you get files using
lots of different encodings, you should convert them all to UTF-8 with
no BOM (which might mean you need to remove it yourself). If you get to
decide what encoding is used for the files that get sent ti you, ask for
UTF-8 files with no BOM (it's often an option on export to include or
exclude it).

--
Ben.

Ben Bacarisse

unread,

Sep 29, 2018, 12:14:17 PM9/29/18

to

Argh! No, you are right, emacs is in fact just not showing me a leading
BOM. I had ended up with a file with *two* BOMs which is why I thought
emacs was showing it.

The golden rule is always to use something like hd or od to be sure what
you are looking it since editors might try be "helpful". Ironically, I
had so many issues like this that I wrote a UTF-8 dump program to
visualise such file in all sorts of ways, but I forgot to use it!

--
Ben.

Helmut Waitzmann

unread,

Sep 29, 2018, 12:56:08 PM9/29/18

to

Jorgen Grahn <grahn...@snipabacken.se>:

>On Sat, 2018-09-29, Sivaram Neelakantan wrote:
>> On Fri, Sep 28 2018,Helmut Waitzmann wrote:
>...
>>> If there is Recode installed on your system, you could do
>>>
>>> recode -- UTF-8..UTF-16BE,UTF-16.. < \
>>> columns_from_dataset.txt | od -bc
>>>
>>> or, using Iconv, you could do
>>>
>>> iconv -f UTF-8 -t UTF-16BE | iconv -f UTF-16 < \
>>> columns_from_dataset.txt | od -bc
>>>
>>> to receive the data encoded in the character set according to your
>>> locale setting.
>
>I'm not sure the examples above are correct: I don't see why they
>would remove the BOM, and taking that extra step via UTF-16 is a
>problem if there are characters which don't fit in UTF-16. I meant to
>ask Helmut earlier, but didn't.

The examples remove the BOM (at least on my Debian Linux), because
they take opportunity of the difference between the UTF-16
encoding, which uses a BOM, and the UTF-16BE and UTF-16LE
encodings, which do not.

I guess, that the data were provided by someone or something, who
resp. which recoded the original UTF-16 encoded data using –
depending of the system – either

iconv -f UTF-16BE -t UTF-8

or

iconv -f UTF-16LE -t UTF-8

rather than

iconv -f UTF-16 -t UTF-8

That is, the recoder made wrong assumptions about the source
encoding: UTF-16BE resp. UTF-16LE (without a BOM) rather than
UTF-16 (with a BOM) and therefore got wrong UTF-8 encoded data.

My examples try to roll the wrong recoding back, then recode using
the right source encoding: UTF-16 to get correct UTF-8 encoded
data.

Compare the output of the following commands (one should
produce a UTF-8 encoded BOM, the other should produce garbage)

printf '%s\n' 'Hello, world!' |
iconv -f UTF-8 -t UTF-16 |
iconv -f UTF-16LE -t UTF-8 |
od -t x1u1o1c

printf '%s\n' 'Hello, world!' |
iconv -f UTF-8 -t UTF-16 |
iconv -f UTF-16BE -t UTF-8 |
od -t x1u1o1c

with the output of the command

printf '%s\n' 'Hello, world!' |
iconv -f UTF-8 -t UTF-16 |
iconv -f UTF-16 -t UTF-8 |
od -t x1u1o1c

Helmut Waitzmann

unread,

Sep 29, 2018, 2:16:39 PM9/29/18

to

Sivaram Neelakantan <nsivar...@gmail.com>:

>When I tried to use one file as a look up in awk, everything worked
>except the first field. Since I had already been bitten by UNIX/DOS
>file endings issue, I checked the following.
>--8<---------------cut here---------------start------------->8---
>
>$ file columns_from_dataset.txt
>columns_from_dataset.txt: UTF-8 Unicode (with BOM) text
>$ file p_generic.txt
>p_generic.txt: ASCII text
>
>--8<---------------cut here---------------end--------------->8---
>
>I thought that was fine and went ahead and found it still failing. So
>I checked bytes in it
>--8<---------------cut here---------------start------------->8---
>$ head -1 columns_from_dataset.txt |od -bc
>0000000 357 273 277 163 153 171 167 012
> 357 273 277 s k y w \n
>
>$ head -1 p_generic.txt |od -bc
>0000000 163 153 171 012
> s k y \n
>--8<---------------cut here---------------end--------------->8---
>Don't know where those 3 bytes came from and I believe that might be
>the cause.

Those 3 bytes came from a wrong recoding from UTF-16BE, UTF-16LE,
UTF-32BE or UTF-32LE rather than from UTF-16 or UTF-32
respectively.

[…]

>I receive these files from different systems mac, AIX and linux and of
>course XLS from WIN10. How do I fix these kind of weirdness in
>scripts?

As soon as the wrong recoding is fixed, the spurious BOMs will
disappear.

To investigate the source of the wrong UTF-8 encoding, you might
ask your data sources to send you the data in their native
encoding, not recoding it.

Then, you could post some sample data:

head -n 1 | od -t x1c

Brian Patrie

unread,

Sep 29, 2018, 5:18:10 PM9/29/18

to

On 2018-09-29 04:57, Ivan Shmakov wrote:
> I'm pretty sure that Emacs removes stray BOMs when reading UTF-8
> files. Why, I've just tried it with 25.1 from Debian 9, and it does
> exactly that. (A BOM read from a UTF-16-LE file was unaffected.)

Removing stray U+FEFF strikes me as odd, borderline broken behaviour, as
that character has meaning (zero-width non-breaking space) other than as
a BOM.

Sivaram Neelakantan

unread,

Sep 29, 2018, 5:31:33 PM9/29/18

to

Thanks, that was odd, trying to replicate what you got. :)

sivaram
--

Sivaram Neelakantan

unread,

Sep 29, 2018, 5:42:35 PM9/29/18

to

On Sat, Sep 29 2018,Helmut Waitzmann wrote:

[snipped 36 lines]

>
> As soon as the wrong recoding is fixed, the spurious BOMs will
> disappear.
>
> To investigate the source of the wrong UTF-8 encoding, you might
> ask your data sources to send you the data in their native
> encoding, not recoding it.
>
> Then, you could post some sample data:
>
> head -n 1 | od -t x1c

Well, no one's going to DTRT to fix this as no one's bothered to get
me files I can work with. Your sound advice is not what is going to
be heeded

OT: I'm reminded of late Erik Naggum's post about Perl programmers way
of fixing things; instead of fixing things upstream, I'm going to jump
hoops in that weird unnecessary 'work' that would show progress. For
every problem there's going to be another shell script instead of
simplifying things.

oh well, it pays the bills. <shrug>

Back on topic, would it be enough to pass the 3 bytes to 'tr' to fix
this instead of iconv or recode? I would like to base my scripts on
standard tools available across systems.

sivaram
--

Helmut Waitzmann

unread,

Sep 29, 2018, 9:44:43 PM9/29/18

to

Sivaram Neelakantan <nsivar...@gmail.com>:

>On Sat, Sep 29 2018,Helmut Waitzmann wrote:

>[snipped 36 lines]
>
>>
>> As soon as the wrong recoding is fixed, the spurious BOMs will
>> disappear.
>>
>> To investigate the source of the wrong UTF-8 encoding, you might
>> ask your data sources to send you the data in their native
>> encoding, not recoding it.
>>
>> Then, you could post some sample data:
>>
>> head -n 1 | od -t x1c
>
>Well, no one's going to DTRT to fix this as no one's bothered to get
>me files I can work with. Your sound advice is not what is going to
>be heeded

So the files you get are encoded in UTF-8 and have a leading UTF-8
encoded BOM?

[…]

>Back on topic, would it be enough to pass the 3 bytes to 'tr' to fix
>this instead of iconv or recode?

I'm afraid, that the UTF-8 encoded BOM is just a symptom of a
wrong encoding rather than a superfluous character.

For example, the command

printf '%b' 'skyw\n' |
iconv -t UTF-32 |
# wrong recoding from UTF-32 to UTF-8:
iconv -f UTF-32LE -t UTF-8 |
od -t co1x1

yields a spuriuous UTF-8 encoded BOM at the beginning of the
output:

0000000 357 273 277 s k y w \n

357 273 277 163 153 171 167 012

ef bb bf 73 6b 79 77 0a
0000010

whereas the command

printf '%b' 'skyw\n' |
iconv -t UTF-32 |
# correct recoding from UTF-32 to UTF-8:
iconv -f UTF-32 -t UTF-8 |
od -t co1x1

does not:

0000000 s k y w \n

163 153 171 167 012

73 6b 79 77 0a
0000005

Also, as the BOM carries information about the endianness of the
encoding, I won't just throw it away but rather let

iconv -f UTF-32

pay attention to it.

A fix of that problem would be to roll that wrong recoding back:

iconv -f UTF-8 -t UTF-32BE

I'm not totally sure about the "-t" option parameter, "UTF-32BE".
Maybe, that "UTF-32LE", "UCS-4", "UCS-4BE" or "UCS-4LE" would be a
better choice.

Then do it right:

iconv -f UTF-32 -t UTF-8

The two commands may be connected by a pipe.

The following shell command line shows, how the correcting
recoding eliminates the spuriuous BOM:

# Create a wrong recoded sample like above:
printf '%b' 'skyw\n' |
iconv -t UTF-32 |
# wrong recoding from UTF-32 to UTF-8:
iconv -f UTF-32LE -t UTF-8 |
# Now recode it to correct it:
iconv -f UTF-8 -t UTF-32BE |
iconv -f UTF-32 -t UTF-8 |
od -t co1x1

yields the correct output:

0000000 s k y w \n

163 153 171 167 012

73 6b 79 77 0a
0000005

>I would like to base my scripts on standard tools available
>across systems.

I agree, but as far as I know, Iconv is a POSIX standard tool,
with the restriction, though, that the names of the encodings are
implementation‐defined.

Ben Bacarisse

unread,

Sep 30, 2018, 3:06:47 AM9/30/18

to

Emacs appears to treat only a leading BOM as special. Others (even an
immediately following one) are "visible" in the file. (I say "visible"
because the default rendering appears to be a barely visible thin
space.)

Also, Emacs does no remove the BOM in the sense of deleting it from the
file; it just does not display an initial BOM.

--
Ben.

Jorgen Grahn

unread,

Sep 30, 2018, 7:25:42 AM9/30/18

to

On Sat, 2018-09-29, Helmut Waitzmann wrote:
> Jorgen Grahn <grahn...@snipabacken.se>:
>>On Sat, 2018-09-29, Sivaram Neelakantan wrote:
>>> On Fri, Sep 28 2018,Helmut Waitzmann wrote:
>>...
>>>> If there is Recode installed on your system, you could do
>>>>
>>>> recode -- UTF-8..UTF-16BE,UTF-16.. < \
>>>> columns_from_dataset.txt | od -bc
>>>>
>>>> or, using Iconv, you could do
>>>>
>>>> iconv -f UTF-8 -t UTF-16BE | iconv -f UTF-16 < \
>>>> columns_from_dataset.txt | od -bc
>>>>
>>>> to receive the data encoded in the character set according to your
>>>> locale setting.
>>
>>I'm not sure the examples above are correct: I don't see why they
>>would remove the BOM, and taking that extra step via UTF-16 is a
>>problem if there are characters which don't fit in UTF-16. I meant to
>>ask Helmut earlier, but didn't.
>
> The examples remove the BOM (at least on my Debian Linux), because
> they take opportunity of the difference between the UTF-16
> encoding, which uses a BOM, and the UTF-16BE and UTF-16LE
> encodings, which do not.

Ah, of course.

Part of what I wrote was based on a misunderstanding: I thought the
BOM was actually some zero-width space character, abused as a byte
order mark under the assumption "the user won't see it anyway".

I see now (in the Wikipedia article) that it's explicitly a BOM in
Unicode, i.e. removing it doesn't alter the text itself. Then it
makes sense for iconv to remove it.

The Wikipedia article also mention's what I believe is the OP's
problem: "Not using a BOM allows text to be backwards-compatible
with some software that is not Unicode-aware. Examples include
programming languages that permit non-ASCII bytes in string literals
but not at the start of the file."

[snip]

Janis Papanagnou

unread,

Sep 30, 2018, 7:38:49 AM9/30/18

to

On 29.09.2018 19:42, Sivaram Neelakantan wrote:
> [...]

>
> Back on topic, would it be enough to pass the 3 bytes to 'tr' to fix
> this instead of iconv or recode? I would like to base my scripts on
> standard tools available across systems.

I don't see how deleting characters with tr would give you correct
results if you don't want to accidentally delete payload data with
the the same code. But you can use other tools like sed, or simply
just tail -c +4 .

Janis

>
> sivaram
>

Ralf Damaschke

unread,

Sep 30, 2018, 4:14:47 PM9/30/18

to

Sivaram Neelakantan wrote:

> Back on topic, would it be enough to pass the 3 bytes to 'tr' to fix
> this instead of iconv or recode? I would like to base my scripts on
> standard tools available across systems.

In your OP you mentioned awk. You may delete the extra BOM by starting
the awk script with 'FNR == 1 { sub(/^\357\273\277/, "") }'.

Jorgen Grahn

unread,

Sep 30, 2018, 4:16:17 PM9/30/18

to

Weird: that's what I stated in the thread a the other day, then
retracted after reading <https://en.wikipedia.org/wiki/Byte_order_mark>
which describes U+FEFF as only a byte order mark.

I only read the intro, though. Under "Usage" the article says:

If the BOM character appears in the middle of a data stream,
Unicode says it should be interpreted as a "zero-width
non-breaking space" (inhibits line-breaking between
word-glyphs). In Unicode 3.2, this usage is deprecated in favor of
the "Word Joiner" character, U+2060.[1] This allows U+FEFF to be
only used as a BOM.

So they /did/ mess up originally. (Twice, if you count adding a BOM
code to begin with).