Sivaram Neelakantan <
nsivar...@gmail.com>:
>When I tried to use one file as a look up in awk, everything worked
>except the first field. Since I had already been bitten by UNIX/DOS
>file endings issue, I checked the following.
>--8<---------------cut here---------------start------------->8---
>
>$ file columns_from_dataset.txt
>columns_from_dataset.txt: UTF-8 Unicode (with BOM) text
^^^
|
There is a hint: -------------------------------'
[…]
>$ head -1 columns_from_dataset.txt |od -bc
>0000000 357 273 277 163 153 171 167 012
> 357 273 277 s k y w \n
>Don't know where those 3 bytes came from and I believe that might be
>the cause.
BOM is the byte order mark, see also Wikipedia:
<
https://en.wikipedia.org/wiki/Byte_Order_Mark#top>
As Unicode or UTF-16 is a 16‐bit character set, whereas in the
Unix world, we've got 8‐bit characters, the 16‐bit characters are
represented by pairs of 8‐bit characters.
Now, there's the question: Should the lower 8 of the 16 bits make up
the first of the two bytes and the upper 8 of the 16 bits
make up the second of the two bytes (also called little
endian) or should the order of the two characters be reversed
(also called big endian)?
To indicate to the receiver of the file, which of the two
possibilities to order the two bytes is used, software would
put a special 16‐bit unicode character, the so called byte order
mark, at the beginning of the file.
The byte order mark is the 16‐bit value
0xFEFF (hexadecimal) =
65279 (decimal)
And because the byte‐swapped value
0xFFFE (hexadecimal) =
65534 (decimal)
is not a valid unicode character, when receiving a unicode‐encoded
file, one can determine the intended byte order of the 16‐bit
characters by looking at that first two bytes at the beginning of
the file.
Now, as your file is made up of UTF-8 encoded Unicode characters,
the byte order mark will get encoded in the three bytes
357 273 277 (octal) =
239 187 191 (decimal) =
0xEF 0xBB 0xBF (hexadecimal),
as can be seen in the output of the command above.
>I fired up Emacs and placed the cursor on 's' in dataset.txt file
>and got only this(C-x = ) which matches the above.
>
>Char: s (115, #o163, #x73) point=1 of 520 (0%) column=0
Yes. Emacs will either hide the byte order mark from you or
discard it after using it to investigate the byte order.
>I receive these files from different systems mac, AIX and linux and of
>course XLS from WIN10. How do I fix these kind of weirdness in
>scripts? I currently only check linefeeds and fix those.
If there is Recode installed on your system, you could do
recode -- UTF-8..UTF-16BE,UTF-16.. < \
columns_from_dataset.txt | od -bc
or, using Iconv, you could do
iconv -f UTF-8 -t UTF-16BE | iconv -f UTF-16 < \
columns_from_dataset.txt | od -bc
to receive the data encoded in the character set according to your
locale setting.