Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Guessing the encoding from a BOM

110 views
Skip to first unread message

Steven D'Aprano

unread,
Jan 15, 2014, 9:13:55 PM1/15/14
to
I have a function which guesses the likely encoding used by text files by
reading the BOM (byte order mark) at the beginning of the file. A
simplified version:


def guess_encoding_from_bom(filename, default):
with open(filename, 'rb') as f:
sig = f.read(4)
if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
return 'utf_16'
elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
return 'utf_32'
else:
return default


The idea is that you can call the function with a file name and a default
encoding to return if one can't be guessed. I want to provide a default
value for the default argument (a default default), but one which will
unconditionally fail if you blindly go ahead and use it.

E.g. I want to either provide a default:

enc = guess_encoding_from_bom("filename", 'latin1')
f = open("filename", encoding=enc)


or I want to write:

enc = guess_encoding_from_bom("filename")
if enc == something:
# Can't guess, fall back on an alternative strategy
...
else:
f = open("filename", encoding=enc)


If I forget to check the returned result, I should get an explicit
failure as soon as I try to use it, rather than silently returning the
wrong results.

What should I return as the default default? I have four possibilities:

(1) 'undefined', which is an standard encoding guaranteed to
raise an exception when used;

(2) 'unknown', which best describes the result, and currently
there is no encoding with that name;

(3) None, which is not the name of an encoding; or

(4) Don't return anything, but raise an exception. (But
which exception?)


Apart from option (4), here are the exceptions you get from blindly using
options (1) through (3):

py> 'abc'.encode('undefined')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.3/encodings/undefined.py", line 19, in
encode
raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding

py> 'abc'.encode('unknown')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: unknown

py> 'abc'.encode(None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: encode() argument 1 must be str, not None


At the moment, I'm leaning towards option (1). Thoughts?



--
Steven

Ben Finney

unread,
Jan 15, 2014, 10:47:00 PM1/15/14
to pytho...@python.org
Steven D'Aprano <steve+comp....@pearwood.info> writes:

> enc = guess_encoding_from_bom("filename")
> if enc == something:
> # Can't guess, fall back on an alternative strategy
> ...
> else:
> f = open("filename", encoding=enc)
>
>
> If I forget to check the returned result, I should get an explicit
> failure as soon as I try to use it, rather than silently returning the
> wrong results.

Yes, agreed.

> What should I return as the default default? I have four possibilities:
>
> (1) 'undefined', which is an standard encoding guaranteed to
> raise an exception when used;

+0.5. This describes the outcome of the guess.

> (2) 'unknown', which best describes the result, and currently
> there is no encoding with that name;

+0. This *better* describes the outcome, but I don't think adding a new
name is needed nor very helpful.

> (3) None, which is not the name of an encoding; or

−1. This is too much like a real result and doesn't adequately indicate
the failure.

> (4) Don't return anything, but raise an exception. (But
> which exception?)

+1. I'd like a custom exception class, sub-classed from ValueError.

--
\ “I love to go down to the schoolyard and watch all the little |
`\ children jump up and down and run around yelling and screaming. |
_o__) They don't know I'm only using blanks.” —Emo Philips |
Ben Finney

Chris Angelico

unread,
Jan 16, 2014, 12:01:56 AM1/16/14
to pytho...@python.org
On Thu, Jan 16, 2014 at 1:13 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
> return 'utf_16'
> elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
> return 'utf_32'

I'd swap the order of these two checks. If the file starts FF FE 00
00, your code will guess that it's UTF-16 and begins with a U+0000.

ChrisA

Ethan Furman

unread,
Jan 16, 2014, 12:40:23 AM1/16/14
to pytho...@python.org
On 01/15/2014 07:47 PM, Ben Finney wrote:
> Steven D'Aprano writes:
>>
>> (4) Don't return anything, but raise an exception. (But
>> which exception?)
>
> +1. I'd like a custom exception class, sub-classed from ValueError.

+1

--
~Ethan~

Steven D'Aprano

unread,
Jan 16, 2014, 1:45:38 AM1/16/14
to
Good catch, thank you.


--
Steven

Steven D'Aprano

unread,
Jan 16, 2014, 1:55:16 AM1/16/14
to
On Thu, 16 Jan 2014 14:47:00 +1100, Ben Finney wrote:

> Steven D'Aprano <steve+comp....@pearwood.info> writes:
>
>> enc = guess_encoding_from_bom("filename") if enc == something:
>> # Can't guess, fall back on an alternative strategy ...
>> else:
>> f = open("filename", encoding=enc)
>>
>>
>> If I forget to check the returned result, I should get an explicit
>> failure as soon as I try to use it, rather than silently returning the
>> wrong results.
>
> Yes, agreed.
>
>> What should I return as the default default? I have four possibilities:
>>
>> (1) 'undefined', which is an standard encoding guaranteed to
>> raise an exception when used;
>
> +0.5. This describes the outcome of the guess.
>
>> (2) 'unknown', which best describes the result, and currently
>> there is no encoding with that name;
>
> +0. This *better* describes the outcome, but I don't think adding a new
> name is needed nor very helpful.

And there is a chance -- albeit a small chance -- that someday the std
lib will gain an encoding called "unknown".


>> (4) Don't return anything, but raise an exception. (But
>> which exception?)
>
> +1. I'd like a custom exception class, sub-classed from ValueError.

Why ValueError? It's not really a "invalid value" error, it's more "my
heuristic isn't good enough" failure. (Maybe the file starts with another
sort of BOM which I don't know about.)

If I go with an exception, I'd choose RuntimeError, or a custom error
that inherits directly from Exception.



Thanks to everyone for the feedback.



--
Steven

Ethan Furman

unread,
Jan 16, 2014, 2:29:15 AM1/16/14
to pytho...@python.org
From the docs [1]:
============================

exception RuntimeError

Raised when an error is detected that doesn’t fall in any
of the other categories. The associated value is a string
indicating what precisely went wrong.

It doesn't sound like RuntimeError is any more informative than Exception or AssertionError, and to my mind at least is
usually close to catastrophic in nature [2].

I'd say a ValueError subclass because, while not an strictly an error, it is values you don't know how to deal with.
But either that or plain Exception, just not RuntimeError.

--
~Ethan~


[1] http://docs.python.org/3/library/exceptions.html#RuntimeError
[2] verified by a (very) brief grep of the sources

Björn Lindqvist

unread,
Jan 16, 2014, 1:01:51 PM1/16/14
to Steven D'Aprano, pytho...@python.org
2014/1/16 Steven D'Aprano <steve+comp....@pearwood.info>:
> def guess_encoding_from_bom(filename, default):
> with open(filename, 'rb') as f:
> sig = f.read(4)
> if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
> return 'utf_16'
> elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
> return 'utf_32'
> else:
> return default

You might want to add the utf8 bom too: '\xEF\xBB\xBF'.

> (4) Don't return anything, but raise an exception. (But
> which exception?)

I like this option the most because it is the most "fail fast". If you
return 'undefined' the error might happen hours later or not at all in
some cases.


--
mvh/best regards Björn Lindqvist

Chris Angelico

unread,
Jan 16, 2014, 1:06:16 PM1/16/14
to pytho...@python.org
On Fri, Jan 17, 2014 at 5:01 AM, Björn Lindqvist <bjo...@gmail.com> wrote:
> 2014/1/16 Steven D'Aprano <steve+comp....@pearwood.info>:
>> def guess_encoding_from_bom(filename, default):
>> with open(filename, 'rb') as f:
>> sig = f.read(4)
>> if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
>> return 'utf_16'
>> elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
>> return 'utf_32'
>> else:
>> return default
>
> You might want to add the utf8 bom too: '\xEF\xBB\xBF'.

I'd actually rather not. It would tempt people to pollute UTF-8 files
with a BOM, which is not necessary unless you are MS Notepad.

ChrisA

Tim Chase

unread,
Jan 16, 2014, 1:50:12 PM1/16/14
to pytho...@python.org
On 2014-01-17 05:06, Chris Angelico wrote:
> > You might want to add the utf8 bom too: '\xEF\xBB\xBF'.
>
> I'd actually rather not. It would tempt people to pollute UTF-8
> files with a BOM, which is not necessary unless you are MS Notepad.

If the intent is to just sniff and parse the file accordingly, I get
enough of these junk UTF-8 BOMs at $DAY_JOB that I've had to create
utility-openers much like Steven is doing here. It's particularly
problematic for me in combination with csv.DictReader, where I go
looking for $COLUMN_NAME and get KeyError exceptions because it wants
me to ask for $UTF_BOM+$COLUMN_NAME for the first column.

-tkc



Albert-Jan Roskam

unread,
Jan 16, 2014, 2:37:29 PM1/16/14
to Chris Angelico, pytho...@python.org
--------------------------------------------
On Thu, 1/16/14, Chris Angelico <ros...@gmail.com> wrote:

Subject: Re: Guessing the encoding from a BOM
To:
Cc: "pytho...@python.org" <pytho...@python.org>
Date: Thursday, January 16, 2014, 7:06 PM

On Fri, Jan 17, 2014 at 5:01 AM,
Björn Lindqvist <bjo...@gmail.com>
wrote:
> 2014/1/16 Steven D'Aprano <steve+comp....@pearwood.info>:
>> def guess_encoding_from_bom(filename, default):
>>     with open(filename, 'rb')
as f:
>>         sig =
f.read(4)
>>     if
sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
>>         return
'utf_16'
>>     elif
sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
>>         return
'utf_32'
>>     else:
>>         return
default
>
> You might want to add the utf8 bom too:
'\xEF\xBB\xBF'.

I'd actually rather not. It would tempt people to pollute
UTF-8 files
with a BOM, which is not necessary unless you are MS
Notepad.


===> Can you elaborate on that? Unless your utf-8 files will only contain ascii characters I do not understand why you would not want a bom utf-8.

Btw, isn't "read_encoding_from_bom" a better function name than "guess_encoding_from_bom"? I thought the point of BOMs was that there would be no more need to guess?

Thanks!

Albert-Jan

Chris Angelico

unread,
Jan 16, 2014, 7:14:17 PM1/16/14
to pytho...@python.org
On Fri, Jan 17, 2014 at 6:37 AM, Albert-Jan Roskam <fo...@yahoo.com> wrote:
> Can you elaborate on that? Unless your utf-8 files will only contain ascii characters I do not understand why you would not want a bom utf-8.

It's completely unnecessary, and could cause problems (the BOM is
actually whitespace, albeit zero-width, so it could effectively indent
the first line of your source code). UTF-8 specifies the byte order
as part of the protocol, so you don't need to mark it.

ChrisA

Steven D'Aprano

unread,
Jan 16, 2014, 8:18:39 PM1/16/14
to
Because the UTF-8 signature -- it's not actually a Byte Order Mark -- is
not really necessary. Unlike UTF-16 and UTF-32, there is no platform
dependent ambiguity between Big Endian and Little Endian systems, so the
UTF-8 stream of bytes is identical no matter what platform you are on.

If the UTF-8 signature was just unnecessary, it wouldn't be too bad, but
it's actually harmful. Pure-ASCII text encoded as UTF-8 is still pure
ASCII, and so backwards compatible with old software that assumes ASCII.
But the same pure-ASCII text encoded as UTF-8 with a signature looks like
a binary file.


> Btw, isn't "read_encoding_from_bom" a better function name than
> "guess_encoding_from_bom"? I thought the point of BOMs was that there
> would be no more need to guess?

Of course it's a guess. If you see a file that starts with 0000FFFE, is
that a UTF-32 text file, or a binary file that happens to start with two
nulls followed by FFFE?

--
Steven

Tim Chase

unread,
Jan 16, 2014, 8:40:05 PM1/16/14
to pytho...@python.org
On 2014-01-17 11:14, Chris Angelico wrote:
> UTF-8 specifies the byte order
> as part of the protocol, so you don't need to mark it.

You don't need to mark it when writing, but some idiots use it
anyway. If you're sniffing a file for purposes of reading, you need
to look for it and remove it from the actual data that gets returned
from the file--otherwise, your data can see it as corruption. I end
up with lots of CSV files from customers who have polluted it with
Notepad or had Excel insert some UTF-8 BOM when exporting. This
means my first column-name gets the BOM prefixed onto it when the
file is passed to csv.DictReader, grr.

-tkc



Rustom Mody

unread,
Jan 17, 2014, 12:08:23 AM1/17/14
to
And its part of the standard:
Table 2.4 here
http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf

Mark Lawrence

unread,
Jan 17, 2014, 4:10:32 AM1/17/14
to pytho...@python.org
On 17/01/2014 01:40, Tim Chase wrote:
> On 2014-01-17 11:14, Chris Angelico wrote:
>> UTF-8 specifies the byte order
>> as part of the protocol, so you don't need to mark it.
>
> You don't need to mark it when writing, but some idiots use it
> anyway. If you're sniffing a file for purposes of reading, you need
> to look for it and remove it from the actual data that gets returned
> from the file--otherwise, your data can see it as corruption. I end
> up with lots of CSV files from customers who have polluted it with
> Notepad or had Excel insert some UTF-8 BOM when exporting. This
> means my first column-name gets the BOM prefixed onto it when the
> file is passed to csv.DictReader, grr.
>
> -tkc
>

My code that used to handle CSV files from M$ Money had to allow for a
single NUL byte right at the end of the file. Thankfully I've now moved
on to gnucash.

Slight aside, any chance of changing the subject of this thread, or even
ending the thread completely? Why? Every time I see it I picture
Inspector Clouseau, "A BOM!!!" :)

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

Chris Angelico

unread,
Jan 17, 2014, 4:43:59 AM1/17/14
to pytho...@python.org
On Fri, Jan 17, 2014 at 8:10 PM, Mark Lawrence <bream...@yahoo.co.uk> wrote:
> Slight aside, any chance of changing the subject of this thread, or even
> ending the thread completely? Why? Every time I see it I picture Inspector
> Clouseau, "A BOM!!!" :)

Special delivery, a berm! Were you expecting one?

ChrisA

Mark Lawrence

unread,
Jan 17, 2014, 4:47:33 AM1/17/14
to pytho...@python.org
By coincidence I'm just off to collect a special delievry, of what I
don't yet know.

Chris Angelico

unread,
Jan 17, 2014, 4:58:28 AM1/17/14
to pytho...@python.org
On Fri, Jan 17, 2014 at 8:47 PM, Mark Lawrence <bream...@yahoo.co.uk> wrote:
> On 17/01/2014 09:43, Chris Angelico wrote:
>>
>> On Fri, Jan 17, 2014 at 8:10 PM, Mark Lawrence <bream...@yahoo.co.uk>
>> wrote:
>>>
>>> Slight aside, any chance of changing the subject of this thread, or even
>>> ending the thread completely? Why? Every time I see it I picture
>>> Inspector
>>> Clouseau, "A BOM!!!" :)
>>
>>
>> Special delivery, a berm! Were you expecting one?
>>
>> ChrisA
>>
>
> By coincidence I'm just off to collect a special delievry, of what I don't
> yet know.

Did you write a script to buy you something for a dollar off eBay
every day? Day six gets interesting, as I understand it.

ChrisA

Pete Forman

unread,
Jan 17, 2014, 11:26:28 AM1/17/14
to
It would have been nice if there was an eighth encoding scheme defined
there UTF-8NB which would be UTF-8 with BOM not allowed.
--
Pete Forman

Rustom Mody

unread,
Jan 17, 2014, 11:30:25 AM1/17/14
to
On Friday, January 17, 2014 9:56:28 PM UTC+5:30, Pete Forman wrote:
If you or I break a standard then, well, we broke a standard.
If Microsoft breaks a standard the standard is obliged to change.

Or as the saying goes, everyone is equal though some are more equal.

Chris Angelico

unread,
Jan 17, 2014, 11:33:44 AM1/17/14
to pytho...@python.org
On Sat, Jan 18, 2014 at 3:26 AM, Pete Forman <petef4...@gmail.com> wrote:
> It would have been nice if there was an eighth encoding scheme defined
> there UTF-8NB which would be UTF-8 with BOM not allowed.

Or call that one UTF-8, and the one with the marker can be UTF-8-MS-NOTEPAD.

ChrisA

Pete Forman

unread,
Jan 17, 2014, 11:46:45 AM1/17/14
to
Endian detection: Does my BOM look big in this?

--
Pete Forman

Chris Angelico

unread,
Jan 17, 2014, 11:50:02 AM1/17/14
to pytho...@python.org
On Sat, Jan 18, 2014 at 3:30 AM, Rustom Mody <rusto...@gmail.com> wrote:
> If you or I break a standard then, well, we broke a standard.
> If Microsoft breaks a standard the standard is obliged to change.
>
> Or as the saying goes, everyone is equal though some are more equal.

https://en.wikipedia.org/wiki/800_pound_gorilla

Though Microsoft has been losing weight over the past decade or so,
just as IBM before them had done (there was a time when IBM was *the*
800lb gorilla, pretty much, but definitely not now). In Unix/POSIX
contexts, Linux might be playing that role - I've seen code that
unwittingly assumes Linux more often than, say, assuming FreeBSD - but
I haven't seen a huge amount of "the standard has to change, Linux
does it differently", possibly because the areas of Linux-assumption
are areas that aren't standardized anyway (eg additional socket
options beyond the spec).

The one area where industry leaders still heavily dictate to standards
is the web. Fortunately, it usually still results in usable standards
documents that HTML authors can rely on. Usually. *twiddles fingers*

ChrisA

Ethan Furman

unread,
Jan 17, 2014, 12:40:19 PM1/17/14
to pytho...@python.org
LOL!

--
~Ethan~

Tim Chase

unread,
Jan 17, 2014, 1:43:28 PM1/17/14
to pytho...@python.org
On 2014-01-17 09:10, Mark Lawrence wrote:
> Slight aside, any chance of changing the subject of this thread, or
> even ending the thread completely? Why? Every time I see it I
> picture Inspector Clouseau, "A BOM!!!" :)

In discussions regarding BOMs, I regularly get the "All your base"
meme from a couple years ago stuck in my head: "Somebody set us up the
bomb!" (usually in reference to clients sending us files with the
aforementioned superfluous UTF-8 BOMs in them)

-tkc




Rotwang

unread,
Jan 17, 2014, 3:22:51 PM1/17/14
to
On 17/01/2014 18:43, Tim Chase wrote:
> On 2014-01-17 09:10, Mark Lawrence wrote:
>> Slight aside, any chance of changing the subject of this thread, or
>> even ending the thread completely? Why? Every time I see it I
>> picture Inspector Clouseau, "A BOM!!!" :)
>
> In discussions regarding BOMs, I regularly get the "All your base"
> meme from a couple years ago stuck in my head: "Somebody set us up the
> bomb!"

ITYM "Somebody set up us the bomb".

Gregory Ewing

unread,
Jan 18, 2014, 4:41:00 AM1/18/14
to
A berm? Is that anything like a shrubbery?

--
Greg

Chris Angelico

unread,
Jan 18, 2014, 5:10:57 AM1/18/14
to pytho...@python.org
0 new messages