[Python-ideas] TextIOWrapper callable encoding parameter

26 views
Skip to first unread message

Rurpy

unread,
Jun 11, 2012, 10:42:46 AM6/11/12
to python...@python.org
Here is another issue that came up in my ongoing
adventure porting to Python3...

Executive summary:
==================

There is no good way to read a text file when the
encoding has to be determined by reading the start
of the file. A long-winded version of that follows.
Scroll down the the "Proposal" section to skip it.

Problem:
========

When one opens a text file for reading, one must specify
(explicitly or by default) an encoding which Python will
use to convert the raw bytes read into Python strings.
This means one must know the encoding of a file before
opening it, which is usually the case, but not always.

Plain text files have no meta-data giving their encoding
so sometimes it may not be known and some of the file must
be read and a guess made. Other data like html pages, xml
files or python source code have encoding information inside
them, but that too requires reading the start of the file
without knowing the encoding in advance.

I see three ways in general in Python 3 currently to attack
this problem, but each has some severe drawbacks:

1. The most straight-forward way to handle this is to open
the file twice, first in binary mode or with latin1 encoding
and again in text mode after the encoding has been determined
This of course has a performance cost since the data is read
twice. Further, it can't be used if the data source is a
from a pipe, socket or other non-rewindable source. This
includes sys.stdin when it comes from a pipe.

2. Alternatively, with a little more expertise, one can rewrap
the open binary stream in a TextIOWrapper to avoid a second
OS file open. The standard library's tokenize.open()
function does this:

def open(filename):
buffer = builtins.open(filename, 'rb')
encoding, lines = detect_encoding(buffer.readline)
buffer.seek(0)
text = TextIOWrapper(buffer, encoding, line_buffering=True)
text.mode = 'r'
return text

This too seems to read the data twice and of course the
seek(0) prevents this method also from being usable with
pipes, sockets and other non-seekable sources.

3. Another method is to simply leave the file open in
binary mode, read bytes data, and manually decode it to
text. This seems to be the only option when reading from
non-rewindable sources like pipes and sockets, etc.
But then ones looses the all the advantages of having
a text stream even though one wants to be reading text!
And if one tries to hide this, one ends up reimplementing
a good part of TextIOWrapper!

I believe these problems could be addressed with a fairly
simple and clean modification of the io.TextIOWrapper
class...

Proposal
========
The following is a logical description; I don't mean to
imply that the code must follow this outline exactly.
It is based on looking at _pyio; I hope the C code is
equivalent.

1. Allow io.TextIOWrapper's encoding parameter to be a
callable object in addition to a string or None.

2. In __init__(), if the encoding parameter was callable,
record it as an encoding hook and leave encoding set to
None.

3. The places in Io.TextIOWrapper that currently read
undecoded data from the internal buffer object and decode
(only methods read() and read_chunk() I think) it would
be modified to do so in this way:

4. Read data from the buffer object as is done now.

5. If the encoding has been set, get a decoder if necessary
and continue on as usual.

6. If the encoding is None, call the encoding callable
with the data just read and the buffer object.

7. The callable will examine the data, possibly using the
buffer object's peek method to look further ahead in the
file. It returns the name of an encoding.

8. io.TextIOWrapper will get the encoding and record it,
and setup the decoder the same way as if the encoding name
had been received as a parameter, decode the read data and
continue on as usual.

9. In other non-read paths where encoding needs to be known,
raise an error if it is still None.

Were io.TextWrapper modified this way, it would offer:

* Better performance since there is no need to reread data

* Read data is decoded after being examined so the stream
is usable with serial datasources like pipes, sockets, etc.

* User code is simplified and clearer; there is better
separation of concerns. For example, the code in the
"Problem" section could be written:

stream = open(filename, encoding=detect_encoding):
...
def detect_encoding (data, buffer):
# This is still basically the same function as
# in the code in the "Problem" section.
... look for Python coding declaration in
first two lines of the 'data' bytes object.
if not found_encoding:
raise Error ("unable to determine encoding")
return found_encoding

I have modified a copy the _pyio module as described and
the changes required seemed unsurprising and relatively
few, though I am sure there are subtleties and other
considerations I am missing. Hence this post seeking
feedback...

_______________________________________________
Python-ideas mailing list
Python...@python.org
http://mail.python.org/mailman/listinfo/python-ideas

Rurpy

unread,
Jun 11, 2012, 11:06:18 AM6/11/12
to python...@python.org
As a followup, here are some timing data that seem to confirm
a modest increase in speed as a result of implementing the
callable encoding parameter I proposed (although that would
not be the main reason for wanting to do it.) These are just
for illustration. (Among many other reasons, _pyio benchmarks
are not very useful.)

I read four short test files using four methods for determining
the test file's encoding. The test files are a simplified model
of a python coding declaration (always on first line in our case
with no BOM present [*1]) followed by mixed english and japanese
text.

Method 0 (reopen0):
Use the encoding callable I am proposing.

def reopen0 (fname):
def hook (data,buf):
return get_encoding (data)
t = io.open (fname, encoding=hook)

Method 1 (reopen1):
Open in binary to determine encoding, then rewrap in a
TextIOWrapper with the correct encoding.

def reopen1 (fname):
b = io.open (fname, 'rb')
line = b.readline()
enc = get_encoding (line)
b.seek (0)
t = io.TextIOWrapper (b, enc, line_buffering=True)
t.mode = 'r'

Method 2 (reopen2):
Open in binary to determine encoding, then reopen in text mode
with correct encoding.

def reopen2 (fname):
b = io.open (fname, 'rb')
line = b.readline()
enc = get_encoding (line)
t = io.open (fname, encoding=enc)

Method 3 (reopen3):
Open in text mode (latin1) to determine encoding, then reopen
in text mode with correct encoding.

def reopen3 (fname):
f = io.open (fname, encoding='latin1')
line = f.readline()
enc = get_encoding (line)
t = io.open (fname, encoding=enc)

The same get_encoding() function is used in all methods [*1].

The input test data are all small files (because we want
to measure encoding detection, not how fast read() runs.)
Each has a python/emacs coding declaration in the first line.

test.utf8 -- Tiny python program with coding declaration
and single print statement in main() function that prints
a short word (literal) in Japanese. Encoding is utf-8
(122 bytes).
test.sjis -- Identical to test.utf8 but sjis encoding
(111 bytes).
test2.utf8 -- A python coding declaration followed by
approximately 50 long lines with mixed English and
Japanese (4274 bytes).
test2.sjis -- Identical to test2.utf8 but sjis encoding
(3401 bytes).

Results:
---------------------------------------------------------
$ python3 bm.py test.utf8
test.utf8 / reopen0: total time (10000 reps) was 1.188323
test.utf8 / reopen1: total time (10000 reps) was 1.490757
test.utf8 / reopen2: total time (10000 reps) was 1.766081
test.utf8 / reopen3: total time (10000 reps) was 2.141996
$ python3 bm.py test.sjis
test.sjis / reopen0: total time (10000 reps) was 1.175914
test.sjis / reopen1: total time (10000 reps) was 1.471780
test.sjis / reopen2: total time (10000 reps) was 1.764444
test.sjis / reopen3: total time (10000 reps) was 2.122550
$ python3 bm.py test2.utf8
test2.utf8 / reopen0: total time (10000 reps) was 1.690255
test2.utf8 / reopen1: total time (10000 reps) was 1.996235
test2.utf8 / reopen2: total time (10000 reps) was 2.278798
test2.utf8 / reopen3: total time (10000 reps) was 2.727867
$ python3 bm.py test2.sjis
test2.sjis / reopen0: total time (10000 reps) was 1.841388
test2.sjis / reopen1: total time (10000 reps) was 2.147142
test2.sjis / reopen2: total time (10000 reps) was 2.426701
test2.sjis / reopen3: total time (10000 reps) was 2.873278
----------------------------------------------------------

Here is what happen when a test data file is piped
into a program using the four methods above:

$ cat test.utf8 | python3 stdin.py reopen0
read 102 characters

$ cat test.utf8 | python3 stdin.py reopen1
got exception: [Errno 29] Illegal seek

$ cat test.utf8 | python3 stdin.py reopen2
read 0 characters

$ cat test.utf8 | python3 stdin.py reopen3
read 0 characters

----
[*1] Here is the get_encoding function used above. It is
a toy simplified python source encoding line reader. Toy,
in that is looks at only one line, doesn't consider a BOM,
etc. It purpose was to allow me to sanity check the benefits
of having a callable encoding parameter.

def get_encoding (line):
if isinstance (line, bytes):
nlpos = line.index(b'\n')
mo = ENC_PATTERN_B.search (line, 0, nlpos)
if not mo: return None
enc = mo.group(1).decode ('latin1')
else:
nlpos = line.index('\n')
mo = ENC_PATTERN_S.search (line, 0, nlpos)
if not mo: return None
enc = mo.group(1)
return enc

Nick Coghlan

unread,
Jun 11, 2012, 11:10:47 AM6/11/12
to Rurpy, python...@python.org

Immediate thought: it seems like it would be easier to offer a way to inject data back into a buffered IO object's internal buffer.

--
Sent from my phone, thus the relative brevity :)

Eric Snow

unread,
Jun 11, 2012, 11:11:50 AM6/11/12
to Rurpy, python...@python.org
On Mon, Jun 11, 2012 at 8:42 AM, Rurpy <ru...@yahoo.com> wrote:
> Here is another issue that came up in my ongoing
> adventure porting to Python3...
>
> Executive summary:
> ==================
>
> There is no good way to read a text file when the
> encoding has to be determined by reading the start
> of the file.  A long-winded version of that follows.
> Scroll down the the "Proposal" section to skip it.

FWIW, the import system does an encoding check on Python source files
that is somewhat related. See
http://www.python.org/dev/peps/pep-0263/.

-eric

Stephen J. Turnbull

unread,
Jun 11, 2012, 12:24:20 PM6/11/12
to Nick Coghlan, Rurpy, python...@python.org
Nick Coghlan writes:

> Immediate thought: it seems like it would be easier to offer a way to
> inject data back into a buffered IO object's internal buffer.

ungetch()?

If you're only interested in the top of the file (see below), I would
suggest allowing only one bufferfull, and then simply rewinding the
buffer pointer once you're done. This is one strategy used by Emacsen
for encoding detection (for the reason pointed out by Rurpy: not all
streams are rewindable).

But is that really "easier"? It might be more general, but you still
need to reinitialize the encoding (ie, from the trivial "binary" to
whatever is detected), with all the hair that comes with that.

> > Executive summary:
> > ==================
> >
> > There is no good way to read a text file when the
> > encoding has to be determined by reading the start
> > of the file. A long-winded version of that follows.
> > Scroll down the the "Proposal" section to skip it.

This may be insufficiently general. Specifically, both Emacsen and vi
allow specification of editor configuration variables at the bottom of
the file as well as the top. I don't know whether vi allows encoding
specs at the bottom, but Emacsen do (but only for files).

I wouldn't recommend paying much attention to what Emacsen actually
*do* when initializing a stream (it's, uh, "baroque").

Victor Stinner

unread,
Jun 12, 2012, 5:48:08 PM6/12/12
to python...@python.org
> 1.  The most straight-forward way to handle this is to open
> the file twice, first in binary mode or with latin1 encoding
> and again in text mode after the encoding has been determined
> This of course has a performance cost since the data is read
> twice.  Further, it can't be used if the data source is a
> from a pipe, socket or other non-rewindable source.  This
> includes sys.stdin when it comes from a pipe.

Some months ago, I proposed to automatically detect if a file contains
a BOM and uses it to set the encoding. Various methods were proposed
but there was no real consensus. One proposition was to use a codec
(e.g. "bom") which uses the BOM if it is present, and so don't need to
reread the file twice.

For the pipe issue: it depends where the encoding specification is. If
the encoding is written at the end of your "file" (stream), you have
to store the whole stream content (few MB or maybe much more?) into
memory. If it is in the first lines, you have to store these lines in
a buffer. It's not easy to decide for the threshold.

I don't like the codec approach because the codec is disconnected from
the stream. For example, the codec doesn't know the current position
in stream nor can read a few more bytes forward or backward. If you
open the file in "append" mode, you are not writing at the beginning
but at the end of the file. You may also seek at an arbitrary position
before the first read...

There are also some special cases. For example, when a text file is
opened in write mode, the file is seekable and the file position is
not zero, TextIOWrapper calls encoder.setstate(0) to not write the BOM
in the middle of the file. (See also Lib/test/test_io.py for related
tests.)

> 2.  Alternatively, with a little more expertise, one can rewrap
> the open binary stream in a TextIOWrapper to avoid a second
> OS file open.

That's my favorite method because you have the full control on the
stream. (I wrote tokenize.open). But yes, it does not work on
non-seekable streams (e.g. pipes).

> This too seems to read the data twice and of course the
> seek(0) prevents this method also from being usable with
> pipes, sockets and other non-seekable sources.

Does it really matter? You usually need to read few bytes to get the encoding.

> 9. In other non-read paths where encoding needs to be known,
>  raise an error if it is still None.

Why not reading data until you the encoding is known instead?

> I have modified a copy the _pyio module as described and
> the changes required seemed unsurprising and relatively
> few, though I am sure there are subtleties and other
> considerations I am missing.  Hence this post seeking
> feedback...

Can you post the modified somewhere so I can play with it?

Victor

Victor Stinner

unread,
Jun 12, 2012, 6:13:50 PM6/12/12
to python...@python.org
2012/6/11 Nick Coghlan <ncog...@gmail.com>:
> Immediate thought: it seems like it would be easier to offer a way to inject
> data back into a buffered IO object's internal buffer.

BufferedReader has already an useful peek() method to read data
without changing the position.
http://docs.python.org/library/io.html#io.BufferedReader.peek

It's not perfect ("The number of bytes returned may be less or more
than requested.") but better than nothing.

Victor

Antoine Pitrou

unread,
Jun 13, 2012, 4:25:21 AM6/13/12
to python...@python.org
On Tue, 12 Jun 2012 01:10:47 +1000
Nick Coghlan <ncog...@gmail.com> wrote:
> Immediate thought: it seems like it would be easier to offer a way to
> inject data back into a buffered IO object's internal buffer.

Except that it would be limited by buffer size, which is not
necessarily something you have control over.

Regards

Antoine.

Rurpy

unread,
Jun 13, 2012, 11:46:01 AM6/13/12
to python...@python.org
On 06/11/2012 10:24 AM, Stephen J. Turnbull wrote:
> > Nick Coghlan writes:
> >
> > > Immediate thought: it seems like it would be easier to offer a way to
> > > inject data back into a buffered IO object's internal buffer.
> >
> > ungetch()?

What would be the TextIOWrapper api for that?

> > If you're only interested in the top of the file (see below), I would
> > suggest allowing only one bufferfull, and then simply rewinding the
> > buffer pointer once you're done. This is one strategy used by Emacsen
> > for encoding detection (for the reason pointed out by Rurpy: not all
> > streams are rewindable).
> >
> > But is that really "easier"? It might be more general, but you still
> > need to reinitialize the encoding (ie, from the trivial "binary" to
> > whatever is detected), with all the hair that comes with that.

I don't think there is any hair involved. In at least
the _pyio version of TextIOWrapper, initializing the
encoding (in the read path) consists of calling
self._get_decoder(). One needs to move the few places
where that is called now to nearby places that are
after the raw buffer has been read but before it is
decoded. There may be need for some consideration
given to raising errors at the old locations in the
case the callable encoding hook is not being used (to
maintain complete backwards compatibility; not sure
that is necessary), but I wouldn't call that hairy.
Of course there may be other factors I am missing...

> > > > Executive summary:
> > > > ==================
> > > >
> > > > There is no good way to read a text file when the
> > > > encoding has to be determined by reading the start
> > > > of the file. A long-winded version of that follows.
> > > > Scroll down the the "Proposal" section to skip it.
> >
> > This may be insufficiently general. Specifically, both Emacsen and vi
> > allow specification of editor configuration variables at the bottom of
> > the file as well as the top. I don't know whether vi allows encoding
> > specs at the bottom, but Emacsen do (but only for files).
> >
> > I wouldn't recommend paying much attention to what Emacsen actually
> > *do* when initializing a stream (it's, uh, "baroque").

Looking only at the beginning of an input stream is
general enough for a large class of problems including
tokenizing python source code.

Rurpy

unread,
Jun 13, 2012, 11:56:02 AM6/13/12
to python...@python.org
On 06/12/2012 03:48 PM, Victor Stinner wrote:
>> >> 1. The most straight-forward way to handle this is to open
>> >> the file twice, first in binary mode or with latin1 encoding
>> >> and again in text mode after the encoding has been determined
>> >> This of course has a performance cost since the data is read
>> >> twice. Further, it can't be used if the data source is a
>> >> from a pipe, socket or other non-rewindable source. This
>> >> includes sys.stdin when it comes from a pipe.
> >
> > Some months ago, I proposed to automatically detect if a file contains
> > a BOM and uses it to set the encoding. Various methods were proposed
> > but there was no real consensus. One proposition was to use a codec
> > (e.g. "bom") which uses the BOM if it is present, and so don't need to
> > reread the file twice.
> >
> > For the pipe issue: it depends where the encoding specification is. If
> > the encoding is written at the end of your "file" (stream), you have
> > to store the whole stream content (few MB or maybe much more?) into
> > memory. If it is in the first lines, you have to store these lines in
> > a buffer. It's not easy to decide for the threshold.

That's always a problem. When trying to determine a
character encoding one may have to read the entire file
because it could consist of all ascii characters except
the very last one. (And of course there is no guarantee
one can determine *the* encoding at all).

Nevertheless, I think thee is a very large class of
problems that can be usefully handled by looking at a
limited amount of data at the start of a file (or stream).

The Python coding declaration in one example (obviously
picked hoping it would have some resonance here.)

The buffer object used by TextIOWrapper already reads the
start of the stream and buffers the first few lines, so
why not take advantage of that rather than repeating the
work?

One of the things I am not sure about is if there are
cases when the buffered read returns, say, only one
line, as might happen with tty input.

> > I don't like the codec approach because the codec is disconnected from
> > the stream. For example, the codec doesn't know the current position
> > in stream nor can read a few more bytes forward or backward. If you
> > open the file in "append" mode, you are not writing at the beginning
> > but at the end of the file. You may also seek at an arbitrary position
> > before the first read...
> >
> > There are also some special cases. For example, when a text file is
> > opened in write mode, the file is seekable and the file position is
> > not zero, TextIOWrapper calls encoder.setstate(0) to not write the BOM
> > in the middle of the file. (See also Lib/test/test_io.py for related
> > tests.)

A callable encoding parameter would not be terribly useful
with a file opened in write or append mode, but it's behavior
would be predictable: a write would result in an error
because the encoding hadn't been set. A read in the middle'
of the file would work the same way as at the beginning.
This is probably not very useful, but is consistent.

Of course one could choose to implement a callable encoding
parameter such that some or all of these paths are detected
at open and declared illegal then. One could prohibit the
encoding call after a seek though I'm not sure there is any
point to that.

>> >> 2. Alternatively, with a little more expertise, one can rewrap
>> >> the open binary stream in a TextIOWrapper to avoid a second
>> >> OS file open.
> >
> > That's my favorite method because you have the full control on the
> > stream. (I wrote tokenize.open). But yes, it does not work on
> > non-seekable streams (e.g. pipes).
> >
>> >> This too seems to read the data twice and of course the
>> >> seek(0) prevents this method also from being usable with
>> >> pipes, sockets and other non-seekable sources.
> >
> > Does it really matter? You usually need to read few bytes to get the encoding.

It certainly matters if input is from a pipe. Quoting from
my other message:

$ cat test.utf8 | python3 stdin.py reopen1
got exception: [Errno 29] Illegal seek

The whole point of my suggestion was that you've already
read those few bytes -- but by the time you have access
to them, you've already been forced to choose an encoding.
My suggestion simply defers that encoding setting until
after you've had a chance to look at the bytes.

>> >> 9. In other non-read paths where encoding needs to be known,
>> >> raise an error if it is still None.
> >
> > Why not reading data until you the encoding is known instead?

That's how I do it now -- open file in binary mode
and read it, buffer it, determine encoding, and henceforth
decode the bytes data "by hand" to text.

But that's an awful lot like what TextIOWrpper does, yes?
Why can't I use TextIOWrapper instead of rewriting it myself?
(Yes, I know I can reopen or rewrap the binary stream but
as I said, that loses the one-pass processing which breaks
pipes.)

>> >> I have modified a copy the _pyio module as described and
>> >> the changes required seemed unsurprising and relatively
>> >> few, though I am sure there are subtleties and other
>> >> considerations I am missing. Hence this post seeking
>> >> feedback...
> >
> > Can you post the modified somewhere so I can play with it?

I put a diff against the Python-3.2.3 _pyio.py file at:

http://pastebin.com/kZHmcBdm

Much of the diff is just moving existing stuff around.
The note at the bottom says:

| It is in no way supposed to be a serious patch.
|
| It was the minimal changes I could make in order to
| see if my suggestion to allow a callable encoding parameter
| in TextIOWrapper was feasible, and allow some timing tests.
|
| I am quite sure it will not pass the Python's tests.
|
| It does I hope give some idea of the nature and scale of the
| code changes needed to implement a callable encodign parameter.
Reply all
Reply to author
Forward
0 new messages