File look ahead and "ungetc"

Aaron Watters

unread,

Jul 1, 1997, 3:00:00 AM7/1/97

to

From: Richard Everson <r...@camelot.mssm.edu>
>The Python file object doesn't have the equivalent of C's getc and
>ungetc (because they'd be inefficient?). I'd really like to be able
>to look ahead (by a single character) on pipes as well as on real
>files. Am I missing something? Is there a well known class that
>permits this?

>"Enhancement" of the file object to include getc and ungetc methods
>seems relatively straight-forward, but non-standard: has this been
>discussed and rejected?

Oh, on this newsgroup everything has been discussed by someone
and rejected by someone else...

I don't know about issues for/against getc/ungetc. My feeling is that
in Python it would only be needed for forking subprocesses, since it'd
be easy in a single (multithreaded) process to shove the character
into a structure someplace (even using a file wrapper, eg, ipwp 130).
Maybe getc/ungetc should be added, afaik, but in the meantime
perhaps you could give us an idea what you wish to do at a higher
level so we can suggest a workaround or an alternative approach?
-- Aaron Watters
===
joke_version = fortune_cookie()[:-1] + ", in bed."

Thierry Bousch

unread,

Jul 3, 1997, 3:00:00 AM7/3/97

to

Aaron Watters (a...@dante.mh.lucent.com) wrote:

: I don't know about issues for/against getc/ungetc. My feeling is that

: in Python it would only be needed for forking subprocesses, since it'd
: be easy in a single (multithreaded) process to shove the character
: into a structure someplace (even using a file wrapper, eg, ipwp 130).

Of course, you can always create a new class to implement the missing
method, but:

(1) Everything will be slower because you're no longer using
builtin objects and C functions, but user-defined classes.

(2) The file interface is supposed to provide the same functionality and
abstraction level as stdio. Since stdio provides a ungetc() method,
so should do the Python interface.

(3) It requires cooperation from *all* the components of the program
(that is, not only the parts you wrote yourself). This can be
very hard to verify. I'm sure there are some C extensions which
require standard files, not simply objects which behave like files.
Are you going to reassign sys.stdin to "fix" the functions which
can only read from standard input? What about snippets of code like
"if type(f)<>FileType: raise TypeError", etc.

: Maybe getc/ungetc should be added, afaik, but in the meantime

: perhaps you could give us an idea what you wish to do at a higher
: level so we can suggest a workaround or an alternative approach?

Well, typically it's to implement some kind of lexical analysis.
For instance, to read a number from standard input.

Thierry

Aaron Watters

unread,

Jul 3, 1997, 3:00:00 AM7/3/97

to

me:

: Maybe getc/ungetc should be added, afaik, but in the meantime
: perhaps you could give us an idea what you wish to do at a higher
: level so we can suggest a workaround or an alternative approach?

Thierry:
... many valid points omitted, this valid point left in:

>Well, typically it's to implement some kind of lexical analysis.
>For instance, to read a number from standard input.

One could argue that relying on the filesystem for
buffering chars is not ideal. Whenever possible I tend to
read entire files as strings and use the strings. (For
line formated files, maybe one line at a time.) This
gives complete flexibility and high performance to the
Python program and reduces kernel system call overhead.
Of course if the file is too large, you have a problem,
but these days the file has to be pretty large to be large.
I suspect many C programs would be written differently
too if C were better at handling strings.

You are right, of course, that something like this won't
work if you need to pass a fd down to a C extension
module or if you need to fork a subprocess in the middle
of reading a file, but I don't recall every having to do
either of these myself.

-- Aaron Watters
===
Beware of people who have "smarter
than everyone else" on their job description.

Donn Cave

unread,

Jul 3, 1997, 3:00:00 AM7/3/97

to

bousch%linott...@topo.math.u-psud.fr (Thierry Bousch) writes:
...

| (2) The file interface is supposed to provide the same functionality and
| abstraction level as stdio. Since stdio provides a ungetc() method,
| so should do the Python interface.

Note though that ungetc()'s standard behavior isn't fully specified -
[from a Digital UNIX man page] ``One character of push back is
guaranteed ...; however, if [ungetc()] is called too many times
on the same stream without an intervening read or file-positioning
operation on that stream, industry standards specify that [it]
may fail. (Applications do not encounter this failure on Digital
UNIX systems. ...''

I'd guess that the odds are small that anyone would be bit by this, since
the 1-character situation is much more common and push back of more than
a handful of characters is arguably not a sane programming practice. But
my point is that ungetc() isn't on a par with fgets() et al., with respect
to standards.

| (3) It requires cooperation from *all* the components of the program
| (that is, not only the parts you wrote yourself). This can be
| very hard to verify. I'm sure there are some C extensions which
| require standard files, not simply objects which behave like files.
| Are you going to reassign sys.stdin to "fix" the functions which
| can only read from standard input? What about snippets of code like
| "if type(f)<>FileType: raise TypeError", etc.

That's a problem anyway, in my opinion. If the lack of ungetc() helps
motivate people to re-code these dependencies, then there's a silver
lining in that cloud after all. There seem to be differing thoughts
on how ideally this kind of issue should be approached, but from a
practical present-day perspective one option is to look for the member
attributes that are supposed to define the interface, and use them.
For an example, there are a couple of modules, e.g., socketmodule, that
find and use the fileno() method of a parameter, rather than requiring
the parameter to be of a specific type. With Python buffered I/O,
that kind of thing would allow string files and all kinds of great
stuff. It's not always practical, e.g., if the module is an interface
to an external library that itself requires a FILE *, but gratuitous
dependencies on stdio are avoidable.

Several times I've idly thought about a distribute-with-Python buffered
I/O system, possibly based on something like Chris Torek's version of stdio.
One reason is to make it usable with select() (because with a standardized
implementation you could check for buffered data), but it could also make
things like ungetc() fully standard.

Donn Cave, University Computing Services, University of Washington
do...@u.washington.edu

Thierry Bousch

unread,

Jul 3, 1997, 3:00:00 AM7/3/97

to

Aaron Watters (a...@dante.mh.lucent.com) wrote:

: One could argue that relying on the filesystem for

: buffering chars is not ideal.

The operating system is absolutely not involved here. The role of stdio is
precisely to provide buffers, in order to minimize the number of I/O system
calls. In particular, ungetc() just puts back characters into these buffers,
so that the next getc() or fread() will get them first. This requires
absolutely no kernel support.

: Whenever possible I tend to
: read entire files as strings and use the strings.

I can see two disadvantages to this. The first one is that it requires more
memory, because the whole file must be present in memory while it is parsed.
The second one is that you cannot tokenize the file as soon as it arrives
(if you're reading from a socket), but you must wait for the file to be
completely transferred before parsing it.

: (For line formated files, maybe one line at a time.) This gives

: complete flexibility and high performance to the Python program and
: reduces kernel system call overhead.

There's no such overhead thanks to stdio, see above. The only overhead is
in the Python interpreter...

: Of course if the file is too large, you have a problem,

: but these days the file has to be pretty large to be large.
: I suspect many C programs would be written differently
: too if C were better at handling strings.

That might be true, unfortunately Python is not so good at handling
individual characters. Lexical analysis is definitely not something
doable in Python with reasonable efficiency, whereas it's
straightforward in C. Not having ungetc() won't improve things.

: You are right, of course, that something like this won't

: work if you need to pass a fd down to a C extension
: module or if you need to fork a subprocess in the middle
: of reading a file, but I don't recall every having to do
: either of these myself.

You could have some parsing routines written in C, and some others written
in Python (the non-time-critical ones...) for instance.

Thierry

Thierry Bousch

unread,

Jul 4, 1997, 3:00:00 AM7/4/97

to

Donn Cave (do...@u.washington.edu) wrote:

: Note though that ungetc()'s standard behavior isn't fully specified -

: [from a Digital UNIX man page] ``One character of push back is
: guaranteed ...; however, if [ungetc()] is called too many times
: on the same stream without an intervening read or file-positioning
: operation on that stream, industry standards specify that [it]
: may fail. (Applications do not encounter this failure on Digital
: UNIX systems. ...''

The ISO C standard defines the behaviour of ungetc(): it must accept
at least one character of push back, as your man page indicates.

I think the key points are the following:

(1) It is a standard function, part of the standard C libraries,
with consistent behaviour on all operating systems;

(2) The return value indicates whether ungetc() succeeded or not;
so if it fails, we can raise an exception.

(3) There are no cases of "undefined behaviour"; either the character is
pushed back or it isn't, but it will never cause a crash.

: But my point is that ungetc() isn't on a par with fgets() et al., with
: respect to standards.

Get a copy of the ISO C standard then. ungetc() is defined in section
4.9.7.11, not far after fgets().

Thierry

Fredrik Lundh

unread,

Jul 4, 1997, 3:00:00 AM7/4/97

to

Thierry wrote:
>The second one is that you cannot tokenize the file as soon as it arrives
>(if you're reading from a socket), but you must wait for the file to be
>completely transferred before parsing it.

So? stdio doesn't work on sockets anyway, on the vast majority of
computers. On the other hand, Python file objects work just fine
with sockets.

> (1) It is a standard function, part of the standard C libraries,
> with consistent behaviour on all operating systems;

What on earth makes you think that Python should be defined by
the ISO C standard? If you need better performance when parsing
stuff from Python, adding a relatively obscure, and hardly ever used
stdio function to the clean and small python file interface is definitely
not the solution.

Could we please focus on Python on this list, and stop pretending it's
C or C++?

Cheers /F

Thierry Bousch

unread,

Jul 4, 1997, 3:00:00 AM7/4/97

to

Fredrik Lundh (fredri...@ivab.se) wrote:

: So? stdio doesn't work on sockets anyway, on the vast majority of

: computers. On the other hand, Python file objects work just fine
: with sockets.

It is evident from Python-1.4/Objects/fileobject.c that the
Python file module is built immediately atop of stdio. If stdio doesn't
work with sockets, neither will the Python file objects.

: What on earth makes you think that Python should be defined by
: the ISO C standard?

The person I was responding to claimed that ungetc() was not standard and
had lots of operating system dependencies. If it were true, these would
be serious reasons for not putting it in the standard file interface.
But it is not the case: ungetc() is very portable and one can be confident
that all C libraries will implement it correctly, if only because it is
mandated by an international standard.

Now, I have obviously never claimed that Python should follow blindly the C
interfaces and specifications. I'm simply saying that it's important to
understand the C library, because (i) Python depends on it, and (ii) because
it's always instructive to see how other programmer communities have solved
certain problems.

: If you need better performance when parsing stuff from Python, adding a

: relatively obscure, and hardly ever used stdio function to the clean and
: small python file interface is definitely not the solution.

Sure, ungetc() is not used all the time. In the SAML project, which is
roughly 15000 lines of C, there are 3 lexers and ungetc() is used 15 times.
It's just a very useful function when you need it.

And it's a rather severe case of NIH syndrome to pretend that the Python
file interface is "clean". It ignores I/O errors and happily mixes stdio
operations with lower-level functions which operate directly on file
descriptors (like isatty or ftruncate, which should be part of the "os"
module).

Thierry

Donn Cave

unread,

Jul 5, 1997, 3:00:00 AM7/5/97

to

bousch%linott...@topo.math.u-psud.fr (Thierry Bousch) writes:
|Fredrik Lundh (fredri...@ivab.se) wrote:
|: So? stdio doesn't work on sockets anyway, on the vast majority of
|: computers. On the other hand, Python file objects work just fine
|: with sockets.
|
| It is evident from Python-1.4/Objects/fileobject.c that the
| Python file module is built immediately atop of stdio. If stdio doesn't
| work with sockets, neither will the Python file objects.

Which I'd argue is the case, in both respects. Neither file object nor
stdio buffered I/O is all there for sockets or pipes. It's fine in certain
applications, but where there are two or more such devices active at the
same time, it's not so great. I use the system I/O for this, with buffering
in Python where I can make it compatible with select() - not the fastest
thing, but it probably doesn't cost as much as the subsequent parsing.

|: What on earth makes you think that Python should be defined by
|: the ISO C standard?
|
| The person I was responding to claimed that ungetc() was not standard and
| had lots of operating system dependencies. If it were true, these would
| be serious reasons for not putting it in the standard file interface.
| But it is not the case: ungetc() is very portable and one can be confident
| that all C libraries will implement it correctly, if only because it is
| mandated by an international standard.

For what it's worth, I stand by my claim on this. The man page I cited
demonstrates that ungetc()'s behavior is not standard - the vendor in this
case allows ungetc() to succeed when, according the standard, it should
fail. Now that seems like a petty thing to complain about, but it means
that on some platforms, maybe most, a programmer can rely on a documented,
useful behavior that's not standard. You're right, code that conforms to
the C standard is portable, and if your code isn't portable because it
relies on non-standard behavior behavior then it's your fault. Technically.
But it's not like fgets(), where the legal standard behavior actually more
or less corresponds to standard practice, and doesn't omit a useful and
commonly implemented feature.

If I get around to writing a general purpose buffered I/O module for
Python, I'll include an unread(), I think that would be very useful.

aaron watters

unread,

Jul 7, 1997, 3:00:00 AM7/7/97

to

Thanks to Thierry for correcting my misrepresentation of
stdio, and he's right that python file objects are based on stdio, and
I agree that there may be no good reason to omit getc/ungetc...

Aaron Watters (a...@dante.mh.lucent.com) wrote:
: Whenever possible I tend to
: read entire files as strings and use the strings.

bousch%linott...@topo.math.u-psud.fr observed:

>I can see two disadvantages to this. The first one is that it requires more
>memory, because the whole file must be present in memory while it is parsed.

>The second one is that you cannot tokenize the file as soon as it arrives
>(if you're reading from a socket), but you must wait for the file to be
>completely transferred before parsing it.

Yes, absolutely. Whether this matters depends on the problem
at hand. If you are using kwParsing, for example, you could
hack the lexer object to try to parse a prefix using regexen or
whatever, and on failure, but not end of file, wait for more input
using select (possibly combined with some sort of sanity
condition).

The fellow who started this thread was reading
a "line formatted" file, so in that common case I think the correct solution
was to use f.readline() to get the line as a string, and then branch
on its first character, or whatever. I've never personally had any
need to use ungetc, and I can't remember ever using it, even in
C. Nonetheless, it might not hurt to add it, perhaps as a dynamically
loadable extension function. Adding functions to the core increases
Python's size, of course, so obscure functions should be left in
loadable extension modules, I think. - Aaron Watters
=-_
a win a way a win a way ,,,
(background to "A lion sleeps tonight")

Fredrik Lundh

unread,

Jul 7, 1997, 3:00:00 AM7/7/97

to

>bousch%linott...@topo.math.u-psud.fr (Thierry Bousch) writes:
>|Fredrik Lundh (fredri...@ivab.se) wrote:
>|: So? stdio doesn't work on sockets anyway, on the vast majority of
>|: computers. On the other hand, Python file objects work just fine
>|: with sockets.
>|
>| It is evident from Python-1.4/Objects/fileobject.c that the
>| Python file module is built immediately atop of stdio. If stdio doesn't
>| work with sockets, neither will the Python file objects.

And it's evident that the socket makefile function work perfectly fine
on platforms where stdio doesn't support sockets; I use it everyday.
Maybe Guido wrote his own stdio package? No, he made sure the object
returned by makefile implemented the Python file interface.

I say it again: Python is all about interfaces, not types. If you're
ignoring that, you're not really programming Python.

>| But it is not the case: ungetc() is very portable and one can be confident
>| that all C libraries will implement it correctly, if only because it is
>| mandated by an international standard.

With respect to the stdio-based fileobject implementation, yes. With
respect to everything else, no way. Adding ungetc behaviour to the
file interface imposes additional overhead in all other file interface
implementations.

Besides, I cannot recall ever using ungetc in my life. A quick grep among
a lot of code, including a lot of stuff from the net, shows zero references
to ungetc. So why make a simple interface more complicated by adding
an obscure function that's hardly ever used. Keep Python clean!

Cheers /F

Thomas Bellman

unread,

Jul 7, 1997, 3:00:00 AM7/7/97

to

In article <970707074...@arnold.image.ivab.se>,
Fredrik Lundh <fredri...@ivab.se> writes:

> Besides, I cannot recall ever using ungetc in my life. A quick grep among
> a lot of code, including a lot of stuff from the net, shows zero references
> to ungetc. So why make a simple interface more complicated by adding
> an obscure function that's hardly ever used. Keep Python clean!

ungetc() is very common in parsers. When I grepped just a few
programs from the net, i found several references to ungetc().
The Python parser code implements its own ungetc() (and names it
tok_backup()), probably since it doesn't always have a FILE* to
read the source from.

If we don't have ungetc() on file objects, it means that parsers
written in Python needs to do it's own buffer management, on top
of file objects. I wouldn't be suprised if it slowed down a
parser by a factor of two or three...

Then of course it's a question if parsers are an important enough
application for Python to warrant the extra baggage on file
objects. On that, I don't have an opinion (despite my sig).

--
Thomas Bellman, Lysator Academic Computer Club, Linköping University
"Adde parvum parvo magnus acervus erit" ! Sweden ; +46-13 177780
(From The Mythical Man-Month) ! bel...@lysator.liu.se

Donn Cave

unread,

Jul 8, 1997, 3:00:00 AM7/8/97

to

nou...@nohost.nodomain (Thomas) writes:

| In article <5pm5fv$d...@nntp6.u.washington.edu> do...@u.washington.edu (Donn Cave) writes:
|
| If I get around to writing a general purpose buffered I/O module for
| Python, I'll include an unread(), I think that would be very useful.
|

| Have you considered using (or cloning) sfio ("safe fast I/O") from
| AT&T? It looks like it's pretty well done.
|
| Of course, Tcl also has a high-quality portable buffered I/O package
| that could be used from Python and that works on UNIX, Windows, Mac,
| and OS/2.

Sfio is new to me, but it does look interesting. It solves the select()
problem by providing its own routine, which would probably work for me.
Its sfungetc() implementation is interesting - sfio has "string" streams,
and stream "stacks" that combine several streams into one, and to store
unget data it stacks a string stream on top of the file.

There are some things that would need to be fixed. I see that the current
version has added a seriously misguided feature where the SHELL environment
variable is used for sfpopen().

I tried building python against it, using the compatibility stdio.h, and
the resulting image was about the same size.

Tcl's I/O also looks interesting. I paged through some of the source
this morning. When I said ``If I get around to writing ...'', I was
kind of joking, because I never get around to anything, and here's an
example of one of the reasons - it's too much work to do anything useful.
Here's over 5K lines of code, and it wouldn't surprise me if it could
grow to 10K with some serious optimization. (fileobject.c is < 1K.)

On the other hand, assuming it does behave the same across platforms,
that's a step forward, while sfio wouldn't help much there (I don't know
how well it would work anywhere but UNIX, which is all its authors claim.)
Tcl solves the select() problem by reporting the amount of buffered data,
which isn't as easy to use as sfio's approach but is arguably more flexible.

Well, life is complicated. Maybe it would be an interesting exercise to
crank out an sfiomodule.c with support for all the interesting things,
possibly using the ExtensionClass system to plug in sfio's stream discipline
extension options for Python users who'd like to subclass an I/O stream.

George H Harth IV

unread,

Jul 13, 1997, 3:00:00 AM7/13/97

to

>ungetc() is very common in parsers. When I grepped just a few
>programs from the net, i found several references to ungetc().
>The Python parser code implements its own ungetc() (and names it
>tok_backup()), probably since it doesn't always have a FILE* to
>read the source from.

Not that it matters I suppose, but I have my parsers "primed". That is,
if I need to know what the next character is going to be I just look at the
cFutureChar variable. I never need to unget.

When I GetNextChar(), I do something like:

cCurrentChar = cFutureChar;
cFutureChar = ReadCharFromInputStream();

It does require that the parser personally allocate storage for one extra
character, but that's pretty minimal and I think simpler than the unget
approach IMHO.

--
George H Harth, IV
gha...@ideashop.com
http://www.ideashop.com
http://www.infinet.com/~gharth

Fredrik Lundh

unread,

Jul 14, 1997, 3:00:00 AM7/14/97

to

Thierry Bousch wrote in article <5qd3km$9...@linotte.republique.fr>...
>: And it's evident that the socket makefile function work perfectly fine

>: on platforms where stdio doesn't support sockets; I use it everyday.
>

>Complain to your vendor.

Huh? Why compain when it works perfectly?

> Now do the same exercise with tcdrain(). Hell, nobody is using it, we
should
> remove it from the posix module.

Better add it first, then.

/F
(not sure we're using the same language, really)

Thierry Bousch

unread,

Jul 14, 1997, 3:00:00 AM7/14/97

to

Fredrik Lundh (fredri...@ivab.se) wrote:

: And it's evident that the socket makefile function work perfectly fine
: on platforms where stdio doesn't support sockets; I use it everyday.

Complain to your vendor.

: Maybe Guido wrote his own stdio package? No, he made sure the object

: returned by makefile implemented the Python file interface.

What is the "Python file interface"?

The _fileobject defined in Lib/win/socket.py doesn't support seek(), tell(),
isatty() nor truncate(), whereas these methods are supported by the builtin
file objects. Do you consider these functions to be part of the "file"
interface?

Now you'll tell me that they would be trivial to implement on "socket file"
objects. This is certainly true, but my point is that *currently* these
methods do not exist, and nobody complained about that. So, why should you
complain if ungetc (for instance) is supported by some file objects but
not all?

: With respect to the stdio-based fileobject implementation, yes. With

: respect to everything else, no way. Adding ungetc behaviour to the
: file interface imposes additional overhead in all other file interface
: implementations.

Oh come on. Adding ungetc() is *trivial* on all buffered file implementations.
It doesn't require any change to the existing functions -- you just have to
put back the character(s) into the buffer. Here is how you'd do it for
_fileobject's in Lib/win/socket.py:

def ungetc(self, pushback):
if type(pushback) <> type(""):
raise ValueError, "ungetc: not a string"
self._rbuf = pushback + self._rbuf

Total amount of code: four lines, and it allows unlimited pushback.

: Besides, I cannot recall ever using ungetc in my life. A quick grep among

: a lot of code, including a lot of stuff from the net, shows zero references
: to ungetc. So why make a simple interface more complicated by adding
: an obscure function that's hardly ever used. Keep Python clean!

Now do the same exercise with tcdrain(). Hell, nobody is using it, we should

remove it from the posix module.

Thierry

Thierry Bousch

unread,

Jul 14, 1997, 3:00:00 AM7/14/97

to

George H Harth IV (gha...@ideashop.com) wrote:

: When I GetNextChar(), I do something like:

: cCurrentChar = cFutureChar;
: cFutureChar = ReadCharFromInputStream();

It will (obviously) work, but it doubles the number of operations necessary
to access each character.

: It does require that the parser personally allocate storage for one extra

: character, but that's pretty minimal and I think simpler than the unget
: approach IMHO.

Simpler? Maybe, if you have a monolithic parser. Otherwise, it requires
cooperation from all its components to use the "future char" as well as the
file object. Personally, I'd call this a kludge; these two elements should
be part of a single object representing the lexer state (this is what other
people have proposed, by using inheritance).

With ungetc() we could simply reuse the "file" object, without modification.

Thierry