I will use an example of badly written code to show several of the problems that arise when people
forget to take care of the details:
char *asctime(const struct tm *timeptr)
{
static const char wday_name[7][3] = {
"Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat" };
static const char mon_name[12][3] = {
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec" };
static char result[26];
sprintf(result, "%.3s %.3s%3d %.2d:%.2d:%.2d %d\n",
wday_name[timeptr->tm_wday],
mon_name[timeptr->tm_mon],
timeptr->tm_mday, timeptr->tm_hour,
timeptr->tm_min, timeptr->tm_sec,
1900 + timeptr->tm_year);
return result;
}
We can see here many errors here that are typical of badly written C programs:
(1) No error checking at all. The above code will crash or corrupt memory at the slightest error in
its input. This is a common error that leads to brittle software: software that crashes too often
for anybody's taste. It can be argued that this function supposes that the specifications forbid
bad inputs, but in my opinion it is better always to provide room for errors.
(2) No provision for error return. We see that this program will always return the buffer, even in
the case of an error. Obviously for some inputs the program will crash before it reaches the return
statement, but for many inputs, it will just access undefined memory locations leading to wrong results.
(3) The use of snprintf, instead of sprintf is recommended. Sprintf doesn't allow to specify the
length of the buffer, so it is very error prone. It is interesting that this code appears in a book
about C99. C99 introduced snprintf, but the authors apparently forgot its existence.
(4) The sprintf specification is "%.3s %.3s%3d %.2d:%.2d:%.2d %d\n" but the buffer "result" is
dimensioned to 26 positions. This is wrong.
How much buffer space we would need to protect asctime from buffer overflows in the worst
case?
This is very easy to calculate. We know that in all cases, %d can't output more characters than
the maximum numbers of characters an integer can hold. This is INT_MAX, and taking into account the
possible negative sign we know that:
Number of digits N = 1 + ceil(log10((double)INT_MAX));
For a 32 bit system this is 11, for a 64 bit system this is 21.
In the asctime format specification there are 5 %d format specifications, meaning that we have as
the size for the buffer the expression:
26+5*N bytes
In a 32 bit system this is 26+55=81.
This is a worst case oversized buffer, since we have already counted some of those digits in the
original calculation, where we have allowed for 3+2+2+2+4 = 13 characters for the digits. A
tighter calculation can be done like this:
Number of characters besides specifications (%d or %s) in the string: 6.
Number of %d specs 5
Total = 6+5*11 = 61 + terminating zero 62.
The correct buffer size for a 32 bit system is 62.
The problems of buffer overflows (that can happen in all programming languages by the way, they are
not specific to C) in this case are the result of a bad calculation of the maximum buffer required.
(5) This function returns a pointer to a static buffer. This is best avoided in multithreaded
applications since all those functions can never be made reentrant
In the next installment of this series I will propose another function that fixes most problems and
retains the interface of asctime().
Aside from the potential buffer overflows that everyone here should know
about, given how voiceful you have been about this issue over the years,
this function is also a good example of code that is not thread safe.
Use of a static buffer is of course the problem. Does lcc-win support the
__thread extension to make this buffer thread specific ? This would be a
quick fix for the thread issue, yet not a complete one since the thread
specific result buffer should not be referenced beyond the end of the thread
lifetime.
Regarding the potential buffer overflow, does lcc-win produce a warning when
compiling this code ? There are many reasons to do so, and the simple
advice to use snprintf instead of sprintf would be a good start.
--
Chqrlie
While a good C programmer can avoid many unsafe constructs out of
habit, it is *far more effective* to remove them entirely from the
programmer model. (Which is what has occurred in the past 30 years of
language development.) Learning to write good C requires: 1) a mindset
which is open enough or creative enough to develop or adopt methods to
defeat the inherent fragility of C; and 2) 10 or 20 years' experience.
> We can see here many errors here that are typical of badly written C programs:
> ...
> The problems of buffer overflows (that can happen in all programming languages by the way, they are
> not specific to C)
They *cannot* happen in all languages.
Jacob, eventually you will have to face the fact that, for most tasks,
C is not an appropriate general purpose language today. If you wish to
make a personal exception and develop your own code, as securely and
robustly as you can, then go ahead; but claims like the above don't
lend you much credibility as a commentator on the diversity of general
purpose languages and three decades of progress since C's birth.
C has its place; but that is a small and shrinking subset of what
people actually do, with almost no overlap any more with general
applications programming. (I say this as a C programmer of more than
20 years' experience.)
There is NO WAY to avoid "buffer overflows" in ANY language.
Of course the consequences of buffer overflows will be different
(a dialog box showing "Index error", or a message "This program... etc")
but essentially errors of indexing tables can happen in ANY language.
Unsafe constructs can be mostly eliminated in C using the enhancements
I introduced in lccwin. Counted strings + container library
eliminate most of the reasons for buffer overflows and simplify
the language
>> We can see here many errors here that are typical of badly written C programs:
>> ...
>> The problems of buffer overflows (that can happen in all programming languages by the way, they are
>> not specific to C)
>
> They *cannot* happen in all languages.
>
Please tell me what happens in language X when you do:
table tab(50) (or the equivalent in the language syntax)
tab[8654] = 67;
Of course you will NOT randomly access some memory but you
will get an error. The same when you use the vector object
in the lcc-win container library.
Well, if you include the possiblility of bugs in the language
implementation, I suppose that is true.
> Of course the consequences of buffer overflows will be different
> (a dialog box showing "Index error", or a message "This program... etc")
> but essentially errors of indexing tables can happen in ANY language.
(snip)
> Please tell me what happens in language X when you do:
> table tab(50) (or the equivalent in the language syntax)
> tab[8654] = 67;
> Of course you will NOT randomly access some memory but you
> will get an error. The same when you use the vector object
> in the lcc-win container library.
PL/I has ON units and the SUBSCRIPTRANGE condition. If enabled
(likely not by default) it will test subscripts and execute
the specified ON unit to handle it.
In Java the test cannot be disabled. With try/catch the program
can gain control and handle it as appropriate.
-- glen
In lccwin you can setup a __try/__except block, and index the array
safely if the array accessing function throws an exception...
Actually lccwin is the same, but in "extended" C. This extensions
are available in MSVC also.
Obviously I am not satisfied with the state of C right now. The objective
of the extensions was precisely to show how C could be improved.
Yes, I mean you can avoid their *consequences*, which is, after all,
the important objective.
> (a dialog box showing "Index error", or a message "This program... etc")
> but essentially errors of indexing tables can happen in ANY language.
The consequences in C are arbitrarily catastrophic, including
segfaults, silent corruption of heap or arbitrary data structures, and
(most notoriously) security holes. Nearly all modern language runtimes
eliminate ALL those side-effects of C buffer overflows. I would even
consider this part of the meaning of "modern". (The only remaining
risk is an implementation bug, as glen points out.)
Of course you can still make addressing errors in other runtimes, but
rather than a silent and/or corrupting effect, you will see a benign
runtime error which can be debugged.
Would you argue against memory protection, since, "with sufficient
care," the programmer can simply not write code that violates process
space? Yet we *have* memory protection and prefer to see a segfault
than silent writing into another process' space. Runtime protections
extend this concept since programmers are never infallible.
Furthermore: debugging bugs in usual C-style memory micromanagement is
difficult, time consuming, and extremely costly - unless the
programmer has sufficient experience to protect himself with useful
abstractions - and IMHO those less experienced should protect
themselves by using environments where these failure modes have been
eliminated.
For example, debug the heap corruption bug(s) in this code from the
real world (not mine):
http://www.telegraphics.com.au/~toby/freesearch.tar.bz2
Then rewrite in a modern language and compare...
> Unsafe constructs can be mostly eliminated in C using the enhancements
> I introduced in lccwin. Counted strings + container library
> eliminate most of the reasons for buffer overflows and simplify
> the language
I thought we were discussing C. Add enough parachutes, eliminate
unsafe constructs - pointers in particular - and you have firstly a
more modern runtime that is safer to use, and secondly, it's not C. My
criticisms are directed at C.
> >> We can see here many errors here that are typical of badly written C programs:
> >> ...
> >> The problems of buffer overflows (that can happen in all programming languages by the way, they are
> >> not specific to C)
>
> > They *cannot* happen in all languages.
>
> Please tell me what happens in language X when you do:
>
> table tab(50) (or the equivalent in the language syntax)
> tab[8654] = 67;
>
> Of course you will NOT randomly access some memory but you
> will get an error. The same when you use the vector object
> in the lcc-win container library.
That is by design. It is easier to debug a reliably generated runtime
error than the sinister failure modes of C bugs as enumerated above.
Some time ago, I was considering the possibility of a C compiler
targeting JVM. With the Java protections included, that pretty
much avoids the problems. It is, then, not the C language but the
popular implementations that are the problem. (I haven't verified
that it could be done yet, though.) Note that the C language standard
makes going outside arrays illegal, so all those programs are not
valid in standard C.
> Of course you can still make addressing errors in other runtimes, but
> rather than a silent and/or corrupting effect, you will see a benign
> runtime error which can be debugged.
Maybe it is Fortran's fault. Is that where the idea of not checking
array accesses came from?
> Would you argue against memory protection, since, "with sufficient
> care," the programmer can simply not write code that violates process
> space? Yet we *have* memory protection and prefer to see a segfault
> than silent writing into another process' space. Runtime protections
> extend this concept since programmers are never infallible.
> Furthermore: debugging bugs in usual C-style memory micromanagement is
> difficult, time consuming, and extremely costly - unless the
> programmer has sufficient experience to protect himself with useful
> abstractions - and IMHO those less experienced should protect
> themselves by using environments where these failure modes have been
> eliminated.
Many years ago, while running OS/2 on an 80286 machine, I had my
program set up to allocate a segment selector from OS/2 with the
appropriate limit. That meant hardware trap going outside an
array for either read or write. It helped much in debugging.
> For example, debug the heap corruption bug(s) in this code from the
> real world (not mine):
> http://www.telegraphics.com.au/~toby/freesearch.tar.bz2
> Then rewrite in a modern language and compare...
(snip)
-- glen
> Please tell me what happens in language X when you do:
> table tab(50) (or the equivalent in the language syntax)
> tab[8654] = 67;
Doing the equivalent in one of my languages does nothing special; you just
get an array of 8653 void values, followed by a 67 value.
That's because the table has a flexible upper (but not lower) bound:
Assigning to index -8654 will generate a bounds error (but in language Y,
the lower bound will just be moved).
Assuming to index 2000000000 will generate a memory allocation error.
In general many languages behave more sensibly than C.
--
Bartc
You can do this with lccwin and the arraylist object, but it is normally
disabled.
> Assigning to index -8654 will generate a bounds error (but in language
> Y, the lower bound will just be moved).
>
You see?
> Assuming to index 2000000000 will generate a memory allocation error.
>
I hope so...
> In general many languages behave more sensibly than C.
>
C can behave that way if you want. At least lccwin can.
> (1) No error checking at all.
True. 25 years later programmers now know better (I hoped so.)
> (2) No provision for error return.
Sorry, but you completely missed the point here.
This very remark, along with the 5th, are against the specifications,
not against code writing as are the other ones. The distinction is
important, but it might get you a good point: buffer overflows are
accidents, and as such happen because of a conjunction of factors. And
_both_ implementations _and_ specifications might be held for
responsible. Yet when it comes to code writing, it is quite too late to
argue against one determinate function specifications: if you do not
like it, just avoid it and use another: in C there are various ways to
code something.
If we look a bit into history, the '84 Standard library was provided
because of compatibility with (then) existing code, and out of a clear
requisite from the (then) users of the language (K&R only specifies the
language itself, what is now the 6 clause.)
asctime() was provided in the Research implementations as a quick hack
to provide a defined output for a timestamp, and was retained in the
Standard based on actual practice and widespread use. In the 1986
version (and all subsequent) of the Standard, there is a proposed
alternative, namely strftime(), which might fit better your needs.
Or not.
Furthermore in this particular case, assuming correct input, I fail to
see the need for some error return. If you feel _every_ C function
should have an explicit, additional, error *return* path, well, ehem, I
am not sure your idea of the language is the most prevalent: at least, I
fail to see it accepted everywhere.
Finally, according to the Standard, NULL might be an acceptable return
value (the out-of-range values are widely recognized as undefined
behaviour cases, see DR217 for the details.) And it definitively looks
like an error indicator in this case; of course, not much uses of
asctime() guard themselves against NULL returned value; but how many
expect the string to be up-to-62-char long?
> (3) The use of snprintf, instead of sprintf is recommended.
In 1990? <grin>
I concur that if you hire a programmer in 2010 and he gives you the
quoted code, he definitevely has bad habits. Fire him!
Now, if you are believing the code that is written in the C Standard is
to be cut-and-pasted into an implementation, I very much disagree with
you. The C Standard is a *specification*. It describes how the
implementation should behave, NOT how you as an implementer should make
it working. And it purposely uses sprintf, in a way to quickly and
clearly describe the behaviour: for example, emphazing that the output
string should be 25+1-character long, terminated with a \n, with the
hour number at positions 11 and 12.
And there are code out there which relies on that.
In fact, I would rather recommend using a simple ito2a conversion to
deal with the integer-to-character conversions (like was done in V7,
with range-checking added): this way spares the printf() overhead, and
most importantly preserves the intent of the specifications,
particularly about the layout of the returned string.
> (4) The sprintf specification is "%.3s %.3s%3d %.2d:%.2d:%.2d %d\n" but
> the buffer "result" is dimensioned to 26 positions. This is wrong.
I disagree.
Please remember he *tm data is supposed to be a Gregorian calendar date;
as such, there are up to 61 seconds in a minute, and only positive
values, up to 60 minutes in an hour and only positive values etc. The
tm_year number (offset by 1900) should be between -318 and some upper
limit you could imagine, but 1000 looks overwhelming to me.
If you argue about out-of-range values, you might also check the
coherency of the datas, like whether tm_wday matches the Gregorian date,
whether tm_mday is in range for the given month and year etc. BUT these
are NOT the specifications for asctime(): asctime() is here to show a
brocken-down calendar time as a 25+1-character ASCII string ending with
\n; if the calendar time is garbage, one solution might be to output ***
(like done usually in production environements) thus preserving the
apparent specifications, rather than to strech them (arbitrarily.)
> How much buffer space we would need to protect asctime from buffer
> overflows in the worst case?
If you use snprintf, 0. Or am I missing something? Belt and suspenders?
> This is very easy to calculate.
I am not sure it is that easy; for example your explanations are not
deterministic, you had to special case the architecture width. It is
certainly possible to do better computations using sizeof(int)*CHAR_BIT
*31/100 +1 +1 (since you are apparently looking after an upper limit,
not the "right" value.)
> We know that in all cases, %d can't output more characters than
> the maximum numbers of characters an integer can hold.
If you are after out-of-range values, you better have to guard yourself
against out-of-interval values for the indices tm_wday and tm_mon...
And once you code against those two intervals, why not coding against
the other ones?
> (5) This function returns a pointer to a static buffer. This is best
> avoided in multithreaded applications since all those functions can
> never be made reentrant
Again, the specification for the asctime() is to return a static object,
which allows to greatly simplify the memory management of lightweight
applications. Bigger ones always can defer to strftime(), but it comes
at the cost of requiring memory mangement, there is no free lunch.
If your library implementation is to be run on a multi-threaded
environment, you certainly can return a TLS object, which will avoid
some forms of multithread concurrency, while still being a "static"
object when considered from the Standard's point of view: all shared
library implementation do this, so it is hardly rocket science.
If your implementation provides good garbage collection, you could
alternatively return an allocated object, which would then be different
on every invocation of asctime() (note the "may overwrite" in 7.23.3p1,
which is not an absolute requirement.)
Antoine
On the other hand, an implementation certainly *could* use the
exact code from the standard, and I'm not convinced that there's
much reason not to do so. It works correctly (by definition) in all
cases for which the behavior is defined, and it's the programmer's
responsibility to avoid the cases in which the behavior is undefined.
Of course implementers are free to add whatever error checking
they like, as long as the defined behavior remains the same.
I'd say that the *only* reason asctime() is still in the standard
is to cater to old code. The vast majority of existing asctime()
calls probably use the current time, which won't be a problem for
several thousand years. strftime() is clearly more flexible, though
it's slightly more difficult to use. My personal preference would
be to deprecate it rather than fix it. Apart from any problems
with undefined behavior, its output format is US-centric, and the
trailing newline is just silly.)
The latest C201X draft does add some additional normative wording,
explaining and slightly expanding the argument values for which its
behavior is undefined. (For example, in C90 and C99, tm_min==99
has well defined behavior; in the C201X, it's UB.)
> In fact, I would rather recommend using a simple ito2a conversion to
> deal with the integer-to-character conversions (like was done in V7,
> with range-checking added): this way spares the printf() overhead, and
> most importantly preserves the intent of the specifications,
> particularly about the layout of the returned string.
I doubt that any uses of asctime() are performance-critical. But
sure, if you can avoid the overhead of sprintf(), that's not a bad
thing.
[...]
--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"