Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

fgets - design deficiency: no efficient way of finding last character read

339 views
Skip to first unread message

John Reye

unread,
Apr 11, 2012, 12:43:17 PM4/11/12
to
Hello,

The last character read from fgets(buf, sizeof(buf), inputstream) is:
'\n'
OR
any character x, when no '\n' was encountered in sizeof(buf)-1
consecutive chars, or when x is the last char of the inputstream

***How can one EFFICIENTLY determine if the last character is '\n'??
"Efficiently" means: don't use strlen!!!

I only come up with the strlen method, which - to me - says that fgets
has a bad design.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[])
{
char buf[6];
FILE *fp = stdin;
while (fgets(buf, sizeof(buf), fp)) {
printf((buf[strlen(buf)-1] == '\n') ? "Got a line which ends with
newline: %s" : "no newline: %s", buf);
}


return EXIT_SUCCESS;
}



A well-designed fgets function should return the length of characters
read, should it not??

Please surprise me, that there is a way of efficiently determining the
number of characters read. ;)
I've thought of ftell, but I think that does not work with stdin.

Because right now, I think that fgets really seems useless.
Why is the standard C library so inefficient?
Do I really have to go about designing my own library? ;)

Thanks for tipps and pointers

Regards,
J.

Rupert Swarbrick

unread,
Apr 11, 2012, 1:43:07 PM4/11/12
to
John Reye <jono...@googlemail.com> writes:
> A well-designed fgets function should return the length of characters
> read, should it not??
>
> Please surprise me, that there is a way of efficiently determining the
> number of characters read. ;)
> I've thought of ftell, but I think that does not work with stdin.

I'm intrigued. What application do you have where you read extremely
long lines from stdin using fgets? This seems an odd thing to do: I
can't think of any text-based formats where lines are extremely
long. For binary formats, use fread and (oh, look!):

FREAD(3)

...

RETURN VALUE
fread() and fwrite() return the number of items successfully read
or written (i.e., not the number of characters). If an error
occurs, or the end-of-file is reached, the return value is a
short item count (or zero).


It seems that the standard library isn't so badly designed after all...

<snip>
> Do I really have to go about designing my own library? ;)

No.


Rupert

Ben Pfaff

unread,
Apr 11, 2012, 1:53:48 PM4/11/12
to
Rupert Swarbrick <rswar...@gmail.com> writes:

> I'm intrigued. What application do you have where you read extremely
> long lines from stdin using fgets? This seems an odd thing to do: I
> can't think of any text-based formats where lines are extremely
> long.

It's fairly common for machine-generated HTML and XML (which are
text-based formats) to be single, very-long lines.

John Reye

unread,
Apr 11, 2012, 1:59:08 PM4/11/12
to
On Apr 11, 7:43 pm, Rupert Swarbrick <rswarbr...@gmail.com> wrote:
> I'm intrigued. What application do you have where you read extremely
> long lines from stdin using fgets?
Actually I was using fgets, to read into a buffer. If the buffer is
not large enough to fit an entire line (i.e. one including '\n'), I
doubled the buffer and read the remaining chars. (stdin is just an
example that shows that I cannot abuse ftell to determine the length
read... you know: ftell-after-fgets minus ftell-before-fgets).

I thought fgets would be a good function to use, since it
automatically stops, when it encounters '\n'.

>        fread() and fwrite() return the number of items successfully read
>        or written (i.e., not the number of characters).  If an error
>        occurs, or the end-of-file is reached, the return value is a
>        short item count (or zero).

Yes... probably fread is a better way of handling it!
I want a buffer to hold the complete line, and then continue reading
lines.

***What is more efficient?
If I use fread, I'll probably overshoot beyond the '\n'.
Is it more efficient to rewind via fseek, and fread the overshoot to
the beginning of the buffer;
OR is it more efficient to copy the overshoot to the beginning of the
buffer and the fread the remainder.

Thanks.
J.

John Reye

unread,
Apr 11, 2012, 2:09:52 PM4/11/12
to
On Apr 11, 7:53 pm, Ben Pfaff <b...@cs.stanford.edu> wrote:
> Rupert Swarbrick <rswarbr...@gmail.com> writes:
> It's fairly common for machine-generated HTML and XML (which are
> text-based formats) to be single, very-long lines.

Correct, but I would not read those huge lines, because the '\n' is
not the logical divider.

I however want a nice routine (which I have to code myself), which
uses realloc, to adjust a buffer to fit everything until the '\n'.
C standard lib does not have anything like this - so I have to code it
myself.

I bet C++ has something useful that one could use. It seems that many
went into C++, to make it the huge bloated monster that it is! But
still seems worth a look, to relieve me from having to handle this
stuff at the basic level. Alternative: I need to develop my own
library of useful C routines.

pete

unread,
Apr 11, 2012, 2:17:34 PM4/11/12
to
It is not uncommon for programmers to write
their own getline() function.

http://www.mindspring.com/~pfilandr/C/get_line/

--
pete

Keith Thompson

unread,
Apr 11, 2012, 2:49:14 PM4/11/12
to
Have you measured the performance cost of calling strlen()?

I haven't done so myself, so the following is largely speculation,
but I strongly suspect that the time to call strlen() is going to
be *much* less than the time to read the data. Yes, an fgets-like
function could return additional information, either the length
of the string or a pointer to the end of it, and that would save a
little time, but I'm not convinced it would be a significant benefit.

And there would be some small but non-zero overhead in returning the
extra information. In a lot of cases, the caller isn't going to use
that information (perhaps it's going to traverse the string anyway).

*Measure* before you decide that fgets is "useless".

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Will write code for food.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Kaz Kylheku

unread,
Apr 11, 2012, 3:13:26 PM4/11/12
to
On 2012-04-11, John Reye <jono...@googlemail.com> wrote:
> ***How can one EFFICIENTLY determine if the last character is '\n'??
> "Efficiently" means: don't use strlen!!!

There is no way to know where the last character of a string is if you
do not know the length explicitly, or else implicitly (scan the string
looking for the null terminator).

> I only come up with the strlen method, which - to me - says that fgets
> has a bad design.

The newline can be missing only in two situations. One is that the buffer isn't
large enough to hold the line. In that case, some non-newline character is
written into the next-to-last element of the buffer and a null terminator
into the last element. If you set the next-to-last byte to zero before
calling fets, you can detect that this situation has happened by finding
a non-zero byte there.

The second situation is that the last line of the stream has been read,
but fails to be newline terminated.

If you want to detect this situation, you only to check for if end-of-file has
been reached. That is to say, keep calling fgets until it returns NULL. Then
go back to the most recently retrieved line and check whether the newline is
here or not, with the help of strlen, or strchr(line, '\n'), etc.

So as you can see, you don't have to scan every single line.

James Kuyper

unread,
Apr 11, 2012, 4:36:22 PM4/11/12
to
On 04/11/2012 12:43 PM, John Reye wrote:
> Hello,
>
> The last character read from fgets(buf, sizeof(buf), inputstream) is:
> '\n'
> OR
> any character x, when no '\n' was encountered in sizeof(buf)-1
> consecutive chars, or when x is the last char of the inputstream
>
> ***How can one EFFICIENTLY determine if the last character is '\n'??

That's relatively easy - so long as you don't need to know where the
'\n' is.

> "Efficiently" means: don't use strlen!!!
>
> I only come up with the strlen method, which - to me - says that fgets
> has a bad design.

The following approach uses strchr() rather than strlen(), so it
technically meets your specification. However, I presume you would have
the same objections to strchr() as you do to strlen(). I'd like to point
out, however, that it uses strchr() only once per file, which seems
efficient enough for me. If you're doing so little processing per file
that a single call to strchr() per file adds significantly to the total
processing load, I'd be more worried about the costs associated with
fopen() and fclose() than those associated with strchr().

The key point is that a successful call to fgets() can fail to read in
an '\n' character only if fgets() meets the end of the input file, or
the end of your buffer, both of which can be checked for quite
efficiently. If it reaches the end of your buffer, there's one and only
one place where the '\n' character can be, if one was read in.
Therefore, it's only at the end of the file that a search is required.

> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> int main(int argc, char *argv[])
> {
> char buf[6];
> FILE *fp = stdin;

buf[(sizeof buf)-1] = 1; // any non-zero value will do.

> while (fgets(buf, sizeof(buf), fp)) {
const char *prefix =
(buf[(sizeof buf)-1] == '\0' && buf[(sizeof buf)-2] != '\n'
|| feof(fp) && !strchr(buf, '\n')) ? "no " : "";

printf("Got a line which ends with %snewline: %s\n",
prefix, buf);

buf[(sizeof buf)-1] = 1;
> }
>
>
> return EXIT_SUCCESS;
> }
>
>
>
> A well-designed fgets function should return the length of characters
> read, should it not??
>
> Please surprise me, that there is a way of efficiently determining the
> number of characters read. ;)
> I've thought of ftell, but I think that does not work with stdin.
>
> Because right now, I think that fgets really seems useless.
> Why is the standard C library so inefficient?

Measure the inefficiency before deciding whether or not it's useless.
You may be surprised.

> Do I really have to go about designing my own library? ;)

You don't need an entire library; a function equivalent to fgets() that
calls getc() and provides the information you're looking for wouldn't be
too difficult to write, and should compile fairly efficiently.

Rupert Swarbrick

unread,
Apr 11, 2012, 4:39:26 PM4/11/12
to
Thanks for this. It neatly uses the O(1) access at the end of the string
and gets around the OP's problem brillantly. I like it!

Rupert

John Reye

unread,
Apr 11, 2012, 6:14:26 PM4/11/12
to
On Apr 11, 9:13 pm, Kaz Kylheku <k...@kylheku.com> wrote:
thanks for your comment

On Apr 11, 10:36 pm, James Kuyper <jameskuy...@verizon.net> wrote:
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
>
> > int main(int argc, char *argv[])
> > {
> >   char buf[6];
> >   FILE *fp = stdin;
>
>     buf[(sizeof buf)-1] = 1;    // any non-zero value will do.
>
> >   while (fgets(buf, sizeof(buf), fp)) {
>
>         const char *prefix =
>             (buf[(sizeof buf)-1] == '\0' && buf[(sizeof buf)-2] != '\n'
>             || feof(fp) && !strchr(buf, '\n')) ? "no " : "";
>
>         printf("Got a line which ends with %snewline: %s\n",
>             prefix, buf);
>
>         buf[(sizeof buf)-1] = 1;
>
> >   }
>
> >   return EXIT_SUCCESS;
> > }


Thanks for that! It's really good! :)


>
>
> > Do I really have to go about designing my own library? ;)
>
> You don't need an entire library; a function equivalent to fgets() that
> calls getc() and provides the information you're looking for wouldn't be
> too difficult to write, and should compile fairly efficiently.

Hmmm... I think fread() is more efficient than continous getc().

Does this make sense?

For some context:
I think that when writing a getline function (that uses realloc)...
i.e. size_t getline(char **ptr_to_inner_buf, FILE *fp) ... where
ptr_to_inner_buf is set to an internal buffer that holds bytes until
'\n', or any char x if EOF...

then realizing that getline function, by repeatedly calling getc() is
less efficient THAN using fread to get a number of bytes, scan for
'\n' and place a '\0' in the following byte. Before the next call to
fread, I could scan any overshoot (beyond '\n'... putting back the
char overwritten by '\0' via a tmp) for '\n' and if I find it... again
set '\0' and adjust ptr_to_inner_buf (see function declaration).
Otherwise I copy the overshoot to the very beginning of the buffer,
and fread the delta needed to fill the entire buffer.
If no '\n in the buffer, I realloc and fread the delta. etc.etc.

So fread() more efficient than continous getc(). Or am I wrong?

Thanks.

John Reye

unread,
Apr 11, 2012, 7:07:37 PM4/11/12
to
Even though James Kuyper showed a nice way of determining if the
string contains '\n', I still feel that fgets has a RETURN VALUE that
simply shouts "deficiency!".

char * fgets ( char * str, int num, FILE * stream );
Return Value
On success, the function returns the same str parameter. etc.

Why on earth return an identical pointer most of the time???
Returning a count of the number of bytes read would have been a far
better choice for the return value, wouldn't it?

James Kuyper

unread,
Apr 11, 2012, 7:19:44 PM4/11/12
to
It's not clear to me that what you're saying is any different between an
implementation-written implementation of fgets() and a user-written
user_fgets() replacement function that makes repeated calls to getc().
They both have to do pretty much the same things you mentioned. It is
true that fgets() could take advantage of OS-specific features that a
portable user_fgets() could not; but I didn't recognize any suggestion
of that possibility in what you were saying.

> So fread() more efficient than continous getc(). Or am I wrong?

"The byte input functions" ( fgets, fread, fscanf, getc, getchar, scanf,
vfscanf, and vscanf - 7.21.1p5) "read characters from the stream as if
by successive calls to the fgetc function." (7.21.3p11)

The reason why the fgetc() function and the getc() function-like macro
both exist is because getc() can eliminate the function call overhead
nominally associated with fgetc(). I say "nominally" because
sufficiently aggressive optimizer that is closely integrated with the C
standard library could remove that overhead even when using fgetc().

Typical implementations of getc() basically just move a pointer though a
buffer, triggering buffer refills when needed. As long as the file is
buffered, all the complicated stuff happens only during the refills.
Off-hand, I'd expect user_fgets() to be able to achieve similar
performance to that of fgets(), at least when reading buffered streams.
The execution time should be dominated by the calls to the OS-specific
function which actually fills the buffer, and the total number of such
calls should be the same in either case.

If it matters, I suppose you could put try testing it. user_fgets()
shouldn't be very difficult to write; I might try if myself in the
unlikely event that I get enough spare time anytime soon.

James Kuyper

unread,
Apr 11, 2012, 7:24:21 PM4/11/12
to
Many of the C standard library functions would have been more useful if
they'd returned a pointer to the end of a string or buffer, rather than
to its beginning. I chalk it up to inexperience (with C, that is) by the
people who invented C. A decent respect for the need to retain backwards
compatibility means that we can't undo those bad design decisions - but
that doesn't prevent the creation of new functions with similar
functionality and a more useful return value.

BartC

unread,
Apr 11, 2012, 7:44:54 PM4/11/12
to


"John Reye" <jono...@googlemail.com> wrote in message
news:c7e4f460-42ee-4f6c...@w5g2000vbp.googlegroups.com...
You're right. A quick test reading files using fgets(), showed that a
following strlen() was adding 10-15% to runtime.

This is for a program doing nothing else except reading all the lines, and
for files already cached in memory, so the overheads will be less in real
programs, especially for mainly small files as line-oriented files tend to
be. So it's not that big a deal. And you can easily write your own version.

--
Bartc

William Ahern

unread,
Apr 11, 2012, 7:47:18 PM4/11/12
to
The designer(s) of fgets() may have been backward looking instead of forward
looking; not intent on making a composable routine--which works well with ad
hoc buffer parsing code--but rather one which works conveniently with the
pre-existing string routines--i.e. read a string then pass that string to
some other string routine which will lazily determine string length while
processing it.

BartC

unread,
Apr 11, 2012, 8:23:40 PM4/11/12
to
"William Ahern" <wil...@wilbur.25thandClement.com> wrote in message
news:6phh59-...@wilbur.25thandClement.com...
Except that fgets() can return NULL on error. That makes it harder to use
the return value unchecked.

--
Bartc

William Ahern

unread,
Apr 11, 2012, 8:47:01 PM4/11/12
to
BartC <b...@freeuk.com> wrote:
> "William Ahern" <wil...@wilbur.25thandClement.com> wrote in message
> news:6phh59-...@wilbur.25thandClement.com...
<snip>
> > The designer(s) of fgets() may have been backward looking instead of
> > forward looking; not intent on making a composable routine--which works
> > well with ad hoc buffer parsing code--but rather one which works
> > conveniently with the pre-existing string routines--i.e. read a string
> > then pass that string to some other string routine which will lazily
> > determine string length while processing it.

> Except that fgets() can return NULL on error. That makes it harder to use
> the return value unchecked.

I didn't mean _literally_ read and pass to another routine, without checking
the return value. But, point taken. Although, IME the typical usage is
`while (fgets()) { ... }', which makes it convenient to use in the very
common cases where I/O errors are ignored or treated similarly to EOF.

Eric Sosman

unread,
Apr 11, 2012, 10:50:10 PM4/11/12
to
On 4/11/2012 12:43 PM, John Reye wrote:
> Hello,
>
> The last character read from fgets(buf, sizeof(buf), inputstream) is:
> '\n'
> OR
> any character x, when no '\n' was encountered in sizeof(buf)-1
> consecutive chars, or when x is the last char of the inputstream
>
> ***How can one EFFICIENTLY determine if the last character is '\n'??
> "Efficiently" means: don't use strlen!!!

Kaz' method is pretty slick. However, the time for strlen() is
likely to be insignificant compared to the time for the I/O itself.

> A well-designed fgets function should return the length of characters
> read, should it not??

IMHO that would be a more useful return value than the one fgets()
actually delivers, but this is scarcely the only unfortunate choice
to be found in the Standard library. For example, strcpy() and strcat()
"know" where their output strings end and could return that information
instead of echoing back a value the caller already has. In another
thread we've just rehashed the gotchas of <ctype.h> for the umpty-
skillionth time. No doubt other folks have their own pet peeves.

Tell me, though: Are you using a QWERTY keyboard, despite all its
drawbacks? Legend[*] has it that QWERTY was chosen *on purpose* to
slow down typists in the days when too much speed led to mechanical
jams. On today's keyboards that's not a problem -- So, are you still
using a nineteenth-century keyboard layout? If so, ponder your reasons
for not changing to something more modern, and see if those reasons
shed any light on why people still put up with the Standard Warts And
All Library.

[*] Wikipedia disputes the legend, but a Wikipedia page is only
as good as its most recent editor.

--
Eric Sosman
eso...@ieee-dot-org.invalid

lawrenc...@siemens.com

unread,
Apr 12, 2012, 9:57:39 AM4/12/12
to
Eric Sosman <eso...@ieee-dot-org.invalid> wrote:
>
> Legend[*] has it that QWERTY was chosen *on purpose* to
> slow down typists in the days when too much speed led to mechanical
> jams. On today's keyboards that's not a problem -- So, are you still
> using a nineteenth-century keyboard layout? If so, ponder your reasons
> for not changing to something more modern, and see if those reasons
> shed any light on why people still put up with the Standard Warts And
> All Library.
>
> [*] Wikipedia disputes the legend, but a Wikipedia page is only
> as good as its most recent editor.

Perhaps the best way to describe it is that the layout was chosen to
maximize speed given the mechanical limitations of the device. Typing
faster doesn't help if you constantly have to stop to clear jams. Think
of it as managing response time to optimize throughput. :-)
--
Larry Jones

I'm getting disillusioned with these New Years. -- Calvin

Nobody

unread,
Apr 12, 2012, 9:49:06 PM4/12/12
to
On Wed, 11 Apr 2012 15:14:26 -0700, John Reye wrote:

> So fread() more efficient than continous getc(). Or am I wrong?

Maybe, maybe not. getc() is allowed to be implemented as a macro, so
a getc() loop could end up as little more than memcpy().

However: if the C library is thread-safe (which may be a compiler option),
it will end up locking the stream for each call, which will definitely be
worse than a single fread().

In GNU libc 1.x, getc was a light-weight macro. This changed in 2.x due to
thread safety, but it has _unlocked versions of many of the stdio
functions, e.g. fgetc_unlocked:

// libio.h:

#define _IO_getc_unlocked(_fp) \
(_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) \
? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)

// bits/stdio.h:

# ifdef __USE_MISC
/* Faster version when locking is not necessary. */
__STDIO_INLINE int
getc_unlocked (FILE *__fp)
{
return _IO_getc_unlocked (__fp);
}
# endif /* misc */

With the right switches (e.g. disabling thread safety or
-Dgetc=getc_unlocked) and sufficient optimisation, a getc() loop could
realistically be limited by memory bandwidth.

Nobody

unread,
Apr 12, 2012, 10:08:01 PM4/12/12
to
On Wed, 11 Apr 2012 22:50:10 -0400, Eric Sosman wrote:

> Tell me, though: Are you using a QWERTY keyboard, despite all its
> drawbacks? Legend[*] has it that QWERTY was chosen *on purpose* to
> slow down typists in the days when too much speed led to mechanical
> jams. On today's keyboards that's not a problem -- So, are you still
> using a nineteenth-century keyboard layout?

A related issue (which clearly isn't legend) is that nearly all computer
keyboards still have the staggered layout of a mechanical typewriter.

And unlike a completely different layout, eliminating the stagger would be
a fairly minor incompatibility (you'd still be using the same finger for
each letter).

Ben Pfaff

unread,
Apr 12, 2012, 11:31:22 PM4/12/12
to
Eric Sosman <eso...@ieee-dot-org.invalid> writes:

> Tell me, though: Are you using a QWERTY keyboard, despite all its
> drawbacks? Legend[*] has it that QWERTY was chosen *on purpose* to
> slow down typists in the days when too much speed led to mechanical
> jams. On today's keyboards that's not a problem -- So, are you still
> using a nineteenth-century keyboard layout? If so, ponder your reasons
> for not changing to something more modern, and see if those reasons
> shed any light on why people still put up with the Standard Warts And
> All Library.

The same topic came up here back in 2002. Here's a new copy of
what I posted back then:

Have you used a mechanical typewriter? I have. These things have
an array of letterforms on spokes[1] arranged in a half-circular
pattern in the body of the typewriter. When you hit a key, one of
them lunges forward to the place where the letter should go (the
"cursor position") and strikes the paper through the ribbon.

Now, if there's only of these spokes in motion, there's no
problem. But there's a mutual exclusion problem: if more than one
of them is in motion at once, e.g., one going out and another
coming back, then they'll hit one another and you'll have to take
a moment to disentangle them by hand, which is annoying and
possibly messy. It's a race condition that you will undoubtedly
be bitten by quickly in real typing.

The problem is exacerbated if the letterforms for common digraphs
have adjacent spokes. This is because the closer two spokes are,
the easier they can hit one another: if the spokes are at
opposite ends of the array, then they can only hit at the point
where they converge at the cursor, but if they are adjacent then
they'll hit as soon as they start moving.

One solution, of course, is to introduce serialization through
use of locking: allow only one key to be depressed at a
time. Unfortunately, that reduces parallelism, because many
digraphs that you want to type in the real world do not have
adjacent spokes, even if you just put the keys in alphabetical
order.

The adopted solution, of using a QWERTY layout, is not a real
solution to the problem. Instead, it reduces the chances of the
race condition by putting keys for common digraphs, and therefore
their spokes, far away from each other. You can still jam the
mechanism and have to untangle the spokes, but it happens less
often, at least for English text. This in fact helps you to type
*faster*, not slower, because you don't have to stop so often to
deal with jammed-together spokes.

To conclude: mechanical QWERTY typewriters are at the same time
an example of optimization for the common case and inherently
flawed because of the remaining race condition. This is a great
example of a tradeoff that you should not make when you design a
program!

[1] I don't know any of the proper vocabulary here. I was about 8
years old when I used the one we had at home, and it was thrown
out as obsolete soon after.

santosh

unread,
Apr 13, 2012, 3:24:22 AM4/13/12
to
On Apr 13, 8:31 am, Ben Pfaff <b...@cs.stanford.edu> wrote:
> Eric Sosman <esos...@ieee-dot-org.invalid> writes:
> >     Tell me, though: Are you using a QWERTY keyboard, despite all its
> > drawbacks?  Legend[*] has it that QWERTY was chosen *on purpose* to
> > slow down typists in the days when too much speed led to mechanical
> > jams.  On today's keyboards that's not a problem -- So, are you still
> > using a nineteenth-century keyboard layout? If so, ponder your reasons
> > for not changing to something more modern, and see if those reasons
> > shed any light on why people still put up with the Standard Warts And
> > All Library.
>
> The same topic came up here back in 2002.  Here's a new copy of
> what I posted back then:
>
> Have you used a mechanical typewriter? I have. <snip>

Yes, we've still got one! A 1932 manufactured Remington. Used it a lot
during the 90s. It's still in excellent condition except that I'm
unable to acquire a ribbon anywhere.

<Good explanation!>

BartC

unread,
Apr 13, 2012, 6:57:02 AM4/13/12
to


"santosh" <santo...@gmail.com> wrote in message
news:2261d25f-a1f2-4413...@s10g2000pbc.googlegroups.com...
I've got an Underwood No. 5 - from 1931. I use it for addressing envelopes,
as it would probably take a week of trial and error (and a wastepaper bin
full of trashed envelopes) to do it on my laser printer.

--
Bartc

Jorgen Grahn

unread,
Apr 13, 2012, 10:51:47 AM4/13/12
to
On Wed, 2012-04-11, John Reye wrote:
> On Apr 11, 7:53 pm, Ben Pfaff <b...@cs.stanford.edu> wrote:
>> Rupert Swarbrick <rswarbr...@gmail.com> writes:
>> It's fairly common for machine-generated HTML and XML (which are
>> text-based formats) to be single, very-long lines.
>
> Correct, but I would not read those huge lines, because the '\n' is
> not the logical divider.

True; you use fgets() if you can expect lines to be reasonably short.
Even then you have to handle the case where one extra long doesn't fit
into your buffer.

For the original complaint, I can see many scenarios where you:

fgets()
parse the line (in whatever way your application needs)
during the parsing, discover the lack of an \n
try to fgets() again and restart the parsing

> I however want a nice routine (which I have to code myself), which
> uses realloc, to adjust a buffer to fit everything until the '\n'.
> C standard lib does not have anything like this - so I have to code it
> myself.
>
> I bet C++ has something useful that one could use. It seems that many
> went into C++, to make it the huge bloated monster that it is! But
> still seems worth a look, to relieve me from having to handle this
> stuff at the basic level.

This is the second time you've made this threat. You can't do *both*
that and insulting the C++ programmers here with things like "huge
bloated monster".

Anyway, I can't see that with your extreme attention to micro-
inefficiencies, you would be comfortable with C++ -- it solves these
things with dynamic memory allocations. Do you really prefer that to
a strlen() in an already warm data cache?

(And do you really believe a strlen() is significant compared to the
I/O that preceded it?)

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

lawrenc...@siemens.com

unread,
Apr 13, 2012, 10:39:02 AM4/13/12
to
Ben Pfaff <b...@cs.stanford.edu> wrote:
>
> To conclude: mechanical QWERTY typewriters are at the same time
> an example of optimization for the common case and inherently
> flawed because of the remaining race condition. This is a great
> example of a tradeoff that you should not make when you design a
> program!

I would assert that it's a great example of the kind of tradeoffs you
frequently *have* to make when designing programs! :-)
--
Larry Jones

Hello, local Navy recruitment office? Yes, this is an emergency... -- Calvin

jono...@googlemail.com

unread,
Oct 31, 2013, 7:26:14 PM10/31/13
to
On Wednesday, April 11, 2012 10:36:22 PM UTC+2, James Kuyper wrote:
> > ***How can one EFFICIENTLY determine if the last character is '\n'??
>
>
>
> That's relatively easy - so long as you don't need to know where the
>
> '\n' is.
>
>
>
>
>
>
Please note that this can be optimized a bit (and improved in understandability) like this:

const char *prefix =
(buf[(sizeof buf)-1] == '\0'
? buf[(sizeof buf)-2] != '\n'
: feof(fp) && !strchr(buf, '\n'))
? "no "
: "";


Or like this (here I've switched "no " and ""):
const char *prefix =
(buf[(sizeof buf)-1] == '\0'
? buf[(sizeof buf)-2] == '\n'
: !feof(fp) || strchr(buf, '\n'))
? ""
: "no ";



Note that pure boolean logic like this
(a && b) || (!a && c)
can be optimized in C to this:
a ? b : c


Ahh... the beauty and confusion of Boolean Logic!

jono...@googlemail.com

unread,
Nov 1, 2013, 2:59:22 AM11/1/13
to
On Wednesday, April 11, 2012 10:36:22 PM UTC+2, James Kuyper wrote:
> > ***How can one EFFICIENTLY determine if the last character is '\n'??
>
> That's relatively easy - so long as you don't need to know where the
>
> '\n' is.
>
>
>
> > "Efficiently" means: don't use strlen!!!
>
> >
>
> > I only come up with the strlen method, which - to me - says that fgets
>
> > has a bad design.
>
>
>
> The following approach uses strchr() rather than strlen(), so it
>
> technically meets your specification. However, I presume you would have
>
> the same objections to strchr() as you do to strlen(). I'd like to point
>
> out, however, that it uses strchr() only once per file, which seems
>
> efficient enough for me.
In fact, strchr() is called AT MOST once per file. It can happen that it is not called at all! This is when fgets(buf, sizeof(buf), fp) returns as its _last_ non-NULL pointer (pointing at buf), a string for which the following holds: strlen(buf) + 1 == sizeof(buf)

The reason: this last non-NULL return from fgets, does not yet set EOF (since buf could be filled completely). The next call from fgets will return NULL, and set EOF; but then we don't even go into the while loop, so in this case we would call strchr zero times... since on the last iteration through the while-loop, we'd get feof to be 0 (false) if it gets called at all.


[PS: Relating to my previous post - Note that when using my code suggestion, we NEVER call feof on an iteration for which buf is filled to its entirety]

> If you're doing so little processing per file
>
> that a single call to strchr() per file adds significantly to the total
>
> processing load, I'd be more worried about the costs associated with
>
> fopen() and fclose() than those associated with strchr().
>
>
>
> The key point is that a successful call to fgets() can fail to read in
>
> an '\n' character only if fgets() meets the end of the input file, or
>
> the end of your buffer, both of which can be checked for quite
>
> efficiently. If it reaches the end of your buffer, there's one and only
>
> one place where the '\n' character can be, if one was read in.
>
> Therefore, it's only at the end of the file that a search is required.
>
>
>
> > #include <stdio.h>
>
> > #include <stdlib.h>
>
> > #include <string.h>
>
> >
>
> > int main(int argc, char *argv[])
>
> > {
>
> > char buf[6];
>
> > FILE *fp = stdin;
>
>
>
> buf[(sizeof buf)-1] = 1; // any non-zero value will do.
>
>
>
> > while (fgets(buf, sizeof(buf), fp)) {
>
/*
> const char *prefix =
>
> (buf[(sizeof buf)-1] == '\0' && buf[(sizeof buf)-2] != '\n'
>
> || feof(fp) && !strchr(buf, '\n')) ? "no " : "";
>
*/
// improvement (see prev. post)
const char *prefix =
(buf[(sizeof buf)-1] == '\0'
? buf[(sizeof buf)-2] != '\n'

Malcolm McLean

unread,
Nov 1, 2013, 5:07:31 AM11/1/13
to
On Friday, April 13, 2012 8:24:22 AM UTC+1, santosh wrote:
>
> Yes, we've still got one! A 1932 manufactured Remington. Used it a lot
> during the 90s. It's still in excellent condition except that I'm
> unable to acquire a ribbon anywhere.
>
My great aunt was a secretary with the admiraltry during the war. I inherited
her typewriter, which I used to type out my first book (a maths textbook
for 11 year olds). Eventually it gummed up. The man in the typewriter repair
shop refused to take any money for fixing it.
My parents eventually chucked it out, together with the manuscript of the book.
The irony is that my father is a terrible hoarder, and all the cellars and
sheds are stuffed with junk.


jono...@googlemail.com

unread,
Nov 13, 2013, 4:24:49 PM11/13/13
to
On Friday, November 1, 2013 12:26:14 AM UTC+1, jono...@googlemail.com wrote:
> Please note that this can be optimized a bit (and improved in understandability) like this:
>
>
>
> const char *prefix =
>
> (buf[(sizeof buf)-1] == '\0'
>
> ? buf[(sizeof buf)-2] != '\n'
>
> : feof(fp) && !strchr(buf, '\n'))
>
> ? "no "
>
> : "";
>

Ah what the heck. Actually neither strlen() nor strchr() is needed.
We can always do this in constant time, irrespective of the length of buf.

Here's how:

const char *prefix =
(buf[sizeof(buf)-1] == '\0'
? buf[sizeof(buf)-2] != '\n'
: feof(fp)
) ? "no " : ""

The reason is this:
fgets will NEVER cause EOF to be set, if a line contains '\n'.
Even if the last character of a file is '\n', then the last non-null-returning call of fgets will deliver this '\n', but not yet set EOF.

fgets only ever causes EOF to be set in one of these conditions:
1) when returning NULL, because of having encountered EOF. Here buf delivers no string and is unchanged. {{Just take care that it can also return NULL on a read error, so NULL is not unique to EOF.}}
2) when returning non-NULL, AND delivering a string in buf, that does not fully fill buf (the element buf[sizeof(buf)-1] being untouched by fgets: '\0' being set before the end of buf), AND if the string in buf does NOT contain a '\n'.

Case 2) is only when reading a file that does NOT end in '\n'; and where the final character... when eventually read by fgets, is part of the string that does not fully fill buf (i.e. where the terminating '\0' is not at buf[sizeof(buf)-1]).

Additional: if the string fully fills buf ->
If we read a file that does NOT end in '\n'; and the final character... when eventually read by fgets, results in a string that fully fills buf (with buf[sizeof(buf)-1] being '\0'), then that call of fgets, returns non-NULL and does not yet set EOF. Here only the next call of fgets will cause if to return NULL and simultaneously set EOF.


Which all just goes to show: stdio.h and its functions are not as simple as they seem.


Nevertheless: fgets is still deficient if you want to QUICKLY determine the length of the delivered string (or get a pointer to its end).

Barry Schwarz

unread,
Nov 13, 2013, 8:21:51 PM11/13/13
to
On Wed, 13 Nov 2013 13:24:49 -0800 (PST), jono...@googlemail.com
wrote:

>On Friday, November 1, 2013 12:26:14 AM UTC+1, jono...@googlemail.com wrote:
>> Please note that this can be optimized a bit (and improved in understandability) like this:
>
>> const char *prefix =
>> (buf[(sizeof buf)-1] == '\0'
>> ? buf[(sizeof buf)-2] != '\n'
>> : feof(fp) && !strchr(buf, '\n'))
>> ? "no "
>> : "";
>>
>
>Ah what the heck. Actually neither strlen() nor strchr() is needed.
>We can always do this in constant time, irrespective of the length of buf.
>
>Here's how:
>
>const char *prefix =
>(buf[sizeof(buf)-1] == '\0'
> ? buf[sizeof(buf)-2] != '\n'
> : feof(fp)
>) ? "no " : ""

It would work a whole lot better if it compiled without error. On the
other hand, even after adding the missing semicolon, the algorithm
fails in those cases where fgets reads a string significantly shorter
than buf.

--
Remove del for email

James Kuyper

unread,
Nov 13, 2013, 8:50:20 PM11/13/13
to
On 11/13/2013 08:21 PM, Barry Schwarz wrote:
> On Wed, 13 Nov 2013 13:24:49 -0800 (PST), jono...@googlemail.com
> wrote:
...
>> const char *prefix =
>> (buf[sizeof(buf)-1] == '\0'
>> ? buf[sizeof(buf)-2] != '\n'
>> : feof(fp)
>> ) ? "no " : ""
>
> It would work a whole lot better if it compiled without error. On the
> other hand, even after adding the missing semicolon, the algorithm
> fails in those cases where fgets reads a string significantly shorter
> than buf.

If it reads a string significantly shorter than buf, then
buf[sizeof(buf)-1] == '\0' will be false, and string pointed at by
prefix will depend upon the value of feof(fp). If feof(fp), then the
input stream ended without a terminating newline, and prefix will be "no
"; otherwise, the line ended with a newline, and prefix will be "".
Except for the missing semi-colon, that looks right to me. How does your
analysis differ from mine?
--
James Kuyper

Barry Schwarz

unread,
Nov 14, 2013, 1:29:46 AM11/14/13
to
If the input is significantly shorter than the buffer, the contents of
the last two elements are residual from a previous read.

Ike Naar

unread,
Nov 14, 2013, 1:40:08 AM11/14/13
to
You've probably missed that the call to fgets() was preceeded by

Ben Bacarisse

unread,
Nov 14, 2013, 6:46:06 AM11/14/13
to
Ike Naar <i...@iceland.freeshell.org> writes:
<snip>
>>>> On Wed, 13 Nov 2013 13:24:49 -0800 (PST), jono...@googlemail.com
>>>> wrote:
<snip>
>>>>> const char *prefix =
>>>>> (buf[sizeof(buf)-1] == '\0'
>>>>> ? buf[sizeof(buf)-2] != '\n'
>>>>> : feof(fp)
>>>>> ) ? "no " : ""
<snip>
> You've probably missed that the call to fgets() was preceeded by
>
> buf[(sizeof buf)-1] = 1; // any non-zero value will do.

All of this is complicated by the fact that the buffer contents are
indeterminate if a read error occurs. If you want absolute portable
certainty you must test for ferror before looking in the buffer. The
original post is old enough that I don't have easy access to the whole
thread so this may have already been covered. Sorry if it has.

--
Ben.

Tim Rentsch

unread,
Nov 14, 2013, 7:15:22 AM11/14/13
to
If a read error occurred then fgets() returned null. The test is
done only if the value returned by fgets() is non-null.

James Kuyper

unread,
Nov 14, 2013, 7:17:35 AM11/14/13
to
The code above was intended as a replacement for code that occurred
inside the body of a loop that I wrote, which starts:

while(fgets(buf, sizeof(buf), fp))
{
...
}

Thus, the body is never entered if there is a read error.
My original code was not intended to be complete; had it been, the
while() loop would have been followed by

if(ferror(fp))
{
// Error Handling
}
else
{
// End of File handling
}

Also, I would ordinarily have written the while condition as
while(fgets(buf, sizeof buf, fp) == buf)

My reasons for preferring that form are surprisingly controversial.
--
James Kuyper

jono...@googlemail.com

unread,
Nov 14, 2013, 12:47:45 PM11/14/13
to
On Thursday, November 14, 2013 1:17:35 PM UTC+1, James Kuyper wrote:
> Also, I would ordinarily have written the while condition as
>
> while(fgets(buf, sizeof buf, fp) == buf)
>
>
Why?
>
> My reasons for preferring that form are surprisingly controversial.

What reasons?
(Note the endless loop when buf = NULL!)

Eric Sosman

unread,
Nov 14, 2013, 12:58:38 PM11/14/13
to
s/endless loop/undefined behavior/, as per 7.1.4p1. Also,
James Kuyper is not the sort of damp-eared newbie who writes
`sizeof pointer_variable' to get the size of the pointed-at thing.

--
Eric Sosman
eso...@comcast-dot-net.invalid

James Kuyper

unread,
Nov 14, 2013, 1:56:17 PM11/14/13
to
On 11/14/2013 12:47 PM, jono...@googlemail.com wrote:
> On Thursday, November 14, 2013 1:17:35 PM UTC+1, James Kuyper wrote:
>> Also, I would ordinarily have written the while condition as
>>
>> while(fgets(buf, sizeof buf, fp) == buf)
>>
>>
> Why?
>>
>> My reasons for preferring that form are surprisingly controversial.
>
> What reasons?

For many functions with return values, the set of values that can be
represented in the return type can be partitioned into three subsets:
A) Values indicating successful operation
B) Values indicating some kind of a problem
C) Values that should not be returned.

If you assume that nothing can go wrong, checking whether the return
value is in set A is the same as checking whether the return value is
not in set B. People routinely write code based upon that assumption,
testing whichever condition is easier to express or evaluate.
I don't like making such assumptions. Among other possibilities:

1. I typed the wrong function name.
2. I remembered incorrectly which return values are in each of the three
categories.
3. The code got linked to a different library than it should have been,
containing a different function with the same name.
4. Some other part of the code contains a defect rendering the behavior
of the code undefined, and the first testable symptom of that fact is
that this particular function returns a value it's not supposed to be
able to return.
5. Other.

Therefore, I prefer to write my code so that values in set C get treated
the same way as values in set B. The problem I'm trying to deal with is
very rare, so I don't bother doing this if doing so would make the code
significantly more complicated. However, adding "== buf" falls below my
threshold for "significantly more complicated".

It was argued, when I explained this preference before, that if anything
was going so badly wrong that a standard library function was returning
a value is should not have been able to return, there was nothing useful
I could do about it - the problem could be arbitrarily bad, rendering
anything I tried to do about it pointless.

While it is technically true that the any one of the problems I
mentioned above could have arbitrarily bad consequences, that's not the
right way to bet. That argument reflects an all-or-nothing attitude that
is out of sync with reality. In my experience, most such defects, if
they don't cause your program to fail immediately, will allow it to keep
running, at least for a little while, with something close to what would
have been the standard-defined behavior, if it hadn't been for the
defect. During that window of opportunity, it's possible to detect the
invalid return value and handle it as an error condition, making it
easier to investigate the actual cause of the problem.

It was also argued that my preferred approach produces code that is
obscure and counter-intuitive. To my mind if(func()==expected_value) is
just as clear as if(func()!=error_value). However, to an extent that
reflects the fact that I'm very familiar with what the expected result
is. If the expected result is not very familiar to a given programmer,
because it's normally ignored, then my approach might seem very obscure.

> (Note the endless loop when buf = NULL!)

buf was an array, not a pointer. Undefined behavior could, in principle,
cause buf==NULL to be true, but it's far harder to come up with any
reasonable low-cost alternative to cover that extremely unlikely
possibility.

James Kuyper

unread,
Nov 14, 2013, 2:09:20 PM11/14/13
to
On 11/14/2013 01:56 PM, James Kuyper wrote:
...
> I don't like making such assumptions. Among other possibilities:
>
> 1. I typed the wrong function name.
> 2. I remembered incorrectly which return values are in each of the three
> categories.
> 3. The code got linked to a different library than it should have been,
> containing a different function with the same name.
> 4. Some other part of the code contains a defect rendering the behavior
> of the code undefined, and the first testable symptom of that fact is
> that this particular function returns a value it's not supposed to be
> able to return.

I was working so hard to cover more plausible possibilities that I
forgot to include the simplest possibility: that the function is
implemented incorrectly, so that it can in fact return a value it's not
supposed to be able to return.
For standard library functions, that's rather implausible - but it's
still not impossible. It's more plausible if the defect comes up only in
obscure situations that might not have been properly tested.


Malcolm McLean

unread,
Nov 14, 2013, 2:20:45 PM11/14/13
to
On Thursday, November 14, 2013 6:56:17 PM UTC, James Kuyper wrote:
> On 11/14/2013 12:47 PM, jono...@googlemail.com wrote:
>
> For many functions with return values, the set of values that can be
> represented in the return type can be partitioned into three subsets:
>
> A) Values indicating successful operation
> B) Values indicating some kind of a problem
> C) Values that should not be returned.
>
> If you assume that nothing can go wrong, checking whether the return
> value is in set A is the same as checking whether the return value is
> not in set B. People routinely write code based upon that assumption,
> testing whichever condition is easier to express or evaluate.
>
> I don't like making such assumptions. Among other possibilities:
>
But in this case the NULL return from fgets() isn't an error return, it's
expected as part of normal control flow on correct input.

The other issue is that if(fgets(buff, 1024, fp) == buff) implies that it's
somehow important that fgets() return a pointer to a memory buffer. In fact
this is a frozen design glitch, it should be returning 0 on success or EOF
on failure.

Keith Thompson

unread,
Nov 14, 2013, 2:26:22 PM11/14/13
to
[...]

Interesting approach. It seems (to me) slightly obscure because I tend
not to think about fgets() returning its first argument, since it's not
a particularly useful value to return.

A more paranoid approach would be:

if ((result = fgets(buf, sizeof buf, fp)) == buf) {
/* ok */
}
else if (result == NULL) {
/* end-of-file or error */
}
else {
/* THIS SHOULD NEVER HAPPEN, print a stern warning and abort */
}

But unless you're writing a test suite for the standard library,
checking for illegal results from standard library functions probably
isn't worth the extra effort.

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

James Kuyper

unread,
Nov 14, 2013, 2:56:25 PM11/14/13
to
On 11/14/2013 02:26 PM, Keith Thompson wrote:
> James Kuyper <james...@verizon.net> writes:
...
Yes, a pointer to (or perhaps, just after?) the last character written
to the buffer might be be more useful, in some circumstances. The same
issue of relatively useless return values is ubiquitous in the
string-handling functions.

> A more paranoid approach would be:
>
> if ((result = fgets(buf, sizeof buf, fp)) == buf) {
> /* ok */
> }
> else if (result == NULL) {
> /* end-of-file or error */
> }
> else {
> /* THIS SHOULD NEVER HAPPEN, print a stern warning and abort */
> }
>
> But unless you're writing a test suite for the standard library,
> checking for illegal results from standard library functions probably
> isn't worth the extra effort.

Yes, going that far would exceed my threshold for "significantly more
complicated". For a third party library that was notorious for being
poorly implemented, such an approach might be more reasonable (assuming
that you had to use it, despite that notoriety).

jono...@googlemail.com

unread,
Nov 14, 2013, 4:11:34 PM11/14/13
to
On Thursday, November 14, 2013 8:56:25 PM UTC+1, James Kuyper wrote:
> On 11/14/2013 02:26 PM, Keith Thompson wrote:
> > [...]
>
> >
>
> > Interesting approach. It seems (to me) slightly obscure because I tend
>
> > not to think about fgets() returning its first argument, since it's not
>
> > a particularly useful value to return.
>
>
>
> Yes, a pointer to (or perhaps, just after?) the last character written
>
> to the buffer might be be more useful, in some circumstances. The same
>
> issue of relatively useless return values is ubiquitous in the
>
> string-handling functions.
>

How about this:
fgets2 should return a pointer to the final '\0' if it was written, or else return NULL if feof() or ferror() is set.

(Oh and fgets2(buf, n, fp) should definately not write a '\0' if n = 1, but instead write nothing and then always return NULL).

But in any case, it's too late to change the standardized fgets().

GNU recommends using getline instead.
http://www.gnu.org/software/libc/manual/html_node/Line-Input.html


Comparing GNU's implementation of fgets to Dinkumware's is interesting! Dinkumware's fgets actually uses memchr(pt, '\n', len) to locate a '\n' and follows this by memcpy(s, pt, m), meaning that it iterates over the same buffer two times: first searching, then copying. Slow.
But it is very nicely readable I must say!

GNU introduced low-level functions.
For fgets()
https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/iofgets.c;hb=HEAD
they use _IO_getline() which looks ummm... nice (but is less readable I think)
https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/iogetline.c;hb=HEAD
but to my shock also has memchr() followed by memcpy(). Slow.

At least GNU's memcpy copies wordwise on longword boundaries.
https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memchr.c
I'm not sure if Dinkumware does this.

If I were to roll my own library implementation, I'd do one that searches the stream's internal buffer (reading longwords and accessing the bytes), and immediately copies the longword if neither EOF nor found '\n'... etc.
i.e. iterating over stuff only once, and doing nice alignment etc.
Something like that.

Malcolm McLean

unread,
Nov 14, 2013, 4:23:20 PM11/14/13
to
On Thursday, November 14, 2013 9:11:34 PM UTC, jono...@googlemail.com wrote:
>
> If I were to roll my own library implementation, I'd do one that searches
> the stream's internal buffer (reading longwords and accessing the bytes),
> and immediately copies the longword if neither EOF nor found '\n'... etc.
>
> i.e. iterating over stuff only once, and doing nice alignment etc.
> Something like that.
>
You can certainly return a quad from the buffer, then test for '\n' using
four masks and comparators. But EOF is harder to code.
Of course it depends on whether you expect to be in an environment which makes
much use of physical input streams or not. If you expect most input to be
via Unix like pipes and so on, it makes sense to optimise fgets(). But not if
you're reading from a keyboard.


0 new messages