fgets vs gets

Lily

unread,

May 30, 2003, 11:12:55 AM5/30/03

to

Hello,

I am unclear of how I can use fgets to achieve the same result as using
gets.

I was coding a command line interpreter program, and I used gets. But it
complains the following:

/tmp/ccDLLJkR.o: In function `main':
/tmp/ccDLLJkR.o(.text+0x27): the `gets' function is dangerous and should not
be used.

I just ran the executable and ignore the above message and I could get the
expected and required output.

I was told that using:
fgets(buf, sizeof buf, stdin) would achieve the same result
as gets(buf), but when I try it, it wouldn't.

What/where could I have done a mistake?

Lily

Emmanuel Delahaye

unread,

May 30, 2003, 11:36:59 AM5/30/03

to

In 'comp.lang.c', "Lily" <umwo...@cc.umanitoba.ca> wrote:

> I am unclear of how I can use fgets to achieve the same result as using
> gets.

So you need your light to be shed, because the use of gets() is forbidden in
a real application (with monkey users).

> I was coding a command line interpreter program, and I used gets. But it
> complains the following:
>
> /tmp/ccDLLJkR.o: In function `main':
> /tmp/ccDLLJkR.o(.text+0x27): the `gets' function is dangerous and should
> not be used.

Absolutely.

> I just ran the executable and ignore the above message and I could get
> the expected and required output.

All is fine as long as you use the program in the way it was designed. But if
by chance, a monkey typist types more characters that expected, it produces
an undefined behaviour, because the gets() functions doesn't know the
expected maximum number of characters.

It simply swallows the whole stdin buffer until '\n' and spits the whole
stuff into the buffer which address was passed, regardless the number of
characters it writes. If the buffer is to small, it's the UB. A serious bug,
and also a serious security hole. There is no portable way to prevent it.

> I was told that using:
> fgets(buf, sizeof buf, stdin) would achieve the same result
> as gets(buf), but when I try it, it wouldn't.
>
> What/where could I have done a mistake?

You have forgotten to post the non working code. Also, you should read the
FAQ that explains the dangers of gets() and the correct use of fgets().

--
-ed- emdel at noos.fr
The C-language FAQ: http://www.eskimo.com/~scs/C-faq/top.html
C-library: http://www.dinkumware.com/htm_cl/index.html
FAQ de f.c.l.c : http://www.isty-info.uvsq.fr/~rumeau/fclc/

"Clearly your code does not meet the original spec."
"You are sentenced to 30 lashes with a wet noodle."
-- Jerry Coffin in a.l.c.c++

Hallvard B Furuseth

unread,

May 30, 2003, 12:07:43 PM5/30/03

to

Lily wrote:

> I am unclear of how I can use fgets to achieve the same result as using
> gets.

Declare a big enough buffer, pass the size of that buffer as an argument
to fgets(), remove the trailing newline from the buffer after calling
fgets().

Finally, decide what to do if the input string is too long for the
buffer (which means you'll not receive a trailing newline in the input
buffer). Alternatives:
- Ignore the problem, like gets() did. If someone inputs a too long
line, it will be read as two lines. (While with gets() the program
just misbehaved in some random way.)
- realloc() the buffer to have room for e.g. twice as much data,
and call fgets() again to read the rest of the line - starting where
the previous strings leaves off in the input buffer.
- Ignore the remaining characters on the line:
loop and do getchar() until it returns EOF or '\n'.

--
Hallvard

Eric Sosman

unread,

May 30, 2003, 12:06:55 PM5/30/03

to

Lily wrote:
>
> I was told that using:
> fgets(buf, sizeof buf, stdin) would achieve the same result
> as gets(buf), but when I try it, it wouldn't.

fgets() deposits the trailing '\n' character in
`buf', but gets() discards it. See Questions 12.23
and 7.1 in the comp.lang.c Frequently Asked Questions
(FAQ) list

http://www.eskimo.com/~scs/C-faq/top.html

... and if that doesn't solve your problem, post the
actual code.

--
Eric....@sun.com

CBFalconer

unread,

May 30, 2003, 12:36:18 PM5/30/03

to

Lily wrote:
>
> I am unclear of how I can use fgets to achieve the same result
> as using gets.
>
> I was coding a command line interpreter program, and I used
> gets. But it complains the following:
>
> /tmp/ccDLLJkR.o: In function `main':
> /tmp/ccDLLJkR.o(.text+0x27): the `gets' function is dangerous
> and should not be used.

And it is absolutely right. For a safe and easily used
alternative, try ggets, available at:

<http://cbfalconer.home.att.net/download/ggets.zip>

--
Chuck F (cbfal...@yahoo.com) (cbfal...@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!

Dann Corbit

unread,

May 30, 2003, 3:04:55 PM5/30/03

to

"CBFalconer" <cbfal...@yahoo.com> wrote in message
news:3ED784CA...@yahoo.com...

> Lily wrote:
> >
> > I am unclear of how I can use fgets to achieve the same result
> > as using gets.
> >
> > I was coding a command line interpreter program, and I used
> > gets. But it complains the following:
> >
> > /tmp/ccDLLJkR.o: In function `main':
> > /tmp/ccDLLJkR.o(.text+0x27): the `gets' function is dangerous
> > and should not be used.
>
> And it is absolutely right. For a safe and easily used
> alternative, try ggets, available at:
>
> <http://cbfalconer.home.att.net/download/ggets.zip>

Another alternative:

http://home.att.net/~jackklein/ctips01.html#safe_gets

--
C-FAQ: http://www.eskimo.com/~scs/C-faq/top.html
"The C-FAQ Book" ISBN 0-201-84519-9
C.A.P. FAQ: ftp://cap.connx.com/pub/Chess%20Analysis%20Project%20FAQ.htm

bd

unread,

May 30, 2003, 4:13:29 PM5/30/03

to

On Fri, 30 May 2003 10:12:55 +0000, Lily wrote:

> Hello,
>
> I am unclear of how I can use fgets to achieve the same result as using
> gets.
>
> I was coding a command line interpreter program, and I used gets. But it
> complains the following:
>
> /tmp/ccDLLJkR.o: In function `main':
> /tmp/ccDLLJkR.o(.text+0x27): the `gets' function is dangerous and should not
> be used.

Indeed - you can overrun buffers easily with gets.

> I just ran the executable and ignore the above message and I could get the
> expected and required output.
>
> I was told that using:
> fgets(buf, sizeof buf, stdin) would achieve the same result
> as gets(buf), but when I try it, it wouldn't.
>
> What/where could I have done a mistake?

Try stripping the trailing newline:
if(strlen(buf) && buf[strlen(buf)-1] == '\n')
buf[strlen(buf)-1] == '\0';

Simon Biber

unread,

May 30, 2003, 4:31:27 PM5/30/03

to

"bd" <bdo...@bd-home-comp.no-ip.org> wrote:
> Try stripping the trailing newline:
> if(strlen(buf) && buf[strlen(buf)-1] == '\n')
> buf[strlen(buf)-1] == '\0';

Ick - three calls to strlen when one would suffice.
Gratuitous inefficiency. I don't think it's premature
optimisation to use:
char *p = strchr(buf, '\n');
if(p != NULL) *p = '\0';

--
Simon.

Emmanuel Delahaye

unread,

May 30, 2003, 5:19:22 PM5/30/03

to

In 'comp.lang.c', "bd" <bdo...@bd-home-comp.no-ip.org> wrote:

> Try stripping the trailing newline:
> if(strlen(buf) && buf[strlen(buf)-1] == '\n')
> buf[strlen(buf)-1] == '\0';

I'm very sorry to have to say it, but this is probably the ugliest way I
ever seen of writing:

char *p = strchr (buf, '\n');
if (p)
{
*p = 0;

bd

unread,

May 30, 2003, 5:21:20 PM5/30/03

to

True, but mine reduces the modification to other parts of the code. Also,
it only gets trailing \ns, but come to think of it that's not strictly
required. You could do:
#define CHOMP(s) \
{ size_t s##_temp = strlen(s); \
if(s##_temp && s[s##_temp - 1] == '\n') \
s[s##_temp - 1] = '\0'; \
}

Eric Sosman

unread,

May 30, 2003, 5:29:58 PM5/30/03

to

Emmanuel Delahaye wrote:
>
> In 'comp.lang.c', "bd" <bdo...@bd-home-comp.no-ip.org> wrote:
>
> > Try stripping the trailing newline:
> > if(strlen(buf) && buf[strlen(buf)-1] == '\n')
> > buf[strlen(buf)-1] == '\0';
>
> I'm very sorry to have to say it, but this is probably the ugliest way I
> ever seen of writing:
>
> char *p = strchr (buf, '\n');
> if (p)
> {
> *p = 0;
> }

An even more compact possibility is

buf[strcspn(buf,"\n")] = '\0';

... but I think this is a case where brevity is the
soul of something other than wit.

--
Eric....@sun.com

Simon Biber

unread,

May 30, 2003, 7:14:55 PM5/30/03

to

"bd" <bdo...@bd-home-comp.no-ip.org> wrote:
> On Sat, 31 May 2003 06:31:27 +1000, Simon Biber wrote:
> > "bd" <bdo...@bd-home-comp.no-ip.org> wrote:
> >> Try stripping the trailing newline:
> >> if(strlen(buf) && buf[strlen(buf)-1] == '\n')
> >> buf[strlen(buf)-1] == '\0';
> >
> > Ick - three calls to strlen when one would suffice.
> > Gratuitous inefficiency. I don't think it's premature
> > optimisation to use:
> > char *p = strchr(buf, '\n');
> > if(p != NULL) *p = '\0';
>
> True, but mine reduces the modification to other parts of the code.

What modification? The variable declaration? In C you can declare variables
anywhere -- and if you're still catering for obsolete C compilers, then just
add braces around the two lines. This little pointer is likely to be
optimised into a register, whereas your version cannot generally be
optimised to the same degree, because each function call could have side
effects.

> Also, it only gets trailing \ns, but come to think
> of it that's not strictly required.

Indeed, the string constructed by fgets will only ever contain zero or one
newline character, never more. (If that weren't the case, there is strrchr -
but since that must first find the end of the string and then work back, it
must be slower.)

> You could do:
> #define CHOMP(s) \
> { size_t s##_temp = strlen(s); \
> if(s##_temp && s[s##_temp - 1] == '\n') \
> s[s##_temp - 1] = '\0'; \
> }

The strchr would work equally well in such a macro and IMHO is more elegant
because it has only one test in the if statement.

--
Simon.

CBFalconer

unread,

May 30, 2003, 10:34:24 PM5/30/03

to

Dann Corbit wrote:
> "CBFalconer" <cbfal...@yahoo.com> wrote in message

> > Lily wrote:
> > >
> > > I am unclear of how I can use fgets to achieve the same result
> > > as using gets.
> > >
> > > I was coding a command line interpreter program, and I used
> > > gets. But it complains the following:
> > >
> > > /tmp/ccDLLJkR.o: In function `main':
> > > /tmp/ccDLLJkR.o(.text+0x27): the `gets' function is dangerous
> > > and should not be used.
> >
> > And it is absolutely right. For a safe and easily used
> > alternative, try ggets, available at:
> >
> > <http://cbfalconer.home.att.net/download/ggets.zip>
>
> Another alternative:
>
> http://home.att.net/~jackklein/ctips01.html#safe_gets

I started my ggets development with 'what attracts newbies to
gets', and the answer was 'simplicity'. No worries about \n
detection. No worries about actual length (HAH). Just go and get
that string. So I tried to match those qualities in ggets.

1. It always ends at the next \n. The \n is always removed.
2. It can also be used on non stdin files. ggets<->fggets
3. However, unlike gets<->fgets, \n is always removed,
4. I would have liked to return a pointer, but that runs into
the need for multiple error possibilities (EOF, file error,
memory shortage). So I returned 0 for OK, EOF, or positive.
5. No argument initialization is ever needed. OK, the file
for fggets needs to be open.

What can go wrong? AFAICT the only possibility is that the user
fails to eventually free the returned string and has a memory
leak. The same applies to strdup, when implemented. At any rate
the result is a program failure, not a potential system failure.
Another possibility is that the caller omits the & in front of the
char* pointer passed. Shades of scanf. Only two things to
remember, and the compiler should flag the missing &.

Malcolm

unread,

May 31, 2003, 6:03:16 AM5/31/03

to

"Simon Biber" <sbi...@optushome.com.au> wrote in message

>
> Ick - three calls to strlen when one would suffice.
> Gratuitous inefficiency. I don't think it's premature
> optimisation to use:
> char *p = strchr(buf, '\n');
> if(p != NULL) *p = '\0';
>

Also the newline is there for a reason - just stripping it out is likely to
replace ub on buffer overflow with a defined-behaviour bug as a partial
input is treated as whole, and then the next input function also gets an
invalid line from stdin.
Why is a defined-behaviour bug generally worse than ub? Ub is undefined by
the C standard, but it can be defined by the platform to do something
sensible - for instance detecting illegal memory access and terminating the
program with an error message. Functionally, a crash is no different to the
computer blowing a fuse.
However if we have defined, incorrect behaviour, the platform is forbidden
from detecting it. The program is forced to give the incorrect output,
whether that be a gas bill for $200 for Mrs Bloggs when really she owes $100
(much more annoying than a bill for $234,000,000,000.02), or too big a dose
of radiation in a computerised piece of medical equipment.

Paul Hsieh

unread,

May 31, 2003, 7:07:22 AM5/31/03

to

In article <3ED802C4...@yahoo.com>, cbfal...@yahoo.com says...

> Dann Corbit wrote:
> > "CBFalconer" <cbfal...@yahoo.com> wrote in message
> > > Lily wrote:
> > > > I am unclear of how I can use fgets to achieve the same result
> > > > as using gets.
> > > >
> > > > I was coding a command line interpreter program, and I used
> > > > gets. But it complains the following:
> > > >
> > > > /tmp/ccDLLJkR.o: In function `main':
> > > > /tmp/ccDLLJkR.o(.text+0x27): the `gets' function is dangerous
> > > > and should not be used.
> > >
> > > And it is absolutely right. For a safe and easily used
> > > alternative, try ggets, available at:
> > >
> > > <http://cbfalconer.home.att.net/download/ggets.zip>
> >
> > Another alternative:
> >
> > http://home.att.net/~jackklein/ctips01.html#safe_gets

Presizing just propogates the memory management problem back to the programmer.
Setting a maximum input length just turns a programming problem into an end-
user problem.

> I started my ggets development with 'what attracts newbies to
> gets', and the answer was 'simplicity'. No worries about \n
> detection. No worries about actual length (HAH). Just go and get
> that string. So I tried to match those qualities in ggets.
>
> 1. It always ends at the next \n. The \n is always removed.
> 2. It can also be used on non stdin files. ggets<->fggets
> 3. However, unlike gets<->fgets, \n is always removed,

fgets does not remove the '\n'. Furthermore fgets ignores any encountered '\0'
characters, and only terminates on one of '\n', EOF, or the length limit.

> 4. I would have liked to return a pointer, but that runs into
> the need for multiple error possibilities (EOF, file error,
> memory shortage). So I returned 0 for OK, EOF, or positive.
> 5. No argument initialization is ever needed. OK, the file
> for fggets needs to be open.
>
> What can go wrong? AFAICT the only possibility is that the user
> fails to eventually free the returned string and has a memory
> leak.

You call realloc roughly (strlen(output) / 128) times. Some weaker C library
implementations perform this very badly and will artifically run out of memory
far before all the heap memory is really consumed. You should use increasing
powers of 2, and if you are that worried, realloc at the end to shrink it to
the minimum required memory. This will guarantee that at most
log_2(strlen(output)) calls to realloc are made.

See breadln, or bgets in the "Better String Library" here:

http://bstring.sourceforge.net/

> [...] The same applies to strdup, when implemented.

Well strdup is kind of trivial -- I don't think its one of the "severe problem"
functions in the C library. In any event, these are only two functions, and
neither perform in-place operations. The other C library string functions are
not so easily replaced by analogous methods.

--
Paul Hsieh
http://bstring.sourceforge.net/
http://www.pobox.com/~qed/

pete

unread,

May 31, 2003, 7:21:25 AM5/31/03

to

Paul Hsieh wrote:

> Well strdup is kind of trivial --
> I don't think its one of the "severe problem"
> functions in the C library.

It's not even one of the functions in the standard C library.

--
pete

Simon Biber

unread,

May 31, 2003, 7:39:06 AM5/31/03

to

"Malcolm" <mal...@55bank.freeserve.co.uk> wrote:

> "Simon Biber" <sbi...@optushome.com.au> wrote:
> > Ick - three calls to strlen when one would suffice.
> > Gratuitous inefficiency. I don't think it's premature
> > optimisation to use:
> > char *p = strchr(buf, '\n');
> > if(p != NULL) *p = '\0';
>
> Also the newline is there for a reason - just stripping it
> out is likely to replace ub on buffer overflow with a
> defined-behaviour bug as a partial input is treated as
> whole, and then the next input function also gets an
> invalid line from stdin.

It's easy enough to add

char *p = strchr(buf, '\n');

if(p != NULL) *p = '\0'; else exit(EXIT_FAILURE);

I think that's heaps better than UB.

--
Simon.

Hallvard B Furuseth

unread,

May 31, 2003, 7:42:42 AM5/31/03

to

Simon Biber wrote:

> It's easy enough to add
> char *p = strchr(buf, '\n');
> if(p != NULL) *p = '\0'; else exit(EXIT_FAILURE);

That will fail if the last line in the file does not end with newline.

--
Hallvard

Simon Biber

unread,

May 31, 2003, 7:56:24 AM5/31/03

to

True. It assumes you have a properly formatted text
file, with a newline after every line. The C standard
leaves it implementation-defined whether a text stream
requires a terminating newline character or not.

Perhaps it would work if you changed it to:
while(fgets(buf, sizeof buf, stdin)) {

char *p = strchr(buf, '\n');
if(p != NULL) {
*p = '\0';

} else {
if(!feof(stdin)) {
fputs("Input line too long\n", stderr);
exit(EXIT_FAILURE);
}
}
...
}

Of course, if it is possible to read a line in stages
and process it that way, it is far preferable to the
abort. Or if you need to handle arbitrary-length lines,
perhaps something like CBFalconer's ggets is in order
-- it will read any length line into a realloced buffer.

--
Simon.

CBFalconer

unread,

May 31, 2003, 9:26:09 AM5/31/03

to

Paul Hsieh wrote:
> cbfal...@yahoo.com says...
> > Dann Corbit wrote:
> > > "CBFalconer" <cbfal...@yahoo.com> wrote in message
> > > > Lily wrote:
>
> > > > > I am unclear of how I can use fgets to achieve the same
> > > > > result as using gets.
> > > > >
> > > > > I was coding a command line interpreter program, and I
> > > > > used gets. But it complains the following:
> > > > >
> > > > > /tmp/ccDLLJkR.o: In function `main':
> > > > > /tmp/ccDLLJkR.o(.text+0x27): the `gets' function is
> > > > > dangerous and should not be used.
> > > >
> > > > And it is absolutely right. For a safe and easily used
> > > > alternative, try ggets, available at:
> > > >
> > > > <http://cbfalconer.home.att.net/download/ggets.zip>
> > >
> > > Another alternative:
> > >
> > > http://home.att.net/~jackklein/ctips01.html#safe_gets
>
> Presizing just propogates the memory management problem back to
> the programmer. Setting a maximum input length just turns a

> programming problem into an end-user problem.

>
> > I started my ggets development with 'what attracts newbies to
> > gets', and the answer was 'simplicity'. No worries about \n
> > detection. No worries about actual length (HAH). Just go and
> > get that string. So I tried to match those qualities in ggets.
> >
> > 1. It always ends at the next \n. The \n is always removed.
> > 2. It can also be used on non stdin files. ggets<->fggets
> > 3. However, unlike gets<->fgets, \n is always removed,
>
> fgets does not remove the '\n'. Furthermore fgets ignores any
> encountered '\0' characters, and only terminates on one of '\n',
> EOF, or the length limit.

As I pointed out in 3. Maybe I was unclear.

>
> > 4. I would have liked to return a pointer, but that runs into
> > the need for multiple error possibilities (EOF, file error,
> > memory shortage). So I returned 0 for OK, EOF, or positive.
> > 5. No argument initialization is ever needed. OK, the file
> > for fggets needs to be open.
> >
> > What can go wrong? AFAICT the only possibility is that the user
> > fails to eventually free the returned string and has a memory
> > leak.
>
> You call realloc roughly (strlen(output) / 128) times. Some
> weaker C library implementations perform this very badly and will
> artifically run out of memory far before all the heap memory is
> really consumed. You should use increasing powers of 2, and if
> you are that worried, realloc at the end to shrink it to the
> minimum required memory. This will guarantee that at most
> log_2(strlen(output)) calls to realloc are made.

This is perfectly possible, without changing the interface one
iota. I deliberately chose otherwise because, at least for me,
the normal use is with interactive input or text files, where such
long lines are possible but highly unlikely. I made allowances
for the internal actions of at least one malloc package (mine).
If anything this is (to me) an argument for inclusion in the
standard library package, where the internal action can be suited
to the actual malloc available.

>
> See breadln, or bgets in the "Better String Library" here:
>
> http://bstring.sourceforge.net/

I am changing the subject to better reflect my point.

Malcolm

unread,

May 31, 2003, 11:16:40 AM5/31/03

to

"Simon Biber" <sbi...@optushome.com.au> wrote in message
>

> Or if you need to handle arbitrary-length lines,
> perhaps something like CBFalconer's ggets is in order
> -- it will read any length line into a realloced buffer.
>

That's the real answer. The problem is that, even on a modern PC with 4GB of
swap space, you can't guarantee that your input won't be the one to run the
computer out of memory, so you still have to handle the error case.

karl malbrain

unread,

May 31, 2003, 1:15:04 PM5/31/03

to

That's VIRTUAL MEMORY, not "swap space". We have a ROLLOUT command for
that.
karl m, ACM SIGOPS.

"Malcolm" <mal...@55bank.freeserve.co.uk> wrote in message
news:bbagra$g4j$1...@newsg1.svr.pol.co.uk...

Paul Hsieh

unread,

May 31, 2003, 5:33:03 PM5/31/03

to

In article <3ED8A217...@yahoo.com>, cbfal...@yahoo.com says...

> Paul Hsieh wrote:
> > You call realloc roughly (strlen(output) / 128) times. Some
> > weaker C library implementations perform this very badly and will
> > artifically run out of memory far before all the heap memory is
> > really consumed. You should use increasing powers of 2, and if
> > you are that worried, realloc at the end to shrink it to the
> > minimum required memory. This will guarantee that at most
> > log_2(strlen(output)) calls to realloc are made.
>
> This is perfectly possible, without changing the interface one
> iota. I deliberately chose otherwise because, at least for me,
> the normal use is with interactive input or text files, where such
> long lines are possible but highly unlikely.

Yeah, 640K ought to be enough for anyone. There is no significant disadvantage
for using doubling sizes. An additional issue is that a realloc can cost an
additional memcpy (if the adjacent tail memory is not available.) Since memcpy
is basically O(n), your copy performance can be as bad as:

O (strlen(output) ^ 2 / (2*128))

while the power of 2 solution can only be as bad as:

O (strlen(output) * 2)

For an input that is 50K, for example, my solution would be roughly 100 times
faster in the worst case scenarios of realloc copying.

On the flip side, if your inputs are really small enough, then you are being
wasteful of memory for the average case, meaning that you have to worry about
how many such inputs you accumulate at once.

Using powers of 2 is a no-lose scenario, and you should probably consider it
more seriously. Its not any harder to program. In your case, its just
initialize some counter d, double it with d += d, then use (d - 16) for your
actual length so that it fits nicely with your reallocation scheme.

CBFalconer

unread,

Jun 1, 2003, 12:00:28 AM6/1/03

to

Paul Hsieh wrote:
> In article <3ED8A217...@yahoo.com>, cbfal...@yahoo.com says...
> > Paul Hsieh wrote:
>
> > > You call realloc roughly (strlen(output) / 128) times. Some
> > > weaker C library implementations perform this very badly and will
> > > artifically run out of memory far before all the heap memory is
> > > really consumed. You should use increasing powers of 2, and if
> > > you are that worried, realloc at the end to shrink it to the
> > > minimum required memory. This will guarantee that at most
> > > log_2(strlen(output)) calls to realloc are made.
> >
> > This is perfectly possible, without changing the interface one
> > iota. I deliberately chose otherwise because, at least for me,
> > the normal use is with interactive input or text files, where such
> > long lines are possible but highly unlikely.
>
> Yeah, 640K ought to be enough for anyone. There is no significant
> disadvantage for using doubling sizes. An additional issue is that
> a realloc can cost an additional memcpy (if the adjacent tail memory
> is not available.) Since memcpy is basically O(n), your copy
> performance can be as bad as:
>

... snip ...

>
> Using powers of 2 is a no-lose scenario, and you should probably
> consider it more seriously. Its not any harder to program.

... snip ...

Of course it is no harder to program. If you look at my hashlib
system I do just that.

However there is a quite likely scenario in which the doubling (or
any other ratio) system can be extremely wasteful. It applies
whenever the expanded memory error is non-contiguous with the old,
and the doubling system is the only thing making demands on the
the malloc system. There are also other awkward cases.

The previous allocation has to exist concurrently with the new.
The sum of all the previous releases is of sizes 1, 2, 4, 8, ....
N, which sum up to less than 2N. Thus this space cannot be used
for the new allocation, and must be dumped into the free space
pool. Nothing in the system (the doubler is the only thing making
demands) will be able to use this space. Thus the effective
memory use both expands the actual allocation, and the free space,
by roughly a factor of 2. Things will break much faster.

A sufficiently smart realloc system could work around this by and
large. It would not be a large change to my nmalloc system to
handle it, but the non-contiguous areas will always break things
sooner or later. And most users won't be using my malloc package.

As I said earlier, I designed ggets for the most likely
applications, and a multi-megabyte single text line is not one I
anticipate arising often. strlen efficiency also falls off
rapidly with long lines, yet it seems adequate for the majority of
uses. If you have an application where the ggets realloc actually
is a bind, as shown by measurements, then there is a good reason
to modify it FOR THAT APPLICATION. Meanwhile my every instinct
says my approach is better.

Richard Heathfield

unread,

Jun 1, 2003, 4:15:09 AM6/1/03

to

CBFalconer wrote:

<snip>

> As I said earlier, I designed ggets for the most likely
> applications, and a multi-megabyte single text line is not one I
> anticipate arising often. strlen efficiency also falls off
> rapidly with long lines, yet it seems adequate for the majority of
> uses.

Right, but for really long lines, you'd keep track, right?

> If you have an application where the ggets realloc actually
> is a bind, as shown by measurements, then there is a good reason
> to modify it FOR THAT APPLICATION.

Absolutely. So far, I agree entirely.

> Meanwhile my every instinct
> says my approach is better.

Here, however, I must disagree. I think your way is simpler on the face of
it, but harder to house-keep properly. As you know, I have written a
similar function, although it is modelled more on fgets than on gets, which
makes the housekeeping considerably simpler, at the expense of putting a
couple of odd-looking &s into the args list. Now, I can understand why you
might like ggets more than fgetline -- after all, you wrote ggets -- but I
am not yet convinced that your approach - a new buffer every time, with no
way (IIRC) for the user to limit the size of that buffer - is /better/ than
giving the choices of supplying one's own reallocable buffer pointer and an
upper limit beyond which the user doesn't want the buffer to grow (as a
security measure).

--
Richard Heathfield : bin...@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton

John

unread,

Jun 1, 2003, 5:49:25 AM6/1/03

to

"Richard Heathfield" <dont...@address.co.uk.invalid> wrote in message
news:bbccmc$567$1...@titan.btinternet.com...

<snip>

> might like ggets more than fgetline -- after all, you wrote ggets -- but I
> am not yet convinced that your approach - a new buffer every time, with no
> way (IIRC) for the user to limit the size of that buffer - is /better/
than
> giving the choices of supplying one's own reallocable buffer pointer and
an
> upper limit beyond which the user doesn't want the buffer to grow (as a
> security measure).

Much of what you say makes sense but why an upper limit? I have
one that returns a error if realloc fails, isn't that enough?

John

Richard Heathfield

unread,

Jun 1, 2003, 8:15:40 AM6/1/03

to

John wrote:

>
> "Richard Heathfield" <dont...@address.co.uk.invalid> wrote in message
> news:bbccmc$567$1...@titan.btinternet.com...
>
> <snip>
>
>> might like ggets more than fgetline -- after all, you wrote ggets -- but
>> I am not yet convinced that your approach - a new buffer every time, with
>> no way (IIRC) for the user to limit the size of that buffer - is /better/
> than
>> giving the choices of supplying one's own reallocable buffer pointer and
> an
>> upper limit beyond which the user doesn't want the buffer to grow (as a
>> security measure).
>
> Much of what you say makes sense but why an upper limit?

It gives the programmer (optional) control over the size of input line that
the program will accept. If I don't care, I will pass (size_t)-1 for that
param. Typically, I choose 32768 as a default. In circumstances where it
mattered, I would provide an override option.

> I have
> one that returns a error if realloc fails, isn't that enough?

Well, it's certainly necessary. Whether it is sufficient is, I suppose, a
matter of taste. It's clearly not a cut-and-dried issue that everyone here
agrees on, or Chuck would have addressed it in ggets(), wouldn't he? :-)

CBFalconer

unread,

Jun 1, 2003, 9:04:02 AM6/1/03

to

If ggets fails for memory shortage, it will return a partial line
and a positive error indicator. The user is perfectly free to
free that line (or other things) and carry on.

BTW, in a separate message I justified the linear, rather than
exponential, growth of the buffer FOR THIS APPLICATION.

At the same time it is not a case of use ggets or nothing - your
routine (forget the name offhand) is still available, when those
characteristics are required. One can get input with various
primitives, including getc, fgets, scanf, and even fread if
desired. They are all tools - reach for the appropriate one.
(For the unwary reader, don't reach for gets.)

James Antill

unread,

Jun 1, 2003, 8:09:22 PM6/1/03

to

On Sat, 31 May 2003 21:33:03 +0000, Paul Hsieh wrote:

> In article <3ED8A217...@yahoo.com>, cbfal...@yahoo.com says...
>> Paul Hsieh wrote:
>> > You call realloc roughly (strlen(output) / 128) times. Some
>> > weaker C library implementations perform this very badly and will
>> > artifically run out of memory far before all the heap memory is
>> > really consumed. You should use increasing powers of 2, and if
>> > you are that worried, realloc at the end to shrink it to the
>> > minimum required memory. This will guarantee that at most
>> > log_2(strlen(output)) calls to realloc are made.
>>
>> This is perfectly possible, without changing the interface one
>> iota. I deliberately chose otherwise because, at least for me,
>> the normal use is with interactive input or text files, where such
>> long lines are possible but highly unlikely.
>
> Yeah, 640K ought to be enough for anyone. There is no significant disadvantage
> for using doubling sizes. An additional issue is that a realloc can cost an
> additional memcpy (if the adjacent tail memory is not available.) Since memcpy
> is basically O(n), your copy performance can be as bad as:
>
> O (strlen(output) ^ 2 / (2*128))
>
> while the power of 2 solution can only be as bad as:
>
> O (strlen(output) * 2)

Err, no. If the input is 17 bytes long, and you start at 1, worst case
is...

1 (input, len=1 sz=1)
1 (realloc copy, len=1 sz=2)
1 (input, len=2 sz=2)
2 (realloc copy, len=2 sz=4)
2 (input, len=4 sz=4)
4 (realloc copy, len=4 sz=8)
4 (input, len=8, sz=8)
8 (realloc copy, len=8, sz=16)
8 (input, len=16 sz=16)
16 (realloc copy, len=16, sz=32)
1 (input, len=17 sz=32)

And at the end your length is 17 and you've done 48 copies
(48 being > 17 * 2) -- and this picture only gets _worse_ with bigger
numbers. Note that at the end of it you are wasting 46% of your space
(and the waste percentage _stays constant_ as the numbers increase).
That's also discounting the per call overhead of realloc().

Really the only sane way to do this is to have seperate chunks of memory,
copies are always O(n) and waste is always (chunk size / 2).
If you did it properly you could even share the chunks over multiple
connections.

Someone really needs to write something like that... (see sig. :)

--
James Antill -- ja...@and.org
Need an efficent and powerful string library for C?
http://www.and.org/vstr/

CBFalconer

unread,

Jun 1, 2003, 9:09:17 PM6/1/03

to

James Antill wrote:
>
... snip ...

>
> Err, no. If the input is 17 bytes long, and you start at 1,
> worst case is...
>
> 1 (input, len=1 sz=1)
> 1 (realloc copy, len=1 sz=2)
> 1 (input, len=2 sz=2)
> 2 (realloc copy, len=2 sz=4)
> 2 (input, len=4 sz=4)
> 4 (realloc copy, len=4 sz=8)
> 4 (input, len=8, sz=8)
> 8 (realloc copy, len=8, sz=16)
> 8 (input, len=16 sz=16)
> 16 (realloc copy, len=16, sz=32)
> 1 (input, len=17 sz=32)
>
> And at the end your length is 17 and you've done 48 copies
> (48 being > 17 * 2) -- and this picture only gets _worse_ with
> bigger numbers. Note that at the end of it you are wasting 46%
> of your space (and the waste percentage _stays constant_ as the
> numbers increase). That's also discounting the per call
> overhead of realloc().
>
> Really the only sane way to do this is to have seperate chunks
> of memory, copies are always O(n) and waste is always (chunk
> size / 2). If you did it properly you could even share the
> chunks over multiple connections.

I suggest you read the code in question first. It doesn't work
like that. If it did I would agree that it was foolish.

James Antill

unread,

Jun 1, 2003, 10:56:15 PM6/1/03

to

On Mon, 02 Jun 2003 01:09:17 +0000, CBFalconer wrote:

> I suggest you read the code in question first. It doesn't work
> like that. If it did I would agree that it was foolish.

No, you're code doesn't, but I replied to Paul, who is using power of 2
increments on expanding the buffer (with a minimum of 8, but that
doesn't really matter).
Ggets uses an initial size of 112 and increments of 128.

So for 241 bytes of input, worst case you do...

112 + (n * 128)

112 (input, len=112 sz=112)
112 (realloc copy, len=112 sz=240)
128 (input, len=240 sz=240)
240 (realloc copy, len=240 sz=368)
1 (input, len=241 sz=368)

128
368
128
496
1

...which copies 593 bytes (246% of the work) of data and wastes 35% of
space. Your waste percentage goes down, just like Vstr, however with
just 497 bytes of input ggets copies, worst case, 1713 bytes which is
344% of the work (but only wastes 21% of space).

CBFalconer

unread,

Jun 2, 2003, 6:29:52 AM6/2/03

to

James Antill wrote:
> On Mon, 02 Jun 2003 01:09:17 +0000, CBFalconer wrote:
>
> > I suggest you read the code in question first. It doesn't work
> > like that. If it did I would agree that it was foolish.
>
> No, you're code doesn't, but I replied to Paul, who is using
> power of 2 increments on expanding the buffer (with a minimum of
> 8, but that doesn't really matter).
>
> Ggets uses an initial size of 112 and increments of 128.

Now I KNOW you have read the source :-)

>
> So for 241 bytes of input, worst case you do...
>
> 112 + (n * 128)
>
> 112 (input, len=112 sz=112)
> 112 (realloc copy, len=112 sz=240)
> 128 (input, len=240 sz=240)
> 240 (realloc copy, len=240 sz=368)
> 1 (input, len=241 sz=368)
>
> 128
> 368
> 128
> 496
> 1
>
> ...which copies 593 bytes (246% of the work) of data and wastes
> 35% of space. Your waste percentage goes down, just like Vstr,
> however with just 497 bytes of input ggets copies, worst case,
> 1713 bytes which is 344% of the work (but only wastes 21% of

> space.

This is only true if realloc copies everything each time. This
depends on many things. The copy is about equal in cost to a
strlen call. For a suitable malloc/free/realloc package the
expected number of actual copies may well be zero. For an example
see my nmalloc package for DJGPP.

Also see my earlier posting on the trap of the doubling
mechanism. At any rate, all this discussion is about the internal
implementation, which can be freely altered without affecting the
ggets interface. It was designed that way.

So once your profile runs show that the bind is in ggets, feel
free to replace the implementation. In many cases a suitable
change would be one #define in the ggets source.

Paul Hsieh

unread,

Jun 2, 2003, 8:23:18 AM6/2/03

to

"James Antill" <james-...@and.org> wrote:
> On Sat, 31 May 2003 21:33:03 +0000, Paul Hsieh wrote:
> > Since memcpy is basically O(n), your copy performance can be
> > as bad as:
> >
> > O (strlen(output) ^ 2 / (2*128))
> >
> > while the power of 2 solution can only be as bad as:
> >
> > O (strlen(output) * 2)
>
> Err, no. If the input is 17 bytes long, and you start at 1, worst case
> is...

I don't understand how to read the strange table you posted here. In
any event, its clearly wrong. The sequences of memory allocations for
a 17 character input (to bgets) is as follows:

malloc(8); /* from a cstr2bstr ("") */
realloc(16); /* from a balloc (9) */
realloc(32); /* from a balloc (17) */

The maximum amount of extra copying is 8 + 16 for a total of 24
characters. Note that 24 < 2 * 17. It does not get *worse* for
bigger numbers -- the property of total copying and/or memory
allocations < strlen(output) * 2 is maintained for any sized input.

Of course, unlike other solutions, the bstring library is short,
readable, easy to compile and test, and is known to work on multiple
platforms (hint, hint). Playing around with it takes very little
effort. Besides being documented as having this property, the above
observation is trivial to duplicate (add some printfs in cstr2bstr,
and balloc, and write an dozen line trivial main.)

CBFalconer's solution has the problem that for a 17 byte inputs, 95
bytes additional bytes are allocated, making it 6.5 times more than
necessary. While his memory efficiency improves as the input size
increases, his copying becomes much much worse.

In concrete terms, for inputs of size > 512, ggets will perform more
bytes copied then bgets, and for inputs of size < 64, ggets will
allocate more memory than bgets. For inputs of size > 1024, ggets
will perform *more* allocations than bgets. His best argument is that
asymptotically, the amount of memory allocated is 50% tighter for
ggets over bgets. But asymptotically his algorithm cannot even
complete in a reasonable amount of time. So in comparison to bgets,
ggets only looks favorable for inputs in the range of 64 to 512
characters, and even then bgets is never worse than twice as bad
(except in number of reallocations, but that is never more than 4
extra.)

Personally, I don't think CBFalconer's intuition is justified in light
of this.

Paul Hsieh

unread,

Jun 2, 2003, 8:58:57 AM6/2/03

to

CBFalconer <cbfal...@yahoo.com> wrote:

> Paul Hsieh wrote:
> > Using powers of 2 is a no-lose scenario, and you should probably
> > consider it more seriously. Its not any harder to program.
> ... snip ...
>
> Of course it is no harder to program. If you look at my hashlib
> system I do just that.
>
> However there is a quite likely scenario in which the doubling (or
> any other ratio) system can be extremely wasteful. It applies
> whenever the expanded memory error is non-contiguous with the old,
> and the doubling system is the only thing making demands on the
> the malloc system. There are also other awkward cases.

You neglected to realize the by using powers of 2, there are far
*fewer* actual occurrences of realloc. While each of the reallocs in
ggets has a higher chance of not requiring a memcpy than each that is
in bgets, yours are far more numerous, and thus you will be throwing
dice more often, even though they are somewhat loaded. In the not so
atypical case of the allocation being adjacent to nothing but empty
space (think about it, its a succession of reallocs after a malloc)
the bgets solution just ends up working better.

> The previous allocation has to exist concurrently with the new.
> The sum of all the previous releases is of sizes 1, 2, 4, 8, ....
> N, which sum up to less than 2N. Thus this space cannot be used
> for the new allocation, and must be dumped into the free space
> pool. Nothing in the system (the doubler is the only thing making
> demands) will be able to use this space. Thus the effective
> memory use both expands the actual allocation, and the free space,
> by roughly a factor of 2. Things will break much faster.

Things will break at most twice as fast as any other solution. The
lesson of the last 30 years of computing is that this is about 18
months. You also neglected to point out that in situtations where
this doubled amount of memory usage in bgets is a problem corresponds
to situtations where ggets has serious performance problems.

> A sufficiently smart realloc system could work around this by and
> large. It would not be a large change to my nmalloc system to
> handle it, but the non-contiguous areas will always break things
> sooner or later. And most users won't be using my malloc package.

Me thinks you don't understand all aspects of the issue. The
contiguousness/fragmentation is a problem that bothers all good
allocation schemes with roughly an equal occurrence of degenerate
cases. But the other issue is the sheer *number* of reallocations.
Some poor heap algorithms accumulate strange internal structures that
just get slower and slower as more and more reallocations are made in
the system (two of the compilers on my system have this problem.)

Trust me, your solution will *shred* the heaps on many systems if you
have large amounts of input, whereas mine seems to survive by sheer
virtue of minimizing the number of times it calls realloc on difficult
to deal with sizes.

> As I said earlier, I designed ggets for the most likely
> applications, and a multi-megabyte single text line is not one I
> anticipate arising often.

So you feel ok with publicizing this non-scalable solution as an
answer to other people's problems, for which you cannot possibly know
the size of their input? As I responded to James Antill, your
solution has the additional problem of allocating far too much memory
for many many very small inputs. Your solution only works really well
for a very specific range of input sizes that isn't very good for
either large or small inputs.

> [...] strlen efficiency also falls off

> rapidly with long lines, yet it seems adequate for the majority of
> uses.

Well, I beg to differ on this point. Calling strlen is always the
slowest approach, and any string manipulation limited algorithm will
show this to be a serious performance hit (or strcat, which is really
just a strlen + strcpy.)

> [...] If you have an application where the ggets realloc actually

> is a bind, as shown by measurements, then there is a good reason
> to modify it FOR THAT APPLICATION. Meanwhile my every instinct
> says my approach is better.

Oddly, I would not recommend that bgets be modified for any reason.
Though, you may prefer breadln, which will perform better if you have
a bufferable input stream (isatty() returns false, or you just know
the stream is continuous.)

Dan Pop

unread,

Jun 2, 2003, 10:12:22 AM6/2/03

to

In <3ed8984f$0$17187$afc3...@news.optusnet.com.au> "Simon Biber" <sbi...@optushome.com.au> writes:

>"Hallvard B Furuseth" wrote:
>> Simon Biber wrote:
>>
>> > It's easy enough to add
>> > char *p = strchr(buf, '\n');
>> > if(p != NULL) *p = '\0'; else exit(EXIT_FAILURE);
>>
>> That will fail if the last line in the file does
>> not end with newline.
>
>True. It assumes you have a properly formatted text
>file, with a newline after every line. The C standard
>leaves it implementation-defined whether a text stream
>requires a terminating newline character or not.

Therefore, well written code can never assume that the last line in a
text file is properly terminated.

I really like implementations that ignore the last line if it's not
newline-terminated.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Dan...@ifh.de

Eric Sosman

unread,

Jun 2, 2003, 10:51:31 AM6/2/03

to

CBFalconer wrote:
>
> I started my ggets development with 'what attracts newbies to
> gets', and the answer was 'simplicity'. No worries about \n
> detection. No worries about actual length (HAH). Just go and get
> that string. So I tried to match those qualities in ggets.

> [...]

> 4. I would have liked to return a pointer, but that runs into
> the need for multiple error possibilities (EOF, file error,
> memory shortage). So I returned 0 for OK, EOF, or positive.

My getline() function returns a pointer, or NULL on
EOF or error or memory shortage. This doesn't mean the
caller cannot tell the three conditions apart, just that
a little extra code is necessary:

if ((ptr = getline(stream)) == NULL) {
if (feof(stream))
/* we reached end-of-input */;
else if (ferror(stream))
/* an I/O error occurred */;
else
/* we ran out of memory */;
}

In many instances I find it really doesn't matter
whether a failure occurred because of memory shortage
or because of I/O error; the program's future course
will have one "success" path and one "failure" path.
When that's the case, I wind up writing just a
single error test:

while ((ptr = getline(stream)) != NULL) {
/* digest an input line */
}
if (! feof(stream))
return A_FAILURE_CODE;
/* end-of-input processing */
return A_SUCCESS_CODE;

Like Beauty, Simplicity is the beholder's.

--
Eric....@sun.com

Dan Pop

unread,

Jun 2, 2003, 11:00:21 AM6/2/03

to

Certainly not on systems with lazy swap space allocation, unless they
have a per-process memory usage limit. Digital UNIX was particularly
fun in this respect: when the system started to run out of virtual memory
it started to kill randomly selected processes, to make more memory
available to the process it has foolishly guaranteed that this memory is
available, without even trying to check!

On Linux, things are somewhat better, because malloc() checks that enough
virtual memory is available at the time it is called. However, if the
program doesn't quickly use it, another program may "steal" it and we
have the same problem as above.

The "cure" is eager swap allocation, but it has its own problems, too,
mostly caused by programs overallocating memory. Lazy swap space
allocation (especially without any kind of checking) is ideal for sparse
arrays: you just treat them as ordinary arrays, the space allocated but
not used doesn't consume any resource, except for virtual memory address
space.

John

unread,

Jun 2, 2003, 11:48:55 AM6/2/03

to

"Dan Pop" <Dan...@cern.ch> wrote in message
news:bbfoq5$buk$1...@sunnews.cern.ch...

> In <bbci88$869bb$1...@ID-97861.news.dfncis.de> "John"
<John....@btinternet.com> writes:
>

<snip>

> >
> >Much of what you say makes sense but why an upper limit? I have
> >one that returns a error if realloc fails, isn't that enough?
>
> Certainly not on systems with lazy swap space allocation, unless they
> have a per-process memory usage limit. Digital UNIX was particularly
> fun in this respect: when the system started to run out of virtual memory
> it started to kill randomly selected processes, to make more memory
> available to the process it has foolishly guaranteed that this memory is
> available, without even trying to check!
>
> On Linux, things are somewhat better, because malloc() checks that enough
> virtual memory is available at the time it is called. However, if the
> program doesn't quickly use it, another program may "steal" it and we
> have the same problem as above.
>
> The "cure" is eager swap allocation, but it has its own problems, too,
> mostly caused by programs overallocating memory. Lazy swap space
> allocation (especially without any kind of checking) is ideal for sparse
> arrays: you just treat them as ordinary arrays, the space allocated but
> not used doesn't consume any resource, except for virtual memory address
> space.

Ok, thanks for the explanation. What is a sparse array?

John

CBFalconer

unread,

Jun 2, 2003, 12:00:21 PM6/2/03

to

Paul Hsieh wrote:
> CBFalconer <cbfal...@yahoo.com> wrote:
>
> > ... snip ...

>
> > As I said earlier, I designed ggets for the most likely
> > applications, and a multi-megabyte single text line is not one I
> > anticipate arising often.
>
> So you feel ok with publicizing this non-scalable solution as an
> answer to other people's problems, for which you cannot possibly know
> the size of their input? As I responded to James Antill, your
> solution has the additional problem of allocating far too much memory
> for many many very small inputs. Your solution only works really well
> for a very specific range of input sizes that isn't very good for
> either large or small inputs.

Well, we obviously disagree and are making no progress towards
common ground. Incidentally, ggets present implementation
reallocs at exit, so it does not waste any memory. This is always
a reduction, and a realloc that can't do that efficiently is
broken IMO. At any rate both my and your views are now available
for posterity :-)

CBFalconer

unread,

Jun 2, 2003, 1:13:01 PM6/2/03

to

Dan Pop wrote:
> "John" <John....@btinternet.com> writes:
> >
> > <snip>

> >
> >Much of what you say makes sense but why an upper limit? I have
> >one that returns a error if realloc fails, isn't that enough?
>
> Certainly not on systems with lazy swap space allocation, unless
> they have a per-process memory usage limit. Digital UNIX was
> particularly fun in this respect: when the system started to run
> out of virtual memory it started to kill randomly selected
> processes, to make more memory available to the process it has
> foolishly guaranteed that this memory is available, without even
> trying to check!
>
> On Linux, things are somewhat better, because malloc() checks
> that enough virtual memory is available at the time it is called.
> However, if the program doesn't quickly use it, another program
> may "steal" it and we have the same problem as above.
>
> The "cure" is eager swap allocation, but it has its own problems,
> too, mostly caused by programs overallocating memory. Lazy swap
> space allocation (especially without any kind of checking) is
> ideal for sparse arrays: you just treat them as ordinary arrays,
> the space allocated but not used doesn't consume any resource,
> except for virtual memory address space.

All these problems arise on machines where multiple processes
exist, contrary to the basic assumptions of C. A further
possibility is that the actual use of the allocated memory can
block, just like an i/o operation, until it can be satisfied.
This too would be unexpected, and have its own set of problems.

CBFalconer

unread,

Jun 2, 2003, 1:13:04 PM6/2/03

to

John wrote:
> "Dan Pop" <Dan...@cern.ch> wrote in message
>

... snip ...

> >
> > The "cure" is eager swap allocation, but it has its own problems,
> > too, mostly caused by programs overallocating memory. Lazy swap
> > space allocation (especially without any kind of checking) is
> > ideal for sparse arrays: you just treat them as ordinary arrays,
> > the space allocated but not used doesn't consume any resource,
> > except for virtual memory address space.
>
> Ok, thanks for the explanation. What is a sparse array?

An array where a significant portion is unused. If you have an
array[10000] of something, and only use indices 0..10, 200..210,
9000..9010 for example, there is no need for actual storage for
the areas not yet used. The remainder can be considered to be 0
initialized even if they don't exist by suitable games within the
virtual memory manager.

CP/M implemented sparse files (not arrays). I believe some Unices
do likewise. One more feature that BillyBoy lost in MsDos. Very
useful for data bases.

CBFalconer

unread,

Jun 2, 2003, 1:13:09 PM6/2/03

to

Fair enough. I think the point is that the main line is handled
simply, and the frills are available if needed.

CBFalconer

unread,

Jun 2, 2003, 1:13:11 PM6/2/03

to

Dan Pop wrote:
> "Simon Biber" wrote:
>
... snip ...

> >
> > True. It assumes you have a properly formatted text
> > file, with a newline after every line. The C standard
> > leaves it implementation-defined whether a text stream
> > requires a terminating newline character or not.
>
> Therefore, well written code can never assume that the last line
> in a text file is properly terminated.
>
> I really like implementations that ignore the last line if it's
> not newline-terminated.

My first ggets implementation did just that, but it was pointed
out to me that this can leave data unrecoverable and reduces
usability. After all, the user may not be able to revise the
input file generation. See, I can be convinced my interface is
imperfect :-)

Richard Heathfield

unread,

Jun 2, 2003, 1:36:05 PM6/2/03

to

John wrote:

>
> "Dan Pop" <Dan...@cern.ch> wrote in message
> news:bbfoq5$buk$1...@sunnews.cern.ch...

<snip>

>>
>> The "cure" is eager swap allocation, but it has its own problems, too,
>> mostly caused by programs overallocating memory. Lazy swap space
>> allocation (especially without any kind of checking) is ideal for sparse
>> arrays: you just treat them as ordinary arrays, the space allocated but
>> not used doesn't consume any resource, except for virtual memory address
>> space.
>
> Ok, thanks for the explanation. What is a sparse array?

A sparse array is an "array" in which most of the elements are not used. As
an illustration, consider a typical spreadsheet, which offers you
thousands, if not millions, of cells, but of course most of them are
normally unused. It would be a little wasteful (to say the least!) to
allocate memory for all of them, all of the time.

A Google search for "sparse array" should prove fruitful.

John

unread,

Jun 2, 2003, 2:27:20 PM6/2/03

to

"Richard Heathfield" <dont...@address.co.uk.invalid> wrote in message

news:bbg1u4$i49$3...@hercules.btinternet.com...

> John wrote:
>
> >
> > "Dan Pop" <Dan...@cern.ch> wrote in message
> > news:bbfoq5$buk$1...@sunnews.cern.ch...
>
> <snip>
> >>
> >> The "cure" is eager swap allocation, but it has its own problems, too,
> >> mostly caused by programs overallocating memory. Lazy swap space
> >> allocation (especially without any kind of checking) is ideal for
sparse
> >> arrays: you just treat them as ordinary arrays, the space allocated but
> >> not used doesn't consume any resource, except for virtual memory
address
> >> space.
> >
> > Ok, thanks for the explanation. What is a sparse array?
>
> A sparse array is an "array" in which most of the elements are not used.
As
> an illustration, consider a typical spreadsheet, which offers you
> thousands, if not millions, of cells, but of course most of them are
> normally unused. It would be a little wasteful (to say the least!) to
> allocate memory for all of them, all of the time.
>
> A Google search for "sparse array" should prove fruitful.

Yes I did, a few derogatory remarks in some URL's about sparse arrays.
Overly complicated to implement and a performance hit were two complaints.

I've never seen or used an implementation so can't comment.

John

Eric Sosman

unread,

Jun 2, 2003, 3:36:21 PM6/2/03

to

CBFalconer wrote:
>
> All these problems arise on machines where multiple processes
> exist, contrary to the basic assumptions of C. A further
> possibility is that the actual use of the allocated memory can
> block, just like an i/o operation, until it can be satisfied.
> This too would be unexpected, and have its own set of problems.

Blocking on a memory access has been with us for
more than forty years, and common for more than thirty.
Anyone who finds it "unexpected" hasn't been paying
close attention.

--
Eric....@sun.com

CBFalconer

unread,

Jun 2, 2003, 8:01:51 PM6/2/03

to

John wrote:
> "Richard Heathfield" <dont...@address.co.uk.invalid> wrote
> >

> > <snip>

> >
> > A sparse array is an "array" in which most of the elements are
> > not used. As an illustration, consider a typical spreadsheet,
> > which offers you thousands, if not millions, of cells, but of
> > course most of them are normally unused. It would be a little
> > wasteful (to say the least!) to allocate memory for all of
> > them, all of the time.
> >
> > A Google search for "sparse array" should prove fruitful.
>
> Yes I did, a few derogatory remarks in some URL's about sparse
> arrays. Overly complicated to implement and a performance hit
> were two complaints.
>
> I've never seen or used an implementation so can't comment.

If you ever implemented a linked list kept in order of some
integer field, you have used a form of sparse array.

CBFalconer

unread,

Jun 2, 2003, 8:01:53 PM6/2/03

to

The 'unexpected' would be the length of the block. People expect
to block for i/o, or for cpu time. The normal 'memory block' is
basically swapping time, while this would wait (possibly
indefinitely) for processes to release memory.

James Antill

unread,

Jun 2, 2003, 10:04:22 PM6/2/03

to

q...@pobox.com (Paul Hsieh) wrote in message news:<796f488f.03060...@posting.google.com>...

> "James Antill" <james-...@and.org> wrote:
> > On Sat, 31 May 2003 21:33:03 +0000, Paul Hsieh wrote:
> > > Since memcpy is basically O(n), your copy performance can be
> > > as bad as:
> > >
> > > O (strlen(output) ^ 2 / (2*128))
> > >
> > > while the power of 2 solution can only be as bad as:
> > >
> > > O (strlen(output) * 2)
> >
> > Err, no. If the input is 17 bytes long, and you start at 1, worst case
> > is...
>
> I don't understand how to read the strange table you posted here. In
> any event, its clearly wrong. The sequences of memory allocations for
> a 17 character input (to bgets) is as follows:
>
> malloc(8); /* from a cstr2bstr ("") */
> realloc(16); /* from a balloc (9) */
> realloc(32); /* from a balloc (17) */
>
> The maximum amount of extra copying is 8 + 16 for a total of 24
> characters. Note that 24 < 2 * 17. It does not get *worse* for
> bigger numbers -- the property of total copying and/or memory
> allocations < strlen(output) * 2 is maintained for any sized input.

Sure that's the _overhead_, not the cost (which is what I was talking
about). By your math data is input into a Vstr in O(1) time, which is
not the way I'd look at it (for a start it means mmap() must have an
initial cost of negative O(n)).
And by worse I meant that the ammount of copying grows in the same
relationship.

So at 17 you have...

malloc(8)
realloc(16)
realloc(32)

...where you input 17, alloc 32 and could copy 24 and at 65 you have...

malloc(8)
realloc(16)
realloc(32)
realloc(64)
realloc(128)

...where you input 65, alloc 128 and could copy 120.

Dan Pop

unread,

Jun 3, 2003, 7:55:27 AM6/3/03

to

In <3EDB7FDA...@yahoo.com> CBFalconer <cbfal...@yahoo.com> writes:

>Dan Pop wrote:
>> "Simon Biber" wrote:
>>
>... snip ...
>> >
>> > True. It assumes you have a properly formatted text
>> > file, with a newline after every line. The C standard
>> > leaves it implementation-defined whether a text stream
>> > requires a terminating newline character or not.
>>
>> Therefore, well written code can never assume that the last line
>> in a text file is properly terminated.
>>
>> I really like implementations that ignore the last line if it's
>> not newline-terminated.
>
>My first ggets implementation did just that, but it was pointed
>out to me that this can leave data unrecoverable and reduces
>usability. After all, the user may not be able to revise the
>input file generation.

You mean, he may not have access to a text editor to add the missing
newline in the file?

>See, I can be convinced my interface is imperfect :-)

Unfortunately it *was* perfect! (in this respect, at least :-)

Your first implementation simply "encouraged" the user to get the *broken*
program generating the file fixed. Your revised implementation provides
no such incentive, therefore more broken software will remain broken.

CBFalconer

unread,

Jun 3, 2003, 9:00:26 AM6/3/03

to

Dan Pop wrote:
> CBFalconer <cbfal...@yahoo.com> writes:
> >Dan Pop wrote:
> >> "Simon Biber" wrote:
> >>
> >... snip ...
> >> >
> >> > True. It assumes you have a properly formatted text
> >> > file, with a newline after every line. The C standard
> >> > leaves it implementation-defined whether a text stream
> >> > requires a terminating newline character or not.
> >>
> >> Therefore, well written code can never assume that the last line
> >> in a text file is properly terminated.
> >>
> >> I really like implementations that ignore the last line if it's
> >> not newline-terminated.
> >
> >My first ggets implementation did just that, but it was pointed
> >out to me that this can leave data unrecoverable and reduces
> >usability. After all, the user may not be able to revise the
> >input file generation.

> ------------^^^^^^^^^^

>
> You mean, he may not have access to a text editor to add the
> missing newline in the file?

Yes. Remember pipes? If you carry your attitude everywhere, even
a text editor won't be able to access that last line, so don't
recommend adding something like sed to the pipe.

Dan Pop

unread,

Jun 3, 2003, 9:12:39 AM6/3/03

to

In <bbg4v7$95d07$1...@ID-97861.news.dfncis.de> "John" <John....@btinternet.com> writes:

>Yes I did, a few derogatory remarks in some URL's about sparse arrays.
>Overly complicated to implement and a performance hit were two complaints.

That's precisely why you want a system with lazy swap allocation and no
checks in malloc for dealing with sparse arrays: you simply pretend that
they're regular arrays and everything is fine.

Dan Pop

unread,

Jun 3, 2003, 9:15:32 AM6/3/03

to

OTOH, blocking on a memory access because there is no more virtual
memory available on the system has never been common practice.

Dan Pop

unread,

Jun 3, 2003, 9:22:05 AM6/3/03

to

C makes no basic assumptions on that. It simply ignores their presence.

>A further
>possibility is that the actual use of the allocated memory can
>block, just like an i/o operation, until it can be satisfied.
>This too would be unexpected, and have its own set of problems.

It a cure worse than the problem. Imagine what happens to a system where
several such processes are simultaneously running.

The best thing is to send a signal to the process and let it decide how to
proceed. If it keeps trying, unconditionally kill it.

Dan Pop

unread,

Jun 3, 2003, 10:45:36 AM6/3/03

to

In <3EDC96E0...@yahoo.com> CBFalconer <cbfal...@yahoo.com> writes:

>Dan Pop wrote:
>> CBFalconer <cbfal...@yahoo.com> writes:
>> >Dan Pop wrote:
>> >> "Simon Biber" wrote:
>> >>
>> >... snip ...
>> >> >
>> >> > True. It assumes you have a properly formatted text
>> >> > file, with a newline after every line. The C standard
>> >> > leaves it implementation-defined whether a text stream
>> >> > requires a terminating newline character or not.
>> >>
>> >> Therefore, well written code can never assume that the last line
>> >> in a text file is properly terminated.
>> >>
>> >> I really like implementations that ignore the last line if it's
>> >> not newline-terminated.
>> >
>> >My first ggets implementation did just that, but it was pointed
>> >out to me that this can leave data unrecoverable and reduces
>> >usability. After all, the user may not be able to revise the
>> >input file generation.
>> ------------^^^^^^^^^^
>>
>> You mean, he may not have access to a text editor to add the
>> missing newline in the file?
>
>Yes. Remember pipes? If you carry your attitude everywhere, even
>a text editor won't be able to access that last line, so don't
>recommend adding something like sed to the pipe.

OK, use a binary file editor instead.

The pipe issue is trivially fixed with a binary filter.

Eric Sosman

unread,

Jun 3, 2003, 11:49:57 AM6/3/03

to

Not so trivial, considering that a filter based on binary
streams may introduce extraneous zero bytes at the end of the
data. If it does so, there's trouble: you've got a final
"line" of zero bytes, still lacking a terminating '\n'. By
hypothesis the downstream program for whose benefit the filter
was introduced will ignore the extra trash, but some other
programs may not do so -- so this just introduces another
irregularity into the programmer's environment ("Should I or
should I not put a binary filter into the middle of this
pipeline?"). That may offer an excuse for blame-shifting,
but doesn't seem to me to be a good solution to the problem.

IMHO, best practice on handling text streams comes down
to the same advice the prevails in many other areas: be
lenient in what you accept and strict in what you generate.
That is, if the implementation hands you a final line with
no terminating '\n' you should do your best to accept it
(possibly issuing a warning), and all the lines you generate
yourself should be '\n'-terminated. The question of what to
do when copying lines really depends on whether the copy is
supposed to be "representationally faithful" or "semantically
faithful;" in the former case you should neither add nor
delete any data, while in the latter you should generate
well-formed lines even if you receive an '\n'-less one.

--
Eric....@sun.com

Dan Pop

unread,

Jun 4, 2003, 7:16:37 AM6/4/03

to

In <3EDCC3A5...@sun.com> Eric Sosman <Eric....@sun.com> writes:

>Dan Pop wrote:
>>
>> In <3EDC96E0...@yahoo.com> CBFalconer <cbfal...@yahoo.com> writes:
>>
>> >Dan Pop wrote:
>> >>
>> >> You mean, he may not have access to a text editor to add the
>> >> missing newline in the file?
>> >
>> >Yes. Remember pipes? If you carry your attitude everywhere, even
>> >a text editor won't be able to access that last line, so don't
>> >recommend adding something like sed to the pipe.
>>
>> OK, use a binary file editor instead.
>>
>> The pipe issue is trivially fixed with a binary filter.
>
> Not so trivial, considering that a filter based on binary
>streams may introduce extraneous zero bytes at the end of the
>data.

Not as long as used as a filter.

Eric Sosman

unread,

Jun 4, 2003, 10:51:33 AM6/4/03

to

Where does the Standard specify a difference between
behavior when "used as a filter" and any other behavior?
As far as I can tell, output to *any* binary stream may
have an arbitrary number of zero bytes appended:

7.19.2 Streams
/3/ A binary stream [...] may, however, have an
implementation-defined number of null characters
appended to the end of the stream.

There doesn't seem to be any special dispensation for
output to pipes, nor for hoses, straws, or bit buckets.

--
Eric....@sun.com

Dan Pop

unread,

Jun 4, 2003, 12:53:42 PM6/4/03

to

Nevertheless, people with a clue know the reason of that specification.
You could also get one, from the C89 Rationale:

Even the specification of binary streams requires some changes to
accommodate a wide range of systems. Because many systems do not keep
track of the length of a file to the nearest byte, an arbitrary number
of characters may appear on the end of a binary stream directed to
a file.

James Antill

unread,

Aug 15, 2003, 5:44:38 PM8/15/03

to

On Sat, 31 May 2003 21:33:03 +0000, Paul Hsieh wrote:

> In article <3ED8A217...@yahoo.com>, cbfal...@yahoo.com says...
>> Paul Hsieh wrote:
>> > You call realloc roughly (strlen(output) / 128) times. Some
>> > weaker C library implementations perform this very badly and will
>> > artifically run out of memory far before all the heap memory is
>> > really consumed. You should use increasing powers of 2, and if
>> > you are that worried, realloc at the end to shrink it to the
>> > minimum required memory. This will guarantee that at most
>> > log_2(strlen(output)) calls to realloc are made.
>>
>> This is perfectly possible, without changing the interface one
>> iota. I deliberately chose otherwise because, at least for me,
>> the normal use is with interactive input or text files, where such
>> long lines are possible but highly unlikely.
>
> Yeah, 640K ought to be enough for anyone. There is no significant disadvantage
> for using doubling sizes. An additional issue is that a realloc can cost an
> additional memcpy (if the adjacent tail memory is not available.) Since memcpy

> is basically O(n), your copy performance can be as bad as:
>
> O (strlen(output) ^ 2 / (2*128))
>
> while the power of 2 solution can only be as bad as:
>
> O (strlen(output) * 2)

Err, no. If the input is 17 bytes long, and you start at 1, worst case
is...

1 (input, len=1 sz=1)
1 (realloc copy, len=1 sz=2)
1 (input, len=2 sz=2)
2 (realloc copy, len=2 sz=4)
2 (input, len=4 sz=4)
4 (realloc copy, len=4 sz=8)
4 (input, len=8, sz=8)
8 (realloc copy, len=8, sz=16)
8 (input, len=16 sz=16)
16 (realloc copy, len=16, sz=32)
1 (input, len=17 sz=32)

And at the end your length is 17 and you've done 48 copies
(48 being > 17 * 2) -- and this picture only gets _worse_ with bigger
numbers. Note that at the end of it you are wasting 46% of your space
(and the waste percentage _stays constant_ as the numbers increase).
That's also discounting the per call overhead of realloc().

Really the only sane way to do this is to have seperate chunks of memory,
copies are always O(n) and waste is always (chunk size / 2).
If you did it properly you could even share the chunks over multiple
connections.

Someone really needs to write something like that... (see sig. :)

--
James Antill -- ja...@and.org
Need an efficent and powerful string library for C?
http://www.and.org/vstr/

CBFalconer

unread,

Aug 15, 2003, 10:35:45 PM8/15/03

to

James Antill wrote:
>
... snip ...

>
> Err, no. If the input is 17 bytes long, and you start at 1, worst
> case is...
>
> 1 (input, len=1 sz=1)
> 1 (realloc copy, len=1 sz=2)
> 1 (input, len=2 sz=2)
> 2 (realloc copy, len=2 sz=4)
> 2 (input, len=4 sz=4)
> 4 (realloc copy, len=4 sz=8)
> 4 (input, len=8, sz=8)
> 8 (realloc copy, len=8, sz=16)
> 8 (input, len=16 sz=16)
> 16 (realloc copy, len=16, sz=32)
> 1 (input, len=17 sz=32)
>
> And at the end your length is 17 and you've done 48 copies

The copy overhead is minimal, even if each realloc requires a
complete copy. You are measuring individual char copies. A
decent realloc will attempt to avoid any copying anyhow, by
searching for free space adjacent to the original.

The major overhead will be in the realloc calls themselves,
especially if the realloc has to go to the system for the space.
Again, a decent realloc will avoid this.

As ever, first get the system working CORRECTLY, then measure to
see if optimizations are worthwhile.

>From the subject line, this seems to be somehow descended from my
ggets package, which uses a different strategy. It is intended to
be efficient for most normal inputs, but cope with abnormal
conditions. Go to my site, download section, for the actual
source.

Malcolm

unread,

Aug 16, 2003, 5:54:49 AM8/16/03

to

"CBFalconer" <cbfal...@yahoo.com> wrote in message

>
> As ever, first get the system working CORRECTLY, then measure to
> see if optimizations are worthwhile.
>

ggests() is designed to handle input, either from stdin or from a file. In
the case of stdin, efficiency is unimportant as long as response is not
noticeable. In the the case of a file, it is likely that any realloc()
overheads will be dwarfed by physical disk access times. Even if we have
some sort of virtual filing system, it is unlikely that dealing with input
would be a critical part of the code.