If I am doing this from scratch, what is the best practise for allocating
a buffer size for the input line?
I guess open the file, scan once to determine the buffer size, then rewind and
start reading. Has this already been done or do I need to code this from
scratch?
(My project is open source, so I can utilize GPL licenced code, if necessary.)
C89 compatible code is preferred.
Mark.
--
Mark Hobley
Linux User: #370818 http://markhobley.yi.org/
Well, if you know how big your lines are, or know a reasonable
maximum, you can just use:
char buffer[1024];
fgets(buffer, sizeof buffer, file);
> C89 compatible code is preferred.
>
Otherwise, Chuck Falconer has a function called ggets() on his
website that handles memory allocation and all that. I don't
remember the link, but Google will find it.
Richard Heathfield also has such a beast, according to the
comments in Chuck's code. Given that Richard is still around
and Chuck is not, you maybe will be better off with that.
In either case, they're very easy functions to use.
--
Andrew Poelstra
http://www.wpsoftware.net/andrew
> I want to read a text file a line at a time from within a C program. Are there
> some available functions or code already written that does this or do I need
> to code from scratch?
<snip>
> (My project is open source, so I can utilize GPL licenced code, if necessary.)
gcc's glibc includes getline. If you can't use gcc and link against glibc
you might be able to use the source (though extracting parts of the
library might be fiddly).
<snip>
--
Ben.
There are some.
> If I am doing this from scratch, what is the best practise for allocating
> a buffer size for the input line?
Good question!
> I guess open the file, scan once to determine the buffer size, then rewind and
> start reading. Has this already been done or do I need to code this from
> scratch?
That's a very expensive way to do it. Reading is usually much more expensive
than, say, copying in memory. If you can make reasonable guesses about buffer
sizes, you should be able to do pretty well.
Have a look at fgets(), which gets a string of definitely no more than a
particular length. If a line is too long for it, you can call fgets()
again to get more of the line.
Do you need to keep multiple lines in memory, or do you just need to look
at each one? A typical strategy I'll use for "look at each item in turn"
is basically this:
size_t line_len = 256;
char *line_data;
line_data = malloc(line_len);
while (fgets(line_data, line_len, stdin)) {
char *s;
size_t this_line_len;
this_line_len = strlen(line_data);
while (line_data[this_line_len - 1] != '\n') {
s = malloc(line_len * 2);
memcpy(s, line_data, line_len);
free(line_data);
line_data = s;
fgets(line_data + line_len, line_len, stdin);
line_len *= 2;
this_line_len = strlen(line_data);
}
}
This omits quite a bit of error checking, but the basic idea is, you
pick a buffer size, and use it, and if it's not big enough, you increase
the buffer size, reallocate, then keep using that larger buffer. In
most cases, you'll probably never even reallocate once.
-s
--
Copyright 2010, all wrongs reversed. Peter Seebach / usenet...@seebs.net
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
> If I am doing this from scratch, what is the best practise for allocating
> a buffer size for the input line?
The simplest method is to start with guess for the length of the
longest line and allocate as much. Now you use fgets() to read in
a line and check if it ends in a '\n' - if it does everything is
ok but if it doesn't the line was too long to fit into the buffer
you started of with. In that case you jincrease the size of the
buffer, e.g. by doubling its size, using realloc(), and try to
read the rest of the line by calling fgets() again (but with the
first argument pointing into the buffer were the last try stopped).
Then repeat the test for the final '\n' and repeat increasing the
buffer size if necessary. If you don't run out of memory you end
up with a buffer that contains the complete line.
The only special case you may have to consider is that the last
line of a file may not end with a '\n' and then, of course, also
what fgets() reads in can't contain that character - but if you
try to read at the very end fgets() will return NULL, so it's
possible to check for that condition.
> I guess open the file, scan once to determine the buffer size, then rewind
> and start reading.
I guess reading the file twice just to find out the length of the
longest line is too much work.
> Has this already been done or do I need to code this from
> scratch?
Probably everyone being faced with the problem of reading lines of
arbitary length will have written such a function at least once;-)
Here's something I found looking through my files (although with
quite a number changes to the original, so be wary, I may have
broken it!):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define LEN_GUESS 128
int
read_line( FILE * fp,
char ** line )
{
static char *buf = NULL;
static size_t buf_len = LEN_GUESS;
char *p = buf;
size_t rem_len = buf_len;
if ( ! fp || ! line )
return -1; /* bad argument(s) */
if ( ! buf && ! ( buf = p = malloc( buf_len ) ) )
return -1; /* running out of memory */
*buf = '\0';
while ( 1 )
{
size_t len;
char *tmp;
if ( ! fgets( p, rem_len, fp ) )
{
if ( ferror( fp ) )
return -1; /* read failure */
break;
}
len = strlen( p );
if ( p[ len - 1 ] == '\n')
break;
if ( ! ( tmp = realloc( buf, 2 * buf_len ) ) )
return -1; /* running out of memory */
buf = tmp;
p += len;
rem_len += buf_len - len;
buf_len *= 2;
}
*line = buf;
return feof( fp ) ? 1 : 0; /* indicate if EOF has been reached */
}
Note that it's, of course, not thread-safe. And when you call it
again the last line returned will be overwritten. When you don't
need to call the function anymore you should free() the returned
pointer.
> (My project is open source, so I can utilize GPL licenced code, if
> necessary.) C89 compatible code is preferred.
Use it for whatever you want if it fits your needs (but better
check carefully that it works, it's not my tested version, I
just checked that it compiles!) And, of course, there are quite
a number of ways it could be improved, it's more meant for giving
you a better idea of how it could be done.
Regards, Jens
--
\ Jens Thoms Toerring ___ j...@toerring.de
\__________________________ http://toerring.de
It is not uncommon for C programmers
to write their own getline function.
Mine is called get_line.
int get_line(char **lineptr, size_t *n, FILE *stream);
--
pete
I just use a fixed size, big enough for text files that are line-oriented.
I've just checked and I'm using a 2KB buffer, but it could be much higher if
memory allows.
If the lines are longer than that sort of size, the file probably isn't
line-oriented and could do with a different approach. (Or might use a
different newline convention from that expected. Either way, you have a file
that is not in the right format.)
> I guess open the file, scan once to determine the buffer size, then rewind
> and
> start reading. Has this already been done or do I need to code this from
> scratch?
From files that might work (although pedants might say that by the second
read, someone could have written a longer line to the file). From devices
such as consoles I'm not sure that would work.
--
Bartc
--- news://freenews.netfront.net/ - complaints: ne...@netfront.net ---
"still around" meaning that Richard still posts here in comp.lang.c;
Chuck used to, but hasn't lately.
> In either case, they're very easy functions to use.
--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
> "Mark Hobley" <markh...@hotpop.donottypethisbit.com> wrote in message
> news:i29l77-...@neptune.markhobley.yi.org...
>>I want to read a text file a line at a time from within a C program. Are
>>there
>> some available functions or code already written that does this or do I
>> need
>> to code from scratch?
>>
>> If I am doing this from scratch, what is the best practise for allocating
>> a buffer size for the input line?
>
> I just use a fixed size, big enough for text files that are line-oriented.
>
> I've just checked and I'm using a 2KB buffer, but it could be much higher if
> memory allows.
>
> If the lines are longer than that sort of size, the file probably isn't
> line-oriented and could do with a different approach. (Or might use a
> different newline convention from that expected. Either way, you have a file
> that is not in the right format.)
I have two CSV files I'm using at the moment whose longest lines have
2201 and 2306 bytes and one old one with a 10155 byte line. It's hard
to put an upper limit on what is reasonable. Today's absurd it
tomorrow's "pah!".
<snip>
--
Ben.
Yes, I wrote a piece of code to do just that and incorporated in it
helpful input from other people on comp.lang.c.
http://codewiki.wikispaces.com/xbuf.c
The section on reading lines shows what you are looking for and also
why the code was needed, i.e. problems with other solutions.
James
and what does your program do?
> > I guess open the file, scan once to determine the buffer size, then rewind
> > and
> > start reading. Has this already been done or do I need to code this from
> > scratch?
>
> From files that might work (although pedants might say that by the second
> read, someone could have written a longer line to the file). From devices
> such as consoles I'm not sure that would work.
>
> --
> Bartc
>
> --- news://freenews.netfront.net/ - complaints: n...@netfront.net ---
>> I've just checked and I'm using a 2KB buffer, but it could be much higher
>> if
>> memory allows.
>>
>> If the lines are longer than that sort of size, the file probably isn't
>> line-oriented and could do with a different approach. (Or might use a
>> different newline convention from that expected. Either way, you have a
>> file
>> that is not in the right format.)
>
> and what does your program do?
On input of:
abcdefghijklmnopqrstuvwxyz
and a buffer size (for fgets()) of 10 characters, it just splits the lines:
abcdefghi
jklmnopqr
stuvwxyz
After adding a few more lines of code, it truncates to:
abcdefghi
which seems better (line-oriented doesn't mean free-format). Perhaps it
should also signal when a truncation has occurred.
What seems wrong is to let the input file dicate to you some ridiculous
'line length' of perhaps a billion characters, and to go along with that.
--
Bartc
--- news://freenews.netfront.net/ - complaints: ne...@netfront.net ---
>> I've just checked and I'm using a 2KB buffer, but it could be much higher
>> if
>> memory allows.
>>
>> If the lines are longer than that sort of size, the file probably isn't
>> line-oriented and could do with a different approach.
> I have two CSV files I'm using at the moment whose longest lines have
> 2201 and 2306 bytes and one old one with a 10155 byte line. It's hard
> to put an upper limit on what is reasonable. Today's absurd it
> tomorrow's "pah!".
The text file format is being abused then. This sounds like an export from a
database or spreadsheet. It's not text, unless you're using to reading pages
60 feet wide.
If you already have code for a flexible getline(), then just it. Otherwise
the next step up from a hard-coded size is a one-time allocated buffer which
remains the same size. Bung 20KB (or 200KB) in there, and have done with it.
But there is always going to be some file or other which is going to cause a
problem.
You may find the getfline function in
http://home.tiac.net/~cri_a/san/source_code/utl/
useful. getfline.c is in the src directory and getfline.h is in
the include directory. The code is ansi C89 and has a BSD style
license.
Richard Harter, c...@tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
It's not much to ask of the universe that it be fair;
it's not much to ask but it just doesn't happen.
What seems wrong to me is to let limitations in the program impose
some arbitrary limit on line length, when the input format you're
trying to process imposes no such limit.
If a file format specifies a maximum line length, then by all means go
with that (and ideally report an error for any line that exceeds the
limit, unless the format specification says that characters past the
maximum are quietly ignored). If it doesn't, then handling
arbitrarily long lines is better than imposing *any* limit other than
what's imposed by available memory.
And if the file format doesn't impose a maximum length but you're
unwilling to handle very long lines, IMHO you should at least report
an internal error if you see a line longer than you can handle.
> "Ben Bacarisse" <ben.u...@bsb.me.uk> wrote in message
> news:0.e23765669a04066f4532.2010...@bsb.me.uk...
>> "bartc" <ba...@freeuk.com> writes:
>>
>>> "Mark Hobley" <markh...@hotpop.donottypethisbit.com> wrote in message
>>> news:i29l77-...@neptune.markhobley.yi.org...
>>>>I want to read a text file a line at a time from within a C program. Are
>
>>> I've just checked and I'm using a 2KB buffer, but it could be much
>>> higher if
>>> memory allows.
>>>
>>> If the lines are longer than that sort of size, the file probably isn't
>>> line-oriented and could do with a different approach.
>
>> I have two CSV files I'm using at the moment whose longest lines have
>> 2201 and 2306 bytes and one old one with a 10155 byte line. It's hard
>> to put an upper limit on what is reasonable. Today's absurd it
>> tomorrow's "pah!".
>
> The text file format is being abused then. This sounds like an export
> from a database or spreadsheet. It's not text, unless you're using to
> reading pages 60 feet wide.
The structure is line-oriented. It should be read in text mode and a
line ends when you see '\n'. I call that a text file.
> If you already have code for a flexible getline(), then just
> it. Otherwise the next step up from a hard-coded size is a one-time
> allocated buffer which remains the same size. Bung 20KB (or 200KB) in
> there, and have done with it.
These solutions work, of course. I was just disputing the fact that
there is some maximum line length beyond which something stops being a
text file.
<snip>
--
Ben.
No need for limits.
1) Read the entire file into one buffer using fread, realloc() when needed.
2) Make a second pass on the buffer, find the line endings , handle \r\n,
replace them by \0, save the beginnings of the lines in an array of
pointers, realloc()ing when needed,
3) Make a third pass: process each line , searching for commas, replacing
them by \0, saving pointers to the beginnings, realloc()ing when needed.
Step 2 and 3 need to take care of quoting / escaping.
Step 1,2,3 _can_ be combined into one state machine.
HTH,
AvK
OK, but then be prepared for your getline() function to actually need to be
a getfile() function with some input, and to potentially grab most of the
memory in your system, or even to bring down the program (if a giant file
uses the wrong newline format for example).
--
Bartc
Or don't read an entire line into memory at a time. For example,
if you're reading an XML file -- well, you should be using an
XML parser that somebody else has already written. But if you're
writing an XML parser for some reason, it might make more sense to
read and store input until you see a '<' or '>' rather than '\n'.
I've seen XML files with extremely long lines, but not with extremely
long tag names.
But yes, sometimes it does make sense to read entire lines into memory
at once, even if they might be inordinately long.
>>> What seems wrong to me is to let limitations in the program impose
>>> some arbitrary limit on line length, when the input format you're
>>> trying to process imposes no such limit.
>>
>> OK, but then be prepared for your getline() function to actually need
>> to be a getfile() function with some input
> Or don't read an entire line into memory at a time. For example,
> if you're reading an XML file -- well, you should be using an
> XML parser that somebody else has already written. But if you're
> writing an XML parser for some reason, it might make more sense to
> read and store input until you see a '<' or '>' rather than '\n'.
> I've seen XML files with extremely long lines, but not with extremely
> long tag names.
I think XML is one of those text formats (like C source files and HTML),
which are not really line-oriented; newline is just another whitespace
character.
In that case, if you don't use a dedicated file reader as you've suggested,
you can't really use simple line-input.
--
Bartc
Quibble: C preprocessor directives are line-oriented. And a C
compiler is allowed to impose a maximum line length on source files.
> In that case, if you don't use a dedicated file reader as you've
> suggested, you can't really use simple line-input.
Sure you can, as long as your simple line-input can handle arbitrarily
long lines (and you have enough memory to store them). Admittedly
it might not be the ideal solution.