Dynamic C Strings (libclc?)

Walter Faxon

unread,

Apr 23, 2003, 2:38:28 AM4/23/03

to

In the May 2003 issue of C/C++ Users Journal, Daniel Lawrence has
written an article, "An Idea for Dynamic C Strings", which outlines a
number of C-compatible string functions that do bounds-checking and
allow for automatic string growth. The string data type provided is a
standard \0-terminated char*, the value of which can be passed to any
function expecting char*. But new functions are supplied such as
string_cat(), string_sprintf(), and string_getline(), which will size
their resulting malloc'ed/realloc'ed strings, avoiding buffer
overflow.

The method is simple: each dynamic string address is used as a key to
a lookup table. (Lawrence's implementation is a plain array which he
admits is suboptimal, but even with a hashed table, memoizing the last
reference or so might prove effective.) The table notes the allocated
size of each string, so the routines know when a string must be
expanded. Strings not controlled by the system are not in the table,
so can be recognized and dealt with correctly.

Lawrence has a version of this library at his company's web site,
www.alphazed.co.uk/software/toolbox/. A demo application is at
www.cuj.com/code and www.alphazed.co.uk/software/path/.

Something like this facility should be in a recognized standard
library! Perhaps libclc.

-- Walter

Malcolm

unread,

Apr 23, 2003, 1:45:16 PM4/23/03

to

"Walter Faxon" <wfa...@gis.net> wrote in message
>
[ new string library snipped]

> Something like this facility should be in a recognized standard
> library! Perhaps libclc.
>

Almost everyone at some time has an idea for a new C string library. Even
Microsoft announced one recently. This is a job for a standards body. The
problem is that all code becomes dependent on the new library, and also
harder to understand for programmers brought up on the standard library.

Michael B. Allen

unread,

Apr 23, 2003, 3:24:05 PM4/23/03

to

Also, I think it's a bad idea to have any predefined format for a string
ADT or to assume that the object was allocated in a certain way or can
be resized etc. There are so many different encodings it is much better
to just create functions for manipulating chunks of memory that encode
a string. The user would know in advance that the string was encoded in
one format or in a class (e.g. multibyte locale dependant encoding vs. 16
bit encoding) and choose functions appropriately (e.g. mbs* vs. wcs*).

In practice string ADTs are overrated. They are performace killers.

Mike

--
A program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes the potential for it to be applied to tasks that are
conceptually similar and, more important, to tasks that have not
yet been conceived.

Walter Faxon

unread,

Apr 24, 2003, 12:39:48 AM4/24/03

to

"Michael B. Allen" <mba...@ioplex.com> wrote in message news:<pan.2003.04.23.15....@ioplex.com>...

> On Wed, 23 Apr 2003 13:45:16 -0400, Malcolm wrote:
>
> > "Walter Faxon" <wfa...@gis.net> wrote in message
> >>
> [ new string library snipped]
> >> Something like this facility should be in a recognized standard
> >> library! Perhaps libclc.
> >>
> > Almost everyone at some time has an idea for a new C string library.
> > Even Microsoft announced one recently. This is a job for a standards
> > body. The problem is that all code becomes dependent on the new library,
> > and also harder to understand for programmers brought up on the standard
> > library.
>
> Also, I think it's a bad idea to have any predefined format for a string
> ADT or to assume that the object was allocated in a certain way or can
> be resized etc. There are so many different encodings it is much better
> to just create functions for manipulating chunks of memory that encode
> a string. The user would know in advance that the string was encoded in
> one format or in a class (e.g. multibyte locale dependant encoding vs. 16
> bit encoding) and choose functions appropriately (e.g. mbs* vs. wcs*).
>
> In practice string ADTs are overrated. They are performace killers.
>
> Mike

I agree that C string ADTs can cause performance problems. They can
also speed up program development and make certain error-prone
functions more secure. If subsequent profiling shows they are an
significant drain on performance, they can be replaced, as any part of
a program can.

Lawrence's method features plain C strings (\0-terminated char*). Any
routine which just reads or writes to char* can continue to do so.
Only the ADT routines, e.g., the replacements for the often
problematic strcat(), sprintf(), and fgets() (a function often
mentioned in this group), take advantage of the ability to resize
them. The only important things to remember for strings under his
library's control are: (1) strings allocated by his library should be
freed by it (unless at program termination), and (2) don't make copies
of a string pointer (or internal pointers) if you expect to later call
a potentially resizing routine on it. This is at least much simpler
than always having to keep track of string buffer sizes by hand.

The standard C library is a very low-level library and has not had a
significant change in years. The commercial libraries tend to be at a
much higher level and are all incompatible with each other. Standards
bodies standardize what most people are already doing, maybe fixing
some problems and inconsistencies. So what standards body is
anointing a new next-level C string library?

Maybe we should just bite the bullet and go straight to C++.

-- Walter

Mamadou Moustapha Kante

unread,

Apr 24, 2003, 4:24:16 AM4/24/03

to

Walter Faxon wrote:

effectively the standard C lbrary is a low level and what makes the performance of the C is that it's a low level
langage so, if we want performance programs,i don't think that a string library is a good thing, but why not for
those who just want a no breakdown program,because all know that manipulate char* is not easy and can cause many
runtime erros.But what we won't forget is what we like in C is its abality to allow us to manipulate the pointers and
to do all we wanna program.and if we trend to those types of libraries, we'll let down this.
tafiscobar.

Dan Pop

unread,

Apr 24, 2003, 8:06:53 AM4/24/03

to

In <57958c65.03042...@posting.google.com> wfa...@gis.net (Walter Faxon) writes:

>The standard C library is a very low-level library

This is a deliberate choice: C itself is a (relatively) low-level
programming language.

>and has not had a significant change in years.

Don't expect any to happen.

>The commercial libraries tend to be at a
>much higher level and are all incompatible with each other. Standards
>bodies standardize what most people are already doing, maybe fixing
>some problems and inconsistencies. So what standards body is
>anointing a new next-level C string library?

None. In principle, the ISO C committee could do that, but in practice
they won't.

Any higher level string library will have to make some trade offs. But
whatever trade offs are optimal for one application may be sub-optimal
for the next application. Therefore, each application is free to define
its own high level string library that matches its own requirements in an
optimal manner.

In practice, most applications can happily live with the low-level C
strings, so there never has been a strong pressure for introducing
higher level strings.

>Maybe we should just bite the bullet and go straight to C++.

That's certainly a sensible choice if you want to program at a higher
level than the one offered by C. Much more sensible than trying to raise
C's level.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Dan...@ifh.de

Michael B. Allen

unread,

Apr 24, 2003, 4:13:52 PM4/24/03

to

On Thu, 24 Apr 2003 00:39:48 -0400, Walter Faxon wrote:

> significant change in years. The commercial libraries tend to be at a
> much higher level and are all incompatible with each other. Standards

The problem is that in many cases strings are not threated like objects
to be passed around created and destroyed frequently.

For example, in network servers strings are usually referenced or decoded
in-situ. You don't want to copy little bits out of network messages into
allocated chunks of memory that then need to be freed.

For string processing intensive applications this in-place string
processing technique is superior and safe. If it is done properly it is
also safer but that also depends on the specific problem.

If you want a higher-level conceptually easier experience then I think
you want C++.

If a library for handling strings were ever introduced into standard C
I would think it would focus more on manipulation of internationalized
strings. For example, functions that can operate on strings regardless
of encoding along with a mechnaism for converting between the various
encodings.

Dan Pop

unread,

Apr 25, 2003, 7:25:00 AM4/25/03

to

In <pan.2003.04.24.16....@ioplex.com> "Michael B. Allen" <mba...@ioplex.com> writes:

>If a library for handling strings were ever introduced into standard C
>I would think it would focus more on manipulation of internationalized
>strings. For example, functions that can operate on strings regardless
>of encoding along with a mechnaism for converting between the various
>encodings.

C99 adds tons of new functions for this very purpose. Unfortunately,
the thing seems to be severely plagued by the "design by committee"
syndrome.

Paul Hsieh

unread,

May 13, 2003, 6:38:12 PM5/13/03

to

This architecture has some problems:

1. What if a parameter is given with an integer added to a base char *
address? The lookup for it will either be necessarily expensive (by
essentially looking through *all* addresses for the base and realizing
it will cover the given address), or yield a no-match, in which case,
none of the safety semantics will work.

2. If an operation implicitely performs a realloc, then the original
pointer may become part of the free space in the heap (since a realloc
typically may not be able to simply expand the buffer in place.) Thus
you need to somehow update every place in your data space which makes
reference to your string -- certainly there's nothing in the language
that will help you here, and the problem listed above will compound
the problem quite severly (i.e., you can't even try to walk your data
segment looking for pointers to update.)

3. What happens if a string is freed then a future allocation is made
which overlaps the old pointer? Does the look-up table just get
gummed up with obsolete addresses which more potential false hits
related to the scenario described in #1 above?

4. A lot of the problem with the C library is that its incomplete and
doesn't do much to use any kind of type checking in the compiler to
help assist the programmer in getting their string manipulations
right. From the functions listed, it appears as though this will do
anything to alleviate that issue.

5. What if the char buffer has been declared from a char[]? Then no
realloc is possible, and therefore the safety semantics cannot be
applied.

Personally, I am not impressed with this approach, as I don't think it
can be made to work completely correctly in general. The extreme
requirement of using char[]/char * as the basic string type is just
too much. I think the right answer is to simply create a standard ADT
which substantially mitigates all of the typical problems. I have my
own proposal on how to do that here:

http://bstring.sourceforge.net/

The right answer is good char * interoperability (as much as is
possible given the problems with using char *'s as a string type),
good functionality (comparable to other languages which advertise good
string support), and good performance (no sweat since I avoid scanning
for '\0' to figure out the length of a string), which I believe
bstrlib delivers on in all cases.

--
Paul Hsieh
http://bstring.sourceforge.net/

Walter Faxon

unread,

May 14, 2003, 2:17:02 AM5/14/03

to

q...@pobox.com (Paul Hsieh) wrote in message news:<796f488f.03051...@posting.google.com>...

<snip>

Hi, Paul.

I heartily approve of any attempt to develop and promulgate a decent
standard dynamic string ADT for C. Your bstring library seems an
excellent start. All professional C programmers should at least take
a look at it:

http://bstring.sourceforge.net/

-- Walter

"The most important data type in computing is character data."

Michael B Allen

unread,

May 14, 2003, 5:28:36 AM5/14/03

to

On Wed, 14 May 2003 02:17:02 -0400, Walter Faxon wrote:

> I heartily approve of any attempt to develop and promulgate a decent
> standard dynamic string ADT for C. Your bstring library seems an

I have two words for you:

pointer arithmetic

It's the most valuable and powerful feature in C. I do not think that
C should have a string type that tries to manage memory for you. In
Java strings are major performance killers and the benefits aren't even
that great. IMO it is superior to think of the memory representing a
string *as* the "ADT" and simply provide functions for operating on
that object. For example, a network server would likely just decode
strings in the buffer passed to read(). Why bother to allocate memory,
copy it out, do stuff, and then free it over and over? You don't have
to do that. The memory holding the string *is* the ADT and you can do
that because of pointer arithmetic.

I would much rather see string functions that abstract
internationalization across disparate platforms. For example Windows
permits the user to compile in ansi or wide character mode using simple
typedefs like:

#ifdef _UNICODE
typedef wchar_t TCHAR
#define TEXT(s) L(s)
#else
typedef unsigned char TCHAR
#define TEXT(s) (s)
#endif

This idea could be extended to permit programs to be written once
and complied on both Unix and Windows using 8bit or wide character
encoding. So you could write a program on Unix that could be compiled
using wchar_t, UTF-8, ASCII, or whatever locale encoding and then copy
it over to Windows and compile as wchar_t (UTF-16LE) or ansi (the locale
encoding). I've been thinking about this a lot lately and so far it
seems very doable.

If there's going to be any forward movement in the area of string handling
in C it would have to focus on cross platform internationalization. If
you want an "ADT" use C++. That's exactly the kind of thing it was
designed for.

Mike

Paul Hsieh

unread,

May 14, 2003, 2:43:11 PM5/14/03

to

Michael B Allen <mba...@ioplex.com> wrote:
> On Wed, 14 May 2003 02:17:02 -0400, Walter Faxon wrote:
> > I heartily approve of any attempt to develop and promulgate a decent
> > standard dynamic string ADT for C. Your bstring library seems an
>
> I have two words for you:
>
> pointer arithmetic

bstring's are just wrapped char * buffers. The bstring library is
completely interoperable with ordinary char * semantics, as a result.
So if you want to do pointer arithmetic, that's fine, it will work.
You can even re-wrap your results fairly easily, if it makes sense for
you to do that. The bstring library is aliasing aware (unlike the
standard C char * library) so this will not lead to problems.

The bstring library also comes with support for all of the standard
string manipulation functions from C and most from other languages,
meaning that grungy low level access to the wrapped char buffer is
usually unnecessary.

> It's the most valuable and powerful feature in C. I do not think that
> C should have a string type that tries to manage memory for you. In
> Java strings are major performance killers and the benefits aren't even
> that great.

Java's strings are all unicode, and very few processors have really
excellent 16-bit data type support. But in any event, unicode strings
just end up chewing twice as much memory. My own testing shows that
bstrings incurr relatively little overhead cost, and of course, in
many cases will obliterate the standard C char * library in
performance.

> [...] IMO it is superior to think of the memory representing a

> string *as* the "ADT" and simply provide functions for operating on
> that object. For example, a network server would likely just decode
> strings in the buffer passed to read(). Why bother to allocate memory,
> copy it out, do stuff, and then free it over and over? You don't have
> to do that. The memory holding the string *is* the ADT and you can do
> that because of pointer arithmetic.

For sensitive real time and high performance network applications, I
would recommend looking at the Vstr library:

http://www.and.org/vstr/

which is architected for "zero-copy" operations even for things like
string concatenation. Its radically different implementation will
allow it to do really well on some kinds of benchmarks, however, it
will fall short on others. Either way, the standard C char * library
approach is not the optimal answer in modern programming environments.

> I would much rather see string functions that abstract
> internationalization across disparate platforms.

This would give you all the performance of Java's strings. But in any
event it would be hard to match the functionality of Java because it
uses unicode for *all* strings, including constant ones just declared
in your program between a pair of quotes. You need to augment the C
standard to truly address internationalization support. (Yes, I know
that Windows compilers do have extensions for this.)

> [...] If you want an "ADT" use C++. That's exactly the kind of thing it was
> designed for.

bstring comes with a C++ interface as well. However, C++ offers very
little concrete advantage over C for this kind of data type (automatic
bounds checking is the only thing truly different.) Hence it would
almost be a crime to make a C++ ADT without also making a C
implementation.

Michael B Allen

unread,

May 14, 2003, 4:17:41 PM5/14/03

to

On Wed, 14 May 2003 14:43:11 -0400, Paul Hsieh wrote:

> Michael B Allen <mba...@ioplex.com> wrote:
>> On Wed, 14 May 2003 02:17:02 -0400, Walter Faxon wrote:
>> > I heartily approve of any attempt to develop and promulgate a decent
>> > standard dynamic string ADT for C. Your bstring library seems an
>>
>> I have two words for you:
>>
>> pointer arithmetic
>
> bstring's are just wrapped char * buffers. The bstring library is
> completely interoperable with ordinary char * semantics, as a result. So
> if you want to do pointer arithmetic, that's fine, it will work. You can

I think you missed the point. I'll repeat it;

A string "ADT" is unnecessary in C. This is because C supports pointer
arithmetic which permits in-situ string manipulation whereas a string
"ADT" suggests that memory to hold (and thus abstrct) strings is
frequently allocated and freed. This incurs performance penalties
similar to that of string handling in languages that do not support
pointer artihmetic.

> But in any event, unicode strings just end up chewing twice as much
> memory.

Unicode has many forms. UTF-8 for example is an 8-bit multibyte
encoding. Also don't expect your library to be popular outside the US.

> My own testing shows that
> bstrings incurr relatively little overhead cost, and of course, in many
> cases will obliterate the standard C char * library in performance.

Your bstrlib.c file contains 27 separate calls to *alloc functions. This
suggests that your string type is immutable and that operations on
strings have a tendency to create new strings. This is precisely my
point about Java's string handling. Java's string performance is not
poor because it uses a 16bit data units -- it's primarily because it
creates and destroys so many little object frequently.

If you have a good string performance test and benchmarks to show your
string handling technique is competitive then I suggest you make that
data available. If you need a good test try creating a function for
extracting strings from a line of text in CSV (comma separated values)
format. Then compare it's performance to this csv_row_parse function:

http://www.ioplex.com/~miallen/libmba/dl/src/csv.c

Mike

Bjørn Augestad

unread,

May 14, 2003, 5:23:07 PM5/14/03

to

Michael B Allen wrote:
[snip]

>
> A string "ADT" is unnecessary in C. This is because C supports pointer
> arithmetic which permits in-situ string manipulation whereas a string
> "ADT" suggests that memory to hold (and thus abstrct) strings is
> frequently allocated and freed. This incurs performance penalties
> similar to that of string handling in languages that do not support
> pointer artihmetic.
>

Performance isn't always needed, and the use of regular C strings can be
quite error phrone. It's way too easy to overwrite buffers or introduce
off-by-n bugs, ask any cracker knocking on your firewall.

Sometimes performance is needed, sometimes it isn't. Some people use
profilers to see when and where performance is most needed. ;-)

Choice is good, and libclc should IMHO have room for a string ADT. We
almost have a good set of "low level" string functions now, and I see
nothing wrong in adding a "high level" string ADT as well.

The hardest part in designing a string ADT, and I'm serious here, is to
find a proper name for it. :-)

How many would like to see a string ADT in libclc? Anyone who doesn't
want it included?

[snip]

--
boa

libclc home: http://libclc.sourceforge.net

Chris Torek

unread,

May 14, 2003, 7:20:21 PM5/14/03

to

In article <%sywa.3724$Y67.2...@juliett.dax.net>

=?ISO-8859-15?Q?Bj=F8rn_Augestad?= <b...@metasystems.no> writes:
>Performance isn't always needed, and the use of regular C strings can be
>quite error phrone. It's way too easy to overwrite buffers or introduce

>off-by-n bugs, ask any cracker knocking on your firewall. ...

If we judge by the desktop software with the largest market
penetration, users have chosen "get the wrong answer fast" over
"get the right answer more slowly" approximately four to one.

Fortunately, not all markets are quite so foolish. :-) Then again,
when it comes to pacemakers, getting the right answer too slowly
may be just as bad as getting the wrong answer quickly.

>Sometimes performance is needed, sometimes it isn't. Some people use
>profilers to see when and where performance is most needed. ;-)

Indeed.
--
In-Real-Life: Chris Torek, Wind River Systems (BSD engineering)
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W)
email: forget about it http://67.40.109.61/torek/ (for the moment)
Reading email is like searching for food in the garbage, thanks to spammers.

Paul Hsieh

unread,

May 16, 2003, 5:33:15 AM5/16/03

to

Michael B Allen <mba...@ioplex.com> wrote:
> On Wed, 14 May 2003 14:43:11 -0400, Paul Hsieh wrote:
> > Michael B Allen <mba...@ioplex.com> wrote:
> >> On Wed, 14 May 2003 02:17:02 -0400, Walter Faxon wrote:
> >> > I heartily approve of any attempt to develop and promulgate a decent
> >> > standard dynamic string ADT for C. Your bstring library seems an
> >>
> >> I have two words for you:
> >>
> >> pointer arithmetic
> >
> > bstring's are just wrapped char * buffers. The bstring library is
> > completely interoperable with ordinary char * semantics, as a result. So
> > if you want to do pointer arithmetic, that's fine, it will work. You can
>
> I think you missed the point. I'll repeat it;
>
> A string "ADT" is unnecessary in C. This is because C supports pointer
> arithmetic which permits in-situ string manipulation

Are you unaware of the phenomenon of the "buffer overflow"? Its not
just a bug that is resolved by educating your programmer -- its
*scourge* that's infected countless reams of software projects. Its
the #1 bug, and its everywhere. If you look a little closer at
bstring you will see that its designed to not take away any
flexibility or functionality of string manipulation capabilities that
C has by default. It is just a library geared at making string
manipulation much safer, and along the way make it easier and in some
cases faster.

> [...] whereas a string

> "ADT" suggests that memory to hold (and thus abstrct) strings is
> frequently allocated and freed.

What a string "ADT" suggests to you is interesting, but has no bearing
on the actual contents of the bstring library.

> [...] This incurs performance penalties

> similar to that of string handling in languages that do not support
> pointer artihmetic.
>
> > But in any event, unicode strings just end up chewing twice as much
> > memory.
>
> Unicode has many forms. UTF-8 for example is an 8-bit multibyte
> encoding. Also don't expect your library to be popular outside the US.

Well, that's ok, programming in general is not popular outside the US.
Seriously though, adding internationalization support is just not a
condition I would be able to satisfy in my library while also
satisfying complete interoperability with the existing char *
semantics.

> > My own testing shows that
> > bstrings incurr relatively little overhead cost, and of course, in many
> > cases will obliterate the standard C char * library in performance.
>
> Your bstrlib.c file contains 27 separate calls to *alloc functions.

Oh great -- you are arguing via grep | wc. My library is not that
hard to understand. Try reading the code or the documentation and you
will see for yourself -- the *alloc functions will be called with
exponentially diminishing frequency as you perform more and more
operations on the string. In fact, on a 32 bit system, a given string
buffer will never be realloced more than 32 times. Though, in reality
it is generally rare for any typical string to be realloced more than
a couple times. I'll let you figure out how I pulled this "miracle"
off.

> [...] This

> suggests that your string type is immutable and that operations on
> strings have a tendency to create new strings.

In the early sections of the documentation, I clearly point out that
the ->data portion of the bstring (which contains a '\0' terminated
char * embodiment of the string) is considered visible to the
programmer, and is writable. Using bstring solely as an abstracted
type is at the discretion of the programmer.

As long as you are playing with grep | wc, why don't you figure out
how many functions *return* bstrings? Those are the only ones that
actually create new strings.

> [...] This is precisely my

> point about Java's string handling. Java's string performance is not
> poor because it uses a 16bit data units -- it's primarily because it
> creates and destroys so many little object frequently.

No ... the point of OO programming is that if you say
object.modify(parameters) then the object is mutated without creating
a new one. bstring has additional mechanisms for creating trivial
static constant objects that don't require any allocation at all, so
that an absolute minimum of actual allocations need to ever take
place.

> If you have a good string performance test and benchmarks to show your
> string handling technique is competitive then I suggest you make that
> data available.

I didn't arrive at my performance claims in that way. Instead I have
run performance profilers on string intensive applications I have
written to determine where the overhead is. Earlier versions of the
bstring library had an unnecessary bottleneck or two, current versions
do not. Obviously, like most libraries, you can abuse it, or find
some corner where the overhead dominates. However, for programmers
who posess the skill and are concerned with performance, they should
not run into those problems.

> [...] If you need a good test try creating a function for

> extracting strings from a line of text in CSV (comma separated values)
> format. Then compare it's performance to this csv_row_parse function:
>
> http://www.ioplex.com/~miallen/libmba/dl/src/csv.c

bstring is not a string parsing library. But that said, since it is
substantially interoperable with C's char * semantics, there's no way
I could lose such a contest. Though you might be a little
disappointed that my source might end up looking quite similar to that
source.

You don't have any sort of "main" or sample data in this program, so I
don't know the conditions under which you would measure the results of
such a challenge. That said, I'm not sure it would be that
challenging to beat your solution. I see you are reading from a file
one character at time ... are you sure you want lay this gauntlet
down? (You might like to do a little research on exactly who I am
before you answer that.) Of course, I wouldn't be appealing to
anything in bstring to beat the performance of that implementation.

--
Paul Hsieh
http://bstring.sourceforge.net/

http://www.pobox.com/~qed/

James Antill

unread,

May 16, 2003, 5:01:02 PM5/16/03

to

On Wed, 14 May 2003 16:17:41 -0400, Michael B Allen wrote:

> On Wed, 14 May 2003 14:43:11 -0400, Paul Hsieh wrote:
>
>> Michael B Allen <mba...@ioplex.com> wrote:
>>> On Wed, 14 May 2003 02:17:02 -0400, Walter Faxon wrote:
>>> > I heartily approve of any attempt to develop and promulgate a decent
>>> > standard dynamic string ADT for C. Your bstring library seems an
>>>
>>> I have two words for you:
>>>
>>> pointer arithmetic
>>
>> bstring's are just wrapped char * buffers. The bstring library is
>> completely interoperable with ordinary char * semantics, as a result. So
>> if you want to do pointer arithmetic, that's fine, it will work. You can
>
> I think you missed the point. I'll repeat it;
>
> A string "ADT" is unnecessary in C. This is because C supports pointer
> arithmetic which permits in-situ string manipulation whereas a string
> "ADT" suggests that memory to hold (and thus abstrct) strings is
> frequently allocated and freed. This incurs performance penalties
> similar to that of string handling in languages that do not support
> pointer artihmetic.

There are two types of C programers, those who write using a string ADT and
those that write buffer overflows.

> If you need a good test try creating a function for
> extracting strings from a line of text in CSV (comma separated values)
> format. Then compare it's performance to this csv_row_parse function:
>
> http://www.ioplex.com/~miallen/libmba/dl/src/csv.c

And you appear to be of the buffer overflow variety...

char buf[2];
char *rows[1];

memset(buf, 1, sizeof(buf));
if (csv_row_parse("a", 1, buf, 1, rows, 1, 0))
{
puts(buf);
puts(row[0]);
}

if (!buf[1])
fprintf(stderr, "Overwrote buf[1]\n");

...as paul said, from a performance POV, the biggest problem with the
_fread version is that it calls fgetc() per char.
Also your csv parsing requires that everything is copied into a
pre-allocated amount of space, with a pre-allocated number of fields, and
_fails silently_ when either of these aren't true. This isn't something
I'd do.
And I'm pretty sure you don't want to trim the postfix spaces when trim
isn't set, but who knows.

I'm pretty sure vstr can easily perform better in both space and time,
esp. given normal usage where vstr would do a mmap() and then never
allocate any space.

James Antill

unread,

May 16, 2003, 6:56:47 PM5/16/03

to

On Fri, 16 May 2003 17:01:02 -0400, James Antill wrote:

> char buf[2];
> char *rows[1];
>
> memset(buf, 1, sizeof(buf));
> if (csv_row_parse("a", 1, buf, 1, rows, 1, 0))

Ack, bad copy and paste, this should be...

if (csv_row_parse("a", 1, buf, 1, rows, 1, 0) != -1)

> {
> puts(buf);
> puts(row[0]);
> }
>

Michael B Allen

unread,

May 17, 2003, 3:14:03 AM5/17/03

to

On Fri, 16 May 2003 17:01:02 -0400, James Antill wrote:

> There are two types of C programers, those who write using a string ADT
> and those that write buffer overflows.

Unfortunately I think you're right.

>> If you need a good test try creating a function for
>> extracting strings from a line of text in CSV (comma separated values)
>> format. Then compare it's performance to this csv_row_parse function:
>>
>> http://www.ioplex.com/~miallen/libmba/dl/src/csv.c
>
> And you appear to be of the buffer overflow variety...
>
> char buf[2];
> char *rows[1];
>
> memset(buf, 1, sizeof(buf));
> if (csv_row_parse("a", 1, buf, 1, rows, 1, 0)) {
> puts(buf);
> puts(row[0]);
> }
>
> if (!buf[1])
> fprintf(stderr, "Overwrote buf[1]\n");

I doubt bstring, vstr, or whatever string would help prevent this type
of overfow. You would still have to parse the CSV like anyone else. In
fact there's a lot going on here. It's harder than it looks. I think if
you used a string adt for each string in each row your code would be
some much more complex the number of overall bugs would end up being
greater. If it didn't have a buffer overflow it would fail to parse
the content correctly. At least all of the problems with this code are
fixible with simple tweeks.

> ...as paul said, from a performance POV, the biggest problem with the
> _fread version is that it calls fgetc() per char.

Kudos for finding a buffer overrun in my code but shame on you for
bringing fgetc into a discussion about string adts (and for being
incorrect to boot).

> Also your csv parsing requires that everything is copied into a
> pre-allocated amount of space, with a pre-allocated number of fields,

So it would be better to not have any limits? How many legit CSV files
have a line of more than 8K or a variable number of columns?

> and _fails silently_ when either of these aren't true. This isn't
> something I'd do.

This is debatible. Maybe you only need the first three columns of data
in a fifty column table.

> And I'm pretty sure you don't want to trim the postfix spaces when trim
> isn't set, but who knows.

True. That's just a bug (albeit trival to fix). There's no hiding this
code is a little green. Thanks for spotting that.

> I'm pretty sure vstr can easily perform better in both space and time,
> esp. given normal usage where vstr would do a mmap() and then never
> allocate any space.

What's mmap? Maybe I use windows. Can I use that with shared memory? Maybe
I'm reading the file from a socket.

I can't believe you guys are going to claim that you could write a csv
parser using some kind of string adt that would be faster than just
parsing into a single buffer on the stack. I really don't see how that's
possible.

Mike

Walter Faxon

unread,

May 17, 2003, 9:23:14 AM5/17/03

to

Michael B Allen <mba...@ioplex.com> wrote in message news:<pan.2003.05.17.03...@ioplex.com>...

<snip>

> I can't believe you guys are going to claim that you could write a csv
> parser using some kind of string adt that would be faster than just
> parsing into a single buffer on the stack. I really don't see how that's
> possible.
>
> Mike

Not referencing the current bstring, but perfectly doable is to have
the ADT recognize two types of strings: regular resizable dynamic
strings, and stack-allocated nonresizable strings (with storage class
"auto"). Maybe an extra test or two (a tiny cost compared to I/O
overhead) but with a solid library, a lot safer. "Safety first!"

-- Walter

James Antill

unread,

May 17, 2003, 1:31:59 PM5/17/03

to

On Sat, 17 May 2003 03:14:03 -0400, Michael B Allen wrote:

> On Fri, 16 May 2003 17:01:02 -0400, James Antill wrote:
>
>> There are two types of C programers, those who write using a string ADT
>> and those that write buffer overflows.
>
> Unfortunately I think you're right.

I'm not really sure how to take this. Maybe you just admitted that you
don't mind creating buffer overflows ... but I presume there's a more
subtle point you wanted to make, that I can't see.

>>> If you need a good test try creating a function for
>>> extracting strings from a line of text in CSV (comma separated values)
>>> format. Then compare it's performance to this csv_row_parse function:
>>>
>>> http://www.ioplex.com/~miallen/libmba/dl/src/csv.c
>>
>> And you appear to be of the buffer overflow variety...
>>
>> char buf[2];
>> char *rows[1];
>>
>> memset(buf, 1, sizeof(buf));
>> if (csv_row_parse("a", 1, buf, 1, rows, 1, 0)) {
>> puts(buf);
>> puts(row[0]);
>> }
>>
>> if (!buf[1])
>> fprintf(stderr, "Overwrote buf[1]\n");
>
> I doubt bstring, vstr, or whatever string would help prevent this type
> of overfow. You would still have to parse the CSV like anyone else. In

The help is that, at the minimum the code goes from...

if ((src > start) && /* actual test */
rn && /* room in row */
bn) /* room in buffer */
{
row[r] = buf;
buf[j] = 0;
}

...which is easy to get wrong, if you have to keep doing the checks. To
the code being like...

if (src_off > 0)
{
if (mystr_terminate(buf))
mystr_row_add(rows, buf);
}

...where the tests are repeated by the ADT, and not the programer. The
vstr one would probably be even simpler, like...

if (pos != beg_pos)
vstr_sects_add(rows, beg_pos, pos - beg_pos);

...where the one simple call deals with: allocation (or truncation), error
returns ... and, because termination isn't needed, you get that for free.

> fact there's a lot going on here. It's harder than it looks. I think if
> you used a string adt for each string in each row your code would be
> some much more complex the number of overall bugs would end up being
> greater. If it didn't have a buffer overflow it would fail to parse
> the content correctly. At least all of the problems with this code are
> fixible with simple tweeks.

This is a fallacy, humans are terrible at doing a "simple" thing many
times ... but can often do a "complex" thing well once.
This is what not having a string ADT does, you reduce the one "complex"
implementation to a bunch of "simple" things, _at each usage_. At which
point, no matter how good, the human programer is bound to get one simple
thing wrong.

>> ...as paul said, from a performance POV, the biggest problem with the
>> _fread version is that it calls fgetc() per char.
>
> Kudos for finding a buffer overrun in my code but shame on you for
> bringing fgetc into a discussion about string adts (and for being
> incorrect to boot).

fgetc() _should_ be in the discussion ... strings don't magically appear
in programs, they come from IO. For example in the csv case an obvious
usage is to parse the entire csv file and then get all the data in the
"foo" column. This is non-trivial with your API, as you need to know the
length of each line ... and the number of fields. You can do some kinds of
guessing, and reallocating to try and get it right. But that would be
repeated every time you used the interface, and leans more towards the
complex than the simple.

As for the performance csv_row_fread() does...

for ( ;; ) {
if (bn-- == 0) {
[snip ... ]
}
if ((*buf = fgetc(in)) == EOF) {
if (ferror(in)) {
[snip ... ]
}
*buf = '\0';
break;
}
if (*buf != '\n') {
inrow = 1;
} else if (inrow) {
break;
}
buf++;
}

...which does call fgetc() for every character, which _will_ kill
performance.

>> Also your csv parsing requires that everything is copied into a
>> pre-allocated amount of space, with a pre-allocated number of fields,
>
> So it would be better to not have any limits? How many legit CSV files
> have a line of more than 8K or a variable number of columns?

Sure, a lot of csv files are going to be much smaller ... so 8K as an
upper bound is fine. So you allocate 8K per line, and waste about 7.8k
per line ... or you re-copy the parsed line so you can reclaim the
memory (massivly error prone).
And then you still lose out that one time when someone has a csv with
a massive description in it.

>> and _fails silently_ when either of these aren't true. This isn't
>> something I'd do.
>
> This is debatible. Maybe you only need the first three columns of data
> in a fifty column table.

Maybe you do, but that's fairly trivial to add either explicitly to
the API or on top of it (removing things that are there is easy,
adding things that aren't is impossible).

>> And I'm pretty sure you don't want to trim the postfix spaces when trim
>> isn't set, but who knows.
>
> True. That's just a bug (albeit trival to fix). There's no hiding this
> code is a little green. Thanks for spotting that.

I think you also want do...

case ST_START:
if (isspace(*src)) {
if (!trim) {
buf[j++] = *src; bn--;
t = j; /* JA: add this line */
}
break;

...or you also always trim whitespace only enteries.

>> I'm pretty sure vstr can easily perform better in both space and time,
>> esp. given normal usage where vstr would do a mmap() and then never
>> allocate any space.
>
> What's mmap? Maybe I use windows. Can I use that with shared memory? Maybe
> I'm reading the file from a socket.

Vstr doesn't need mmap(), it'll work fine just using an IO primative
like fread() if you want. SYSV shared memory would also be fine,
assuming you don't change the data from another process and expect
anything magic to happen.
Vstr was designed for use with sockets, so that's obviously fine.

> I can't believe you guys are going to claim that you could write a csv
> parser using some kind of string adt that would be faster than just
> parsing into a single buffer on the stack. I really don't see how that's
> possible.

Here's a rough estimate of the copying for your version...

1. fgetc() => DISK/etc. to FILE *
2. fgetc() => FILE * to buf
3. csv_row_parse() => buf(src) to buf(dest)
4. row[x] = buf => &buf to row

...whereas here's the copying for the normal case[1] in the Vstr
version...

1. mmap()/readv()/etc. => DISK/etc. to Vstr
2. row[x] = pos/len => Vstr section

...so I'm pretty sure that Vstr will be faster than what you have now
without trying. Now in theory you can make your csv routines more
complicated and get to only having the 2 stage copying, in the normal
case but it'll be more complex (much more so when finding the "")[1]
... and you'll probably need to change the API so that you can read in
large chunks for use over several lines/calls. And if you need to have
csv's as a part of some data from a socket, things will be even more
complex.

So yes, if you...

. assume that the line lengths are < X
. assume that the number of fields are < Y
. take ownership of all the data from the FILE *
. parse a line at a time, and then throw the data away

...then it's possible the stack based solution will be faster.

[1] The normal case being any valid line without a "" in the middle of a
quoted string, as then you need to alter the data in some way ... but
that doesn't happen much, IME.

--
James Antill -- ja...@and.org
Need an efficent and powerful string library for C?
http://www.and.org/vstr/

Michael B Allen

unread,

May 17, 2003, 6:47:27 PM5/17/03

to

On Sat, 17 May 2003 13:31:59 -0400, James Antill wrote:

>> I doubt bstring, vstr, or whatever string would help prevent this type
>> of overfow. You would still have to parse the CSV like anyone else. In
>
> The help is that, at the minimum the code goes from...
>
> if ((src > start) && /* actual test */
> rn && /* room in row */
> bn) /* room in buffer */
> {
> row[r] = buf;
> buf[j] = 0;
> }
>
> ...which is easy to get wrong, if you have to keep doing the checks. To
> the code being like...

Actuall the checks only occured in two places. The off by one error
you found was in the tail check (or the lack of it) after it exited the
loop to get the last buffer in the unlikely event buffer space ran out
in the middle of the loop. And after fixing the code (new csv.c there
now) you can see the tail check is largely gone (replaced with if bn ==
0 return error).

> if (src_off > 0)
> {
> if (mystr_terminate(buf))
> mystr_row_add(rows, buf);
> }

Nope. See above. If you write a state machine parser like this correctly
it should only check each condition once. I just did a crumby job the
first time around. But all of this is just a distraction I think. I
should have picked something that does not require complex parsing.

The point is that if you use one of the string adts to hold the elements
of each row I think your code would run into problems with complexity,
freeing individual string, performance, etc. I chose the csv example
because it was optimal in how it handled strings which is to say
it doesn't use any generic management. You really can't beat this
example. You should have made your case with a different example that
benifited from using a string ADT.

> ...where the tests are repeated by the ADT, and not the programer. The
> vstr one would probably be even simpler, like...
>
> if (pos != beg_pos)
> vstr_sects_add(rows, beg_pos, pos - beg_pos);
>
> ...where the one simple call deals with: allocation (or truncation),
> error returns ... and, because termination isn't needed, you get that
> for free.

Presumably I would have to write this vstr_sets_add function? I could
just write a function to do the:

row[r++] = buf; rn--;
buf[j] = '\0'; buf += j + 1; bn -= j + 1;

part. This still doesn't save you from doing parsing. You could still
get pos wrong.

>>> ...as paul said, from a performance POV, the biggest problem with the
>>> _fread version is that it calls fgetc() per char.
>>
>> Kudos for finding a buffer overrun in my code but shame on you for
>> bringing fgetc into a discussion about string adts (and for being
>> incorrect to boot).
>
> fgetc() _should_ be in the discussion ... strings don't magically
> appear

First, the fgetc call was in a different function in the same source
file. It calls the function we're using as an example. Also, I could
mmap the CSV file beforehand and skip the cvs_line_read function.

> ...which does call fgetc() for every character, which _will_ kill
> performance.

You keep saying this and you're completely WRONG. Fgetc is buffered
IO. The first call results traps and reads into the files userspace
buffer. It will use more CPU due to function call overhead but no more
so than if you called isspace once more and it would not be slower as
far as throughput is concerned.

>>> Also your csv parsing requires that everything is copied into a
>>> pre-allocated amount of space, with a pre-allocated number of fields,
>>
>> So it would be better to not have any limits? How many legit CSV files
>> have a line of more than 8K or a variable number of columns?
>
> Sure, a lot of csv files are going to be much smaller ... so 8K as an
> upper bound is fine. So you allocate 8K per line, and waste about 7.8k
> per line ... or you re-copy the parsed line so you can reclaim the
> memory (massivly error prone).
> And then you still lose out that one time when someone has a csv with
> a massive description in it.

Why on earth would you read the entire file into memory?! That's totally
unncessary and bad programming regardless of how you handle strings. In
my example it was always my intention to have *one* 8192 byte buffer
allocated on the stack and used to call with read, csv_row_parse,
some_fn(row[]), and repeat.

>> True. That's just a bug (albeit trival to fix). There's no hiding this
>> code is a little green. Thanks for spotting that.
>
> I think you also want do...

I just reviewed the csv module and redistributed the package it came
from. You've made you're case about safetly and it safices to say the code
had quite a few problems. It was only recently added ot the package and
never used seriously in a production environment. It's the only module
in the package that did not have a test program and ironically it was
the one that needed one the most.

>>> I'm pretty sure vstr can easily perform better in both space and
>>> time,
>>> esp. given normal usage where vstr would do a mmap() and then never
>>> allocate any space.
>>
>> What's mmap? Maybe I use windows. Can I use that with shared memory?
>> Maybe I'm reading the file from a socket.
>
> Vstr doesn't need mmap(), it'll work fine just using an IO primative
> like fread() if you want. SYSV shared memory would also be fine,
> assuming you don't change the data from another process and expect
> anything magic to happen.

IOW you can't put these strings in shared memory because after all the
point of using shared memeory would be for multiple processes to be able
to operate on that data.

> Vstr was designed for use with sockets, so that's obviously fine.

Q: Does it treat strings as blobs or memory or use a user defined
character unit or is it hog-pinned to char * like bstring?

>> I can't believe you guys are going to claim that you could write a csv
>> parser using some kind of string adt that would be faster than just
>> parsing into a single buffer on the stack. I really don't see how
>> that's possible.
>
> Here's a rough estimate of the copying for your version...
>
> 1. fgetc() => DISK/etc. to FILE *
> 2. fgetc() => FILE * to buf
> 3. csv_row_parse() => buf(src) to buf(dest)
> 4. row[x] = buf => &buf to row

This is cheezy hand-waving. Again, try writing a test program that
compares fgetc to fread. Also the csv_row_fread is a convenience
function. You're not required to use it. Shame on you. I could easy use
fread and call csv_row_parse (but again it's not necessary) and I could
mmap the file beforehand as well.

Mike

Paul Hsieh

unread,

May 18, 2003, 12:50:26 AM5/18/03

to

Michael B Allen <mba...@ioplex.com> wrote:

> On Fri, 16 May 2003 17:01:02 -0400, James Antill wrote:
> > There are two types of C programers, those who write using a string ADT
> > and those that write buffer overflows.
>
> Unfortunately I think you're right.

Programmers who write buffer overflows probably costs the industry billions.
Software houses who have figured this out have decided its cheaper to
completely switch away from the C language which is an expense in of itself.

> >> If you need a good test try creating a function for
> >> extracting strings from a line of text in CSV (comma separated values)
> >> format. Then compare it's performance to this csv_row_parse function:
> >>
> >> http://www.ioplex.com/~miallen/libmba/dl/src/csv.c
> >
> > And you appear to be of the buffer overflow variety...

Yeah, I didn't bother to read too deeply into the source to discover this. I
just stopped reading as soon as I knew I could obliterate his performance with
relative ease; he has no serious or credible point. And if he seriously
wants to make a challenge of this he will supply a main and a sample file (he
also has to supply mba/msgno.h and defines.h -- I don't know what AMSG or
PMNO are) so that I can demonstrate his folly on his terms.

For what its worth, he has *changed* his file from the original he posted
versus the current.

> > char buf[2];
> > char *rows[1];
> >
> > memset(buf, 1, sizeof(buf));
> > if (csv_row_parse("a", 1, buf, 1, rows, 1, 0)) {
> > puts(buf);
> > puts(row[0]);
> > }
> >
> > if (!buf[1])
> > fprintf(stderr, "Overwrote buf[1]\n");
>
> I doubt bstring, vstr, or whatever string would help prevent this type
> of overfow.

bstring, via its documentation or source is not that difficult to understand.
The bstring library has *specific* support for high performance generic
stream I/O, and of course, its completely safe. To my understanding, Vstr at
least has file I/O support.

Unless there are bugs in it that I haven't sussed out yet, no well defined
use of the bstring API (short of direct access to the wrapped string buffer)
can lead to a buffer overflow. Use it properly (which is an easy thing to
do) and you *can't* have these kinds of problems.

> [...] You would still have to parse the CSV like anyone else. In

> fact there's a lot going on here. It's harder than it looks. I think if
> you used a string adt for each string in each row your code would be
> some much more complex the number of overall bugs would end up being
> greater. If it didn't have a buffer overflow it would fail to parse
> the content correctly. At least all of the problems with this code are
> fixible with simple tweeks.

These are some really odd claims. I've never encountered an algorithm where
having an intrinsic buffer overflow is a requirement. The point of bstring is
that I don't believe it costs any significantly performance, or is
particularly difficult to implement safe string manipulation code.

> > ...as paul said, from a performance POV, the biggest problem with the
> > _fread version is that it calls fgetc() per char.
>
> Kudos for finding a buffer overrun in my code but shame on you for
> bringing fgetc into a discussion about string adts (and for being
> incorrect to boot).

An interesting statement. Would you please like to put your money where your
mouth is? Supply a main and a sample file and I will demonstrate what I mean.

> > Also your csv parsing requires that everything is copied into a
> > pre-allocated amount of space, with a pre-allocated number of fields,
>
> So it would be better to not have any limits? How many legit CSV files
> have a line of more than 8K or a variable number of columns?

Infinite, actually. That's how hackers break into system. Because they know
that there is no shortage of people who believe things like this. But an
alternative question you might ask is how many legit CSV files need far less
than 8K? Probably a lot. But you are going to spend the 8K anyway right?

> > and _fails silently_ when either of these aren't true. This isn't
> > something I'd do.
>
> This is debatible. Maybe you only need the first three columns of data
> in a fifty column table.

Maybe you should at least document this. This is a problem because its not
obvious except to someone who knows or studies the source code.

> > And I'm pretty sure you don't want to trim the postfix spaces when trim
> > isn't set, but who knows.
>
> True. That's just a bug (albeit trival to fix). There's no hiding this
> code is a little green. Thanks for spotting that.
>
> > I'm pretty sure vstr can easily perform better in both space and time,
> > esp. given normal usage where vstr would do a mmap() and then never
> > allocate any space.
>
> What's mmap? Maybe I use windows. Can I use that with shared memory? Maybe
> I'm reading the file from a socket.

Then bstring is the perfect library for your problem. mmap is some UNIX based
file -> memory mapping mechanism. It probably leads to impressive performance
for certain scenarios.

> I can't believe you guys are going to claim that you could write a csv
> parser using some kind of string adt that would be faster than just
> parsing into a single buffer on the stack. I really don't see how that's
> possible.

Who says you can't place a bstring on the stack? If you can't see it, perhaps
its because you haven't made any effort to understand bstring.

Look, what exactly does this code do? Are you just taking any old generic
*.csv file and filling in a square array of strings holding each entry? I'm
not really familliar with the format of *.csv files, but it looks like you
are making all sorts of provisions for the quote character as well.
Interestingly you are using a platform independent (and actually slow) call
to detect whitespaces (isspace) for a platform specific file format.

Give me a description and I will show you how its done in bstring without any
risk of buffer overflow, have good performance, and possibly easiler to
understand how it works.

--
Paul Hsieh
http://bstring.sourceforge.net/

http://www.pobox.com/~qed/

James Antill

unread,

May 18, 2003, 3:19:09 AM5/18/03

to

On Sat, 17 May 2003 18:47:27 -0400, Michael B Allen wrote:

> On Sat, 17 May 2003 13:31:59 -0400, James Antill wrote:
>
>>> I doubt bstring, vstr, or whatever string would help prevent this type
>>> of overfow. You would still have to parse the CSV like anyone else. In
>>
>> The help is that, at the minimum the code goes from...
>>
>> if ((src > start) && /* actual test */
>> rn && /* room in row */
>> bn) /* room in buffer */
>> {
>> row[r] = buf;
>> buf[j] = 0;
>> }
>>
>> ...which is easy to get wrong, if you have to keep doing the checks. To
>> the code being like...
>
> Actuall the checks only occured in two places. The off by one error

No, they have to occur before every write to buf ... and after every
write the accounting information has to be updated.
You layed the code out specifically so that the test would only need to
physically appear twice, but the programers needs to be aware (and make
sure) that logically they are before every call.

> you found was in the tail check (or the lack of it) after it exited the
> loop to get the last buffer in the unlikely event buffer space ran out
> in the middle of the loop. And after fixing the code (new csv.c there
> now) you can see the tail check is largely gone (replaced with if bn ==
> 0 return error).

Errm... the tail check is still there, and just has an extra if check in
the version in the 0.6.1 tar ball (the web site isn't updated BTW.)

>> if (src_off > 0)
>> {
>> if (mystr_terminate(buf))
>> mystr_row_add(rows, buf);
>> }
>
> Nope. See above. If you write a state machine parser like this correctly
> it should only check each condition once. I just did a crumby job the
> first time around. But all of this is just a distraction I think. I

What can I say... You picked the problem and the code.
Also I _almost_ never see "C style" string code done as a well done state
machine that only has/needs one check, and so is right.
I do see a lot of code that has multiple checks, and gets one wrong,
though. And from the version you had at the start of the discussion to
this version you did fixes (just for buffer overruns) in _four_ seperate
places. And you _still_ missed the one on line 97.

> should have picked something that does not require complex parsing.

Because complex parsing is often wrong?
And getting things wrong without a string ADT means a buffer overflow?

> The point is that if you use one of the string adts to hold the elements
> of each row I think your code would run into problems with complexity,
> freeing individual string, performance, etc. I chose the csv example

Only the very worst string ADTs would require a string per field, but
even those would probably be better than what you had ... I'll take a
memory leak over a buffer overflow any day.

> because it was optimal in how it handled strings which is to say
> it doesn't use any generic management. You really can't beat this
> example. You should have made your case with a different example that
> benifited from using a string ADT.

Why, your example is just as good for my purposes ... and it's _your
example_. Which is always nice.

>> ...where the tests are repeated by the ADT, and not the programer. The
>> vstr one would probably be even simpler, like...
>>
>> if (pos != beg_pos)
>> vstr_sects_add(rows, beg_pos, pos - beg_pos);
>>
>> ...where the one simple call deals with: allocation (or truncation),
>> error returns ... and, because termination isn't needed, you get that
>> for free.
>
> Presumably I would have to write this vstr_sets_add function? I could

I'm reading your code to the point that I'm telling you where the bugs
are, you could at least read my documentation...

http://www.and.org/vstr/functions.html#vstr_sects_add()

> part. This still doesn't save you from doing parsing. You could still
> get pos wrong.

True, you can always get something wrong but ...

1. If you get pos wrong you'll see it on _almost every_ code path, when
you get the size wrong you only see it when you've filled the space
perfectly.

2. You already have pos in your code (buf), so I'm not adding anything
that could be wrong ... I'm just taking things away.

3. You are more likely to have a non-exploitable hole, if you have an
error on a code path you don't test.

> First, the fgetc call was in a different function in the same source
> file. It calls the function we're using as an example. Also, I could
> mmap the CSV file beforehand and skip the cvs_line_read function.

It's your code... the only thing that did the IO as well was the fread
version. I didn't see anything on the documentation page saying don't use
this function it's slow ... but then there wasn't anything about trim
either, so I guess I should have assumed it's incomplete.

Sure you can mmap(), but it requires a lot of extra work.

>> ...which does call fgetc() for every character, which _will_ kill
>> performance.
>
> You keep saying this and you're completely WRONG. Fgetc is buffered
> IO. The first call results traps and reads into the files userspace
> buffer. It will use more CPU due to function call overhead but no more
> so than if you called isspace once more and it would not be slower as
> far as throughput is concerned.

I know fgetc() will normally read in blocks from the disk but you are
still calling a function per character ... and isspace() is a macro on
every platform I've seen.

> Why on earth would you read the entire file into memory?! That's totally
> unncessary and bad programming regardless of how you handle strings. In
> my example it was always my intention to have *one* 8192 byte buffer
> allocated on the stack and used to call with read, csv_row_parse,
> some_fn(row[]), and repeat.

Because often you need to keep the data around to do something with it,
the string library doesn't require it ... but if you do the string library
will make your life a lot easier.

> I just reviewed the csv module and redistributed the package it came
> from. You've made you're case about safetly and it safices to say the code
> had quite a few problems. It was only recently added ot the package and
> never used seriously in a production environment. It's the only module
> in the package that did not have a test program and ironically it was
> the one that needed one the most.

Well you hadn't ... you'd released the code.

> IOW you can't put these strings in shared memory because after all the
> point of using shared memeory would be for multiple processes to be able
> to operate on that data.

I defy you to show how you can put a (char *) string in shared memory and
update it in one process while using it in another.
If you use locking or some other kind of ownership rule, then it passes
my test ... you won't be updating it (and it's worth noting that much less
would go wrong if you accidently did update the shared memory -- assuming
only the data is in shared memory, and I can't think of a sane reason for
putting the entire string object there).

>> Vstr was designed for use with sockets, so that's obviously fine.
>
> Q: Does it treat strings as blobs or memory or use a user defined
> character unit or is it hog-pinned to char * like bstring?

Again, I do have code and documentation on the web...

http://www.and.org/vstr/design.html

Holger Hasselbach

unread,

May 18, 2003, 6:51:28 AM5/18/03

to

Walter Faxon wrote in message news:<57958c65.03051...@posting.google.com>...

> Not referencing the current bstring, but perfectly doable is to have
> the ADT recognize two types of strings: regular resizable dynamic
> strings, and stack-allocated nonresizable strings (with storage class
> "auto"). Maybe an extra test or two (a tiny cost compared to I/O
> overhead) but with a solid library, a lot safer. "Safety first!"

That's it, and beside bstring and vstr, here is another idea for a
string ADT, the way I've implemented it.

When you analyse string functionality in general, you will find that
it is composed of 3 stages:

StringCopy (Dst, Src)
{
Len = strlen(Src); // stage 1
Dst = malloc(Len + 1); // stage 2a
if (Dst) // stage 2b
strcpy(Dst, Src); // stage 3
}

stage 1: Calculate the necessary destination length
stage 2: Make sure the necessary length is available
stage 3: Build the Dst = f(Srcs)

Stage 2 is the thing most of the discussion is about. Let us call this
the "allocation scheme". Unfortunately, the Standard library's string
functions are build on the premise to exchange stages 1 and 2, saying:
"Calculate stage 1 in advance or guess the necessary length and hope
it fits. If not then live with the buffer overflow." It does neither
provide nor expect any allocation scheme at all.

More unfortunately, when the string functions become more and more
complex, both stages 1 and 3 can become arbitrarily complex, too.
Think about a function for multiple substitution:

StringMultiSubst(Dst, HtmlSrc, 2, "ä", "ä", "ö", "ö");

When you want different allocation schemes (dynamic or static
storage), you have to split the functions into 2 (one for stage 1 and
one for stage 3), making the usage a horror; doing it like the
Standard library does (hope it fits); or write the same function
multiple times, one for each scheme... hoping there is no need to
maintain all the redundance in the future.

But we are talking about ADTs, so why not providing a callback for
stage 2, stored in the ADT? The callback will allocate the required
memory size, freeing the previous object before if there is any. Maybe
there must be another callback for freeing it without a new
allocation, or it can be done when a length of 0 is required, or
however.

For stage 2 there is a unique Reserve ADT function:

StringReserve(Object, Length)

* Check if the actual allocation provides Length+1 characters of
storage
* If not and /if there is a callback/ call it to reserve Length+1
* If there are now less than Length+1 characters, throw an error

Note the /if there is a callback/ - the support for static storage is
automagically build in by simply providing no callback.

The general string functionality now looks like this:

StringCopy (DstObject, Src)
{
Len = strlen(Src); // stage 1
StringReserve(DstObject, Len); // stage 2a
if (NoError) // stage 2b
strcpy(DstObject->Str, Src); // stage 3
}

Currently, I use 3 different allocation schemes with this: Static
storage, dynamic storage with malloc(), and a blocked storage, using
large character blocks for multiple strings instead of one memory
block per string, which is good as a permanent storage of large
amounts of short strings or as a temporary storage with automatic
garbage collection.

Holger

James Antill

unread,

May 18, 2003, 11:49:34 AM5/18/03

to

On Sat, 17 May 2003 21:50:26 -0700, Paul Hsieh wrote:

> Michael B Allen <mba...@ioplex.com> wrote:
>> [...] You would still have to parse the CSV like anyone else. In
>> fact there's a lot going on here. It's harder than it looks. I think if
>> you used a string adt for each string in each row your code would be
>> some much more complex the number of overall bugs would end up being
>> greater. If it didn't have a buffer overflow it would fail to parse
>> the content correctly. At least all of the problems with this code are
>> fixible with simple tweeks.
>
> These are some really odd claims. I've never encountered an algorithm where
> having an intrinsic buffer overflow is a requirement. The point of bstring is
> that I don't believe it costs any significantly performance, or is
> particularly difficult to implement safe string manipulation code.

I presume he meant that you can make an error either way, and if you are
protected from buffer overflows then the erorr will manifest as a failure
to parse correctly.

>> Kudos for finding a buffer overrun in my code but shame on you for
>> bringing fgetc into a discussion about string adts (and for being
>> incorrect to boot).
>
> An interesting statement. Would you please like to put your money where your
> mouth is? Supply a main and a sample file and I will demonstrate what I mean.

There's a t17csv.c in the 0.6.1 libmba, also I just did...

// #include "mba/msgno.h"
// #include "defines.h"

#define msgno_msg(x) strerror(x) /* ok here... */

#define PMNO(msgno) (fprintf(stderr, \
"%s:%u:%s: %s\n", \
__FILE__, __LINE__, __FUNCTION__, msgno_msg(msgno)))
#define PMNF(msgno, fmt, args...) (fprintf(stderr, \
"%s:%u:%s: %s" fmt "\n", \
__FILE__, __LINE__, __FUNCTION__, msgno_msg(msgno), ## args))

#define AMSG(fmt, args...) (fprintf(stderr, \
" %s:%u:%s: " fmt "\n", \
__FILE__, __LINE__, __FUNCTION__, ## args))

...and then main can be as simple as...

int main(void)
{
char buf[1024];
char *rows[128];

memset(buf, 'x', sizeof(buf));
if (csv_row_parse("\"aaaa\",bbbb", 10, buf, 5, rows, 128, 0) != -1)
{
printf("%x%x%x%x %x %x%x%x%x %x%x\n",
buf[0], buf[1], buf[2], buf[3],
buf[4],
buf[5], buf[6], buf[7], buf[8],
buf[9], buf[10]);
printf("%p\n%p\n%p\n%p\n", rows[0], rows[1], rows[2], rows[3]);
}

exit (EXIT_SUCCESS);
}

> mmap is some UNIX based
> file -> memory mapping mechanism. It probably leads to impressive performance
> for certain scenarios.

Most of the performance comes from sharing data, thus using less memory.

>> I can't believe you guys are going to claim that you could write a csv
>> parser using some kind of string adt that would be faster than just
>> parsing into a single buffer on the stack. I really don't see how that's
>> possible.
>
> Who says you can't place a bstring on the stack? If you can't see it, perhaps
> its because you haven't made any effort to understand bstring.

I presume he meant put the string and data on the stack, but add a
possibly arbitrary amount of data to it. Where if I understand bstring
would try to free() the stack pointer.

> Look, what exactly does this code do? Are you just taking any old generic
> *.csv file and filling in a square array of strings holding each entry? I'm
> not really familliar with the format of *.csv files, but it looks like you
> are making all sorts of provisions for the quote character as well.
> Interestingly you are using a platform independent (and actually slow) call
> to detect whitespaces (isspace) for a platform specific file format.
>
> Give me a description and I will show you how its done in bstring without any
> risk of buffer overflow, have good performance, and possibly easiler to
> understand how it works.

CSV don't really have a well defined std. but they often follow the
format...

"abcd, xyz",abcd,xyz

...where that is three fields. You generate a single '"' by putting two ""
Eg.

"abcd"",xyz"

...is one field parsed out as...

abcd,"xyz

...the places where they differ are whether you allow non alpha numeric
characters without quoting (whether the quoting uses ' or " ... or even
both) ... if you allow/disregard spaces between elements and the
comma/BOL/EOL. Michael's always uses " as the quote and does whitespace
trimming or not based on the trim parameter (modulo bugs), but always
disregards whitespace around quoted elements.
He also checks for a bunch of error conditions, which might make it
somewhat slower.

Paul Hsieh

unread,

May 18, 2003, 5:11:21 PM5/18/03

to

wfa...@gis.net (Walter Faxon) wrote:
> Not referencing the current bstring, but perfectly doable is to have
> the ADT recognize two types of strings: regular resizable dynamic
> strings, and stack-allocated nonresizable strings (with storage class
> "auto"). Maybe an extra test or two (a tiny cost compared to I/O
> overhead) but with a solid library, a lot safer. "Safety first!"

And I thought this newsgroup was just filled with buffer overflowers
and language lawyers unable to see 2 ft past the end of their nose. I
guess I was thinking about comp.std.c. Your elucidating of this in
just the right way has helped me solve the most serious design
shortcoming of bstring.

Bstring, in fact, has always had support for statically declared or
artificially constructed (with the buffer from the stack or whatever)
and read-only strings. But there was never any enforcement of the
required user behavior (i.e., don't try to write to or destroy a
read-only bstring via bstring functions since the may attempt to
realloc/free the data buffer.) But as it turns out, the bstring
mutation functions alread have a sanity check in them that are trivial
to alias with a "read-only" protection state.

I have now implemented a write protection mechanism, that costs
exactly nothing over the previous version! Now statically declared
bstrings are not mutatable at all, and constructed bstrings are set to
write protect mode by default. You can still switch these constructed
bstrings to write allow mode if you know what you are doing and
writing to statically declared bstrings is not portable/well defined
in any case. All gain, no pain.

This is important because both destroying or writing to a bstring
which has been allocated on the stack isn't well defined. The only
guard against performing these illegal actions was user dilligence --
but the same could be said about buffer overflows, which bstring was
designed to solve! Now it is automatically taken care of, making it
just that much safer.

Thank you for making this issue a little clearer in my mind so I could
see what I needed to do to solve the only serious remaining saftey
problem with bstring (i.e., mutating a bstring that you are not
supposed to be able to mutate.)

--
Paul Hsieh
http://bstring.sourceforge.net/

http://www.pobox.com/~qed/

Grant Jacobs

unread,

May 18, 2003, 10:13:49 PM5/18/03

to

Paul Hsieh wrote:

>Thank you for making this issue a little clearer in my mind so I could
>see what I needed to do to solve the only serious remaining saftey
>problem with bstring (i.e., mutating a bstring that you are not
>supposed to be able to mutate.)
>
>
>

Reading this, I ust can't help writing... never say "only" when it come
to bugs or design flaws - it'll come back to bite you :-) (nip) After
all, that's what you've just been through...

Michael B Allen

unread,

May 19, 2003, 3:35:19 AM5/19/03

to

Ok. I'll accept Paul's challenge.

I think James has a better chance with his Vstr though, so I'd like to
see a contribution from him as well. Not that I think he can beat this
code either but he makes fewer bold statements which is a good sign. I'm
not going to "put my money where my mouth is" but you will have the
satisfaction (or disgrace) attributed to conducting this challenge in
a public forum.

So. I have created a test program and another program to generate the csv
data. The game will be to convert the csv data to rows of strings exactly
like the code I posted previously (minus buffer overruns). The strings
may never be accessed but they have to exist as they would be accessed in
a real program so no trickery that takes advantage of this. We want to
test string creation and manipulation; not printf. Presumably you will
use my implementation as a base as you do not want to try and reproduce
all of the parsing as csv can be a little complicated. Substitute you're
string ADT where you feel it suites the problem and helps make the code
more secure, faster, easier to understand, better, etc. Or completely
rewrite your own but it has to successfully process all the input.

However, I'm still trying to figure out a good way to generate enough
data with enough variety of strings to make the program run long enough
though so I'll have to post the csvgen.c and my csvtest.c implementation
tomorrow. I also have another bug in my code to weed out.

Mike

Paul Hsieh

unread,

May 19, 2003, 4:42:01 PM5/19/03

to

Michael B Allen <mba...@ioplex.com> wrote:

> Ok. I'll accept Paul's challenge.
>
> I think James has a better chance with his Vstr though, so I'd like to
> see a contribution from him as well.

Not likely. This whole program comes down to overhead, and Vstr pays
all sorts of additional overhead costs in exchange for advantages for
large string manipulations. I think his best chance is on quoted
entries with embedded double quotes which are larger than the L1 cache
(the breaking point would probably be somewhat lower than this, but we
couldn't be sure without direct measurement) of the CPU. But even
then, its not clear to me that the overhead of parsing without direct
access to a char * buffer wouldn't just kill Vstr's performance no
matter what.

> [...] Not that I think he can beat this

> code either but he makes fewer bold statements which is a good sign. I'm
> not going to "put my money where my mouth is" but you will have the
> satisfaction (or disgrace) attributed to conducting this challenge in
> a public forum.
>
> So. I have created a test program and another program to generate the csv
> data. The game will be to convert the csv data to rows of strings exactly
> like the code I posted previously (minus buffer overruns). The strings
> may never be accessed but they have to exist as they would be accessed in
> a real program so no trickery that takes advantage of this. We want to
> test string creation and manipulation; not printf. Presumably you will
> use my implementation as a base as you do not want to try and reproduce
> all of the parsing as csv can be a little complicated. Substitute you're
> string ADT where you feel it suites the problem and helps make the code
> more secure, faster, easier to understand, better, etc. Or completely
> rewrite your own but it has to successfully process all the input.

I have read James' description of the CSV files, and found some
sources on it on the net (most importantly from www.wotsit.org). It
turns out I had a few of these lying around on my machine as well. So
I just wrote one from scratch myself.

This is entirely a parsing problem that has little to do with memory
management and has very low risk of buffer overflowing (though somehow
you managed to do it.) Nevertheless the initial input format is a
bstring (since bstring has some convenient buffer safe file reading
semantics) as is the output format (just as a matter of
demonstration.)

> However, I'm still trying to figure out a good way to generate enough
> data with enough variety of strings to make the program run long enough
> though so I'll have to post the csvgen.c and my csvtest.c implementation
> tomorrow. I also have another bug in my code to weed out.

Well in the mean time you can look at my first crack at this. I don't
guarantee that its correct, but it seems to be able to parse all the
sample CSV files included in the archive:

http://www.pobox.com/~qed/bcsv.zip

Be warned, I made the same "fixed number of columns" mistake that you
did, since I started with your API as I was implementing the program.
I did not realize until I was too far down the process that the number
of columns is not meta data that you somehow have a priori (as amd.csv
demonstrates.) In a future version, I will hopefully fix this, but in
its current incarnation, it should not have any less functionality
than yours.

The code is also a little bit longer than yours since it has been
written with performance in mind. In the end I believe this
implementation should beat yours by virtue of the fact that it does
not *copy* the entries character by character, and it uses code
position to track the state rather than calling a switch after a for
loop, and reduces redundant checks of the "trim" variable.

--
Paul Hsieh
http://bstring.sourceforge.net/

http://www.pobox.com/~qed/optimize.html
http://www.pobox.com/~qed/asmexample.html

James Antill

unread,

May 19, 2003, 10:04:33 PM5/19/03

to

q...@pobox.com (Paul Hsieh) wrote in message news:<796f488f.03051...@posting.google.com>...

> Michael B Allen <mba...@ioplex.com> wrote:
> > Ok. I'll accept Paul's challenge.
> >
> > I think James has a better chance with his Vstr though, so I'd like to
> > see a contribution from him as well.
>
> Not likely. This whole program comes down to overhead, and Vstr pays
> all sorts of additional overhead costs in exchange for advantages for
> large string manipulations.

If you think of bstring as a wrapper around a (char *), then think of
Vstr as a wrapper around (struct iovec *) ... so for reading I can
mostly just walk the iovec list using the vstr_iter_*() functions
which are all inlined.
And with the way the Vstr API works I'd only need to alter the input
if I replaced a "" with a " (which doesn't happen often in real input).

Sure Vstr has searching functions, and you might write them in that if
didn't care as much about performance (as it'd be easier) ... and yes
that'd probably be slower.

> I have read James' description of the CSV files, and found some
> sources on it on the net (most importantly from www.wotsit.org). It
> turns out I had a few of these lying around on my machine as well. So
> I just wrote one from scratch myself.
>
> This is entirely a parsing problem that has little to do with memory
> management and has very low risk of buffer overflowing (though somehow
> you managed to do it.) Nevertheless the initial input format is a
> bstring (since bstring has some convenient buffer safe file reading
> semantics) as is the output format (just as a matter of
> demonstration.)

Yeh, apart from the input phase your code could go almost unaltered to
use a normal (char *).

> > However, I'm still trying to figure out a good way to generate enough
> > data with enough variety of strings to make the program run long enough
> > though so I'll have to post the csvgen.c and my csvtest.c implementation
> > tomorrow. I also have another bug in my code to weed out.
>
> Well in the mean time you can look at my first crack at this. I don't
> guarantee that its correct, but it seems to be able to parse all the
> sample CSV files included in the archive:
>
> http://www.pobox.com/~qed/bcsv.zip

OMG Michael's might have had overflows but I'd rather debug/maintain
his version than yours (although you handle more cases, I think).
And the original state machine version that he has on his page is very
nice.

> Be warned, I made the same "fixed number of columns" mistake that you
> did, since I started with your API as I was implementing the program.
> I did not realize until I was too far down the process that the number
> of columns is not meta data that you somehow have a priori (as amd.csv
> demonstrates.) In a future version, I will hopefully fix this, but in
> its current incarnation, it should not have any less functionality
> than yours.

It also doesn't deal with multiple lines of input, so I can't compare
it to the libmba code.

> The code is also a little bit longer than yours since it has been
> written with performance in mind. In the end I believe this
> implementation should beat yours by virtue of the fact that it does
> not *copy* the entries character by character, and it uses code
> position to track the state rather than calling a switch after a for
> loop, and reduces redundant checks of the "trim" variable.

I should probably find time to write a vstr version, as it should be
readable and fast (mainly due to not having to do any copies of the
data -- see the progression, copy lots: slow, copy in blocks: faster,
copy none: safest/fastest).

Michael B Allen

unread,

May 20, 2003, 2:27:32 AM5/20/03

to

On Mon, 19 May 2003 16:42:01 -0400, Paul Hsieh wrote:

> Michael B Allen <mba...@ioplex.com> wrote:
>> Ok. I'll accept Paul's challenge.
>>
>

> Well in the mean time you can look at my first crack at this. I don't
> guarantee that its correct, but it seems to be able to parse all the
> sample CSV files included in the archive:
>
> http://www.pobox.com/~qed/bcsv.zip

Ok, as promised, here's my program (csvtest) and another program (csvgen)
to generate the input file:

http://users.erols.com/mballen/csvtest/

Build with:

nmake -f Makefile.msvc

Also, I do not have Wantcom C so if you would like me to use optimization
flags with cl that are different than the ones specified in the
Makefile.msvc that I provided please provide you're own makefile for msvc.

> Be warned, I made the same "fixed number of columns" mistake that you
> did, since I started with your API as I was implementing the program. I

Yes that fine. It's my API. If you would like to provide the input data
that's fine too of course but you will need to write a program to generate
it because I really don't feel like downloading a huge file like that.

> The code is also a little bit longer than yours since it has been
> written with performance in mind. In the end I believe this

A little bit? Man the bold (and rude) statements just never cease. Just
from glancing at your code you're not going to win on simplicity
alone. Aren't we ultimately comparing performace here? I mean you said
you would "obliterate" the speed of my code.

> implementation should beat yours by virtue of the fact that it does not
> *copy* the entries character by character, and it uses code position to

FYI a *copy* is just an extra variable assignment.

Mike

Paul Hsieh

unread,

May 20, 2003, 4:04:18 AM5/20/03

to

james-...@and.org (James Antill) wrote:

> q...@pobox.com (Paul Hsieh) wrote:
> > Michael B Allen <mba...@ioplex.com> wrote:
> > > Ok. I'll accept Paul's challenge.
> > >
> > > I think James has a better chance with his Vstr though, so I'd like to
> > > see a contribution from him as well.
> >
> > Not likely. This whole program comes down to overhead, and Vstr pays
> > all sorts of additional overhead costs in exchange for advantages for
> > large string manipulations.
>

> If you think of bstring as a wrapper around a (char *), [...]

For reading, and direct character access, there's no point in doing
anything else. The code used is actually a natural usage within
bstring context. I believe I said I would do it this way earlier in
the thread.

The mere act of parsing CSV files is unlikely to be the sole desire of
a programmer using such a routine. Since the final strings are
packaged into struct tagbstring's they are perfect for just about any
of the next logical operations you would expect in a DB application
using bstrlib. For example, since biseq() dramatically outperforms
strcmp() in most cases (because of retaining and checking the lengths
first) this means that exact searches, for example, will perform much
better.

I am not a database guy, so pehaps someone can suggest what some
likely operations on a DB are.

> [...] then think of

> Vstr as a wrapper around (struct iovec *) ... so for reading I can
> mostly just walk the iovec list using the vstr_iter_*() functions
> which are all inlined.

Right, but then I would guess that you have stick with a wrapping
for/switch implementation like Michael did. My 7-label goto mechanism
doesn't easily map into an iterator. But of course, I think much of
the performance of my solution comes from this mechanism.

Think about it -- between characters Michael's solution performs the
outer-for loop (which does 3 variable inc/decs and 5 comparisons), a
switch, then possibly a number of if()-else's before even accessing
the next character then deciding which action to take, then setting a
new state. My solution will have a for and a single if() between most
characters. (The state is implicit with the line of code, and other
mode flags are hoisted out of the inner loops.) Additionally,
noticing that since Michael always copies for some reason, but
ignoring his fgetc nonsense which should be easily fixed, the
performance difference between his solution and mine should be pretty
high (2x minimum but likely much more, I'd guess.)

Michael, of couse, could switch to a solution more like mine while
sticking with char * buffers, and only lose on all the operations
*after* the CSV file has been parsed. But I don't think that is
really the point. Extracting performance from a given problem relies
on the skill of the programmer. The language mechanisms by themselves
won't imbibe you with this ability.

> And with the way the Vstr API works I'd only need to alter the input
> if I replaced a "" with a " (which doesn't happen often in real input).

Oh. I actually thought that you might opt to use Vstr's high
performance insert and delete mechanisms to beat both Michael and I on
this one aspect of the problem. It can be applied as a last stage
operation, so I am surprised you choose not to do this -- do you have
a restriction that you cannot modify a Vstr that you are currently
iterating through?

In any event, my solution has the exact same properties. Its
completely read-only except for when double quote cases are detected.
Even then, the modification is performed in-place since we know that
this shrinks the string.

> Sure Vstr has searching functions, and you might write them in that if
> didn't care as much about performance (as it'd be easier) ... and yes
> that'd probably be slower.

Yeah, bstrlib could do that too. But that's the whole thing about of
this exercise -- its not really doing much string *manipulation* and
there is little risk of buffer overflow, because no operation ever
expands any buffer. Its just a trivial problem where in-place
manipulation and referencing is clearly the best solution.

He could have just as easliy asked us to reverse the words in a string
-- once again, neither Vstr's nor bstrlib's automatic memory
management mechanisms have anything to contribute on such a problem.
I would argue that bstring has the advantage that it is designed
precisely to interoperate with char * manipulations, however, clearly
my lack of deep understanding of Vstr (I hadn't thought of the idea of
an inlined iterator) prevents me from categorically claiming that
bstrlib's direct interoperability leads to a truly better solution
than Vstr in practice.

> > I have read James' description of the CSV files, and found some
> > sources on it on the net (most importantly from www.wotsit.org). It
> > turns out I had a few of these lying around on my machine as well. So
> > I just wrote one from scratch myself.
> >
> > This is entirely a parsing problem that has little to do with memory
> > management and has very low risk of buffer overflowing (though somehow
> > you managed to do it.) Nevertheless the initial input format is a
> > bstring (since bstring has some convenient buffer safe file reading
> > semantics) as is the output format (just as a matter of
> > demonstration.)
>
> Yeh, apart from the input phase your code could go almost unaltered to
> use a normal (char *).

Well, a key difference is that lengths for each entry are tracked,
(the trailing '\0's really only being added so that I can printf the
results.) The result being that the output is ready for more serious
string manipulation (where bstring can excel.)

> > > However, I'm still trying to figure out a good way to generate enough
> > > data with enough variety of strings to make the program run long enough
> > > though so I'll have to post the csvgen.c and my csvtest.c implementation
> > > tomorrow. I also have another bug in my code to weed out.
> >
> > Well in the mean time you can look at my first crack at this. I don't
> > guarantee that its correct, but it seems to be able to parse all the
> > sample CSV files included in the archive:
> >
> > http://www.pobox.com/~qed/bcsv.zip
>
> OMG Michael's might have had overflows but I'd rather debug/maintain
> his version than yours (although you handle more cases, I think).
> And the original state machine version that he has on his page is very
> nice.

If you are interested in some C language bashers, listen to Todd
Proebsting's talk from the afternoon session here:

http://ll2.ai.mit.edu/

Part of what he says is that parsing is a harder problem than people
are willing to admit, and I think he is right. Only in the most
trivial cases can you implement a correct parser that is fast,
correct/safe, and elegant. I chose to compromise on the third. As
for maintainability, wouldn't you also have to consider the actual
number of bugs in the code itself?

Keep in mind that it has been very difficult to get a precise
statement of the problem. I simply changed it to the most general
statement of the problem: "correctly parse any CSV file (possibly
including defective ones generated by Borland's Paradox)". If I am
told the cases where I can compromise on correctness, perhaps I can
cut out a number of cases, or branches from my code to make it more
"maintainable". (Perhaps I could overclock my PC, eat raw meat,
neglect to wear seatbelts, take up smoking and give recreational
bullfighting a try right afterwards.)

Besides being limited to a predetermined maximum number of columns, I
just noticed that Michael's solution clearly does not correctly handle
the case of a \n embedded in quotes. He's just placed that '\n' test
in the outer loop. If the outer for loop terminates due to a *src ==
'\0', he also seems to have missed the case of reporting an error when
an opened '"' has not been closed.

So how do we resolve this apples versus oranges scenario? Should
Michael fix his solution, or should I break mine?

> > Be warned, I made the same "fixed number of columns" mistake that you
> > did, since I started with your API as I was implementing the program.
> > I did not realize until I was too far down the process that the number
> > of columns is not meta data that you somehow have a priori (as amd.csv
> > demonstrates.) In a future version, I will hopefully fix this, but in
> > its current incarnation, it should not have any less functionality
> > than yours.
>
> It also doesn't deal with multiple lines of input, so I can't compare
> it to the libmba code.

Huh? The first version I uploaded could read multiple lines, so I
don't know exactly what you are referring to. In any event I have
updated it, so it now supports arbitrary number of columns, and can be
easily mapped to a single data structure to hold the completely parsed
CSV file if so desired (though the demo code doesn't do this.) Did
you notice a bug? Am I handling empty lines wrong or something?

> > The code is also a little bit longer than yours since it has been
> > written with performance in mind. In the end I believe this
> > implementation should beat yours by virtue of the fact that it does
> > not *copy* the entries character by character, and it uses code
> > position to track the state rather than calling a switch after a for
> > loop, and reduces redundant checks of the "trim" variable.
>
> I should probably find time to write a vstr version, as it should be
> readable and fast (mainly due to not having to do any copies of the
> data -- see the progression, copy lots: slow, copy in blocks: faster,
> copy none: safest/fastest).

Come on and join the fun!

Paul Hsieh

unread,

May 20, 2003, 7:46:55 AM5/20/03

to

Michael B Allen <mba...@ioplex.com> wrote:
> On Mon, 19 May 2003 16:42:01 -0400, Paul Hsieh wrote:
> > Michael B Allen <mba...@ioplex.com> wrote:
> >> Ok. I'll accept Paul's challenge.
> >
> > Well in the mean time you can look at my first crack at this. I don't
> > guarantee that its correct, but it seems to be able to parse all the
> > sample CSV files included in the archive:
> >
> > http://www.pobox.com/~qed/bcsv.zip
>
> Ok, as promised, here's my program (csvtest) and another program (csvgen)
> to generate the input file:
>
> http://users.erols.com/mballen/csvtest/
>
> Build with:
>
> nmake -f Makefile.msvc
>

> Also, I do not have Watcom C so if you would like me to use optimization

> flags with cl that are different than the ones specified in the
> Makefile.msvc that I provided please provide you're own makefile for msvc.

Just -O2 or whatever. The only compiler optimizations that will have
any relevance in my code are better register usage and sub-expression
elimination. These are usually both enabled even at the lowest
optimization setting above "no optimizations."

Testing both programs under WATCOM C/C++ using your hoist on a 43MB
input shows that my version was only between 50 and 70% faster. I
changed it to a much smaller file (just the first 23 lines of your
suggested out.csv, or about 40KB), and simply repeated the same
parsing 10000 times in a tight loop and it rose to about 130% faster.

Profiling shows that my version of the program spends nearly 40% of
its time in fread(). Your program is spending about 60% of its time
in fgetc(). So it is difficult to remove I/O from the equation in
this test.

> > Be warned, I made the same "fixed number of columns" mistake that you
> > did, since I started with your API as I was implementing the program. I
>
> Yes that fine. It's my API. If you would like to provide the input data
> that's fine too of course but you will need to write a program to generate
> it because I really don't feel like downloading a huge file like that.

Well I have since changed my program to just do everything with the
maximal amount of generality and correctness. There is no maximum
column parameter in the current version. For test files, how about
something like test7.csv or teste.csv which are included in my
archive?

> > The code is also a little bit longer than yours since it has been
> > written with performance in mind. In the end I believe this
>
> A little bit? Man the bold (and rude) statements just never cease. Just
> from glancing at your code you're not going to win on simplicity
> alone. Aren't we ultimately comparing performace here? I mean you said
> you would "obliterate" the speed of my code.

Well it turns out that I came out on the low end of obliterate.

> > implementation should beat yours by virtue of the fact that it does not
> > *copy* the entries character by character, and it uses code position to
>
> FYI a *copy* is just an extra variable assignment.

Well, in the case of your 300MB test file, its 300 million
assignments. The real point, though, is that you are essentially
generating two data vectors with interleaved access. For CPUs, and/or
hard disks with auto-detect prefetch logic built in (i.e., all of them
these days) this will prevent it from working as effectively as
possible (you always want to minimize the number of streams of
uncached data that you are working on at once.)

Michael B Allen

unread,

May 20, 2003, 2:09:12 PM5/20/03

to

On Tue, 20 May 2003 07:46:55 -0400, Paul Hsieh wrote:

> http://www.pobox.com/~qed/bcsv.zip

When I run this it chews up all the memory on my machine and prints one
row. It appears that bread is reading the entire file into memory. I can
understand why you would want to change the size of the input file :)
Is there an update to this?

Mike

Paul Hsieh

unread,

May 20, 2003, 5:43:49 PM5/20/03

to

Michael B Allen <mba...@ioplex.com> wrote:

No, the parsing grammar given suggests that in theory you may need to
read the whole file just to properly parse the file. A quoted string
can enclose the '\n' terminator, so you can't correctly parse each
"row" of a CSV file by trying to read it in on line at a time. And
the space required for holding the parsed CSV, in general will be
about the same as that for the string read (which is modified in
place) which I assumed was the really general problem trying to be
solved.

If you need a way to *stream* through the file, that obviously would
be a lot different (though now I can see how that might be a real
world scenario, if, for example, you only wanted to retain a sub-set
of the DB -- as I said, I'm not a DB person, so these things aren't
immediately obvious to me), but interestingly by using the bStream
functions could bring in a lot more useful bstring functionality into
it.

This is why I should have waited for the test conditions *before* I
wrote any code. Anyhow, the program should work just fine on smaller
files that can be read into memory.

Paul Hsieh

unread,

May 20, 2003, 10:25:34 PM5/20/03

to

q...@pobox.com (Paul Hsieh) wrote in message news:<796f488f.03052...@posting.google.com>...

> Michael B Allen <mba...@ioplex.com> wrote:
> > On Tue, 20 May 2003 07:46:55 -0400, Paul Hsieh wrote:
> > > http://www.pobox.com/~qed/bcsv.zip
> >
> > When I run this it chews up all the memory on my machine and prints one
> > row. It appears that bread is reading the entire file into memory. I can
> > understand why you would want to change the size of the input file :)
> > Is there an update to this?

There is now. The latest version has a little bit simpler state
machine and can chew through files in a stream-like fashion with no
compromise on correctness. If a single entry is larger than the
available memory, then well there is nothing I can do about that, but
it can handle anything short of that.

The bstring library finally gets some sort of test out of this as a
results of this as well. Using a bStream allows the input to be
accumulated in a safe a fast way.

Michael B Allen

unread,

May 21, 2003, 2:18:40 AM5/21/03

to

Ok,

Here is my final submission. By my calculations it is 23% faster than
yours. Run unbcsv.exe vs. bcsv.exe:

http://users.erols.com/mballen/csvtest/unbcsv.zip

But then I think you have lost focus of the problem. The objective was to
prove to me that you're bstring ADT could perform as well or supposedly
better than my very plain and simple char * version as you boldly claimed:

"My own testing shows that bstrings incurr relatively little overhead
cost, and of course, in many cases will obliterate the standard C
char * library in performance."

The challenge was:

"Substitute you're string ADT where you feel it suites the problem
and helps make the code more secure, faster, easier to understand,
better, etc."

But you didn't really *use* your bstring ADT. You cannot even claim it
contributed to a more secure implementation as you can see my version does
precisely the same thing without using bstring at all. Bstring amounted
to nothing but overhead.

This was not a contest to see who could write a CSV parser that might as
well have been assembly. I don't know what ever made you think it was.
You've spectacularly illustrated why most programmers beleive heavy
optimization is not always a good thing. In this case you brought out
a donkey to race in the Kentucky Derby.

Ironically, by you're hand, you have provided what is probably a very
good metric for how bstring contributed to the CSV problem. In fact using
bstring for this purpose resulted in a 29% decrease in performace.

Have a nice day,
Mike

PS: You're code folds the current row into the next if there's space
trailing the end quote on the last row element. Also it completely
chokes on extra carriage returns.

Mark Gordon

unread,

May 21, 2003, 9:40:01 AM5/21/03

to

On 20 May 2003 01:04:18 -0700
q...@pobox.com (Paul Hsieh) wrote:

<snip>

> The mere act of parsing CSV files is unlikely to be the sole desire of
> a programmer using such a routine. Since the final strings are
> packaged into struct tagbstring's they are perfect for just about any
> of the next logical operations you would expect in a DB application
> using bstrlib. For example, since biseq() dramatically outperforms
> strcmp() in most cases (because of retaining and checking the lengths
> first) this means that exact searches, for example, will perform much
> better.
>
> I am not a database guy, so pehaps someone can suggest what some
> likely operations on a DB are.

In the application I maintain, the most likely processing of a cvs file
is of the form:
Read and decode line
Validate against information in other database
Generally use sone of the data to do lookups in other databases.
Write to our database

The processing being 1 line at a time, so you only need to hold one
line. Actual CSV files can be in the 100K+ sometimes, at other times
under 1K.

CSV is good for porting data between systems, but very bad for internal
usage. Holding the entire file in memory is bad because you then have
unbounded memory requirements on a system that might have another
hundred users doing things, although most would not be importing large
CSV files.
--
Mark Gordon
Paid to be a Geek & a Senior Software Developer
Currently looking for a new job commutable from Slough, Berks, U.K.
Although my email address says spamtrap, it is real and I read it.

James Antill

unread,

May 21, 2003, 7:25:23 PM5/21/03

to

q...@pobox.com (Paul Hsieh) wrote in message news:<796f488f.03052...@posting.google.com>...

> james-...@and.org (James Antill) wrote:
> > And with the way the Vstr API works I'd only need to alter the input
> > if I replaced a "" with a " (which doesn't happen often in real input).
>
> Oh. I actually thought that you might opt to use Vstr's high
> performance insert and delete mechanisms to beat both Michael and I on
> this one aspect of the problem. It can be applied as a last stage
> operation, so I am surprised you choose not to do this -- do you have

Well there's isn't any point in copying, when no copying needs to be
done.

> a restriction that you cannot modify a Vstr that you are currently
> iterating through?

Yes.

http://www.and.org/vstr/functions.html#vstr_iter_fwd_beg()

> In any event, my solution has the exact same properties. Its
> completely read-only except for when double quote cases are detected.
> Even then, the modification is performed in-place since we know that
> this shrinks the string.

Well you were probably confused by my message, as I meant alter the
Vstr ... which sometimes means alter part of the Vstr data, but not
always. As an extreme, think about...

perl -e 'print "\"" . 'x' x (1024 * 1024 * 8) . \
("\"" x 3) . "\n";' > /tmp/bad.csv

...here a Vstr with mmap()'d data will start with...

node[1] = "\"xxx" ... "xxx\"\"\"";

...and end with...

node[1] = "xxx" ... "xxx"
node[2] = "\"";

...with no writing of actual data, at all. On the other hand,
anything using a single (char *) as a store will have to re-write
the entire block of memory.
In the above example Vstr takes 1% of the time of bstring (note
again I say that's extreme -- and Vstr's worst case is a repeat of
"x\"\"", but even here it seems to win).

> I would argue that bstring has the advantage that it is designed
> precisely to interoperate with char * manipulations, however, clearly

vstr_export_cstr_ptr() is there for a reason, and it's pretty efficent
(definatly efficient enough, unless you are moving data around and
changing it all the time ... but if you are doing that you need to work
with Vstr's as the primitive type :).

> my lack of deep understanding of Vstr (I hadn't thought of the idea of
> an inlined iterator) prevents me from categorically claiming that
> bstrlib's direct interoperability leads to a truly better solution
> than Vstr in practice.

My guess is that scarily hand optomised code like your csv parser will
perform slightly better with a single chunk of memory, due to cache
issues. But will almost certainly...

1. Have corner cases that are very bad.

2. Have much worse bugs.

...of course Vstr also has a much richer API, but then I'm biased.

> If you are interested in some C language bashers, listen to Todd
> Proebsting's talk from the afternoon session here:
>
> http://ll2.ai.mit.edu/
>
> Part of what he says is that parsing is a harder problem than people
> are willing to admit, and I think he is right. Only in the most
> trivial cases can you implement a correct parser that is fast,
> correct/safe, and elegant. I chose to compromise on the third. As
> for maintainability, wouldn't you also have to consider the actual
> number of bugs in the code itself?

Yes, I'd consider bugs ... but I'd also consider bugs I don't know
about. Michael's code had bugs, but I'm fairly confident that
I could easily do an audit and catch most/all the bugs. Your code
might be bug free ... but I don't know.

> > It also doesn't deal with multiple lines of input, so I can't compare
> > it to the libmba code.
>
> Huh? The first version I uploaded could read multiple lines, so I
> don't know exactly what you are referring to. In any event I have
> updated it, so it now supports arbitrary number of columns, and can be
> easily mapped to a single data structure to hold the completely parsed
> CSV file if so desired (though the demo code doesn't do this.) Did
> you notice a bug? Am I handling empty lines wrong or something?

The output for the first version didn't print more than the first line.
The version I have now does.

> > I should probably find time to write a vstr version, as it should be
> > readable and fast (mainly due to not having to do any copies of the
> > data -- see the progression, copy lots: slow, copy in blocks: faster,
> > copy none: safest/fastest).
>
> Come on and join the fun!

http://www.and.org/vstr/csv_vstr.c

James Antill

unread,

May 21, 2003, 7:42:15 PM5/21/03

to

Michael B Allen <mba...@ioplex.com> wrote in message news:<pan.2003.05.21.02....@ioplex.com>...

> Ok,
>
> Here is my final submission. By my calculations it is 23% faster than
> yours. Run unbcsv.exe vs. bcsv.exe:
>
> http://users.erols.com/mballen/csvtest/unbcsv.zip

Well this unbcsv version is faster than my vstr version...

http://www.and.org/vstr/csv_vstr.c

...but I'm not sure what of that is the solution, and what the string
ADT (and even so I'll take my version for 99.9999% of the times I'd
need one). The libmba version which is a similar solution is slower
than the vstr version (even after fgetc() => getc_unlocked()), but
the semantics are slightly different (I half wrote the csv parser
for this group and half because I thought it might be a nice
addition to Vstr or an addon library).

And as I said the corner case of a large field with a "" in it
is 100x faster in Vstr.

> But then I think you have lost focus of the problem. The objective was to
> prove to me that you're bstring ADT could perform as well or supposedly
> better than my very plain and simple char * version as you boldly claimed:
>
> "My own testing shows that bstrings incurr relatively little overhead
> cost, and of course, in many cases will obliterate the standard C
> char * library in performance."
>
> The challenge was:
>
> "Substitute you're string ADT where you feel it suites the problem
> and helps make the code more secure, faster, easier to understand,
> better, etc."

This is what was originaly said by you...

``A string "ADT" is unnecessary in C. This is because C supports pointer

arithmetic which permits in-situ string manipulation whereas a string
"ADT" suggests that memory to hold (and thus abstrct) strings is
frequently allocated and freed. This incurs performance penalties
similar to that of string handling in languages that do not support

pointer artihmetic.''

So my statement back would be...

``You _can_ use string ADTs in C and get at least the same
performance ... but also have security from buffer overflows''

> But you didn't really *use* your bstring ADT. You cannot even claim it
> contributed to a more secure implementation as you can see my version does
> precisely the same thing without using bstring at all. Bstring amounted
> to nothing but overhead.

But you can't do the same as Vstr did ... I don't write to the string's
data at all, I don't need to. The other valid point is that you always do
more than just parse the fields out of a csv file ... you often alter the
data before and after, and it's those things which will still be safe in
even the bstring version.