What is "TP_NUM_C

Robbie Hatley

unread,

Apr 10, 2016, 11:59:54 PM4/10/16

to

I spent most of today struggling to fix a bug in a function,
a regular-expression substitution function based on the GNU C
library's regex facilities. (Though, that library is pissing me off,
so I may end up going to the new C++ regex functionality instead.)

The point where the bug occurred looked something like this:

std::string rhregex::Substitute(std::string const & Text...)
{
...hundreds of lines of code...
cerr
<< "In rhregex::Substitute(), about to set Text2 equal to Text."
<< endl;
std::string Text2 = Text; // CATASTROPHIC PROGRAM CRASH!!!

Now, there's nothing wrong with that code itself. It just copies
the contents of one string (via a const ref function parameter)
to a second string. Legal, harmless, safe, impossible to crash.

AND YET, the program was crashing at that point consistently!
The error messages sometimes read:

segmentation fault, core dumped

And on other occasions read:

0 [main] renamefiles 6040 C:\cygwin64\rhe\bin\util\renamefiles.exe:
*** fatal error - Internal error: TP_NUM_C_BUFS too small: 50
Hangup

(And on one occasion the "error message" consisted of several hundred
pages of absolute gibberish: graphics characters, smudges, planet
symbols, file listings from my current directory, the contents of
a couple of C++ source files, all jumbled up. Talk about "undefined
behavior"!!! That was the most "undefined" result I've ever got from
running *any* computer program!)

So... what are "TP_NUM_C_BUFS"?
And why is it a problem that there's only 50 of them?
And why were they making my program crash?

Now, for information purposes, I *did* finally make the program
stop crashing. It was a process of elimination, commenting-out
larger and larger portions of the code, starting from the crash
point and working UP towards the top of the function. The CAUSE
turned out to be the very first line of the function:

regex_t RegEx; // ACTUAL CAUSE OF PROGRAM CRASH

To make the program stop crashing, I replaced that line of code
with the following:

regex_t * RegEx = NULL;
RegEx = reinterpret_cast<regex_t *>(malloc(sizeof(regex_t)));
assert(RegEx);

So obviously the regex_t type has the buggy undocumented "feature"
that you can't safely define it as an auto variable on the stack;
it *must* be malloc'ed on the heap instead. I wish this factoid
had made it into the GNU C-lib manual.

(And this explains all the other crashes, messages, and gibberish:
stack corruption.)

But the nagging question remains: What is "TP_NUM_C_BUFS"???

--
Cheers,
Robbie Hatley
Midway City, CA, USA
perl -le 'print "\154o\156e\167o\154f\100w\145ll\56c\157m"'
http://www.well.com/user/lonewolf/
https://www.facebook.com/robbie.hatley

Alf P. Steinbach

unread,

Apr 11, 2016, 12:48:02 AM4/11/16

to

On 11.04.2016 05:59, Robbie Hatley wrote:
>
> [snip]

> The point where the bug occurred looked something like this:
>
> std::string rhregex::Substitute(std::string const & Text...)
> {
> ...hundreds of lines of code...
> cerr
> << "In rhregex::Substitute(), about to set Text2 equal to Text."
> << endl;
> std::string Text2 = Text; // CATASTROPHIC PROGRAM CRASH!!!

Probably `Text` has been initialized with 0 or `nullptr`.

That arms an UB bomb.

Other causes include a buffer overflow somewhere (ditto).

> [snip] The CAUSE

> turned out to be the very first line of the function:
>
> regex_t RegEx; // ACTUAL CAUSE OF PROGRAM CRASH
>
> To make the program stop crashing, I replaced that line of code
> with the following:
>
> regex_t * RegEx = NULL;
> RegEx = reinterpret_cast<regex_t *>(malloc(sizeof(regex_t)));
> assert(RegEx);

Unless `RegEx` is a POD type this introduces more UB. And if it doesn't,
then its only effects are to (1) add a time consuming operation, (2)
hide the real problem, and (3) possibly introducing a memory leak or
other such issue.

The idea of searching systematically for the problem is good.

But most likely the problem is in the calling code, not in the function
itself.

> But the nagging question remains: What is "TP_NUM_C_BUFS"???

It's just the g++ way of saying, "got you... muwhahaha!"

Cheers & hth.,

- Alf

Daniel

unread,

Apr 11, 2016, 2:05:37 AM4/11/16

to

On Sunday, April 10, 2016 at 11:59:54 PM UTC-4, Robbie Hatley wrote:
>
> But the nagging question remains: What is "TP_NUM_C_BUFS"???
>

It indicates stack corruption. It may be a cygwin issue (similar issues have
been reported by others.)

Daniel

Paavo Helde

unread,

Apr 11, 2016, 5:58:49 AM4/11/16

to

On 11.04.2016 6:59, Robbie Hatley wrote:
>
> I spent most of today struggling to fix a bug in a function,
> a regular-expression substitution function based on the GNU C
> library's regex facilities. (Though, that library is pissing me off,
> so I may end up going to the new C++ regex functionality instead.)
>
> The point where the bug occurred looked something like this:
>
> std::string rhregex::Substitute(std::string const & Text...)
> {
> ...hundreds of lines of code...
> cerr
> << "In rhregex::Substitute(), about to set Text2 equal to Text."
> << endl;
> std::string Text2 = Text; // CATASTROPHIC PROGRAM CRASH!!!
>
> Now, there's nothing wrong with that code itself. It just copies
> the contents of one string (via a const ref function parameter)
> to a second string. Legal, harmless, safe, impossible to crash.
>
> AND YET, the program was crashing at that point consistently!
> The error messages sometimes read:
>
> segmentation fault, core dumped
>
> And on other occasions read:
>
> 0 [main] renamefiles 6040 C:\cygwin64\rhe\bin\util\renamefiles.exe:
> *** fatal error - Internal error: TP_NUM_C_BUFS too small: 50
> Hangup

Looks like an Undefined Behavior bug expressed. Probably some memory
corruption somewhere.

>
> Now, for information purposes, I *did* finally make the program
> stop crashing.
>

> regex_t RegEx; // ACTUAL CAUSE OF PROGRAM CRASH
>
> To make the program stop crashing, I replaced that line of code
> with the following:
>
> regex_t * RegEx = NULL;
> RegEx = reinterpret_cast<regex_t *>(malloc(sizeof(regex_t)));
> assert(RegEx);
>
> So obviously the regex_t type has the buggy undocumented "feature"
> that you can't safely define it as an auto variable on the stack;
> it *must* be malloc'ed on the heap instead. I wish this factoid
> had made it into the GNU C-lib manual.

No, this is not the reason and you did not fix the issue. All you did
was to move the memory corruption to some other place, hiding the bug.
An hidden bug is much worse than a clear bug.

Suggesting to review your code to find out the actual bug. Most probably
you are passing invalid arguments to the library.

It is not excluded that the bug is in the library itself, the CygWin
ports are not very widely used or tested, so there is some chance (say
1%) that the bug is not in your code.

HTH
Paavo

Robbie Hatley

unread,

Apr 11, 2016, 2:24:30 PM4/11/16

to

On 4/11/2016 2:58 AM, Paavo Helde wrote:

> No, this is not the reason and you did not fix the issue. All you did was
> to move the memory corruption to some other place, hiding the bug.
> An hidden bug is much worse than a clear bug. Suggesting to review your code
> to find out the actual bug. Most probably you are passing invalid arguments
> to the library.

I want to say it can't be so, but that is a possibility, because the library
I'm using (Cygwin C Library) is not written by any one person or company,
but is a hodge-podge of code stolen from Red Hat, BSD, and FSF/GNU, along
with some written by the Cygwin folks themselves. And while Cygwin's
C Lib comes with a "Red Hat C Standard Library" html documentention file,
it doesn't cover their Regex stuff. GNU's C Library Manual covers it, but
it quite vague in many places (such as, doesn't tell you if a function
allocates memory for an input variable, or if you should allocate it yourself,
and if so, whether it should be stack, static, dynamic, or whatever).
And the Regex stuff in Cygwin's C Lib was actually taken from BSD, so
I suppose I should hunt down a BSD C Lib manual. Sigh. What a mess.

But I just learned that there *is* a better way (see below)....

> It is not excluded that the bug is in the library itself, the CygWin
> ports are not very widely used or tested, so there is some chance (say 1%)
> that the bug is not in your code.

Sounds like a good reason to avoid the C Lib in C++ code.

Especially since I just learned this morning (from reading Stroustrup's
"The C++ Programming Language 4th Ed.", which I just got my hands on;
the version I WAS using was the 2000 edition) that I can replace
the entire function I was creating -- a regex substitution function --
with the following, using the new C++ 2011 std lib:

// #include <regex.h> // No, DON'T include this piece of $#%^!
#include <regex> // Let's try THIS header instead.
regex_replace(.......); // Already written for me. Cool! :D

Replaces 343 lines of code with... well... 0. I can erase my entire
homebrew regex module, seeing as how C++ 2011 std lib already
has easy-to-use regex match, search, replace tools. Cool.

Well, it was any interesting exercise in tracing an elusive memory
corruption bug (which is clearly what it was/is). Perhaps I'll pursue
debugging my buggy C regex module later as an exercise. But for
any serious usage I think I'll use C++ regex instead.

I *still* wonder what the heck "TP_NUM_C_BUFS" is, though. :-)

Paavo Helde

unread,

Apr 11, 2016, 2:45:13 PM4/11/16

to

Good to hear! Yes of course it is better to use "builtin" C++ regex
whenever possible. And I am still pretty sure the actual bug was in some
of those deleted 343 lines (and deleting them was a good way to get rid
of the bug!)

> Well, it was any interesting exercise in tracing an elusive memory
> corruption bug (which is clearly what it was/is). Perhaps I'll pursue
> debugging my buggy C regex module later as an exercise. But for
> any serious usage I think I'll use C++ regex instead.
>
>
> I *still* wonder what the heck "TP_NUM_C_BUFS" is, though. :-)

According to Google, this is a constant defined in some cygtls.h header
file (e.g.
https://sourceforge.net/u/earnie/msys2/msys2/ci/01e8fa7372b33b9766bb5dd3e059cfab695bbf90/tree/winsup/cygwin/cygtls.h).
"tls" means thread local storage, a known source of problems and
confusion on Windows. But I believe your issue had nothing to do with
the thread local storage, it's just that this code bothered to do some
consistency checks and noticed the memory corruption.

Cheers
Paavo

Robbie Hatley

unread,

Apr 11, 2016, 3:18:49 PM4/11/16

to

On 4/10/2016 9:47 PM, Alf P. Steinbach wrote:

> Probably `Text` has been initialized with 0 or `nullptr`.
> That arms an UB bomb.

At the point where the program always crashed, Text is a
std::string const & parameter to my "substitute" function,
and it refers to a std::string containing "randip.c".

Drastically simplified call chain:

int main ()
{
...
CallingFunc();
...
return 0;
}

int CallingFunc ()
{
...
std::string source = "randip.c";
std::string pat = "and";
std::string repl = "or";
std::string result = substitute(source, pat, repl, 'g');
...
return 0;
}

std::string substitute
(
std::string const & Text,
std::string const & Pat,
std::string const & Repl,
char Flag // 'g' for global
)
{
regex_t Regex; // line which appears to have corrupted stack
... hundreds of lines of code...
std::string Text2 = Text; // accesses previously corrupted stack???
... hundreds more lines of code...
return Text2;

}

> Other causes include a buffer overflow somewhere (ditto).

I don't make much use of fixed-length buffers in C++ code,
preferring std::string over char[], vector<MyClass> over MyClass[], etc.
So it's not something as simple as stick 73 objects in a 50-object array.

Or if it is, it's in library code. Perhaps that's what the error msg,
"TP_NUM_C_BUFS too small: 50" was all about. :::shrugs:::

> Unless `RegEx` is a POD type this introduces more UB. And if it doesn't,
> then its only effects are to (1) add a time consuming operation,
> (2) hide the real problem,

The "regex_t" type from the BSD and GNU C libs's "regex.h" headers
is poorly documented, so it's hard to know what's in it. The documentation
says "This struct contains several more fields than are documented here,
but you don't need to know about those, as only library internal functions
should access them". So GNU's attitude is, "You don't want to know; and if
you do, bugger off."

But I'm curious: why would dynamically allocating memory for an object
which later dynamically allocates still more memory for its own use
be a problem? Seems straightforward to me. Since this is a C library,
I doubt regex_t has "member functions"; the struct itself is likely POD,
but it definitely has associated functions in the library for dynamic memory
allocation.

> and (3) possibly introducing a memory leak or other such issue.

A function "regfree" is provided for freeing memory allocated by
objects of type regex_t (or more accurately, by their associated functions),
but the documentation says "regfree doesn't free the object itself",
implicitly implying that you malloced the regex_t object. And indeed,
when I did that, the bug seemed to go away. But creating and freeing
a regex_t object is a hassle:

// Make the object:

regex_t * RegEx = NULL;

RegEx = (regex_t *)malloc(sizeof(regex_t));
assert(RegEx);
... use the object....
// Free the object:
regfree(RegEx);
assert(RegEx);
free(RegEx);

> The idea of searching systematically for the problem is good.
> But most likely the problem is in the calling code, not in the
> function itself.

I suppose it's possible. I'll also look there, and see if I'm doing
anything funky. But I don't think it's as simple as sending wrong stuff
to substitute(), as those arguments are very straightforward.

> > But the nagging question remains: What is "TP_NUM_C_BUFS"???
>
> It's just the g++ way of saying, "got you... muwhahaha!"

:-)

Well, the Cygwin C regex module appears to be the work of Henry Spencer
of the University of Toronto, who contributed it to UC Berkeley,
who used it in BSD, from whence Cygwin borrowed and modified it,
stirring in stuff from Red Hat and FSF/GNU.

So maybe it was Henry Spencer, or Richard Stallman, or someone else,
who thought, "HEY, I know, I'll name the error code for when xxxxxxx
happens 'TP_NUM_C_BUFS too low'; that'll confuse the hell out of
our users! MUAH-HA-HA-HA-HA!!!"

Anyway, long-term fix is to go over to using header <regex> (C++ 2011)
instead of header <regex.h> (inherited from C). It already has a
regex_replace() function, making my module obsolete except as an
exercise in troubleshooting obscure memory corruption bugs. (I just
learned of the existence of the 2011 C++ std lib regex functionality
this very morning.)

Scott Lurndal

unread,

Apr 11, 2016, 4:08:28 PM4/11/16

to

Robbie Hatley <see.m...@for.my.address> writes:
>
>On 4/11/2016 2:58 AM, Paavo Helde wrote:
>
>
>> No, this is not the reason and you did not fix the issue. All you did was
>> to move the memory corruption to some other place, hiding the bug.
>> An hidden bug is much worse than a clear bug. Suggesting to review your code
>> to find out the actual bug. Most probably you are passing invalid arguments
>> to the library.
>
>
>I want to say it can't be so, but that is a possibility, because the library
>I'm using (Cygwin C Library) is not written by any one person or company,
>but is a hodge-podge of code stolen from Red Hat, BSD, and FSF/GNU, along
>with some written by the Cygwin folks themselves. And while Cygwin's
>C Lib comes with a "Red Hat C Standard Library" html documentention file,
>it doesn't cover their Regex stuff. GNU's C Library Manual covers it, but
>it quite vague in many places (such as, doesn't tell you if a function
>allocates memory for an input variable, or if you should allocate it yourself,
>and if so, whether it should be stack, static, dynamic, or whatever).
>And the Regex stuff in Cygwin's C Lib was actually taken from BSD, so
>I suppose I should hunt down a BSD C Lib manual. Sigh. What a mess.
>

http://pubs.opengroup.org/onlinepubs/9699919799/functions/regexec.html
http://pubs.opengroup.org/onlinepubs/9699919799/

Richard

unread,

Apr 11, 2016, 4:39:36 PM4/11/16

to

[Please do not mail me a copy of your followup]

Robbie Hatley <see.m...@for.my.address> spake the secret code
<QOOdnblOB4atvJbK...@giganews.com> thusly:

>The point where the bug occurred looked something like this:
>
>std::string rhregex::Substitute(std::string const & Text...)
>{
> ...hundreds of lines of code...
> cerr
> << "In rhregex::Substitute(), about to set Text2 equal to Text."
> << endl;
> std::string Text2 = Text; // CATASTROPHIC PROGRAM CRASH!!!

> [...]

I'm going to take a step back here to add a viewpoint that hasn't yet
been expressed in this thread. I agree that it is unlikely that the
GNU C Library Regex facility requires that it's values be allocated on
the heap in order to function properly. That would be unusual for the
any Standard Library facility. They're designed intentionally so that
you "only pay for what you use".

TL;DR: Monster methods are a source of bugs. Refactor them into
smaller units that your brain can reason about with confidence.

If this code is easy to test in isolation, I should first write a
unit test for it that reproduces the problem. What you've written
suggests that this is a "monster method". Writing an isolated test
for it is likely to be difficult because it almost certainly doesn't
have "hundreds of lines of code" that execute in a linear fashion.
That means lots of branching and conditionally executed code. So
writing a test is likely to be difficult. If not, certainly write one
that reproduces the problem.

That "hundreds of lines of code" is really bugging me. Chances are
that there is another "hundreds of lines of code" after the part
you're showing us. Compose Method is your friend here.[1]
(See below for an alternative: Replace Method with Method Object.)
Most likely your "hundreds of lines of code" is introducing lots and
lots of local variables throughout. Almost certainly it was decided
that reusing local variables was preferable to each chunk of code using
only it's own variables.

Code like this makes it difficult to apply Compose Method because there
is lots of shared state (the variables) between various chunks of code.
They aren't coupled through the values of the variables because each
chunk overwrites the value with it's own local initialization, but
they are coupled through the names of the variables that are reused.
In order to break this up, you'll want to refactor each isolated use of
a reused variable into a use of a distinct variable. Unfortunately
there isn't any good tool for C++ that can do this automatically and
reliably, so you'll need to do it by hand.

Here's where you should lean on your version control system really
heavily. Each time you decouple a reused variable you'll want to
decouple a single reuse instance and then make a commit into your
version control system for it. Yes, this may result in 50+ commits
for these several hundred lines of code, but so what? Commits are
free. From the way you describe it, I assume this code has 0% unit
test coverage. Small, slow steps are what you need in order to bring
some sanity to this monster method. When you transform code in small
steps anyone can follow the version control history and see what was
done and how it was done. When a huge method like this is refactored
in one gigantic step, then looking at the diffs becomes pointless.
The diff viewer will say "several hundred lines of code were removed
and replaced with several hundred other lines of code", making for an
unintelligible change. When looking at the diff, the average person
will be hard pressed to determine if the change was for the better or
not in terms of bugs and preserved functionality.

Once we've reduced the coupling of various chunks of code, we can
start to Extract Method on the various chunks in order to achieve
Compose Method. Here, there are good tools to help: CLion for
Linux/Mac and VS2015/ReSharper for C++ for Windows. They all have
pretty good automated Extract Method functionality that handles the
dataflow needed to properly deduce input arguments, return types and
their proper signatures. (Sorry, VisualAssist fairs pretty poorly
here and I can't recomment it as it causes too much manual fixup after
the fact.) If they don't deduce the types to best practices (passing
std::string input args by value instead of const reference), then
applying Change Signature on the extracted function gets you where you
want to be.

Why do Compose Method? We do it because large methods with lots of
local variables are nearly impossible to reason about with confidence.
Breaking a monster method up into a series of calls to smaller methods
that we named with intention revealing names[2] let's us see the
forest instead of the trees.

Once things are broken down into small chunks we can reason about
easily, we often spot the small error that caused the bug in the first
place. It was "hiding in plain sight" inside that monster method.

If you end up with very large argument lists because all these
extracted chunks are communicating large amounts of state between
them, then Replace Method with Method Object[3] may be a better
alternative to Compose Method. With a Method Object, almost everything
that was local state in the monster method becomes instance variables in
the new object. This instance state doesn't need to be passed around
between methods. We apply the same process as in Compose Method to
break the monster method into smaller chunks.

There's something nice that can happen when we use a Method Object
instead of Compose Method on an existing object. The new Method
Object can make all of it's extracted methods public, increasing the
visible surface area that we can unit test from the outside. The
Method Object is visible to us internally for testing purposes, but
isn't exposed to clients as part of our public API anymore than the
local variables in the monster method were exposed publicly.

[1] <https://www.industriallogic.com/xp/refactoring/composeMethod.html>
[2] <http://c2.com/cgi/wiki?IntentionRevealingNames>
[3] <http://refactoring.com/catalog/replaceMethodWithMethodObject.html>
--
"The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline>
The Computer Graphics Museum <http://computergraphicsmuseum.org>
The Terminals Wiki <http://terminals.classiccmp.org>
Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

Robbie Hatley

unread,

Apr 12, 2016, 12:16:16 PM4/12/16

to

For comparison purposes, here are the C++ and C versions of my
"Substitute" function, as stand-alone C++ programs, dependent
only on std libraries (and the GNU C regex library in the case
of the C-style version).

First, the C++ version.
Total code lines: 38.
Total time spent writing & debugging: 60 minutes.

#include <iostream>
#include <regex>
using std::cout;
using std::cerr;
using std::endl;
std::string Substitute
(
std::string const & Pattern,
std::string const & Replacement,

std::string const & Text,

std::string const & Flags // If Flags contains 'g', do global replace;
) // otherwise, replace first occurrence only.
{
std::regex_constants::match_flag_type RegexFlags {};
if (std::string::npos == Flags.find_first_of("g"))
{
RegexFlags = std::regex_constants::format_first_only;
}
std::regex RegEx {Pattern};
std::string Result = std::regex_replace(Text, RegEx, Replacement, RegexFlags);
return Result;
} // end function rhregex::Substitute()
int main (int Beren, char ** Luthien)
{
std::string Pattern {Luthien[1]};
std::string Replacement {Luthien[2]};
std::string Text {Luthien[3]};
std::string Flags {Luthien[4]};
if (5 != Beren)
{
cerr
<< "Error in regex-c-test.cpp: this program takes 4 arguments," << endl
<< "but you only typed " << Beren << ". Aborting program." << endl;
exit(666);
}
std::string Result = Substitute(Pattern, Replacement, Text, Flags);
std::cout << Result << std::endl;
}

Next, the C-style version.
Total code lines: 322.
Total time spent writing & debugging: many hours

#include <iostream>
#include <iomanip>
#include <string>
#include <sstream>
#include <regex.h>
#undef NDEBUG
#include <assert.h>
#include <errno.h>
#undef BLAT_ENABLE
#ifdef BLAT_ENABLE
#define BLAT(X) std::cerr << X << std::endl;
#else
#define BLAT(X)
#endif

using std::cout;
using std::cerr;
using std::setw;
using std::setfill;
using std::endl;

struct RegExException
{
RegExException(void) : msg(std::string("")) {}
RegExException(char const * Msg) : msg(std::string(Msg)) {}
RegExException(std::string const & Msg) : msg(Msg) {}
std::string msg;
};

std::string Substitute
(
std::string const & Pattern,
std::string const & Replacement,

std::string const & Text,

char Flag
)
{
BLAT("\nJust entered Substitute() with following parameters:")
BLAT("Pattern = " << Pattern)
BLAT("Replacement = " << Replacement)
BLAT("Text = " << Text)
BLAT("Flag = " << Flag << "\n")

int Result {};
std::string ReturnString {};

regex_t * RegEx = NULL;
RegEx = reinterpret_cast<regex_t *>(malloc(sizeof(regex_t)));

assert(NULL != RegEx);

// Make variable ReplacementCopy to hold copy of Replacement.
// (We can't alter Replacement because it is a const reference.)
BLAT("\nIn Substitute(), about to copy Replacement to new string ReplacementCopy.")
std::string ReplacementCopy {Replacement};
BLAT("In Substitute(), finished copying Text to TextCopy.\n")

// Make variable TextCopy to hold copy of Text.
// (We can't alter Text because it is a const reference.)
BLAT("\nIn Substitute(), about to copy Text to new string TextCopy.")
std::string TextCopy {Text};
BLAT("In Substitute(), finished copying Text to TextCopy.\n")

// COMPILE REGULAR EXPRESSION:
BLAT("\nIn Substitute(), about to call regcomp().")
Result = regcomp(RegEx, Pattern.c_str(), REG_EXTENDED);
BLAT("In Substitute(), just returned from regcomp().\n")

// If regcomp() failed, throw exception:
if (0 != Result)
{
BLAT("In Substitute(), regcomp() failed so about to throw exception.")
char ErrorBuffer[505];
size_t ErrorSize = regerror(Result, RegEx, ErrorBuffer, 500);
std::string ErrorMessage =
std::string("regcomp() failed.\nError message from regerror():\n")
+ std::string(ErrorBuffer) + "\n";
if (ErrorSize > 500)
{
ErrorMessage += "(Message truncated to 500 characters.)\n";
}
RegExException up (ErrorMessage);
regfree(RegEx); // Free the (internal) dynamically-allocated memory associated with RegEx.
free(RegEx);
BLAT("In Substitute(), at bottom of \"regcomp() failed so throw exception\" section;")
BLAT("ErrorMessage = " << ErrorMessage)
throw up;
}

// Allocate Matches:
BLAT("\nIn Substitute(), about to allocate Matches.")
regmatch_t * Matches = NULL;
Matches = reinterpret_cast<regmatch_t*>(malloc((RegEx->re_nsub + 5)*sizeof(regmatch_t)));
assert(NULL != Matches);
BLAT("In Substitute(), just allocated Matches.")

// EXECUTE REGULAR EXPRESSION:
// Run regexec(), which will look for matches to the RE in Text:
BLAT("In Substitute(), about to call regexec().")
Result = regexec(RegEx, Text.c_str(), 10, Matches, 0);
BLAT("In Substitute(), called regexec().")

// Don't regfree(RegEx) here; wait until we see if there was an error,
// so that we can feed RegEx to regerror().

// If there was an error, get error message, free RegEx and Matches, and throw an exception:
if (REG_NOERROR != Result && REG_NOMATCH != Result)
{
BLAT("In Substitute(), at top of \"regexec() failed so throw exception\" section;")
BLAT("Result = " << Result)
char ErrorBuffer[505] = {'\0'};
size_t ErrorSize = regerror(Result, RegEx, ErrorBuffer, 500);
std::string ErrorMessage =
std::string("regexec() failed.\nError message from regerror():\n")
+ std::string(ErrorBuffer) + "\n";
if (ErrorSize > 500)
{
ErrorMessage += "(Message truncated to 500 characters.)\n";
}
RegExException up (ErrorMessage);
regfree(RegEx); // Free the (internal) dynamically-allocated memory associated with RegEx.
free(RegEx);
free(Matches);
BLAT("In Substitute(), at bottom of \"regexec() failed so throw exception\" section;")
BLAT("ErrorMessage = " << ErrorMessage)
throw up;
}

BLAT("In Substitute(), after \"if (exec error) throw exception\" section;")
BLAT("just above \"if (no match)\" section.")

// If no match, free all dynamically-allocated memory and return TextCopy.
if (REG_NOMATCH == Result)
{
BLAT("In Substitute(), at top of \"no-match\" section; about to free RegEx and Matches.")
regfree(RegEx); // Free the (internal) dynamically-allocated memory associated with RegEx.
free(RegEx);
free(Matches);
BLAT("In Substitute(), at bottom of \"no-match\" section; about to return TextCopy:")
BLAT("TextCopy = " << TextCopy)
return TextCopy;
}

// If we get to here, we have matches. Grab number of submatches (if any)
// from RegEx.re_nsub and free RegEx:
size_t NumMatches = RegEx->re_nsub;
BLAT("In Substitute, in \"we have matches\" section, NumMatches = " << NumMatches)
regfree(RegEx);
free(RegEx);
BLAT("In Substitute, at bottom of \"we have matches\" section, just ran regfree(RegEx) and free(RegEx).")

// Expand backreferences in Replacement; store expanded version in ReplacementCopy:
for ( size_t i = 1 ; i <= NumMatches ; ++i )
{
// While there are instances of backreference i in Replacement, expand those:
std::ostringstream SS ("");
std::string::size_type Index;
SS << "\\" << setw(1) << i;
BLAT("In Substitute(), inside top of backreference-expansion for loop;")
BLAT("SS.str() = " << SS.str())
while (std::string::npos != (Index = ReplacementCopy.find(SS.str())))
{
BLAT("In Substitute(), just inside top of while(backreferences exist) loop,")
BLAT("Index = " << Index)
if // If there was a match for parenthetical group i...
(
Matches[i].rm_so > -1
&&
Matches[i].rm_eo > -1
&&
Matches[i].rm_so < static_cast<long>(TextCopy.size())
&&
Matches[i].rm_eo <= static_cast<long>(TextCopy.size())
&&
Matches[i].rm_eo > Matches[i].rm_so
)
{
BLAT("In Substitute(), inside if (match exists to backreference).")
// Expand current instance of backreferrence i:
ReplacementCopy.replace
(
Index,
2,
Text.substr
(
std::string::size_type(Matches[i].rm_so),
std::string::size_type(Matches[i].rm_eo - Matches[i].rm_so)
)
);
} // end if (there was a match for parenthetical group i)
else // Otherwise, current backreference is unused, so erase it:
{
BLAT("In Substitute(), inside else (no match to backreference).")
ReplacementCopy.erase(Index, 2);
} // end else (current backreference is unused)
} // end while (unexpanded backreference i instances exist in ReplacementCopy)
} // end for (each submatch, i = 1 through n)

// Replace first match of Pattern in TextCopy:
std::string::size_type RepPos = std::string::size_type(Matches[0].rm_so);
std::string::size_type RepLen = std::string::size_type(Matches[0].rm_eo - Matches[0].rm_so);
BLAT("In Substitute(), about to do replacement on TextCopy with these parameters:")
BLAT("TextCopy = " << TextCopy)
BLAT("RepPos = " << RepPos)
BLAT("RepLen = " << RepLen)
BLAT("ReplacementCopy = " << ReplacementCopy)
TextCopy.replace(RepPos, RepLen, ReplacementCopy);
BLAT("TextCopy after replacement = " << TextCopy)

// If doing global replacement, recurse until ALL instances of Pattern in TextCopy are
// replaced with Replacement. But don't do global replacement if the character '^'
// appears in a context other than a litteral or a character-list inversion. Otherwise,
// we'd violate the user's request that a replacement be done ONLY at the beginning of a
// line.

std::string::size_type index = Pattern.find('^');

if
(
index < std::string::npos
&&
'[' != Pattern[index-1]
&&
'\\' != Pattern[index-1]
)
{
BLAT("In Substitute; about to set flag to 'h' because of unescaped ^.")
Flag = 'h';
}

// Similarly, if the character '$' appears in Pattern in a non-litteral context,
// don't do global replacement:

index = Pattern.find('$');

if
(
index < std::string::npos
&&
'\\' != Pattern[index-1]
)
{
BLAT("In Substitute; about to set flag to 'h' because of unescaped $.")
Flag = 'h';
}

BLAT("In Substitute, just above global recursion section. Flag = " << Flag)

// If Flag is still 'g', do global replacement by recursing this function:
if ('g' == Flag)
{
BLAT("In Substitute, just inside \"if ('g' == Flag)\" section;")
BLAT("TextCopy = " << TextCopy);
// Now here things get very, very tricky! It would be tempting to do this:
// return Substitute(Pattern, Replacement, TextCopy, 'g');
// Tempting... but disastrous! if Pattern matches a substring of Replacement,
// then this would recurse forever! (Actually, it would recurse till it overflows
// the stack and crashes the system.) So we must split TextCopy into "processed" and
// "unprocessed" chunks, and pass only the unprocessed chunk to Substitute at the
// next recursive level down, then re-glom the chucks and return the result:
std::string::size_type FirstChunkSize = Matches[0].rm_so + ReplacementCopy.size();

BLAT("In Substitute, about to recurse;")
BLAT("TextCopy.substr(0, FirstChunkSize ) = ")
BLAT(TextCopy.substr(0, FirstChunkSize))
BLAT("TextCopy.substr(FirstChunkSize , std::string::npos) = ")
BLAT(TextCopy.substr(FirstChunkSize , std::string::npos))

// RECURSE!!!
ReturnString =
// substring, starting at 0, size = FirstChunkSize:
TextCopy.substr(0, FirstChunkSize)
+
Substitute // RECURSE!!!
(
Pattern,
Replacement,
// substring, starting at FirstChunkSize, size = unlimited:
TextCopy.substr(FirstChunkSize, std::string::npos),
'g'
);

BLAT("In Substitute() at bottom of Global section;")
BLAT("just returned from recursion;")
BLAT("ReturnString = " << ReturnString)
}

// Otherwise, just replace the first match:
else
{
ReturnString = TextCopy;
BLAT("In Substitute(), at bottom of \"else ('g' != Flag\" section;")
BLAT("Didn't recurse;")
BLAT("ReturnString = " << ReturnString)
}

BLAT("In Substitute(), about to free Matches.")

// Free Matches:
free(Matches);

// Return ReturnString:
BLAT("In Substitute(), freed Matches, at bottom, about to return;")
BLAT("ReturnString = " << ReturnString)
return ReturnString;
} // end function Substitute()

int main (int Beren, char ** Luthien)
{
std::ios_base::sync_with_stdio();
std::string Pattern {Luthien[1]};
std::string Replacement {Luthien[2]};
std::string Text {Luthien[3]};
char Flag {Luthien[4][0]};
if (5 != Beren)
{
cerr << "Error in regex-c-test.cpp: this program takes 4 arguments," << endl;
cerr << "but you only typed " << Beren << ". Aborting program." << endl;
exit(666);
}
std::string Result = Substitute(Pattern, Replacement, Text, Flag);
std::cout << Result << std::endl;

Paavo Helde

unread,

Apr 12, 2016, 1:54:03 PM4/12/16

to

The above is UB, you must check Beren value earlier.

> if (5 != Beren)
> {
> cerr
> << "Error in regex-c-test.cpp: this program takes 4 arguments,"
> << endl
> << "but you only typed " << Beren << ". Aborting program."
> << endl;
> exit(666);
> }
> std::string Result = Substitute(Pattern, Replacement, Text, Flags);
> std::cout << Result << std::endl;

Inconsistent usage of std:: with cerr and cout.

> }
>
>
>
> Next, the C-style version.
> Total code lines: 322.
> Total time spent writing & debugging: many hours
>
> #include <iostream>
> #include <iomanip>
> #include <string>
> #include <sstream>
> #include <regex.h>

> #undef NDEBUG
> #include <assert.h>
> #include <errno.h>
> #undef BLAT_ENABLE
> #ifdef BLAT_ENABLE
> #define BLAT(X) std::cerr << X << std::endl;
> #else
> #define BLAT(X)
> #endif
>
> using std::cout;
> using std::cerr;
> using std::setw;
> using std::setfill;
> using std::endl;
>
> struct RegExException
> {
> RegExException(void) : msg(std::string("")) {}

A C-ism.

> RegExException(char const * Msg) : msg(std::string(Msg)) {}
> RegExException(std::string const & Msg) : msg(Msg) {}
> std::string msg;
> };

An exception class should be derived from std::exception. Also, why not
just use std::runtime_error?

>
> std::string Substitute
> (
> std::string const & Pattern,
> std::string const & Replacement,
> std::string const & Text,
> char Flag
> )
> {
> BLAT("\nJust entered Substitute() with following parameters:")
> BLAT("Pattern = " << Pattern)
> BLAT("Replacement = " << Replacement)
> BLAT("Text = " << Text)
> BLAT("Flag = " << Flag << "\n")
>
> int Result {};
> std::string ReturnString {};
>
> regex_t * RegEx = NULL;
> RegEx = reinterpret_cast<regex_t *>(malloc(sizeof(regex_t)));

This trickery should not be needed for anything.

> assert(NULL != RegEx);

This assumes that '#undef NDEBUG' is not lost after any code
reorganizations. What's wrong with 'new RegEx()' instead of malloc() (in
a hypothetical scenario when you actually need dynamic allocation)?

This is error-prone and fragile. What's wrong with

std::vector Matches(RegEx->re_nsub + 5) ?

> BLAT("In Substitute(), just allocated Matches.")
>
> // EXECUTE REGULAR EXPRESSION:
> // Run regexec(), which will look for matches to the RE in Text:
> BLAT("In Substitute(), about to call regexec().")
> Result = regexec(RegEx, Text.c_str(), 10, Matches, 0);

The passed nmatch (10) does not correspond to the number of allocated
Matches (RegEx->re_nsub+5). If the allocated size is smaller, then the
regexec function might write in the memory after the array, corrupting
the memory. If it is larger, then you access uninitialized Matches
records later below.

If the allocated size of Matches is larger than 10, then here you will
access uninitialized Matches records if i>=10.

UB, you need to check for Beren earlier.

Robbie Hatley

unread,

Apr 12, 2016, 2:55:26 PM4/12/16

to

On 4/11/2016 2:58 AM, Paavo Helde wrote:

> ...All you did was to move the memory corruption to some other
> place, hiding the bug....

It turns out, that's exactly correct. I found the problem.

In my original code (designed to run on the DJGPP platform)
I had the following:

// ALLOCATE MATCHES:
regmatch_t Matches[10];

// EXECUTE REGULAR EXPRESSION:
Result = regexec(&RegEx, Text.c_str(), 10, Matches, 0);

But when refactoring to work on Cygwin, I also decided to unlimit
the max number of parenthetical subexpressions, so I did this:

// ALLOCATE MATCHES:
size_t NumParSubExp = RegEx.re_nsub;
regmatch_t * Matches =
(regmatch_t*)malloc((NumParSubExp + 1)*sizeof(regmatch_t));

But I forgot and left the execution section unchanged!!!

// EXECUTE REGULAR EXPRESSION:
Result = regexec(&RegEx, Text.c_str(), 10, Matches, 0);

Oh, my. That now writes 10 items to a buffer which, if no
parentheses were actually used in the regular expression,
is big enough only for 1 item, thus corrupting the stack.

So I did the following changes, and now it no longer fails
(so far), so I think I've squashed the bug:

// ALLOCATE MATCHES:
size_t NumParSubExp = RegEx.re_nsub;
regmatch_t * Matches =
(regmatch_t*)malloc((NumParSubExp + 1)*sizeof(regmatch_t));

// EXECUTE REGULAR EXPRESSION:
Result = regexec(&RegEx, Text.c_str(), NumParSubExp + 1, Matches, 0);

I must have looked at that (obviously wrong) code 500 times in
the last few days and I didn't see the glaring "10" sitting
right there. A simple array over-run, basically. A matter of
having the number of C buffers too low to hold the requested stuff.

So the error message "TP_NUM_C_BUFS too low" meant *exactly* what
it said: my number of C buffers was too low.

Paavo Helde

unread,

Apr 12, 2016, 3:12:44 PM4/12/16

to

On 12.04.2016 21:55, Robbie Hatley wrote:
>
> On 4/11/2016 2:58 AM, Paavo Helde wrote:
>
>> ...All you did was to move the memory corruption to some other
>> place, hiding the bug....
>
> It turns out, that's exactly correct. I found the problem.

[...]

> So the error message "TP_NUM_C_BUFS too low" meant *exactly* what
> it said: my number of C buffers was too low.

You are missing a smiley here ;-)

Robbie Hatley

unread,

Apr 12, 2016, 3:41:00 PM4/12/16

to

On 4/12/2016 10:53 AM, Paavo Helde wrote:

> The above is UB, you must check Beren value earlier.

Yep, found, fixed.

> Inconsistent usage of std:: with cerr and cout.

Doesn't hurt anything, but does look weird; now fixed.

> What's wrong with 'new RegEx()' instead of malloc()?

I suppose that would work. It just never occurred to me to use a C++-style
memory allocator with a C-style struct.

>> // Allocate Matches:
>> ... malloc ...

>
> This is error-prone and fragile. What's wrong with
>
> std::vector Matches(RegEx->re_nsub + 5) ?

I suppose that would work. But again, it never occurred to me to
put C-style structs in a std::vector. Interesting idea. Would
probably be safer. But I think it needs to be:

std::vector Matches(NumParSubExp + 1);

Which would be exactly the right length. Match slot 0 is always
for the whole match, and the other slots are for matches to
parenthetical subexpressions.

>> // EXECUTE REGULAR EXPRESSION:
>> // Run regexec(), which will look for matches to the RE in Text:
>> BLAT("In Substitute(), about to call regexec().")
>> Result = regexec(RegEx, Text.c_str(), 10, Matches, 0);
>
> The passed nmatch (10) does not correspond to the number of allocated
> Matches (RegEx->re_nsub+5). If the allocated size is smaller, then the
> regexec function might write in the memory after the array, corrupting

> the memory....

BINGO. I'm pretty sure that was the whole cause of the problem.
With the code like that, the function (even in its new stand-alone
program, as opposed to the program I was calling it from before)
fails reliably with error message "abort called" at point of
regexec(), if given the following inputs:

Text: "randip.c"
Pattern: "and"
Replacement: "or"

(Whereas with other inputs, it actually works, or fails in other
ways, some of them spectacular, due to the unpredictable results
of stack corruption.)

Anyway, I prefer my C++ version; much more concise.
Upgraded code follows:

#include <iostream>
#include <regex>

using std::cout;
using std::cerr;
using std::endl;

std::string Substitute
(
std::string const & Pattern,
std::string const & Replacement,
std::string const & Text,
std::string const & Flags // If Flags contains 'g', do global replace;
) // otherwise, replace first occurrence only.
{
std::regex_constants::match_flag_type RegexFlags {};
if (std::string::npos == Flags.find_first_of("g"))
{
RegexFlags = std::regex_constants::format_first_only;
}
std::regex RegEx {Pattern};
std::string Result = std::regex_replace(Text, RegEx, Replacement, RegexFlags);
return Result;
} // end function rhregex::Substitute()

int main (int Beren, char ** Luthien)
{

if (5 != Beren)
{
cerr

<< "Error in regex-cpp-test.cpp: this program takes 4 arguments," << endl

<< "but you only typed " << Beren << ". Aborting program." << endl;
exit(666);
}

std::string Pattern {Luthien[1]};
std::string Replacement {Luthien[2]};
std::string Text {Luthien[3]};
std::string Flags {Luthien[4]};

std::string Result = Substitute(Pattern, Replacement, Text, Flags);

cout << Result << endl;

Paavo Helde

unread,

Apr 12, 2016, 4:14:45 PM4/12/16

to

On 12.04.2016 22:40, Robbie Hatley wrote:
>
> On 4/12/2016 10:53 AM, Paavo Helde wrote:
>
>> The above is UB, you must check Beren value earlier.
>
> Yep, found, fixed.
>
>> Inconsistent usage of std:: with cerr and cout.
>
> Doesn't hurt anything, but does look weird; now fixed.
>
>> What's wrong with 'new RegEx()' instead of malloc()?
>
> I suppose that would work. It just never occurred to me to use a C++-style
> memory allocator with a C-style struct.

C++ does not know or care that the struct is defined in a C library
header, and C does not know or care how the memory for the struct is
allocated.

>
>>> // Allocate Matches:
>>> ... malloc ...
>>
>> This is error-prone and fragile. What's wrong with
>>
>> std::vector Matches(RegEx->re_nsub + 5) ?
>
> I suppose that would work. But again, it never occurred to me to
> put C-style structs in a std::vector. Interesting idea. Would
> probably be safer.

C++ does not know or care that the struct is defined in a C library
header, and C does not know or care how the memory for the struct is
allocated.

>
> std::vector Matches(NumParSubExp + 1);

Actually I forgot the template parameter earlier, the right syntax is

std::vector<regmatch_t> Matches(NumParSubExp + 1);

[...]

> BINGO. I'm pretty sure that was the whole cause of the problem.
> With the code like that, the function (even in its new stand-alone
> program, as opposed to the program I was calling it from before)
> fails reliably with error message "abort called"

No, it does not (fail reliably). Try with some other compiler and you
will see. The proper wording would be "appears to fail in quite
predictably way with my compiler version and with my compiler options".

Welcome to the wonderful world of C and C++ Undefined Behavior!

Robbie Hatley

unread,

Apr 12, 2016, 8:23:24 PM4/12/16

to

On 4/12/2016 1:14 PM, Paavo Helde wrote:

> No, it does not (fail reliably). Try with some other compiler and you will see.
> The proper wording would be "appears to fail in quite predictably way with my
> compiler version and with my compiler options".

The stated result (computer just says "Abort called" and ends program) depends
on these exact inputs:

Text: "randip.c"
Pattern: "and"
Replacement: "or"

Change anything just slightly, and it either doesn't fail, or fails in a very
different way (such as, instead of saying "abort called", says "segmentation
fault, core dumped", or says "internal error: TP_NUM_C_BUFS too low: 50",
or suddenly spews 500 pages of complete gibberish; I've seen this program
do all of those the last few days).

> Welcome to the wonderful world of C and C++ Undefined Behavior!

Sort of like going on a bad drug trip. :-/

I think the reason the results of stack corruption is so unpredictable
and varied is, when you over-write the stack, the results depend on exactly
*what* gets clobbered. Clobber the wrong thing, and instead of saying something
relatively sane like "segmentation fault", the computer suddenly spews
a bunch of directory listings, or spews the contents of random C++ source files.
(It did both of those, yesterday.) I suppose stack corruption could even
cause worse things (over-writing disk files, damaging hardware, etc).

Which is why I don't use C for complex programs. Too dangerous. I'll leave
that to people more skilled than me.

C++ is better in that regard, what with std::vector<int> instead of

int sprat[50] = {0};
...857 lines of code here...
int LoopLimit = 87;
for ( int i = 0 ; i < LoopLimit ; ++i )
{
sprat[i] = SomeDamnFunc(Input[i]);
}
(Ooops! We just corrupted 4*37 bytes of the stack!)

Yes, I've seen professional programmers at work make *exactly* that
blunder, almost verbatim. C makes shooting one's self in the foot
in that way easy. I convinced the guys to convert to C++ and use
std::vector, std::map, std::list instead of arrays. Got rid of
a lot of bugs that way. (That was back in 2001-2006 when I was
actually heavily involved in doing C++ programming for a living.
But I digress.)

But Perl is even better in that regard. :D Even harder to screw things up.

However, I'm trying to stay proficient in all 3 of those languages
(while also learning Python as well), so that involves practice with
things which are less "safe", such as fixed-length arrays on the stack,
dynamic allocation with malloc & free, etc, in addition to the the safer
approaches of C++. Which is part of why I do certain things the old-fashioned
way in the C version of my Substitute() function. Playing with fire and
trying not to get burned.

woodb...@gmail.com

unread,

Apr 12, 2016, 10:35:36 PM4/12/16

to

On Tuesday, April 12, 2016 at 7:23:24 PM UTC-5, Robbie Hatley wrote:

Robbie, please don't swear here.

Brian
Ebenezer Enterprises - Making programming fun again.
http://webEbenezer.net

Daniel

unread,

Apr 12, 2016, 11:04:59 PM4/12/16

to

On Tuesday, April 12, 2016 at 10:35:36 PM UTC-4, woodb...@gmail.com wrote:
> On Tuesday, April 12, 2016 at 7:23:24 PM UTC-5, Robbie Hatley wrote:
>
> Robbie, please don't swear here.
>

Brian's right, in the context of Robbie's post, it's inappropriate to have "Damn" as part of the function name.

It should be used in the name of the loop limit variable

int SomeDamnLoopLimit = 87;

as it's this that causes the annoyance and surprise.

Best regards,
Daniel

Rosario19

unread,

Apr 13, 2016, 1:13:47 AM4/13/16

to

On Mon, 11 Apr 2016 21:45:01 +0300, Paavo Helde wrote:

>Good to hear! Yes of course it is better to use "builtin" C++ regex
>whenever possible. And I am still pretty sure the actual bug was in some
>of those deleted 343 lines (and deleting them was a good way to get rid
>of the bug!)

i'm different of you
while all you fear bug of your code, i like to find and correct bug of
my code, and see my code to run in every places

what i can not like is bug or difficult of use / understand / correct
of code, i not wrote [example code library]
possibly i not too much experice...
"chi fa da se fa per tre"

Paavo Helde

unread,

Apr 13, 2016, 3:04:26 AM4/13/16

to

On 13.04.2016 3:23, Robbie Hatley wrote:
> I suppose stack corruption could even
> cause worse things (over-writing disk files, damaging hardware, etc).

Standard example is sending nasty e-mails to your boss.

> Which is why I don't use C for complex programs. Too dangerous. I'll leave
> that to people more skilled than me.
>
> C++ is better in that regard, what with std::vector<int> instead of

C++ is inherently not much safer, one can always still write code like

std::sort(vec.begin(), vec.begin()+87);

I think it's just that the style and habits used in C++ are different,
so one is strongly inclined to write safer constructs like

std::sort(vec.begin(), vec.end());

Cheers
Paavo