Best way to find and replace substrings in a char buffer?

Matthias Hofmann

unread,

Mar 24, 2011, 2:52:52 AM3/24/11

to

Hello everybody!

I need to perform find and replace of certain substrings in a text that was
loaded from a text file into a char buffer, not into a std::string or
std::vector. So I thought of a function such as the following, which I found
at
http://www.linuxquestions.org/questions/programming-9/replace-a-substring-with-another-string-in-c-170076/

// Begin code.

#include <stdio.h>
#include <string.h>

char *replace_str(char *str, char *orig, char *rep)
{
static char buffer[4096];
char *p;

if(!(p = strstr(str, orig))) // Is 'orig' even in 'str'?
return str;

strncpy(buffer, str, p-str); // Copy characters from 'str' start to 'orig'
st$
buffer[p-str] = '\0';

sprintf(buffer+(p-str), "%s%s", rep, p+strlen(orig));

return buffer;
}

int main(void)
{
puts(replace_str("Hello, world!", "world", "Miami"));

return 0;
}

// End code.

The problem with this function is that it replaces only one occurrence of
the substring, so it would have to be called repeatedly until it returns
NULL, thus searching parts of the text many times over for nothing. Besides,
and this is even worse, it uses a fixed size buffer, which makes it rather
unpracticable for texts of arbitrary length.

Then I found the following function at
http://stackoverflow.com/questions/1494399/how-do-i-search-find-and-replace-in-an-stl-string

// Begin code.

template<class T> int inline findAndReplace(T& source, const T& find, const
T& replace)
{
int num=0;
T::size_t fLen = find.size();
T::size_t rLen = replace.size();
for (T::size_t pos=0; (pos=source.find(find, pos))!=T::npos; pos+=rLen)
{
num++;
source.replace(pos, fLen, replace);
}
return num;
}

// End code.

It looks like it works, but I wonder why it is a template - is there any
other STL container besides std::string that has a replace() function? I
think the following is better, which I found at the same site, as it uses a
std::string:

//params find and replace cannot be NULL
void FindAndReplace( std::string& source, const char* find, const char*
replace )
{
//ASSERT(find != NULL);
//ASSERT(replace != NULL);
size_t findLen = strlen(find);
size_t replaceLen = strlen(replace);
size_t pos = 0;

//search for the next occurrence of find within source
while ((pos = source.find( it, pos)) != std::string::npos)
{
//replace the found string with the replacement
source.replace( pos, findLen, replace );

//the next line keeps you from searching your replace string,
//so your could replace "hello" with "hello world"
//and not have it blow chunks.
pos += replaceLen;
}
}

I feel like using this function, but the problem is that I am not sure
whether there is any reason not to construct a std::string from a char
buffer that was loaded from a text file, like "\0" characters for example -
is possible for a "\0" character to show up in the middle of a text file,
such messing up my find and replace function?

Unfortunately, std::vector does not seem to have a replace function,
otherwise I would construct an object of that type from the char buffer. So
would it be alright to use a std::string in this case?

--
Matthias Hofmann
Anvil-Soft, CEO
http://www.anvil-soft.com - The Creators of Toilet Tycoon
http://www.anvil-soft.de - Die Macher des Klomanagers

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Ulrich Eckhardt

unread,

Mar 25, 2011, 12:54:45 AM3/25/11

to

Matthias Hofmann wrote:
> I need to perform find and replace of certain substrings in a text that
> was loaded from a text file into a char buffer, not into a std::string or
> std::vector.

Well, in that case I'd suggest to first fix the container type, i.e. at
least use a vector<char>.

> So I thought of a function such as the following

[...]

> char *replace_str(char *str, char *orig, char *rep)

Horrible. Even for straight C code, there is a const keyword that should be
used to document what is modified here and what isn't.

> {
> static char buffer[4096];

This has two implications:
1. Two concurrent calls (i.e. from different threads) will wreak havoc on
each other.
2. Two consecutive calls from one thread will overwrite the result of the
former.

In order to be safe, you can't get around either modifying the data in place
or copying it.

> sprintf(buffer+(p-str), "%s%s", rep, p+strlen(orig));

This is so horribly inefficient, all the pointers and sizes are known, it
could use memcpy() much better.

> The problem with this function is that it replaces only one occurrence of
> the substring, so it would have to be called repeatedly until it returns
> NULL, thus searching parts of the text many times over for nothing.

Even worse, consider the case that you use this to replace '\n' with '\r\n',
then with '\r\r\n', then '\r\r\r\n'...

> Besides, and this is even worse, it uses a fixed size buffer, which makes
> it rather unpracticable for texts of arbitrary length.

Full acknowledge.

> template<class T> int inline findAndReplace(T& source, const T& find,
> const T& replace)
> {
> int num=0;
> T::size_t fLen = find.size();
> T::size_t rLen = replace.size();
> for (T::size_t pos=0; (pos=source.find(find, pos))!=T::npos;
> pos+=rLen)
> {
> num++;
> source.replace(pos, fLen, replace);
> }
> return num;
> }
>
> // End code.
>
> It looks like it works

Yep, looks correct, but slow.

> but I wonder why it is a template - is there any
> other STL container besides std::string that has a replace() function?

There is a

typedef std::basic_string<char> string;

somewhere in the standard lib. Another class that has a replace function is
std::wstring. Anyhow, what is the problem with that function being a
template?

> I think the following is better, which I found at the same site, as it
> uses a std::string:
>
> //params find and replace cannot be NULL
> void FindAndReplace( std::string& source, const char* find, const char*
> replace )
> {
> //ASSERT(find != NULL);
> //ASSERT(replace != NULL);
> size_t findLen = strlen(find);
> size_t replaceLen = strlen(replace);
> size_t pos = 0;
>
> //search for the next occurrence of find within source
> while ((pos = source.find( it, pos)) != std::string::npos)
> {
> //replace the found string with the replacement
> source.replace( pos, findLen, replace );
>
> //the next line keeps you from searching your replace string,
> //so your could replace "hello" with "hello world"
> //and not have it blow chunks.
> pos += replaceLen;
> }
> }

This seems to use "it" but never to define that...or maybe its too early in
the morning for me to find. In any case, the approach looks much like the
template function above, although better documented. However, it also
suffers from possible slowness, which is cute because the author tried to
avoid unnecessary copying by passing in a reference but completely failed
at the algorithm.

> I feel like using this function, but the problem is that I am not sure
> whether there is any reason not to construct a std::string from a char
> buffer that was loaded from a text file, like "\0" characters for example
> - is possible for a "\0" character to show up in the middle of a text
> file, such messing up my find and replace function?

General rule: If you expect text, verify that you got text. You can always
have file corruption, users renaming their GINAWI~1.MPG to THESIS.TXT etc.

> Unfortunately, std::vector does not seem to have a replace function,
> otherwise I would construct an object of that type from the char buffer.
> So would it be alright to use a std::string in this case?

Let me first explain what I meant with slowness above, then get to
the "right" container...

If you have the string "abc" and replace "b" with "xx", you leave the "a"
alone, then move the "c" one position back and then copy the "xx" into the
space in between. If you have "axxc" and replace the "xx" with a "b", you
move the "c" one position up front and then copy the "b" in the space in
between. In both cases, and that's what makes it slow, you have to touch
all the characters after the replacement position.

This means that the implementation is an O(m*n) algorithm, with n being the
number of characters in the string and m being the number of times the
substring is found. I had this very algorithm (just written differently)
and I was using it for a long time without problems on files that are
typically not larger that 10k. Then, some day someone came up with a file
that had 200k. The execution time with 10k was not noticeable (let's say it
was 500ms). With 200k and an O(n*n) algorithm (m was a fraction of n in my
data), it turned out to be 400 times as much, 20s, which is a lot of time
to wait for a program's response.

Anyhow, a better approach is this:
1. Allocate a second buffer of sufficient[1] length.
2. Search for the sequence to replace.
3. Append anything before that to the buffer.
4. Append the replacement to the buffer.
5. Continue with 2.

For the buffer, you could use a vector<char> and preallocate using
reserve().

Note: Copying from one buffer to the other isn't that time-consuming, unless
you do it often. So just moving the data from a char* to a vector or string
shouldn't keep you. For convenience, if you do the replacement, it would be
better to convert to the desired container type at the same time. Of
course, it would be best to avoid any conversions, but that is more of a
general approach.

Good luck!

Uli

[1] If you have any data about the file's content, here's where you can use
it to avoid reallocations and copying. Otherwise, just assume the same
length as the input. Or, you could ignore this optimization and use e.g.
deque<char>, which doesn't copy any data around when it allocates new
storage.

--
Domino Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

olekk

unread,

Mar 25, 2011, 1:02:37 AM3/25/11

to

On Mar 24, 1:52 am, "Matthias Hofmann" <hofm...@anvil-soft.com> wrote:
> The problem with this function is that it replaces only one occurrence of
> the substring, so it would have to be called repeatedly until it returns
> NULL, thus searching parts of the text many times over for nothing. Besides,
> and this is even worse, it uses a fixed size buffer, which makes it rather
> unpracticable for texts of arbitrary length.
>

only slight changes to the function are enought to make it replace all
the occurences of susbstring instead of calling it N times. Like this:
#include <stdio.h>
#include <string.h>

char *replace_str(char *str, char *orig, char *rep)
{
static char buffer[4096];
char *p;

char *begin = str; //points on the beginning of string
int origLen = strlen(orig);

buffer[0] = '\0';
while ( p = strstr(begin, orig))
{
strncat(buffer, begin, p-begin);
strcat(buffer, rep);
begin = p+origLen;
}
strncat(buffer, begin, strlen(str) - (begin - str));
return buffer;
}
int main()

{
puts(replace_str("Hello, world!", "world", "Miami"));

puts(replace_str("Hello, world and new world world!", "world",
"Miami"));
puts(replace_str("Hello, Capital Letter World!", "world",
"Miami"));
return 0;
}

you can also dynamically allocate buffer to fit the size and instead
of returning buffer just copy it to orig (it modifise original
although).

> It looks like it works, but I wonder why it is a template - is there any
> other STL container besides std::string that has a replace() function? I
> think the following is better, which I found at the same site, as it uses a
> std::string:

std::string is not a container itself, it is a typedef for a
basic_string instantiation, so it works for every basic_string or
container that has replace in it. Moreover, if you want to use other
container, for ex. vector of strings and replace one string with
another in this vector you can just use std::replace algorithm which
gets forward iterators as a parameters, smth like that

std::vector<std::string> my_init_text =
get_string_vector_from_file(...); //
std::string orig("World");
std::string rep("Miami");
std::replace(my_init_text.begin(), my_init_text.end(), orig, rep);

the same function works for any container that supports forward
iterators

> Unfortunately, std::vector does not seem to have a replace function,
> otherwise I would construct an object of that type from the char buffer. So
> would it be alright to use a std::string in this case?

see replace alg. above

Olek

--

Matthias Hofmann

unread,

Mar 25, 2011, 1:02:00 AM3/25/11

to

I did some further research on the internet to find out whether the standard
gives any gurantees with regards to "\0" characters in a std::string, but I
found no such information. Take a look at the following code for example:

#include <iostream>
#include <string>

using namespace std;

int main()
{
char buffer[] = "Banana\0Potato";

string str( buffer, sizeof buffer - 1 );

for ( string::iterator it = str.begin();
it != str.end(); ++it )
{
cout << *it << endl;
}

return 0;
}

On my compiler, the output I get is

Banana Potato

which shows that the "\0" character in the middle of the string does not
stop the iteration. This is exactly what I want, but is there any guarantee
for such behaviour in the standard? And is it safe to assume that
std::string::replace() will work properly in case it encounters a "\0"
character?

{ Mod factoid: yes, it's guaranteed. -mod/sk }

Matthias Hofmann

unread,

Mar 28, 2011, 7:29:44 PM3/28/11

to

"Ulrich Eckhardt" <ulrich....@dominolaser.com> schrieb im Newsbeitrag
news:ikcr58-...@satorlaser.homedns.org...
> Matthias Hofmann wrote:

>> //params find and replace cannot be NULL
>> void FindAndReplace( std::string& source, const char* find, const char*
>> replace )
>> {
>> //ASSERT(find != NULL);
>> //ASSERT(replace != NULL);
>> size_t findLen = strlen(find);
>> size_t replaceLen = strlen(replace);
>> size_t pos = 0;
>>
>> //search for the next occurrence of find within source
>> while ((pos = source.find( it, pos)) != std::string::npos)
>> {
>> //replace the found string with the replacement
>> source.replace( pos, findLen, replace );
>>
>> //the next line keeps you from searching your replace string,
>> //so your could replace "hello" with "hello world"
>> //and not have it blow chunks.
>> pos += replaceLen;
>> }
>> }
>
> This seems to use "it" but never to define that...or maybe its too early
> in
> the morning for me to find. In any case, the approach looks much like the
> template function above, although better documented. However, it also
> suffers from possible slowness, which is cute because the author tried to
> avoid unnecessary copying by passing in a reference but completely failed
> at the algorithm.

Obviously the author has not tried to compile his code, as "it" is in fact
not defined. He must have meant "source". I have corrected this mistake and
I am now using his function in my project, which works fine:

void FindAndReplace( std::string& str,
const std::string& find, const std::string& repl )
{
// Get the length of the substring to be found.
std::string::size_type flen = find.length();

// Get the length of the replacement substring.
std::string::size_type rlen = repl.length();

//
// Find and replace each occurrence
// of the substring to be found with
// the replacement substring.
//
for ( std::string::size_type pos = 0;
( pos = str.find( find, pos ) ) !=
std::string::npos; pos += rlen )
str.replace( pos, flen, repl );
}

It could be documented better, but its main caveat is the performance
penalty incurred. I don't know how std::string:replace() is implemented, but
I assume that it pretty much behaves the way you are describing it further
below in your reply.

>> I feel like using this function, but the problem is that I am not sure
>> whether there is any reason not to construct a std::string from a char
>> buffer that was loaded from a text file, like "\0" characters for example
>> - is possible for a "\0" character to show up in the middle of a text
>> file, such messing up my find and replace function?
>
> General rule: If you expect text, verify that you got text. You can always
> have file corruption, users renaming their GINAWI~1.MPG to THESIS.TXT etc.

How can you verify whether you got text? The file extension is just a hint,
and they are not always used on some systems like Linux. One could check
whether the file contains only ASCII graphic symbols, but what if the text
is based on some exotic code page? And finally, it may be a stupid question,
but would it be possible for a "\0" character to show up in the middle of
text file? At the moment I am constructing a std::string from my char buffer
like this:

// "size" is the buffer size in bytes.
std::string buffer( buffer, size );

Is there anything wrong with that? It should work even when actually opening
an MPG file, although in this case, substring find and replace is likely to
find nothing except some meta tags in the file header maybe.

> If you have the string "abc" and replace "b" with "xx", you leave the "a"
> alone, then move the "c" one position back and then copy the "xx" into the
> space in between. If you have "axxc" and replace the "xx" with a "b", you
> move the "c" one position up front and then copy the "b" in the space in
> between. In both cases, and that's what makes it slow, you have to touch
> all the characters after the replacement position.
>
> This means that the implementation is an O(m*n) algorithm, with n being
> the
> number of characters in the string and m being the number of times the
> substring is found. I had this very algorithm (just written differently)
> and I was using it for a long time without problems on files that are
> typically not larger that 10k. Then, some day someone came up with a file
> that had 200k. The execution time with 10k was not noticeable (let's say
> it
> was 500ms). With 200k and an O(n*n) algorithm (m was a fraction of n in my
> data), it turned out to be 400 times as much, 20s, which is a lot of time
> to wait for a program's response.
>
> Anyhow, a better approach is this:
> 1. Allocate a second buffer of sufficient[1] length.
> 2. Search for the sequence to replace.
> 3. Append anything before that to the buffer.
> 4. Append the replacement to the buffer.
> 5. Continue with 2.

Olek's reply contains an alternative version of the first function in my
original post, and it does pretty much what you are suggesting:

#include <iostream>

using namespace std;

char *FindAndReplace( char *str, const char *find, const char *repl )

{
static char buffer[4096];
char *p;

char *begin = str;
int flen = strlen( find );

// Note that buffer may still contain data
// from a previous call to this function.
buffer[0] = '\0';

while ( p = strstr( begin, find ) )
{
strncat( buffer, begin, p - begin );
strcat( buffer, repl );
begin = p + flen;
}

strncat( buffer, begin, strlen( str ) - ( begin - str ) );

return buffer;
}

int main()
{
cout << FindAndReplace( "Hello, world!", "world", "Miami" ) << endl;
cout << FindAndReplace( "Hello, world and new world world!", "world",
"Miami") << endl;
cout << FindAndReplace( "Hello, Capital Letter World!", "world",
"Miami" ) << endl;

return 0;
}

On my compiler, this give sme the following output:

Hello, Miami!
Hello, Miami and new Miami Miami!
Hello, Capital Letter World!

This reminds me that neither my current solution using
std::string::replace() nor the code above offers a case insensitive find and
replace method, but that's not a big issue for me right now. And if I
understand things right, the code above has O( n ) complexity, hasn't it?
It's only drawback is the static sized buffer it uses, but as soon as I got
this fixed, I might use that function rather than the one using
std::string::replace().

--
Matthias Hofmann
Anvil-Soft, CEO
http://www.anvil-soft.com - The Creators of Toilet Tycoon
http://www.anvil-soft.de - Die Macher des Klomanagers

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Matthias Hofmann

unread,

Mar 28, 2011, 7:28:59 PM3/28/11

to

"olekk" <sasha...@gmail.com> schrieb im Newsbeitrag
news:93c53e03-5c4e-447a...@o21g2000prh.googlegroups.com...

> only slight changes to the function are enought to make it replace all
> the occurences of susbstring instead of calling it N times. Like this:
> #include <stdio.h>
> #include <string.h>
>
>
> char *replace_str(char *str, char *orig, char *rep)
> {
> static char buffer[4096];
> char *p;
> char *begin = str; //points on the beginning of string
> int origLen = strlen(orig);
>
> buffer[0] = '\0';
> while ( p = strstr(begin, orig))
> {
> strncat(buffer, begin, p-begin);
> strcat(buffer, rep);
> begin = p+origLen;
> }
> strncat(buffer, begin, strlen(str) - (begin - str));
> return buffer;
> }
> int main()
> {
> puts(replace_str("Hello, world!", "world", "Miami"));
> puts(replace_str("Hello, world and new world world!", "world",
> "Miami"));
> puts(replace_str("Hello, Capital Letter World!", "world",
> "Miami"));
> return 0;
> }

Thanks a lot, this function really does the trick! :-)

> you can also dynamically allocate buffer to fit the size and instead
> of returning buffer just copy it to orig (it modifise original
> although).

Dynamic allocation of the buffer is harder because the size is not known in
advance, unless you first count the number of occurrences and multiply it
with the difference in length of the string to be found and the one used for
replacement.

> std::string is not a container itself, it is a typedef for a
> basic_string instantiation, so it works for every basic_string or
> container that has replace in it. Moreover, if you want to use other
> container, for ex. vector of strings and replace one string with
> another in this vector you can just use std::replace algorithm which
> gets forward iterators as a parameters, smth like that
>
> std::vector<std::string> my_init_text =
> get_string_vector_from_file(...); //
> std::string orig("World");
> std::string rep("Miami");
> std::replace(my_init_text.begin(), my_init_text.end(), orig, rep);
>
> the same function works for any container that supports forward
> iterators

So why does std::string have a replace() function? Is it more efficient than
the std::replace() function? Would there be any advantage for me if I
decided to use a std::vector instead of a std::string?

--
Matthias Hofmann
Anvil-Soft, CEO
http://www.anvil-soft.com - The Creators of Toilet Tycoon
http://www.anvil-soft.de - Die Macher des Klomanagers

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Chris Morley

unread,

Mar 29, 2011, 9:42:52 PM3/29/11

to

> I need to perform find and replace of certain substrings in a text that
> was
> loaded from a text file into a char buffer, not into a std::string or
> std::vector. So I thought of a function such as the following, which I
> found

Can you use tr1 extensions, if so why not simply use tr1::regex_replace? One
form of the function returns a std::string and it is trivial to build an
input string from the text file using "string ( const char * s, size_t n )"
or assign.

If you don't know how to use regex then this shouldn't be a barrier, it is
pretty easy and there is plenty of help on the internet... some
confusing/bad but some is well written. I use
regular-expressions-cheat-sheet-v2.pdf (google it) to remind me of syntax I
infrequently use.

Chris

--

Ulrich Eckhardt

unread,

Mar 29, 2011, 9:44:58 PM3/29/11

to

Matthias Hofmann wrote:
> "Ulrich Eckhardt" <ulrich....@dominolaser.com> schrieb im Newsbeitrag
> news:ikcr58-...@satorlaser.homedns.org...

>> Anyhow, a better approach is this:
>> 1. Allocate a second buffer of sufficient[1] length.
>> 2. Search for the sequence to replace.
>> 3. Append anything before that to the buffer.
>> 4. Append the replacement to the buffer.
>> 5. Continue with 2.
>
> Olek's reply contains an alternative version of the first function in my
> original post, and it does pretty much what you are suggesting:

[...]

> char *FindAndReplace( char *str, const char *find, const char *repl )
> {
> static char buffer[4096];
> char *p;
> char *begin = str;
> int flen = strlen( find );
>
> // Note that buffer may still contain data
> // from a previous call to this function.
> buffer[0] = '\0';
>
> while ( p = strstr( begin, find ) )
> {
> strncat( buffer, begin, p - begin );
> strcat( buffer, repl );
> begin = p + flen;
> }
>
> strncat( buffer, begin, strlen( str ) - ( begin - str ) );
>
> return buffer;
> }

[...]

> And if I understand things right, the code above has O( n ) complexity,
> hasn't it?

No, unfortunately not. The problem is the strcat/strncat functions, both of
which first scan through the buffer to find the terminating zero. This
means you repeatedly scan through the same data while replacing, making it
O(n*n). You could easily replace it with a sliding pointer like 'begin', or
just use std::string which knows its own beginning and end.

Further, but I wouldn't bother too much with that aspect, you also depend on
the performance of strstr() or string::find..(), which has O(n*m) in the
naive implementation, so if you only replace small substrings it shouldn't
matter. Text searching is a whole topic for algorithm research, I'll leave
researching there to you, I also don't know enough to give you any advise
there.

Lastly, you asked about internal zeroes: C-APIs like strlen, strcat, strstr
doesn't work with those, just as they don't work with data that is missing
a terminating zero. std::string does work with it though, it doesn't
interpret the data it contains in any way. For that reason, you can also
use it to handle non-textual data.

Good luck!

Uli

--
Domino Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

olekk

unread,

Mar 29, 2011, 9:53:56 PM3/29/11

to

> Dynamic allocation of the buffer is harder because the size is not known in
> advance, unless you first count the number of occurrences and multiply it
> with the difference in length of the string to be found and the one used for
> replacement.

you should not count the number of occurences in advance. Just
allocate the buffer with malloc and when you don't have space in the
buffer to append another portion of new string, increase it's capacity
using realloc.

> So why does std::string have a replace() function? Is it more efficient than
> the std::replace() function? Would there be any advantage for me if I
> decided to use a std::vector instead of a std::string?

string is an array of chars. and using replace in a string you are
replacing not single emenets, but one subarray withing the original
array with the other subarray.
but when you are using replace algorithm with forward iterators you
are replacing all single elements that match the pattern. for example,
if you have smth like that

char * strings[]={"Hello", "World", "and", "NewWorld"};
std::vector<std::string> s(&strings[0], &strings[4]);
std::replace(s.begin(), s.end(), "World", "Miami");

it won't replace "NewWorld" with "NewMiami" and this is definitely not
what you want.

--

olekk

unread,

Mar 29, 2011, 9:55:35 PM3/29/11

to

> Dynamic allocation of the buffer is harder because the size is not known in
> advance, unless you first count the number of occurrences and multiply it
> with the difference in length of the string to be found and the one used for
> replacement.

This is how dynamic allocation can be done. I haven't testet this
carefully, so there might be errors in calculating needed size for new
buffer, but you'll get the idea
#include <stdio.h>
#include <malloc.h>
#include <string.h>
#include <assert.h>

#define CHECK_BUFFER(buffer,capacity,length,append_length) \
if ( (length) + (append_length) > (capacity)) { \
printf("Buffer Capacity %d is not enought for %d\n", (capacity),
(length) + (append_length));\
capacity = (capacity)*2 > (length) + (append_length) ?
(capacity*2) : (length) + 2*(append_length);\
printf("New Capacity is %d\n", (capacity));\
buffer = (char*)realloc(buffer, (capacity)); \
assert(buffer!=NULL);\
}

char *replace_str(char *str, char *orig, char *rep)
{

int buffer_capacity = 10;
int buffer_length = 0;
char* buffer = (char*)malloc(sizeof(char)*buffer_capacity);
buffer[0] = '\0';

char *p;
char *begin = str; //points on the beginning of string
int origLen = strlen(orig);

int rep_length = strlen(rep);

while ( p = strstr(begin, orig))
{

CHECK_BUFFER(buffer, buffer_capacity, buffer_length, p-begin);
strncat(buffer, begin, p-begin);
buffer_length+=(p-begin);

CHECK_BUFFER(buffer, buffer_capacity, buffer_length, rep_length);
strcat(buffer, rep);
buffer_length+=rep_length;
begin = p+origLen;
}
CHECK_BUFFER(buffer, buffer_capacity, buffer_length, strlen(str) -

(begin - str));
strncat(buffer, begin, strlen(str) - (begin - str));

//ATTENTION : buffer is not freed!!!
return buffer;
}

int main()
{
char * str = replace_str("Hello, world!", "world", "Miami");
puts(str);
free(str);// not that wierd as it might look like,
// strdup for example also returns buffer that must be freed
return 0;
}

--

Matthias Hofmann

unread,

Apr 6, 2011, 7:30:00 PM4/6/11

to

"Chris Morley" <chris....@lineone.net> schrieb im Newsbeitrag
news:4d919111$0$12155$fa0f...@news.zen.co.uk...

> Can you use tr1 extensions, if so why not simply use tr1::regex_replace?
> One
> form of the function returns a std::string and it is trivial to build an
> input string from the text file using "string ( const char * s, size_t
> n )"
> or assign.
>
> If you don't know how to use regex then this shouldn't be a barrier, it is
> pretty easy and there is plenty of help on the internet... some
> confusing/bad but some is well written. I use
> regular-expressions-cheat-sheet-v2.pdf (google it) to remind me of syntax
> I
> infrequently use.

Unfortunately, I am not familiar with regular expressions. I guess this is
an extension from the boost library, isn't it?

--
Matthias Hofmann
Anvil-Soft, CEO
http://www.anvil-soft.com - The Creators of Toilet Tycoon
http://www.anvil-soft.de - Die Macher des Klomanagers

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Matthias Hofmann

unread,

Apr 6, 2011, 7:29:24 PM4/6/11

to

"olekk" <sasha...@gmail.com> schrieb im Newsbeitrag

news:f67e12e4-cbf3-4cb4...@18g2000prd.googlegroups.com...

>> Dynamic allocation of the buffer is harder because the size is not known
>> in
>> advance, unless you first count the number of occurrences and multiply it
>> with the difference in length of the string to be found and the one used
>> for
>> replacement.
> you should not count the number of occurences in advance. Just
> allocate the buffer with malloc and when you don't have space in the
> buffer to append another portion of new string, increase it's capacity
> using realloc.

I also thought of using realloc(), but the problem is that this function
does not simply append memory to an existing buffer. Instead, it allocates a
larger buffer, copies the data from the old buffer to the larger one,
destroys the old buffer and returns a pointer to the larger one. You can
imagine what that means for the performance of the algorithm.

--
Matthias Hofmann
Anvil-Soft, CEO
http://www.anvil-soft.com - The Creators of Toilet Tycoon
http://www.anvil-soft.de - Die Macher des Klomanagers

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Matthias Hofmann

unread,

Apr 6, 2011, 7:31:34 PM4/6/11

to

"Ulrich Eckhardt" <ulrich....@dominolaser.com> schrieb im Newsbeitrag

news:7af868-...@satorlaser.homedns.org...

> The problem is the strcat/strncat functions, both of
> which first scan through the buffer to find the terminating zero. This
> means you repeatedly scan through the same data while replacing, making it
> O(n*n). You could easily replace it with a sliding pointer like 'begin',
> or
> just use std::string which knows its own beginning and end.
>
> Further, but I wouldn't bother too much with that aspect, you also depend
> on
> the performance of strstr() or string::find..(), which has O(n*m) in the
> naive implementation, so if you only replace small substrings it shouldn't
> matter. Text searching is a whole topic for algorithm research, I'll leave
> researching there to you, I also don't know enough to give you any advise
> there.

So I should probably implement everything myself, i.e. write a function that
uses neither string functions from the C Standard Library nor
string::find(). Using one ore more sliding pointers should make it possible
to avoid scanning through the same data over and over again.

> Lastly, you asked about internal zeroes: C-APIs like strlen, strcat,
> strstr
> doesn't work with those, just as they don't work with data that is missing
> a terminating zero. std::string does work with it though, it doesn't
> interpret the data it contains in any way. For that reason, you can also
> use it to handle non-textual data.

This is another reason to implement the entire find and replace function
myself, without resorting to any standard string functions. Anyway, I am
surprised to learn that text searching is such a wide field...

--
Matthias Hofmann
Anvil-Soft, CEO
http://www.anvil-soft.com - The Creators of Toilet Tycoon
http://www.anvil-soft.de - Die Macher des Klomanagers

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Jens Schmidt

unread,

Apr 7, 2011, 4:07:40 AM4/7/11

to

Matthias Hofmann wrote:

> I also thought of using realloc(), but the problem is that this function
> does not simply append memory to an existing buffer. Instead, it allocates
> a larger buffer, copies the data from the old buffer to the larger one,
> destroys the old buffer and returns a pointer to the larger one.

Unless we have architectures with sparsely populated virtual address spaces,
it can't do anything else.

> You can imagine what that means for the performance of the algorithm.

There are several possible methods to call realloc as rarely as possible.
Selecting one depends on other factors, because finding the correct
parameter for realloc may as inefficient as calling it multiple times.
--
Greetings,
Jens Schmidt

Martin B.

unread,

Apr 7, 2011, 7:37:49 PM4/7/11

to

On 07.04.2011 10:07, Jens Schmidt wrote:
> Matthias Hofmann wrote:
>
>> I also thought of using realloc(), but the problem is that this function
>> does not simply append memory to an existing buffer. Instead, it allocates
>> a larger buffer, copies the data from the old buffer to the larger one,
>> destroys the old buffer and returns a pointer to the larger one.
>
> Unless we have architectures with sparsely populated virtual address spaces,
> it can't do anything else.
>

I'm not sure I follow you here. Wouldn't it be highly dependent on the actual implementation whether the [m|re]alloc functions returned address ranges that could be enlarged without moving stuff around?

Might be slightly related, at least I found it an interesting read:
[A Collaborative Latency Reducing Enhancement Of The C/C++ malloc API]
http://mallocv2.wordpress.com/

cheers,
Martin

--

olekk

unread,

Apr 7, 2011, 7:34:40 PM4/7/11

to

> I also thought of using realloc(), but the problem is that this function
> does not simply append memory to an existing buffer. Instead, it allocates a
> larger buffer, copies the data from the old buffer to the larger one,
> destroys the old buffer and returns a pointer to the larger one. You can
> imagine what that means for the performance of the algorithm.

You can apply different approaches about how and when to increase the
buffer size. You can double the buffer capacity each time it needs to
be expanded, like I showed in CHECK_BUFFER macro. If you choose
appropriate heuristic, amortized costs of increasing array's capacity
will be acceptable

--

Jens Schmidt

unread,

Apr 8, 2011, 10:18:02 PM4/8/11

to

Martin B. wrote:

> On 07.04.2011 10:07, Jens Schmidt wrote:
>> Matthias Hofmann wrote:
>>
>>> I also thought of using realloc(), but the problem is that this function
>>> does not simply append memory to an existing buffer. Instead, it
>>> allocates a larger buffer, copies the data from the old buffer to the
>>> larger one, destroys the old buffer and returns a pointer to the larger
>>> one.
>>
>> Unless we have architectures with sparsely populated virtual address
>> spaces, it can't do anything else.
>
> I'm not sure I follow you here. Wouldn't it be highly dependent on the
> actual implementation whether the [m|re]alloc functions returned address
> ranges that could be enlarged without moving stuff around?

Yes, sure. But how can an implementation guarantee enlargement without
any move? Only by having enough free virtual address space after each
allocated memory region.
Assuming a typical program where the average string is very short, say 10
characters, combined with a quality implementation allowing files of several
gigabytes in one allocated region, then typical virtual memory use will be
/very/ sparse.
I know of no architecture where this can be implemented efficiently. The few
cases where such an implementation outperforms the usual one involve easily
improved algorithms. So all implementations move the data around instead.
Again: they can't do anything else.
--
Greetings,
Jens Schmidt