[Boost-users] best tool in Boost for (massive) string replacement?

28 views
Skip to first unread message

alfC

unread,
Sep 23, 2010, 6:11:20 PM9/23/10
to boost...@lists.boost.org
With all the tools available in Boost and coming from a different
backgroup is hard for me to choose what is the best tool in Boost to
do a massive string replacement.

The problem I have is the following, I have a map of replaced and
replacing strings

std::map<string,string> rep;
rep["\\alpha"] = "a";
rep["\\beta"] = "b";
...

let's say about 100 of these. And I have an input/output file (few
thousand lines) were I would like to do all this replacements. What is
the best tool in boost to do this,
Spirit, Regex, tokenizer, StringAlgorithm?

My only approach so far is Regex and the implementation is very crude.
I read the file line by line and do a loop over the replacement keys
for each line. It is not even exploiting the fact that I have a map of
replacements (compared to an array of replacements). It seems very
slow.

(Yes, it is like a 'sed' unix command replacement but with hundreds of
replacement strings)

Thank you,
Alfredo
_______________________________________________
Boost-users mailing list
Boost...@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users

michi7x7

unread,
Sep 24, 2010, 12:45:12 PM9/24/10
to boost...@lists.boost.org
Am 24.09.2010 00:11, schrieb alfC:
> With all the tools available in Boost and coming from a different
> backgroup is hard for me to choose what is the best tool in Boost to
> do a massive string replacement.
>
> The problem I have is the following, I have a map of replaced and
> replacing strings
>
> std::map<string,string> rep;
> rep["\\alpha"] = "a";
> rep["\\beta"] = "b";
> ...
>
> let's say about 100 of these. And I have an input/output file (few
> thousand lines) were I would like to do all this replacements. What is
> the best tool in boost to do this,
> Spirit, Regex, tokenizer, StringAlgorithm?
>
> My only approach so far is Regex and the implementation is very crude.
> I read the file line by line and do a loop over the replacement keys
> for each line. It is not even exploiting the fact that I have a map of
> replacements (compared to an array of replacements). It seems very
> slow.
>
> (Yes, it is like a 'sed' unix command replacement but with hundreds of
> replacement strings)
>
> Thank you,
> Alfredo
Hi,

Spirit should do that quite fast, but of course it's difficult to know
which one is fastest.
Spirit-Code can be optimized during compilation, so make sure to use a
static const map, if possible.

Regards,

michi7x7

Anthony Foiani

unread,
Sep 25, 2010, 8:13:07 PM9/25/10
to boost...@lists.boost.org, alfC
alfC <alfredo...@gmail.com> writes:

> My only approach so far is Regex and the implementation is very crude.
> I read the file line by line and do a loop over the replacement keys
> for each line. It is not even exploiting the fact that I have a map of
> replacements (compared to an array of replacements). It seems very
> slow.

Depending on where you want to spend your runtime (setup cost
v. per-line cost), and how much memory you have available...

It might be faster to build a single regex that has all your targets
as alternates, then use the match data to map to the correct replacement.

In Perl, it'd go something like this:

# establish mapping from target to replacement.
my %reps = ( '\alpha' => 'a',
'\beta' => 'b',
'\gamma' => 'g' );

# create a regular expression consisting of all targets, using
# alternation:
my $re = join '|', map { quotemeta $_ } keys %reps;

# now loop over the data:
while ( my $line = <STDIN> )
{
# every time the regex matches, capture what matched into $1 and
# then replace it by looking up the target in the %reps map.
$line =~ s/($re)/$reps{$1}/g;
print $line;
}

A rough translation into Boost can be found here:

http://scrye.com/~tkil/boost/regex/multi-rep.cpp

It will still fail if any of your target strings contain "\E"
literally in them; I couldn't find any obvious "quotemeta" replacement
in boost::regex.

There are ways to get fancier with it, but I started running into
version incompatibilities. In particular, current implementations of
boost::regex allow the replacement formatter to be any arbitrary
functor, and my subroutine 'replace' turns into this:

struct find_replacement
{
const ssmap & dict_;
find_replacement( const ssmap & dict ) : dict_( dict ) {}
const std::string & operator()( const std::string s ) const
{ return dict_.at( s ); }
};

const std::string
replace( const std::string & input,
const ssmap & dict,
const boost::regex & re )
{
find_replacement fr( dict );
return boost::regex_replace( input, re, fr );
}

Happy hacking,
t.

Eric Niebler

unread,
Sep 25, 2010, 9:04:56 PM9/25/10
to boost...@lists.boost.org
On 9/23/2010 6:11 PM, alfC wrote:
> With all the tools available in Boost and coming from a different
> backgroup is hard for me to choose what is the best tool in Boost to
> do a massive string replacement.
>
> The problem I have is the following, I have a map of replaced and
> replacing strings
>
> std::map<string,string> rep;
> rep["\\alpha"] = "a";
> rep["\\beta"] = "b";
> ...
>
> let's say about 100 of these. And I have an input/output file (few
> thousand lines) were I would like to do all this replacements. What is
> the best tool in boost to do this,
> Spirit, Regex, tokenizer, StringAlgorithm?

None of the above. Use Boost.Xpressive. The complete solution is below:

#include <map>
#include <string>
#include <iostream>
#include <boost/xpressive/xpressive_static.hpp>
#include <boost/xpressive/regex_actions.hpp>

using namespace std;
using namespace boost::xpressive;

int main()
{
std::map<std::string, std::string> rep;


rep["\\alpha"] = "a";
rep["\\beta"] = "b";

rep["\\gamma"] = "g";
rep["\\delta"] = "d";

local<std::string const *> pstr;
sregex const rx = (a1 = rep)[pstr = &a1];

std::string str("\\alpha \\beta \\gamma \\delta");
std::cout << regex_replace(str, rx, *pstr) << std::endl;
}

The regex (a1 = rep) takes the keys in the rep map and builds a search
trie (http://en.wikipedia.org/wiki/Trie) out of them. When the trie
matches, the attribute a1 receives the value associated with the
matching key. The semantic action [pstr = &a1] assigns the address of
the value to the local variable pstr.

The call to regex_replace uses the lambda expression *pstr as the
replacement.

HTH,

--
Eric Niebler
BoostPro Computing
http://www.boostpro.com

Reply all
Reply to author
Forward
0 new messages