any find and replace, one-pass algorithm processing multiple search strings at once?

JKdr

unread,

Sep 30, 2014, 5:06:34 AM9/30/14

to

Input would be:

1) the strings that need to be searched for (or regex patterns) and
their substitutions

2) a stream (from a file) and its encoding

I spent some time trying to find an implementation of such an
algorithm, which I think shouldn't be a big deal at least in the case
that you don't use regex, just strings to find, but all implementations
I have found touch the input stream more than once

I think the data strategies of such algorithms is based on
graphs/decision trees, which are used in state machines and similar
algorithms

Unless I am missing something

Any ideas you could shared?

lbrtchx
(comp.lang.c++)

Jorgen Grahn

unread,

Sep 30, 2014, 5:44:32 AM9/30/14

to

On Tue, 2014-09-30, JKdr wrote:
> Input would be:
>
> 1) the strings that need to be searched for (or regex patterns) and
> their substitutions
>
> 2) a stream (from a file) and its encoding
>
> I spent some time trying to find an implementation of such an
> algorithm, which I think shouldn't be a big deal at least in the case
> that you don't use regex, just strings to find, but all implementations
> I have found touch the input stream more than once

Which ones /have/ you found?

> I think the data strategies of such algorithms is based on
> graphs/decision trees, which are used in state machines and similar
> algorithms
>
> Unless I am missing something
>
> Any ideas you could shared?

Not really ... but I /have/ looked into the simpler case "search for
multiple strings at once" and there are several algorithms. Follow
the links from

http://en.wikipedia.org/wiki/Knuth.Morris.Pratt_algorithm

I've been meaning to implement this in C++: the case where the same
set of search strings are used over and over again can be
precalculated well, and encapsulated as an object.

On the other hand, I suspect std::regex might perform well. Is it
one-pass enough for your purposes?

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

JKdr

unread,

Sep 30, 2014, 8:23:03 AM9/30/14

to

On 09/30/2014 07:00 AM, Malcolm McLean wrote:

> On Tuesday, September 30, 2014 10:08:28 AM UTC+1, JKdr wrote:
>> Input would be:
>>
>> 1) the strings that need to be searched for (or regex patterns) and
>> their substitutions
>>
>> 2) a stream (from a file) and its encoding
>>
>> I spent some time trying to find an implementation of such an
>> algorithm, which I think shouldn't be a big deal at least in the case
>> that you don't use regex, just strings to find, but all implementations
>> I have found touch the input stream more than once
>>

>> I think the data strategies of such algorithms is based on
>> graphs/decision trees, which are used in state machines and similar
>> algorithms
>>

> You need to build a suffix tree, on your search strings. Usually it's done the other
> way round, build the suffix tree with the data and search on the terms,
> but the method can work in reverse.
>
> Then it's a bit of a fiddle to hold the characters in a buffer, until they match
> or don't match your end of string character, which is the terminal of your
> suffix tree.

I had something very similar in mind. String search algorithms tend to
be messy (repeatedly looping though both data). What seems to be great
theoretically sometimes is not practical, could you point me to
algorithms like the ones you describe, preferably with tests? ;-)

Thank you
lbrtchx

JKdr

unread,

Sep 30, 2014, 11:31:42 AM9/30/14

to

On 09/30/2014 08:54 AM, Siri Crews wrote:
> That's called a finite state transducer

Yeap! Those kinds of problems have been studied to exhaustion

http://scholar.google.com/scholar?start=10&q="finite+state+transducer"+text

and there are implementations of such logic ...

http://en.wikipedia.org/wiki/Finite_state_transducer

http://jsalatas.ictpro.gr/java-fst-framework-api-review/

I was (very expectedly indeed ;-)) trying to tackle text
processing/corpora research problems with such FST-like strategies

Thank you very much,
lbrtchx