P0638R0: std::crochemore_perrin_searcher. Anyone willing to push this in Toronto?

138 views
Skip to first unread message

Ed Schouten

unread,
Jun 27, 2017, 2:51:42 AM6/27/17
to std-pr...@isocpp.org
Hi there,

The last mailing included a short proposal that I wrote, called
"P0638R0: Crochemore-Perrin search algorithm for std::search":

https://wg21.link/p0638r0

In essence, this proposal extends std::search() to include an
additional algorithm that runs both in linear time and is in-place.
Due to these properties, it can even be marked constexpr. This not
only allows you to fully initialise searchers at compile-time, it can
even perform matching.

My question is, as I won't be going to Toronto for the WG21 meeting,
is anyone interested in presenting this proposal there to the LEWG?
Thanks!

--
Ed Schouten <e...@nuxi.nl>
Nuxi, 's-Hertogenbosch, the Netherlands
KvK-nr.: 62051717

zamaz...@gmail.com

unread,
Jul 3, 2017, 11:51:23 PM7/3/17
to ISO C++ Standard - Future Proposals
Hello.

I work on search algorithms in Boost.Algorithm. And i have some thoughts about your proposal:
1) Do you have comparison between naive search vs Crochemore-Perrin vs Boyer-Moore? I have benchmarked Crochemore-Perrin vs Boyer-Moore, and i can say that Crochemore-Perrin (as known as Two-way search algorithm (TW)) is always slower. And it's without comparison between Musser-Nishanov algorithm vs TW (Musser-Nishanov is faster than BM).
2) TW requires Random Access Iterators. BM too.

So can you explain me please, why we should add this algorithm to standard library?

Useful links:
3) https://github.com/boostorg/algorithm/pull/25

вторник, 27 июня 2017 г., 9:51:42 UTC+3 пользователь Ed Schouten написал:

Ville Voutilainen

unread,
Jul 4, 2017, 12:09:21 AM7/4/17
to ISO C++ Standard - Future Proposals
On 4 July 2017 at 06:51, <zamaz...@gmail.com> wrote:
> Hello.
>
> I work on search algorithms in Boost.Algorithm. And i have some thoughts
> about your proposal:
> 1) Do you have comparison between naive search vs Crochemore-Perrin vs
> Boyer-Moore? I have benchmarked Crochemore-Perrin vs Boyer-Moore, and i can
> say that Crochemore-Perrin (as known as Two-way search algorithm (TW)) is
> always slower. And it's without comparison between Musser-Nishanov algorithm
> vs TW (Musser-Nishanov is faster than BM).
> 2) TW requires Random Access Iterators. BM too.
>
> So can you explain me please, why we should add this algorithm to standard
> library?

Crochemore-Perrin is a constant-space algorithm, whereas Boyer-Moore isn't. That
would seem nice-to-have for uses that do not wish to have the dynamic
allocation.
In other words, Boyer-Moore would win every benchmark you throw at these
algorithms, except the one where you can't run Boyer-Moore at all. :)

zamaz...@gmail.com

unread,
Jul 4, 2017, 12:33:18 AM7/4/17
to ISO C++ Standard - Future Proposals
Yeah, i know that TW is constant-space algorithm. But is enough this feature for including into Standard? IMHO, no (maybe i didn't know some details).

About possibility for using as constexpr. There is proposal about constexpr_allocator, which can make all STL containers constexpr by default, so BM and Boyer-Moore-Horspool (BMH) will become also constexpr.

Maybe in a day i can show you some benchmarks between naive search and TW algorithms (i think, TW wins. But how many percents TW will win? And also TW has a lot of restrictions, which naive search has not).

I think, that current situation with search algorithms isn't so terrible - BM and BMH are fast enough. Yep, there are faster algorithms, but BM and BMH are widely-used.

вторник, 4 июля 2017 г., 7:09:21 UTC+3 пользователь Ville Voutilainen написал:

Ville Voutilainen

unread,
Jul 4, 2017, 12:46:14 AM7/4/17
to ISO C++ Standard - Future Proposals
On 4 July 2017 at 07:33, <zamaz...@gmail.com> wrote:
> Yeah, i know that TW is constant-space algorithm. But is enough this feature
> for including into Standard? IMHO, no (maybe i didn't know some details).

Seems handy to have when BM can't be used. But, here's a question: is
there some reason
why a string search can't use TW internally? It can't use BM due to
the preprocessing,
but why not TW? Why would it need to be a separate searcher?

> Maybe in a day i can show you some benchmarks between naive search and TW
> algorithms (i think, TW wins. But how many percents TW will win? And also TW
> has a lot of restrictions, which naive search has not).

Data is always good to have.

> I think, that current situation with search algorithms isn't so terrible -
> BM and BMH are fast enough. Yep, there are faster algorithms, but BM and BMH
> are widely-used.

Speed superior to BM/BMH is not the reason to add TW, semi-obviously.

zamaz...@gmail.com

unread,
Jul 4, 2017, 12:52:46 AM7/4/17
to ISO C++ Standard - Future Proposals
Seems handy to have when BM can't be used. But, here's a question: is 
there some reason 
why a string search can't use TW internally? It can't use BM due to 
the preprocessing, 
but why not TW? Why would it need to be a separate searcher? 

We don't talk about std::string only. std::search is general algorithm, so it can used for no only std::string. In this case we must support not only Random access iterators (BM and BMH), but also other iterators (now we do it with naive search). TW requires Random Access iterators. Reason of appearing searchers in Standard is give to programmer possibility to choose search algorithm (by default we use naive search, because it works even on Forward Iterators).

вторник, 4 июля 2017 г., 7:46:14 UTC+3 пользователь Ville Voutilainen написал:

Ville Voutilainen

unread,
Jul 4, 2017, 12:59:36 AM7/4/17
to ISO C++ Standard - Future Proposals
On 4 July 2017 at 07:52, <zamaz...@gmail.com> wrote:
>> Seems handy to have when BM can't be used. But, here's a question: is
>> there some reason
>> why a string search can't use TW internally? It can't use BM due to
>> the preprocessing,
>> but why not TW? Why would it need to be a separate searcher?
>
>
> We don't talk about std::string only. std::search is general algorithm, so
> it can used for no only std::string. In this case we must support not only
> Random access iterators (BM and BMH), but also other iterators (now we do it
> with naive search). TW requires Random Access iterators. Reason of appearing
> searchers in Standard is give to programmer possibility to choose search
> algorithm (by default we use naive search, because it works even on Forward
> Iterators).

So you're now answering your own question about why TW should be
standardized? :)

zamaz...@gmail.com

unread,
Jul 4, 2017, 1:02:17 AM7/4/17
to ISO C++ Standard - Future Proposals
Before benchmarks between Naive search vs TW i can't agree with appearing TW in Standard. Also i think that there are more efficient algorithm than TW with similar requirements. We should research it before including.

вторник, 4 июля 2017 г., 7:59:36 UTC+3 пользователь Ville Voutilainen написал:

Ville Voutilainen

unread,
Jul 4, 2017, 1:05:54 AM7/4/17
to ISO C++ Standard - Future Proposals
On 4 July 2017 at 08:02, <zamaz...@gmail.com> wrote:
> Before benchmarks between Naive search vs TW i can't agree with appearing TW
> in Standard. Also i think that there are more efficient algorithm than TW
> with similar requirements. We should research it before including.

Ok, fully agreed on that. In general, having a constant-space searcher
(although a naive one is such
a searcher already) seems attractive, *if* it can provide an actual
benefit. Whether TW is that
solution is indeed something worth exploring.

Ed Schouten

unread,
Jul 4, 2017, 3:20:42 AM7/4/17
to std-pr...@isocpp.org
Hi there!

Thanks for taking the time to look into my proposal. Let me try to
respond to all of your messages in one go.

2017-07-04 6:09 GMT+02:00 Ville Voutilainen <ville.vo...@gmail.com>:
> Crochemore-Perrin is a constant-space algorithm, whereas Boyer-Moore isn't. That
> would seem nice-to-have for uses that do not wish to have the dynamic
> allocation.
> In other words, Boyer-Moore would win every benchmark you throw at these
> algorithms, except the one where you can't run Boyer-Moore at all. :)

Exactly. This is not a race to find the fastest search algorithm. It's
about providing a somewhat decent string searching algorithm that
works well on resource-constrained systems.

2017-07-04 6:46 GMT+02:00 Ville Voutilainen <ville.vo...@gmail.com>:
> On 4 July 2017 at 07:33, <zamaz...@gmail.com> wrote:
> Seems handy to have when BM can't be used. But, here's a question: is
> there some reason why a string search can't use TW internally? It can't use BM due to
> the preprocessing, but why not TW? Why would it need to be a separate searcher?

So the problem with all of the existing searchers is that they are
defined to only do comparisons for equality. As Two-Way's
preprocessing algorithm depends on having a total order on the
alphabet, my proposal adds a searcher that may depend on std::less<>.
I have to confess, I am not an expert when it comes to benchmarking
string search algorithms. Question: are there some kind of
standardised test vectors that represent realistic workloads any of
you want me to use?

Anyway, as a quick experiment, I benchmarked the running time of
various string searching algorithms when given the following input:

Needle: 'A' * 999 + 'B'
Haystack: 'A' * 1000000

This means we'll be searching for a kilobyte-sized needle in a
megabyte-sized haystack, where the needle and haystack are structured
in such a way that a naive algorithm would perform very poorly. There
will obviously be no full match.

When running string searchers on FreeBSD 12.0 in VirtualBox on my 2,8
GHz Intel i7 Macbook Pro (late 2013 model), I observe the following
average running times for searching (i.e., not counting the
preprocessing phase), when built with Clang 4.0 at -O3:

Naive std::search(): 629.4 milliseconds (avg of 100 iterations)
libc++'s std::experimental::boyer_moore_searcher: 5.916 milliseconds
(avg of 10000 iterations)
libc++'s std::experimental::boyer_moore_horspool_searcher: 3.494
milliseconds (avg of 10000 iterations)
My implementation of std::crochemore_perrin_searcher: 1.079
milliseconds (avg of 10000 iterations)

zamaz...@gmail.com

unread,
Jul 4, 2017, 6:45:24 AM7/4/17
to ISO C++ Standard - Future Proposals
Thank you for the brief explanation.

First: BM and BMH uses strong heuristics for searching. Not only simple comparisons.

Second: Your benchmark is not from real world. It's very-very specific. I have Good workable benchmark (provided by SMART). And later i will give you my results. 

вторник, 4 июля 2017 г., 10:20:42 UTC+3 пользователь Ed Schouten написал:

Alexander Zaitsev

unread,
Jul 4, 2017, 7:38:18 AM7/4/17
to ISO C++ Standard - Future Proposals
Ok, i have benchmarked it just know. I have used SMART tool (tool was written by scientists. SMART is project about researching search algorithms).

Results are attached as 'results.zip' archive.

My system:
i7-3630QM, 12 Gib RAM, SSD, Kubuntu 17.04. CPU resources are free as much as possible during tests.
Test program you can find here: https://github.com/smart-tool (you must patch smart-gui for full compatibility with Qt5. I did it locally, sorry)

Short explanation:
BF is Brute-Force, naive algorithm
BM is Boyer-Moore
TW is Two-Way or Crochemore-Perrin.

My summary:
Results is unfortunately bad and i can't agree with accepting it into Standard. 

My suggestion: I am working on search algorithms in Boost.Algorithm. Boost.Algorithm is very good place for testing new search algorithms. At first i suggest you write proposal to Bost.Algorithm, i will be happy to help you with it. And after our testing if your algorithm is really good, we will add it to Boost.Algorithm, release it. And after 2-3 releases we can write really full proposal (Sorry, your proposal is too short, without benchmarks, etc. ) and send it to Comitee. 

There are a lot of algorithms for searching, and Two-Way isn't one of the best (even in this specific situation).

вторник, 4 июля 2017 г., 10:20:42 UTC+3 пользователь Ed Schouten написал:
results.zip

Ed Schouten

unread,
Jul 4, 2017, 11:56:01 AM7/4/17
to std-pr...@isocpp.org
Alexander,

2017-07-04 13:38 GMT+02:00 Alexander Zaitsev <zamaz...@gmail.com>:
> Results is unfortunately bad and i can't agree with accepting it into
> Standard.

I agree with you that the results of the Two-Way algorithm are
obviously not as good as Boyer-Moore, but we are clearly in
disagreement about what to conclude from your measurements.

Looking at the results of the benchmarks you provided, it seems that
the Two-Way algorithm performs rather similar to the naïve algorithm.
It looks like it doesn't really matter which one you pick for these
specific benchmarks. Sometimes it's faster, sometimes it's slower.
That said, none of the benchmark results you presented correspond with
degenerate cases. It's either the English/Italian dictionary or
randomly generated strings, right?

Because of this, I think the Two-Way algorithm truly shows its
strength: you've shown that it works equally well for common use,
except that we know it can't ever regress to a quadratic running time.
This is exactly what my proposal tries to cover. A string-searching
algorithm that:

- is in-place,
- runs in linear time,
- doesn't perform worse,
- is relatively compact, at least when compared to the
Boyer-Moore(-Horspool) searchers.

> There are a lot of algorithms for searching, and Two-Way isn't one of the
> best (even in this specific situation).

Sure. Lots of other algorithms and even variations on Two-Way exist.
Libraries like glibc, musl, etc. also combine Two-Way with a skip
table, but technically speaking, that makes the algorithm no longer
in-place. The size of the skip table depends on the size of the
alphabet.

What kind of in-place, worst-case linear time algorithm would you
propose instead?
Reply all
Reply to author
Forward
0 new messages