A case insensitive substring search?

James Moe

unread,

Jun 28, 2016, 4:17:13 PM6/28/16

to

Hello,
A case insensitive substring search: How hard can it be?.
I searched for it. There are lots of case insensitive string searches,
but very few case insensitive substring searches. Nothing that I could
understand.
I came up with the following. It seemed that search_n() was perfect
for a substring search; I just had to supply an appropriate comparison
function. which is where I stalled.

bool icompare (unsigned char a, unsigned char b)
{
return (toupper(a) == toupper(b));
}

bool strstricmp (string & s1, string & sub2)
{
string::iterator pos;

pos = search_n(s1.begin, s1.end(),
1,
sub2,
icompare());
return (pos != s1.end());
}

The main problem is what to supply icompare() as parameters? Probably
the wrong choice.

This was a cakewalk in C. Creating a C++-worthy function is not so easy.

--
James Moe
jmm-list at sohnen-moe dot com
Think.

Ben Bacarisse

unread,

Jun 28, 2016, 6:48:36 PM6/28/16

to

James Moe <jimoe...@sohnen-moe.com> writes:

> A case insensitive substring search: How hard can it be?.
> I searched for it. There are lots of case insensitive string searches,
> but very few case insensitive substring searches. Nothing that I could
> understand.
> I came up with the following. It seemed that search_n() was perfect
> for a substring search; I just had to supply an appropriate comparison
> function. which is where I stalled.

You need search, not search_n. And if you are using modern C++ you can
pass a lambda for the comparison function so there is no need for an
extra named function (though that's not a big deal).

<snip>
--
Ben.

Nobody

unread,

Jun 28, 2016, 7:38:24 PM6/28/16

to

On Tue, 28 Jun 2016 13:16:55 -0700, James Moe wrote:

> A case insensitive substring search: How hard can it be?.

Substring search is provided by the std::basic_string::find() method.
Comparisons are performed using the eq() member of the traits class,
so if you use a traits class where eq() is case-insensitive, you get a
case-insensitive substring comparison.

But that requires using an explicit specialisation of std::basic_string
rather than just using std::string or std::wstring (which use
std::char_traits implicitly). A cast would almost certainly work, but
(AFAIK) it's not guaranteed.

Whatever method you choose, there's the issue that it isn't always
possible to perform a case-insensitive comparison character-by-character.

The classic example of where this fails is that the German "sharp s"
character ("ß") doesn't have an equivalent upper-case character; the
upper-case equivalent is "SS". There's no way to handle this situation
using only the standard library, where case conversion is assumed to map
one character to one character.

James Moe

unread,

Jun 29, 2016, 1:25:56 AM6/29/16

to

On 06/28/2016 03:48 PM, Ben Bacarisse wrote:
> You need search, not search_n.
>

Tried that. No luck.

bool strstricmp (string & s1, string & sub2)
{
string::iterator pos;

pos = search(s1.begin, s1.end(),
sub2.begin(), sub2.end(),
icompare);
return (pos != s1.end());
}
Complains that there is no match for the predicate.

> And if you are using modern C++ you can [...]
>
Alas, no. g++ 4.7.3.

Ben Bacarisse

unread,

Jun 29, 2016, 5:56:32 AM6/29/16

to

James Moe <jimoe...@sohnen-moe.com> writes:

> On 06/28/2016 03:48 PM, Ben Bacarisse wrote:
>> You need search, not search_n.
>>
> Tried that. No luck.
> bool strstricmp (string & s1, string & sub2)
> {
> string::iterator pos;
>
> pos = search(s1.begin, s1.end(),
> sub2.begin(), sub2.end(),
> icompare);
> return (pos != s1.end());
> }
> Complains that there is no match for the predicate.

Seems odd. I'd expect more problems to come from the missing () after
s1.begin. When I fix that, it works here.

You should consider making s1 and sub2 references to const.

>> And if you are using modern C++ you can [...]
>>
> Alas, no. g++ 4.7.3.

The [...] went on to suggest using an anonymous function. I thought
these were introduced in gcc 4.5.

--
Ben.

Öö Tiib

unread,

Jun 29, 2016, 7:20:10 AM6/29/16

to

On Wednesday, 29 June 2016 12:56:32 UTC+3, Ben Bacarisse wrote:
> James Moe <jimoe...@sohnen-moe.com> writes:
>
> > On 06/28/2016 03:48 PM, Ben Bacarisse wrote:
> >> You need search, not search_n.
> >>
> > Tried that. No luck.
> > bool strstricmp (string & s1, string & sub2)
> > {
> > string::iterator pos;
> >
> > pos = search(s1.begin, s1.end(),
> > sub2.begin(), sub2.end(),
> > icompare);
> > return (pos != s1.end());
> > }
> > Complains that there is no match for the predicate.
>
> Seems odd. I'd expect more problems to come from the missing () after
> s1.begin. When I fix that, it works here.

May be that James has some user-defined things named 'search'
and/or 'string'. With 'std::string' and 'std::search' his code will be
rejected because of missing '()' after 's1.begin'.

>
> You should consider making s1 and sub2 references to const.
>
> >> And if you are using modern C++ you can [...]
> >>
> > Alas, no. g++ 4.7.3.
>
> The [...] went on to suggest using an anonymous function. I thought
> these were introduced in gcc 4.5.

Yes, lambdas were available since gcc 4.5.

Öö Tiib

unread,

Jun 29, 2016, 9:23:01 AM6/29/16

to

On Wednesday, 29 June 2016 02:38:24 UTC+3, Nobody wrote:
> On Tue, 28 Jun 2016 13:16:55 -0700, James Moe wrote:
>
> > A case insensitive substring search: How hard can it be?.
>
> Substring search is provided by the std::basic_string::find() method.
> Comparisons are performed using the eq() member of the traits class,
> so if you use a traits class where eq() is case-insensitive, you get a
> case-insensitive substring comparison.
>
> But that requires using an explicit specialisation of std::basic_string
> rather than just using std::string or std::wstring (which use
> std::char_traits implicitly). A cast would almost certainly work, but
> (AFAIK) it's not guaranteed.
>
> Whatever method you choose, there's the issue that it isn't always
> possible to perform a case-insensitive comparison character-by-character.
>
> The classic example of where this fails is that the German "sharp s"
> character ("ß") doesn't have an equivalent upper-case character; the
> upper-case equivalent is "SS".

So some capitalized word in text (say "RUSSEN") may be one German
word ("rußen") or other German word ("Russen") or same-looking
word or name from some other language?

> There's no way to handle this situation using only the standard
> library, where case conversion is assumed to map one character
> to one character.

If the issue is like I described above then it can't be handled with
whatever software and even for real human operator it is possible
to construct ambiguous input that the operator can't decide. If
treating "ß" and "ss" as equivalents is good enough hack around
it then however it is doable.

Ralf Goertz

unread,

Jun 29, 2016, 11:16:20 AM6/29/16

to

Am Wed, 29 Jun 2016 06:22:48 -0700 (PDT)
schrieb Öö Tiib <oot...@hot.ee>:

> So some capitalized word in text (say "RUSSEN") may be one German word
> ("rußen") or other German word ("Russen") or same-looking word or name
> from some other language?

If there is going to be an ambiguity like in your example a capitalized
ß becomes SZ which is what it really is. AFAIK the origin of the ß is a
ligature of ſ (long s) and z. Another more common example of this
ambiguity would be „Maße“ (measures). It will become „MASZE“ not „MASSE“
(mass).

James Moe

unread,

Jun 29, 2016, 1:48:30 PM6/29/16

to

On 06/28/2016 10:25 PM, James Moe wrote:
> pos = search(s1.begin, s1.end(),
>
Bzzt! User error.
That should be "s1.begin()", not "s1.begin".

Alf P. Steinbach

unread,

Jun 29, 2016, 3:34:13 PM6/29/16

to

On 29.06.2016 19:24, Stefan Ram wrote:
> r...@zedat.fu-berlin.de (Stefan Ram) writes:

>> Nobody <nob...@nowhere.invalid> writes:
>>> There's no way to handle this situation
>>> using only the standard library, where case conversion is assumed to map
>>> one character to one character.

>> It was always possible to use a custom character encoding, where
>> »SS« is represented by a single code point (character); it still
>> can be rendered »SS«, which is a question of I/O.
>
> And there even is a precedent for it: »"\n"« (one character),
> which sometimes is output as »"\015\012"« (two characters).

But that would break the Unix-land convention for narrow streams, that
(in Unix-land) they only shuffle bytes, with no interpretation.

Also keep in mind that in Unix-land the narrow streams are now usually
UTF-8 encoded, with a variable number of bytes per Unicode code point.

But I like the idea of the uppercase sharp S. I didn't know about it
until your posting. Thanks!

Cheers!,

- Alf

Nobody

unread,

Jul 1, 2016, 2:52:49 AM7/1/16

to

On Wed, 29 Jun 2016 04:33:05 +0000, Stefan Ram wrote:

> Nobody <nob...@nowhere.invalid> writes:
>> There's no way to handle this situation
>>using only the standard library, where case conversion is assumed to map
>>one character to one character.
>

> No way?

>
> It was always possible to use a custom character encoding

I should have been more clear: there's no way to handle this situation
using only the standard library's notion of case (either the std::ctype
facet or the features inherited from C in <cctype>) for an arbitrary
locale.

I daresay that you can create custom locales or just avoid using anything
related to locales and create your own alternatives. But that's likely to
create as many problems as it solves.

Öö Tiib

unread,

Jul 1, 2016, 4:47:07 AM7/1/16

to

Thanks for explaining. Seemingly it still remains possible to prepare
a "test case" about what even human operator can not be 100%
sure if it is German „Maße“ or name "Masze" or word "masze"
from some other language in all caps.