case insensitive match in unicode

SunWuKung

unread,

Mar 27, 2006, 4:31:17 AM3/27/06

to

I would need to do case insensitive match against a field that contains
text in different languages - Greek, Hungarian, Arabic etc.
The db encoding is UTF8.

So far I found no way to achieve that. I tried converting both strings
to the same case and using ~* , but neither worked.

Does anybody no a way to do this?

Thanks for the help.
Balázs

Martijn van Oosterhout

unread,

Mar 27, 2006, 4:48:29 AM3/27/06

to

On Mon, Mar 27, 2006 at 11:31:17AM +0200, SunWuKung wrote:
> I would need to do case insensitive match against a field that contains
> text in different languages - Greek, Hungarian, Arabic etc.
> The db encoding is UTF8.
>
> So far I found no way to achieve that. I tried converting both strings
> to the same case and using ~* , but neither worked.

Oh, tricky. Firstly, case-insensetive means different things to
different locales. For example, in Turkish 'i' is not the lowecase
version of 'I'. Maybe you've chosen a locale that doesn't do UTF-8? You
don't specify a platform either. Locale support varies wildly by
platform.

What you probably want it some kind of accent-insensetive match that
mean that é, è, ë, e, É, È, E and Ë are all considered to match
eachother. The way you do that is by converting unicode to a particular
normal form and then comparing. Unfortunatly, I don't think PostgreSQL
supplies such a function right now.

However, some server-side procedural languages can do this. If you can
find one (possibly Perl) that can do the conversion, you can create a
function to do the mapping.

Have a nice day,
--
Martijn van Oosterhout <kle...@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

signature.asc

SunWuKung

unread,

Mar 27, 2006, 5:45:05 AM3/27/06

to

In article <20060327094...@svana.org>, kle...@svana.org says...

> On Mon, Mar 27, 2006 at 11:31:17AM +0200, SunWuKung wrote:
> > I would need to do case insensitive match against a field that contains
> > text in different languages - Greek, Hungarian, Arabic etc.
> > The db encoding is UTF8.
> >
> > So far I found no way to achieve that. I tried converting both strings
> > to the same case and using ~* , but neither worked.
>
> Oh, tricky. Firstly, case-insensetive means different things to
> different locales. For example, in Turkish 'i' is not the lowecase
> version of 'I'. Maybe you've chosen a locale that doesn't do UTF-8? You
> don't specify a platform either. Locale support varies wildly by
> platform.
>
> What you probably want it some kind of accent-insensetive match that
> mean that é, è, ë, e, É, È, E and Ë are all considered to match
> eachother. The way you do that is by converting unicode to a particular
> normal form and then comparing. Unfortunatly, I don't think PostgreSQL
> supplies such a function right now.
>
> However, some server-side procedural languages can do this. If you can
> find one (possibly Perl) that can do the conversion, you can create a
> function to do the mapping.
>
> Have a nice day,
>

This sounds like a very interesting concept.
It wouldn't be 'case insensitive' just insensitive.

The way I imagine it now is a special case of the ~ function.
I create matchgroups in a table and check each character if it is in the
group. If it is I will replace the character with the group in [éÉE],
[oóOÓ??] and do a regexp with that.

What do you think?
B.

Martijn van Oosterhout

unread,

Mar 27, 2006, 6:40:37 AM3/27/06

to

On Mon, Mar 27, 2006 at 12:45:05PM +0200, SunWuKung wrote:
> This sounds like a very interesting concept.
> It wouldn't be 'case insensitive' just insensitive.
>
> The way I imagine it now is a special case of the ~ function.
> I create matchgroups in a table and check each character if it is in the
> group. If it is I will replace the character with the group in [éÉE],
> [oóOÓ??] and do a regexp with that.

No need to reinvent the wheel. ICU provides a range of services to deal
with this. For example the following filter in ICU:

NFD; [:Nonspacing Mark:] Remove; NFC.

Will remove all accents from characters. And it works for all Unicode
characters. With a bit more thinking you can work with case variations
also.

There is also a locale-independant case-mapping module there plus
various locale specific ones also.

http://icu.sourceforge.net/userguide/Transform.html
http://icu.sourceforge.net/userguide/caseMappings.html
http://icu.sourceforge.net/userguide/normalization.html

signature.asc

SunWuKung

unread,

Apr 6, 2006, 5:12:26 PM4/6/06

to

In article <20060327114...@svana.org>, kle...@svana.org says...

> On Mon, Mar 27, 2006 at 12:45:05PM +0200, SunWuKung wrote:
> > This sounds like a very interesting concept.
> > It wouldn't be 'case insensitive' just insensitive.
> >
> > The way I imagine it now is a special case of the ~ function.
> > I create matchgroups in a table and check each character if it is in the
> > group. If it is I will replace the character with the group in [éÉE],
> > [oóOÓ??] and do a regexp with that.
>
> No need to reinvent the wheel. ICU provides a range of services to deal
> with this. For example the following filter in ICU:
>
> NFD; [:Nonspacing Mark:] Remove; NFC.
>
> Will remove all accents from characters. And it works for all Unicode
> characters. With a bit more thinking you can work with case variations
> also.
>
> There is also a locale-independant case-mapping module there plus
> various locale specific ones also.
>
> http://icu.sourceforge.net/userguide/Transform.html
> http://icu.sourceforge.net/userguide/caseMappings.html
> http://icu.sourceforge.net/userguide/normalization.html
>
> Have a nice day,
>

Thanks, I looked at this and it looks like something that would indeed
solve the problem.
However I was so far unable to figure out how could I use this from
within Postgres. If you have experience with it could you give me an
example?

Thanks
Balázs

Martijn van Oosterhout

unread,

Apr 7, 2006, 9:46:01 AM4/7/06

to

On Thu, Apr 06, 2006 at 11:12:26PM +0200, SunWuKung wrote:
> > There is also a locale-independant case-mapping module there plus
> > various locale specific ones also.
> >
> > http://icu.sourceforge.net/userguide/Transform.html
> > http://icu.sourceforge.net/userguide/caseMappings.html
> > http://icu.sourceforge.net/userguide/normalization.html
> >
> > Have a nice day,
> >
> Thanks, I looked at this and it looks like something that would indeed
> solve the problem.
> However I was so far unable to figure out how could I use this from
> within Postgres. If you have experience with it could you give me an
> example?

There are some unofficial ICU patches but I doubt they're still
up-to-date. I don't personally use it though maybe someone else here
does...

signature.asc

Mike Rylander

unread,

Apr 7, 2006, 10:41:15 AM4/7/06

to

I was looking into creating a Pg function wrapper to some of the ICU
stuff, but, to be perfectly honest, I couldn't find an actual API
reference for ICU.

In any case, you can do this with PL/Perl:

CREATE FUNCTION strip_nonspacing_marks ( text ) RETURNS text AS $func$
use Unicode::Normalize;
use Encode;

my $string = NFD( decode( utf8 => shift() ) );
$string =~ s/\p{Mn}+//ogsm;

return NFC($string);
$func$ LANGUAGE 'plperl' STRICT;

It's untested and won't be as fast as ICU, but it should get the job
done. Hope it helps!

>
> Thanks
> Balázs
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

--
Mike Rylander
mryl...@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Balazs...@t-online.hu

unread,

Apr 13, 2006, 8:02:28 AM4/13/06

to

This perl function doesn't work for me.
I keep getting a
Cannot decode string with wide characters at
/usr/lib/perl5/5.8.7/i386-linux/Encode.pm line 166 error

Do you have any suggestion?
Thanks
Balázs