Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

case insensitive match in unicode

42 views
Skip to first unread message

SunWuKung

unread,
Mar 27, 2006, 4:31:17 AM3/27/06
to
I would need to do case insensitive match against a field that contains
text in different languages - Greek, Hungarian, Arabic etc.
The db encoding is UTF8.

So far I found no way to achieve that. I tried converting both strings
to the same case and using ~* , but neither worked.

Does anybody no a way to do this?

Thanks for the help.
Balázs

Martijn van Oosterhout

unread,
Mar 27, 2006, 4:48:29 AM3/27/06
to
On Mon, Mar 27, 2006 at 11:31:17AM +0200, SunWuKung wrote:
> I would need to do case insensitive match against a field that contains
> text in different languages - Greek, Hungarian, Arabic etc.
> The db encoding is UTF8.
>
> So far I found no way to achieve that. I tried converting both strings
> to the same case and using ~* , but neither worked.

Oh, tricky. Firstly, case-insensetive means different things to
different locales. For example, in Turkish 'i' is not the lowecase
version of 'I'. Maybe you've chosen a locale that doesn't do UTF-8? You
don't specify a platform either. Locale support varies wildly by
platform.

What you probably want it some kind of accent-insensetive match that
mean that é, è, ë, e, É, È, E and Ë are all considered to match
eachother. The way you do that is by converting unicode to a particular
normal form and then comparing. Unfortunatly, I don't think PostgreSQL
supplies such a function right now.

However, some server-side procedural languages can do this. If you can
find one (possibly Perl) that can do the conversion, you can create a
function to do the mapping.

Have a nice day,
--
Martijn van Oosterhout <kle...@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

signature.asc

SunWuKung

unread,
Mar 27, 2006, 5:45:05 AM3/27/06
to
In article <20060327094...@svana.org>, kle...@svana.org says...

> On Mon, Mar 27, 2006 at 11:31:17AM +0200, SunWuKung wrote:
> > I would need to do case insensitive match against a field that contains
> > text in different languages - Greek, Hungarian, Arabic etc.
> > The db encoding is UTF8.
> >
> > So far I found no way to achieve that. I tried converting both strings
> > to the same case and using ~* , but neither worked.
>
> Oh, tricky. Firstly, case-insensetive means different things to
> different locales. For example, in Turkish 'i' is not the lowecase
> version of 'I'. Maybe you've chosen a locale that doesn't do UTF-8? You
> don't specify a platform either. Locale support varies wildly by
> platform.
>
> What you probably want it some kind of accent-insensetive match that
> mean that é, è, ë, e, É, È, E and Ë are all considered to match
> eachother. The way you do that is by converting unicode to a particular
> normal form and then comparing. Unfortunatly, I don't think PostgreSQL
> supplies such a function right now.
>
> However, some server-side procedural languages can do this. If you can
> find one (possibly Perl) that can do the conversion, you can create a
> function to do the mapping.
>
> Have a nice day,
>
This sounds like a very interesting concept.
It wouldn't be 'case insensitive' just insensitive.

The way I imagine it now is a special case of the ~ function.
I create matchgroups in a table and check each character if it is in the
group. If it is I will replace the character with the group in [éÉE],
[oóOÓ??] and do a regexp with that.

What do you think?
B.

Martijn van Oosterhout

unread,
Mar 27, 2006, 6:40:37 AM3/27/06
to
On Mon, Mar 27, 2006 at 12:45:05PM +0200, SunWuKung wrote:
> This sounds like a very interesting concept.
> It wouldn't be 'case insensitive' just insensitive.
>
> The way I imagine it now is a special case of the ~ function.
> I create matchgroups in a table and check each character if it is in the
> group. If it is I will replace the character with the group in [éÉE],
> [oóOÓ??] and do a regexp with that.

No need to reinvent the wheel. ICU provides a range of services to deal
with this. For example the following filter in ICU:

NFD; [:Nonspacing Mark:] Remove; NFC.

Will remove all accents from characters. And it works for all Unicode
characters. With a bit more thinking you can work with case variations
also.

There is also a locale-independant case-mapping module there plus
various locale specific ones also.

http://icu.sourceforge.net/userguide/Transform.html
http://icu.sourceforge.net/userguide/caseMappings.html
http://icu.sourceforge.net/userguide/normalization.html

signature.asc

SunWuKung

unread,
Apr 6, 2006, 5:12:26 PM4/6/06
to
In article <20060327114...@svana.org>, kle...@svana.org says...

> On Mon, Mar 27, 2006 at 12:45:05PM +0200, SunWuKung wrote:
> > This sounds like a very interesting concept.
> > It wouldn't be 'case insensitive' just insensitive.
> >
> > The way I imagine it now is a special case of the ~ function.
> > I create matchgroups in a table and check each character if it is in the
> > group. If it is I will replace the character with the group in [éÉE],
> > [oóOÓ??] and do a regexp with that.
>
> No need to reinvent the wheel. ICU provides a range of services to deal
> with this. For example the following filter in ICU:
>
> NFD; [:Nonspacing Mark:] Remove; NFC.
>
> Will remove all accents from characters. And it works for all Unicode
> characters. With a bit more thinking you can work with case variations
> also.
>
> There is also a locale-independant case-mapping module there plus
> various locale specific ones also.
>
> http://icu.sourceforge.net/userguide/Transform.html
> http://icu.sourceforge.net/userguide/caseMappings.html
> http://icu.sourceforge.net/userguide/normalization.html
>
> Have a nice day,
>
Thanks, I looked at this and it looks like something that would indeed
solve the problem.
However I was so far unable to figure out how could I use this from
within Postgres. If you have experience with it could you give me an
example?

Thanks
Balázs

Martijn van Oosterhout

unread,
Apr 7, 2006, 9:46:01 AM4/7/06
to
On Thu, Apr 06, 2006 at 11:12:26PM +0200, SunWuKung wrote:
> > There is also a locale-independant case-mapping module there plus
> > various locale specific ones also.
> >
> > http://icu.sourceforge.net/userguide/Transform.html
> > http://icu.sourceforge.net/userguide/caseMappings.html
> > http://icu.sourceforge.net/userguide/normalization.html
> >
> > Have a nice day,
> >
> Thanks, I looked at this and it looks like something that would indeed
> solve the problem.
> However I was so far unable to figure out how could I use this from
> within Postgres. If you have experience with it could you give me an
> example?

There are some unofficial ICU patches but I doubt they're still
up-to-date. I don't personally use it though maybe someone else here
does...

signature.asc

Mike Rylander

unread,
Apr 7, 2006, 10:41:15 AM4/7/06
to

I was looking into creating a Pg function wrapper to some of the ICU
stuff, but, to be perfectly honest, I couldn't find an actual API
reference for ICU.

In any case, you can do this with PL/Perl:

CREATE FUNCTION strip_nonspacing_marks ( text ) RETURNS text AS $func$
use Unicode::Normalize;
use Encode;

my $string = NFD( decode( utf8 => shift() ) );
$string =~ s/\p{Mn}+//ogsm;

return NFC($string);
$func$ LANGUAGE 'plperl' STRICT;

It's untested and won't be as fast as ICU, but it should get the job
done. Hope it helps!

>
> Thanks
> Balázs
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>


--
Mike Rylander
mryl...@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Balazs...@t-online.hu

unread,
Apr 13, 2006, 8:02:28 AM4/13/06
to
This perl function doesn't work for me.
I keep getting a
Cannot decode string with wide characters at
/usr/lib/perl5/5.8.7/i386-linux/Encode.pm line 166 error

Do you have any suggestion?
Thanks
Balázs

0 new messages