Case-insensitive select

RVince

unread,

Aug 31, 2011, 8:29:13 AM8/31/11

to

I'm using Postgres and the end-users are complaining that the search
features provided by the web app are case-sensitive (via Postgres). Is
there a way to do a case-insensitive version of select statements? Not
all statements are of the form "=" or "like" but often of the form
"...where lastname >= 'smith';" so the I don;t think the '~*' operator
will work here for me in all cases. Is there a way to do this type of
case-insensitive comparison? Is there a way to designate certain
tables or columns be compared on a case-insensitive basis ? Thanks
RVince

Mladen Gogala

unread,

Aug 31, 2011, 9:19:03 AM8/31/11

to

First of all, PostgreSQL supports ILIKE operator, which is case
insensitive. Second, ~* will have a problem with indexes, which limits
its usefulness. Third, PostgreSQL support standard function like "upper"
and "lower". Your comparison can be rewritten like
where lower(lastname) >= 'smith', which will be case insensitive.
PostgreSQL also allows you to create an index on lower(lastname), which
will probably be used, unless the number of Smiths in your name
collection isn't too big for that.

--
http://mgogala.byethost5.com

Matthew Woodcraft

unread,

Aug 31, 2011, 3:42:40 PM8/31/11

to

RVince <rvin...@gmail.com> wrote:
> Is there a way to designate certain tables or columns be compared on
> a case-insensitive basis ? Thanks RVince

There's a case-insensitive string type in contrib (the "Additional
Supplied Modules"):

http://www.postgresql.org/docs/9.0/static/citext.html

-M-

Mladen Gogala

unread,

Aug 31, 2011, 7:00:28 PM8/31/11

to

On Wed, 31 Aug 2011 20:42:40 +0100, Matthew Woodcraft wrote:

> There's a case-insensitive string type in contrib (the "Additional
> Supplied Modules"):

There are unpleasant surprises with that, like the fact that unique index
of that type is not case insensitive. Standard varchar data type,
combined with a function index or a trigger should be more than adequate.

--
http://mgogala.byethost5.com

Laurenz Albe

unread,

Sep 1, 2011, 2:56:31 AM9/1/11

to

Mladen Gogala wrote:
>> There's a case-insensitive string type in contrib (the "Additional
>> Supplied Modules"):

> There are unpleasant surprises with that, like the fact that unique index
> of that type is not case insensitive. Standard varchar data type,
> combined with a function index or a trigger should be more than adequate.

What you say seems to contradict citext's documentation:

The citext data type allows you to eliminate calls to lower in SQL
queries, and allows a primary key to be case-insensitive.

Can you explain?

Yours,
Laurenz Albe

Mladen Gogala

unread,

Sep 1, 2011, 8:57:52 AM9/1/11

to

You skipped a passage in the documentation:

"If you declare a column as UNIQUE or PRIMARY KEY, the implicitly
generated index is case-sensitive. So it's useless for case-insensitive
searches, and it won't enforce uniqueness case-insensitively. "

--
http://mgogala.byethost5.com

Laurenz Albe

unread,

Sep 1, 2011, 10:40:58 AM9/1/11

to

Mladen Gogala wrote:
[about citext contrib]

>>> There are unpleasant surprises with that, like the fact that unique
>>> index of that type is not case insensitive. Standard varchar data type,
>>> combined with a function index or a trigger should be more than
>>> adequate.

>> What you say seems to contradict citext's documentation:
>>
>> The citext data type allows you to eliminate calls to lower in SQL
>> queries, and allows a primary key to be case-insensitive.
>>
>> Can you explain?

> You skipped a passage in the documentation:

>
> "If you declare a column as UNIQUE or PRIMARY KEY, the implicitly
> generated index is case-sensitive. So it's useless for case-insensitive
> searches, and it won't enforce uniqueness case-insensitively. "

Oh, I see. I think you got confused here, because the sentence you
quote is a description of a drawback of the "standard approach" using
"WHERE lower(col) = LOWER(?)", *not* a drawback of citext.

Yours,
Laurenz Albe

Mladen Gogala

unread,

Sep 1, 2011, 5:26:13 PM9/1/11

to

On Thu, 01 Sep 2011 16:40:58 +0200, Laurenz Albe wrote:

> Oh, I see. I think you got confused here, because the sentence you quote
> is a description of a drawback of the "standard approach" using "WHERE
> lower(col) = LOWER(?)", *not* a drawback of citext.

Seems you're right, I was mistaken.

mgogala=# create temporary table t1(col1 citext, constraint t1_pk primary
key(col1));
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "t1_pk"
for table "t1"
CREATE TABLE
mgogala=# insert into t1 values('Laurenz Albe');
INSERT 0 1
mgogala=# insert into t1 values('laurenz albe');
ERROR: duplicate key value violates unique constraint "t1_pk"
DETAIL: Key (col1)=(laurenz albe) already exists.
mgogala=# insert into t1 values('LAURENZ ALBE');
ERROR: duplicate key value violates unique constraint "t1_pk"
DETAIL: Key (col1)=(LAURENZ ALBE) already exists.
mgogala=#

However, the citext doesn't allow me to limit the column size:

mgogala=# create temporary table t1(col1 citext(10), constraint t1_pk
primary key(col1));
ERROR: type modifier is not allowed for type "citext"
LINE 1: create temporary table t1(col1 citext(10), constraint t1_pk ...

It seems to me that citext is a good replacement for the type text, not
for the type varchar(n). At any rate, the type citext is useful but
rather nonstandard. I would still use the standard varchar type for the
vast majority of cases and use trigger and/or functional indexes to
resolve the case insensitive search. Case insensitive search was the
problem for which functional indexes were invented. Cost based optimizers
were having trouble with the functional indexes, Oracle's until version
11 and the extended statistics, because the normal statistics gathering
procedures usually do not collect data on lower(col1) or something
similar to it. The trick to make it perform better was to add a column,
named something like lower_col1 and populate it by a trigger. That would
help the optimizer, at the expense of violating the relational rules and
storing the duplicate information. That column could then be indexed
normally. This method still provides the best performance, on all
databases I know of, with the notable exception of Oracle 11i.

--
http://mgogala.byethost5.com

Laurenz Albe

unread,

Sep 2, 2011, 4:08:19 AM9/2/11

to

Mladen Gogala wrote:
> It seems to me that citext is a good replacement for the type text, not
> for the type varchar(n).

Well, it's called "citext" and not "civarchar" for a reason.
But you can easily

CREATE DOMAIN civarchar20 AS citext
CONSTRAINT civarchar20_len CHECK(length(VALUE) <= 20);

to get that.

> I would still use the standard varchar type for the
> vast majority of cases and use trigger and/or functional indexes to
> resolve the case insensitive search.

It's everybody's choice.

> The trick to make it perform better was to add a column,
> named something like lower_col1 and populate it by a trigger. That would
> help the optimizer, at the expense of violating the relational rules and
> storing the duplicate information. That column could then be indexed
> normally. This method still provides the best performance, on all
> databases I know of, with the notable exception of Oracle 11i.

Of course it also wastes space space, and triggers don't come for
free either.

I doubt that using an index based on an expression would perform worse
in PostgreSQL, but I don't have the time to run a performance test.

At any rate statistics for such indexes are collected (look for rows
in pg_stats that belong to the index), and these statistice are exactly
the same as would be collected for the redundant column, so why should
there be a difference?

Yours,
Laurenz Albe

Laurenz Albe

unread,

Sep 2, 2011, 4:08:19 AM9/2/11

to

Mladen Gogala wrote:
> It seems to me that citext is a good replacement for the type text, not
> for the type varchar(n).

Well, it's called "citext" and not "civarchar" for a reason.
But you can easily

CREATE DOMAIN civarchar20 AS citext
CONSTRAINT civarchar20_len CHECK(length(VALUE) <= 20);

to get that.

> I would still use the standard varchar type for the

> vast majority of cases and use trigger and/or functional indexes to
> resolve the case insensitive search.

It's everybody's choice.

> The trick to make it perform better was to add a column,

> named something like lower_col1 and populate it by a trigger. That would
> help the optimizer, at the expense of violating the relational rules and
> storing the duplicate information. That column could then be indexed
> normally. This method still provides the best performance, on all
> databases I know of, with the notable exception of Oracle 11i.

Of course it also wastes space space, and triggers don't come for

Mladen Gogala

unread,

Sep 2, 2011, 9:03:56 AM9/2/11

to

On Fri, 02 Sep 2011 10:08:19 +0200, Laurenz Albe wrote:

> At any rate statistics for such indexes are collected (look for rows in
> pg_stats that belong to the index), and these statistice are exactly the
> same as would be collected for the redundant column, so why should there
> be a difference?

The problem is in the histograms, which are collected per column, not per
index. You don't have histograms on lower(col1) unless it's in a
separate column. Without histograms, the optimizer can only make an
assumption, usually the uniform distribution of values, and use the
number of keys in the index divided by the number of rows in the table
for the selectivity estimate. That is not necessarily accurate and can
lead to bad performance. If the values in the "citext" column are skewed,
there are many people named "Smith" in the population, you will get an
index search when that is inappropriate.

--
http://mgogala.byethost5.com

Laurenz Albe

unread,

Sep 5, 2011, 3:45:13 AM9/5/11

to

Mladen Gogala wrote:
>> At any rate statistics for such indexes are collected (look for rows in
>> pg_stats that belong to the index), and these statistice are exactly the
>> same as would be collected for the redundant column, so why should there
>> be a difference?

> The problem is in the histograms, which are collected per column, not per
> index. You don't have histograms on lower(col1) unless it's in a
> separate column.

Why do you claim that without a test?
This is PostgreSQL 9.1:

CREATE TABLE test(id serial primary key, val varchar(20) NOT NULL);

DO $$BEGIN
FOR i IN 1..10000 LOOP
INSERT INTO test(val) VALUES (substr(random()::text, 3));
END LOOP;
END;$$;

CREATE INDEX test_val_ind ON test(lower(val));

SHOW default_statistics_target;

default_statistics_target
---------------------------
100
(1 row)

ANALYZE test;

SELECT array_length(histogram_bounds, 1) FROM pg_stats
WHERE tablename = 'test_val_ind';

array_length
--------------
101
(1 row)

Yours,
Laurenz Albe

Mladen Gogala

unread,

Sep 5, 2011, 6:53:28 AM9/5/11

to

On Mon, 05 Sep 2011 09:45:13 +0200, Laurenz Albe wrote:

>
> Why do you claim that without a test? This is PostgreSQL 9.1:
>
> CREATE TABLE test(id serial primary key, val varchar(20) NOT NULL);
>
> DO $$BEGIN
> FOR i IN 1..10000 LOOP
> INSERT INTO test(val) VALUES (substr(random()::text, 3));
> END LOOP;
> END;$$;
>
> CREATE INDEX test_val_ind ON test(lower(val));
>
> SHOW default_statistics_target;
>
> default_statistics_target
> ---------------------------
> 100
> (1 row)
>
> ANALYZE test;
>
> SELECT array_length(histogram_bounds, 1) FROM pg_stats
> WHERE tablename = 'test_val_ind';
>
> array_length
> --------------
> 101
> (1 row)
>
> Yours,
> Laurenz Albe

I don't understand? What does this prove? That there is a histogram on
lower(val)? How does your example prove that?

--
http://mgogala.byethost5.com

Laurenz Albe

unread,

Sep 5, 2011, 10:13:44 AM9/5/11

to

Mladen Gogala wrote:
>> CREATE TABLE test(id serial primary key, val varchar(20) NOT NULL);

>> CREATE INDEX test_val_ind ON test(lower(val));

>> SELECT array_length(histogram_bounds, 1) FROM pg_stats

>> WHERE tablename = 'test_val_ind';
>>
>> array_length
>> --------------
>> 101
>> (1 row)

> I don't understand? What does this prove? That there is a histogram on

> lower(val)? How does your example prove that?

I thought that was clear, sorry.

There is a histogram in the index statistics, and the index is on lower(val).
So yes, there is a histogram on lower(val), used whenever the planner
considers the index.

The index has only one "column", namly lower(val).

SELECT attname FROM pg_stats WHERE tablename = 'test_val_ind';

attname
---------
lower
(1 row)

pg_statistic (and its derived view pg_stats) has entries for all
indexable columns and index espressions.

Yours,
Laurenz Albe

Mladen Gogala

unread,

Sep 5, 2011, 10:21:25 PM9/5/11

to

On Mon, 05 Sep 2011 16:13:44 +0200, Laurenz Albe wrote:

Hmmm yes, you're right, I ran your test on 9.0.2, the version available
for Fedora 14 and it does have histogram on index, in additon to the
histogram on the column itself. I must admit I didn't think of using the
index name as a table name in pg_stats. Oracle histograms are related to
table columns, not to indexes, which seems logical to me.

mgogala=# CREATE TABLE test(id serial primary key, val varchar(20) NOT
NULL);
NOTICE: CREATE TABLE will create implicit sequence "test_id_seq" for
serial column "test.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "test_pkey"
for table "test"
CREATE TABLE
mgogala=# DO $$BEGIN
mgogala$# FOR i IN 1..10000 LOOP
mgogala$# INSERT INTO test(val) VALUES (substr(random()::text, 3));
mgogala$# END LOOP;
mgogala$# END;$$;
DO
mgogala=#
mgogala=# CREATE INDEX test_val_ind ON test(lower(val));
CREATE INDEX
mgogala=# ANALYZE test;
ANALYZE
mgogala=# SELECT array_length(histogram_bounds, 1) FROM pg_stats
mgogala-# WHERE tablename = 'test_val_ind';

array_length
--------------
101
(1 row)

mgogala=# SELECT attname FROM pg_stats WHERE tablename = 'test_val_ind';

attname
---------
lower
(1 row)

mgogala=# SELECT array_length(histogram_bounds, 1) FROM pg_stats
mgogala-# where tablename='test' and attname='val';

array_length
--------------
101
(1 row)

mgogala=# select version();

version

--------------------------------------------------------------------------------
---------------------------
PostgreSQL 9.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.5.1
20100924
(Red Hat 4.5.1-4), 32-bit
(1 row)

--
http://mgogala.byethost5.com