First of all, PostgreSQL supports ILIKE operator, which is case
insensitive. Second, ~* will have a problem with indexes, which limits
its usefulness. Third, PostgreSQL support standard function like "upper"
and "lower". Your comparison can be rewritten like
where lower(lastname) >= 'smith', which will be case insensitive.
PostgreSQL also allows you to create an index on lower(lastname), which
will probably be used, unless the number of Smiths in your name
collection isn't too big for that.
There's a case-insensitive string type in contrib (the "Additional
Supplied Modules"):
http://www.postgresql.org/docs/9.0/static/citext.html
-M-
> There's a case-insensitive string type in contrib (the "Additional
> Supplied Modules"):
There are unpleasant surprises with that, like the fact that unique index
of that type is not case insensitive. Standard varchar data type,
combined with a function index or a trigger should be more than adequate.
> There are unpleasant surprises with that, like the fact that unique index
> of that type is not case insensitive. Standard varchar data type,
> combined with a function index or a trigger should be more than adequate.
What you say seems to contradict citext's documentation:
The citext data type allows you to eliminate calls to lower in SQL
queries, and allows a primary key to be case-insensitive.
Can you explain?
Yours,
Laurenz Albe
You skipped a passage in the documentation:
"If you declare a column as UNIQUE or PRIMARY KEY, the implicitly
generated index is case-sensitive. So it's useless for case-insensitive
searches, and it won't enforce uniqueness case-insensitively. "
>> What you say seems to contradict citext's documentation:
>>
>> The citext data type allows you to eliminate calls to lower in SQL
>> queries, and allows a primary key to be case-insensitive.
>>
>> Can you explain?
> You skipped a passage in the documentation:
>
> "If you declare a column as UNIQUE or PRIMARY KEY, the implicitly
> generated index is case-sensitive. So it's useless for case-insensitive
> searches, and it won't enforce uniqueness case-insensitively. "
Oh, I see. I think you got confused here, because the sentence you
quote is a description of a drawback of the "standard approach" using
"WHERE lower(col) = LOWER(?)", *not* a drawback of citext.
Yours,
Laurenz Albe
> Oh, I see. I think you got confused here, because the sentence you quote
> is a description of a drawback of the "standard approach" using "WHERE
> lower(col) = LOWER(?)", *not* a drawback of citext.
Seems you're right, I was mistaken.
mgogala=# create temporary table t1(col1 citext, constraint t1_pk primary
key(col1));
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "t1_pk"
for table "t1"
CREATE TABLE
mgogala=# insert into t1 values('Laurenz Albe');
INSERT 0 1
mgogala=# insert into t1 values('laurenz albe');
ERROR: duplicate key value violates unique constraint "t1_pk"
DETAIL: Key (col1)=(laurenz albe) already exists.
mgogala=# insert into t1 values('LAURENZ ALBE');
ERROR: duplicate key value violates unique constraint "t1_pk"
DETAIL: Key (col1)=(LAURENZ ALBE) already exists.
mgogala=#
However, the citext doesn't allow me to limit the column size:
mgogala=# create temporary table t1(col1 citext(10), constraint t1_pk
primary key(col1));
ERROR: type modifier is not allowed for type "citext"
LINE 1: create temporary table t1(col1 citext(10), constraint t1_pk ...
It seems to me that citext is a good replacement for the type text, not
for the type varchar(n). At any rate, the type citext is useful but
rather nonstandard. I would still use the standard varchar type for the
vast majority of cases and use trigger and/or functional indexes to
resolve the case insensitive search. Case insensitive search was the
problem for which functional indexes were invented. Cost based optimizers
were having trouble with the functional indexes, Oracle's until version
11 and the extended statistics, because the normal statistics gathering
procedures usually do not collect data on lower(col1) or something
similar to it. The trick to make it perform better was to add a column,
named something like lower_col1 and populate it by a trigger. That would
help the optimizer, at the expense of violating the relational rules and
storing the duplicate information. That column could then be indexed
normally. This method still provides the best performance, on all
databases I know of, with the notable exception of Oracle 11i.
Well, it's called "citext" and not "civarchar" for a reason.
But you can easily
CREATE DOMAIN civarchar20 AS citext
CONSTRAINT civarchar20_len CHECK(length(VALUE) <= 20);
to get that.
> I would still use the standard varchar type for the
> vast majority of cases and use trigger and/or functional indexes to
> resolve the case insensitive search.
It's everybody's choice.
> The trick to make it perform better was to add a column,
> named something like lower_col1 and populate it by a trigger. That would
> help the optimizer, at the expense of violating the relational rules and
> storing the duplicate information. That column could then be indexed
> normally. This method still provides the best performance, on all
> databases I know of, with the notable exception of Oracle 11i.
Of course it also wastes space space, and triggers don't come for
free either.
I doubt that using an index based on an expression would perform worse
in PostgreSQL, but I don't have the time to run a performance test.
At any rate statistics for such indexes are collected (look for rows
in pg_stats that belong to the index), and these statistice are exactly
the same as would be collected for the redundant column, so why should
there be a difference?
Yours,
Laurenz Albe
Well, it's called "citext" and not "civarchar" for a reason.
But you can easily
CREATE DOMAIN civarchar20 AS citext
CONSTRAINT civarchar20_len CHECK(length(VALUE) <= 20);
to get that.
> I would still use the standard varchar type for the
> vast majority of cases and use trigger and/or functional indexes to
> resolve the case insensitive search.
It's everybody's choice.
> The trick to make it perform better was to add a column,
> named something like lower_col1 and populate it by a trigger. That would
> help the optimizer, at the expense of violating the relational rules and
> storing the duplicate information. That column could then be indexed
> normally. This method still provides the best performance, on all
> databases I know of, with the notable exception of Oracle 11i.
Of course it also wastes space space, and triggers don't come for
> At any rate statistics for such indexes are collected (look for rows in
> pg_stats that belong to the index), and these statistice are exactly the
> same as would be collected for the redundant column, so why should there
> be a difference?
The problem is in the histograms, which are collected per column, not per
index. You don't have histograms on lower(col1) unless it's in a
separate column. Without histograms, the optimizer can only make an
assumption, usually the uniform distribution of values, and use the
number of keys in the index divided by the number of rows in the table
for the selectivity estimate. That is not necessarily accurate and can
lead to bad performance. If the values in the "citext" column are skewed,
there are many people named "Smith" in the population, you will get an
index search when that is inappropriate.
> The problem is in the histograms, which are collected per column, not per
> index. You don't have histograms on lower(col1) unless it's in a
> separate column.
Why do you claim that without a test?
This is PostgreSQL 9.1:
CREATE TABLE test(id serial primary key, val varchar(20) NOT NULL);
DO $$BEGIN
FOR i IN 1..10000 LOOP
INSERT INTO test(val) VALUES (substr(random()::text, 3));
END LOOP;
END;$$;
CREATE INDEX test_val_ind ON test(lower(val));
SHOW default_statistics_target;
default_statistics_target
---------------------------
100
(1 row)
ANALYZE test;
SELECT array_length(histogram_bounds, 1) FROM pg_stats
WHERE tablename = 'test_val_ind';
array_length
--------------
101
(1 row)
Yours,
Laurenz Albe
>
> Why do you claim that without a test? This is PostgreSQL 9.1:
>
> CREATE TABLE test(id serial primary key, val varchar(20) NOT NULL);
>
> DO $$BEGIN
> FOR i IN 1..10000 LOOP
> INSERT INTO test(val) VALUES (substr(random()::text, 3));
> END LOOP;
> END;$$;
>
> CREATE INDEX test_val_ind ON test(lower(val));
>
> SHOW default_statistics_target;
>
> default_statistics_target
> ---------------------------
> 100
> (1 row)
>
> ANALYZE test;
>
> SELECT array_length(histogram_bounds, 1) FROM pg_stats
> WHERE tablename = 'test_val_ind';
>
> array_length
> --------------
> 101
> (1 row)
>
> Yours,
> Laurenz Albe
I don't understand? What does this prove? That there is a histogram on
lower(val)? How does your example prove that?
>> CREATE INDEX test_val_ind ON test(lower(val));
>> SELECT array_length(histogram_bounds, 1) FROM pg_stats
>> WHERE tablename = 'test_val_ind';
>>
>> array_length
>> --------------
>> 101
>> (1 row)
> I don't understand? What does this prove? That there is a histogram on
> lower(val)? How does your example prove that?
I thought that was clear, sorry.
There is a histogram in the index statistics, and the index is on lower(val).
So yes, there is a histogram on lower(val), used whenever the planner
considers the index.
The index has only one "column", namly lower(val).
SELECT attname FROM pg_stats WHERE tablename = 'test_val_ind';
attname
---------
lower
(1 row)
pg_statistic (and its derived view pg_stats) has entries for all
indexable columns and index espressions.
Yours,
Laurenz Albe
Hmmm yes, you're right, I ran your test on 9.0.2, the version available
for Fedora 14 and it does have histogram on index, in additon to the
histogram on the column itself. I must admit I didn't think of using the
index name as a table name in pg_stats. Oracle histograms are related to
table columns, not to indexes, which seems logical to me.
mgogala=# CREATE TABLE test(id serial primary key, val varchar(20) NOT
NULL);
NOTICE: CREATE TABLE will create implicit sequence "test_id_seq" for
serial column "test.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "test_pkey"
for table "test"
CREATE TABLE
mgogala=# DO $$BEGIN
mgogala$# FOR i IN 1..10000 LOOP
mgogala$# INSERT INTO test(val) VALUES (substr(random()::text, 3));
mgogala$# END LOOP;
mgogala$# END;$$;
DO
mgogala=#
mgogala=# CREATE INDEX test_val_ind ON test(lower(val));
CREATE INDEX
mgogala=# ANALYZE test;
ANALYZE
mgogala=# SELECT array_length(histogram_bounds, 1) FROM pg_stats
mgogala-# WHERE tablename = 'test_val_ind';
array_length
--------------
101
(1 row)
mgogala=# SELECT attname FROM pg_stats WHERE tablename = 'test_val_ind';
attname
---------
lower
(1 row)
mgogala=# SELECT array_length(histogram_bounds, 1) FROM pg_stats
mgogala-# where tablename='test' and attname='val';
array_length
--------------
101
(1 row)
mgogala=# select version();
version
--------------------------------------------------------------------------------
---------------------------
PostgreSQL 9.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.5.1
20100924
(Red Hat 4.5.1-4), 32-bit
(1 row)