Full text search performance with Postgres

37 views
Skip to first unread message

David Huang

unread,
Jan 4, 2010, 2:42:46 PM1/4/10
to trac-...@googlegroups.com
Hi, I had been running Trac 0.11 with a PostgreSQL database for a while, and recently upgraded it to 0.11.6. After the upgrade, the full text search is much slower than it was in 0.11, and while it's searching, Postgres is using all the CPU. I believe this is due to the changes for ticket #2530, which changed the search to include ticket custom fields. I noticed the ticket_custom table didn't have any indexes, so I added one on the ticket column, but that didn't seem to help any. I'm no Postgres guru, so adding the index is all I could think of. Does anyone have any ideas on how to speed up the search? A search for a single word used to take a few seconds with 0.11; it now takes over a minute.

Philip Semanchuk

unread,
Jan 4, 2010, 3:34:05 PM1/4/10
to trac-...@googlegroups.com

Hi David,
I could be wrong, but this is more of a generic SQL problem than a
Postgres problem. You need to find out what columns are being accessed
during a full text search in order to decide where an index might
help. You might find that easiest to do with Trac.

Alternatively (and here's where it gets Postgres-specific), you could
turn on Postgres logging via postgresql.conf with the log_statement
param set to "all". You'll need to set some other params, too, like
log_destination, etc. My Postgres skills are a bit rusty and I don't
have this stuff in my head anymore, so check the manual for details.
Setting log_statement=all will cause every query executed to be
written to the log. This will kill your performance, so you don't want
to leave this on any longer than you have to.

If you log all queries, you'll get to see the exact SQL that's
executed when you do a full-text search. Copy and paste that SQL into
a text file, and then use psql (Postgres' interactive command line) to
play around with it. You can add an index to a column and re-run the
query to see if the run time improves. (Beware of caching which can
skew your results.)


Hope this helps
Philip

David Huang

unread,
Jan 4, 2010, 6:42:36 PM1/4/10
to trac-...@googlegroups.com

On Jan 4, 2010, at 2:34 PM, Philip Semanchuk wrote:
> Hi David,
> I could be wrong, but this is more of a generic SQL problem than a
> Postgres problem. You need to find out what columns are being accessed
> during a full text search in order to decide where an index might
> help. You might find that easiest to do with Trac.

The SQL being executed is in the Trac source at: http://trac.edgewall.org/browser/branches/0.11-stable/trac/ticket/web_ui.py?rev=7569#L202

Enabling Postgres logging gives me this as the actual statement (for a search for "added"):

SELECT DISTINCT a.summary,a.description,a.reporter, a.type,a.id,a.time,a.status,a.resolution
FROM ticket a
LEFT JOIN ticket_change b ON a.id = b.ticket
LEFT OUTER JOIN ticket_custom c ON (a.id = c.ticket)
WHERE (b.field='comment' AND (b.newvalue ILIKE E'%added%' ESCAPE '/')) OR
(summary ILIKE E'%added%' ESCAPE '/' OR keywords ILIKE E'%added%' ESCAPE '/' OR description ILIKE E'%added%' ESCAPE '/' OR reporter ILIKE E'%added%' ESCAPE '/' OR cc ILIKE E'%added%' ESCAPE '/' OR CAST(id AS text) ILIKE E'%added%' ESCAPE '/') OR
(c.value ILIKE E'%added%' ESCAPE '/')

And an EXPLAIN ANALYZE of the query plan shows:

Unique (cost=18018.72..18270.50 rows=6525 width=414) (actual time=63595.824..64390.470 rows=931 loops=1)
-> Sort (cost=18018.72..18046.70 rows=11190 width=414) (actual time=63595.812..64217.818 rows=27921 loops=1)
Sort Key: a.summary, a.description, a.reporter, a.type, a.id, a."time", a.status, a.resolution
Sort Method: external merge Disk: 18072kB
-> Hash Left Join (cost=852.02..15159.20 rows=11190 width=414) (actual time=125.952..56215.314 rows=27921 loops=1)
Hash Cond: (a.id = c.ticket)
Filter: (((b.field = 'comment'::text) AND (b.newvalue ~~* '%added%'::text)) OR (a.summary ~~* '%added%'::text) OR (a.keywords ~~* '%added%'::text) OR (a.description ~~* '%added%'::text) OR (a.reporter ~~* '%added%'::text) OR (a.cc ~~* '%added%'::text) OR ((a.id)::text ~~* '%added%'::text) OR (c.value ~~* '%added%'::text))
-> Merge Left Join (cost=0.00..5291.49 rows=46973 width=535) (actual time=0.165..466.826 rows=48773 loops=1)
Merge Cond: (a.id = b.ticket)
-> Index Scan using ticket_pkey on ticket a (cost=0.00..1199.98 rows=6525 width=416) (actual time=0.082..30.423 rows=6510 loops=1)
-> Index Scan using ticket_change_ticket_idx on ticket_change b (cost=0.00..3488.03 rows=46973 width=123) (actual time=0.061..200.220 rows=46971 loops=1)
-> Hash (cost=475.34..475.34 rows=30134 width=6) (actual time=125.049..125.049 rows=30155 loops=1)
-> Seq Scan on ticket_custom c (cost=0.00..475.34 rows=30134 width=6) (actual time=0.048..59.032 rows=30155 loops=1)
Total runtime: 64486.821 ms

I don't really know how to interpret that, but it's not using the index I added on ticket_custom.ticket. I expected it to, since the query is joining ticket.id = ticket_custom.ticket.

Philip Semanchuk

unread,
Jan 4, 2010, 10:31:14 PM1/4/10
to trac-...@googlegroups.com

On Jan 4, 2010, at 6:42 PM, David Huang wrote:

>
> On Jan 4, 2010, at 2:34 PM, Philip Semanchuk wrote:
>> Hi David,
>> I could be wrong, but this is more of a generic SQL problem than a
>> Postgres problem. You need to find out what columns are being
>> accessed
>> during a full text search in order to decide where an index might
>> help. You might find that easiest to do with Trac.
>
> The SQL being executed is in the Trac source at: http://trac.edgewall.org/browser/branches/0.11-stable/trac/ticket/web_ui.py?rev=7569#L202
>
> Enabling Postgres logging gives me this as the actual statement (for
> a search for "added"):

Good on ya for finding the query and digging into the Postgres
logging. The explain analyze is helpful too, although reading the
output is an art at which I never surpassed the level of novice.
However I can make a few stabs at what you sent, but take my
observations with the same credibility that you'd lend to Sir
Bedevere's observation, "...and that, my liege, is how we know the
Earth to be banana shaped".

First of all, your sort is happening on disk ("Sort Method: external
merge Disk: 18072kB"). It looks like that's about 11% of your time.
You can increase the work_mem parameter (I think) to try to get that
sort to happen in memory. The disk sort might have been happening
before, though, so we can't point to that as the perpetrator to the
sudden performance decrease.

I'm not sure about the use of the index on ticket_custom.ticket.
Postgres is using a hash or building one on the fly ("Hash
(cost=475.34..475.34 rows=30134 width=6...") which I guess would allow
for O(1) lookups in ticket_custom. Postgres might decide that's
cheaper than using the index, especially if the hash can be
constructed in memory rather than loading from disk the pages that
describe the index on that table.

Nevertheless, if I read this correctly (and I'm really not confident
of that, in case I haven't already overstated that point), the hash/
index on ticket_custom doesn't really matter. The query is spending
most of the time that's not lost to the sort in the hash left join
("Hash Left Join (cost=852.02..15159.20 rows=11190 width=414) (actual
time=125.952..56215.314"). If you add up the "actual time" of the
items after that, they're a pittance compared to the 56215ms. I think
"actual time" is supposed to be multiplied by the number of loops, but
in each case loops==1.

My guess, then, is that the Filter step chews up the lion's share of
the time, and to me that makes intuitive sense. I don't think there's
a way for any database to shortcut this -- the rows have to be loaded
from disk and searched for the string in question. Note that six
fields from two tables are ORed:

(a.summary ~~* '%added%'::text) OR
(a.keywords ~~* '%added%'::text) OR
(a.description ~~* '%added%'::text) OR
(a.reporter ~~* '%added%'::text) OR
(a.cc ~~* '%added%'::text) OR
((a.id)::text ~~* '%added%'::text) OR (c.value ~~* '%added
%'::text))

Postgres has to search each of these columns (from two different
tables) for each row until it finds a hit. That's a lot of disk
traffic, especially if these fields are large.

So that's my hypothesis. One way to test it would be to hack
web_ui.py, revert the query in question to a code revision where you
didn't see the big performance hit, and run this again with the
EXPLAIN ANALYZE output and see how it differs from this.

Hope this helps
Philip

PS - let me know if you want to know how sheep's bladders can be used
to prevent earthquakes

David Huang

unread,
Jan 5, 2010, 2:00:36 PM1/5/10
to trac-...@googlegroups.com

On Jan 4, 2010, at 9:31 PM, Philip Semanchuk wrote:
> My guess, then, is that the Filter step chews up the lion's share of
> the time, and to me that makes intuitive sense. I don't think there's
> a way for any database to shortcut this -- the rows have to be loaded
> from disk and searched for the string in question. Note that six
> fields from two tables are ORed:

Ah, OK... yeah, I think I see what's going on now. Before ticket_custom was joined in, it was only doing the expensive "ILIKE" substring search on some semi-reasonable number of rows (in my DB, about 49000). After adding the JOIN to ticket_custom, it needs to search 345000 rows, even though the bulk of the ILIKEs are on fields from the main ticket table (which only has 6500 rows). The columns from the ticket table are all the same for the same ticket.id, so it's searching a lot more tickets than it needs to.

It seems like a more efficient way to do the query would be to do separate SELECTs for the three tables that are being searched, and UNION the results together. E.g.,

SELECT a.summary,a.description,a.reporter, a.type,a.id,a.time,a.status,a.resolution
FROM ticket a
WHERE (summary ILIKE E'%added%' ESCAPE '/' OR keywords ILIKE E'%added%' ESCAPE '/' OR description ILIKE E'%added%' ESCAPE '/' OR reporter ILIKE E'%added%' ESCAPE '/' OR cc ILIKE E'%added%' ESCAPE '/' OR CAST(id AS text) ILIKE E'%added%' ESCAPE '/')
UNION
SELECT a.summary,a.description,a.reporter, a.type,a.id,a.time,a.status,a.resolution
FROM ticket a
INNER JOIN ticket_change b ON a.id = b.ticket


WHERE (b.field='comment' AND (b.newvalue ILIKE E'%added%' ESCAPE '/'))

UNION
SELECT a.summary,a.description,a.reporter, a.type,a.id,a.time,a.status,a.resolution
FROM ticket a
INNER JOIN ticket_custom c ON (a.id = c.ticket)
WHERE (c.value ILIKE E'%added%' ESCAPE '/')

An EXPLAIN ANALYZE on that gives:

HashAggregate (cost=4220.53..4229.10 rows=857 width=414) (actual time=2395.736..2397.835 rows=931 loops=1)
-> Append (cost=0.00..4203.39 rows=857 width=414) (actual time=5.265..2382.928 rows=1432 loops=1)
-> Seq Scan on ticket a (cost=0.00..594.74 rows=258 width=414) (actual time=5.262..707.284 rows=316 loops=1)
Filter: ((summary ~~* '%added%'::text) OR (keywords ~~* '%added%'::text) OR (description ~~* '%added%'::text) OR (reporter ~~* '%added%'::text) OR (cc ~~* '%added%'::text) OR ((id)::text ~~* '%added%'::text))
-> Hash Join (cost=898.55..3041.12 rows=598 width=414) (actual time=64.893..1497.745 rows=1116 loops=1)
Hash Cond: (b.ticket = a.id)
-> Seq Scan on ticket_change b (cost=0.00..1774.60 rows=598 width=4) (actual time=0.070..1401.898 rows=1116 loops=1)
Filter: ((newvalue ~~* '%added%'::text) AND (field = 'comment'::text))
-> Hash (cost=463.58..463.58 rows=6558 width=414) (actual time=64.695..64.695 rows=6511 loops=1)
-> Seq Scan on ticket a (cost=0.00..463.58 rows=6558 width=414) (actual time=0.013..20.594 rows=6511 loops=1)
-> Nested Loop (cost=0.00..558.96 rows=1 width=414) (actual time=174.906..174.906 rows=0 loops=1)
-> Seq Scan on ticket_custom c (cost=0.00..550.67 rows=1 width=4) (actual time=174.900..174.900 rows=0 loops=1)
Filter: (value ~~* '%added%'::text)
-> Index Scan using ticket_pkey on ticket a (cost=0.00..8.27 rows=1 width=414) (never executed)
Index Cond: (a.id = c.ticket)
Total runtime: 2399.341 ms

I think that gives equivalent results (the number of rows is the same in my case), while only taking a bit over 2 seconds, vs. 64 seconds with the previous query. Does changing the query to do UNIONs sound like a good idea as a general patch to Trac? While I don't think UNIONs are obscure, I don't actually know whether the various DBs that Trac supports handle them well.

Christian Boos

unread,
Jan 5, 2010, 2:15:37 PM1/5/10
to trac-...@googlegroups.com
On 1/5/2010 8:00 PM, David Huang wrote:
> On Jan 4, 2010, at 9:31 PM, Philip Semanchuk wrote:
>
>> My guess, then, is that the Filter step chews up the lion's share of
>> the time, and to me that makes intuitive sense. I don't think there's
>> a way for any database to shortcut this -- the rows have to be loaded
>> from disk and searched for the string in question. Note that six
>> fields from two tables are ORed:
>>
> Ah, OK... yeah, I think I see what's going on now. Before ticket_custom was joined in, it was only doing the expensive "ILIKE" substring search on some semi-reasonable number of rows (in my DB, about 49000). After adding the JOIN to ticket_custom, it needs to search 345000 rows, even though the bulk of the ILIKEs are on fields from the main ticket table (which only has 6500 rows). The columns from the ticket table are all the same for the same ticket.id, so it's searching a lot more tickets than it needs to.
>
> It seems like a more efficient way to do the query would be to do separate SELECTs for the three tables that are being searched, and UNION the results together. E.g.,
>
> SELECT a.summary,a.description,a.reporter, a.type,a.id,a.time,a.status,a.resolution
> FROM ticket a
> WHERE (summary ILIKE E'%added%' ESCAPE '/' OR keywords ILIKE E'%added%' ESCAPE '/' OR description ILIKE E'%added%' ESCAPE '/' OR reporter ILIKE E'%added%' ESCAPE '/' OR cc ILIKE E'%added%' ESCAPE '/' OR CAST(id AS text) ILIKE E'%added%' ESCAPE '/')
> UNION
> SELECT a.summary,a.description,a.reporter, a.type,a.id,a.time,a.status,a.resolution
> FROM ticket a
> INNER JOIN ticket_change b ON a.id = b.ticket
> WHERE (b.field='comment' AND (b.newvalue ILIKE E'%added%' ESCAPE '/'))
> UNION
> SELECT a.summary,a.description,a.reporter, a.type,a.id,a.time,a.status,a.resolution
> FROM ticket a
> INNER JOIN ticket_custom c ON (a.id = c.ticket)
> WHERE (c.value ILIKE E'%added%' ESCAPE '/')
>
> An EXPLAIN ANALYZE on that gives:
>
> ...

> Total runtime: 2399.341 ms
>
> I think that gives equivalent results (the number of rows is the same in my case), while only taking a bit over 2 seconds, vs. 64 seconds with the previous query. Does changing the query to do UNIONs sound like a good idea as a general patch to Trac? While I don't think UNIONs are obscure, I don't actually know whether the various DBs that Trac supports handle them well.

The UNION solution would be fine, thanks for going through this issue!

Another solution you could investigate could be along the lines of:

SELECT a.summary,a.description,a.reporter, a.type,a.id,a.time,a.status,a.resolution
FROM ticket a

WHERE a.id in (
SELECT a.id


FROM ticket a
WHERE (summary ILIKE E'%added%' ESCAPE '/' OR keywords ILIKE E'%added%' ESCAPE '/' OR description ILIKE E'%added%' ESCAPE '/' OR reporter ILIKE E'%added%' ESCAPE '/' OR cc ILIKE E'%added%' ESCAPE '/' OR CAST(id AS text) ILIKE E'%added%' ESCAPE '/')
UNION

SELECT b.ticket
FROM ticket_change b


WHERE (b.field='comment' AND (b.newvalue ILIKE E'%added%' ESCAPE '/'))
UNION

SELECT c.ticket
FROM ticket_custom c


WHERE (c.value ILIKE E'%added%' ESCAPE '/')

)

(untested)

-- Christian


David Huang

unread,
Jan 5, 2010, 9:59:45 PM1/5/10
to trac-...@googlegroups.com

On Jan 5, 2010, at 1:15 PM, Christian Boos wrote:
> Another solution you could investigate could be along the lines of:

That works too, and seems just as efficient. I've created a ticket with a patch: #8935

Thanks :)

CM Lubinski

unread,
Jan 6, 2010, 10:35:32 AM1/6/10
to Trac Users
Excellent work, David and Christian!

That's something our organization has struggled with but never had
enough time to flesh out.

Thank you very much!

Reply all
Reply to author
Forward
0 new messages