MySQL issues with long multiple keyword queries

David Dwiggins

unread,

Jun 30, 2010, 10:35:01 PM6/30/10

to ResourceSpace

Every so often I'm getting a query that just hammers my MySQL database, and basically stops ResourceSpace in its tracks. I wasn't really sure what was generating them, but a while ago I was able to catch one in the act. It looked like this (spaced out for legibility):

select distinct k1.hit_count+k2.hit_count+k3.hit_count+k4.hit_count+k5.hit_count+k6.hit_count+k7.hit_count+k8.hit_count+k9.hit_count+k10.hit_count+k11.hit_count+k12.hit_count+k13.hit_count score, r.ref, r.resource_type, r.has_image, r.is_transcoding, r.hit_count, r.creation_date, r.rating, r.user_rating, r.user_rating_count, r.user_rating_total, r.file_extension, r.preview_extension, r.image_red, r.image_green, r.image_blue, r.thumb_width, r.thumb_height, r.archive, r.access, r.colour_key, r.created_by, r.file_modified, r.file_checksum, r.request_count, r.new_hit_count, r.expiry_notification_sent, r.preview_tweaks, r.file_path ,null group_access, null user_access ,r.field12 ,r.field8 ,r.field3 ,r.field88 ,r.field98 ,r.field73
from resource r
join resource_keyword k1 on k1.resource=r.ref and (k1.keyword='75214' )
join resource_keyword k2 on k2.resource=r.ref and (k2.keyword='75215' )
join resource_keyword k3 on k3.resource=r.ref and (k3.keyword='74796' )
join resource_keyword k4 on k4.resource=r.ref and (k4.keyword='75052' )
join resource_keyword k5 on k5.resource=r.ref and (k5.keyword='75216' )
join resource_keyword k6 on k6.resource=r.ref and (k6.keyword='75217' )
join resource_keyword k7 on k7.resource=r.ref and (k7.keyword='75218' )
join resource_keyword k8 on k8.resource=r.ref and (k8.keyword='75052' )
join resource_keyword k9 on k9.resource=r.ref and (k9.keyword='75219' )
join resource_keyword k10 on k10.resource=r.ref and (k10.keyword='75141' )
join resource_keyword k11 on k11.resource=r.ref and (k11.keyword='74732' )
join resource_keyword k12 on k12.resource=r.ref and (k12.keyword='74733' )
join resource_keyword k13 on k13.resource=r.ref and (k13.keyword='74669' )
where resource_type in (1,2,3,4,5,6) and archive='0' and r.ref>0
group by r.ref
order by score DESC, user_rating DESC, hit_count DESC, field12 DESC,r.ref DESC
limit 50000

Running this query will effectively hang my system for long periods of time.

By working backward through the list of keywords, I was able to figure out the search terms, which were

"keyword"
"mass"
"st"
"dorchester"
"front"
"columbia"
"hancock"
"carriages"
"road"
"stables"
"525-527"
"rd"

I tried a simpler search based on these terms (Columbia Hancock Carriages, I think), and discovered that we had a resource with the caption:

"Horses and carriages in front of Columbia Road Stables, 525-527 Columbia Rd. at Hancock St., Dorchester, Mass."

Presumably someone had this complete title, and dropped it into the search box. I just tried that, and sure enough it killed the system again.

I'm thinking maybe we need to do some work on optimizing the way the system handles searches for many keywords at once. I think the number of joins is killing it, and the fact that it has to do a group by/distinct operation at the same time probably isn't helping. I wonder if the temporary table method I have applied successfully in special search situations could be used more generally here. Or some other method of doing this search in multiple steps rather than all at once.

This also brings me back to the question of whether it would make sense to integrate a full text search engine like Lucene/Solr (or the PHP Lucene port), since I think these problems have already been solved over there. (That would also help move toward other search features that people have been longing for, like word stemming,etc.)

In any case, no immediate solution, but figured I'd share my detective work in case anyone else had feedback.

-David

Tom Gleason

unread,

Jun 30, 2010, 10:46:11 PM6/30/10

to resour...@googlegroups.com

Hi David,

The many-keyword problem has been shared on this list before but I haven't seen it actually happen.

how big is your resource_keyword table?

Tom

--
You received this message because you are subscribed to the Google Groups "ResourceSpace" group.
To post to this group, send email to resour...@googlegroups.com.
To unsubscribe from this group, send email to resourcespac...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/resourcespace?hl=en.

--
Tom Gleason, PHP Developer
DBA Impressive Design

Exploring ResourceSpace at:
http://resourcespace.blogspot.com

Dan Huby

unread,

Jul 1, 2010, 7:38:09 AM7/1/10

to ResourceSpace

> I'm thinking maybe we need to do some work on optimizing the way the system
> handles searches for many keywords at once.

It's a long standing issue and I've personally been unable to resolve
it, except for to come up with a temporary solution, i.e. the script /
pages/tools/mysql_timeout.php which kills queries that are taking a
long time to execute - the user then gets an error message. This is
set up as a cron job, i.e.:

* * * * * cd /var/www/pages/tools; php mysql_timeout.php

The problem only seems to affect some of the systems, and for those
systems queries that cause the issue are usually quite rare. I think
it might be when several keywords that are widely used (maybe in a
large percentage of all resources) are strung together in a free text
search, causing an exponential growth of the number of rows in the
join. But MySQL should be able to handle that.

I find it odd that queries go from being near-instantaneous to never
completing, with nothing in between.

It's also odd that it only affects free text searches and you can
build up extensive queries using several widely used keywords using
Advanced Search with no issues at all.

Someone I spoke to had a potential solution which might be useful,
which is to rearrange the join so the least-used (i.e. most
restrictive) keywords are included first. In theory this means the
remaining joins would have less to join with - we'd have reduced the
number of resources significantly early on. This assumes that the
order of the joins is important to MySQL which might not be the case.
It seems like the sort of optimisation MySQL should be doing itself,
or may already be doing. And it'd require quite a rewrite of the
do_search() function just to try it out, plus working out which
keywords are the most / least used will add some overhead (although
maybe we could return hit_count from the keyword table when resolving
keywords and piggy-back something already being done).

The do_search() function has remained largely the same since version
1.0 and I think could definitely use a fresh look by someone other
than myself.

David Dwiggins

unread,

Jul 1, 2010, 8:10:41 PM7/1/10

to resour...@googlegroups.com

Tom - we have about 269,000 entries in our keywords table, and about 3.6 million in our resource_keyword table.

I agree that this problem is not frequent, and the system works great for normal (2-3 word) searches. But it's a problem because it seems to get in the way of other queries. Not sure why this is, or if there might be a way to configure MySQL to multitask better.

In any case, as Dan implies, I think this is likely due to poor query optimization in MySQL -- just like subqueries are basically unusable, I think the system may not be smart enough to understand how to more quickly narrow down the search.

I poked around a bit, and it appears that running "analyze table" on a table can help MySQL make better decisions about the order to join things in. I tried this on the resource_keyword table, and it helped a simpler query quite a bit, but the one referenced in this thread is still slow to the point of unusability. (There seems to be a tipping point -- when I start stripping off terms, it begins to work around 7 or 8, but it increases rapidly.)

I will keep this in the back of my mind -- I suspect reading up on MySQL query optimization would be helpful.

-David

David Dwiggins

unread,

Jul 1, 2010, 8:28:13 PM7/1/10

to resour...@googlegroups.com

Another clue: when I run the original query and then run a "show full processlist" command, the query appears to be sitting in the "statistics" state. According to the MySQL documentation, this means that "The server is calculating statistics to develop a query execution plan. If a thread is in this state for a long time, the server is probably disk-bound performing other work."

So I'm wondering if the problem isn't really executing the query itself so much as trying to figure out what order it should do the joins in for optimal performance. If that's the case, I wonder if there is indexing or analysis that can be done so that MySQL doesn't spend so much time on this.

Hmm...

-DD

Dan Huby

unread,

Jul 2, 2010, 9:04:12 AM7/2/10

to ResourceSpace

Some possible directions for investigation.

- Are we all using the MyISAM storage engine? Does InnoDB perform
better?

- We could reduce the index size on resource_keywords by setting a
prefix limit, i.e. index by the first 5 characters (or so) only. You
normally only use prefix limits for blobs but they work for varchars
too. This may speed up the join. It shouldn't be significantly slower
when looking up a keyword as there should normally only be a handful
of keywords with the first five characters.

- As I mentioned above I think intelligently reordering the
resource_keyword joins may be a potential solution, assuming MySQL is
failing to do this internally. This probably wouldn't be a complete
fix as using several very widely used keywords _alone_ (with no rarely
used keyword to restrict the set) would still cause a problem.

Jeff Harmon

unread,

Jul 14, 2010, 11:11:37 PM7/14/10

to ResourceSpace

We are hiring a MySQL performance specialist to look at queries,
schema, and set-up, and will gladly share the results of their audit.
We hope to get movement on this over the next couple weeks.

- Jeff Harmon
Colorhythm LLC

Jeff Harmon

unread,

Jul 14, 2010, 11:15:24 PM7/14/10

to ResourceSpace

forgive my ignorance, but is:

* * * * * cd /var/www/pages/tools; php mysql_timeout.php

correct? i'm confused why the "cd" command is used rather than "wget"
in this context

also, why does the path have "; php " in it? is that right? we are
trying to set this up post haste, as our queries sometimes bring the
entire system down for everyone!

- Jeff Harmon

Tom Gleason

unread,

Jul 14, 2010, 11:27:34 PM7/14/10

to resour...@googlegroups.com

* * * * * cd /var/www/pages/tools; php mysql_timeout.php

I think this means that every minute cron Changes Directory to tools and then runs the timeout script?

--
You received this message because you are subscribed to the Google Groups "ResourceSpace" group.
To post to this group, send email to resour...@googlegroups.com.
To unsubscribe from this group, send email to resourcespac...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/resourcespace?hl=en.

Tom Gleason

unread,

Jul 14, 2010, 11:33:36 PM7/14/10

to resour...@googlegroups.com

wget is for retrieving files from the internet, cron runs local programs on a periodic basis
http://www.linuxweblog.com/crotab-tutorial

so this cron job is two commands, cd to tools, then run the script via php on the command line.

I would think it could also be set up as
* * * * * php /var/www/resourcespace/pages/tools/mysql_timeout.php or similar.
(not sure about that)

but in any case, [/var/www/pages/tools] should be the path on the server to the mysql_timeout.php script.

Tom Gleason

unread,

Jul 14, 2010, 11:41:34 PM7/14/10

to resour...@googlegroups.com

For people like me used to using VIM:

export VISUAL=vim

changes the editor to VIM before running 
crontab -e

Dan Huby

unread,

Jul 15, 2010, 6:50:52 AM7/15/10

to ResourceSpace

> I would think it could also be set up as
> * * * * * php /var/www/resourcespace/pages/tools/mysql_timeout.php or
> similar.
> (not sure about that)

I think the reason I didn't do it this way was that file paths are
relative to the current working directory when running PHP from the
command line, so you have to change directory first otherwise some
parts of ResourceSpace might look for files in the wrong place.

Dan

Jeff Harmon

unread,

Jul 15, 2010, 6:35:07 PM7/15/10

to ResourceSpace

Thank you, gentlemen! Will implement post haste. Apparently we are
topping 15,000,000 entries in our keyword table...

- Jeff Harmon
Colorhythm LLC

Dan Huby

unread,

Jul 15, 2010, 8:02:22 PM7/15/10

to ResourceSpace

On 15 July, 19:35, Jeff Harmon <jeffreyhhar...@gmail.com> wrote:
> Thank you, gentlemen! Will implement post haste. Apparently we are
> topping 15,000,000 entries in our keyword table...

keyword or resource_keyword?

If the keyword table is this big it is probably due to prefix/infix
searching being enabled.

Tom Gleason

unread,

Jul 15, 2010, 8:23:37 PM7/15/10

to resour...@googlegroups.com

no, that's the resource_keyword table, but it is because of partial indexing. One partial indexed field is proving an 80/20 rule.

I'm looking into it. I think it could have a lot to do with not including underscores as config separators, which might help in chunking it down a bit more.

In any case, after some cost/benefit analysis, I'm hesitant to recommend partial indexing without very careful consideration.

--
You received this message because you are subscribed to the Google Groups "ResourceSpace" group.
To post to this group, send email to resour...@googlegroups.com.
To unsubscribe from this group, send email to resourcespac...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/resourcespace?hl=en.

Jeff Harmon

unread,

Jul 16, 2010, 9:57:59 PM7/16/10

to ResourceSpace

if we were to revamp the search system with lucent or sphinx, would
that presumably solve these sorts of problems? i can't tell if that's
a naive notion, because enormous tables are enormous tables under
anyone's search algorithm.

- Jeff Harmon
Colorhythm LLC

Tom Gleason

unread,

Jul 16, 2010, 10:47:11 PM7/16/10

to resour...@googlegroups.com

It's a possibility, perhaps, but it is not a simple mod. I'm not familiar with the alternative search engines so maybe Dwiggins can chime in about the possibility.

Whatever might be done there should also definitely not be a requirement, because the system works well as it is for most people and your odd configuration (not using "_" as a config separator) still needs to be reconsidered.

--

You received this message because you are subscribed to the Google Groups "ResourceSpace" group.
To post to this group, send email to resour...@googlegroups.com.
To unsubscribe from this group, send email to resourcespac...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/resourcespace?hl=en.

Jeff Harmon

unread,

Jul 17, 2010, 12:52:07 AM7/17/10

to ResourceSpace

i'll be returning the underscore as a separator, but the build has
60,000 resources, and we need to be able to accommodate far more than
that. dwiggins is also having some issues with a large resource set,
and our search engine doesn't handle full text search, stemming, etc.
so if migrating to a more robust search engine not only addresses some
(presently rare) scalability issues, but also offers more of the kinds
of results expected by an increasingly savvy and google-cozy userbase,
it would be worth it.

- Jeff Harmon
Colorhythm LLC

On Jul 16, 3:47 pm, Tom Gleason <theorysav...@gmail.com> wrote:
> It's a possibility, perhaps, but it is not a simple mod. I'm not familiar
> with the alternative search engines so maybe Dwiggins can chime in about the
> possibility.
>
> Whatever might be done there should also definitely not be a requirement,
> because the system works well as it is for most people and your odd
> configuration (not using "_" as a config separator) still needs to be
> reconsidered.
>

> On Fri, Jul 16, 2010 at 5:57 PM, Jeff Harmon <jeffreyhhar...@gmail.com>wrote:
>
>
>
> > if we were to revamp the search system with lucent or sphinx, would
> > that presumably solve these sorts of problems? i can't tell if that's
> > a naive notion, because enormous tables are enormous tables under
> > anyone's search algorithm.
>
> > - Jeff Harmon
> > Colorhythm LLC
>
> > On Jul 15, 1:23 pm, Tom Gleason <theorysav...@gmail.com> wrote:
> > > no, that's the resource_keyword table, but it is because of partial
> > > indexing. One partial indexed field is proving an 80/20 rule.
>
> > > I'm looking into it. I think it could have a lot to do with not including
> > > underscores as config separators, which might help in chunking it down a
> > bit
> > > more.
>
> > > In any case, after some cost/benefit analysis, I'm hesitant to recommend
> > > partial indexing without very careful consideration.
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "ResourceSpace" group.
> > To post to this group, send email to resour...@googlegroups.com.
> > To unsubscribe from this group, send email to

> > resourcespac...@googlegroups.com<resourcespace%2Bunsu...@googlegroups.com>
> > .

David Dwiggins

unread,

Jul 17, 2010, 2:18:30 AM7/17/10

to resour...@googlegroups.com

I've been looking at this a little bit. I do think it might be difficult to entirely replace the SQL-based display of results, partly because of the need to support custom resource permissions. (Most search engines I'm familiar with get their performance boost by doing most of the work at index time -- but there's no way to pre-index every possible user/custom permission combination.)

That said, as Jeff noted, full text search engines like Lucene/Solr, Sphinx, etc. do offer better performance for some types of searches, and typically support advanced search features that we would take a lot of wheel reinvention to add to the ResourceSpace base.

I also agree with Tom that I think retaining the ability to run ResourceSpace "out of the box" without another software installation (such as the a java-based search tool like Solr or a standalone tool like Sphinx) is important. So I think there are two ways to do that: 1: use a search tool that is native to PHP (such as the zend Lucene port) or 2: continue to support the existing search tool in addition to any more advanced add-ons.

Option 2 is problematic, because it implies that any new search add-on (such as specialized keywords like !duplicates) would have to be implemented in both the traditional search and the full-text based search. I think that's likely to cause confusion. And even option 1 has some issues, because you still have to have a way to apply permissions within the search.

Here's another thought that might point toward a longer term solution: I've been very happy with how temporary tables have worked out for the limited uses we've put them to. (such as the duplicate search). I've been hesitant to rely on them too much, since I don't know if there might be compatibility issues with older versions of MySQL (or if we ever decided to try to support another database engine.) But they are amazingly useful, and have worked beautiful on the systems I've tested on. If we feel like they work well enough to become a standard tool, I think we could simplify the search queries by always creating search results sets in a temporary table. This would allow the segments of complex queries (like a 15 keyword query, for example) to be executed separately against an increasingly small temporary table of possible hits. I think the result would be vastly improved performance for this sort of query.

This would also open up the possibility of supporting other search tools via plugins. The plugin could be called to populate the initial temporary table with matching resources, and then the built in search function could make a final pass to make sure that all of the returned results match the security profile of the person executing the search. (Alternatively, a plugin could be called after the regular search executes to *add* things to the result set before it was displayed, which might also be useful.

Note that it would be possible to do this same thing using a regular table. But because regular tables persist between MySQL sessions, there would have to be some sort of garbage collection to get rid of result sets that had sat around too long. Temporary tables are much neater, because MySQL gets rid of them internally as soon as they are no longer needed.

I'm curious if anyone has run into problems using temporary tables so far. (Right now, when enabled, they are used for the duplicate resource search and to support improved wildcard searching.) If not, maybe this is a solution worth considering...

-David

To unsubscribe from this group, send email to resourcespac...@googlegroups.com.

Jeff Harmon

unread,

Jul 17, 2010, 10:08:04 PM7/17/10

to ResourceSpace

your logic looks sound to me. we haven't migrated the temp tables
efforts to our larger customers yet, but in our dev build it has
worked without a hitch and quite speedily. perhaps we should test it
against a very large db and see how it goes? i think you've probably
already done that. temp tables are supported in MySQL 3.23 and later,
so compatibility isn't a concern i don't think.

since we are hiring mysql performance specialists to perform an audit
of RS, maybe we should ask them their opinion on this, too, before
moving ahead? david, i'll follow up off-list with you on this.

Jeff Harmon

unread,

Jul 20, 2010, 9:07:57 PM7/20/10

to ResourceSpace

We returned the underscore to the separator list, and turned off pre/
infixing for archive contents (argh) as well as a couple of other
strays that were using it incorrectly, then reindexed, and the
resource_keyword table went from 16M entries to 3.5M! and now
performance is snappy as well. Thanks to Tom for this counsel on

Tom Gleason

unread,

May 19, 2011, 6:31:18 PM5/19/11

to resour...@googlegroups.com

I've implemented a temp table-based solution for the keyword joins.

It appears that the lack of distinct results on the resource_keyword joins was a major cause of problems. I noticed performance degrade significantly once I had a lot of large PDFs fully indexed.

On my own 9,000,000 resource_keyword table, (which is really only a couple hundred pdf books, mostly christian books), I was seeing mysql crashes for searches 3+ words such as "augustin city of god"

After the temp tables improvement, a search for

"augustin city of god All Things and All Movements of the Mind, Chose to Have the Temple of Quiet Outside the Gates"

returned results in about 2 seconds.

but...there are almost 600,000 words in this particular document. Scalability for full-text searching is obviously an issue that needs more attention in future development.

The resource_keyword table could be reduced in size if duplicate keywords for the same resource were not indexed. (I'm not sure if there is a special reason for logging them all, other than the 'position' or frequency which could eventually help with search results, but I don't think anything is implemented along those lines, and those stats could potentially be consolidated somewhere).

Anyway, for those interested in this problem, please test at r2710

There was also a complex refactor of highlighting/word wrapping to accomodate underscores much better as word breaks.

--
You received this message because you are subscribed to the Google Groups "ResourceSpace" group.
To post to this group, send email to resour...@googlegroups.com.
To unsubscribe from this group, send email to resourcespac...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/resourcespace?hl=en.

--
Tom Gleason, PHP Developer

ResourceSpace Support Services
https://www.buildadam.com

Tom Gleason

unread,

May 19, 2011, 6:50:51 PM5/19/11

to resour...@googlegroups.com

Another improvement related to full-text indexing is the improved comparison of strings when saving resource metadata.

It used to be that if you just wanted to change the title of a resource via a single Edit, the full text would be reindexed because the comparison didn't work.

Now, there should be fewer edits made and it should work much faster because of better string comparison. The most poignant example is where there is a long extracted text field.

Dan Huby

unread,

May 19, 2011, 8:31:22 PM5/19/11

to ResourceSpace

On May 19, 7:31 pm, Tom Gleason <t...@buildadam.com> wrote:
> The resource_keyword table could be reduced in size if duplicate keywords
> for the same resource were not indexed. (I'm not sure if there is a special
> reason for logging them all, other than the 'position' or frequency which
> could eventually help with search results, but I don't think anything is
> implemented along those lines, and those stats could potentially be
> consolidated somewhere).

It's for future support for exact multiple-keyword string matches
using quotes (as with Google) - e.g. "temple of quiet". The reason the
position in the sequence is logged with each resource_keyword match so
we can spot keywords that appear next to each other.

It would be fairly complicated to add that support - keyword groups of
quoted strings would need to be identified and appropriate additional
conditions added to the search query, e.g. if the first four keywords
were grouped using quotes then we'd add the condition:

and k2.position=k1.position+1 and k3.position=k2.position+1 and
k4.position=k3.position+1

For example a search for

"temple of quiet" outside "the gates"

...would need to group the first three keywords and the last two only,
so the condition becomes:

and k2.position=k1.position+1 and k3.position=k2.position+1 and
k4.position=k3.position+1 (first group - 'quiet' must appear after
'of', also 'of' must appear after 'temple')
and k5.position=k4.position+1 (second group, 'gates' must appear
after 'the')

Not sure if the 'k' aliases are right, it's been a while since I
looked at the search code, but hopefully you get the idea.

I think supporting a single set of quotes around the whole search term
would be pretty straightforward, the tough(er) bit is identifying
multiple/partial groups and applying the conditions only to the
keywords within the groups, as per the second example above.

The data is there - but not the code, so if added this would 'just
work' with everyone's existing index data.

Anyone fancy a challenge? :)

Dan

Dan Huby

unread,

May 19, 2011, 8:33:31 PM5/19/11

to ResourceSpace

On May 19, 7:31 pm, Tom Gleason <t...@buildadam.com> wrote:

> I've implemented a temp table-based solution for the keyword joins.

By the way are these improvements enabled by default?

I think where there is an obvious improvement with no drawbacks then
it should be enabled by default. David's temp tables for wildcard
searching (I think it was) may be another one to enable by default.

Dan

Tom Gleason

unread,

May 19, 2011, 8:55:10 PM5/19/11

to resour...@googlegroups.com

On my pdf search plugin ( https://www.buildadam.com/muse2/pdfsearch/ in development, looking for funding still, and there are demos on https://buidadam.com/resourcespace )...

I support contiguous keywords to narrow the result to several words in order. I don't use the same keyword table as RS, though, currently, but would like to, and the fact that there is a 'position' on resource keywords could help me out.

I basically just relied upon the insertion position, since all the OCR'd or pdftotext'd keywords are inserted with LOAD DATA LOCAL INFILE and are therefore in order.

So, this is a problem I'm likely to be looking at.

The improvements to RS require $use_temp_tables = true; and this is off by default, because I think David thought his work was too experimental at the time. All temp table experiments are controlled by this config.
I think it works pretty well; but possibly too early to tell since I just did it this morning.

The development definitely adds a large amount of headroom (unless I've overlooked something) and I think there could be more room for improvement, since there are still limits that go up rapidly at a certain point, but they are much higher now.

Tom

--

You received this message because you are subscribed to the Google Groups "ResourceSpace" group.
To post to this group, send email to resour...@googlegroups.com.
To unsubscribe from this group, send email to resourcespac...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/resourcespace?hl=en.

Tom Gleason

unread,

May 19, 2011, 9:21:12 PM5/19/11

to resour...@googlegroups.com

Sorry, the link to my demo site was spelled incorrectly, so I'm reposting that.

https://buildadam.com/resourcespace

Tom Gleason

unread,

May 19, 2011, 9:57:16 PM5/19/11

to resour...@googlegroups.com

I've been reading a lot about NoSQL databases lately, and there is a section in the O'Reilly book on Cassandra that was particularly interesting.

Basically, it said that denormalized data is considered 'normal', since it is the joins in a traditional RDBMS-based application that cause scalability problems, and you design your tables based on the queries you need to achieve quickly. It's not that a different DBMS is needed, but the strategy of denormalizing and accepting that consistency is going to be the responsibility of the application. More writes, but faster reads. It seems like the big data applications use this strategy, and just spend more time processing the data to be able to have it at hand in specially-designed tables for quick access without joins.

I've wondered if denormalizing the keywords somehow could provide more scalability (especially when the time comes for contiguous word searches). Not that I have an idea about how to achieve it, but based on experience, some denormalization strategy seems like it would be necessary.

In my pdfsearch plugin, I haven't seen the same scalability issues, and I store all the keywords in the same table as the data about those keywords; meaning, I don't care that the same keyword is in the table many times.

I have a table that stores the page sizes of each page for each pdf, and then a pdf_words table that stores the resource ref, keyword, and coordinates. It's not exactly the same problem, though.

Denormalization was the solution for another scalability issue we had: configurable display fields, where we were able to remove an arbitrary number of joins from the search (and remember that searches are running all the time in RS...in the background for autocompletes, in the collection panel, on every search page turn...) by duplicating some into the resource table itself; then, there were only a few joins to do for permissions and such. There is a lot of headroom and query time saved based on that development.

Tom Gleason

unread,

May 20, 2011, 8:14:13 PM5/20/11

to resour...@googlegroups.com

I've noticed some longer queries around the Find Similar feature when moving to the View page...nothing too bad yet, but I thought I'd note that looks like another potential area for scalability focus.

function get_resource_top_keywords()

Tom Gleason

unread,

May 20, 2011, 8:26:57 PM5/20/11

to resour...@googlegroups.com

another note:
I recommend:
$wildcard_always_applied =false;
This causes a lot of mysql crashes.

Dan Huby

unread,

May 20, 2011, 9:23:33 PM5/20/11

to ResourceSpace

On May 20, 9:26 pm, Tom Gleason <t...@buildadam.com> wrote:
> another note:
> I recommend:
> $wildcard_always_applied =false;
> This causes a lot of mysql crashes.

That's why the default is off, and there's a big warning above the
setting in the config.default.php file:
# WARNING - this option could cause search performance issues due to
the hugely expanded searches that will be performed.

A few things are experimental or for special cases.

There's an automatic suggestion feature ($suggest_threshold) that
causes some pretty slow queries too - again off by default with a
warning in the config.

Tom Gleason

unread,

May 25, 2011, 8:59:05 PM5/25/11

to resour...@googlegroups.com

I just wanted to make one more note on this topic as a reminder if the
issue comes up again.

The temp tables solution (which basically is just compacting the
joins) still in theory may not be as good as David's initial proposal.

It was simply much easier to implement because it uses the same basic
code/query structure, just more compacted intermediate results. So, in
theory, similar scalability issues will apply to the temp table
solution, but the ceiling should be higher now.

So, if we start seeing any problems with my temp table solution, then
we shouldn't forget about David's initial idea.

Reply all

Reply to author

Forward