Quick question about Plans-on-Rails database

38 views
Skip to first unread message

Alex Cohn

unread,
Jul 17, 2012, 7:07:25 PM7/17/12
to grinnellplan...@googlegroups.com, grinne...@gmail.com
What's the expected backing data store for Plans on Rails? I'm trying to decide whether or not to purchase some reserved capacity with Amazon, and I'd rather not prepay for a MySQL server that will sit unused.

Alex Cohn

Ian Young

unread,
Jul 17, 2012, 7:29:11 PM7/17/12
to grinnellplan...@googlegroups.com
Well, for starters it'll be MySQL, so's we can run in parallel with PHP for a bit. After that... I was intentionally leaving it open. We'd be fine with staying on MySQL, but if someone is excited to move to Postgres or something, it wouldn't be too hard. I don't intend on letting anything engine-specific happen in that project, so we should be living the sparkling Rails dream of db independence.

Ian

On Tue, Jul 17, 2012 at 4:07 PM, Alex Cohn <al...@alexcohn.com> wrote:
What's the expected backing data store for Plans on Rails? I'm trying to decide whether or not to purchase some reserved capacity with Amazon, and I'd rather not prepay for a MySQL server that will sit unused.

Alex Cohn

--
You received this message because you are subscribed to the Google Groups "GrinnellPlans Development" group.
To post to this group, send email to grinnellplan...@googlegroups.com.
To unsubscribe from this group, send email to grinnellplans-deve...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/grinnellplans-development?hl=en.

Ian Young

unread,
Jul 17, 2012, 7:31:04 PM7/17/12
to grinnellplan...@googlegroups.com
If it helps, I should clarify that I don't expect us to switch to NoSQL or anything crazy like that. Those technologies are fun, but it would be a disservice to future maintainability and code accessibility, and I don't think Plans has the kind of scaling issues that would justify them.

Alex Cohn

unread,
Jul 17, 2012, 10:50:20 PM7/17/12
to grinnellplan...@googlegroups.com
Thanks for the clarification - I guess the question I was really asking was "Are you moving to NoSQL?". It sounds like the devs would be happy sticking with MySQL for the medium-term; I'm not opposed to that either.

As for scalability: a gzipped MySQL dump of the production database fits in less than 30 MB, and our database master has only 613 MB of RAM (it's an AWS t1.micro) - we're not exactly pushing MySQL's limits here. Quicklove is the only feature that regularly tests the limits of the database server; I'm writing a patch to fix that in PHP, and I'd be much happier if the Rails rewrite did something a bit more elegant than a LIKE SQL query.

Alex

Shitanshu Aggarwal

unread,
Jul 18, 2012, 5:54:28 AM7/18/12
to grinnellplan...@googlegroups.com
I cannot imagine a situation where we would fully need to move away from a relational database. A relational schema should be able to support at least twenty more generations of grinnellians without too much trouble; we might need to move to a small host in a couple years and a large host in 15 years. What do you think?

A case could be made for using a data store that specializes in search by way of MapReduce. However, this is an expensive solution because of the initial engineering work and the operational overhead of managing multiple packages later. I think we might be able to fix the problem by optimizing our MySQL configuration more, and looking into building an inverted index for quicklove. Since RDS does MySQL 5.1 and above only, we have stored procedures. So something like (http://code.google.com/p/inverted-index/wiki/Introduction) could be a fun experiment.

Do we document the specifications of our production fleet somewhere? -- I'm specifically interested in the database.


Shitanshu Aggarwal

Ian Young

unread,
Jul 18, 2012, 12:00:32 PM7/18/12
to grinnellplan...@googlegroups.com
Agreed. MapReduce-centric data stores are powerful for searching and data aggregation, but other than Quicklove/Search, those things aren't very central to the core business of Plans.

For search stuff, my plan was always to use something like [ThinkingSphinx](http://freelancing-god.github.com/ts/en/). It's a small extra bit of configuration, which is too bad, but should be well-suited to our needs otherwise.

Shitanshu Aggarwal

unread,
Jul 20, 2012, 2:24:27 AM7/20/12
to grinnellplan...@googlegroups.com
So do we have a game plan to do this? I can lead the effort to solve the problem in our PHP branch.

Alex, we'll need to measure the pain before we can do anything about it. Can you get us some metrics?
CM, how many hours of work do you think is it to set up Solr with an initial not-updated-live dump of plans?
Ian, do you see any downside in plans-dev spending some time solving this, given the fact that we are maybe trying to deprecate this code?

I like Solr, and I'd like to explore it more.

Shitanshu

Ian Young

unread,
Jul 20, 2012, 3:22:14 AM7/20/12
to grinnellplan...@googlegroups.com
Given the slow pace of the Rails rewrite at the moment, I don't want
to discourage you from making incremental improvements on the existing
code (particularly if you're excited about doing so). Personally, I'd
be a bit apprehensive about how difficult it will be to set up Solr to
work (well) with PHP in both development and production, but I could
be estimating pessimistically. See what Alex says about the pain
levels - if the existing system is good enough for now, I'd channel
your energies towards implementing it on the Rails project instead,
which will probably be a much more pleasant experience.

Ian

Alex Cohn

unread,
Jul 20, 2012, 3:48:55 AM7/20/12
to grinnellplan...@googlegroups.com
I have started work on a really naïve solution to this problem. I
created a new table with two columns-planlover_id and planlovee_id and
wrote some code to update it on plan changes. The code works, but I
couldn't get search.php to cooperate and use the table - doctrine was
being a pain. If someone could update search.php to use my
poor-man's-index for the time being, I'd really appreciate it.

As for how much pain it is causing, I am running a MySQL replication
slave solely to handle searches-it's a fair amount of work to keep
running, as MySQL replication likes to inexplicably stop working about
once a month. I think searches take a write lock on the plans table,
which can cause large numbers of threads to block, leading to the
database server locking up for minutes at a time. Not fun.

If anyone would like any specific metrics, I am happy to provide them.
Alex

Shitanshu Aggarwal

unread,
Jul 20, 2012, 4:01:05 AM7/20/12
to grinnellplan...@googlegroups.com
[1] What does the slow query log look like? What are the top offenders?
[2] What is the latency for search, and for the average plans request?
[3] What causes the replication to break occasionally?
[4] So we have some read-traffic going to the slave?
[5] Are we using InnoDB? What is the size of our InnodbBufferPool, and what's our cache miss rate? What constraint are we hitting; CPU, Disk, Memory or contention?

Alex Cohn

unread,
Jul 20, 2012, 4:42:20 AM7/20/12
to grinnellplan...@googlegroups.com
On Fri, Jul 20, 2012 at 4:01 AM, Shitanshu Aggarwal <saggar...@gmail.com> wrote:
[1] What does the slow query log look like? What are the top offenders?
I've turned on slow query logging with long_query_time=1, and will report back in a day or so. In general, I'm not aware of any queries that have the same impact as fulltext searches. 
[2] What is the latency for search, and for the average plans request?
I'm not sure. I know anecdotally that quicklove is the only feature I use that feels noticeably slow. Any suggestions on how to collect these data? 
[3] What causes the replication to break occasionally?
Most recently, replication has failed to restart after instances crashed due to power failures in the US-EAST-1 facility. The servers would come back up, but their replication thread would get stuck. I'm not sure what caused failures before this.
[4] So we have some read-traffic going to the slave?
Yes. The Configuration.php file on the production server looks like this:
if (basename($_SERVER['PHP_SELF']) == "search.php") {
define('MYSQL_HOST', 'localhost');
} else {
define('MYSQL_HOST', master');
}
search.php never writes, so this is safe. The slave is running with read-only on, so any writes would fail, anyway. 
There are actually three different replicas of the database: two t1.micro EC2 instances in different availability zones in master-master, with only one receiving traffic at a time, and the search replica on the frontend m1.small instance. The database itself is on instance storage, which is local to the VM host, I think.
[5] Are we using InnoDB? What is the size of our InnodbBufferPool, and what's our cache miss rate? What constraint are we hitting; CPU, Disk, Memory or contention?
All tables are MyISAM. I'm not sure what constraint we're hitting. How would you tell?

Alex Cohn

unread,
Jul 20, 2012, 4:46:25 AM7/20/12
to grinnellplan...@googlegroups.com
Quick clarification: Only the slave used to run searches is using instance storage. t1.micro instances (which are used for the master pair) don't come with any local storage, and must use network-attached storage.

Shitanshu Aggarwal

unread,
Jul 20, 2012, 5:11:59 AM7/20/12
to grinnellplan...@googlegroups.com
[1] Cool. This info should be very helpful.
[2] You will have to modify the format string of the access log in the Apache config. 





--
You received this message because you are subscribed to the Google Groups "GrinnellPlans Development" group.
To post to this group, send email to grinnellplan...@googlegroups.com.
To unsubscribe from this group, send email to grinnellplans-deve...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/grinnellplans-development?hl=en.



--
Shitanshu Aggarwal

Tom Baldwin

unread,
Jul 20, 2012, 5:12:50 AM7/20/12
to grinnellplan...@googlegroups.com

Ian Young said:

> Given the slow pace of the Rails rewrite at the moment, I don't want
> to discourage you from making incremental improvements on the existing
> code (particularly if you're excited about doing so). Personally, I'd
> be a bit apprehensive about how difficult it will be to set up Solr to
> work (well) with PHP in both development and production, but I could
> be estimating pessimistically. See what Alex says about the pain
> levels - if the existing system is good enough for now, I'd channel
> your energies towards implementing it on the Rails project instead,
> which will probably be a much more pleasant experience.


I agree - Plans should benefit from a solid indexed search backend in the future, but for the PHP branch I think it is important to approach this problem in ways that don't require extra dependencies, or a heart transplant on the data store.

To briefly sum up the problem - as I understand it, the main problems with the MySQL implementation re: search are

1. the cruftastic query used for searching, issue 138: http://code.google.com/p/grinnellplans/issues/detail?id=138
2. the fact that that query is a MySQL LIKE, which does not use indexing (as opposed to MySQL MATCH... AGAINST)

That first bit is just due to Doctrine doing a poor job constructing a concise query. We could do a better job with a string substitution, probably - there's no rule that we need to use the ORM for everything -

The second bit is limited by the apparent fact that MySQL does not index square brackets - so MATCH... AGAINST is unsuitable for planlove searches. Alex mentions this in issue 233, but I'm still looking in the MySQL docs for the reason - Alex, can you cite it? Not indexing punctuation marks is a sensible default for most use cases, I guess, but maybe it can be overridden.

Here's another idea: just change to the MATCH... AGAINST syntax anyway. As in: when I search for my planlove, the database is queried for all plans containing my username, bracketed or not, taking advantage of the FULLTEXT index. There would be some false positives from wherever my name appears unbracketed, but the PHP code that generates the snippets would only pick up the bracketed ones. That would be a minuscule performance cost compared to the current brute-force approach, yeah?

I think we can solve this without much pain and suffering - but if the problems run deeper than this, i may be wrong -

tk

Shitanshu Aggarwal

unread,
Jul 20, 2012, 5:26:25 AM7/20/12
to grinnellplan...@googlegroups.com
Hey Alex,

[1] Cool. This info should be very helpful.

[2] You will have to modify the format string of the access log in the Apache config. 
[2-1] This package will need to be installed and the format string will have to configured with all the cool information we are looking for. Duration is specified by %D, and the requested URL is specified by %U. There might be other golden nuggets in there. http://httpd.apache.org/docs/2.2/mod/mod_log_config.html
[2-2] Then we'll need to do some simple aggregation using standard linux utilities. This will give us latencies on a per page level.

[3] Can't do much about natural disasters, can we? :)

[4] Cool. Thanks for clarifying.

[5] I guess once we have some metrics in place, we can start the InnoDB vs MyISAM discussion for our use case. Speaking of MyISAM, a quick google search highlights is capability to do fulltext searches efficiently. Maybe this could help quicklove a little.

Shitanshu
--
Shitanshu Aggarwal

Shitanshu Aggarwal

unread,
Jul 20, 2012, 5:45:23 AM7/20/12
to grinnellplan...@googlegroups.com
re: [2-2]
Here are some promising projects:
--
Shitanshu Aggarwal

Ian Young

unread,
Jul 20, 2012, 2:11:32 PM7/20/12
to grinnellplan...@googlegroups.com
On Fri, Jul 20, 2012 at 2:12 AM, Tom Baldwin <tbal...@uoregon.edu> wrote:

 That first bit is just due to Doctrine doing a poor job constructing a concise query. We could do a better job with a string substitution, probably - there's no rule that we need to use the ORM for everything -

Definitely. ORMs are great, but if it's getting in our way here, there's nothing wrong with writing a hand-crafted query for a very specific use case like this.
 

The second bit is limited by the apparent fact that MySQL does not index square brackets - so MATCH... AGAINST is unsuitable for planlove searches. Alex mentions this in issue 233, but I'm still looking in the MySQL docs for the reason - Alex, can you cite it? Not indexing punctuation marks is a sensible default for most use cases, I guess, but maybe it can be overridden.

Here's another idea: just change to the MATCH... AGAINST syntax anyway. As in: when I search for my planlove, the database is queried for all plans containing my username, bracketed or not, taking advantage of the FULLTEXT index. There would be some false positives from wherever my name appears unbracketed, but the PHP code that generates the snippets would only pick up the bracketed ones. That would be a minuscule performance cost compared to the current brute-force approach, yeah?

I think that sounds like an excellent idea.

So everyone knows - FULLTEXT indexes are not only specific to MySQL, they're actually specific to MyISAM tables, and can be a gigantic pain in the ass. I definitely wouldn't choose to use them on a new project. But, since the PHP app is already pretty tightly coupled to MySQL (and AFAIK we're on MyISAM), this seems like a good cheap way to semi-optimize the problem. Having a fulltext index will marginally improve the quality and performance of non-quicklove search results as well, though there may not be enough traffic there to be a concern.

Alex's simple homegrown solution would work fine too. I say go with whichever is least effort.


Ian

Alex Cohn

unread,
Jul 20, 2012, 2:31:23 PM7/20/12
to grinnellplan...@googlegroups.com
On Fri, Jul 20, 2012 at 2:11 PM, Ian Young <ian.gr...@gmail.com> wrote:


On Fri, Jul 20, 2012 at 2:12 AM, Tom Baldwin <tbal...@uoregon.edu> wrote:

 That first bit is just due to Doctrine doing a poor job constructing a concise query. We could do a better job with a string substitution, probably - there's no rule that we need to use the ORM for everything -

Definitely. ORMs are great, but if it's getting in our way here, there's nothing wrong with writing a hand-crafted query for a very specific use case like this.

If we're willing to bypass the ORM, a relatively simple query like "select accounts.username, plans.edit_text from planlove left join accounts on planlove.lover_id=accounts.userid left join plans on accounts.userid=plans.user_id where planlove.lovee_id=$idcookie" will return everything search.php needs, except maybe a column or two. (I'm not sure if the second join condition would be faster as "accounts.userid=plans.user_id" or as "planlove.lover_id=plans.user_id" - the first seems more natural and they're almost equivalent semantically)


The second bit is limited by the apparent fact that MySQL does not index square brackets - so MATCH... AGAINST is unsuitable for planlove searches. Alex mentions this in issue 233, but I'm still looking in the MySQL docs for the reason - Alex, can you cite it? Not indexing punctuation marks is a sensible default for most use cases, I guess, but maybe it can be overridden.

No indexing of punctuation, and nowhere to configure it. MySQL FULLTEXT search doesn't have many knobs you can turn. See http://dev.mysql.com/doc/refman/5.0/en/fulltext-fine-tuning.html

Here's another idea: just change to the MATCH... AGAINST syntax anyway. As in: when I search for my planlove, the database is queried for all plans containing my username, bracketed or not, taking advantage of the FULLTEXT index. There would be some false positives from wherever my name appears unbracketed, but the PHP code that generates the snippets would only pick up the bracketed ones. That would be a minuscule performance cost compared to the current brute-force approach, yeah?

I think that sounds like an excellent idea.

So everyone knows - FULLTEXT indexes are not only specific to MySQL, they're actually specific to MyISAM tables, and can be a gigantic pain in the ass. I definitely wouldn't choose to use them on a new project. But, since the PHP app is already pretty tightly coupled to MySQL (and AFAIK we're on MyISAM), this seems like a good cheap way to semi-optimize the problem. Having a fulltext index will marginally improve the quality and performance of non-quicklove search results as well, though there may not be enough traffic there to be a concern.

I looked in to using FULLTEXT indexes. It isn't a bad idea, but there are a couple huge limitations. First, anyone with a username below the fulltext index's ft_min_word_len limit (defaults to 4), would be completely unable to use quicklove. There are 51 two- and three-character username accounts. Second, as Ian pointed out, it's not portable, and I want to retain the flexibility to move to InnoDB tables. Regular, non-quicklove searches seem well-suited to fulltext indexes, though.


Alex's simple homegrown solution would work fine too. I say go with whichever is least effort.

If someone's willing to wrangle search.php to use my homegrown poor-man's-index, go for it. The last time I touched that code, I managed to make a foreach loop infinitely. I kid you not.


Alex

Alex Cohn

unread,
Jul 20, 2012, 2:54:11 PM7/20/12
to grinnellplan...@googlegroups.com
On Fri, Jul 20, 2012 at 5:26 AM, Shitanshu Aggarwal <saggar...@gmail.com> wrote:
Hey Alex,

[1] Cool. This info should be very helpful.
Preliminary results: the master (which handles everything except search.php) hasn't logged any queries that took more than one second to run. The slave has logged hundreds of slow queries, all of which look like this:
# Time: 120720 18:17:49
# User@Host: plans[plans] @ localhost []
# Query_time: 3.640993  Lock_time: 0.000105 Rows_sent: 62  Rows_examined: 16815
SET timestamp=1342808269;
SELECT a.userid AS a__userid, a.username AS a__username, a.created AS a__created, a.password AS a__password, a.email AS a__email, a.pseudo AS a__pseudo, a.login AS a__login, a.changed AS a__changed, a.poll AS a__poll, a.group_bit AS a__group_bit, a.spec_message AS a__spec_message, a.grad_year AS a__grad_year, a.edit_cols AS a__edit_cols, a.edit_rows AS a__edit_rows, a.webview AS a__webview, a.notes_asc AS a__notes_asc, a.user_type AS a__user_type, a.show_images AS a__show_images, a.guest_password AS a__guest_password, a.is_admin AS a__is_admin, p.id AS p__id, p.user_id AS p__user_id, p.plan AS p__plan, p.edit_text AS p__edit_text FROM accounts a LEFT JOIN plans p ON a.userid = p.user_id WHERE p.edit_text LIKE '%\\[cohnalex\\]%' ORDER BY a.username;

The query time ranges between about three to about five seconds for all of them. 

[2] You will have to modify the format string of the access log in the Apache config. 
[2-1] This package will need to be installed and the format string will have to configured with all the cool information we are looking for. Duration is specified by %D, and the requested URL is specified by %U. There might be other golden nuggets in there. http://httpd.apache.org/docs/2.2/mod/mod_log_config.html
[2-2] Then we'll need to do some simple aggregation using standard linux utilities. This will give us latencies on a per page level.

Logging enabled. 

[3] Can't do much about natural disasters, can we? :)

No, but the instance, the OS, and the MySQL server all managed to restart themselves. Why can't replication do the same? (I think what happened one time was that the master rashed, and the slave was trying to read past the end of a log file that didn't get fully written to disk on the master, instead of moving on to the next log entry. Had to manually restart replication by mysqldumping the database from the master and restoring it to the slave. (it's technically master-master; I'm using the terms master and slave to mean "instance that gets queries from production" and "instance that sits idly most of the time and occasionally makes backups to S3")

[4] Cool. Thanks for clarifying.

[5] I guess once we have some metrics in place, we can start the InnoDB vs MyISAM discussion for our use case. Speaking of MyISAM, a quick google search highlights is capability to do fulltext searches efficiently. Maybe this could help quicklove a little.

See discussion elsewhere in this thread. 

Thanks for all your help, everyone!

Alex

Shitanshu Aggarwal

unread,
Jul 21, 2012, 7:31:59 AM7/21/12
to grinnellplan...@googlegroups.com
I. Metrics & Benchmarks (Alex)

[1] We should now aggregate the slow query log using mysqldumpslow. I'm interested in the average slow query times.
[1-1] http://dev.mysql.com/doc/refman/5.0/en/mysqldumpslow.html
[2] Any word on the Apache access log aggregation?
[3] We need to benchmark MATCH...AGAINST. General steps:
[3-1] If the column containing the plan is not already a TEXT column, copy the plans table into a new table, change the column to be TEXT on the new table.
[3-2] Construct a MATCH...AGAINST to simulate a planlove query about ten times.
[3-3] If created new table in [3-1], drop the table.

II. Gameplan (Team)

The options are:
[1] Pre processing the plan and extracting the quicklove relations for quick access later. This technique is simple but doesn't improve full search. The preprocessing work has already been done.
[2] We could tap into MyISAM's one major specialty, FULLTEXTSEARCH. This will improve both quicklove and search latencies, but we'll have to do some extra work for Doctrine to do MATCH...AGAINST, or we can skip Doctrine on this one.
[2-1] https://github.com/beberlei/DoctrineExtensions/blob/master/lib/DoctrineExtensions/Query/Mysql/MatchAgainst.php
[2-2] https://groups.google.com/forum/?fromgroups#!topic/doctrine-user/adHyk-gACic
[3] A combination of [1] and [2] where we do quicklove using the pre processed edge data, and do full text search using MATCH...AGAINST. Estimate will be the sum of estimates for [1] and [2].
[4] Set up Solr, work a client into our PHP code and use it. This technique helps both quicklove and search /substantially/, does require code modification, and could potentially be carried over to the RAILS branch (Sunspot). Estimated to be around around 5 hours of initial work. ( 2 to set up Solr, 3 to do everything else; thanks for the estimate CM! ).

Alex, since you worked on this primarily, could you estimate [1] in hours?
Tom, could you estimate [2] in hours? The Doctrine bit will probably add an hour.


Shitanshu


--
You received this message because you are subscribed to the Google Groups "GrinnellPlans Development" group.
To post to this group, send email to grinnellplan...@googlegroups.com.
To unsubscribe from this group, send email to grinnellplans-deve...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/grinnellplans-development?hl=en.



--
Shitanshu Aggarwal

Alex Cohn

unread,
Jul 21, 2012, 6:26:35 PM7/21/12
to grinnellplan...@googlegroups.com
On Sat, Jul 21, 2012 at 7:31 AM, Shitanshu Aggarwal <saggar...@gmail.com> wrote:
I. Metrics & Benchmarks (Alex)

[1] We should now aggregate the slow query log using mysqldumpslow. I'm interested in the average slow query times.
Here's the output from the mysqldumpslow when run on the server that handles search.php's logfile:

Reading mysql slow query log from mysql/ip-10-6-127-173-slow.log
Count: 1040  Time=3.88s (4037s)  Lock=0.00s (0s)  Rows=26.1 (27173), plans[plans]@localhost
  SELECT a.userid AS a__userid, a.username AS a__username, a.created AS a__created, a.password AS a__password, a.email AS a__email, a.pseudo AS a__pseudo, a.login AS a__login, a.changed AS a__changed, a.poll AS a__poll, a.group_bit AS a__group_bit, a.spec_message AS a__spec_message, a.grad_year AS a__grad_year, a.edit_cols AS a__edit_cols, a.edit_rows AS a__edit_rows, a.webview AS a__webview, a.notes_asc AS a__notes_asc, a.user_type AS a__user_type, a.show_images AS a__show_images, a.guest_password AS a__guest_password, a.is_admin AS a__is_admin, p.id AS p__id, p.user_id AS p__user_id, p.plan AS p__plan, p.edit_text AS p__edit_text FROM accounts a LEFT JOIN plans p ON a.userid = p.user_id WHERE p.edit_text LIKE 'S' ORDER BY a.username

There have still not been any non-search.php queries that took more than a second to run. I'm dropping slow_query_time on that server to 0.25 seconds.

[2] Any word on the Apache access log aggregation?
The second of your suggested aggregation apps looks promising. I'll work on it tonight.

[3] We need to benchmark MATCH...AGAINST. General steps:
[3-1] If the column containing the plan is not already a TEXT column, copy the plans table into a new table, change the column to be TEXT on the new table.
[3-2] Construct a MATCH...AGAINST to simulate a planlove query about ten times.
[3-3] If created new table in [3-1], drop the table.

The edit_text column is type TEXT, and the fulltext index exists. Anyone want to write a SQL query for me to run? 

Alex

Ian Young

unread,
Jul 21, 2012, 7:47:32 PM7/21/12
to grinnellplan...@googlegroups.com
SELECT MATCH (plans.plan) AGAINST ("youngian") AS relevance FROM plans
ORDER BY relevance;

Tom Baldwin

unread,
Jul 21, 2012, 10:10:14 PM7/21/12
to grinnellplan...@googlegroups.com

On 20 Jul 2012, at 11:31 , Alex Cohn wrote:
> On Fri, Jul 20, 2012 at 2:11 PM, Ian Young <ian.gr...@gmail.com> wrote:
>> On Fri, Jul 20, 2012 at 2:12 AM, Tom Baldwin <tbal...@uoregon.edu> wrote:
>>> The second bit is limited by the apparent fact that MySQL does not index square brackets - so MATCH... AGAINST is unsuitable for planlove searches. Alex mentions this in issue 233, but I'm still looking in the MySQL docs for the reason - Alex, can you cite it? Not indexing punctuation marks is a sensible default for most use cases, I guess, but maybe it can be overridden.
>
> No indexing of punctuation, and nowhere to configure it. MySQL FULLTEXT search doesn't have many knobs you can turn. See http://dev.mysql.com/doc/refman/5.0/en/fulltext-fine-tuning.html.
>
> [...]
>
> I looked in to using FULLTEXT indexes. It isn't a bad idea, but there are a couple huge limitations. First, anyone with a username below the fulltext index's ft_min_word_len limit (defaults to 4), would be completely unable to use quicklove. There are 51 two- and three-character username accounts.

I hadn't known about the word limit. If that's true, then your poor-man's index doesn't seem so naive any more. In that case, I like the idea of using your index for planlove searches instead of the FULLTEXT index. Non-planlove searches could use it, I guess, but I wonder whether the non-planlove search volume is even worth that trouble right now.

> Second, as Ian pointed out, it's not portable, and I want to retain the flexibility to move to InnoDB tables. Regular, non-quicklove searches seem well-suited to fulltext indexes, though.

Can't argue with that.

Aside... the docs you linked do mention changing what characters are 'word characters' - by providing a custom charset XML file. I assume you looked into this and found what the catch was (no access on production, maybe... or generally being a PITA... just guessin')

> If someone's willing to wrangle search.php to use my homegrown poor-man's-index, go for it. The last time I touched that code, I managed to make a foreach loop infinitely. I kid you not.

Oh, PHP. I'll take a crack at it, but only out of curiosity - I'm not confident I'll fare any better.

tk

Alex Cohn

unread,
Jul 21, 2012, 11:41:00 PM7/21/12
to grinnellplan...@googlegroups.com
On Sat, Jul 21, 2012 at 10:10 PM, Tom Baldwin <tbal...@uoregon.edu> wrote:
>
> I looked in to using FULLTEXT indexes. It isn't a bad idea, but there are a couple huge limitations. First, anyone with a username below the fulltext index's ft_min_word_len limit (defaults to 4), would be completely unable to use quicklove. There are 51 two- and three-character username accounts.

I hadn't known about the word limit. If that's true, then your poor-man's index doesn't seem so naive any more. In that case, I like the idea of using your index for planlove searches instead of the FULLTEXT index. Non-planlove searches could use it, I guess, but I wonder whether the non-planlove search volume is even worth that trouble right now.

Non-planlove search has been used 606 times in the last 165 hours, compared to 5061 planlove searches. (for my own future reference: grep -E "GET \/search\.php\?mysearch=[^& ]+ HTTP\/1" /var/log/apache2/access.log | wc -l)
Interesting aside: many of those were probably intended to be planlove searches, as they were searching for a valid username.

> Second, as Ian pointed out, it's not portable, and I want to retain the flexibility to move to InnoDB tables. Regular, non-quicklove searches seem well-suited to fulltext indexes, though.

Can't argue with that.

I just noticed that the docs for MySQL 5.6 claim it has support for fulltext indexes on InnoDB tables. Plans is using the Ubuntu 12.04 distribution of MySQL 5.5 currently, but I'm not opposed to installing my own if we decide to switch to InnoDB. 

Aside... the docs you linked do mention changing what characters are 'word characters' - by providing a custom charset XML file. I assume you looked into this and found what the catch was (no access on production, maybe... or generally being a PITA... just guessin')
 
Looked in to it. I have access to the database server now, but I am strongly considering moving the database hosting to Amazon RDS for improved reliability in availability zone failures (there have been several in the past couple months) and it doesn't appear that you can make that change without root access to the server (which RDS doesn't provide). Also, I'm not sure adding square brackets to a custom charset would handle cases where many usernames are planloved without any separating punctuation (like this: "[cohnalex][baldwint][youngian][aggarwal]" ) correctly - if square brackets are allowed to be part of a word, then that would appear to be only one word, and therefore a search would miss all of them.

Alex

Tom Baldwin

unread,
Jul 22, 2012, 5:39:57 AM7/22/12
to grinnellplan...@googlegroups.com
Ok, here's a patch to use the planlove index. Let me know if it works for y'all.

On 21 Jul 2012, at 20:41 , Alex Cohn wrote:
> Non-planlove search has been used 606 times in the last 165 hours, compared to 5061 planlove searches. (for my own future reference: grep -E "GET \/search\.php\?mysearch=[^& ]+ HTTP\/1" /var/log/apache2/access.log | wc -l)
> Interesting aside: many of those were probably intended to be planlove searches, as they were searching for a valid username.

Interesting! Thanks for sharing that data. It certainly motivates indexing planlove if nothing else.

tk
indexed_planlove.patch
Reply all
Reply to author
Forward
0 new messages