Since you can't add a UNIQUE constraint to a populated table if existing entries already violate it, I think we need to do something like the following.
The migration code below generates a new cleaned-up version of the table with a new constraint [UNIQUE KEY `url` (`url`,`post_id`)] and no dups.
Then, a change needs to be made to LinkMySQLDAO->insert() to use INSERT IGNORE so that any attempted insert of duplicate info is ignored, rather than causing an error.
I have this change in my (long-suffering) 'favorites' branch but it should be incorporated sooner as a separate thing.
I could do it in a new branch and attempt the NEW pull handling :)
But, I would not be able to get to that for a while. So I thought I would just propose the change, see if it makes sense to the db experts (is there a better way to do this?), and see if Gina or anyone else wants to just incorporate it.
-----------migration file-----
CREATE TABLE `links_temp` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`url` varchar(255) NOT NULL,
`expanded_url` varchar(255) NOT NULL,
`title` varchar(255) NOT NULL,
`clicks` int(11) NOT NULL DEFAULT '0',
`post_id` bigint(11) DEFAULT NULL,
`network` varchar(10) NOT NULL DEFAULT 'twitter',
`is_image` tinyint(4) NOT NULL DEFAULT '0',
`error` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
UNIQUE KEY `url` (`url`,`post_id`),
KEY `is_image` (`is_image`),
KEY `post_id` (`post_id`,`network`)
) ENGINE=MyISAM AUTO_INCREMENT=14 DEFAULT CHARSET=utf8;
INSERT into links_temp SELECT * FROM tu_links WHERE 1 GROUP BY post_id, url;
DROP table tu_links;
RENAME table links_temp to tu_links;
-----change to LinkMySQLDAO->insert()----
$q = " INSERT IGNORE INTO #prefix#links ";
// $q = " INSERT INTO #prefix#links ";
---------------------------
-Amy
Yes, we should avoid duplicates in the links table.
I'm not a database expert, but UNIQUE keys and INSERT IGNORE make me a
little nervous, because it feels brute force and requires MySQL to
update the index on every insert. Perhaps we should enforce in the
application vs as a DB constraint. Like we do in
PostMySQLDAO::addPost. That method has a if
(!$this->isPostInDB($vals)) before the insertion logic.
Or maybe I'm being crazy.
Anyone have opinions/insight on enforcing uniqueness in the data
structure vs in app code?
Personally, I favor uniqueness requirements on schemata if nonuniqueness in the data would be indicative of a problem. That way, also, the model logic can detect the issue when it tries to insert and know that a problem occurred before the insert happened (since it knows there shouldn't be duplicates if all went well), and can give more informative error logging.
Also, I personally like to push as much work to the database as possible--I tend to assume it's better at optimization than I am. ;)
--Mickey
> --
> You received this message because you are subscribed to the Google Groups "ThinkUp App" group.
> To post to this group, send email to think...@googlegroups.com.
> To unsubscribe from this group, send email to thinkupapp+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/thinkupapp?hl=en.
>
It is, though. If we add the unique key, the performance hit would be
on each link insert during the crawl process, and we really want to
save time there (especially in light of the RSS/web-based crawl convo
earlier today). My links table has 2.2M rows. That will be a
significant unique index.
Mark W, if I remember correctly, we talked about this awhile ago and
you said a MySQL consultant said that generally non-unique keys are
better with uniqueness enforced at the app level. Am I remembering
that right? Correct me if I'm misquoting.
However, 'NoSql'/key-value data stores (like App Engine's Datastore/BigTable), tend to have a different approach-- there tends to be less constraint support at the data store level (which allows the data store implementation to be more scalable etc.), so more of this logic tends to stay at the app level by necessity.
Perhaps some of this approach has trickled back into the rdbms community? I look forward to hearing what Mark W learned from the mysql consultant :).
The RSS discussion might be be somewhat of a red herring.
I believe that what we have to figure out is how to spawn off the background crawler process and then just send an immediate 200 status response to the rss reader. That is, I think the actual crawler should run asynchronously to the web request.
Of course, we don't want the crawler to be 'too slow' overall, but I think that is a somewhat different issue.
We could consider splitting up the crawler into multiple activities. E.g. the link expansion & geocoding could run independently of other crawler tasks. These could be separate php scripts that wouldn't even need to be coordinated with the other crawler activities.
This was more or less my thought; if it's the crawler that's going to
be primarily writing to the table, I'm not too worried about adding a
bit of overhead to its writes, since the user presumably isn't waiting
for it to finish.
Gina--I don't suppose that would be worth doing a bit of disk space
and speed benchmarking on with your very large table, would it? (My
own database isn't nearly large enough to get a useful benchmark.)
> The RSS discussion might be be somewhat of a red herring.
> I believe that what we have to figure out is how to spawn off the background crawler process and then just send an immediate 200 status response to the rss reader. That is, I think the actual crawler should run asynchronously to the web request.
Having the crawler script the user sees schedule an at job is a
possible way of doing this, though it's not necessarily portable.
--Mickey
--
Michael Louis Thaler
Vassar College '09
Phone: 201-632-1674
Short version: RSS URL will request the crawler URL using cURL, with a
short timeout. This will spawn the crawler process in a separate
thread, and allow the RSS URL to return a valid RSS feed in a timely
manner.
- Guillaume
Yes, exactly. And this is the approach I'm tending toward....
Quick story. My followers table has 20M rows. In a recent db
migration, I had to drop an index and add a new one (to add the
network field to the mix). I started the process and waited over 6
hours, during which time all my tables were locked, making the webapp
basically unreachable. Finally I had to restart my server (not
kidding) and just gave up.
When you start dealing with big data, separating the schema from the
actual tables is much more scalable/manageable. Mark pointed me
towards this really great article about how they separated schema from
data for FriendFeed:
http://bret.appspot.com/entry/how-friendfeed-uses-mysql
> On Wed, Sep 1, 2010 at 3:34 PM, Jared Hatfield <jjha...@gmail.com> wrote:
>> However, the obvious downside would be a possible performance hit, but I
>> don't think that is a concern in this instance.
>
> It is, though. If we add the unique key, the performance hit would be
> on each link insert during the crawl process, and we really want to
> save time there (especially in light of the RSS/web-based crawl convo
> earlier today). My links table has 2.2M rows. That will be a
> significant unique index.
>
> Mark W, if I remember correctly, we talked about this awhile ago and
> you said a MySQL consultant said that generally non-unique keys are
> better with uniqueness enforced at the app level. Am I remembering
> that right? Correct me if I'm misquoting.
>
more specifically, unique keys across more than one table. I think either way we'd want an index. because we'd need to query the table to check for the url before an insert, so I think the question would be what performs better, and index and a query check from the app, or a unique index?
-m
I changed my test code to take that into account: http://pastie.org/1133982
- Guillaume
Quick story. My followers table has 20M rows. In a recent db migration, I had to drop an index and add a new one (to add the network field to the mix). I started the process and waited over 6 hours, during which time all my tables were locked, making the webapp basically unreachable. Finally I had to restart my server (not kidding) and just gave up. When you start dealing with big data, separating the schema from the actual tables is much more scalable/manageable. Mark pointed me towards this really great article about how they separated schema from data for FriendFeed: http://bret.appspot.com/entry/how-friendfeed-uses-mysql
Absolutely. Just to be clear, I'm not arguing against an index, I'm
asking if *uniqueness* can/should be enforced in the application, like
we currently do for posts.
That said, the FriendFeed method of decoupling schemata from data and
enforcing it in the app is very interesting to me, because I don't
want to lock my tables for 6 hours to change indexes. My thoughts are
on that track on a smaller scale (unique index vs index).
Yup, that settles it. I modified the code a bit (the unique index
should be on url, post_id, and network, as posts on different networks
could have the same ID) and added in timing for building the new index
out of curiosity.
On my 2.2M row links table, I get:
Using 100000 links for test...
Time for application-level uniqueness checks: 45 seconds.
Time to build new unique index: 729 seconds.
Time for db-level uniqueness checks: 16 seconds.
So, it takes triple the time to do this in the app. I'm
convinced--we'll go with the unique key and use insert ignore. Thanks
all!
> Backup your links table before you run that!
> On my install, it gives this:
> $ php test.php
> Using 3091 links for test...
> Time for application-level uniqueness checks: 3 seconds.
> Time for db-level uniqueness checks: 1 seconds.
> On my 2.2M row links table, I get:
>
> Using 100000 links for test...
> Time for application-level uniqueness checks: 45 seconds.
> Time to build new unique index: 729 seconds.
> Time for db-level uniqueness checks: 16 seconds.
>
> So, it takes triple the time to do this in the app. I'm
> convinced--we'll go with the unique key and use insert ignore. Thanks
> all!
It's good to have that benchmark, it sounds like the right way to go for now.
That said... I too thought that that FriendFreed article was really interesting (for those who did not read it, they basically create separate mysql tables for their indexes. This more addresses issues with schema and index changes than insertions per se).
This approach is (probably) not as relevant to ThinkUp's current incarnation, but it might be as it grows.
-Amy