small tu_links table issue

1 view
Skip to first unread message

Amy

unread,
Sep 1, 2010, 4:50:45 PM9/1/10
to think...@googlegroups.com

I noticed that I had lots of duplicate entries in my tu_links table (different ids, but duplicate post_id/url pairs), which can result in duplicate displayed items.
This situation is probably not super common [if you have not been abusing the crawler like I have] but is possible with crawler crashes etc. It would be good to add a database table constraint so it can't happen.

Since you can't add a UNIQUE constraint to a populated table if existing entries already violate it, I think we need to do something like the following.

The migration code below generates a new cleaned-up version of the table with a new constraint [UNIQUE KEY `url` (`url`,`post_id`)] and no dups.

Then, a change needs to be made to LinkMySQLDAO->insert() to use INSERT IGNORE so that any attempted insert of duplicate info is ignored, rather than causing an error.

I have this change in my (long-suffering) 'favorites' branch but it should be incorporated sooner as a separate thing.
I could do it in a new branch and attempt the NEW pull handling :)
But, I would not be able to get to that for a while. So I thought I would just propose the change, see if it makes sense to the db experts (is there a better way to do this?), and see if Gina or anyone else wants to just incorporate it.


-----------migration file-----
CREATE TABLE `links_temp` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`url` varchar(255) NOT NULL,
`expanded_url` varchar(255) NOT NULL,
`title` varchar(255) NOT NULL,
`clicks` int(11) NOT NULL DEFAULT '0',
`post_id` bigint(11) DEFAULT NULL,
`network` varchar(10) NOT NULL DEFAULT 'twitter',
`is_image` tinyint(4) NOT NULL DEFAULT '0',
`error` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
UNIQUE KEY `url` (`url`,`post_id`),
KEY `is_image` (`is_image`),
KEY `post_id` (`post_id`,`network`)
) ENGINE=MyISAM AUTO_INCREMENT=14 DEFAULT CHARSET=utf8;


INSERT into links_temp SELECT * FROM tu_links WHERE 1 GROUP BY post_id, url;
DROP table tu_links;
RENAME table links_temp to tu_links;

-----change to LinkMySQLDAO->insert()----

$q = " INSERT IGNORE INTO #prefix#links ";
// $q = " INSERT INTO #prefix#links ";

---------------------------

-Amy

Gina Trapani

unread,
Sep 1, 2010, 6:21:42 PM9/1/10
to think...@googlegroups.com
On Wed, Sep 1, 2010 at 1:50 PM, Amy <amy.j...@gmail.com> wrote:
>
> I noticed that I had lots of duplicate entries in my tu_links table (different ids, but duplicate post_id/url pairs), which can result in duplicate displayed items.
> This situation is probably not super common [if you have not been abusing the crawler like I have] but is possible with crawler crashes etc.  It would be good to add a database table constraint so it can't happen.

Yes, we should avoid duplicates in the links table.

I'm not a database expert, but UNIQUE keys and INSERT IGNORE make me a
little nervous, because it feels brute force and requires MySQL to
update the index on every insert. Perhaps we should enforce in the
application vs as a DB constraint. Like we do in
PostMySQLDAO::addPost. That method has a if
(!$this->isPostInDB($vals)) before the insertion logic.

Or maybe I'm being crazy.

Anyone have opinions/insight on enforcing uniqueness in the data
structure vs in app code?

--
http://ginatrapani.org
http://twitter.com/ginatrapani

Michael Louis Thaler

unread,
Sep 1, 2010, 6:29:24 PM9/1/10
to think...@googlegroups.com

Personally, I favor uniqueness requirements on schemata if nonuniqueness in the data would be indicative of a problem. That way, also, the model logic can detect the issue when it tries to insert and know that a problem occurred before the insert happened (since it knows there shouldn't be duplicates if all went well), and can give more informative error logging.

Also, I personally like to push as much work to the database as possible--I tend to assume it's better at optimization than I am. ;)

--Mickey

> --
> You received this message because you are subscribed to the Google Groups "ThinkUp App" group.
> To post to this group, send email to think...@googlegroups.com.
> To unsubscribe from this group, send email to thinkupapp+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/thinkupapp?hl=en.
>

Jared Hatfield

unread,
Sep 1, 2010, 6:34:34 PM9/1/10
to thinkupapp
I also agree the constraint should be in the database.  It is almost always assumed that the data is more important than the application.  Accurate data is always a good thing.  If the constraint is on the database end and not enforced within the application that is the preferred approach because it is not able to be violated on accident or by an unknown error.

However, the obvious downside would be a possible performance hit, but I don't think that is a concern in this instance.

Jared

Gina Trapani

unread,
Sep 1, 2010, 6:50:43 PM9/1/10
to think...@googlegroups.com
On Wed, Sep 1, 2010 at 3:34 PM, Jared Hatfield <jjha...@gmail.com> wrote:
> However, the obvious downside would be a possible performance hit, but I
> don't think that is a concern in this instance.

It is, though. If we add the unique key, the performance hit would be
on each link insert during the crawl process, and we really want to
save time there (especially in light of the RSS/web-based crawl convo
earlier today). My links table has 2.2M rows. That will be a
significant unique index.

Mark W, if I remember correctly, we talked about this awhile ago and
you said a MySQL consultant said that generally non-unique keys are
better with uniqueness enforced at the app level. Am I remembering
that right? Correct me if I'm misquoting.

--
http://ginatrapani.org
http://twitter.com/ginatrapani

Amy

unread,
Sep 2, 2010, 9:29:48 AM9/2/10
to think...@googlegroups.com
I find this discussion very interesting. Like Mickey and Jared, I too had always thought that it was good practice to "push down" data integrity management to the database level as much as possible, which allows more decoupling of data integrity from the app level. Especially since data sources might potentially be used by > 1 app. In general I think this viewpoint does make a lot of sense.

However, 'NoSql'/key-value data stores (like App Engine's Datastore/BigTable), tend to have a different approach-- there tends to be less constraint support at the data store level (which allows the data store implementation to be more scalable etc.), so more of this logic tends to stay at the app level by necessity.

Perhaps some of this approach has trickled back into the rdbms community? I look forward to hearing what Mark W learned from the mysql consultant :).

Amy

unread,
Sep 2, 2010, 9:55:31 AM9/2/10
to think...@googlegroups.com
A couple more thoughts on this.
In general I think it is 'okay' to have asynchronous background processes (like the crawler) spend a fair bit of time doing (useful) writes. The place where one really needs to pay attention to timing is the synchronous client requests. As long as these client requests are doing only reads of the links table, they should be okay.

The RSS discussion might be be somewhat of a red herring.
I believe that what we have to figure out is how to spawn off the background crawler process and then just send an immediate 200 status response to the rss reader. That is, I think the actual crawler should run asynchronously to the web request.

Of course, we don't want the crawler to be 'too slow' overall, but I think that is a somewhat different issue.

We could consider splitting up the crawler into multiple activities. E.g. the link expansion & geocoding could run independently of other crawler tasks. These could be separate php scripts that wouldn't even need to be coordinated with the other crawler activities.

Michael Louis Thaler

unread,
Sep 2, 2010, 10:27:21 AM9/2/10
to think...@googlegroups.com
On Thu, Sep 2, 2010 at 9:55 AM, Amy <amy.j...@gmail.com> wrote:
> A couple more thoughts on this.
> In general I think it is 'okay' to have asynchronous background processes (like the crawler) spend a fair bit of time doing (useful) writes.  The place where one really needs to pay attention to timing is the synchronous client requests.  As long as these client requests are doing only reads of the links table, they should be okay.

This was more or less my thought; if it's the crawler that's going to
be primarily writing to the table, I'm not too worried about adding a
bit of overhead to its writes, since the user presumably isn't waiting
for it to finish.

Gina--I don't suppose that would be worth doing a bit of disk space
and speed benchmarking on with your very large table, would it? (My
own database isn't nearly large enough to get a useful benchmark.)

> The RSS discussion might be be somewhat of a red herring.
> I believe that what we have to figure out is how to spawn off the background crawler process and then just send an immediate 200 status response to the rss reader.  That is, I think the actual crawler should run asynchronously to the web request.

Having the crawler script the user sees schedule an at job is a
possible way of doing this, though it's not necessarily portable.

--Mickey

--
Michael Louis Thaler
Vassar College '09
Phone: 201-632-1674

Guillaume Boudreau

unread,
Sep 2, 2010, 10:44:47 AM9/2/10
to think...@googlegroups.com
This problem has been solved in the other thread:
http://groups.google.com/group/thinkupapp/msg/0e76ff7e33203b45

Short version: RSS URL will request the crawler URL using cURL, with a
short timeout. This will spawn the crawler process in a separate
thread, and allow the RSS URL to return a valid RSS feed in a timely
manner.

- Guillaume

Gina Trapani

unread,
Sep 2, 2010, 11:18:58 AM9/2/10
to think...@googlegroups.com
On Thu, Sep 2, 2010 at 6:29 AM, Amy <amy.j...@gmail.com> wrote:
> However, 'NoSql'/key-value data stores (like App Engine's Datastore/BigTable), tend to have a different approach-- there tends to be less constraint support at the data store level (which allows the data store implementation to be more scalable etc.), so more of this logic tends to stay at the app level by necessity.

Yes, exactly. And this is the approach I'm tending toward....

Quick story. My followers table has 20M rows. In a recent db
migration, I had to drop an index and add a new one (to add the
network field to the mix). I started the process and waited over 6
hours, during which time all my tables were locked, making the webapp
basically unreachable. Finally I had to restart my server (not
kidding) and just gave up.

When you start dealing with big data, separating the schema from the
actual tables is much more scalable/manageable. Mark pointed me
towards this really great article about how they separated schema from
data for FriendFeed:

http://bret.appspot.com/entry/how-friendfeed-uses-mysql

Mark Wilkie

unread,
Sep 2, 2010, 11:33:54 AM9/2/10
to think...@googlegroups.com

On Sep 1, 2010, at 6:50 PM, Gina Trapani wrote:

> On Wed, Sep 1, 2010 at 3:34 PM, Jared Hatfield <jjha...@gmail.com> wrote:
>> However, the obvious downside would be a possible performance hit, but I
>> don't think that is a concern in this instance.
>
> It is, though. If we add the unique key, the performance hit would be
> on each link insert during the crawl process, and we really want to
> save time there (especially in light of the RSS/web-based crawl convo
> earlier today). My links table has 2.2M rows. That will be a
> significant unique index.
>
> Mark W, if I remember correctly, we talked about this awhile ago and
> you said a MySQL consultant said that generally non-unique keys are
> better with uniqueness enforced at the app level. Am I remembering
> that right? Correct me if I'm misquoting.
>


more specifically, unique keys across more than one table. I think either way we'd want an index. because we'd need to query the table to check for the url before an insert, so I think the question would be what performs better, and index and a query check from the app, or a unique index?

-m

Guillaume Boudreau

unread,
Sep 2, 2010, 11:48:45 AM9/2/10
to think...@googlegroups.com
Mark points out something important in his latest post: even if you do
validation in the application, you'll need an index on the same
columns, since you don't want to query a 2M rows table without using
an index each time you do an insert.

I changed my test code to take that into account: http://pastie.org/1133982

- Guillaume

Trevor Bramble

unread,
Sep 2, 2010, 11:59:28 AM9/2/10
to think...@googlegroups.com
On 09/02/2010 08:18 AM, Gina Trapani wrote:
Quick story. My followers table has 20M rows. In a recent db
migration, I had to drop an index and add a new one (to add the
network field to the mix). I started the process and waited over 6
hours, during which time all my tables were locked, making the webapp
basically unreachable. Finally I had to restart my server (not
kidding) and just gave up.

When you start dealing with big data, separating the schema from the
actual tables is much more scalable/manageable. Mark pointed me
towards this really great article about how they separated schema from
data for FriendFeed:

http://bret.appspot.com/entry/how-friendfeed-uses-mysql
  
One of the systems I maintained a few years ago had enormous data sets as well, and not as normalized as I'd have liked to see. When faced with halting due to table locks I found that it was related to time spent on inserts, as you did. After some research, we migrated the afflicted tables from MyISAM (all of the tables in my decreasingly fresh install of ThinkUp use this engine) to InnoDB*.

The switch resulted in slightly higher system utilization, but completely eliminated the locking trouble. I'd suggest that's worth exploring here.

* http://en.wikipedia.org/wiki/InnoDB

Cheers,
Trevor

Gina Trapani

unread,
Sep 2, 2010, 12:03:00 PM9/2/10
to think...@googlegroups.com
On Thu, Sep 2, 2010 at 8:48 AM, Guillaume Boudreau
<guillaume...@danslereseau.com> wrote:
> even if you do
> validation in the application, you'll need an index on the same
> columns, since you don't want to query a 2M rows table without using
> an index each time you do an insert.

Absolutely. Just to be clear, I'm not arguing against an index, I'm
asking if *uniqueness* can/should be enforced in the application, like
we currently do for posts.

That said, the FriendFeed method of decoupling schemata from data and
enforcing it in the app is very interesting to me, because I don't
want to lock my tables for 6 hours to change indexes. My thoughts are
on that track on a smaller scale (unique index vs index).

Guillaume Boudreau

unread,
Sep 1, 2010, 8:17:15 PM9/1/10
to think...@googlegroups.com
The performance hit in MySQL from adding a unique key on two columns is probably manifolds less than doing a select before doing each insert.
I'd go with the MySQL index.

It also insure the data is valid, even if other processes, users or parts of the code inserts data in that table without using the code that checks for uniqueness.

A small test on your big links table should but the matter to rest:

Backup your links table before you run that!

On my install, it gives this:

$ php test.php
Using 3091 links for test...
Time for application-level uniqueness checks: 3 seconds.
Time for db-level uniqueness checks: 1 seconds.

- Guillaume

Gina Trapani

unread,
Sep 2, 2010, 4:34:38 PM9/2/10
to think...@googlegroups.com
On Wed, Sep 1, 2010 at 5:17 PM, Guillaume Boudreau
<guillaume...@danslereseau.com> wrote:
> A small test on your big links table should but the matter to rest:
> http://pastie.org/1132738

Yup, that settles it. I modified the code a bit (the unique index
should be on url, post_id, and network, as posts on different networks
could have the same ID) and added in timing for building the new index
out of curiosity.

On my 2.2M row links table, I get:

Using 100000 links for test...
Time for application-level uniqueness checks: 45 seconds.
Time to build new unique index: 729 seconds.
Time for db-level uniqueness checks: 16 seconds.

So, it takes triple the time to do this in the app. I'm
convinced--we'll go with the unique key and use insert ignore. Thanks
all!

> Backup your links table before you run that!
> On my install, it gives this:
> $ php test.php
> Using 3091 links for test...
> Time for application-level uniqueness checks: 3 seconds.
> Time for db-level uniqueness checks: 1 seconds.

--
http://ginatrapani.org
http://twitter.com/ginatrapani

Amy

unread,
Sep 3, 2010, 1:16:30 PM9/3/10
to think...@googlegroups.com

On 2 Sep, 2010, at 3:34 PM, Gina Trapani wrote:

> On my 2.2M row links table, I get:
>
> Using 100000 links for test...
> Time for application-level uniqueness checks: 45 seconds.
> Time to build new unique index: 729 seconds.
> Time for db-level uniqueness checks: 16 seconds.
>
> So, it takes triple the time to do this in the app. I'm
> convinced--we'll go with the unique key and use insert ignore. Thanks
> all!

It's good to have that benchmark, it sounds like the right way to go for now.

That said... I too thought that that FriendFreed article was really interesting (for those who did not read it, they basically create separate mysql tables for their indexes. This more addresses issues with schema and index changes than insertions per se).
This approach is (probably) not as relevant to ThinkUp's current incarnation, but it might be as it grows.

-Amy

Reply all
Reply to author
Forward
0 new messages