Pinning a table into the InnoDB/XtraDB buffer pool

ba...@percona.com

unread,

Jan 3, 2009, 2:32:51 PM1/3/09

to Percona Discussion

What do you think about the possibility to pin a table (or an index --
if you want the "table" you pin the PRIMARY index) into the buffer
pool so its pages don't get replaced by something else?

Peter Zaitsev

unread,

Jan 4, 2009, 2:21:34 PM1/4/09

to ba...@percona.com, Percona Discussion

Baron,

I think this would be very good feature but in general I would like to
see a bit more broad notion of priorities attached to the index pages.
This can range from "never replace"=pinned to be first candidates
for replacement.

For example full table scans can place pages with low priority, so
large table scans do not wipe out buffer pool. Non-leaf pages also
can get higher priority than leaf pages etc.

--
Peter Zaitsev, CEO, Percona Inc.
Tel: +1 888 401 3401 ext 501
24/7 Emergency Line +1 888 401 3401 ext 911
Our Services: http://www.percona.com/services.html
Our Blog: http://www.mysqlperformanceblog.com/

Baron Schwartz

unread,

Jan 4, 2009, 6:24:50 PM1/4/09

to Percona Discussion

Peter,

How complex do you think these are?

Peter Zaitsev wrote:
> Baron,
>
> I think this would be very good feature but in general I would like to
> see a bit more broad notion of priorities attached to the index pages.
> This can range from "never replace"=pinned to be first candidates
> for replacement.
>
> For example full table scans can place pages with low priority, so
> large table scans do not wipe out buffer pool. Non-leaf pages also
> can get higher priority than leaf pages etc.
>
>
>
>
> Saturday, January 3, 2009, 11:32:51 AM, you wrote:
>
>
>> What do you think about the possibility to pin a table (or an index --
>> if you want the "table" you pin the PRIMARY index) into the buffer
>> pool so its pages don't get replaced by something else?
>
>
>
>

--
Baron Schwartz, Director of Consulting, Percona Inc.
Tel: +1 888 401 3401 ext 507 (Timezone: EST5EDT; GMT-5/GMT-4)
Skype: baron.schwartz AIM: bps7j

Peter Zaitsev

unread,

Jan 4, 2009, 6:56:35 PM1/4/09

to Baron Schwartz, Percona Discussion

Baron,

Yasufumi would be good to check. In general I assume some information
will need to be tracked in the buffer pool to track priority of pages
and other information. I bet there is extra space for it on the page
(for example both checksum slots can be used because checksum is only
counted on write our)

When we will need to change the main replacement policy from LRU to be
something smarter.

Finally interface needs to be added to set priority for pages which
are being read (if it is some pinned tables or something else).
Pinning tables also has other issue - we need to store this pinned
flag somewhere if mapping is permanent (ie MyISAM key cache allocation
is not)

BTW looking at midpoint insertion strategy in MyISAM could be also
interesting for ideas for Innodb.

Sunday, January 4, 2009, 3:24:50 PM, you wrote:

> Peter,

> How complex do you think these are?

> Peter Zaitsev wrote:
>> Baron,
>>
>> I think this would be very good feature but in general I would like to
>> see a bit more broad notion of priorities attached to the index pages.
>> This can range from "never replace"=pinned to be first candidates
>> for replacement.
>>
>> For example full table scans can place pages with low priority, so
>> large table scans do not wipe out buffer pool. Non-leaf pages also
>> can get higher priority than leaf pages etc.
>>
>>
>>
>>
>> Saturday, January 3, 2009, 11:32:51 AM, you wrote:
>>
>>
>>> What do you think about the possibility to pin a table (or an index --
>>> if you want the "table" you pin the PRIMARY index) into the buffer
>>> pool so its pages don't get replaced by something else?
>>
>>
>>
>>

--
Peter Zaitsev, CEO, Percona Inc.
Tel: +1 888 401 3401 ext 501

Don MacAskill

unread,

Jan 4, 2009, 7:02:29 PM1/4/09

to percona-d...@googlegroups.com

You know, I've been thinking along these same lines a bit myself, too...

Our use case is probably pretty common - we have most of our row level
data sitting in memcached (with 95-99% hit rates), but we look up those
rows using indexes. And I hope those indexes stay in RAM.

I've been wondering for awhile (but have no data to support this theory)
if when we do access a "cold" row (aka memcached doesn't have it for
whatever reason), whether that row data displaces something more useful
in my buffer pool (like a chunk of an index). Because having it live in
RAM on the DB box *and* in RAM on the memcached box is just a waste,
obviously.

Ideally, I'd like to be able to flag indexes (and possibly tables that,
for whatever reason, memcached can't easily cache) so that other data
doesn't easily displace them...

Of course, like I said, this is only a theory with no hard data - but it
seems like "educating" MySQL on what data is important for it to retain
can only be helpful, not harmful, no?

Don

Baron Schwartz

unread,

Jan 4, 2009, 7:41:02 PM1/4/09

to percona-d...@googlegroups.com

Right. And as Peter wrote, the distinction between leaf and internal nodes
matters. If your rows are short, let's say 100 bytes, then a well-packed
InnoDB leaf page is going to hold 16384 * (15/16) / 100 or so rows, minus a
little because of various types of overhead. (15/16 is because that's the
page fill factor, which would also be good to be able to customize so you
can build completely packed indexes!) So reading one row might actually
bring 150 unwanted rows into the buffer pool.

As I understand it, this notion of wasted space because of page-level
caching is part of why Falcon can theoretically give better cache fill, with
dual page and row caching, but as I also understand it, they had to disable
that feature entirely to get any kind of performance in benchmarks. So
perhaps double-caching is not the answer, and customizable page priorities
is another approach to try. (Note that I am not a Falcon expert and may be
grossly mis-stating this.)

--
Baron Schwartz, Director of Consulting, Percona Inc.

Tel: +1 888 401 3401 ext 507 (Timezone: EST5EDT; GMT-5/GMT-4)
Skype: baron.schwartz AIM: bps7j

EvgeniyStepchenko

unread,

Jan 5, 2009, 11:02:21 AM1/5/09

to Percona Discussion

Hello!

I just created new blueprint on the LaunchPad for this feature request
- https://blueprints.launchpad.net/percona-xtradb/+spec/pin-table-in-buffer.

Yasufumi thinks the compatibility of the datafile may be the biggest
problem in the implementation.

Is it possible to create supplement file with our extended data, like
pinning flag for the table?

Baron Schwartz

unread,

Jan 5, 2009, 11:04:12 AM1/5/09

to percona-d...@googlegroups.com

> Is it possible to create supplement file with our extended data, like
> pinning flag for the table?

If we do that, can we modify the undo-slots patch so there are no backwards
compatibility problems?

Peter Zaitsev

unread,

Jan 5, 2009, 11:17:24 AM1/5/09

to EvgeniyStepchenko, Percona Discussion

EvgeniyStepchenko,

I think we need to check how such things are done.

I think in 5.1 you can have various options specified in CREATE TABLE
syntax.

The hack many tables uses (like FEDERATED) is using special table
comment.

> Is it possible to create supplement file with our extended data, like
> pinning flag for the table?

--
Peter Zaitsev, CEO, Percona Inc.
Tel: +1 888 401 3401 ext 501

Yasufumi

unread,

Jan 5, 2009, 10:47:38 PM1/5/09

to Percona Discussion

Baron,

Before the development, I'd like to understand your true problem.
If you want the more efficient datacache strategy simply,
the "pinning a table" may not be so useful for the efficiency.

Not to flood the useful pages, we may also have the another simple
methods.
("pinning selected table" may be complicated as its effect...)

For example, OracleDB don't give LRU priority to "(whole) scanned"
datapages.
So, I never have used pinning table of OracleDB's.
I think "pinning table" is special function for specific cases.

Currently, my simplest first idea is
giving different LRU priority to "leaf pages" and "not leaf pages
(root, branch)".
InnoDB datapages have flag shows the level in the btree (0:leaf,
max:root).
So, I think we can implement simply by using also an another LRU list
(for not leaf pages)
without any change of the datafile format or compatibility.

How do you like it?

Baron Schwartz

unread,

Jan 5, 2009, 11:02:18 PM1/5/09

to percona-d...@googlegroups.com

Hi Yasufumi,

There are two things. Here is one example: a customer wants to have a hot
standby master-master replication, but does not want to send any reads to
the standby machine. Since it has replication's write workload but not the
read, its buffer pool is not ready to support reads after a failover. I
have already measured the difference in the buffer pool with a write-only
workload and with a read-write workload, and it is significant. It might be
useful to pin the most important tables in the buffer pool so the write
workload doesn't push them out.

I admit that pinning is a bad solution for this, and I think actually the
"right" solution is to mirror the read workload onto the standby server. I
don't like pinning either.

The second thing is what you pointed out -- your idea makes a lot of sense.

I think the best test is this: benchmark on a workload that has a mixture of
a) single-row lookups by PK and other indexes, and b) full table and index
scans. But this requires you to actually make the change you suggested :)

Baron

--

Peter Zaitsev

unread,

Jan 5, 2009, 11:28:49 PM1/5/09

to Yasufumi, Percona Discussion

Yasufumi,

> For example, OracleDB don't give LRU priority to "(whole) scanned"
> datapages.
> So, I never have used pinning table of OracleDB's.
> I think "pinning table" is special function for specific cases.

This is only one of the reasons. There are other cases when you simply
want to have real time access to the table such as it is in memory.

> Currently, my simplest first idea is
> giving different LRU priority to "leaf pages" and "not leaf pages
> (root, branch)".
> InnoDB datapages have flag shows the level in the btree (0:leaf,
> max:root).
> So, I think we can implement simply by using also an another LRU list
> (for not leaf pages)
> without any change of the datafile format or compatibility.

What does data file have to do with it at all ?

Honestly I think it is best to split notion of the priority for the
pages and when how this priority is set.

This would allow to do a lot of cool stuff easily... For example for
batch jobs (even not full table scans) we may want to lower their
priority so pages touched by these jobs are not high priority. Leaf,
Full Table scans are also different ways to set priority.

--
Peter Zaitsev, CEO, Percona Inc.
Tel: +1 888 401 3401 ext 501

Baron Schwartz

unread,

Jan 5, 2009, 11:39:07 PM1/5/09

to percona-d...@googlegroups.com, Yasufumi

> This would allow to do a lot of cool stuff easily... For example for
> batch jobs (even not full table scans) we may want to lower their
> priority so pages touched by these jobs are not high priority. Leaf,
> Full Table scans are also different ways to set priority.

Right, mk-table-checksum queries are not full scans (if you're wisely using
--chunksize and if you have a usable index) but I wouldn't want them to
pollute the cache if I could help it. Right now I can't.

I think we can talk about a lot of complex ways to do this. All models are
wrong but some models are useful. Is there a simple/elegant/whatever model
that has a lot of benefits because it approximates real-world needs better
than just LRU?

I think straightforward leaf/nonleaf is probably worth looking into.

Peter Zaitsev

unread,

Jan 6, 2009, 12:26:50 AM1/6/09

to Baron Schwartz, percona-d...@googlegroups.com, Yasufumi

Baron,

I think it is worth checking what others are using. MyISAM ? Maria ?
PostgreSQL ? PBXT ? What file caches use in Linux ? Solaris ?

Some interesting links are here:

http://www.informatik.uni-trier.de/~ley/db/dbimpl/buffer.html

http://en.wikipedia.org/wiki/Cache_algorithms

http://en.wikipedia.org/wiki/Page_replacement_algorithm

http://en.wikipedia.org/wiki/Adaptive_Replacement_Cache

> I think we can talk about a lot of complex ways to do this. All models are
> wrong but some models are useful. Is there a simple/elegant/whatever model
> that has a lot of benefits because it approximates real-world needs better
> than just LRU?

> I think straightforward leaf/nonleaf is probably worth looking into.

--

Yasufumi

unread,

Jan 6, 2009, 4:23:01 AM1/6/09

to Percona Discussion

Peter,

I may understand your problem. :)

I think there are several properties of implementation.

[volatile or static?]
volatile: implement as new command like "FLUSH TABLE"?
(difficulty) new plugin specific command implementation (?)

static: implement as new attribution of "CREATE TABLE"?
(difficulty) where the attribution is stored?
we must change the datafile format to store the new
attribution
(or it may be internal table? are there any problem?)

* Adding the new attribution to the entries of the online dictionary
cache
may not cause compatibility problem of datafiles.

[passive or active?]
passive: give high priority to the specified tables/indexes.

active: + read the all pages of the specified tables/indexes at the
time.

* "active" may have more things (exception cases?) to consider than
"passive"

* Anyway, I think the all of read/create pages must look up the
dictionary
to check the attribution and set the new flag of the page.

Peter Zaitsev

unread,

Jan 6, 2009, 5:43:16 PM1/6/09

to Yasufumi, Percona Discussion

Yasufumi,

> I think there are several properties of implementation.

> [volatile or static?]
> volatile: implement as new command like "FLUSH TABLE"?
> (difficulty) new plugin specific command implementation (?)

> static: implement as new attribution of "CREATE TABLE"?
> (difficulty) where the attribution is stored?
> we must change the datafile format to store the new
> attribution
> (or it may be internal table? are there any problem?)

We need to check what 5.1 offers as storing table properties in .frm
file - I think it was greatly extended.

See for example:

http://dev.mysql.com/doc/refman/5.0/en/federated-use.html

Federated engine adds CONNECTION option. I guess cache policy should
be done similar way.

> * Adding the new attribution to the entries of the online dictionary
> cache
> may not cause compatibility problem of datafiles.

Note this is what we say on per table specification. We also may have
it dynamic on connection/query basics.

> [passive or active?]
> passive: give high priority to the specified tables/indexes.

> active: + read the all pages of the specified tables/indexes at the
> time.

> * "active" may have more things (exception cases?) to consider than
> "passive"

What do you mean by reading all pages by tables/indexes ? I think
this is rather prefetch which is independent from anything else.

If table is pinned we can also do prefetch for it (I think we need a
command to be able to prefetch table anyway)

> * Anyway, I think the all of read/create pages must look up the
> dictionary
> to check the attribution and set the new flag of the page.

Yeah. This is something what needs to be efficient. Though really it
can be handled similar way as compression properties or
compact/redundant format exist now - there is already some space to
store options in Innodb

Reply all

Reply to author

Forward