Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Wikipedia 'pagelinks' - InnoDB Bulk Insert

27 views
Skip to first unread message

froggy000

unread,
Jan 14, 2007, 7:48:36 PM1/14/07
to
I'm trying to import a dump of the most interesting Wikipedia tables.
Specifically page.sql (560MB), categorylinks.sql (584MB) and
pagelinks.sql (3GB) from [download.wikimedia.org].

I've successfully imported both page.sql and categorylinks.sql in about
30 minutes each but I haven't had the same success with pagelinks.sql.
I've alotted as much memory as is available (1.5GB) but it fills up and
CPU usage drops to about 2-5% with gratuitous disk usage. I'm going on
12 hours here so I thought I'd check and see if this is normal and if
there's anything I can do to improve my import speed.

A couple notes. pagelinks.sql begins with:

DROP TABLE IF EXISTS `pagelinks`;
CREATE TABLE `pagelinks` (
`pl_from` int(8) unsigned NOT NULL default '0',
`pl_namespace` int(11) NOT NULL default '0',
`pl_title` varchar(255) binary NOT NULL default '',
UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`),
KEY `pl_namespace` (`pl_namespace`,`pl_title`)
) TYPE=InnoDB;

/*!40000 ALTER TABLE `pagelinks` DISABLE KEYS */;

--
-- Dumping data for table `pagelinks`
--

LOCK TABLES `pagelinks` WRITE;
INSERT INTO `pagelinks` VALUES
(11327,0,'!'),(33640,0,'!'),(64486,0,'!'),...

Importing with: mysql -u username -p schema < pagelinks.sql

Running on a Core 2 Duo 6600 running Windows 2k3 Server with 2GB RAM.

mysql.ini is:

[mysqld]
#Port number to use for connections.
port=3306
#Path to installation directory. All paths are usually resolved
relative to this.
basedir=C:/Program Files/MySQL/MySQL Server 5.0
#Path to the database root
datadir=D:/MySQL
#If no specific storage engine/table type is defined in an SQL-Create
statement the default type will be used.
default-storage-engine=innodb
#The bigger you set this the less disk I/O is needed to access data in
tables. On a dedicated database server you may set this parameter up to
80% of the machine physical memory size. Do not set it too large,
though, because competition of the physical memory may cause paging in
the operating system.
innodb_buffer_pool_size=1500M
#Size of a memory pool InnoDB uses to store data dictionary information
and other internal data structures. A sensible value for this might be
2M, but the more tables you have in your application the more you will
need to allocate here. If InnoDB runs out of memory in this pool, it
will start to allocate memory from the operating system, and write
warning messages to the MySQL error log.
innodb_additional_mem_pool_size=8M
#Paths to individual datafiles and their sizes.
innodb_data_file_path=ibdata1:1000M:autoextend:max:10000M
#How much to increase datafile size by when full.
innodb_autoextend_increment=512M
#The common part of the directory path for all InnoDB datafiles. Leave
this empty if you want to split the data files onto different drives.
innodb_data_home_dir=D:/MySQL/InnoDB
#Directory path to InnoDB log files.
innodb_log_group_home_dir=D:/MySQL/InnoDB
#Size of each log file in a log group in megabytes. Sensible values
range from 1M to 1/n-th of the size of the buffer pool specified below,
where n is the number of log files in the group. The larger the value,
the less checkpoint flush activity is needed in the buffer pool, saving
disk I/O. But larger log files also mean that recovery will be slower
in case of a crash. The combined size of log files must be less than 4
GB on 32-bit computers. The default is 5M.
innodb_log_file_size=300M
#The size of the buffer which InnoDB uses to write log to the log files
on disk. Sensible values range from 1M to 8M. A big log buffer allows
large transactions to run without a need to write the log to disk until
the transaction commit. Thus, if you have big transactions, making the
log buffer big will save disk I/O.
innodb_log_buffer_size=8M
#Specifies when log files are flushed to disk.
innodb_flush_log_at_trx_commit=1
#Helps in performance tuning in heavily concurrent environments.
innodb_thread_concurrency=4
#Max packetlength to send/receive from to server.
max_allowed_packet=128M

#The size of the buffer used for index blocks. Increase this to get
better index handling (for all reads and multiple writes) to as much as
you can afford; 64M on a 256M machine that mainly runs MySQL is quite
common.
key_buffer_size=64M

[mysql]
#Max packetlength to send/receive from to client.
max_allowed_packet=128M

Any help/insight would be greatly appreciated.

Axel Schwenke

unread,
Jan 15, 2007, 2:46:35 PM1/15/07
to
"froggy000" <chris....@gmail.com> wrote:
> I'm trying to import a dump of the most interesting Wikipedia tables.
> Specifically page.sql (560MB), categorylinks.sql (584MB) and
> pagelinks.sql (3GB) from [download.wikimedia.org].

This is a 3GB uncompressed SQL dump?

> I've successfully imported both page.sql and categorylinks.sql in about
> 30 minutes each but I haven't had the same success with pagelinks.sql.
> I've alotted as much memory as is available (1.5GB) but it fills up and
> CPU usage drops to about 2-5% with gratuitous disk usage.

Seems your system is I/O bound now. This was foreseeable with only
1.5GB for InnoDBs buffer pool.

> A couple notes. pagelinks.sql begins with:
>
> DROP TABLE IF EXISTS `pagelinks`;
> CREATE TABLE `pagelinks` (
> `pl_from` int(8) unsigned NOT NULL default '0',
> `pl_namespace` int(11) NOT NULL default '0',
> `pl_title` varchar(255) binary NOT NULL default '',
> UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`),
> KEY `pl_namespace` (`pl_namespace`,`pl_title`)
> ) TYPE=InnoDB;

Bah! Bad one!

I assume we're talking about 3GB of uncompressed SQL Dump. Since it
uses the multi-value INSERT syntax, there is not much overhead from
SQL in the dump. Assuming we have 6 digits for the numbers and k
characters in the VARCHAR() column we have 6+6+k+7 characters in the
dump and get 4+4+k bytes raw data from it. Estimating k~30 this leads
to approx. 2.3GB of raw data. But rows are small with 38 bytes and I
guess InnoDB adds another 10 Bytes of overhead so we're back at 3GB
for the rows alone.

Now for indexes: the first index is on *all* columns in the table
and therefore duplicates the amount of memory used. The second index
duplicates the bigger part of the data again. So overall you will
have approx. 8GB of active data in InnoDB (when you create an index
on a table you can consider the whole table "active"). With just
1.5GB of memory this will clearly become slow.

You may try to remove both indexes from the table definition, loading
the raw data and then creating the indexes. The following line

> /*!40000 ALTER TABLE `pagelinks` DISABLE KEYS */;

does that for the second index already, but not for the first one.
But I guess index creation will still take very long.


There is another workaround: switch the engine to MyISAM. MyISAM uses
much less memory than InnoDB. You should put your memory in key_buffer
instead of the innodb_buffer_pool then.

Wikipedia uses InnoDB because they have a lot of writes. Unless you
want to run a Wikipedia system yourself, MyISAM should be good.


XL
--
Axel Schwenke, Support Engineer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

0 new messages