Writing large files is slow

Ivan Krasilnikov

unread,

Dec 16, 2010, 11:01:52 PM12/16/10

to vim...@vim.org

Hello,

Recently I was using vim to edit large, multi-gigabyte text files and
I noticed that writing a file back to disk takes significantly longer
time than reading it.

Here's the amount of time it took vim (v7.3.081) on my machine (24G RAM)
to open a file with N 20-byte long lines and execute :wq command with
default options and disabled swap files via -n flag. Also shown is
the number of entries inserted into mf_hash table (explained below).

        N             real     user    sys items in mf_hash at peak
1000000 (19Mb)      0.58     0.33   0.16 5967
2000000 (38Mb)      1.02     0.57   0.19 11931
4000000 (76Mb)      2.33     1.48   0.39 23860
8000000 (153Mb)     5.48     3.31   1.23 47716
16000000 (305Mb)    13.79     9.77   1.40 95429
32000000 (610Mb)    66.10    59.55   2.64 190855
64000000 (1.2Gb)   329.30   311.52   7.45 381707
128000000 (2.4Gb) 1366.13 1224.87 17.96 763410

Just reading a large file without writing it back is much faster:

        N   real   user   sys
1000000   0.23   0.17 0.05
2000000   0.44   0.31 0.12
4000000   0.88   0.70 0.17
8000000   1.72   1.35 0.36
16000000   4.13   2.69 1.41
32000000   7.22   5.58 1.60
64000000 14.62 11.11 3.44
128000000 27.49 22.13 5.24

Profiler shows that much of the time during write is spent in mf_find_hash(),
which implements a fixed 64-bucket chained hash table mf_hash, and which is
called once for every line written. This number of buckets is inadequate for
the number of items inserted into this hash table for large files.

Attached is a patch which makes mf_hash a dynamically growing hashtable
with a fixed maximum load factor. Here's the runtime of my write test
case with this patch:

             max load factor = 1   max load factor = 64
        N    real   user    sys     real   user    sys
1000000    0.46   0.25   0.11     0.52   0.34   0.08
2000000    0.98   0.50   0.24     0.93   0.56   0.18
4000000    1.98   1.06   0.39     1.81   1.08   0.42
8000000    4.08   2.17   0.71     3.97   2.25   0.81
16000000    8.24   4.29   1.47     7.83   4.74   1.53
32000000   16.85   8.60   3.05    16.78   9.45   3.26
64000000   32.62 17.16   5.55    32.98 19.08   5.94
128000000   66.71 34.38 11.27    66.51 38.09 12.40

Other values of load factor up to ~ 256 seem also good to me.

--
Ivan Krasilnikov

mf_hash.patch

StarWing

unread,

Dec 17, 2010, 1:14:51 AM12/17/10

to vim_dev

> mf_hash.patch
> 17KViewDownload

Why not use hashtable in hashtab.c for data storage?

Ivan Krasilnikov

unread,

Dec 17, 2010, 6:22:19 AM12/17/10

to vim_dev, StarWing

On Dec 17, 9:14 am, StarWing <weasl...@gmail.com> wrote:
> Why not use hashtable in hashtab.c for data storage?

It's only for string keys. In mf_hash the key is blocknr_T.

Ivan Krasilnikov

unread,

Dec 20, 2010, 1:23:04 AM12/20/10

to Bram Moolenaar, vim...@vim.org

So, Bram, what do you think of this patch? Will it see the light of day?

Bram Moolenaar

unread,

Dec 20, 2010, 5:50:28 AM12/20/10

to Ivan Krasilnikov, vim...@vim.org

Ivan Krasilnikov wrote:

> So, Bram, what do you think of this patch? Will it see the light of day?

It's in my todo list to have a look at it. The numbers you show look
promising.

This goes to the core of storing the edited text and hash mechanisms may
have ways to fail. How are we going to torture-test this?

--
If you put 7 of the most talented OSS developers in a room for a week
and asked them to fix a bug in a spreadsheet program, in 1 week
you'd have 2 new mail readers and a text-based web browser.

/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///

Bram Moolenaar

unread,

Dec 30, 2010, 8:38:39 AM12/30/10

to Ivan Krasilnikov, vim...@vim.org

Ivan Krasilnikov wrote:

It looks good, but this is quite a big change to an important part of
the code. We need a few good tests for this.

The most tricky function is mf_hash_grow(). It's only invoked for big
files. Well, a buffer with lots of lines, the lines can be short. With
a script adding lines with the line number, and then checking that the
right number can be read back, should cover the basic case.

The requirement to have the "next" and "previous" pointers in a specific
position isn't very clean. Perhaps it's not too inconvenient to
use mf_hashitem_T in those places and access its members through a few
macros (with type casts).

An alternative is to do everything with bhdr_T and use a type cast for
NR_TRANS only. That simplifies things, although it still has the
requirement that the two structures start with those three specific
items.

For code layout: please put { in a separate line in a few places.
Keep the lines shorter than 80 characters.

--
Be thankful to be in a traffic jam, because it means you own a car.

Reply all

Reply to author

Forward