substitutions in a very large file

1 view
Skip to first unread message

googler

unread,
Aug 16, 2009, 9:44:30 PM8/16/09
to vim_use
I have a text file that is very large. It has only one line which is
of the form like below.
itemA itemB itemC ......

Basically itemA, itemB etc are character strings of some arbitrary
length and they are separated by a space.

I wanted to modify the file so that each line will have only one item.
The next item after a certain item will be placed in the next line. So
this is fairly easy with a ":s/ /\r/g" command. The problem is when I
have about a million of such items, this takes very long. It was going
on for about two hours and then I killed the vim session. Then I tried
to achieve the same result writing a small perl script (using split),
and I was surprised that it took less than 10 seconds to finish.

Why such a huge difference in the time taken by two methods? Any
comments? Any way I could have made things faster in the vim method?

Thanks.

Thomas Adam

unread,
Aug 16, 2009, 9:55:38 PM8/16/09
to vim...@googlegroups.com, pinak...@yahoo.com
2009/8/17 googler <pinak...@yahoo.com>:

Did this file have syntax highlighting on? If so, turn it off.

Dare you provide an example of what the file looks like? As there's
every chance your substitution was wrong to begin with.

-- Thomas Adam

googler

unread,
Aug 16, 2009, 10:05:42 PM8/16/09
to vim_use
> Did this file have syntax highlighting on?   If so, turn it off.
>
> Dare you provide an example of what the file looks like?  As there's
> every chance your substitution was wrong to begin with.

Syntax highlighting is not on. It's a plain text file with a .txt
extension.

I already said what it looks like. The substitution is correct. It
works on smaller files of same format.

Tim Chase

unread,
Aug 16, 2009, 10:06:37 PM8/16/09
to vim...@googlegroups.com
> I have a text file that is very large. It has only one line which is
> of the form like below.
> itemA itemB itemC ......
>
> Basically itemA, itemB etc are character strings of some arbitrary
> length and they are separated by a space.
>
> I wanted to modify the file so that each line will have only one item.
> The next item after a certain item will be placed in the next line. So
> this is fairly easy with a ":s/ /\r/g" command. The problem is when I
> have about a million of such items, this takes very long. It was going
> on for about two hours and then I killed the vim session. Then I tried
> to achieve the same result writing a small perl script (using split),
> and I was surprised that it took less than 10 seconds to finish.

I'd use the standard "tr" utility available on most *nix boxes:

tr ' ' '\n' <in.txt >out.txt

which should be about as fast as it gets (eliminating Vim
altogether).

There are a number of things in Vim that could be causing
slowness: syntax highlighting, paren matching, undo levels, etc.

Dr. Chip maintains a "largefile" script that may help:

http://www.vim.org/scripts/script.php?script_id=1506

-tim

Charles Campbell

unread,
Aug 17, 2009, 10:25:39 AM8/17/09
to vim...@googlegroups.com
There are several things that vim is/may be doing that perl processing
is not:

* syntax highlighting
* undo
* checking for autocmd events, and executing associated code when
they're triggered
* swapfile/backup work
* folding

The LargeFile.vim script that Tim mentions turns several of these time
consumers off for "large" files, the definition for which you may
customize by setting the "g:LargeFile" variable.

I don't think LargeFile is switching off paren matching -- that's probably
a good idea (thanks, Tim!) for me to add (a blindspot of mine as I
have paren matching turned off by default, anyway).

Regards,
Chip Campbell

Jay Heyl

unread,
Aug 17, 2009, 12:36:51 PM8/17/09
to vim...@googlegroups.com
On Sun, Aug 16, 2009 at 6:44 PM, googler <pinak...@yahoo.com> wrote:


Why such a huge difference in the time taken by two methods? Any
comments? Any way I could have made things faster in the vim method?

I ran into a similar problem recently when I accidentally did effectively the reverse of what you're trying to do -- I turned a multi-thousand line file into one long line. And it took forever. 

In my case, ignoring my error, the slowdown was due to Vim maintaining undo data. The memory used by Vim grew to over 500MB and the swap file grew similarly. I'm assuming Vim was writing the line(s) changed to the swap file after each change. In my case that line kept getting longer and longer, meaning more and more data that had to be written. In your case it would start off huge and probably get smaller/faster as it made more changes.

As others have stated, Vim might not be the best tool for this particular job, but if you want to use it you might trying turning off undo and backup.

  -- Jay

googler

unread,
Aug 17, 2009, 12:53:45 PM8/17/09
to vim_use

On Aug 17, 9:25 am, Charles Campbell <Charles.E.Campb...@nasa.gov>
wrote:
> The LargeFile.vim script that Tim mentions turns several of these time
> consumers off for "large" files, the definition for which you may
> customize by setting the "g:LargeFile" variable.
>
> I don't think LargeFile is switching off paren matching -- that's probably
> a good idea (thanks, Tim!) for me to add (a blindspot of mine as I
> have paren matching turned off by default, anyway).

Thanks, this will be a useful tool, although I didn't get a chance to
perform the same operation as earlier as the large files do not exist
any more. I was wondering perhaps it will be a good idea to have some
way to let the user know if the file is currently in LargeFile mode or
not.

Charles Campbell

unread,
Aug 17, 2009, 4:43:57 PM8/17/09
to vim...@googlegroups.com
In version 5b, there are two echomsg's used:

***note*** handling a large file
***note*** stopped large file handling

So a :messages command should let you know. Since syntax highlighting
is off for
large file handling its often obvious anyway.

Regards,
Chip Campbell

Tony Mechelynck

unread,
Aug 26, 2009, 6:22:10 AM8/26/09
to vim...@googlegroups.com

I haven't had a look at that LargeFile plugin, but I suppose that "large
file mode" status can be checked as a buffer-local boolean expression.
In that case, Googler, it would be possible (for the user, not the
plugin) to set up a custom 'statusline' displaying, let's say, [LF]
after the file name if in Large File mode.

Best regards,
Tony.
--
"Apathy is not the problem, it's the solution"

Reply all
Reply to author
Forward
0 new messages