Deleting duplicate lines _without_ sorting

108 views
Skip to first unread message

Stefan Klein

unread,
Jul 6, 2011, 5:02:24 AM7/6/11
to vim...@googlegroups.com
Hi vim users,

i got a longer SQL script with duplicate inserts :/
I'd like to remove those without sorting the whole file.
It's possible to match the lines by a pattern.

One solution might be to insert the line number at the end of the line,
sort the file,
delete duplicate lines ignoring the linenumber,
move the line number to the start of the line,
sort,
remove the line number.

Is there a more simple way?

thanks,
Stefan

Karol Samborski

unread,
Jul 6, 2011, 5:26:46 AM7/6/11
to vim...@googlegroups.com, st.fa...@googlemail.com
Hi Stefan,

I would do simple macro like this one (before starting recording macro
place cursor on the first column of the first line you want to check):
qa -- start recording macro 'a'
y$ -- copy whole line
j -- place the cursor one line down
:,$g/ -- we want to delete every duplicate of copied line to the end
of file so we do the global command
<C-r>" -- this should paste the copied line
/d<CR> -- and tell vim to delete these duplicate lines, <CR> means
that you should press enter
<C-o> -- place cursor back to the last place (the line below the
checked line recently)
q -- stop recording

and then 999@a should remove all duplicate lines.

I tested it on simple file but I don't say this will be always
correct. Just try it.

Best Regards,
Karol Samborski

2011/7/6 Stefan Klein <st.fa...@googlemail.com>:

> --
> You received this message from the "vim_use" maillist.
> Do not top-post! Type your reply below the text you are replying to.
> For more information, visit http://www.vim.org/maillist.php
>

Tony Mechelynck

unread,
Jul 6, 2011, 5:30:57 AM7/6/11
to vim...@googlegroups.com, Stefan Klein

well, maybe, but probably not as fast since its processing time would be
on the order of the square of the number of lines: simply write a
function with a double loop, which would examine all lines 1→$ in turn
in the outer loop then compare it with all following lines $→(i+1) in
turn in an inner loop, and delete the later line if equal. Scanning
forward in the outer loop and backward in the inner loop ensures that
you don't get line numbers changed before you have finished using them.
But the end-of-loop test for the outer loop must recompute line('$') at
every interation.


Best regards,
Tony.
--
hundred-and-one symptoms of being an internet addict:
124. You begin conversations with, "Who is your internet service provider?"

Karol Samborski

unread,
Jul 6, 2011, 5:35:13 AM7/6/11
to vim...@googlegroups.com, Stefan Klein, antoine.m...@gmail.com
2011/7/6 Tony Mechelynck <antoine.m...@gmail.com>:

> well, maybe, but probably not as fast since its processing time would be on
> the order of the square of the number of lines: simply write a function with
> a double loop, which would examine all lines 1→$ in turn in the outer loop
> then compare it with all following lines $→(i+1) in turn in an inner loop,
> and delete the later line if equal. Scanning forward in the outer loop and
> backward in the inner loop ensures that you don't get line numbers changed
> before you have finished using them. But the end-of-loop test for the outer
> loop must recompute line('$') at every interation.
>

You are right but if it is only one file and this macro will never be
used as a script to do this automatically etc. it should be fast
enough ;)

Best Regards,
Karol

Karol Samborski

unread,
Jul 6, 2011, 5:44:11 AM7/6/11
to vim...@googlegroups.com, Stefan Klein, antoine.m...@gmail.com
2011/7/6 Karol Samborski <edv....@gmail.com>:

Oops, I'm sorry I thought you was responding to me...
I must read everything twice before I start to write ;)

Regards,
Karol

Stefan Klein

unread,
Jul 6, 2011, 7:24:33 AM7/6/11
to vim...@googlegroups.com
Hi Karol,

2011/7/6 Karol Samborski <edv....@gmail.com>

Hi Stefan,

I would do simple macro like this one (before starting recording macro
place cursor on the first column of the first line you want to check):
qa -- start recording macro 'a'
y$ -- copy whole line
j -- place the cursor one line down
:,$g/ -- we want to delete every duplicate of copied line to the end
of file so we do the global command
<C-r>" -- this should paste the copied line
/d<CR> -- and tell vim to delete these duplicate lines, <CR> means
that you should press enter
<C-o> -- place cursor back to the last place (the line below the
checked line recently)
q -- stop recording

and then 999@a should remove all duplicate lines.

I tested it on simple file but I don't say this will be always
correct. Just try it.

Thank you, i didn't thought of using global from current position to the end, this realy does the trick.
Since i can identify the duplicate lines by a regexp and the dups aren't nested i used a search instead of <C-o>.

regards,
Stefan

rameo

unread,
Jul 7, 2011, 3:29:09 AM7/7/11
to vim_use
Hi Stefan,

I found this one a few weeks ago.
It does the job without sorting.

:g/^/kl \|if search('^'.escape(getline('.'),'\.*[]^$/').'$','bW')
\|'ld

Regards,
Rameo

Tim Chase

unread,
Jul 7, 2011, 7:20:20 AM7/7/11
to vim...@googlegroups.com
If the file is large, but the number of resulting unduplicated
lines is manageably small (say a couple megs), you can do it in
O(N) rather than O(N^2) or O(N*M) where N is the number of lines
and M the number of duplicates. Just store each line in a dict
as you process them and then delete them if you encounter them again:

:let a={}|g/^/let k=getline('.')|if has_key(a,k)|d|else|let
a[k]=1|endif
:unlet a

This has the advantage that you can do transforms on the key if
you want, changing the "let k=getline('.')" to strip off
leading/trailing whitespace, normalize the case, etc.

Hope this gives you another option,

-tim


Ben Fritz

unread,
Jul 7, 2011, 11:41:05 AM7/7/11
to vim_use
As a refinement of your original method, I expected this to work:

1. Insert line number at BEGINNING of the line
2. ":sort /^\d\+ / u" to sort, removing duplicate lines, ignoring the
line number at the beginning of the line
3. ":sort n" to sort on line number
4. remove line number

But, it seems the 'u' flag does not ignore the /pattern/ as the sort
itself does. I expected u and /pattern/ to work together since u and i
work together just fine. Does anyone know if this is intentional?

Anyway, even if you cannot automatically remove duplicate lines with
the u flag, it saves the step of moving the line number around.

rameo

unread,
Jul 8, 2011, 3:26:18 AM7/8/11
to vim_use
Have you checked the one I mentioned above?

The one above is working in menu-vim (the "pipe" symbol has to be
escaped in menu.vim)
To let it work in the commandline the escape before the pipe symbol
has to be removed:

g/^/kl |if search('^'.escape(getline('.'),'\.*[]^$/').'$','bW') |'ld

The only problem I've found after using it for a few weeks, that it
removes
also the double empty lines.

(I would be happy also to know what the code exactly does)

ps: I added in my menu let @/ = '' to avoid after-highlighting.
Reply all
Reply to author
Forward
0 new messages