deleting existing duplicates

John Hunter

unread,

Jul 12, 1999, 3:00:00 AM7/12/99

to

I have a mail group with about 1000 articles, half of which are
duplicates. I asked about deleting these but noone knew how (I've
since set nnmail-treat-duplicates to 'delete').

I could write a perl script which identified duplicates and removed
them from my hard drive. Would GNUS handle this ok, and update the
summary buffer appropriately?

John

Karl Kleinpaste

unread,

Jul 12, 1999, 3:00:00 AM7/12/99

to

John Hunter <jdhu...@nitace.bsd.uchicago.edu> writes:
> I could write a perl script which identified duplicates and removed
> them from my hard drive. Would GNUS handle this ok, and update the
> summary buffer appropriately?

Gnus would probably become unhappy, to the extent that Gnus' normal
absolute control of both article existence and overview management
would stop being synchronized, unless you intend to delete the
relevant overview lines as well. That is, you have to be very
complete, if you intend to mess with these things outside Gnus.

Better would be to use `M-s' in *Summary* to locate the
"Gnus-Warning:" header in the affected articles. Or alternatively,
one of the limit commands off of `/' (hit `/ C-h' for a quick summary)
might give you the needed effect. Mark the relevant articles thus
found, then `B DEL' them all.

Paul Stevenson

unread,

Jul 12, 1999, 3:00:00 AM7/12/99

to

John Hunter <jdhu...@nitace.bsd.uchicago.edu> writes:

> I have a mail group with about 1000 articles, half of which are
> duplicates. I asked about deleting these but noone knew how (I've
> since set nnmail-treat-duplicates to 'delete').

If you respool all your messages through the backend then
nnmail-treat-duplicates will catch them:

In the Summary buffer,

M-x gnus-uu-mark-all

will process mark all the messages, then 'B r' will respool them. It
will ask you for a backend and maybe a server name, but you can just hit
enter for the default since you only want them to end up back where they
were.

Paul

John Hunter

unread,

Jul 15, 1999, 3:00:00 AM7/15/99

to

Excellent idea and I did it but it didn't work because the Message IDs
look like this:

Message-ID: <1ru34owg8w.fsf@totally-fudged-out-message-id>

and are not the same in the duplicate files. However the Date:, Lines:
and other headers are identical which I could use to sift them out
with a perl script.

Thanks for the idea, though. I thought it would work!

Apropos the problem with .overview. If I delete the duplicates in
.overview and the articles in the hard drive, would gnus be happy?

Thanks,
John

>>>>> "Paul" == Paul Stevenson <sp...@mail.phy.ornl.gov> writes:

Paul> If you respool all your messages through the backend then
Paul> nnmail-treat-duplicates will catch them:

Kai Großjohann

unread,

Jul 15, 1999, 3:00:00 AM7/15/99

to

John Hunter <jdhu...@nitace.bsd.uchicago.edu> writes:

> Excellent idea and I did it but it didn't work because the Message IDs
> look like this:
>
> Message-ID: <1ru34owg8w.fsf@totally-fudged-out-message-id>
>
> and are not the same in the duplicate files.

Do the dupes have a Gnus-Warning header? That could be used to
weeding them out.

kai
--
Life is hard and then you die.

John Hunter

unread,

Jul 16, 1999, 3:00:00 AM7/16/99

to

>>>>> "Kai" == Kai Großjohann <Kai.Gro...@CS.Uni-Dortmund.DE> writes:

Kai> Do the dupes have a Gnus-Warning header? That could be used
Kai> to weeding them out.

No; as far as I know these warnings are based on the Message ID which
for whatever reason in my mail articles were different in the
duplicates in question. Most of the mail articles came from Germany;
I don't know if this is the source of the Message ID problem.

I have written a perl script which *works*. I use nnml so it is only
useful for that backend. It takes dirnames as arguments and goes
through all these directories looking for duplicate mail (same sender,
same recipient, same number of lines and same arrival time to the
second). It then goes through and removes the duplicate articles and
cleans up .overview by removing the lines corresponding to the file.

It accepts a couple of options, '-h' gives the usage, '-v' is verbose,
'-i' is inquire only so it lists the duplicate and priginal pairs so
you can confirm the script is working before letting it kill all your
emails. GNUS seems perfectly happy with the outcome. I suggest you
quit GNUS before executing the script though.

I include the script below in case any poor sucker has my problem. As usual,
put it in your executable path and make it executable. Use the -h
option to get usage.

John Hunter

--perl script begins here--
#!/usr/bin/perl -w

use strict;

use Getopt::Std;
use vars qw($opt_i $opt_h $opt_v);
my ($inquire_only, $verbose, $dir, %seen);

getopts('ihv');

if ($opt_h || $#ARGV == -1) {
print_help();
exit(-1);
}

if ($opt_i) {
$inquire_only = 1;
}

if ($opt_v || $opt_i) {
$verbose = 1;
}

foreach $dir (@ARGV) {
if (-d $dir) {
$dir = path_strip_slash($dir) . "/";
print "\nChecking $dir\n" if $verbose;
my (%email_lines, %email_sender,
%email_recipient,$file,%match_file);

foreach $file (dir_list_files($dir)) {
#ignore .overview, backups and binaries
next if ($file =~ m/.overview/ || $file =~ m/.*~/ || -B $file);
open(MAILFILE,"<$dir$file") or die "can't open $dir$file\n";
my $date;
my $lines;
my $sender;
my $recipient;

while (<MAILFILE>) {
if (m/^(Date:\s+)(.*)\s*\n/) {
$date = $2;
}
if (m/^(Lines:\s+)(\d+)[\D]*/) {
$lines = $2;
}
if (m/^(From:\s+)(.*)/) {
$sender = $2;
}
if (m/^(To:\s+)(.*)/) {
$recipient = $2;
}
last if ($date && $lines && $sender && $recipient);
} #while mailfile

close(MAILFILE);
if ($date && $lines && $sender && $recipient) {
if ($seen{$date} && $email_lines{$seen{$date}} == $lines &&
$email_sender{$seen{$date}} =~ m/$sender/ &&
$email_recipient{$seen{$date}} =~ m/$recipient/) {
#seen it before; add to duplicate array
$match_file{$file} = $seen{$date};
} else {
#this email hasn't been seen before; store the relevant info
$seen{$date} = $file;
$email_lines{$file} = $lines;
$email_sender{$file} = $sender;
$email_recipient{$file} = $recipient;
}
} #if we have all relevant info
} #foreach file in the dir

#remove the duplicates; clean up .overview
my @delete_lines;
my $delete_regexp;
foreach $file (sort keys %match_file) {
print "I've seen $dir$file in $dir$match_file{$file}\n" if $inquire_only;
#delete from .overview lines beginning with $file
$delete_regexp = "^" . $file . ".*";

unless ($inquire_only) {
print "unlinking $dir$file which is identical to $dir$match_file{$file}\n" if $verbose;
if (unlink("$dir$file")) {
push @delete_lines , $delete_regexp;
}
else {
print "Can't remove $dir$file\n" if $verbose;
}
} #unless
}

my $cleaned_overview_text = delete_lines_that_match("$dir.overview", \@delete_lines);
unless ($inquire_only || not (scalar @delete_lines)) {
print "cleaning up .overview\n" if $verbose;
open(OUTPUT,">$dir.overview");
print OUTPUT $cleaned_overview_text;
close(OUTPUT);
}

} #foreach dir
else {
#this isn't a dir
}
}

sub print_help() {
print <<"USAGETEXT";
Usage: $0 dir1 [dir2 dir3...]
Options:
-h print this help message
-i inquire: print duplicate filenames only; change nothing.
-v verbose; give all extra info

Example: $0 -i Bugs
USAGETEXT
}

sub delete_lines_that_match {
my ($filename, $array_of_regexps) = @_;
my $output_text;
open(INPUT,"<$filename") || die "can't open $filename\n";;
while (<INPUT>) {
my ($regexp, $line);
$line = $_;
my $match = 0;
foreach $regexp (@$array_of_regexps) {
if ($line =~ m/$regexp/) {
$match = 1;
}
} #foreach rexep in array
$output_text .= "$line" unless $match;
} #while lines of input left
close(INPUT);
return $output_text;
} #sub delete_lines_that_match

sub dir_list_files {
my @dir_files;
my $dir = shift || '.';
opendir(DIR, "$dir");
my @full_list = readdir(DIR);
foreach (@full_list) {
#exclude . and ..
push @dir_files, $_ unless m/\.\.?$/;
}
closedir(DIR);
return @dir_files;
}

sub path_strip_slash {
my $filename = shift;
chop $filename if $filename =~ m#.*/$#;
return $filename
}
--perl script ends here--