Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion deleting existing duplicates
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
John Hunter  
View profile  
 More options Jul 16 1999, 3:00 am
Newsgroups: gnu.emacs.gnus
From: John Hunter <jdhun...@nitace.bsd.uchicago.edu>
Date: 1999/07/16
Subject: Re: deleting existing duplicates

>>>>> "Kai" == Kai Großjohann <Kai.Grossjoh...@CS.Uni-Dortmund.DE> writes:

    Kai> Do the dupes have a Gnus-Warning header?  That could be used
    Kai> to weeding them out.

No; as far as I know these warnings are based on the Message ID which
for whatever reason in my mail articles were different in the
duplicates in question.  Most of the mail articles came from Germany;
I don't know if this is the source of the Message ID problem.

I have written a perl script which *works*.  I use nnml so it is only
useful for that backend.  It takes dirnames as arguments and goes
through all these directories looking for duplicate mail (same sender,
same recipient, same number of lines and same arrival time to the
second).  It then goes through and removes the duplicate articles and
cleans up .overview by removing the lines corresponding to the file.

It accepts a couple of options, '-h' gives the usage, '-v' is verbose,
'-i' is inquire only so it lists the duplicate and priginal pairs so
you can confirm the script is working before letting it kill all your
emails.  GNUS seems perfectly happy with the outcome.  I suggest you
quit GNUS before executing the script though.

I include the script below in case any poor sucker has my problem.  As usual,
put it in your executable path and make it executable. Use the -h
option to get usage.

John Hunter

--perl script begins here--
#!/usr/bin/perl -w

use strict;

use Getopt::Std;
use vars qw($opt_i $opt_h $opt_v);
my ($inquire_only, $verbose, $dir, %seen);

getopts('ihv');

if ($opt_h || $#ARGV == -1) {
  print_help();
  exit(-1);

}

if ($opt_i) {
  $inquire_only = 1;

}

if ($opt_v || $opt_i) {
  $verbose = 1;

}

foreach $dir (@ARGV) {
  if (-d $dir) {
    $dir = path_strip_slash($dir) . "/";
    print "\nChecking $dir\n" if $verbose;
    my (%email_lines, %email_sender,
        %email_recipient,$file,%match_file);

    foreach $file (dir_list_files($dir)) {
      #ignore .overview, backups and binaries
      next if  ($file =~ m/.overview/ || $file =~ m/.*~/ || -B $file);
      open(MAILFILE,"<$dir$file") or die "can't open $dir$file\n";
      my $date;
      my $lines;
      my $sender;
      my $recipient;

      while (<MAILFILE>) {
        if (m/^(Date:\s+)(.*)\s*\n/) {
          $date = $2;
        }
        if (m/^(Lines:\s+)(\d+)[\D]*/) {
          $lines = $2;
        }
        if (m/^(From:\s+)(.*)/) {
          $sender = $2;
        }
        if (m/^(To:\s+)(.*)/) {
          $recipient = $2;
        }
        last if ($date && $lines && $sender && $recipient);
      } #while mailfile

      close(MAILFILE);
      if ($date && $lines && $sender && $recipient) {
        if ($seen{$date} && $email_lines{$seen{$date}} == $lines &&
            $email_sender{$seen{$date}} =~ m/$sender/ &&
            $email_recipient{$seen{$date}} =~ m/$recipient/) {
          #seen it before; add to duplicate array
          $match_file{$file} = $seen{$date};
        } else {
          #this email hasn't been seen before; store the relevant info
          $seen{$date} = $file;
          $email_lines{$file} = $lines;
          $email_sender{$file} = $sender;
          $email_recipient{$file} = $recipient;
        }
      } #if we have all relevant info
    } #foreach file in the dir

    #remove the duplicates; clean up .overview
    my @delete_lines;
    my $delete_regexp;
    foreach $file (sort keys %match_file) {
      print "I've seen $dir$file in $dir$match_file{$file}\n" if $inquire_only;
      #delete from .overview lines beginning with $file
      $delete_regexp = "^" . $file . ".*";

      unless ($inquire_only) {
        print "unlinking $dir$file which is identical to $dir$match_file{$file}\n" if $verbose;
        if (unlink("$dir$file")) {
          push @delete_lines , $delete_regexp;
        }
        else {
          print "Can't remove $dir$file\n" if $verbose;
        }
      } #unless
    }

    my $cleaned_overview_text = delete_lines_that_match("$dir.overview", \@delete_lines);
    unless ($inquire_only || not (scalar @delete_lines)) {
      print "cleaning up .overview\n" if $verbose;
      open(OUTPUT,">$dir.overview");
      print OUTPUT $cleaned_overview_text;
      close(OUTPUT);
    }

  } #foreach dir
  else {
    #this isn't a dir
  }

}

sub print_help() {
  print <<"USAGETEXT";
Usage: $0 dir1 [dir2 dir3...]
Options:
  -h     print this help message
  -i     inquire: print duplicate filenames only; change nothing.  
  -v     verbose; give all extra info

Example: $0 -i Bugs
USAGETEXT

}

sub delete_lines_that_match {
  my ($filename, $array_of_regexps) = @_;
  my $output_text;
  open(INPUT,"<$filename") || die "can't open $filename\n";;
  while (<INPUT>) {
    my ($regexp, $line);
    $line = $_;
    my $match = 0;
    foreach $regexp (@$array_of_regexps) {
      if ($line =~ m/$regexp/) {
        $match = 1;
      }
    } #foreach rexep in array
    $output_text .=  "$line" unless $match;
  } #while lines of input left
  close(INPUT);
  return $output_text;

} #sub delete_lines_that_match

sub dir_list_files  {
  my @dir_files;
  my $dir = shift || '.';
  opendir(DIR, "$dir");
  my @full_list = readdir(DIR);
  foreach (@full_list) {
    #exclude . and ..
    push @dir_files, $_ unless m/\.\.?$/;  
  }
  closedir(DIR);
  return @dir_files;

}

sub path_strip_slash {
  my $filename = shift;
  chop $filename if $filename =~ m#.*/$#;
  return $filename
}

--perl script ends here--

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.