detecting and deleting duplicates of emails

738 views
Skip to first unread message

Stephen Eglen

unread,
Jul 9, 2016, 3:03:50 AM7/9/16
to mu-discuss
Hi,
I often end up with multiple copies of messages (e.g. one in my Sent folder, one when I am on the CC: line).

What's the suggested route for finding and deleting duplicate emails?  Some lisp (which I can try)?  I am aware of the mu4e-headers-skip-duplicates

Thanks, Stephen

Dirk-Jan C. Binnema

unread,
Jul 9, 2016, 3:16:21 AM7/9/16
to mu-di...@googlegroups.com
If you have mu-guile, you can do:

mu find-dups

which prints a message, followed by its duplicates prefixed with 'dup:
', which you can then process with shell tools.

There's also a --delete option - but be careful!

Kind regards,
Dirk.

--
Dirk-Jan C. Binnema Helsinki, Finland
e:dj...@djcbsoftware.nl w:www.djcbsoftware.nl
pgp: D09C E664 897D 7D39 5047 A178 E96A C7A1 017D DA3C

Stephen Eglen

unread,
Jul 9, 2016, 11:02:34 AM7/9/16
to mu-discuss


On Saturday, July 9, 2016 at 8:16:21 AM UTC+1, djcb wrote:

On Saturday Jul 09 2016, Stephen Eglen wrote:

> Hi,
> I often end up with multiple copies of messages (e.g. one in my Sent
> folder, one when I am on the CC: line).
>
> What's the suggested route for finding and deleting duplicate emails?  Some
> lisp (which I can try)?  I am aware of the mu4e-headers-skip-duplicates

If you have mu-guile, you can do:

   mu find-dups

Thank you very much!  I will recompile to get mu-guile.

Stephen 

Stephen Eglen

unread,
Jul 9, 2016, 12:05:48 PM7/9/16
to mu-discuss
Hi again Dirk,

I have find-dups now working, but it came up with only one match, when I was expecting hundreds!

I think I found the problem.  Two files that the Emacs interfaces are duplicates (same message-id) are detected as different by find-dups because they have different md5sums.  The critical diff between an example pair of messages is:

85c85
< X-TUID: GE/tqaGtaBX6
---
> X-TUID: oeJ7xKDXCQxT

I can't find much out about this TUID field, except that it appears a lot with mbsync in google (and that is what I am using to sync messages).

I can presumably hack around this in the short-term by having "md5sum" in the .scm file return a constant value, but that seems risky...
do you know what that header might be?

Stephen

Dirk-Jan C. Binnema

unread,
Jul 9, 2016, 4:56:56 PM7/9/16
to mu-di...@googlegroups.com
Hi Stephen,

On Saturday Jul 09 2016, Stephen Eglen wrote:

> Hi again Dirk,
>
> I have find-dups now working, but it came up with only one match, when I
> was expecting hundreds!
>
> I think I found the problem. Two files that the Emacs interfaces are
> duplicates (same message-id) are detected as different by find-dups because
> they have different md5sums. The critical diff between an example pair of
> messages is:
>
> 85c85
> < X-TUID: GE/tqaGtaBX6
> ---
>> X-TUID: oeJ7xKDXCQxT
>
> I can't find much out about this TUID field, except that it appears a lot
> with mbsync in google (and that is what I am using to sync messages).
>
> I can presumably hack around this in the short-term by having "md5sum" in
> the .scm file return a constant value, but that seems risky...
> do you know what that header might be?

Yeah, it's good to feel a bit uncomfortable about such potentially
harmful things!

Perhaps you can delete the X-TUID headers from these message, using "sed
-i"? "find-dups" first considers the message-id and size of message, and
uses md5sum only if those are the same.

Stephen Eglen

unread,
Jul 9, 2016, 5:57:58 PM7/9/16
to mu-discuss
Thanks, but I'd rather not delete this TUID field if mbsync is using it.  
notes that it is "Temporary UID".

Daniele Pizzolli

unread,
Jul 10, 2016, 5:26:43 AM7/10/16
to mu-di...@googlegroups.com
On Sat, Jul 09 2016, Dirk-Jan C. Binnema wrote:

> Hi Stephen,
>
> On Saturday Jul 09 2016, Stephen Eglen wrote:
>
>> Hi again Dirk,
>>
>> I have find-dups now working, but it came up with only one match, when I
>> was expecting hundreds!
>>
>> I think I found the problem. Two files that the Emacs interfaces are
>> duplicates (same message-id) are detected as different by find-dups because
>> they have different md5sums. The critical diff between an example pair of
>> messages is:
>>
>> 85c85
>> < X-TUID: GE/tqaGtaBX6
>> ---
>>> X-TUID: oeJ7xKDXCQxT
>>
>> I can't find much out about this TUID field, except that it appears a lot
>> with mbsync in google (and that is what I am using to sync messages).
>>
>> I can presumably hack around this in the short-term by having "md5sum" in
>> the .scm file return a constant value, but that seems risky...
>> do you know what that header might be?
>
> Yeah, it's good to feel a bit uncomfortable about such potentially
> harmful things!
>
> Perhaps you can delete the X-TUID headers from these message, using "sed
> -i"? "find-dups" first considers the message-id and size of message, and
> uses md5sum only if those are the same.

Hello,

been there done that... at least on a copy of the messages.

I found this python snippet in my old scripts dir. The purpose is to
reduce the email to a canonical form before calculating an hash (not
sure if the result will still be a valid message by some standard).
Those where the tricks I used. I am also interested to know if there
is a better way to do it!

def normalize_message(msg):

# On my file system there is no \r\n but only \n.
# On the remote server there is \r\n while using fetch RFC822.
# Go for the short one.
msg = msg.replace('\r\n', '\n')

# Remove application headers, usually found only on local messages
# TODO: limit the removal to the headers!
re_tuid = re.compile('(^X-TUID: .*\n)', re.MULTILINE)
re_offl = re.compile('(^X-OfflineIMAP: .*\n)', re.MULTILINE)

msg = re.sub(re_tuid, '', msg)
msg = re.sub(re_offl, '', msg)

# It seems that some message, maybe those sent by the Roundcube web
# interface are saved by OfflineImap without the last newline, but I
# am not sure, so add the ending new line if not present.
if msg[-1] != '\n':
msg += '\n'

return msg

Kind regards,
Daniele

Stephen Eglen

unread,
Jul 13, 2016, 1:50:39 PM7/13/16
to mu-discuss, d...@toel.it
Thanks Daniele for this.  My approach was to write a simple shell script called "md5sum1" that looks like:

  #!/bin/sh
  perl -ne 'print unless /^X-TUID:.*/' $1 | md5sum

which I then call in find-dups.scm

I also have another tweak to find-dups.scm so that rather than searching the whole database, you can search a subset of files.  This is often quicker and less worrying when you --delete
!  I'll put a patch into github.

Stephen Eglen

unread,
Jul 13, 2016, 2:06:39 PM7/13/16
to mu-discuss, d...@toel.it

Jeff Templon

unread,
Jun 14, 2018, 9:23:08 AM6/14/18
to mu-discuss
Any idea why mu in homebrew doesn't have the guile stuff?  Does anyone have scriptology for deduplication that does not rely on the guile extension?

JT

Reply all
Reply to author
Forward
0 new messages