On Sat, Jul 09 2016, Dirk-Jan C. Binnema wrote:
> Hi Stephen,
>
> On Saturday Jul 09 2016, Stephen Eglen wrote:
>
>> Hi again Dirk,
>>
>> I have find-dups now working, but it came up with only one match, when I
>> was expecting hundreds!
>>
>> I think I found the problem. Two files that the Emacs interfaces are
>> duplicates (same message-id) are detected as different by find-dups because
>> they have different md5sums. The critical diff between an example pair of
>> messages is:
>>
>> 85c85
>> < X-TUID: GE/tqaGtaBX6
>> ---
>>> X-TUID: oeJ7xKDXCQxT
>>
>> I can't find much out about this TUID field, except that it appears a lot
>> with mbsync in google (and that is what I am using to sync messages).
>>
>> I can presumably hack around this in the short-term by having "md5sum" in
>> the .scm file return a constant value, but that seems risky...
>> do you know what that header might be?
>
> Yeah, it's good to feel a bit uncomfortable about such potentially
> harmful things!
>
> Perhaps you can delete the X-TUID headers from these message, using "sed
> -i"? "find-dups" first considers the message-id and size of message, and
> uses md5sum only if those are the same.
Hello,
been there done that... at least on a copy of the messages.
I found this python snippet in my old scripts dir. The purpose is to
reduce the email to a canonical form before calculating an hash (not
sure if the result will still be a valid message by some standard).
Those where the tricks I used. I am also interested to know if there
is a better way to do it!
def normalize_message(msg):
# On my file system there is no \r\n but only \n.
# On the remote server there is \r\n while using fetch RFC822.
# Go for the short one.
msg = msg.replace('\r\n', '\n')
# Remove application headers, usually found only on local messages
# TODO: limit the removal to the headers!
re_tuid = re.compile('(^X-TUID: .*\n)', re.MULTILINE)
re_offl = re.compile('(^X-OfflineIMAP: .*\n)', re.MULTILINE)
msg = re.sub(re_tuid, '', msg)
msg = re.sub(re_offl, '', msg)
# It seems that some message, maybe those sent by the Roundcube web
# interface are saved by OfflineImap without the last newline, but I
# am not sure, so add the ending new line if not present.
if msg[-1] != '\n':
msg += '\n'
return msg
Kind regards,
Daniele