Removing Duplicate E-mail Messages From A Mailbox

Occasionally your mail delivery scheme might hicough, leaving you with duplicate copies of email messages sitting in your mailboxes. I find this happens occasionally if something goes wrong with fetchmail - you kill the fetchmail process before it has expunged the deleted email from the remote POP3 server, so the next time you run fetchmail it downloads a second copy of each email. This is a simple process that I came up with to remove duplicate email messages from a maildir format mailbox.

As a bit of background, a maildir mailbox is a small directory tree:

$ du .boxes.xml-dev
4       .boxes.xml-dev/tmp
124     .boxes.xml-dev/new
52340   .boxes.xml-dev/cur
52920   .boxes.xml-dev
$
Heirarchy is represented by components of the mailbox name separated by dots, so the mailbox above is called xml-dev and it is in the boxes mailbox. Messages are files in either the new or cur directories. Transport agents place messages into the new directory. When a user agent opens a mailbox it moves all the messages from new to cur. If you're accessing your mail through an IMAP server like Courier-IMAP the IMAP server will deal with this for you.

  1. Make sure there's nothing sitting in the new subdirectory.

    $ ls new
    $ 
    
    If there are messages in the new subdirectory, open the mailbox in a user agent to get it to move them into cur.

  2. See how many messages you have:

    $ ls cur | wc -l
        842
    $ 
    

  3. Check they all have Message-IDs:

    $ for i in cur/*; do reformail -x Message-ID: <$i; done | wc -l
        842
    $ 
    

  4. See how many you have if you filter out duplicate Message-IDs:

    $ for i in cur/*; do reformail -x Message-ID: <$i; done | sort -u | wc -l
        698
    $ 
    

  5. See how many we're going to delete:

    $ rm /tmp/dups
    $ for i in cur/*; do reformail -D 20000 /tmp/dups <$i && echo $i; done | wc -l
        144
    $ expr 698 + 144
    842
    $ 
    
    If this total doesn't match you should increase the 20000 - reformail isn't remembering enough Message-IDs to spot all the duplicates.

  6. Delete the messages and check things look right afterwards:

    $ rm /tmp/dups
    $ for i in cur/*; do reformail -D 20000 /tmp/dups <$i && rm $i; done
    $ ls cur | wc -l
        698
    $ 
    


$Id: mail-duplicates.html,v 1.1 2003/04/22 10:32:59 mhw Exp $