mail_deduplicate package¶
Expose package-wide elements.
- mail_deduplicate.HASH_HEADERS: tuple[str, ...] = ('Date', 'From', 'To', 'Subject', 'MIME-Version', 'Content-Type', 'Content-Disposition', 'User-Agent', 'X-Priority', 'Message-ID')¶
Default ordered list of headers to use to compute the unique hash of a mail.
By default we choose to exclude:
Cc
Since
mailman
apparently sometimes trims list members from theCc
header to avoid sending duplicates. Which means that copies of mail reflected back from the list server will have a differentCc
to the copy saved by the MUA at send-time.Bcc
Because copies of the mail saved by the MUA at send-time will have
Bcc
, but copies reflected back from the list server won’t.Reply-To
Since a mail could be
Cc
’d to two lists with differentReply-To
munging options set.
- mail_deduplicate.MINIMAL_HEADERS_COUNT = 4¶
Below this value, we consider not having enough headers to compute a solid hash.
- mail_deduplicate.DEFAULT_SIZE_THRESHOLD = 512¶
Default size threshold in bytes.
Since we’re ignoring the
Content-Length
header by default because of mailing-list effects, we introduced a limit on the allowed difference between the sizes of the message payloads.If this is exceeded, a warning is issued and the messages are not considered duplicates, because this could point to message corruption somewhere, or a false positive.
Note
Headers are not counted towards this threshold, because many headers can be added by mailing list software such as
mailman
, or even by the process of sending the mail through various MTAs.One copy could have been stored by the sender’s MUA prior to sending, without any
Received:
headers, and another copy could be reflected back via aCc
-to-self mechanism or mailing list server.This threshold has to be large enough to allow for footers added by mailing list servers.
- mail_deduplicate.DEFAULT_CONTENT_THRESHOLD = 768¶
Default content threshold in bytes.
As above, we similarly generates unified diffs of duplicates and ensure that the diff is not greater than a certain size to limit false-positives.
- mail_deduplicate.TIME_SOURCES = frozenset({'ctime', 'date-header'})¶
Methods used to extract a mail’s canonical timestamp:
date-header
: sourced from the message’sDate
header.ctime
: sourced from the email’s file from the filesystem. Only available formaildir
sources.
- exception mail_deduplicate.TooFewHeaders[source]¶
Bases:
Exception
Not enough headers were found to produce a solid hash.
- exception mail_deduplicate.SizeDiffAboveThreshold[source]¶
Bases:
Exception
Difference in mail size is greater than `threshold.
<https://kdeldycke.github.io/mail- deduplicate/mail_deduplicate.html#mail_deduplicate.DEFAULT_SIZE_THRESHOLD>`_.
- exception mail_deduplicate.ContentDiffAboveThreshold[source]¶
Bases:
Exception
Difference in mail content is greater than `threshold.
<https://kdeldycke.github.io/mail- deduplicate/mail_deduplicate.html#mail_deduplicate.DEFAULT_CONTENT_THRESHOLD>`_.
- class mail_deduplicate.Config(**kwargs)[source]¶
Bases:
object
Holds global configuration.
Validates configuration parameter types and values.
- default_conf = {'action': None, 'content_threshold': 768, 'dry_run': False, 'export': None, 'export_append': False, 'export_format': 'mbox', 'force_unlock': False, 'hash_body': None, 'hash_headers': ('Date', 'From', 'To', 'Subject', 'MIME-Version', 'Content-Type', 'Content-Disposition', 'User-Agent', 'X-Priority', 'Message-ID'), 'hash_only': False, 'input_format': False, 'regexp': None, 'show_diff': False, 'size_threshold': 512, 'strategy': None, 'time_source': None}¶
Submodules¶
mail_deduplicate.action module¶
- mail_deduplicate.action.DELETE_DISCARDED = 'delete-discarded'¶
Define all available action IDs.
- mail_deduplicate.action.copy_mails(dedup, mails)[source]¶
Copy provided
mails
to a brand new box or an existing one.- Return type:
- mail_deduplicate.action.move_mails(dedup, mails)[source]¶
Move provided
mails
to a brand new box or an existing one.- Return type:
- mail_deduplicate.action.delete_mails(dedup, mails)[source]¶
Remove provided
mails
in-place, from their original boxes.- Return type:
- mail_deduplicate.action.copy_selected(dedup)[source]¶
Copy all selected mails to a brand new box.
- Return type:
- mail_deduplicate.action.copy_discarded(dedup)[source]¶
Copy all discarded mails to a brand new box.
- Return type:
- mail_deduplicate.action.move_selected(dedup)[source]¶
Move all selected mails to a brand new box.
- Return type:
- mail_deduplicate.action.move_discarded(dedup)[source]¶
Move all discarded mails to a brand new box.
- Return type:
- mail_deduplicate.action.delete_selected(dedup)[source]¶
Remove in-place all selected mails, from their original boxes.
- Return type:
- mail_deduplicate.action.delete_discarded(dedup)[source]¶
Remove in-place all discarded mails, from their original boxes.
- Return type:
- mail_deduplicate.action.ACTIONS = {'copy-discarded': <function copy_discarded>, 'copy-selected': <function copy_selected>, 'delete-discarded': <function delete_discarded>, 'delete-selected': <function delete_selected>, 'move-discarded': <function move_discarded>, 'move-selected': <function move_selected>}¶
Map action ID’s to their implementation.
mail_deduplicate.cli module¶
- mail_deduplicate.cli.validate_regexp(ctx, param, value)[source]¶
Validate and compile regular expression provided as parameters to the CLI.
- class mail_deduplicate.cli.MdedupCommand(*args, version=None, extra_option_at_end=True, populate_auto_envvars=True, **kwargs)[source]¶
Bases:
ExtraCommand
List of extra parameters:
- Parameters:
version (
str
|None
) – allows a version string to be set directly on the command. Will be passed to the first instance ofExtraVersionOption
parameter attached to the command.extra_option_at_end (
bool
) – reorders all parameters attached to the command, by moving all instances ofExtraOption
at the end of the parameter list. The original order of the options is preserved among themselves.populate_auto_envvars (
bool
) – forces all parameters to have their auto-generated environment variables registered. This address the shortcoming ofclick
which only evaluates them dynamiccaly. By forcing their registration, the auto-generated environment variables gets displayed in the help screen, fixing click#2483 issue.
By default, these Click context settings are applied:
auto_envvar_prefix = self.name
(Click feature)Auto-generate environment variables for all options, using the command ID as prefix. The prefix is normalized to be uppercased and all non-alphanumerics replaced by underscores.
help_option_names = ("--help", "-h")
(Click feature)Allow help screen to be invoked with either –help or -h options.
show_default = True
(Click feature)Show all default values in help screen.
Additionally, these Cloup context settings are set:
align_option_groups = False
(Cloup feature)show_constraints = True
(Cloup feature)show_subcommand_aliases = True
(Cloup feature)
Click Extra also adds its own
context_settings
:show_choices = None
(Click Extra feature)If set to
True
orFalse
, will force that value on all options, so we can globally show or hide choices when prompting a user for input. Only makes sense for options whoseprompt
property is set.Defaults to
None
, which will leave all options untouched, and let them decide of their ownshow_choices
setting.show_envvar = None
(Click Extra feature)If set to
True
orFalse
, will force that value on all options, so we can globally enable or disable the display of environment variables in help screen.Defaults to
None
, which will leave all options untouched, and let them decide of their ownshow_envvar
setting. The rationale being that discoverability of environment variables is enabled by the--show-params
option, which is active by default on extra commands. So there is no need to surcharge the help screen.This addresses the click#2313 issue.
To override these defaults, you can pass your own settings with the
context_settings
parameter:@extra_command( context_settings={ "show_default": False, ... } )
mail_deduplicate.deduplicate module¶
- mail_deduplicate.deduplicate.STATS_DEF = {'mail_copied': 'Number of mails copied from their original mailbox to another.', 'mail_deleted': 'Number of mails deleted from their mailbox in-place.', 'mail_discarded': 'Number of mails discarded from the final selection.', 'mail_duplicates': 'Number of duplicate mails (sum of mails in all duplicate sets with at least 2 mails).', 'mail_found': 'Total number of mails encountered from all mail sources.', 'mail_hashes': 'Number of unique hashes.', 'mail_moved': 'Number of mails moved from their original mailbox to another.', 'mail_rejected': 'Number of mails rejected individually because they were unparsable or did not have enough metadata to compute hashes.', 'mail_retained': 'Number of valid mails parsed and retained for deduplication.', 'mail_selected': 'Number of mails kept in the final selection on which the action will be performed.', 'mail_skipped': 'Number of mails ignored in the selection step because the whole set they belong to was skipped.', 'mail_unique': 'Number of unique mails (which where automatically added to selection).', 'set_deduplicated': 'Number of valid sets on which the selection strategy was successfully applied.', 'set_single': 'Total number of sets containing only a single mail with no applicable strategy. They were automatically kept in the final selection.', 'set_skipped_content': 'Number of sets skipped from the selection process because they were too dissimilar in content.', 'set_skipped_encoding': 'Number of sets skipped from the selection process because they had encoding issues.', 'set_skipped_size': 'Number of sets skipped from the selection process because they were too dissimilar in size.', 'set_skipped_strategy': 'Number of sets skipped from the selection process because the strategy could not be applied.', 'set_total': 'Total number of duplicate sets.'}¶
All tracked statistics and their definition.
- mail_deduplicate.deduplicate.BODY_HASHERS = {'normalized': <function <lambda>>, 'raw': <function <lambda>>, 'skip': <function <lambda>>}¶
Method used to hash the body of mails.
- class mail_deduplicate.deduplicate.DuplicateSet(hash_key, mail_set, conf)[source]¶
Bases:
object
A set of mails sharing the same hash.
Implements all the safety checks required before we can apply any selection strategy.
Load-up the duplicate set of mail and freeze pool.
Once loaded-up, the pool of parsed mails is considered frozen for the rest of the duplicate set’s life. This allows aggressive caching of lazy instance attributes depending on the pool content.
- property newest_timestamp¶
Returns the newest timestamp among all mails in the set.
- property oldest_timestamp¶
Returns the oldest timestamp among all mails in the set.
- property biggest_size¶
Returns the biggest size among all mails in the set.
- property smallest_size¶
Returns the smallest size among all mails in the set.
- check_differences()[source]¶
Ensures all mail differs in the limits imposed by size and content thresholds.
Compare all mails of the duplicate set with each other, both in size and content. Raise an error if we’re not within the limits imposed by the threshold settings.
- diff(mail_a, mail_b)[source]¶
Return difference in bytes between two mails’ normalized body.
Todo
Rewrite the diff algorithm to not rely on naive unified diff result
parsing.
- class mail_deduplicate.deduplicate.Deduplicate(conf)[source]¶
Bases:
object
Load-up messages, search for duplicates, apply selection strategy and perform the action.
Similar messages sharing the same hash are grouped together in a
DuplicateSet
.- add_source(source_path)[source]¶
Registers a source of mails, validates and opens it.
Duplicate sources of mails are not allowed, as when we perform the action, we use the path as a unique key to tie back a mail from its source.
- Return type:
- hash_all()[source]¶
Browse all mails from all registered sources, compute hashes and group mails by hash.
Displays a progress bar as the operation might be slow.
mail_deduplicate.mail module¶
- class mail_deduplicate.mail.DedupMail(message=None)[source]¶
Bases:
object
Message with deduplication-specific properties and utilities.
Extends standard library’s mailbox.Message, and shouldn’t be used directly, but composed with
mailbox.Message
sub-classes.Initialize a pre-parsed
Message
instance the same way the default factory in Python’smailbox
module does.- add_box_metadata(box, mail_id)[source]¶
Post-instantiation utility to attach to mail some metadata derived from its parent box.
Called right after the
__init__()
constructor.This allows the mail to carry its own information on its origin box and index.
- property uid¶
Unique ID of the mail.
- property timestamp¶
Compute the normalized canonical timestamp of the mail.
Sourced from the message’s
Date
header by default. In the case ofmaildir
, can be sourced from the email’s file from the filesystem.Warning
ctime
does not refer to creation time on POSIX systems, but rather the last time the inode data changed.Todo
Investigate what mailbox.MaildirMessage.get_date() does and if we can use it.
- property size¶
Returns canonical mail size.
Size is computed as the length of the message body, i.e. the payload of the mail stripped of all its headers, not from the mail file persisting on the file- system.
Todo
Allow customization of the way the size is computed, by getting the file size instead for example:
`python size = os.path.getsize(mail_file) `
- property body_lines¶
Return a normalized list of lines from message’s body.
- property subject¶
Normalized subject.
Only used for debugging and human-friendly logging.
- hash_key()[source]¶
Returns the canonical hash of a mail.
Caution
This method hasn’t been made explicitly into a cached property in order to reduce the overall memory footprint.
- property hash_raw_body¶
Returns the canonical body hash of a mail.
- property hash_normalized_body¶
Returns the normalized body hash of a mail.
- property canonical_headers¶
Returns the full list of all canonical headers names and values in preparation for hashing.
- pretty_canonical_headers()[source]¶
Renders a table of headers names and values used to produce the mail’s hash.
Caution
This method hasn’t been made explicitly into a cached property in order to reduce the overall memory footprint.
Returns a string ready to be printed.
mail_deduplicate.mail_box module¶
Patch and Python’s standard library mail box constructors.
Python’s `mailbox module<https://docs.python.org/3.11/library/mailbox.html>`_ needs some tweaks and sane defaults.
- mail_deduplicate.mail_box.build_box_constructors()[source]¶
Build our own mail constructors for each subclass of
mailbox.Mailbox
.Gather all constructors defined by the standard Python library and augments them with our
DedupMail
class.Only augment direct subclasses of the
mailbox.Mailbox
interface. Ignoremailbox.Mailbox
itself but the latter and all others starting with an underscore.
- mail_deduplicate.mail_box.BOX_TYPES = {'babyl': functools.partial(<class 'mailbox.Babyl'>, factory=<class 'mail_deduplicate.mail_box.BabylDedupMail'>, create=False), 'maildir': functools.partial(<class 'mailbox.Maildir'>, factory=<class 'mail_deduplicate.mail_box.MaildirDedupMail'>, create=False), 'mbox': functools.partial(<class 'mailbox.mbox'>, factory=<class 'mail_deduplicate.mail_box.mboxDedupMail'>, create=False), 'mh': functools.partial(<class 'mailbox.MH'>, factory=<class 'mail_deduplicate.mail_box.MHDedupMail'>, create=False), 'mmdf': functools.partial(<class 'mailbox.MMDF'>, factory=<class 'mail_deduplicate.mail_box.MMDFDedupMail'>, create=False)}¶
Mapping between supported box type IDs and their constructors.
- mail_deduplicate.mail_box.BOX_STRUCTURES = {'file': {'babyl', 'mbox', 'mmdf'}, 'folder': {'maildir', 'mh'}}¶
Categorize each box type into its structure type.
- mail_deduplicate.mail_box.MAILDIR_SUBDIRS = frozenset({'cur', 'new', 'tmp'})¶
List of required sub-folders defining a properly structured maildir.
- mail_deduplicate.mail_box.autodetect_box_type(path)[source]¶
Auto-detect the format of the mailbox located at the provided path.
Returns a box type as indexed in the BOX_TYPES dictionary above.
If the path is a file, then it is considered as an
mbox
. Else, if the provided path is a folder and feature the expecteed sub-directories, it is parsed as amaildir
. :rtype:str
Note
Future finer autodetection heuristics should be implemented here.
- Some ideas:
single mail from a
maildir
plain text mail content
- other mailbox formats supported in Python’s standard library:
MH
Babyl
MMDF
- mail_deduplicate.mail_box.open_box(path, box_type=False, force_unlock=False)[source]¶
Open a mail box.
Returns a list of boxes, one per sub-folder. All are locked, ready for operations.
If
box_type
is provided, forces the opening of the box in the specified format. Else, defaults to autodetection.
- mail_deduplicate.mail_box.lock_box(box, force_unlock)[source]¶
Lock an opened box and allows for forced unlocking.
Returns the locked box.
mail_deduplicate.strategy module¶
Strategy definitions.
- mail_deduplicate.strategy.select_older(duplicates)[source]¶
Select all older duplicates.
Discards the newests, i.e. the subset sharing the most recent timestamp.
- mail_deduplicate.strategy.select_oldest(duplicates)[source]¶
Select all the oldest duplicates.
Discards the newers, i.e. all mail of the duplicate set but those sharing the oldest timestamp.
- mail_deduplicate.strategy.select_newer(duplicates)[source]¶
Select all newer duplicates.
Discards the oldest, i.e. the subset sharing the most ancient timestamp.
- mail_deduplicate.strategy.select_newest(duplicates)[source]¶
Select all the newest duplicates.
Discards the olders, i.e. all mail of the duplicate set but those sharing the newest timestamp.
- mail_deduplicate.strategy.select_smaller(duplicates)[source]¶
Select all smaller duplicates.
Discards the biggests, i.e. the subset sharing the biggest size.
- mail_deduplicate.strategy.select_smallest(duplicates)[source]¶
Select all the smallest duplicates.
Discards the biggers. i.e. all mail of the duplicate set but those sharing the smallest size.
- mail_deduplicate.strategy.select_bigger(duplicates)[source]¶
Select all bigger duplicates.
Discards the smallests, i.e. the subset sharing the smallest size.
- mail_deduplicate.strategy.select_biggest(duplicates)[source]¶
Select all the biggest duplicates.
Discards the smallers, i.e. all mail of the duplicate set but those sharing the biggest size.
- mail_deduplicate.strategy.select_matching_path(duplicates)[source]¶
Select all duplicates whose file path match the regular expression provided via the –regexp parameter.
- mail_deduplicate.strategy.select_non_matching_path(duplicates)[source]¶
Select all duplicates whose file path doesn’t match the regular expression provided via the –regexp parameter.
- mail_deduplicate.strategy.select_one(duplicates)[source]¶
Randomly select one duplicate, and discards all others.
- mail_deduplicate.strategy.select_all_but_one(duplicates)[source]¶
Randomly discard one duplicate, and select all others.
- mail_deduplicate.strategy.SELECT_NEWEST = 'select-newest'¶
Time-based strategies.
- mail_deduplicate.strategy.SELECT_BIGGEST = 'select-biggest'¶
Size-based strategies.
- mail_deduplicate.strategy.SELECT_NON_MATCHING_PATH = 'select-non-matching-path'¶
Location-based strategies.
- mail_deduplicate.strategy.SELECT_ALL_BUT_ONE = 'select-all-but-one'¶
Quantity-based strategies.
- mail_deduplicate.strategy.STRATEGY_ALIASES = frozenset({('select-all-but-one', 'discard-one'), ('select-bigger', 'discard-smallest'), ('select-biggest', 'discard-smaller'), ('select-matching-path', 'discard-non-matching-path'), ('select-newer', 'discard-oldest'), ('select-newest', 'discard-older'), ('select-non-matching-path', 'discard-matching-path'), ('select-older', 'discard-newest'), ('select-oldest', 'discard-newer'), ('select-one', 'discard-all-but-one'), ('select-smaller', 'discard-biggest'), ('select-smallest', 'discard-bigger')})¶
Groups strategy aliases and their definitions.
Aliases are great usability features as it helps users to better reason about the selection operators depending on their mental models.
- mail_deduplicate.strategy.get_method_id(strategy_id)[source]¶
Transform strategy ID to its method ID.