mail_deduplicate packageΒΆ
Expose package-wide elements.
SubmodulesΒΆ
mail_deduplicate.action moduleΒΆ
- mail_deduplicate.action.export_box(dedup)[source]ΒΆ
Context manager for export box operations.
- Return type:
- mail_deduplicate.action.copy_mails(dedup, mails)[source]ΒΆ
Copy provided
mailsto a brand new box or an existing one.- Return type:
- mail_deduplicate.action.move_mails(dedup, mails)[source]ΒΆ
Move provided
mailsto a brand new box or an existing one.- Return type:
- mail_deduplicate.action.delete_mails(dedup, mails)[source]ΒΆ
Remove provided
mailsin-place, from their original boxes.- Return type:
- mail_deduplicate.action.copy_selected(dedup)[source]ΒΆ
Copy all selected mails to a brand new box.
- Return type:
- mail_deduplicate.action.copy_discarded(dedup)[source]ΒΆ
Copy all discarded mails to a brand new box.
- Return type:
- mail_deduplicate.action.move_selected(dedup)[source]ΒΆ
Move all selected mails to a brand new box.
- Return type:
- mail_deduplicate.action.move_discarded(dedup)[source]ΒΆ
Move all discarded mails to a brand new box.
- Return type:
- mail_deduplicate.action.delete_selected(dedup)[source]ΒΆ
Remove in-place all selected mails, from their original boxes.
- Return type:
- mail_deduplicate.action.delete_discarded(dedup)[source]ΒΆ
Remove in-place all discarded mails, from their original boxes.
- Return type:
- class mail_deduplicate.action.Action(*values)[source]ΒΆ
Bases:
EnumDefine all available action IDs.
- COPY_SELECTED = 'copy-selected'ΒΆ
- COPY_DISCARDED = 'copy-discarded'ΒΆ
- MOVE_SELECTED = 'move-selected'ΒΆ
- MOVE_DISCARDED = 'move-discarded'ΒΆ
- DELETE_SELECTED = 'delete-selected'ΒΆ
- DELETE_DISCARDED = 'delete-discarded'ΒΆ
mail_deduplicate.cli moduleΒΆ
- mail_deduplicate.cli.DEFAULT_HASH_HEADERS: tuple[str, ...] = ('Date', 'From', 'To', 'Subject', 'MIME-Version', 'Content-Type', 'Content-Disposition', 'User-Agent', 'X-Priority', 'Message-ID')ΒΆ
Default ordered list of headers to use to compute the unique hash of a mail.
By default we choose to exclude:
CCSince
mailmanapparently sometimes trims list members from theCCheader to avoid sending duplicates. Which means that copies of mail reflected back from the list server will have a differentCCto the copy saved by the MUA at send-time.BCCBecause copies of the mail saved by the MUA at send-time will have
BCC, but copies reflected back from the list server wonβt.Reply-ToSince a mail could be
CCβd to two lists with differentReply-Tomunging options set.
- class mail_deduplicate.cli.Config[source]ΒΆ
Bases:
TypedDictHolds global configuration.
- input_format: BoxFormat | NoneΒΆ
- force_unlock: boolΒΆ
- hash_headers: tuple[str, ...]ΒΆ
- minimal_headers: intΒΆ
- hash_body: BodyHasherΒΆ
- hash_only: boolΒΆ
- size_threshold: intΒΆ
- content_threshold: intΒΆ
- show_diff: boolΒΆ
- strategy: StrategyΒΆ
- time_source: TimeSourceΒΆ
- regexp: re.Pattern | NoneΒΆ
- action: ActionΒΆ
- export: Path | NoneΒΆ
- export_format: BoxFormatΒΆ
- export_append: boolΒΆ
- dry_run: boolΒΆ
- mail_deduplicate.cli.normalize_headers(ctx, param, value)[source]ΒΆ
Validate headers provided as parameters to the CLI.
Headers are case-insensitive in Python implementation, so we normalize them to lower-case.
We then deduplicate them, while preserving order.
Mail headers are expected to be composed of ASCII characters between 33 and 126 (both inclusive) according to RFC-5322.
- mail_deduplicate.cli.compile_regexp(ctx, param, value)[source]ΒΆ
Validate and compile regular expression provided as parameters to the CLI.
- class mail_deduplicate.cli.MdedupCommand(*args, version=None, extra_option_at_end=True, populate_auto_envvars=True, **kwargs)[source]ΒΆ
Bases:
ExtraCommandList of extra parameters:
- Parameters:
version (
str|None) β allows a version string to be set directly on the command. Will be passed to the first instance ofExtraVersionOptionparameter attached to the command.extra_option_at_end (
bool) β reorders all parameters attached to the command, by moving all instances ofExtraOptionat the end of the parameter list. The original order of the options is preserved among themselves.populate_auto_envvars (
bool) β forces all parameters to have their auto-generated environment variables registered. This address the shortcoming ofclickwhich only evaluates them dynamiccaly. By forcing their registration, the auto-generated environment variables gets displayed in the help screen, fixing click#2483 issue. On Windows, environment variable names are case-insensitive, so we normalize them to uppercase.
By default, these Click context settings are applied:
auto_envvar_prefix = self.name(Click feature)Auto-generate environment variables for all options, using the command ID as prefix. The prefix is normalized to be uppercased and all non-alphanumerics replaced by underscores.
help_option_names = ("--help", "-h")(Click feature)Allow help screen to be invoked with either βhelp or -h options.
show_default = True(Click feature)Show all default values in help screen.
Additionally, these Cloup context settings are set:
align_option_groups = False(Cloup feature)show_constraints = True(Cloup feature)show_subcommand_aliases = True(Cloup feature)
Click Extra also adds its own
context_settings:show_choices = None(Click Extra feature)If set to
TrueorFalse, will force that value on all options, so we can globally show or hide choices when prompting a user for input. Only makes sense for options whosepromptproperty is set.Defaults to
None, which will leave all options untouched, and let them decide of their ownshow_choicessetting.show_envvar = None(Click Extra feature)If set to
TrueorFalse, will force that value on all options, so we can globally enable or disable the display of environment variables in help screen.Defaults to
None, which will leave all options untouched, and let them decide of their ownshow_envvarsetting. The rationale being that discoverability of environment variables is enabled by the--show-paramsoption, which is active by default on extra commands. So there is no need to surcharge the help screen.This addresses the click#2313 issue.
To override these defaults, you can pass your own settings with the
context_settingsparameter:@command( context_settings={ "show_default": False, ... } )
mail_deduplicate.deduplicate moduleΒΆ
- class mail_deduplicate.deduplicate.StatDef(description: str, category: str)[source]ΒΆ
Bases:
NamedTupleDefinition of a statistic with its description and category.
Create new instance of StatDef(description, category)
- class mail_deduplicate.deduplicate.Stat(*values)[source]ΒΆ
Bases:
EnumAll tracked statistics and their definition.
- MAIL_FOUND = ('Total number of mails encountered from all mail sources.', 'mail')ΒΆ
- MAIL_REJECTED = ('Number of mails rejected individually because they were unparsable or did not have enough metadata to compute hashes.', 'mail')ΒΆ
- MAIL_RETAINED = ('Number of valid mails parsed and retained for deduplication.', 'mail')ΒΆ
- MAIL_HASHES = ('Number of unique hashes.', 'mail')ΒΆ
- MAIL_UNIQUE = ('Number of unique mails (which were automatically added to selection).', 'mail')ΒΆ
- MAIL_DUPLICATES = ('Number of duplicate mails (sum of mails in all duplicate sets with at least 2 mails).', 'mail')ΒΆ
- MAIL_SKIPPED = ('Number of mails ignored in the selection step because the whole set they belong to was skipped.', 'mail')ΒΆ
- MAIL_DISCARDED = ('Number of mails discarded from the final selection.', 'mail')ΒΆ
- MAIL_SELECTED = ('Number of mails kept in the final selection on which the action will be performed.', 'mail')ΒΆ
- MAIL_COPIED = ('Number of mails copied from their original mailbox to another.', 'mail')ΒΆ
- MAIL_MOVED = ('Number of mails moved from their original mailbox to another.', 'mail')ΒΆ
- MAIL_DELETED = ('Number of mails deleted from their mailbox in-place.', 'mail')ΒΆ
- SET_TOTAL = ('Total number of duplicate sets.', 'set')ΒΆ
- SET_SINGLE = ('Total number of sets containing only a single mail with no applicable strategy. They were automatically kept in the final selection.', 'set')ΒΆ
- SET_SKIPPED_ENCODING = ('Number of sets skipped from the selection process because they had encoding issues.', 'set')ΒΆ
- SET_SKIPPED_SIZE = ('Number of sets skipped from the selection process because they were too dissimilar in size.', 'set')ΒΆ
- SET_SKIPPED_CONTENT = ('Number of sets skipped from the selection process because they were too dissimilar in content.', 'set')ΒΆ
- SET_SKIPPED_STRATEGY = ('Number of sets skipped from the selection process because the strategy could not be applied.', 'set')ΒΆ
- SET_DEDUPLICATED = ('Number of valid sets on which the selection strategy was successfully applied.', 'set')ΒΆ
- class mail_deduplicate.deduplicate.Stats[source]ΒΆ
Bases:
objectType-safe statistics counter using Stat enum keys.
- exception mail_deduplicate.deduplicate.SizeDiffAboveThreshold[source]ΒΆ
Bases:
ExceptionDifference in mail size is greater than threshold.
- exception mail_deduplicate.deduplicate.ContentDiffAboveThreshold[source]ΒΆ
Bases:
ExceptionDifference in mail content is greater than threshold.
- class mail_deduplicate.deduplicate.BodyHasher(*values)[source]ΒΆ
Bases:
StrEnumEnumeration of available body hashing methods.
- SKIP = 'skip'ΒΆ
- RAW = 'raw'ΒΆ
- NORMALIZED = 'normalized'ΒΆ
- class mail_deduplicate.deduplicate.DuplicateSet(hash_key, mail_set, conf)[source]ΒΆ
Bases:
objectA set of mails sharing the same hash.
Implements all the safety checks required before we can apply any selection strategy.
Load-up the duplicate set of mail and freeze pool.
Once loaded-up, the pool of parsed mails is considered frozen for the rest of the duplicate setβs life. This allows aggressive caching of lazy instance attributes depending on the pool content.
- confΒΆ
Configuration shared from the main deduplication process.
-
pool:
frozenset[DedupMailMixin]ΒΆ Pool referencing all duplicated mails and their attributes.
- check_differences()[source]ΒΆ
Ensures all mail differs in the limits imposed by size and content thresholds.
Compare all mails of the duplicate set with each other, both in size and content. Raise an error if weβre not within the limits imposed by the threshold settings.
- diff(mail_a, mail_b)[source]ΒΆ
Return difference in bytes between two mailsβ normalized body.
Todo
Rewrite the diff algorithm to not rely on naive unified diff result parsing.
- class mail_deduplicate.deduplicate.Deduplicate(conf)[source]ΒΆ
Bases:
objectLoad-up messages, search for duplicates, apply selection strategy and perform the action.
Similar messages sharing the same hash are grouped together in a
DuplicateSet.-
CLEANUP_ATTRS:
tuple[str,...] = ('canonical_headers', 'body_lines', 'subject')ΒΆ Attributes to remove from mails after categorization to free memory.
-
sources:
dict[str,Mailbox]ΒΆ Index of mail sources by their full, normalized path. So we can refer to them in Mail instances. Also have the nice side effect of natural deduplication of sources themselves.
- confΒΆ
Configuration shared across the deduplication process.
- add_source(source_path)[source]ΒΆ
Registers a source of mails, validates and opens it.
Duplicate sources of mails are not allowed, as when we perform the action, we use the path as a unique key to tie back a mail from its source.
- Return type:
- hash_all()[source]ΒΆ
Browse all mails from all registered sources, compute hashes and group mails by hash.
Displays a progress bar as the operation might be slow.
- static cleanup_mail_attrs(mail, attrs)[source]ΒΆ
Remove cached attributes from mail to free memory.
- Return type:
- build_sets()[source]ΒΆ
Build the selected and discarded sets from each duplicate set.
We apply the selection strategy one duplicate set at a time to keep memory footprint low and make the log easier to read.
-
CLEANUP_ATTRS:
mail_deduplicate.mail moduleΒΆ
- exception mail_deduplicate.mail.TooFewHeaders[source]ΒΆ
Bases:
ExceptionNot enough headers were found to produce a solid hash.
- class mail_deduplicate.mail.TimeSource(*values)[source]ΒΆ
Bases:
EnumEnumeration of all supported mail timestamp sources.
- DATE_HEADER = 'date-header'ΒΆ
Timestamp sourced from the messageβs
Dateheader.
- CTIME = 'ctime'ΒΆ
Timestamp is from the emailβs file on the filesystem.
Attention
Only available for
maildirsources.
- mail_deduplicate.mail.ADDRESS_HEADERS = frozenset({'bcc', 'cc', 'delivered-to', 'disposition-notification-to', 'envelope-to', 'from', 'original-recipient', 'reply-to', 'resent-bcc', 'resent-cc', 'resent-from', 'resent-reply-to', 'resent-sender', 'resent-to', 'return-path', 'sender', 'to', 'x-envelope-from', 'x-envelope-to', 'x-original-to'})ΒΆ
Headers that contain email addresses.
Hint
Headers from which quotes should be discarded. E.g.:
"Bob" <bob@example.com>
should hash to the same thing as:
Bob <bob@example.com>
Attention
These IDs should be kept lower-case, because they are compared to the one provided to those provided to the
-h/--hash-headeroption, that is carried by thehash_headersproperty of the configuration.
- class mail_deduplicate.mail.DedupMailMixin(message=None)[source]ΒΆ
Bases:
MessageMessage with deduplication-specific properties and utilities.
Extends standard libraryβs mailbox.Message, and shouldnβt be used directly, but composed with
mailbox.Messagesub-classes.Initialize a Message instance.
-
path:
strΒΆ Real filesystem location of the mail.
Returns the individual mailβs file for folder-based box types (
maildir& co.), but returns the whole box path for file-based boxes (mbox& co.). Only used by regexp-based selection strategies.
- add_box_metadata(box, mail_id)[source]ΒΆ
Post-instantiation utility to attach to mail some metadata derived from its parent box.
Called right after the
__init__()constructor.This allows the mail to carry its own information on its origin box and index.
- Return type:
- property parsed_date: float | None[source]ΒΆ
Parse the mailβs date header into float timestamp.
Returns
Noneif the mail has no valid date header.
- property timestamp: float | None[source]ΒΆ
Compute the normalized canonical timestamp of the mail.
Sourced from the messageβs
Dateheader by default. In the case ofmaildir, can be sourced from the emailβs file from the filesystem.Warning
ctimedoes not refer to creation time on POSIX systems, but rather the last time the inode data changed.Todo
Investigate what mailbox.MaildirMessage.get_date() does and if we can use it.
- property size: int[source]ΒΆ
Returns canonical mail size.
Size is computed as the length of the message body, i.e. the payload of the mail stripped of all its headers, not from the mail file persisting on the file- system.
Todo
Allow customization of the way the size is computed, by getting the file size instead for example:
`python size = os.path.getsize(mail_file) `
- hash_key()[source]ΒΆ
Returns the canonical hash of a mail.
Caution
This method hasnβt been made explicitly into a cached property in order to reduce the overall memory footprint.
- Return type:
- property canonical_headers: tuple[tuple[str, str], ...][source]ΒΆ
Returns the full list of all canonical headers names and values in preparation for hashing.
- pretty_canonical_headers()[source]ΒΆ
Renders a table of headers names and values used to produce the mailβs hash.
Caution
This method hasnβt been explicitly made into a cached property in order to reduce the overall memory footprint.
Returns a string ready to be printed.
- Return type:
- serialized_headers()[source]ΒΆ
Serialize the canonical headers into a single string ready to be hashed.
At this point we should have an absolute minimum of headers.
Caution
This method hasnβt been explicitly made into a cached property in order to reduce the overall memory footprint.
- Return type:
- normalized_header_values(header_id)[source]ΒΆ
Returns all normalized values of a header.
Values are cleaned-up into their canonical form.
- normalize_subject(subject)[source]ΒΆ
Strip
Re:/Fwd:and[list-name]prefixes fromSubject.This cleans up prefixes automatically added by mailing list software, since the mail could have been
CCβd to multiple lists, in which case it will receive a different prefix for each.- Return type:
- normalize_content_type(value)[source]ΒΆ
Normalize
Content-Typeby stripping parameters.Removes everything after the semicolon, keeping only the MIME type. E.g.,
text/plain; charset=utf-8becomestext/plain.Apparently list servers actually munge
Content-Typee.g. by stripping the quotes fromcharset="us-ascii". Section 5.1 of RFC2045 says that either form is valid (and they are equivalent).Additionally, with multipart/mixed, boundary delimiters can vary by recipient. We need to allow for duplicates coming from multiple recipients, since for example you could be signed up to the same list twice with different addresses. Or maybe someone bounces you a load of mail some of which is from a mailing list youβre both subscribed to - then itβs still useful to be able to eliminate duplicates.
- Return type:
- normalize_date(value)[source]ΒΆ
Normalize
DatetoYYYY-MM-DDformat.Date timestamps can differ by seconds or hours for various reasons, so letβs only honour the date for now and normalize them to UTC timezone.
- Return type:
- normalize_address_header(value)[source]ΒΆ
Normalize address headers by removing quotes and collapsing whitespace.
E.g.,
"Bob" <bob@example.com>becomesBob <bob@example.com>.Remove quotes in any headers that contain addresses to ensure a quoted name is hashed to the same value as an unquoted one.
Danger
This may not be the cleanest way to normalize email addresses. E.g.
"Robert \"Bob\"``becomesRobert \Bob\, but this shouldnβt matter for hashing purposes as weβre just trying to get a good heuristic. Refs: #847 and #846.- Return type:
- normalize_message_id(value)[source]ΒΆ
Normalize Message-ID header by stripping angle brackets.
E.g.,
<unique-id@example.com>becomesunique-id@example.com.- Return type:
- strip_angle_brackets(value)[source]ΒΆ
Strip angle brackets from a value if itβs a single bracketed item.
Only strips if the value matches
<something>with no commas.Note
Sometimes
email.parserstrips the<>brackets from aTo:header which has a single address. I have seen this happen for only one mail in a duplicate pair. Iβm not sure why (presumably the parser usesemail.utils.unquotesomewhere in its code path which was only triggered by that mail and not its sister mail), but to be safe, we should always strip the<>brackets to avoid this difference preventing duplicate detection.- Return type:
-
path:
mail_deduplicate.mail_box moduleΒΆ
Utilities to read and write mail boxes in various formats.
Based on Pythonβs standard library mailbox module.
- mail_deduplicate.mail_box.make_dedup_mail(name, base)[source]ΒΆ
Create a DedupMail class for a mailbox message type.
- Return type:
- class mail_deduplicate.mail_box.MaildirDedupMail(message=None)ΒΆ
Bases:
DedupMailMixin,MaildirMessageInitialize a MaildirMessage instance.
- class mail_deduplicate.mail_box.mboxDedupMail(message=None)ΒΆ
Bases:
DedupMailMixin,mboxMessageInitialize an mboxMMDFMessage instance.
- class mail_deduplicate.mail_box.MHDedupMail(message=None)ΒΆ
Bases:
DedupMailMixin,MHMessageInitialize an MHMessage instance.
- class mail_deduplicate.mail_box.BabylDedupMail(message=None)ΒΆ
Bases:
DedupMailMixin,BabylMessageInitialize a BabylMessage instance.
- class mail_deduplicate.mail_box.MMDFDedupMail(message=None)ΒΆ
Bases:
DedupMailMixin,MMDFMessageInitialize an mboxMMDFMessage instance.
- class mail_deduplicate.mail_box.BoxStructure(*values)[source]ΒΆ
Bases:
EnumBox structures can be file-based or folder-based.
- FOLDER = 1ΒΆ
- FILE = 2ΒΆ
- class mail_deduplicate.mail_box.BoxFormat(base_class, structure, message_class)[source]ΒΆ
Bases:
EnumIDs of all the supported box formats and their metadata.
Each entry is associated to:
their original base class,
the structure they implement (file-based or folder-based),
the custom message factory class to use.
From these, we can derive the proper constructor with our own custom
DedupMailfactory.Hint
This could be extended in the future to add support for other mailbox formats and sources, like Gmail accounts, IMAP servers, etc.
- MAILDIR = (<class 'mailbox.Maildir'>, BoxStructure.FOLDER, <class 'mail_deduplicate.mail_box.MaildirDedupMail'>)ΒΆ
- MBOX = (<class 'mailbox.mbox'>, BoxStructure.FILE, <class 'mail_deduplicate.mail_box.mboxDedupMail'>)ΒΆ
- MH = (<class 'mailbox.MH'>, BoxStructure.FOLDER, <class 'mail_deduplicate.mail_box.MHDedupMail'>)ΒΆ
- BABYL = (<class 'mailbox.Babyl'>, BoxStructure.FILE, <class 'mail_deduplicate.mail_box.BabylDedupMail'>)ΒΆ
- MMDF = (<class 'mailbox.MMDF'>, BoxStructure.FILE, <class 'mail_deduplicate.mail_box.MMDFDedupMail'>)ΒΆ
- property constructorΒΆ
Return a constructor for this box format with our custom message factory.
- mail_deduplicate.mail_box.FOLDER_FORMATS = (BoxFormat.MAILDIR, BoxFormat.MH)ΒΆ
Box formats implementing a folder-based structure.
Is a tuple to keep natural order defined by
BoxFormat.
- mail_deduplicate.mail_box.FILE_FORMATS = (BoxFormat.MBOX, BoxFormat.BABYL, BoxFormat.MMDF)ΒΆ
Box formats implementing a file-based structure.
Is a tuple to keep natural order defined by
BoxFormat.
- mail_deduplicate.mail_box.MAILDIR_SUBDIRS = frozenset({'cur', 'new', 'tmp'})ΒΆ
List of required sub-folders defining a properly structured maildir.
- mail_deduplicate.mail_box.autodetect_box_type(path)[source]ΒΆ
Auto-detect the format of the mailbox located at the provided path.
Returns a box type as indexed in the BOX_TYPES dictionary above.
If the path is a file, then it is considered as an
mbox. Else, if the provided path is a folder and feature the expecteed sub-directories, it is parsed as amaildir.Todo
Future finer autodetection heuristics should be implemented here. Some ideas:
single mail from a
maildirplain text mail content
other mailbox formats supported in Pythonβs standard library:
MHBabylMMDF
- Return type:
- mail_deduplicate.mail_box.open_box(path, box_format=None, force_unlock=False)[source]ΒΆ
Open a mail box.
Returns a list of boxes, one per sub-folder. All are locked, ready for operations.
If
box_formatis provided, forces the opening of the box in the specified format. Else, defaults to autodetection.
- mail_deduplicate.mail_box.lock_box(box, force_unlock)[source]ΒΆ
Lock an opened box and allows for forced unlocking.
Returns the locked box.
- Return type:
- mail_deduplicate.mail_box.FOLDER_FORMAT_CLASSES = frozenset({<class 'mailbox.MH'>, <class 'mailbox.Maildir'>})ΒΆ
Base classes of folder-based box formats.
mail_deduplicate.strategy moduleΒΆ
Strategy definitions.
- mail_deduplicate.strategy.log_selection(message_template)[source]ΒΆ
Decorator to log selection criteria.
- mail_deduplicate.strategy.select_older(duplicates)[source]ΒΆ
Select all older duplicates.
Discards the newests, i.e. the subset sharing the most recent timestamp.
- Return type:
- mail_deduplicate.strategy.select_oldest(duplicates)[source]ΒΆ
Select all the oldest duplicates.
Discards the newers, i.e. all mail of the duplicate set but those sharing the oldest timestamp.
- Return type:
- mail_deduplicate.strategy.select_newer(duplicates)[source]ΒΆ
Select all newer duplicates.
Discards the oldest, i.e. the subset sharing the most ancient timestamp.
- Return type:
- mail_deduplicate.strategy.select_newest(duplicates)[source]ΒΆ
Select all the newest duplicates.
Discards the olders, i.e. all mail of the duplicate set but those sharing the newest timestamp.
- Return type:
- mail_deduplicate.strategy.select_smaller(duplicates)[source]ΒΆ
Select all smaller duplicates.
Discards the biggests, i.e. the subset sharing the biggest size.
- Return type:
- mail_deduplicate.strategy.select_smallest(duplicates)[source]ΒΆ
Select all the smallest duplicates.
Discards the biggers. i.e. all mail of the duplicate set but those sharing the smallest size.
- Return type:
- mail_deduplicate.strategy.select_bigger(duplicates)[source]ΒΆ
Select all bigger duplicates.
Discards the smallests, i.e. the subset sharing the smallest size.
- Return type:
- mail_deduplicate.strategy.select_biggest(duplicates)[source]ΒΆ
Select all the biggest duplicates.
Discards the smallers, i.e. all mail of the duplicate set but those sharing the biggest size.
- Return type:
- mail_deduplicate.strategy.select_matching_path(duplicates)[source]ΒΆ
Select all duplicates whose file path match the regular expression provided via the βregexp parameter.
- Return type:
- mail_deduplicate.strategy.select_non_matching_path(duplicates)[source]ΒΆ
Select all duplicates whose file path doesnβt match the regular expression provided via the βregexp parameter.
- Return type:
- mail_deduplicate.strategy.select_one(duplicates)[source]ΒΆ
Randomly select one duplicate, and discards all others.
- Return type:
- mail_deduplicate.strategy.select_all_but_one(duplicates)[source]ΒΆ
Randomly discard one duplicate, and select all others.
- Return type:
- class mail_deduplicate.strategy.Strategy(*values)[source]ΒΆ
Bases:
EnumSelection strategies to apply on a sets of duplicate mails.
Each strategy in the
Enumpoints to the function implementing the selection logic, by the way of thestrategy_function()method.Strategies whose member value is a string are simply aliases to other strategies, pointing to the name of the function implementing the logic. The other members have integer values, to indicate their function ID is to be derived from the member name. This arrangement allow for each member to have its own existence without being hidden by the aliasing mechanism of
Enum.Aliases are great usability features to represent inverse operations. They helps users to better reason about the selection operators depending on their mental models.
- SELECT_OLDER = 1ΒΆ
- SELECT_OLDEST = 2ΒΆ
- SELECT_NEWER = 3ΒΆ
- SELECT_NEWEST = 4ΒΆ
- DISCARD_NEWEST = 'select_older'ΒΆ
- DISCARD_NEWER = 'select_oldest'ΒΆ
- DISCARD_OLDEST = 'select_newer'ΒΆ
- DISCARD_OLDER = 'select_newest'ΒΆ
- SELECT_SMALLER = 5ΒΆ
- SELECT_SMALLEST = 6ΒΆ
- SELECT_BIGGER = 7ΒΆ
- SELECT_BIGGEST = 8ΒΆ
- DISCARD_BIGGEST = 'select_smaller'ΒΆ
- DISCARD_BIGGER = 'select_smallest'ΒΆ
- DISCARD_SMALLEST = 'select_bigger'ΒΆ
- DISCARD_SMALLER = 'select_biggest'ΒΆ
- SELECT_MATCHING_PATH = 9ΒΆ
- SELECT_NON_MATCHING_PATH = 10ΒΆ
- DISCARD_NON_MATCHING_PATH = 'select_matching_path'ΒΆ
- DISCARD_MATCHING_PATH = 'select_non_matching_path'ΒΆ
- SELECT_ONE = 11ΒΆ
- SELECT_ALL_BUT_ONE = 12ΒΆ
- DISCARD_ALL_BUT_ONE = 'select_one'ΒΆ
- DISCARD_ONE = 'select_all_but_one'ΒΆ