CLI parameters¶
Help screen¶
$ mdedup --help
Usage: mdedup [OPTIONS] MAIL_SOURCE_1 MAIL_SOURCE_2
...
Deduplicate mails from multiple sources.
Process:
- Step #1: load mails from their sources.
- Step #2: compute the canonical hash of each mail based on their headers (and
optionally their body), and regroup mails sharing the same hash.
- Step #3: apply a selection strategy on each subset of duplicate mails.
- Step #4: perform an action on all selected mails.
- Step #5: report statistics.
Positional arguments:
MAIL_SOURCE_1 MAIL_SOURCE_2 ...
Mail sources to deduplicate. Can be a single mail box or a list of mails.
Mail sources (step #1):
-i, --input-format [babyl|maildir|mbox|mh|mmdf]
Force all provided mail sources to be parsed in the specified format. If
not set, auto-detect the format of sources independently. Auto-detection
only supports maildir and mbox format. Use this option to open up other box
format, or bypass unreliable detection.
-u, --force-unlock
Remove the lock on mail source opening if one is found.
Hashing (step #2):
-h, --hash-header Header-ID
Headers to use to compute each mail's hash. Must be repeated multiple times
to set an ordered list of headers. Header IDs are case-insensitive.
Repeating entries are ignored. [default: Date, From, To, Subject, MIME-
Version, Content-Type, Content-Disposition, User-Agent, X-Priority,
Message-ID]
-b, --hash-body [normalized|raw|skip]
Method used to hash the body of mails. Defaults to skip, which doesn't hash
the body at all: it is the fastest method and header-based hash should be
sufficient to determine duplicate set. raw use the body as it is (slow).
normalized pre-process the body before hashing, by removing all line breaks
and spaces (slowest). [default: skip]
-H, --hash-only
Compute and display the internal hashes used to identify duplicates. Do not
performs any selection or action.
Deduplication (step #3):
Process each set of mails sharing the same hash and apply the selection
--strategy. Fine-grained checks on size and content are performed beforehand.
If differences are above safety levels, the whole duplicate set will be
skipped. Limits can be set via the --size-threshold and --content-threshold
options.
-s, --strategy [discard-all-but-one|discard-bigger|discard-biggest|discard-matching-path|discard-newer|discard-newest|discard-non-matching-path|discard-older|discard-oldest|discard-one|discard-smaller|discard-smallest|select-all-but-one|select-bigger|select-biggest|select-matching-path|select-newer|select-newest|select-non-matching-path|select-older|select-oldest|select-one|select-smaller|select-smallest]
Selection strategy to apply within a subset of duplicates. If not set,
duplicates will be grouped and counted but all be skipped, selection will
be empty, and no action will be performed. Description of each strategy is
available further down that help screen.
-t, --time-source [ctime|date-header]
Source of a mail's time reference used in time-sensitive strategies.
[default: date-header]
-r, --regexp REGEXP
Regular expression on a mail's file path. Applies to real, individual mail
location for folder-based boxed (maildir, mh). But for file-based boxes
(babyl, mbox, mmdf), applies to the whole box's path, as all mails are
packed into one single file. Required in discard-matching-path, discard-
non-matching-path, select-matching-path and select-non-matching-path
strategies.
-S, --size-threshold BYTES
Maximum difference allowed in size between mails sharing the same hash. The
whole subset of duplicates will be skipped if at least one pair of mail
exceeds the threshold. Set to 0 to enforce strictness and apply selection
strategy on the subset only if all mails are exactly the same. Set to -1 to
allow any difference and apply the strategy whatever the differences.
[default: 512]
-C, --content-threshold BYTES
Maximum difference allowed in content between mails sharing the same hash.
The whole subset of duplicates will be skipped if at least one pair of mail
exceeds the threshold. Set to 0 to enforce strictness and apply selection
strategy on the subset only if all mails are exactly the same. Set to -1 to
allow any difference and apply the strategy whatever the differences.
[default: 768]
-d, --show-diff
Show the unified diff of duplicates not within thresholds.
Action (step #4):
-a, --action [copy-discarded|copy-selected|delete-discarded|delete-selected|move-discarded|move-selected]
Action performed on the selected mails. Defaults to copy-selected as it is
the safest: it only reads the mail sources and create a brand new mail box
with the selection results. [default: copy-selected]
-E, --export MAIL_BOX_PATH
Location of the destination mail box to where to copy or move deduplicated
mails. Required in copy-selected, copy-discarded, move-selected and move-
discarded actions.
-e, --export-format [babyl|maildir|mbox|mh|mmdf]
Format of the mail box to which deduplication mails will be exported to.
Only affects copy-selected, copy-discarded, move-selected and move-
discarded actions. [default: mbox]
--export-append
If destination mail box already exists, add mails into it instead of
interrupting (default behavior). Affect copy-selected, copy-discarded,
move-selected and move-discarded actions.
-n, --dry-run
Do not perform any action but act as if it was, and report which action
would have been performed otherwise.
Other options:
--time / --no-time
Measure and print elapsed execution time. [default: no-time]
--color, --ansi / --no-color, --no-ansi
Strip out all colors and all ANSI codes from output. [default: color]
-C, --config CONFIG_PATH
Location of the configuration file. Supports glob pattern of local path and
remote URL. [default: ~/.config/mdedup/*.{toml,yaml,yml,json,ini,xml}]
--show-params
Show all CLI parameters, their provenance, defaults and value, then exit.
-v, --verbosity LEVEL
Either CRITICAL, ERROR, WARNING, INFO, DEBUG. [default: INFO]
--version
Show the version and exit.
--help
Show this message and exit.
Available strategies:
[select-all-but-one|discard-one]
Randomly discard one duplicate, and select all others.
[select-bigger|discard-smallest]
Select all bigger duplicates. Discards the smallests, i.e. the subset
sharing the smallest size.
[select-biggest|discard-smaller]
Select all the biggest duplicates. Discards the smallers, i.e. all mail of
the duplicate set but those sharing the biggest size.
[select-matching-path|discard-non-matching-path]
Select all duplicates whose file path match the regular expression provided
via the --regexp parameter.
[select-newer|discard-oldest]
Select all newer duplicates. Discards the oldest, i.e. the subset sharing
the most ancient timestamp.
[select-newest|discard-older]
Select all the newest duplicates. Discards the olders, i.e. all mail of the
duplicate set but those sharing the newest timestamp.
[select-non-matching-path|discard-matching-path]
Select all duplicates whose file path doesn't match the regular expression
provided via the --regexp parameter.
[select-older|discard-newest]
Select all older duplicates. Discards the newests, i.e. the subset sharing
the most recent timestamp.
[select-oldest|discard-newer]
Select all the oldest duplicates. Discards the newers, i.e. all mail of the
duplicate set but those sharing the oldest timestamp.
[select-one|discard-all-but-one]
Randomly select one duplicate, and discards all others.
[select-smaller|discard-biggest]
Select all smaller duplicates. Discards the biggests, i.e. the subset
sharing the biggest size.
[select-smallest|discard-bigger]
Select all the smallest duplicates. Discards the biggers. i.e. all mail of
the duplicate set but those sharing the smallest size.
Options¶
mdedup¶
Deduplicate mails from multiple sources.
mdedup [OPTIONS] MAIL_SOURCE_1 MAIL_SOURCE_2 ...
Options
- -i, --input-format <input_format>¶
Force all provided mail sources to be parsed in the specified format. If not set, auto-detect the format of sources independently. Auto-detection only supports maildir and mbox format. Use this option to open up other box format, or bypass unreliable detection.
- Options:
babyl | maildir | mbox | mh | mmdf
- -u, --force-unlock¶
Remove the lock on mail source opening if one is found.
- -h, --hash-header <Header-ID>¶
Headers to use to compute each mail’s hash. Must be repeated multiple times to set an ordered list of headers. Header IDs are case-insensitive. Repeating entries are ignored.
- -b, --hash-body <hash_body>¶
Method used to hash the body of mails. Defaults to skip, which doesn’t hash the body at all: it is the fastest method and header-based hash should be sufficient to determine duplicate set. raw use the body as it is (slow). normalized pre-process the body before hashing, by removing all line breaks and spaces (slowest).
- Options:
normalized | raw | skip
- -H, --hash-only¶
Compute and display the internal hashes used to identify duplicates. Do not performs any selection or action.
- -s, --strategy <strategy>¶
Selection strategy to apply within a subset of duplicates. If not set, duplicates will be grouped and counted but all be skipped, selection will be empty, and no action will be performed. Description of each strategy is available further down that help screen.
- Options:
discard-all-but-one | discard-bigger | discard-biggest | discard-matching-path | discard-newer | discard-newest | discard-non-matching-path | discard-older | discard-oldest | discard-one | discard-smaller | discard-smallest | select-all-but-one | select-bigger | select-biggest | select-matching-path | select-newer | select-newest | select-non-matching-path | select-older | select-oldest | select-one | select-smaller | select-smallest
- -t, --time-source <time_source>¶
Source of a mail’s time reference used in time-sensitive strategies.
- Options:
ctime | date-header
- -r, --regexp <REGEXP>¶
Regular expression on a mail’s file path. Applies to real, individual mail location for folder-based boxed (maildir, mh). But for file-based boxes (babyl, mbox, mmdf), applies to the whole box’s path, as all mails are packed into one single file. Required in discard-matching-path, discard-non-matching-path, select-matching-path and select-non-matching-path strategies.
- -S, --size-threshold <BYTES>¶
Maximum difference allowed in size between mails sharing the same hash. The whole subset of duplicates will be skipped if at least one pair of mail exceeds the threshold. Set to 0 to enforce strictness and apply selection strategy on the subset only if all mails are exactly the same. Set to -1 to allow any difference and apply the strategy whatever the differences.
- -C, --content-threshold <BYTES>¶
Maximum difference allowed in content between mails sharing the same hash. The whole subset of duplicates will be skipped if at least one pair of mail exceeds the threshold. Set to 0 to enforce strictness and apply selection strategy on the subset only if all mails are exactly the same. Set to -1 to allow any difference and apply the strategy whatever the differences.
- -d, --show-diff¶
Show the unified diff of duplicates not within thresholds.
- -a, --action <action>¶
Action performed on the selected mails. Defaults to copy-selected as it is the safest: it only reads the mail sources and create a brand new mail box with the selection results.
- Options:
copy-discarded | copy-selected | delete-discarded | delete-selected | move-discarded | move-selected
- -E, --export <MAIL_BOX_PATH>¶
Location of the destination mail box to where to copy or move deduplicated mails. Required in copy-selected, copy-discarded, move-selected and move-discarded actions.
- -e, --export-format <export_format>¶
Format of the mail box to which deduplication mails will be exported to. Only affects copy-selected, copy-discarded, move-selected and move-discarded actions.
- Options:
babyl | maildir | mbox | mh | mmdf
- --export-append¶
If destination mail box already exists, add mails into it instead of interrupting (default behavior). Affect copy-selected, copy-discarded, move-selected and move-discarded actions.
- -n, --dry-run¶
Do not perform any action but act as if it was, and report which action would have been performed otherwise.
- --time, --no-time¶
Measure and print elapsed execution time.
- --color, --ansi, --no-color, --no-ansi¶
Strip out all colors and all ANSI codes from output.
- -C, --config <CONFIG_PATH>¶
Location of the configuration file. Supports glob pattern of local path and remote URL.
- --show-params¶
Show all CLI parameters, their provenance, defaults and value, then exit.
- -v, --verbosity <LEVEL>¶
Either CRITICAL, ERROR, WARNING, INFO, DEBUG.
- Options:
CRITICAL | ERROR | WARNING | INFO | DEBUG
- --version¶
Show the version and exit.
Arguments
- MAIL_SOURCE_1 MAIL_SOURCE_2 ...¶
Optional argument(s)
Mail sources to deduplicate. Can be a single mail box or a list of mails.
Environment variables
- ('MDEDUP_INPUT_FORMAT',)
Provide a default for
-i
- ('MDEDUP_FORCE_UNLOCK',)
Provide a default for
-u
- ('MDEDUP_HASH_HEADER',)
Provide a default for
-h
- ('MDEDUP_HASH_BODY',)
Provide a default for
-b
- ('MDEDUP_HASH_ONLY',)
Provide a default for
-H
- ('MDEDUP_STRATEGY',)
Provide a default for
-s
- ('MDEDUP_TIME_SOURCE',)
Provide a default for
-t
- ('MDEDUP_REGEXP',)
Provide a default for
-r
- ('MDEDUP_SIZE_THRESHOLD',)
Provide a default for
-S
- ('MDEDUP_CONTENT_THRESHOLD',)
Provide a default for
-C
- ('MDEDUP_SHOW_DIFF',)
Provide a default for
-d
- ('MDEDUP_ACTION',)
Provide a default for
-a
- ('MDEDUP_EXPORT',)
Provide a default for
-E
- ('MDEDUP_EXPORT_FORMAT',)
Provide a default for
-e
- ('MDEDUP_EXPORT_APPEND',)
Provide a default for
--export-append
- ('MDEDUP_DRY_RUN',)
Provide a default for
-n
- ('MDEDUP_TIME',)
Provide a default for
--time
- ('MDEDUP_COLOR',)
Provide a default for
--color
- ('MDEDUP_CONFIG',)
Provide a default for
--config
- ('MDEDUP_SHOW_PARAMS',)
Provide a default for
--show-params
- ('MDEDUP_VERBOSITY',)
Provide a default for
--verbosity
- ('MDEDUP_VERSION',)
Provide a default for
--version