CLI parameters#

Help screen#

$ mdedup --help
Usage: mdedup [OPTIONS] MAIL_SOURCE_1 MAIL_SOURCE_2
              ...

  Deduplicate mails from multiple sources.

  Process:
  ● Step #1: load mails from their sources.
  ● Step #2: compute the canonical hash of each mail based on their headers (and
             optionally their body), and regroup mails sharing the same hash.
  ● Step #3: apply a selection strategy on each subset of duplicate mails.
  ● Step #4: perform an action on all selected mails.
  ● Step #5: report statistics.

Positional arguments:
  MAIL_SOURCE_1 MAIL_SOURCE_2 ...
     Mail sources to deduplicate. Can be a single mail box or a list of mails.

Mail sources (step #1):
  -i, --input-format [babyl|maildir|mbox|mh|mmdf]
     Force all provided mail sources to be parsed in the specified format. If
     not set, auto-detect the format of sources independently. Auto-detection
     only supports maildir and mbox format. Use this option to open up other box
     format, or bypass unreliable detection.

  -u, --force-unlock
     Remove the lock on mail source opening if one is found.

Hashing (step #2):
  -h, --hash-header Header-ID
     Headers to use to compute each mail's hash. Must be repeated multiple times
     to set an ordered list of headers. Header IDs are case-insensitive.
     Repeating entries are ignored.  [default: Date, From, To, Subject, MIME-
     Version, Content-Type, Content-Disposition, User-Agent, X-Priority,
     Message-ID]

  -b, --hash-body [normalized|raw|skip]
     Method used to hash the body of mails. Defaults to skip, which doesn't hash
     the body at all: it is the fastest method and header-based hash should be
     sufficient to determine duplicate set. raw use the body as it is (slow).
     normalized pre-process the body before hashing, by removing all line breaks
     and spaces (slowest).  [default: skip]

  -H, --hash-only
     Compute and display the internal hashes used to identify duplicates. Do not
     performs any selection or action.

Deduplication (step #3):
  Process each set of mails sharing the same hash and apply the selection
  --strategy. Fine-grained checks on size and content are performed beforehand.
  If differences are above safety levels, the whole duplicate set will be
  skipped. Limits can be set via the --size-threshold and --content-threshold
  options.
  -s, --strategy [discard-all-but-one|discard-bigger|discard-biggest|discard-matching-path|discard-newer|discard-newest|discard-non-matching-path|discard-older|discard-oldest|discard-one|discard-smaller|discard-smallest|select-all-but-one|select-bigger|select-biggest|select-matching-path|select-newer|select-newest|select-non-matching-path|select-older|select-oldest|select-one|select-smaller|select-smallest]
     Selection strategy to apply within a subset of duplicates. If not set,
     duplicates will be grouped and counted but all be skipped, selection will
     be empty, and no action will be performed. Description of each strategy is
     available further down that help screen.

  -t, --time-source [ctime|date-header]
     Source of a mail's time reference used in time-sensitive strategies.
     [default: date-header]

  -r, --regexp REGEXP
     Regular expression on a mail's file path. Applies to real, individual mail
     location for folder-based boxed (maildir, mh). But for file-based boxes
     (babyl, mbox, mmdf), applies to the whole box's path, as all mails are
     packed into one single file. Required in discard-matching-path, discard-
     non-matching-path, select-matching-path and select-non-matching-path
     strategies.

  -S, --size-threshold BYTES
     Maximum difference allowed in size between mails sharing the same hash. The
     whole subset of duplicates will be skipped if at least one pair of mail
     exceeds the threshold. Set to 0 to enforce strictness and apply selection
     strategy on the subset only if all mails are exactly the same. Set to -1 to
     allow any difference and apply the strategy whatever the differences.
     [default: 512]

  -C, --content-threshold BYTES
     Maximum difference allowed in content between mails sharing the same hash.
     The whole subset of duplicates will be skipped if at least one pair of mail
     exceeds the threshold. Set to 0 to enforce strictness and apply selection
     strategy on the subset only if all mails are exactly the same. Set to -1 to
     allow any difference and apply the strategy whatever the differences.
     [default: 768]

  -d, --show-diff
     Show the unified diff of duplicates not within thresholds.

Action (step #4):
  -a, --action [copy-discarded|copy-selected|delete-discarded|delete-selected|move-discarded|move-selected]
     Action performed on the selected mails. Defaults to copy-selected as it is
     the safest: it only reads the mail sources and create a brand new mail box
     with the selection results.  [default: copy-selected]

  -E, --export MAIL_BOX_PATH
     Location of the destination mail box to where to copy or move deduplicated
     mails. Required in copy-selected, copy-discarded, move-selected and move-
     discarded actions.

  -e, --export-format [babyl|maildir|mbox|mh|mmdf]
     Format of the mail box to which deduplication mails will be exported to.
     Only affects copy-selected, copy-discarded, move-selected and move-
     discarded actions.  [default: mbox]

  --export-append
     If destination mail box already exists, add mails into it instead of
     interrupting (default behavior). Affect copy-selected, copy-discarded,
     move-selected and move-discarded actions.

  -n, --dry-run
     Do not perform any action but act as if it was, and report which action
     would have been performed otherwise.

Other options:
  --time / --no-time
     Measure and print elapsed execution time.  [default: no-time]

  --color, --ansi / --no-color, --no-ansi
     Strip out all colors and all ANSI codes from output.  [default: color]

  -C, --config CONFIG_PATH
     Location of the configuration file. Supports glob pattern of local path and
     remote URL.  [default: ~/.config/mdedup/*.{toml,yaml,yml,json,ini,xml}]

  --show-params
     Show all CLI parameters, their provenance, defaults and value, then exit.

  -v, --verbosity LEVEL
     Either CRITICAL, ERROR, WARNING, INFO, DEBUG.  [default: INFO]

  --version
     Show the version and exit.

  -h, --help
     Show this message and exit.

Available strategies:
  [select-all-but-one|discard-one]
     Randomly discard one duplicate, and select all others.

  [select-bigger|discard-smallest]
     Select all bigger duplicates. Discards the smallests, i.e. the subset
     sharing the smallest size.

  [select-biggest|discard-smaller]
     Select all the biggest duplicates. Discards the smallers, i.e. all mail of
     the duplicate set but those sharing the biggest size.

  [select-matching-path|discard-non-matching-path]
     Select all duplicates whose file path match the regular expression provided
     via the --regexp parameter.

  [select-newer|discard-oldest]
     Select all newer duplicates. Discards the oldest, i.e. the subset sharing
     the most ancient timestamp.

  [select-newest|discard-older]
     Select all the newest duplicates. Discards the olders, i.e. all mail of the
     duplicate set but those sharing the newest timestamp.

  [select-non-matching-path|discard-matching-path]
     Select all duplicates whose file path doesn't match the regular expression
     provided via the --regexp parameter.

  [select-older|discard-newest]
     Select all older duplicates. Discards the newests, i.e. the subset sharing
     the most recent timestamp.

  [select-oldest|discard-newer]
     Select all the oldest duplicates. Discards the newers, i.e. all mail of the
     duplicate set but those sharing the oldest timestamp.

  [select-one|discard-all-but-one]
     Randomly select one duplicate, and discards all others.

  [select-smaller|discard-biggest]
     Select all smaller duplicates. Discards the biggests, i.e. the subset
     sharing the biggest size.

  [select-smallest|discard-bigger]
     Select all the smallest duplicates. Discards the biggers. i.e. all mail of
     the duplicate set but those sharing the smallest size.

Options#

mdedup#

Deduplicate mails from multiple sources.

Process:
● Step #1: load mails from their sources.
● Step #2: compute the canonical hash of each mail based on their headers (and
optionally their body), and regroup mails sharing the same hash.
● Step #3: apply a selection strategy on each subset of duplicate mails.
● Step #4: perform an action on all selected mails.
● Step #5: report statistics.
mdedup [OPTIONS] MAIL_SOURCE_1 MAIL_SOURCE_2 ...

Options

-i, --input-format <input_format>#

Force all provided mail sources to be parsed in the specified format. If not set, auto-detect the format of sources independently. Auto-detection only supports maildir and mbox format. Use this option to open up other box format, or bypass unreliable detection.

Options:

babyl | maildir | mbox | mh | mmdf

-u, --force-unlock#

Remove the lock on mail source opening if one is found.

-h, --hash-header <Header-ID>#

Headers to use to compute each mail’s hash. Must be repeated multiple times to set an ordered list of headers. Header IDs are case-insensitive. Repeating entries are ignored.

-b, --hash-body <hash_body>#

Method used to hash the body of mails. Defaults to skip, which doesn’t hash the body at all: it is the fastest method and header-based hash should be sufficient to determine duplicate set. raw use the body as it is (slow). normalized pre-process the body before hashing, by removing all line breaks and spaces (slowest).

Options:

normalized | raw | skip

-H, --hash-only#

Compute and display the internal hashes used to identify duplicates. Do not performs any selection or action.

-s, --strategy <strategy>#

Selection strategy to apply within a subset of duplicates. If not set, duplicates will be grouped and counted but all be skipped, selection will be empty, and no action will be performed. Description of each strategy is available further down that help screen.

Options:

discard-all-but-one | discard-bigger | discard-biggest | discard-matching-path | discard-newer | discard-newest | discard-non-matching-path | discard-older | discard-oldest | discard-one | discard-smaller | discard-smallest | select-all-but-one | select-bigger | select-biggest | select-matching-path | select-newer | select-newest | select-non-matching-path | select-older | select-oldest | select-one | select-smaller | select-smallest

-t, --time-source <time_source>#

Source of a mail’s time reference used in time-sensitive strategies.

Options:

ctime | date-header

-r, --regexp <REGEXP>#

Regular expression on a mail’s file path. Applies to real, individual mail location for folder-based boxed (maildir, mh). But for file-based boxes (babyl, mbox, mmdf), applies to the whole box’s path, as all mails are packed into one single file. Required in discard-matching-path, discard-non-matching-path, select-matching-path and select-non-matching-path strategies.

-S, --size-threshold <BYTES>#

Maximum difference allowed in size between mails sharing the same hash. The whole subset of duplicates will be skipped if at least one pair of mail exceeds the threshold. Set to 0 to enforce strictness and apply selection strategy on the subset only if all mails are exactly the same. Set to -1 to allow any difference and apply the strategy whatever the differences.

-C, --content-threshold <BYTES>#

Maximum difference allowed in content between mails sharing the same hash. The whole subset of duplicates will be skipped if at least one pair of mail exceeds the threshold. Set to 0 to enforce strictness and apply selection strategy on the subset only if all mails are exactly the same. Set to -1 to allow any difference and apply the strategy whatever the differences.

-d, --show-diff#

Show the unified diff of duplicates not within thresholds.

-a, --action <action>#

Action performed on the selected mails. Defaults to copy-selected as it is the safest: it only reads the mail sources and create a brand new mail box with the selection results.

Options:

copy-discarded | copy-selected | delete-discarded | delete-selected | move-discarded | move-selected

-E, --export <MAIL_BOX_PATH>#

Location of the destination mail box to where to copy or move deduplicated mails. Required in copy-selected, copy-discarded, move-selected and move-discarded actions.

-e, --export-format <export_format>#

Format of the mail box to which deduplication mails will be exported to. Only affects copy-selected, copy-discarded, move-selected and move-discarded actions.

Options:

babyl | maildir | mbox | mh | mmdf

--export-append#

If destination mail box already exists, add mails into it instead of interrupting (default behavior). Affect copy-selected, copy-discarded, move-selected and move-discarded actions.

-n, --dry-run#

Do not perform any action but act as if it was, and report which action would have been performed otherwise.

--time, --no-time#

Measure and print elapsed execution time.

--color, --ansi, --no-color, --no-ansi#

Strip out all colors and all ANSI codes from output.

-C, --config <CONFIG_PATH>#

Location of the configuration file. Supports glob pattern of local path and remote URL.

--show-params#

Show all CLI parameters, their provenance, defaults and value, then exit.

-v, --verbosity <LEVEL>#

Either CRITICAL, ERROR, WARNING, INFO, DEBUG.

Options:

CRITICAL | ERROR | WARNING | INFO | DEBUG

--version#

Show the version and exit.

-h, --help#

Show this message and exit.

Arguments

MAIL_SOURCE_1 MAIL_SOURCE_2 ...#

Optional argument(s)

Environment variables

('MDEDUP_INPUT_FORMAT',)

Provide a default for -i

('MDEDUP_FORCE_UNLOCK',)

Provide a default for -u

('MDEDUP_HASH_HEADER',)

Provide a default for -h

('MDEDUP_HASH_BODY',)

Provide a default for -b

('MDEDUP_HASH_ONLY',)

Provide a default for -H

('MDEDUP_STRATEGY',)

Provide a default for -s

('MDEDUP_TIME_SOURCE',)

Provide a default for -t

('MDEDUP_REGEXP',)

Provide a default for -r

('MDEDUP_SIZE_THRESHOLD',)

Provide a default for -S

('MDEDUP_CONTENT_THRESHOLD',)

Provide a default for -C

('MDEDUP_SHOW_DIFF',)

Provide a default for -d

('MDEDUP_ACTION',)

Provide a default for -a

('MDEDUP_EXPORT',)

Provide a default for -E

('MDEDUP_EXPORT_FORMAT',)

Provide a default for -e

('MDEDUP_EXPORT_APPEND',)

Provide a default for --export-append

('MDEDUP_DRY_RUN',)

Provide a default for -n

('MDEDUP_TIME',)

Provide a default for --time

('MDEDUP_COLOR',)

Provide a default for --color

('MDEDUP_CONFIG',)

Provide a default for --config

('MDEDUP_SHOW_PARAMS',)

Provide a default for --show-params

('MDEDUP_VERBOSITY',)

Provide a default for --verbosity

('MDEDUP_VERSION',)

Provide a default for --version

('MDEDUP_HELP',)

Provide a default for --help