Revision history for Email-Abuse-Investigator

0.05	Sat Mar 28 11:52:05 EDT 2026

  Bug fixes

  - Fixed _extract_and_resolve_urls() discarding the registrar abuse
    contact for URL hosts that cannot be resolved to an IP at analysis
    time.  Previously, when _resolve_host() returned undef, _whois_ip()
    was skipped entirely and the host was recorded with abuse=>'(unknown)',
    which caused abuse_contacts() to produce no contact for that host even
    though a domain WHOIS record (and therefore a registrar abuse address)
    existed.  _extract_and_resolve_urls() now falls back to a domain WHOIS
    lookup on the registrable parent of the host when the IP WHOIS yields
    no abuse address.  A new private helper _parse_domain_whois_abuse()
    performs this lookup without the full overhead of _analyse_domain().
    Combined with the protocol-relative URL fix above, this means that the
    badshamart.com spam campaign (PBS Health News / prostate supplement)
    now correctly produces a registrar abuse contact in abuse_contacts()
    even though all four badshamart.com URL hosts were unresolvable.

  - Fixed _extract_http_urls() not extracting protocol-relative URLs
    (scheme-omitted form //domain/path).  These are used in spam messages
    as tracking pixels and click-redirect links, e.g.:
      <img src="//badshamart.com/o/2516/19142/347/US" ...>
    The leading // was not matched by either the https?:// absolute-URL
    regex or the HTML::LinkExtor filter, which also required a full scheme.
    Both passes now recognise the //domain form and normalise it to
    https://domain before adding it to the URL list.  The regex pass
    anchors the match to whitespace, quotes, or = to avoid false positives
    on CSS path segments and HTML comments.
    Discovered via a real spam message (PBS Health News / badshamart.com)
    where three click-redirect hrefs and one tracking-pixel src all used
    protocol-relative URLs, causing badshamart.com to be entirely absent
    from embedded_urls() and therefore from abuse_contacts().

  - Fixed duplicate Salesforce Marketing Cloud comment block in
    %PROVIDER_ABUSE.  A leftover comment fragment introduced during 0.03
    appeared immediately before the real Salesforce entries, causing
    cosmetic confusion in the source.  Removed the orphaned fragment.

  - Fixed two stale references to Mail::Message::Abuse in the SUPPORT POD
    section: the perldoc command example and the CPAN Testers Dependencies
    URL both still named the old module.  Both now correctly reference
    Email::Abuse::Investigator.

  New features

  - Added Blogger/Blogspot and Google Sites to the built-in provider table
    alongside the existing Google entries:
      blogspot.com       -> abuse@google.com
      blogger.com        -> abuse@google.com
      sites.google.com   -> abuse@google.com
    Blogspot is one of the most commonly abused free hosting platforms for
    spam landing pages.  Subdomains (e.g. ruseriver.blogspot.com) are
    resolved to blogspot.com by the existing subdomain-stripping logic.
    Note: google.com is in %TRUSTED_DOMAINS and is therefore excluded from
    the domain intelligence pipeline; these entries are effective via the
    URL-host and account-provider lookup routes in abuse_contacts().

  - Documented that the {logger} constructor slot may be populated by
    Object::Configure from a configuration file, allowing log output to
    be routed through any Log::* compatible logger rather than STDERR.

0.04	Fri Mar 27 22:01:05 EDT 2026

  Bug fixes

  - Fixed abuse_contacts() silently discarding discovery routes that resolve
    to an address already seen.  When the same abuse address is found via
    multiple routes (e.g. Google as both the sending ISP via rDNS and the
    owner of a blogspot.com URL in the body), the second and subsequent
    roles are now accumulated rather than dropped.  Each hashref in the
    returned list gains a 'roles' arrayref holding the individual role
    strings, and 'role' (singular) is set to their join(' and ', ...) for
    backward compatibility.  The dry-run footer in submit_abuse_report.pl
    now reflects this: a merged entry shows both roles on one line and the
    total line reads "N recipients (M contact routes merged)" when merging
    has occurred.

  - Fixed _decode_multipart() not recursing into nested multipart/* parts.
    A message with Content-Type: multipart/mixed containing a nested
    multipart/alternative (a common structure for HTML+plaintext mail) had
    its body silently discarded, causing embedded_urls() to find no URLs
    and abuse_contacts() to miss all URL-host contacts.  _decode_multipart()
    now detects nested multipart/* parts, extracts the inner boundary from
    the Content-Type header, and recurses to decode the inner container.

  - Fixed abuse_contacts() section 4 (account provider lookup) incorrectly
    matching the domain of an @ sign appearing in a display name rather than
    the actual addr-spec.  A From: header of the form:
      "evil@gmail.com" <real@hotmail.com>
    was matching gmail.com instead of hotmail.com.  The addr-spec is now
    extracted from the rightmost angle-bracket pair before the domain is
    parsed; without angle brackets the whole value is used as before.

  New features

  - Added implausible_timezone (MEDIUM, weight 2) risk flag.  Numeric
    timezone offsets in the Date: header are now validated against the
    real-world range of +1400 (Line Islands) to -1200 (Baker Island).
    Offsets outside that range, or with a minutes field >= 60, raise this
    flag.  Positive and negative bounds are checked separately; a symmetric
    limit would wrongly accept values such as -1300.

  - Added Blogger/Blogspot and Google Sites to the built-in provider table:
      blogspot.com       -> abuse@google.com
      blogger.com        -> abuse@google.com
      sites.google.com   -> abuse@google.com
    Blogspot subdomains (e.g. ruseriver.blogspot.com) are handled by the
    existing subdomain-stripping logic.

  - Added ActiveCampaign to the built-in provider table:
      activecampaign.com  -> abuse@activecampaign.com
      ac-tinker.com       -> abuse@activecampaign.com  (tracking domain)

0.03	Fri Mar 27 19:54:32 EDT 2026

  Bug fixes

  - Fixed spurious abuse reports being sent to the registrar or ISP of the
    message recipient.  Bulk mailers routinely embed the recipient's email
    address in the message body (personalisation footers, unsubscribe
    confirmations, "this email was sent to you@example.com" lines).
    _extract_and_analyse_domains() was collecting domains from the body
    without first excluding the To: and Cc: recipients, causing innocent
    parties to receive abuse reports.  The To:, Cc:, and Received: "for"
    envelope-recipient domains are now built into an exclusion set --
    including their registrable eTLD+1 parents -- before any body or header
    scanning takes place.

  - Fixed "no abuse contacts could be determined" when analysing email
    sent via Salesforce Marketing Cloud (ExactTarget).  Three separate
    causes were identified and corrected:

    1. Salesforce Marketing Cloud was absent from the built-in provider
       table.  Added salesforce.com, mc.salesforce.com, exacttarget.com,
       and et.exacttarget.com, all mapping to abuse@salesforce.com.

    2. Non-routable hostnames such as iad4s13mta756.xt.local (injected
       by Salesforce's MTA into the Message-ID) were passing through the
       domain collection pipeline and consuming a WHOIS lookup slot that
       could never return an actionable result.  The $record closure in
       _extract_and_analyse_domains() now rejects any domain whose TLD is
       not at least two alphabetic characters, and explicitly rejects the
       pseudo-TLDs .local, .internal, .lan, .localdomain, and .arpa.

    3. When a message carries multiple DKIM-Signature headers (common
       with ESPs: the first signs for the customer domain, the second
       for the ESP infrastructure), _parse_auth_results_cached() took
       only the first d= tag and stopped.  It now collects all d= domains
       and sets dkim_domain to whichever one has a hit in the provider
       table -- identifying the actionable ESP -- falling back to the
       first if none match.  All collected domains are fed into the
       domain analysis pipeline via the new dkim_domains arrayref in the
       auth results hashref.

  - The --dry-run output of submit_abuse_report.pl now appends a compact
    recipient summary at the foot of the report:

        Total: 2 recipients

          abuse@tpg.com.au (Sending ISP)
          abuse@godaddy.com (Domain registrar for firmluminary.com)

    Previously only the count was shown.  The summary allows a user to
    confirm at a glance who would receive reports without scrolling back
    through the full numbered table.

  - submit_abuse_report now produces fully RFC 5965 (ARF) compliant
    messages.  The MIME structure changed from multipart/mixed (two parts)
    to multipart/report; report-type=feedback-report (three parts):
      Part 1  text/plain                 human-readable abuse report
      Part 2  message/feedback-report    ARF machine-readable metadata
      Part 3  message/rfc822             original spam message verbatim
    The feedback-report part includes Feedback-Type, Version, User-Agent,
    Source-IP, Original-Mail-From, Original-Rcpt-To, Arrival-Date,
    Reported-Domain, Reported-Uri (one per URL), and Authentication-Results.

0.02	Fri Mar 27 19:04:37 EDT 2026
  - Added bin/submit_abuse_report

0.01	Fri Mar 27 14:23:09 EDT 2026
        First draft
