fighting email spam

from HTYP, the free directory anyone can edit if they can prove to me that they're not a spambot
Jump to navigation Jump to search

Overview

This page is about fighting spam received via email.

Types of Techniques

Email-spam-fighting techniques can be broken down into the following categories:

  • Pre-send: prevent spammers from getting your "real" address
    • Harvesting control: one major source of spammable email addresses is web pages; you can prevent or control such harvesting in a number of different ways; see section below
    • Domain control:
      • "per-org unique addresses": see section below
      • "time-stamped addresses": a powerful harvesting control technique; see section below
    • Harvest poisoning: provide loads and loads of useless data (in a format which only a 'bot would be likely to find), thus making the spammers' database mostly chaff.
      • There is a program called sugarplum which does this for Linux, but I'm not sure where to find more information about it
  • Post-send: filters and such for catching the mail after it has been sent but before you have to read it
    • Most decent web hosts now include spam-fighting tools with their email service. These tend to include sender verification against lists of known spammers, and thus do a much better job than training-based filters.
    • Most decent email clients (including Thunderbird and Evolution) also include spam fighting tools, but they tend to be error-prone.
    • 3rd-party services are available which prevent email from reaching you unless the sender has passed a "humanity test" (most commonly a "CAPTCHA"); this is a problematic solution at best as it may prevent you from receiving legitimate automated messages and is a nuisance for legitimate senders.

Per-Organization Unique Addresses

A significant source of spammable email addresses is disreputable organizations who sell their emailing lists; if you have your own domain name for email and the ability to set up a "catch-all", you can invent a new email address whenever you need to give one out. If that address starts receiving spam, then there are simple techniques to have it automatically discarded. Furthermore, if the address is named in such a way that you can easily identify the organization to whom it was given (e.g. "nameoforganization@yourdomain.com"), you can also keep track of which organizations have been less than discreet with the information you have given them.

Web Page Harvesting Control

Email addresses which are published on a web page are a particularly likely target for spam, as any "findable" page on the web will inevitably be patrolled by spambots looking for email addresses for their spam databases. The one advantage we have here is that this process is highly automated, which allows for a number of prevention techniques.

Some things to remember:

  1. While some spambots may be fairly sophisticated, many of them may be written by novice script-kiddies; such bots should be relatively easy to fool.
  2. While most workarounds may be very easy to defeat, it probably won't be worth the spambot programmer's time to implement a fix; the small increase in the number of harvested addresses simply wouldn't be worth the extra coding (and possibly support, if the spambot is being sold or traded in the "underground" malware market) necessary.

simple HTML obfuscation

A few simple techniques which are easily worked around by spambot programmers, but which should at least cut down the volume of spam:

  • Instead of the "@" and "." characters, use the HTML entities @ and . respectively. These will copy-and-paste properly into an email program, as well as being visually indistinguishable from plain "@" and "." on a web browser. (Note: on this wiki, you can use the email template to disguise email addresses this way.)
  • Insert HTML markup around the "@" and ".", such as <i> (italics), <b> (bold), <span class="whatever"> (style sheet markup; no visible effect unless you define the "whatever" class in CSS). This requires the spambot to filter out all HTML tags before searching for email addresses on the page.

An example of the above two techniques combined (using the HTYP email template): spam1spam@spamhtypspam.spamorg

  • Insert "invisible" characters within the email address using <span style="display: none">. Example: spam2SPAM@SPAMhtypSPAM.SPAMorg
    • This technique has the disadvantage that it does not copy-and-paste properly, but if the inserted text is something obvious like " REMOVE ME ", you can instruct people to remove the extra text before emailing. Unfortunately, such instructions tend to get overlooked. If you include illegal characters (such as spaces) in the inserted text, you may actually be able to prevent the email from being sent until it is "fixed" by the user. Some experimentation is probably called for here.
    • The above example also includes a "mailto:" link; some users can click on such links to open their email programs, but many computers are not configured to allow this to work. This opens the issue of whether or not to "obfuscate" the address as shown in the email link. A poorly-written spambot might be HTML-unaware and find the unobfuscated address inside the <a href="mailto:..."> tag – or it might be set to strip out all HTML before looking, in which case the unobfuscated address would be missed. As with the above, some experimentation is probably called for here.

domain-dependent techniques

If you have your own domain name and can control the redirection of email, then some much more powerful techniques become available.

  • Insert a date in the address, like this: spam320241121@htyp.org (this example uses HTYP's emaildated template; a normal-looking address is displayed by the browser). This requires that you are able to either (a) set up a "catch-all" email account on your domain, so that all email not otherwise redirected goes to a valid email account, or (b) dynamically configure your email handler to accept email from an address that changes every day.
    • In this example, the displayed address appears normal, but the mailto: address has a bunch of extra numbers after it. These numbers change every day – so if your email address happens to get picked up by a spambot on, say, November 21, 2024, all the email addresses will have that particular date on them; all you have to do is put in a redirect for the address for that date – in this case, spam320241121@yourdomain.com – and send those emails straight to purgatory (or your worst enemy, or the spam-reporting address of your choice).
    • For the displayed address, you can use any of the simple obfuscation techniques described above or include the date; it's kind of a trade-off between friendly-looking addresses and spam-prevention.
  • Assign a new address for each display of the web page: This is much the same as the above technique but provides a higher degree of filtering. It may also help pollute spam databases, thus rendering them less effective. (It is not, however, easy to implement as a MediaWiki template. It would be pretty easy to include the time after the date, which would accomplish much the same thing, but MediaWiki's "CURRENTTIME" variable prints the time with a colon, and I don't know if that might cause problems with email routing; further investigation is needed. --Woozle 10:31, 12 October 2006 (EDT))

obfuscation with images

Another way to prevent harvesting is to show your email address only as an image, never as text. This is primarily useful for individual users with the necessary image-editing software (most operating systems are shipped with sufficient software to create a usable image of this sort); it does not scale well to applications with multiple email addresses or directories which must be updated frequently. Most web-servers do have the capability of generating text-based images on the fly (using imagemagick or its equivalent), but this would require some CGI coding. It is also a nuisance for users, who must accurately transcribe email addresses (which often contain essentially arbitrary mixes of letters and numbers) from one window (the browser) to another (email client) without the convenience of copy-and-paste.

site-based webmail

Rather than giving out email addresses, a web site could provide links to an on-site message-sending page which looks up email addresses internally based on an index sent as part of the URL (e.g. http://mysite.org/webmail?recipient=3). This could be combined with the image-obfuscation technique to allow site visitors the choice of using the web interface (single-click convenience, but webmail interfaces are often poorly designed; a topic for another page...) or via their preferred email client.

While this would completely prevent spambots from harvesting addresses directly, automatic harvesting could still take place if the webmail page automatically sends anything to the sender's address and if it includes a valid email address in such automatic messages. Such replies need to be carefully designed to prevent such harvesting without unnecessarily hampering communication with legitimate emailers. This is only likely to be a problem, however, on high-traffic sites with large numbers of email addresses to harvest; smaller sites aren't worth the trouble of custom code to handle the webmail interface.

overkill prevention

  • HashCash "is a denial-of-service counter measure tool. Its main current use is to help hashcash users avoid losing email due to content based and blacklist based anti-spam systems."