SpamFerret

from HTYP, the free directory anyone can edit if they can prove to me that they're not a spambot
Revision as of 00:56, 15 July 2009 by Woozle (talk | contribs) (→‎Tables: patterns.isDiff)
Jump to navigation Jump to search

Navigation

{{#lst:MediaWiki|navbar}}: extensions / fighting spam: SpamFerret

Overview

SpamFerret is my attempt at an improvement over the SpamBlacklist extension. The version posted here works, but you have to manually enter patterns (blacklisted items) into the database. Fortunately this isn't hard to do, but a more friendly interface would be nice (especially some way to take a spam-page, break it up into unique URLs, and let you select which ones to add to the database).

Requirements

SpamFerret requires the mysqli library to be installed/enabled on PHP. (It may be a good idea, later, to rewrite it using the standard mysql library, as some servers don't have mysqli installed; I wasn't aware of this when I started writing SpamFerret. It would also be nice to have the ability to use other database engines besides MySQL via the dbx libraries, but I don't currently have access to these on my main server. -W.)

To-Do

  • If spam filter db cannot be contacted, fall back to a captcha rather than letting spam through
  • Important: Document the installation process (pretty simple; the only non-obvious bit is the database spec)
  • Option to allow banned IP addresses to create account and go to link received by email in order to un-block IP
  • Improved pages for reporting automatic IP suspension, ampersandbots
  • Automated reporting of intercepted spam
  • Web-based tool ("Special" page) for:
    • Entering new patterns
    • Deactivating/modifying disused patterns
    • Testing sample spam against existing or candidate filters (Kiki regex tester shows matches for spam which SpamFerret is letting through...)
  • Log of changes to "patterns" table
  • Management tools for new spam (generate candidate patterns from spam page, allow user to fine-tune and choose which ones to use, and add chosen/tweaked patterns to database)
  • Easy way to automatically revert to chosen revision while showing all edits below filter-creation form, to make it easier to add new patterns
  • Manual reporting tools:
    • basic data viewing, i.e. received spam grouped by IP address, by triggered filter, or in chronological order
    • list patterns least recently used, for possible deactivation
    • whois of all recently promoted domains and create a consolidated list of owners
  • Optional log of complete spam contents, for possible data-mining or filter-training (e.g. to answer the question "if I disable this set of filters, how many spams would have gotten through instead of being caught by the other filters?" or "how effective would this proposed new filter have been at catching the spams received thus far?")

MW Versions

  • SpamFerret (no version number yet) has been used without modification on MediaWiki versions 1.5.5, 1.7.1, 1.8.2, 1.9.3, 1.11.0, and 1.12.

Purpose

The SpamBlacklist extension has a number of shortcomings:

  • Can only handle a limited number of entries before exceeding the maximum string-length it can process, at which point all spam is allowed through
  • Does not keep track of which entries are still being "tried" (to allow for periodic "cleaning" of the list)
  • Does not keep track of offending IP addresses
  • Handles only domains; cannot blacklist by URL path (for partially compromised servers) or "catch phrases" found in spam and nowhere else
  • Does not keep a log of failed spam attempts, so there is no way to measure effectiveness

SpamFerret, on the other hand:

  • is database-driven
  • keeps logs and counts of spam attempts by blacklisting and by IP
  • matches domains ("http://*.domain"), URLs ("http://*.domain/path") and catch-phrases ("helo please to forgive my posting but my children are hungary")
    • can also match patterns, like long lists of links in a certain format

It may also be unsuitable for use on busier wikis, as the checking process (which only happens when an edit is submitted) may take a fair amount of CPU time (checks the entire page once per blacklisted pattern). This shouldn't be a problem for smaller wikis, which are often monitored less frequently than busier wikis and hence are more vulnerable to spam.

Tables

History

  • 2009-07-14 Adding attempts.diff and patterns.isDiff to enable more advanced filtering (e.g. bots which delete most of a section and replace it with a nonsense word)

SQL

<sql> CREATE TABLE `patterns` (

 `ID` INT NOT NULL AUTO_INCREMENT,
 `Pattern` varchar(255) COMMENT "pattern to match (regex or straight text match)",
 `WhenAdded` DATETIME DEFAULT NULL COMMENT "when this entry was added",
 `WhenTried` DATETIME DEFAULT NULL COMMENT "when a spammer last attempted to include this pattern",
 `isActive` BOOL COMMENT "FALSE = do not include in checking",
 `isURL` BOOL COMMENT "TRUE indicates that additional URL-related stats may be collected",
 `isRegex` BOOL COMMENT "TRUE = regex (use preg_match()); FALSE = normal string comparison (use stristr())",
 `isDiff` BOOL COMMENT "TRUE = pattern should be compared to the diff, not the edit contents",
 `Count` INT DEFAULT 0 COMMENT "number of attempts",
 PRIMARY KEY(`ID`)

) ENGINE = MYISAM;

CREATE TABLE `clients` (

 `ID` INT NOT NULL AUTO_INCREMENT,
 `Address` varchar(15) COMMENT 'IP address',
 `WhenFirst` DATETIME COMMENT 'when this IP address first submitted a spam',
 `WhenLast` DATETIME COMMENT 'when this IP address last submitted a spam',
 `Retries` INT DEFAULT NULL COMMENT 'number of spam retries',
 `Count` INT DEFAULT 0 COMMENT 'number of attempts',
 `doBlock` BOOL COMMENT "TRUE = permanent block",
 `Notes` VARCHAR(255) DEFAULT NULL COMMENT "human-added notes",
 PRIMARY KEY(`ID`)

) ENGINE = MYISAM;

CREATE TABLE `attempts` (

 `ID`         INT NOT NULL AUTO_INCREMENT,
 `When`       DATETIME                   COMMENT "timestamp of attempt",
 `ID_Pattern` INT DEFAULT NULL           COMMENT "(patterns.ID) matching pattern found",
 `ID_Client`  INT NOT NULL               COMMENT "(clients.ID) spamming client",
 `IDS_Session` varchar(255)              COMMENT "PHP session ID from session_id()",
 `Code`       varchar(15)                COMMENT "type of attempt",
 `PageServer` varchar(63)                COMMENT "identifier of wiki being attacked (usually domain)",
 `PageName`   varchar(255)               COMMENT "name of page where the spam would have displayed",
 `MatchText`  varchar(255) DEFAULT NULL  COMMENT "optional: text that triggered the filter",
 `didAllow`   BOOL         DEFAULT FALSE COMMENT "edit was allowed",
 `Diff`       MEDIUMTEXT                 COMMENT "difference between original and attempted edit",
 PRIMARY KEY(`ID`)

) ENGINE = MYISAM;

CREATE TABLE `sessions` (

 `ID` INT NOT NULL AUTO_INCREMENT,
 `Address`    varchar(15)           COMMENT "IP address",
 `Username`   varchar(255)          COMMENT "username",
 `Browser`    varchar(255) NOT NULL COMMENT "browser client string",
 `ID_User`    INT      DEFAULT NULL COMMENT "MediaWiki user.user_id",
 `PageServer` varchar(63)  NOT NULL COMMENT "identifier of site logged in to (usually domain)",
 `WhenStart`  DATETIME     NOT NULL COMMENT "when this session began (login)",
 `WhenFinish` DATETIME DEFAULT NULL COMMENT "when this session ended (logout, if any)",
 PRIMARY KEY(`ID`)

) ENGINE = MYISAM; </sql>

  • attempts.Code:
    • NULL = normal filter match
    • "AMP" = ampersandbot (to be eventually superceded by some kind of difference-pattern)
    • "THR" = throttled: too many spam attempts, temporary blacklist of IP address
  • attempts.MatchText: mainly useful for regex filters and (especially) non-listable offenses
  • clients.Retries:
    • Each time a client submits spam, clients.Retries increments...
    • ...unless clients.WhenLast was sufficiently long ago, in which case clients.Retries is reset to 0 (and WhenLast is updated).
    • Each time a client submits non-spam, if clients.Retries is too high and WhenLast is recent enough, the content is refused without checking for a spam match.
  • Net effect is that too many spams within a certain period of time causes an IP to be temporarily blacklisted.
  • attempts.ID_Pattern needs to allow NULL in order to log throttled saves with no pattern match

sessions

I added the sessions table when I thought the MediaWiki software wasn't recording IP addresses anywhere, but then it turns out that the IP address and username is stored in recentchanges for each edit (user ID = rc_user, user name = rc_user_text, IP address = rc_ip). It may turn out to be easier to use sessions for reports and such (can you do a JOIN across databases?), but for now it isn't actually used.

If I were going to use it, I'd probably revise the other tables to point to it instead of recording user info locally.

Data Views

AttemptsEx

Eventually, some Specialpages with reports would be nice, but for now you can see what's being blocked, and where the spam attempts are coming from, with this query conveniently packaged in a stored in a "view": <mysql>CREATE OR REPLACE VIEW `AttemptsEx` AS

 SELECT
   a.ID,
   a.`When` AS WhenDone,
   a.Code,
   a.ID_Pattern,
   p.Pattern,
   a.ID_Client,
   c.Address,
   c.doBlock,
   a.PageServer,
   a.PageName,
   a.MatchText,
   a.didAllow
 FROM
   (
     (attempts AS a
       LEFT JOIN patterns AS p
       ON (a.ID_Pattern = p.ID)
      ) LEFT JOIN clients AS c
        ON (a.ID_Client = c.ID)
    ) ORDER BY a.ID DESC;</mysql>

ClientThrottle

This view is for pre-screening Clients (IP addresses) -- includes a column showing how long since the last spam attempt (in seconds) and another to indicate whether a client is permanently blocked: <mysql>CREATE OR REPLACE VIEW `ClientThrottle` AS

 SELECT
   ID,
   Address,
   WhenFirst,
   WhenLast,
   Count,
   IFNULL(Retries,0) AS Retries,
   TIMESTAMPDIFF(SECOND,WhenLast,NOW()) AS ThrottleTime,
   doBlock
 FROM clients;</mysql>

Code

This consists of the following:

<php>require_once( "$IP/extensions/SpamFerret.php" ); $wgSpamFerretSettings['dbspec'] = /* connection string - see notes below */; $wgSpamFerretSettings['throttle_retries'] = 5; // 5 strikes and you're out $wgSpamFerretSettings['throttle_timeout'] = 86400; // 86400 seconds = 24 hours</php>

  • connection string has the following format: mysql:db_user_name:db_user_password@db_server/spamferret_db_name
    • Example: mysql:spfbot:b0tpa55@myserver.com/spamferretdb
  • SpamFerret.php and data.php still contain some debugging code, most of which I'll clean up later (some of it calls stubbed debug-printout routines which can come in handy when adding features or fixing the inevitable bugs).

Update Log

  • 2007-12-27 see SpamFerret.php code comments
  • 2007-10-13 (1) IP throttling, and (2) logging of ampersandbot attempts (not tested)
    • If an IP address makes more than N spam attempts with no more than T seconds between them, it will not be allowed to post anything until a further T seconds have elapsed without spam.
  • 2007-06-10 Added code to prevent ampersandbot edits; need to add logging of those blocks, but don't have time right now. Also don't know if the ampersandbots trim off whitespace or if that's just how MediaWiki is displaying the changes.
  • 2007-08-30 Current version accommodates some changes to the data.php class library