SpamFerret
computing: software: MediaWiki: fighting spam: SpamFerret
Overview
SpamFerret is my attempt at an improvement over the SpamBlacklist extension. The version posted here works, but you have to manually enter patterns (blacklisted items) into the database. Fortunately this isn't hard to do, but a more friendly interface would be nice (especially some way to take a spam-page, break it up into unique URLs, and let you select which ones to add to the database).
To-Do
- Important: Document the installation process (pretty simple; the only non-obvious bit is the database spec)
- Automated reporting of intercepted spam
- Management tools for new spam (generate candidate patterns from spam page, allow user to fine-tune and choose which ones to use, and add chosen/tweaked patterns to database)
- Manual reporting tools
- list patterns least recently used, for possible deactivation
- whois of all recently promoted domains and create a consolidated list of owners
- Optional log of complete spam contents, for possible data-mining or filter-training
MW Versions
- SpamFerret (no version number yet) has been used without modification on MediaWiki versions 1.5.5, 1.7.1, 1.8.2, and 1.9.3
Purpose
The SpamBlacklist extension has a number of shortcomings:
- Can only handle a limited number of entries before exceeding the maximum string-length it can process, at which point all spam is allowed through
- Does not keep track of which entries are still being "tried" (to allow for periodic "cleaning" of the list)
- Does not keep track of offending IP addresses
- Handles only domains; cannot blacklist by URL path (for partially compromised servers) or "catch phrases" found in spam and nowhere else
- Does not keep a log of failed spam attempts, so there is no way to measure effectiveness
SpamFerret, on the other hand:
- is database-driven
- keeps logs and counts of spam attempts by blacklisting and by IP
- matches domains ("http://*.domain"), URLs ("http://*.domain/path") and catch-phrases ("helo please to forgive my posting but my children are hungary")
- can also match patterns, like long lists of links in a certain format
It may also be unsuitable for use on busier wikis, as the checking process (which only happens when an edit is submitted) may take a fair amount of CPU time (checks the entire page once per blacklisted pattern). This shouldn't be a problem for smaller wikis, which are often monitored less frequently than busier wikis and hence are more vulnerable to spam.
Design
<sql> CREATE TABLE `patterns` (
`ID` INT NOT NULL AUTO_INCREMENT, `Pattern` varchar(255) COMMENT 'pattern to match (regex)', `WhenAdded` DATETIME DEFAULT NULL COMMENT 'when this entry was added', `WhenTried` DATETIME DEFAULT NULL COMMENT 'when a spammer last attempted to include this pattern', `isActive` BOOL COMMENT 'if FALSE, do not include in checking', `isURL` BOOL COMMENT 'TRUE indicates that additional URL-related stats may be collected', `isRegex` BOOL COMMENT 'TRUE indicates that the string should not be escaped before feeding to preg_match()', `Count` INT DEFAULT 0 COMMENT 'number of attempts', PRIMARY KEY(`ID`)
) ENGINE = MYISAM;
CREATE TABLE `clients` (
`ID` INT NOT NULL AUTO_INCREMENT, `Address` varchar(15) COMMENT 'IP address', `WhenFirst` DATETIME COMMENT 'when this IP address first submitted a spam', `WhenLast` DATETIME COMMENT 'when this IP address last submitted a spam', `Retries` INT DEFAULT NULL COMMENT 'number of spam retries', `Count` INT DEFAULT 0 COMMENT 'number of attempts', PRIMARY KEY(`ID`)
) ENGINE = MYISAM;
CREATE TABLE `attempts` (
`ID` INT NOT NULL AUTO_INCREMENT, `When` DATETIME COMMENT 'timestamp of attempt', `ID_Pattern` INT NOT NULL COMMENT '(patterns.ID) matching pattern found', `ID_Client` INT NOT NULL COMMENT '(clients.ID) spamming client', `Code` varchar(15) COMMENT 'type of attempt', `PageServer` varchar(63) COMMENT 'identifier of wiki being attacked (usually domain)', `PageName` varchar(255) COMMENT 'name of page where the spam would have displayed', PRIMARY KEY(`ID`)
) ENGINE = MYISAM; </sql>
- attempts.Code:
- NULL = normal match
- "AMP" = ampersandbot (to be eventually superceded by some kind of difference-pattern)
- "PEST" = too many spam attempts; temporary blacklist of IP address
- clients.Retries:
- Each time a client submits spam, clients.Retries increments...
- ...unless clients.WhenLast was sufficiently long ago, in which case clients.Retries is reset to 0 (and WhenLast is updated).
- Each time a client submits non-spam, if clients.Retries is too high and WhenLast is recent enough, the content is refused without checking for a spam match.
- Net effect is that too many spams within a certain period of time causes an IP to be temporarily blacklisted.
Reports
Eventually, some Specialpages with reports would be nice, but for now you can see what's being blocked, and where the spam attempts are coming from, with this query – which you can package in a stored procedure, as shown, to make it easier to run (or just use the SELECT statement by itself): <mysql>CREATE PROCEDURE Attempts()
BEGIN select a.ID, a.`When`, CONCAT('(',a.ID_Pattern,') ',p.Pattern) AS Pattern, CONCAT('(',a.ID_Client,') ',c.Address) AS Address, PageServer, PageName from (attempts AS a LEFT JOIN patterns AS p ON a.ID_Pattern=p.ID) LEFT JOIN clients AS c ON a.ID_Client=c.ID ORDER BY a.ID DESC; END</mysql>
Execute <mysql>CALL Attempts();</mysql> to view the results.
A useful View: <mysql>CREATE OR REPLACE VIEW `AttemptsEx` AS
SELECT a.ID, a.When, a.ID_Pattern, p.Pattern, a.ID_Client, c.Address, a.PageServer, a.PageName FROM ( (attempts AS a LEFT JOIN patterns AS p ON (a.ID_Pattern = p.ID) ) LEFT JOIN clients AS c ON (a.ID_Client = c.ID) ) ORDER BY a.ID DESC</mysql>
Code
This consists of the following:
- SpamFerret.php goes in the MediaWiki extensions folder
- data.php goes in the "includes" folder because PHP seems to want it there.
- Add these lines to your localSettings.php:
- require_once( "$IP/extensions/SpamFerret.php" );
- $wgSpamFerretSettings['dbspec'] = 'mysql:db_user_name:db_user_password@db_server/spamferret_db_name';
Both files still contain some debugging code, most of which I'll clean up later (some of it calls stubbed debug-printout routines which can come in handy when adding features or fixing the inevitable bugs).
- 2007-06-10 Added code to prevent ampersandbot edits; need to add logging of those blocks, but don't have time right now. Also don't know if the ampersandbots trim off whitespace or if that's just how MediaWiki is displaying the changes.
- 2007-08-30 Current version accommodates some changes to the data.php class library