<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://htyp.org/mw/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=59.14.174.100</id>
	<title>HTYP - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://htyp.org/mw/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=59.14.174.100"/>
	<link rel="alternate" type="text/html" href="https://htyp.org/mw/index.php?title=Special:Contributions/59.14.174.100"/>
	<updated>2026-06-24T12:27:12Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.45.3</generator>
	<entry>
		<id>https://htyp.org/mw/index.php?title=SpamFerret&amp;diff=8462</id>
		<title>SpamFerret</title>
		<link rel="alternate" type="text/html" href="https://htyp.org/mw/index.php?title=SpamFerret&amp;diff=8462"/>
		<updated>2007-10-27T14:01:53Z</updated>

		<summary type="html">&lt;p&gt;59.14.174.100: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;pasnoroldar&lt;br /&gt;
==Navigation==&lt;br /&gt;
[[computing]]: [[software]]: [[MediaWiki]]: [[fighting spam posts in MediaWiki|fighting spam]]: [[SpamFerret]]&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
[[SpamFerret]] is [[User:Woozle|my]] attempt at an improvement over the SpamBlacklist extension. The version posted here works, but you have to manually enter patterns (blacklisted items) into the database. Fortunately this isn&#039;t hard to do, but a more friendly interface would be nice (especially some way to take a spam-page, break it up into unique URLs, and let you select which ones to add to the database).&lt;br /&gt;
===To-Do===&lt;br /&gt;
* &#039;&#039;&#039;Important&#039;&#039;&#039;: Document the installation process (pretty simple; the only non-obvious bit is the database spec)&lt;br /&gt;
* Automated reporting of intercepted spam&lt;br /&gt;
* Management tools for new spam (generate candidate patterns from spam page, allow user to fine-tune and choose which ones to use, and add chosen/tweaked patterns to database)&lt;br /&gt;
* Manual reporting tools&lt;br /&gt;
** list patterns least recently used, for possible deactivation&lt;br /&gt;
** [[whois]] of all recently promoted domains and create a consolidated list of owners&lt;br /&gt;
* Optional log of complete spam contents, for possible data-mining or filter-training&lt;br /&gt;
===MW Versions===&lt;br /&gt;
* [[SpamFerret]] (no version number yet) has been used without modification on MediaWiki versions 1.5.5, 1.7.1, 1.8.2, and 1.9.3&lt;br /&gt;
&lt;br /&gt;
==Purpose==&lt;br /&gt;
The SpamBlacklist extension has a number of shortcomings:&lt;br /&gt;
* Can only handle a limited number of entries before exceeding the maximum string-length it can process, at which point all spam is allowed through&lt;br /&gt;
* Does not keep track of which entries are still being &amp;quot;tried&amp;quot; (to allow for periodic &amp;quot;cleaning&amp;quot; of the list)&lt;br /&gt;
* Does not keep track of offending IP addresses&lt;br /&gt;
* Handles only domains; cannot blacklist by URL path (for partially compromised servers) or &amp;quot;catch phrases&amp;quot; found in spam and nowhere else&lt;br /&gt;
* Does not keep a log of failed spam attempts, so there is no way to measure effectiveness&lt;br /&gt;
&lt;br /&gt;
[[SpamFerret]], on the other hand:&lt;br /&gt;
* is database-driven&lt;br /&gt;
* keeps logs and counts of spam attempts by blacklisting and by IP&lt;br /&gt;
* matches domains (&amp;quot;http://*.domain&amp;quot;), URLs (&amp;quot;http://*.domain/path&amp;quot;) and catch-phrases (&amp;quot;helo please to forgive my posting but my children are hungary&amp;quot;)&lt;br /&gt;
** can also match patterns, like long lists of links in a certain format&lt;br /&gt;
&lt;br /&gt;
It may also be unsuitable for use on busier wikis, as the checking process (which only happens when an edit is submitted) may take a fair amount of CPU time (checks the entire page once per blacklisted pattern). This shouldn&#039;t be a problem for smaller wikis, which are often monitored less frequently than busier wikis and hence are more vulnerable to spam.&lt;br /&gt;
&lt;br /&gt;
==Design==&lt;br /&gt;
&amp;lt;sql&amp;gt;&lt;br /&gt;
CREATE TABLE `patterns` (&lt;br /&gt;
  `ID` INT NOT NULL AUTO_INCREMENT,&lt;br /&gt;
  `Pattern` varchar(255) COMMENT &#039;pattern to match (regex)&#039;,&lt;br /&gt;
  `WhenAdded` DATETIME DEFAULT NULL COMMENT &#039;when this entry was added&#039;,&lt;br /&gt;
  `WhenTried` DATETIME DEFAULT NULL COMMENT &#039;when a spammer last attempted to include this pattern&#039;,&lt;br /&gt;
  `isActive` BOOL COMMENT &#039;if FALSE, do not include in checking&#039;,&lt;br /&gt;
  `isURL` BOOL COMMENT &#039;TRUE indicates that additional URL-related stats may be collected&#039;,&lt;br /&gt;
  `isRegex` BOOL COMMENT &#039;TRUE indicates that the string should not be escaped before feeding to preg_match()&#039;,&lt;br /&gt;
  `Count` INT DEFAULT 0 COMMENT &#039;number of attempts&#039;,&lt;br /&gt;
  PRIMARY KEY(`ID`)&lt;br /&gt;
)&lt;br /&gt;
ENGINE = MYISAM;&lt;br /&gt;
&lt;br /&gt;
CREATE TABLE `clients` (&lt;br /&gt;
  `ID` INT NOT NULL AUTO_INCREMENT,&lt;br /&gt;
  `Address` varchar(15) COMMENT &#039;IP address&#039;,&lt;br /&gt;
  `WhenFirst` DATETIME COMMENT &#039;when this IP address first submitted a spam&#039;,&lt;br /&gt;
  `WhenLast` DATETIME COMMENT &#039;when this IP address last submitted a spam&#039;,&lt;br /&gt;
  `Retries` INT DEFAULT NULL COMMENT &#039;number of spam retries&#039;,&lt;br /&gt;
  `Count` INT DEFAULT 0 COMMENT &#039;number of attempts&#039;,&lt;br /&gt;
  PRIMARY KEY(`ID`)&lt;br /&gt;
)&lt;br /&gt;
ENGINE = MYISAM;&lt;br /&gt;
&lt;br /&gt;
CREATE TABLE `attempts` (&lt;br /&gt;
  `ID` INT NOT NULL AUTO_INCREMENT,&lt;br /&gt;
  `When` DATETIME COMMENT &#039;timestamp of attempt&#039;,&lt;br /&gt;
  `ID_Pattern` INT DEFAULT NULL COMMENT &#039;(patterns.ID) matching pattern found&#039;,&lt;br /&gt;
  `ID_Client` INT NOT NULL COMMENT &#039;(clients.ID) spamming client&#039;,&lt;br /&gt;
  `Code` varchar(15) COMMENT &#039;type of attempt&#039;,&lt;br /&gt;
  `PageServer` varchar(63) COMMENT &#039;identifier of wiki being attacked (usually domain)&#039;,&lt;br /&gt;
  `PageName` varchar(255) COMMENT &#039;name of page where the spam would have displayed&#039;,&lt;br /&gt;
  PRIMARY KEY(`ID`)&lt;br /&gt;
)&lt;br /&gt;
ENGINE = MYISAM;&lt;br /&gt;
&amp;lt;/sql&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;attempts.Code&#039;&#039;&#039;:&lt;br /&gt;
** &#039;&#039;NULL&#039;&#039; = normal filter match&lt;br /&gt;
** &amp;quot;&#039;&#039;&#039;AMP&#039;&#039;&#039;&amp;quot; = [[ampersandbot]] (to be eventually superceded by some kind of difference-pattern)&lt;br /&gt;
** &amp;quot;&#039;&#039;&#039;THR&#039;&#039;&#039;&amp;quot; = throttled: too many spam attempts, temporary blacklist of IP address&lt;br /&gt;
* &#039;&#039;&#039;clients.Retries&#039;&#039;&#039;:&lt;br /&gt;
** Each time a client submits spam, clients.Retries increments...&lt;br /&gt;
** ...unless clients.WhenLast was sufficiently long ago, in which case clients.Retries is reset to 0 (and WhenLast is updated).&lt;br /&gt;
** Each time a client submits non-spam, if clients.Retries is too high &#039;&#039;and&#039;&#039; WhenLast is recent enough, the content is refused without checking for a spam match.&lt;br /&gt;
* Net effect is that too many spams within a certain period of time causes an IP to be temporarily blacklisted.&lt;br /&gt;
* &#039;&#039;&#039;attempts.ID_Pattern&#039;&#039;&#039; needs to allow NULL in order to log throttled saves with no pattern match&lt;br /&gt;
&lt;br /&gt;
==Reports==&lt;br /&gt;
Eventually, some Specialpages with reports would be nice, but for now you can see what&#039;s being blocked, and where the spam attempts are coming from, with this query conveniently packaged in a stored in a &amp;quot;view&amp;quot;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;mysql&amp;gt;CREATE OR REPLACE VIEW `AttemptsEx` AS&lt;br /&gt;
  SELECT&lt;br /&gt;
    a.ID,&lt;br /&gt;
    a.`When` AS WhenDone,&lt;br /&gt;
    a.Code,&lt;br /&gt;
    a.ID_Pattern,&lt;br /&gt;
    p.Pattern,&lt;br /&gt;
    a.ID_Client,&lt;br /&gt;
    c.Address,&lt;br /&gt;
    a.PageServer,&lt;br /&gt;
    a.PageName&lt;br /&gt;
  FROM&lt;br /&gt;
    (&lt;br /&gt;
      (attempts AS a&lt;br /&gt;
        LEFT JOIN patterns AS p&lt;br /&gt;
        ON (a.ID_Pattern = p.ID)&lt;br /&gt;
       ) LEFT JOIN clients AS c&lt;br /&gt;
         ON (a.ID_Client = c.ID)&lt;br /&gt;
     ) ORDER BY a.ID DESC&amp;lt;/mysql&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Used Internally==&lt;br /&gt;
This view is for pre-screening Clients -- includes a column showing how long since the last spam attempt, in seconds:&lt;br /&gt;
&amp;lt;mysql&amp;gt;CREATE OR REPLACE VIEW `ClientThrottle` AS&lt;br /&gt;
  SELECT&lt;br /&gt;
    ID,&lt;br /&gt;
    Address,&lt;br /&gt;
    WhenFirst,&lt;br /&gt;
    WhenLast,&lt;br /&gt;
    Count,&lt;br /&gt;
    IFNULL(Retries,0) AS Retries,&lt;br /&gt;
    TIMESTAMPDIFF(SECOND,WhenLast,NOW()) AS ThrottleTime&lt;br /&gt;
  FROM clients;&amp;lt;/mysql&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Code==&lt;br /&gt;
This consists of the following:&lt;br /&gt;
* [[SpamFerret.php]] goes in the MediaWiki extensions folder&lt;br /&gt;
* [[User:Woozle/data.php|data.php]] goes in the &amp;quot;includes&amp;quot; folder because [[PHP]] seems to want it there.&lt;br /&gt;
* Add these lines to your [[LocalSettings.php]]:&lt;br /&gt;
&amp;lt;php&amp;gt;require_once( &amp;quot;$IP/extensions/SpamFerret.php&amp;quot; );&lt;br /&gt;
$wgSpamFerretSettings[&#039;dbspec&#039;] = &#039;mysql:&amp;lt;u&amp;gt;db_user_name&amp;lt;/u&amp;gt;:&amp;lt;u&amp;gt;db_user_password&amp;lt;/u&amp;gt;@&amp;lt;u&amp;gt;db_server&amp;lt;/u&amp;gt;/&amp;lt;u&amp;gt;spamferret_db_name&amp;lt;/u&amp;gt;&#039;;&lt;br /&gt;
$wgSpamFerretSettings[&#039;throttle_retries&#039;] = 5;	// 5 strikes and you&#039;re out&lt;br /&gt;
$wgSpamFerretSettings[&#039;throttle_timeout&#039;] = 86400;	// 86400 seconds = 24 hours&amp;lt;/php&amp;gt;&lt;br /&gt;
SpamFerret.php and data.php still contain some debugging code, most of which I&#039;ll clean up later (some of it calls stubbed debug-printout routines which can come in handy when adding features or fixing the inevitable bugs).&lt;br /&gt;
===Update Log===&lt;br /&gt;
* &#039;&#039;&#039;2007-10-13&#039;&#039;&#039; (1) IP throttling, and (2) logging of [[ampersandbot]] attempts (not tested)&lt;br /&gt;
** If an IP address makes more than &amp;lt;u&amp;gt;N&amp;lt;/u&amp;gt; spam attempts with no more than &amp;lt;u&amp;gt;T&amp;lt;/u&amp;gt; seconds between them, it will not be allowed to post anything until a further &amp;lt;u&amp;gt;T&amp;lt;/u&amp;gt; seconds have elapsed without spam.&lt;br /&gt;
* &#039;&#039;&#039;2007-06-10&#039;&#039;&#039; Added code to prevent [[ampersandbot]] edits; need to add logging of those blocks, but don&#039;t have time right now. Also don&#039;t know if the ampersandbots trim off whitespace or if that&#039;s just how MediaWiki is displaying the changes.&lt;br /&gt;
* &#039;&#039;&#039;2007-08-30&#039;&#039;&#039; Current version accommodates some changes to the data.php class library&lt;/div&gt;</summary>
		<author><name>59.14.174.100</name></author>
	</entry>
</feed>