Difference between revisions of "SpamFerret"

from HTYP, the free directory anyone can edit if they can prove to me that they're not a spambot
Jump to navigation Jump to search
(→‎Overview: update: works; needs doc)
(Replaced content with "{{to wooz|MWX/SpamFerret}}")
Tag: Replaced
 
(55 intermediate revisions by 4 users not shown)
Line 1: Line 1:
==Navigation==
+
{{to wooz|MWX/SpamFerret}}
[[computing]]: [[software]]: [[MediaWiki]]: [[fighting spam posts in MediaWiki|fighting spam]]
 
==Overview==
 
[[SpamFerret]] is [[User:Woozle|my]] attempt at an improvement over the SpamBlacklist extension. The version posted here works, but you have to manually enter patterns (blacklisted items) into the database. Fortunately this isn't hard to do, but a more friendly interface would be nice (especially some way to take a spam-page, break it up into unique URLs, and let you select which ones to add to the database).
 
===To-Do===
 
* '''Important''': Document the installation process (pretty simple; the only non-obvious bit is the database spec)
 
* Automated reporting of intercepted spam
 
* Management tools for new spam (generate candidate patterns from spam page, allow user to fine-tune and choose which ones to use, and add chosen/tweaked patterns to database)
 
* Manual reporting tools
 
** list patterns least recently used, for possible deactivation
 
** [[whois]] of all recently promoted domains and create a consolidated list of owners
 
 
 
==Purpose==
 
The SpamBlacklist extension has a number of shortcomings:
 
* Can only handle a limited number of entries before exceeding the maximum string-length it can process, at which point all spam is allowed through
 
* Does not keep track of which entries are still being "tried" (to allow for periodic "cleaning" of the list)
 
* Does not keep track of offending IP addresses
 
* Handles only domains; cannot blacklist by URL path (for partially compromised servers) or "catch phrases" found in spam and nowhere else
 
* Does not keep a log of failed spam attempts, so there is no way to measure effectiveness
 
 
 
[[SpamFerret]] will:
 
* be database-driven
 
* keep logs and counts of spam attempts by blacklisting and by IP
 
* match domains ("http://*.domain"), URLs ("http://*.domain/path") and catch-phrases ("helo please to forgive my posting but my children are hungary")
 
** should also be able to match patterns, like long lists of links in a certain format
 
 
 
It may also be unsuitable for use on busier wikis, as the checking process (which only happens when an edit is submitted) may take a fair amount of CPU time (checks the entire page once per blacklisted pattern). This shouldn't be a problem for smaller wikis, which are often monitored less frequently than busier wikis and hence are more vulnerable to spam.
 
==Design==
 
<sql>
 
CREATE TABLE `patterns` (
 
  `ID` INT NOT NULL AUTO_INCREMENT,
 
  `Pattern` varchar(255) COMMENT 'pattern to match (regex)',
 
  `WhenAdded` DATETIME DEFAULT NULL COMMENT 'when this entry was added',
 
  `WhenTried` DATETIME DEFAULT NULL COMMENT 'when a spammer last attempted to include this pattern',
 
  `isActive` BOOL COMMENT 'if FALSE, do not include in checking',
 
  `isURL` BOOL COMMENT 'TRUE indicates that additional URL-related stats may be collected',
 
  `isRegex` BOOL COMMENT 'TRUE indicates that the string should not be escaped before feeding to preg_match()',
 
  PRIMARY KEY(`ID`)
 
)
 
ENGINE = MYISAM;
 
 
 
CREATE TABLE `clients` (
 
  `ID` INT NOT NULL AUTO_INCREMENT,
 
  `Address` varchar(15) COMMENT 'IP address',
 
  `WhenFirst` DATETIME COMMENT 'when this IP address first submitted a spam',
 
  `WhenLast` DATETIME COMMENT 'when this IP address last submitted a spam',
 
  `Count` INT COMMENT 'number of attempts',
 
  PRIMARY KEY(`ID`)
 
)
 
ENGINE = MYISAM;
 
 
 
CREATE TABLE `attempts` (
 
  `ID` INT NOT NULL AUTO_INCREMENT,
 
  `When` DATETIME COMMENT 'timestamp of attempt',
 
  `ID_Pattern` INT NOT NULL COMMENT '(patterns.ID) matching pattern found',
 
  `ID_Client` INT NOT NULL COMMENT '(clients.ID) spamming client',
 
  `PageServer` varchar(63) COMMENT 'identifier of wiki being attacked (usually domain)',
 
  `PageName` varchar(255) COMMENT 'name of page where the spam would have displayed',
 
  PRIMARY KEY(`ID`)
 
)
 
ENGINE = MYISAM;
 
</sql>
 
==Code==
 
This consists of two files. '''SpamFerret.php''' goes in the MediaWiki extensions folder, and '''data.php''' goes in the "includes" folder because PHP seems to want it there. Both files still contain some debugging code, most of which I'll clean up later (some of it calls stubbed debug-printout routines which can come in handy when adding features or fixing the inevitable bugs).
 
===SpamFerret.php===
 
<php>
 
<?php
 
 
 
# Loader for spam blacklist feature
 
# Include this from LocalSettings.php
 
 
 
if ( defined( 'MEDIAWIKI' ) ) {
 
require('data.php');
 
 
 
global $wgFilterCallback, $wgPreSpamFilterCallback;
 
 
 
if ( $wgFilterCallback ) {
 
$wgPreSpamFilterCallback = $wgFilterCallback;
 
} else {
 
$wgPreSpamFilterCallback = false;
 
}
 
 
 
$wgFilterCallback = 'wfSpamFerretLoader';
 
$wgExtensionCredits['other'][] = array(
 
'name' => 'SpamFerret',
 
        'author' => 'Woozle Staddon',
 
        'url' => 'http://htyp.org/SpamFerret',
 
);
 
 
 
function wfSpamFerretLoader( &$title, $text, $section ) {
 
static $spamObj = false;
 
global $wgSpamFerretSettings, $wgPreSpamFilterCallback;
 
 
 
if ( $spamObj === false ) {
 
$spamObj = new SpamFerret( $wgSpamFerretSettings );
 
}
 
 
 
return $spamObj->filter( $title, $text, $section );
 
}
 
class SpamFerret {
 
var $dbspec;
 
var $previousFilter = false;
 
 
function SpamFerret( $settings = array() ) {
 
global $IP;
 
 
 
foreach ( $settings as $name => $value ) {
 
$this->$name = $value;
 
}
 
}
 
 
 
function filter( &$title, $text, $section ) {
 
global $wgArticle, $wgDBname, $wgMemc, $messageMemc, $wgVersion, $wgOut;
 
global $wgTitle, $wgServer;
 
global $debug;
 
 
 
$fname = 'wfSpamFerretFilter';
 
wfProfileIn( $fname );
 
 
 
# Call the rest of the hook chain first
 
if ( $this->previousFilter ) {
 
$f = $this->previousFilter;
 
if ( $f( $title, $text, $section ) ) {
 
wfProfileOut( $fname );
 
return true;
 
}
 
}
 
// Open the database
 
$dbSpam = new clsDatabase($this->dbspec);
 
$objTblPatterns = new clsDataTable($dbSpam,'patterns');
 
$objDataPatterns = $objTblPatterns->GetData('isActive');
 
/*
 
$debug .= ' objDataPatterns is object:'.is_object($objDataPatterns);
 
$debug .= ' objDataPatterns.Res is object:'.is_object($objDataPatterns->Res);
 
$debug .= ' objDataPatterns.Row is array:'.is_array($objDataPatterns->Row);
 
$debug .= ' objDataPatterns.Res is class '.get_class($objDataPatterns->Res);
 
$debug .= ' objDataPatterns.Res has '.$objDataPatterns->Res->num_rows.' row(s)';
 
*/
 
$debug .= 'TEXT=['.$text.']';
 
while(is_array($objDataPatterns->Row)) {
 
$strPattern = $objDataPatterns->GetValue('Pattern');
 
$isRegex = $objDataPatterns->GetValue('isRegex');
 
$idPattern = $objDataPatterns->GetValue('ID');
 
$debug .= ' PATTERN: ['.$strPattern.']';
 
if ($isRegex) {
 
$debug .= ' IS REGEX';
 
$strPattCk = strtolower($strPattern);
 
$strTextCk = strtolower($text);
 
$isMatch = preg_match('/'.$strPattCk.'/',$strTextCk,$matches);
 
if ($isMatch) {
 
$strMatch = $matches[0];
 
}
 
} else {
 
$debug .= ' IS PLAIN';
 
$strMatch = stristr ($text,$strPattern);
 
$isMatch = ($strMatch != '');
 
}
 
// $debug .= 'ROW: '.DumpArray($objDataPatterns->Row);
 
 
 
if ($isMatch) {
 
$debug .= ' MATCH!';
 
$objDataPatterns->Row = NULL; // stop the search
 
} else {
 
$debug .= ' no match';
 
$objDataPatterns->NextRow(); // keep looking
 
}
 
}
 
 
 
if ( $strMatch != '' ) {
 
// spam cue found; display the matching text and don't allow the edit to be saved:
 
wfDebug( "Match!\n" );
 
 
 
// The string sent to spamPage() will be shown after "The following text is what triggered our spam filter:"
 
EditPage::spamPage( $strMatch );
 
// Log the spam attempt:
 
// -- first, get an ID for the IP address:
 
$strIPAddr = wfGetIP();
 
$sql = 'SELECT * FROM clients WHERE Address="'.$strIPAddr.'"';
 
$objDataClients = $dbSpam->Query($sql);
 
// update or create client record:
 
if (is_object($objDataClients)) {
 
if ($objDataClients->RowCount() > 0) {
 
$isFound = true;
 
}
 
}
 
if ($isFound) {
 
$idClient = $objDataClients->GetValue('ID');
 
$sql = 'UPDATE clients SET WhenLast=NOW(),Count=Count+1 WHERE Address="'.$strIPAddr.'"';
 
$dbSpam->Exec($sql);
 
} else {
 
$sql = 'INSERT INTO clients (Address,WhenFirst,Count) VALUES("'.$strIPAddr.'",NOW(),1)';
 
$dbSpam->Exec($sql);
 
$idClient = $dbSpam->NewID();
 
}
 
$sqlURL = '"'.$dbSpam->SafeParam($wgTitle->getFullURL()).'"';
 
$sqlSrvr = '"'.$dbSpam->SafeParam($wgServer).'"';
 
$sqlPage = '"'.$dbSpam->SafeParam($wgTitle->getPrefixedText()).'"';
 
$sql = 'INSERT INTO attempts (`When`,ID_Pattern,ID_Client,PageServer,PageName) VALUES (NOW(),'.$idPattern.','.$idClient.','.$sqlSrvr.','.$sqlPage.')';
 
$dbSpam->Exec($sql);
 
$retVal = true;
 
} else {
 
// no spam cues found; allow the edit to be saved
 
/*
 
EditPage::spamPage( 'DEBUGGING: '.$debug );
 
$retVal = true;
 
/*/
 
$retVal = false;
 
/**/
 
}
 
 
 
wfProfileOut( $fname );
 
return $retVal;
 
/**/
 
}
 
}
 
 
 
} // end of 'MEDIAWIKI' check
 
?>
 
 
 
</php>
 
===data.php===
 
<php>
 
<?php
 
 
 
# Loader for spam blacklist feature
 
# Include this from LocalSettings.php
 
 
 
/* ===========================
 
*** DATA UTILITY CLASSES ***
 
*/
 
 
 
class clsDatabase {
 
  private $cntOpen; // count of requests to keep db open
 
  private $strType; // type of db (MySQL etc.)
 
  private $strUser; // database user
 
  private $strPass; // password
 
  private $strHost; // host (database server domain-name or IP address)
 
  private $strName; // database (schema) name
 
 
 
  private $Conn; // connection object
 
 
 
  public function __construct($iConn) {
 
    CallEnter('clsDatabase('.get_class($iConn).')');
 
    $this->Init($iConn);
 
    CallExit('clsDatabase()');
 
  }
 
  public function Init($iConn) {
 
// $iConn format: type:user:pass@server/dbname
 
    CallEnter('clsDatabase.Init('.get_class($iConn).')');
 
    $this->cntOpen = 0;
 
    list($part1,$part2) = split('@',$iConn);
 
    list($this->strType,$this->strUser,$this->strPass) = split(':',$part1);
 
    list($this->strHost,$this->strName) = explode('/',$part2);
 
    $this->strType = strtolower($this->strType); // make sure it is lowercased, for comparison
 
    CallExit('clsDatabase.Init()');
 
  }
 
  public function Open() {
 
    CallEnter('clsDatabase.Open()');
 
    if ($this->cntOpen == 0) {
 
// then actually open the db
 
      if ($this->strType == 'mysql') {
 
        $this->Conn = new mysqli($this->strHost,$this->strUser,$this->strPass,$this->strName);
 
      } else {
 
        $this->Conn = dbx_connect($this->strType,$this->strHost,$this->strName,$this->strUser,$this->strPass);
 
      }
 
    }
 
    $this->cntOpen++;
 
    CallExit('clsDatabase.Open()');
 
  }
 
  public function Shut() {
 
    CallEnter('clsDatabase.Shut()');
 
    $this->cntOpen--;
 
    if ($this->cntOpen == 0) {
 
      if ($this->strType == 'mysql') {
 
        $this->Conn->close();
 
      } else {
 
        dbx_close($this->Conn);
 
      }
 
    }
 
    CallExit('clsDatabase.Shut()');
 
  }
 
  public function Query($iSQL) {
 
    CallEnter('clsDatabase.Query('.$iSQL.')');
 
    if ($this->strType == 'mysql') {
 
      $this->Conn->real_query($iSQL);
 
      $objQry = $this->Conn->store_result();
 
    } else {
 
      $objQry = dbx_query($this->Conn,$iSQL,DBX_RESULT_ASSOC);
 
    }
 
    if (is_object($objQry)) {
 
      $objRec = new clsDataItem($objQry);
 
      CallExit('clsDatabase.Query('.$iSQL.') -> new clsDataItem');
 
      return $objRec;
 
    } else {
 
      CallExit('clsDatabase.Query('.$iSQL.') -> NULL (ERROR!)');
 
      return NULL; // empty result set, one way or another
 
    }
 
  }
 
  public function Exec($iSQL) {
 
// MYSQL only
 
    $objQry = $this->Conn->prepare($iSQL);
 
    return $objQry->execute();
 
  }
 
  public function NewID() {
 
    return $this->Conn->insert_id;
 
  }
 
  public function SafeParam($iString) {
 
    return $this->Conn->escape_string($iString);
 
  }
 
}
 
 
 
class clsDataTable {
 
  protected $objDB; // clsDatabase
 
  protected $strName;
 
 
 
  public function __construct($iDB,$iName) {
 
    CallEnter('clsDataTable('.get_class($iDB).','.$iName.')');
 
    $this->objDB = $iDB;
 
    $this->strName = $iName;
 
    $this->objDB->Open();
 
    if (!isset($this->objDB)) {
 
      print 'DATABASE NOT OPEN!';
 
    }
 
    CallExit('clsDataTable()');
 
  }
 
  public function __destruct() {
 
    $this->objDB->Shut();
 
  }
 
  public function GetItem($iID,$iObj=NULL) {
 
    CallEnter('clsDataTable.GetItem('.$iID.')');
 
    $sql = 'SELECT * FROM '.$this->strName.' WHERE ID='.$iID;
 
//    $objQry = dbx_query($this->objDB,$sql,DBX_RESULT_ASSOC);
 
    $objQry = $this->objDB->Query($sql);
 
    if (is_object($iObj)) {
 
      $iObj->Eat($objQry);
 
      CallExit('clsDataTable.GetItem('.$iID.') -> ate object');
 
      return $iObj;
 
    } else {
 
      CallExit('clsDataTable.GetItem('.$iID.') -> from objDB.Query');
 
      return $objQry;
 
    }
 
  }
 
  public function GetData($iFilt) {
 
    CallEnter('clsDataTable.GetData('.$iFilt.')');
 
    $sql = 'SELECT * FROM '.$this->strName.' WHERE ('.$iFilt.')';
 
    $objQry = $this->objDB->Query($sql);
 
    CallExit('clsDataTable.GetData('.$iFilt.')');
 
    return $objQry;
 
  }
 
}
 
 
 
class clsDataItem {
 
  public $Res; // result set
 
  public $Row; // first (presumably the *only*) row data
 
  public $Table; // clsDataTable object
 
  public function __construct($iResults=NULL,$iTable=NULL) {
 
    CallEnter('clsDataItem('.get_class($iResults).')');
 
    if (is_object($iResults)) {
 
      $this->Init($iResults,$iTable);
 
    }
 
    CallExit('clsDataItem()');
 
  }
 
  public function Init($iResults,$iTable=NULL) {
 
    CallEnter('clsDataItem.Init('.get_class($iResults).')');
 
    $this->Res = $iResults;
 
    $this->Table = $iTable;
 
// this works for mysql only:
 
// fetch the first row of the results as an associative array:
 
    $this->Row = $iResults->fetch_assoc();
 
    CallExit('clsDataItem.Init()');
 
  }
 
  public function Eat($iDataItem) {
 
    CallEnter('clsDataItem.Eat('.get_class($iDataItem).')');
 
    $this->Res = $iDataItem->Res;
 
    $this->Table = $iDataItem->Table;
 
    $this->Row = $iDataItem->Row;
 
    CallExit('clsDataItem.Eat()');
 
  }
 
  public function GetValue($iName) {
 
// this works for mysql only
 
    CallEnter('GetValue('.$iName.')');
 
    DumpArray($this->Row);
 
    CallExit('GetValue('.$iName.')');
 
    return $this->Row[$iName];
 
  }
 
  public function RowCount() {
 
    return $this->Res->num_rows;
 
  }
 
  public function NextRow() {
 
// this works for mysql only:
 
// fetch the NEXT row of the results as an associative array:
 
    $this->Row = $this->Res->fetch_assoc();
 
  }
 
}
 
 
 
/* ========================
 
*** UTILITY FUNCTIONS ***
 
*/
 
function Pluralize($iQty,$iSingular,$iPlural='s') {
 
if ($iQty == 1) {
 
return $iSingular;
 
} else {
 
return $iPlural;
 
}
 
}
 
/* ========================
 
*** DEBUGGING FUNCTIONS ***
 
*/
 
define(KS_DEBUG_HTML,0);
 
 
 
// these could later be expanded to create a call-path for errors, etc.
 
function CallEnter($iName) {
 
/*
 
  global $intCallDepth, $debug;
 
  $intCallDepth++;
 
  if (KS_DEBUG_HTML) {
 
    $debug .= '<br><span style="background: #000000;"><b>'.str_repeat('&gt;&gt; ',$intCallDepth).'</b>'.$iName.'</span>';
 
  } else {
 
    $debug .= "\n\n".str_repeat('*',$intCallDepth).$iName;
 
  }
 
/**/
 
}
 
function CallExit($iName) {
 
/*
 
  global $intCallDepth, $debug;
 
  $intCallDepth--;
 
  if (KS_DEBUG_HTML) {
 
    $debug .= '<br><span style="background: #000000; color: yellow;"><b>'.str_repeat('&gt;&gt; ',$intCallDepth).'&lt;&lt; </b>'.$iName.'</span>';
 
  } else {
 
    $debug .= "\n\n".str_repeat('*',$intCallDepth).'<';
 
  }
 
/**/
 
}
 
function DumpArray($iArr) {
 
  global $intCallDepth, $debug;
 
/*
 
  while (list($key, $val) = each($iArr)) {
 
    if (KS_DEBUG_HTML) {
 
      $debug .= '<br><span style="background: #000000; color: green;"><b>'.str_repeat('-- ',$intCallDepth+1).'</b>';
 
      $debug .= " $key => $val";
 
      $debug .= '</span>';
 
    } else {
 
      $debug .= "/ $key => $val /";
 
    }
 
  }
 
/**/
 
}
 
?>
 
</php>
 

Latest revision as of 20:22, 1 May 2022

Wooz-dev-logo.take 1.crop.1200pxw.png This page has been moved to wooz.dev, User:Woozle's coding site.