#2646 - Bayesian spam detection
| Identifier | #2646 |
|---|---|
| Issue type | Feature request or suggestion |
| Title | Bayesian spam detection |
| Status | Open |
| Tags |
Roadmap: Over the horizon (custom) Type: Spam (custom) |
| Handling member | Deleted |
| Addon | core |
| Description | When content is deleted for spam, copy the text into a new 'spam' table. Use this, and normal content, for spam/ham detection using a Bayesian algorithm.
Every once in a while (or when the cleanup tool is run), the prior probabilities for keywords would be updated. This then can be integrated as extra signalling for #2384 |
| Steps to reproduce | |
| Related to | |
| Funded? | No |
The system will post a comment when this issue is modified (e.g., status changes). To be notified of this, click "Enable comment notifications".


Comments
For the algorithm to work effectively, it will need to be trained both on spam and on ham. However, we would need to know when to classify a piece of content as ham.
Perhaps an hourly scheduled task can be run that trains the algorithm on content which is X hours or older as ham (configurable, perhaps a default of 72 hours as we can reasonably assume in most cases content which has not been moderated as spam within 3 days is ham).
Or, we could go a dynamic route:
* Have two tables... spam and spam_probabilities.
* "spam" is a collection of raw text which has been marked as spam.
* "spam_probabilities" is the training data for the Bayesian algorithm.
* Have a scheduled hook run every hour (but only when new content was recently added) which recalculates spam_probabilities. It does this by looking at content which currently exists on the site and is newer than X days (let's say a default of 30) and considers it ham. All entries in "spam" no older than X days (again, 30 by default, same number as ham) is trained as spam. We should also run this every time a new entry is added into spam. And this should also clean out old entries from the spam table (or perhaps let the privacy hooks do that instead in case an admin wants to increase the number of days to look back).
* When checking for spam, we run the requested content submission through the algorithm to determine if it is likely spam and apply a score if so.
* We should consider other fields as well, not just main body content... like title, SEO keywords, etc. Basically, any text-type field.
* Define a new property on content / resource meta aware hooks which defines the database fields considered "scannable" by the algorithm. These are the fields passed into the training data when we train for spam or ham, and also the fields we look at when determining if something is spam.
* A scheduler hook periodically processes content older than x days (let's say 7, because that's how far back the content deletion field goes on the warnings form) and trains the algorithm on it as "ham". Although, for fresh addon installs, it needs to process in small batches so we don't overload the server. So we should also note the timestamp up to which we've trained so far.
* Anywhere we can delete content, add a new tick box that allows us to flag it as spam. When flagged, it will be trained as "spam". Additionally, have a tick box on the warnings form to do the same for any content or posts marked for deletion.
* Persist the training data somewhere, probably as a serialised file instead of the database (because that's how most machine-learning PHP libraries handle training data).
* Add a new antispam heuristic option specifying the threshold at which the Bayes is confident the content being submitted is "spam". If the threshold is met, we add an amount of spam confidence specified in another option.
Unfortunately, there aren't that many maintained PHP libraries out there to handle the Bayes training, so we may need to write our own.