#2646 - Bayesian spam detection

By Chris Graham
Added 7th Jun 2016, 9:21 PM
15 views

Identifier	#2646
Issue type	Feature request or suggestion
Title	Bayesian spam detection
Status	Open
Tags	Roadmap: Over the horizon (custom) Type: Spam (custom)
Handling member	Deleted
Addon	core
Description	When content is deleted for spam, copy the text into a new 'spam' table. Use this, and normal content, for spam/ham detection using a Bayesian algorithm. Every once in a while (or when the cleanup tool is run), the prior probabilities for keywords would be updated. This then can be integrated as extra signalling for #2384
Steps to reproduce
Related to	#2384 - Anti-spam heuristics #2057 - Delete member content on punishment form
Funded?	No

The system will post a comment when this issue is modified (e.g., status changes). To be notified of this, click "Enable comment notifications".

Rating

Unrated

Comments

By Guest, By Guest, posted 8th Jun 2016, 10:18 PM

Question. How would one mark content being deleted as being deleted for spam? Would there be a new checkbox or something for staff?

By Guest posted 8th Jun 2016, 10:20 PM

It would be done with #2057, all the punished members content would be considered spam if a checkbox as ticked.

By Guest posted 8th Jun 2016, 10:51 PM

Hmm... what if only a few pieces of content are considered spam? That approach would wipe everything, including legitimate posts. I can see its use for sole spammers, but it may be a problem for the casual "I did it once and I learned from it" spammer.

By Guest posted 5th Jan 2024, 11:08 PM

Doing some research, this could be a useful antispam feature for v11 as spam continues to go on the rise.

For the algorithm to work effectively, it will need to be trained both on spam and on ham. However, we would need to know when to classify a piece of content as ham.

Perhaps an hourly scheduled task can be run that trains the algorithm on content which is X hours or older as ham (configurable, perhaps a default of 72 hours as we can reasonably assume in most cases content which has not been moderated as spam within 3 days is ham).

Or, we could go a dynamic route:

* Have two tables... spam and spam_probabilities.
* "spam" is a collection of raw text which has been marked as spam.
* "spam_probabilities" is the training data for the Bayesian algorithm.
* Have a scheduled hook run every hour (but only when new content was recently added) which recalculates spam_probabilities. It does this by looking at content which currently exists on the site and is newer than X days (let's say a default of 30) and considers it ham. All entries in "spam" no older than X days (again, 30 by default, same number as ham) is trained as spam. We should also run this every time a new entry is added into spam. And this should also clean out old entries from the spam table (or perhaps let the privacy hooks do that instead in case an admin wants to increase the number of days to look back).
* When checking for spam, we run the requested content submission through the algorithm to determine if it is likely spam and apply a score if so.
* We should consider other fields as well, not just main body content... like title, SEO keywords, etc. Basically, any text-type field.

By Guest posted 22nd Apr 2024, 1:18 PM

Due to the strict v11 timeline for release, this has been put off to over the horizon (11.1 or later)

By Guest posted 14th Oct 2025, 12:33 AM

Probably the easiest way to do this is to tie it into Resource-fs.

* Define a new property on content / resource meta aware hooks which defines the database fields considered "scannable" by the algorithm. These are the fields passed into the training data when we train for spam or ham, and also the fields we look at when determining if something is spam.
* A scheduler hook periodically processes content older than x days (let's say 7, because that's how far back the content deletion field goes on the warnings form) and trains the algorithm on it as "ham". Although, for fresh addon installs, it needs to process in small batches so we don't overload the server. So we should also note the timestamp up to which we've trained so far.
* Anywhere we can delete content, add a new tick box that allows us to flag it as spam. When flagged, it will be trained as "spam". Additionally, have a tick box on the warnings form to do the same for any content or posts marked for deletion.
* Persist the training data somewhere, probably as a serialised file instead of the database (because that's how most machine-learning PHP libraries handle training data).
* Add a new antispam heuristic option specifying the threshold at which the Bayes is confident the content being submitted is "spam". If the threshold is met, we add an amount of spam confidence specified in another option.

Unfortunately, there aren't that many maintained PHP libraries out there to handle the Bayes training, so we may need to write our own.

#2646 - Bayesian spam detection

Rating

Comments

Statistics