View Issue Details

IDProjectCategoryView StatusLast Update
2646Composrcorepublic2024-04-22 17:18
ReporterChris Graham Assigned ToGuest  
PrioritynormalSeverityfeature 
Status newResolutionopen 
Summary2646: Bayesian spam detection
DescriptionWhen content is deleted for spam, copy the text into a new 'spam' table. Use this, and normal content, for spam/ham detection using a Bayesian algorithm.

Every once in a while (or when the cleanup tool is run), the prior probabilities for keywords would be updated.

This then can be integrated as extra signalling for 2384
TagsRoadmap: Over the horizon, Type: Spam
Attach Tags
Time estimation (hours)20
Sponsorship open

Sponsor

Date Added Member Amount Sponsored

Relationships

related to 2384 ResolvedChris Graham Anti-spam heuristics 
related to 2057 ResolvedChris Graham Delete member content on punishment form 

Activities

Guest

2016-06-09 02:18

reporter   ~4019

Question. How would one mark content being deleted as being deleted for spam? Would there be a new checkbox or something for staff?

Chris Graham

2016-06-09 02:20

administrator   ~4020

It would be done with 2057, all the punished members content would be considered spam if a checkbox as ticked.

PDStig

2016-06-09 02:51

administrator   ~4024

Hmm... what if only a few pieces of content are considered spam? That approach would wipe everything, including legitimate posts. I can see its use for sole spammers, but it may be a problem for the casual "I did it once and I learned from it" spammer.

PDStig

2024-01-06 04:08

administrator   ~8153

Last edited: 2024-01-06 04:10

Doing some research, this could be a useful antispam feature for v11 as spam continues to go on the rise.

For the algorithm to work effectively, it will need to be trained both on spam and on ham. However, we would need to know when to classify a piece of content as ham.

Perhaps an hourly scheduled task can be run that trains the algorithm on content which is X hours or older as ham (configurable, perhaps a default of 72 hours as we can reasonably assume in most cases content which has not been moderated as spam within 3 days is ham).

Or, we could go a dynamic route:

* Have two tables... spam and spam_probabilities.
* "spam" is a collection of raw text which has been marked as spam.
* "spam_probabilities" is the training data for the Bayesian algorithm.
* Have a scheduled hook run every hour (but only when new content was recently added) which recalculates spam_probabilities. It does this by looking at content which currently exists on the site and is newer than X days (let's say a default of 30) and considers it ham. All entries in "spam" no older than X days (again, 30 by default, same number as ham) is trained as spam. We should also run this every time a new entry is added into spam. And this should also clean out old entries from the spam table (or perhaps let the privacy hooks do that instead in case an admin wants to increase the number of days to look back).
* When checking for spam, we run the requested content submission through the algorithm to determine if it is likely spam and apply a score if so.
* We should consider other fields as well, not just main body content... like title, SEO keywords, etc. Basically, any text-type field.

PDStig

2024-04-22 17:18

administrator   ~8659

Due to the strict v11 timeline for release, this has been put off to over the horizon (11.1 or later)

Add Note

View Status
Note
Upload Files
Maximum size: 32,768 KiB

Attach files by dragging & dropping, selecting or pasting them.
You are not logged in You are not logged in. This means you will not get any e-mail notifications. And if you reply, we will not know for sure you are the original poster of the issue.

Issue History

Date Modified Username Field Change
2016-06-08 01:21 Chris Graham New Issue
2016-06-08 01:21 Chris Graham Tag Attached: Type: Spam
2016-06-08 01:24 Chris Graham Description Updated
2016-06-08 01:24 Chris Graham Relationship added child of 2384
2016-06-08 01:25 Chris Graham Relationship added child of 2057
2016-06-09 02:18 Guest Note Added: 0004019
2016-06-09 02:20 Chris Graham Note Added: 0004020
2016-06-09 02:51 PDStig Note Added: 0004024
2016-10-25 17:43 Chris Graham Relationship deleted child of 2057
2016-10-25 17:43 Chris Graham Relationship deleted child of 2384
2016-10-25 17:43 Chris Graham Relationship added related to 2384
2016-10-25 17:43 Chris Graham Relationship added related to 2057
2024-01-06 03:58 PDStig Tag Attached: Roadmap: v11
2024-01-06 04:08 PDStig Note Added: 0008153
2024-01-06 04:10 PDStig Note Edited: 0008153
2024-04-22 17:18 PDStig Tag Detached: Roadmap: v11
2024-04-22 17:18 PDStig Tag Attached: Roadmap: Over the horizon
2024-04-22 17:18 PDStig Note Added: 0008659