View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
2646 | Composr | core | public | 2016-06-08 01:21 | 2024-04-22 17:18 |
Reporter | Chris Graham | Assigned To | Guest | ||
Priority | normal | Severity | feature | ||
Status | new | Resolution | open | ||
Summary | 2646: Bayesian spam detection | ||||
Description | When content is deleted for spam, copy the text into a new 'spam' table. Use this, and normal content, for spam/ham detection using a Bayesian algorithm. Every once in a while (or when the cleanup tool is run), the prior probabilities for keywords would be updated. This then can be integrated as extra signalling for 2384 | ||||
Tags | Roadmap: Over the horizon, Type: Spam | ||||
Attach Tags | |||||
Time estimation (hours) | 20 | ||||
Sponsorship open | |||||
related to | 2384 | Resolved | Chris Graham | Anti-spam heuristics |
related to | 2057 | Resolved | Chris Graham | Delete member content on punishment form |
|
Question. How would one mark content being deleted as being deleted for spam? Would there be a new checkbox or something for staff? |
|
It would be done with 2057, all the punished members content would be considered spam if a checkbox as ticked. |
|
Hmm... what if only a few pieces of content are considered spam? That approach would wipe everything, including legitimate posts. I can see its use for sole spammers, but it may be a problem for the casual "I did it once and I learned from it" spammer. |
|
Doing some research, this could be a useful antispam feature for v11 as spam continues to go on the rise. For the algorithm to work effectively, it will need to be trained both on spam and on ham. However, we would need to know when to classify a piece of content as ham. Perhaps an hourly scheduled task can be run that trains the algorithm on content which is X hours or older as ham (configurable, perhaps a default of 72 hours as we can reasonably assume in most cases content which has not been moderated as spam within 3 days is ham). Or, we could go a dynamic route: * Have two tables... spam and spam_probabilities. * "spam" is a collection of raw text which has been marked as spam. * "spam_probabilities" is the training data for the Bayesian algorithm. * Have a scheduled hook run every hour (but only when new content was recently added) which recalculates spam_probabilities. It does this by looking at content which currently exists on the site and is newer than X days (let's say a default of 30) and considers it ham. All entries in "spam" no older than X days (again, 30 by default, same number as ham) is trained as spam. We should also run this every time a new entry is added into spam. And this should also clean out old entries from the spam table (or perhaps let the privacy hooks do that instead in case an admin wants to increase the number of days to look back). * When checking for spam, we run the requested content submission through the algorithm to determine if it is likely spam and apply a score if so. * We should consider other fields as well, not just main body content... like title, SEO keywords, etc. Basically, any text-type field. |
|
Due to the strict v11 timeline for release, this has been put off to over the horizon (11.1 or later) |
Date Modified | Username | Field | Change |
---|---|---|---|
2016-06-08 01:21 | Chris Graham | New Issue | |
2016-06-08 01:21 | Chris Graham | Tag Attached: Type: Spam | |
2016-06-08 01:24 | Chris Graham | Description Updated | |
2016-06-08 01:24 | Chris Graham | Relationship added | child of 2384 |
2016-06-08 01:25 | Chris Graham | Relationship added | child of 2057 |
2016-06-09 02:18 | Guest | Note Added: 0004019 | |
2016-06-09 02:20 | Chris Graham | Note Added: 0004020 | |
2016-06-09 02:51 | PDStig | Note Added: 0004024 | |
2016-10-25 17:43 | Chris Graham | Relationship deleted | child of 2057 |
2016-10-25 17:43 | Chris Graham | Relationship deleted | child of 2384 |
2016-10-25 17:43 | Chris Graham | Relationship added | related to 2384 |
2016-10-25 17:43 | Chris Graham | Relationship added | related to 2057 |
2024-01-06 03:58 | PDStig | Tag Attached: Roadmap: v11 | |
2024-01-06 04:08 | PDStig | Note Added: 0008153 | |
2024-01-06 04:10 | PDStig | Note Edited: 0008153 | |
2024-04-22 17:18 | PDStig | Tag Detached: Roadmap: v11 | |
2024-04-22 17:18 | PDStig | Tag Attached: Roadmap: Over the horizon | |
2024-04-22 17:18 | PDStig | Note Added: 0008659 |