View Issue Details

IDProjectCategoryView StatusLast Update
2384Composrcorepublic2019-06-27 17:20
ReporterChris Graham Assigned ToChris Graham  
PrioritynormalSeverityfeature 
Status resolvedResolutionfixed 
Summary2384: Anti-spam heuristics
DescriptionThere are a number of factors we can use to detect increased likelihood of spam:
1) Posting speed (by looking at when the CSRF token was generated compared to when the form was posted)
2) Closeness to having joined (people may join for bypassing CAPTCHA, getting extra features)
3) Posting links
4) Posting frequency
5) Posting repeat content
6) Using particular keywords ("cialis", ...)
7) Using particular coding ("Times New Roman" [implies a paste], "<font face=" [implies a paste])
8) Use of invalid coding from other software ("[link", ...)
9) Use of paste as opposed to typing
10) Presence of JavaScript (particular calculations could be done and submitted with the form, to know that a real working JavaScript engine was there; perhaps something computationally costly like factorisation; also detection of use of mouse and/or keyboard as a human would)
11) Triggering of the spam blackhole in a form
12) Particular user-agent substrings ("bot", "perl", ...)
13) Missing HTTP headers a real browser will always send: Accept, User-Agent, Cookie, Accept-Language, Accept-Encoding
14) Hits from particular countries (fully configurable)

We can detect these factors and make them configurable to bump up the spam certainty ratings for a request. It would be cumulative, each factor would add together to give an overall spam rating. That overall rating would be subject to the approve/block/ban thresholds that already exist.

Our LAME_SPAM_HACK hack-attack signal can be removed, and the code for that integrated into this new system.

It would all be configurable. All the time factors, all the different spam certainty increments (including configuration per detected spammy keyword).
Additional InformationHere's some simple temporary code in use on our own sites in an unofficial capacity, a small subset of what this final system would do...

    require_code('antispam');
    $hours_like_guest=2;
    $post=post_param('post','');
    if ((is_guest() || $GLOBALS['FORUM_DRIVER']->get_member_join_timestamp(get_member())>time()-60*60*$hours_like_guest) && ((strpos($post,'<a ')!==false) || (strpos($post,'[url')!==false))) {
        handle_perceived_spammer_by_confidence(get_ip_address(),floatval(get_option('spam_approval_threshold'))/100.0,'internal checks',false);
    }
TagsType: Spam
Attach Tags
Time estimation (hours)16
Sponsorship open

Sponsor

Date Added Member Amount Sponsored

Relationships

related to 2057 ResolvedChris Graham Delete member content on punishment form 
related to 2646 Not AssignedGuest Bayesian spam detection 

Activities

Chris Graham

2016-06-08 00:58

administrator   ~4007

Use of the contact forms is a concern. We should log everything going into them so that '4'/'5' above can work for these.

Guest

2016-06-09 02:20

reporter   ~4021

Last edited: 2016-06-09 02:21

View 2 revisions

True. But also what about guest forum posting and guest support ticket / feedback submitting as well?

Chris Graham

2016-06-09 02:23

administrator   ~4022

Fair point. It would need to track through somehow then. Maybe a punish link in the staff actions, like we do for forum posts - and track through content_type and content_id from that.

PDStig

2016-06-09 02:48

administrator   ~4023

Last edited: 2016-06-09 02:53

If the punish links could also be tied in to the warning/punishment form similar to forum posts... aka the content being punished is rendered as a link or a render box tempcode (or comcode) in the message field (similar to how my new reports addon works), that could further enhance the usefulness of punish links elsewhere.

But agreed. Technically, virtually any form of content can be submitted by guests... if permissions allow for it. Therefore, there needs to be a pipe for all content.

Guest

2016-06-14 13:48

reporter   ~4037

We also should have a privilege to avoid the spam heuristic system.

Chris Graham

2016-10-25 17:50

administrator   ~4472

Ok, so I'm reading the comments more carefully than I did originally, as I am now implementing this.

I don't really agree with much of the discussion, it's tangential to the issue, more related to 2057 and #2374 and 375 which will be considered separately.

The main issue discussed seems to be how can we do posting-frequency detection for guests, as all combined guest postings go under a single ID. However I think there's no real issue because guests get the CAPTCHA, or we'd generally limit guest posting access (who'd want guests submitting news for example). So we can implement posting-frequency for non-guests only, and still have a whole diverse set of other techniques that do work on guests (CAPTCHA, but also all the other heuristics). We couldn't really track guests anyway, people could use TOR (so have rotating IPs and session IDs).

Duplicate content submission can work on the guest ID with no issue - because different guests are not legitimately going to be posting the same content.

We do need to make sure heuristics do work effectively for contact forms though.

Chris Graham

2016-10-25 17:52

administrator   ~4473

Oh, also I think I was getting at, how do we know what is duplicate content, and I suggested a mechanism using the report system for that.
That isn't so necessary really. I've implemented a system where it can query via meta-data provided in the CMA hooks, over a time range for a particular submitter ID. That's simpler and better than trying to do it through reporting, because it works without any reporting needing to happen.

Chris Graham

2019-06-27 17:20

administrator   ~5998

For reference, W3C have a document explaining non-CAPTCHA anti-spam techniques:
https://www.w3.org/TR/turingtest/

The TLDR is that we now do everything we can that isn't awful in some way, but it's still a good reference.

Issue History

Date Modified Username Field Change
2016-04-08 13:32 Chris Graham New Issue
2016-06-08 00:17 Chris Graham Tag Attached: Type: Spam
2016-06-08 00:36 Chris Graham Summary Link spammer detection => Anti-spam heuristics
2016-06-08 00:36 Chris Graham Description Updated
2016-06-08 00:55 Chris Graham Time estimation (hours) 3 => 16
2016-06-08 00:55 Chris Graham Description Updated
2016-06-08 00:55 Chris Graham Additional Information Updated
2016-06-08 00:58 Chris Graham Note Added: 0004007
2016-06-08 01:15 Chris Graham Description Updated
2016-06-08 01:23 Chris Graham Description Updated
2016-06-08 01:24 Chris Graham Relationship added parent of 2646
2016-06-09 02:20 Guest Note Added: 0004021
2016-06-09 02:21 Guest Note Edited: 0004021 View Revisions
2016-06-09 02:23 Chris Graham Note Added: 0004022
2016-06-09 02:48 PDStig Note Added: 0004023
2016-06-09 02:48 PDStig Note Edited: 0004023
2016-06-09 02:53 PDStig Note Edited: 0004023
2016-06-14 13:48 Guest Note Added: 0004037
2016-10-25 17:43 Chris Graham Relationship deleted parent of 2646
2016-10-25 17:43 Chris Graham Relationship added related to 2646
2016-10-25 17:44 Chris Graham Relationship added related to 2057
2016-10-25 17:50 Chris Graham Note Added: 0004472
2016-10-25 17:52 Chris Graham Note Added: 0004473
2016-10-26 21:00 Chris Graham Status Not Assigned => Resolved
2016-10-26 21:00 Chris Graham Resolution open => fixed
2016-10-26 21:00 Chris Graham Assigned To => Chris Graham
2019-06-27 17:20 Chris Graham Note Added: 0005998