Bugs can be tricky

Post

Posted
Rating:
Item has a rating of 5 Item has a rating of 5 Item has a rating of 5 Item has a rating of 5 Item has a rating of 5 (Liked by Adam Edington)
#4704 (In Topic #949)
This is going to be a very technical post. I'll keep it as simple as reasonable possible though. I thought it would be interesting to share, and also writing this down helps crystallise my thinking.

Today I had an issue with Google Search Console showing server errors from a client of ours, yet manual tests show Google can access the site fine, and nothing was logged as an error on the server. The search ranking of the client is not likely going to be affected like this, but just the possibility of that happening is something that would set of alarm bells to most people.

Here's what was the problem…

Months ago, there was a bug with our HTML minification on the client site (code to reduce the size of the HTML code automatically, coming in v11), which led to some invalid URLs. Even though this is long fixed, a very minor bug, and only there for a short time, Google remembered some of these invalid URLs.

Certain invalid URLs in Composr CMS can a "hack-attack alert", because they look like a hacker is trying to compromise the system.
Enough of these alerts, and the connecting IP is banned. It's an important security technique we have which has blocked many malicious bots over the years from wasting system resources.

But a banned IP address should not result in a server error (a '500' error technically), it should result in some kind of access denied error code. Well, the server in question is behind a firewall, so IP addresses seen by the web server are that of the gateway router, not of the end user. At some point the web server remaps them to the real IP address, but only after server-level bans were checked. This meant that bans were running in Composr instead of server-level, which is a secondary level of ban security that we have but usually isn't needed. These bans were throwing out Composr "Critical errors", which are implemented with the 500 error code. That's an oversight which I've now fixed.

But, if Google rendered the page fine, how could it be banned? Well, only 1 single IP address from Google was banned, and hence it was only banned some very small percentage of the time. From the point of view of looking at Google Search Console errors though, that's not something you'd pick up on (Google Search Console provides very limited information).

But, we actually prevent Google from being banned automatically. We're not stupid enough to just ban any IP address that goes to malicious URLs, as that itself would be a vulnerability to allow a site to get kicked off Google. We check for if an IP is Google by doing a reverse DNS check, looking to see if the DNS address is that of an important crawler (we can't trust the user-agent). On the particular client server involved though, reverse DNS checks have a trailing dot, while on previous tested machines they did not. This threw off the check. (We can't just do a substring check as that would be vulnerable, but our check has been enhanced)

Further complicating things, the logged user agent banned was not Googlebot, it was the Google Adsense crawler. Why would Google Search Console be showing errors relating to Adsense? Well, Google task their machines for multiple things. So while we banned a machine acting as an Adsense crawler, that same machine also acted as a regular Google crawler.

I have now added automated tests for checking our bot detection via IP whitelisting and DNS whitelisting. And, I've made the DNS whitelisting configurable via an overridable text file for v11.

To aid future debugging, I have added a Health Check, for checking (with the latest whitelists and code) whether any crawler IPs have been banned. This is now one of about 200 checks we run to make sure something isn't badly screwed up on a Composr website. That's way more checks than any human can reasonably stay on top on or even really know about, which is why I love this Health Check system (coming in v11, although a version of it is available as a v10 addon).

Here's a summary of the curve balls which made this so difficult to debug:
  1. Googlebot going to URLs that weren't even linked from anywhere
  2. 'Server errors' for what was actually a ban situation
  3. Google 'banned' but still able to access the site fine when tested from Google Search Console
  4. Google banned even though Google could not be banned
  5. Google search crawler banned even though it was Google Adsense crawler banned

And this, ladies and gentlemen, is why sometimes I really struggle to get my billable hours in! About 5 hours of my day for something that I only found about this morning (due to a Google email alert). I don't charge clients for Composr bugs, however esoteric they may be.

Post

Posted
Rating:
#4706
Wow! intense, good work getting that solved.

1 guest and 0 members have recently viewed this.