#6160 - Scalability considerations for block_side_stats

  • By
  • Added
  • 12 views
Identifier #6160
Issue type Trivial issue (does not break functionality)
Title Scalability considerations for block_side_stats
Status Open
Handling member Deleted
Version 10.0.43
Addon General / Uncategorised
Description Currently, the side stats block offers page hit counters for today, this week, and this month. The code for all three of these takes the current time, subtracts the given time interval, then does a COUNT in the MySQL table that logs hits.

Over the past few days, I've had a massive influx of new web scraping (which I understand many are dealing with this year as corporations attempt to train their AI models on public websites). My website, which actually only gets <100 valid visitors per day according to my Matomo analytics, has logged half a million hits in the last 24 hours, nearly 2 million hits over the last week, and nearly 5 million hits over the last month according to Composr's stats.

I started seeing my website fail to load altogether due to MariaDB's max connections limit being surpassed. Looking at the active processes in MariaDB, I saw that many of the running processes were simply selecting counts from the hit table-- mostly the "this month" query (which I could see via the timestamp used in the query). When I ran one of the "this month" queries manually, it took several seconds to complete. It makes sense that hundreds of simultaneous connections all attempting to do this (while the hits are still being updated, and thus MariaDB's internal cache constantly invalidated) would overwhelm the database server.

For the time being, I've mitigated this issue by disabling the "this month" counter (since that one took the longest to count), as well as at the network level by blocking the most egregious scraper in my firewall. However, given that some websites might ideally get a massive amount of traffic, this incident's raised questions about the implementation of the stat counters.

For me, the purpose of "Hits this week:" and "Hits this month:" are to provide a general order of magnitude. The specific number is interesting, but what people actually see when they glance at the number is mostly if it's in the hundreds, thousands, etc. Therefore, I don't think "Hits this week:" and "Hits this month:" actually need to be constantly updated (by taking a reading of "hits between X days ago from this very second" every time like Composr does now).

Instead, it would be much more scalable if, for example, "Hits this week:" and "Hits this month:" calculated the number of hits over the last X days ONE time per day, then cached/stored that value. That way, every single website visitor that day would only have to select a single number from the database, rather than performing a COUNT of potentially millions of rows. This would cut down on database load and increase webpage load times.
Steps to reproduce

1. Enable the stats module and add block_side_stats to a page

2. Receive massive amounts of traffic from poorly designed web scrapers

3. Observe heavy database load causing website failure

Additional information I've checked and confirmed the code relevant to this hasn't changed yet in v11 or newer versions of v10 than what I'm running.
Funded? No
The system will post a comment when this issue is modified (e.g., status changes). To be notified of this, click "Enable comment notifications".

Rating

Unrated