#6336 - Stats are still failing to generate on high-traffic sites

By PDStig
Added 4th Jan 2026, 1:05 PM
67 views

Identifier	#6336
Issue type	Major issue (breaks an entire feature)
Title	Stats are still failing to generate on high-traffic sites
Status	Open
Handling member	Deleted
Version	11 beta8
Addon	stats
Description	Despite optimisation, pre-processing of statistics is still failing on composr.app, probably because of all the guest traffic it has been receiving. We need to optimise even more.
Funded?	No

The system will post a comment when this issue is modified (e.g., status changes). To be notified of this, click "Enable comment notifications".

Rating

Unrated

Comments

By PDStig posted 14th Jan 2026, 11:11 AM

In progress.

My approach is to flatten the data that we store in the database. Instead of dumping serialized data into p_data for each bucket/interval, we will flatten out the keys. Every key-value pair (data point) will be its own row in the database.

Pros:
- Much less memory use as we are not selecting dumps of p_data data; these can easily be multiple MBs;
- The flat key structure means that we can select groups of data points that we want using LIKE `keys||to||select||%` instead of loading entire dumps of data in, running unserialize on them, and finding the data points that we want. Since this is directly SQL, we can also select keys in batches (e.g., 100 at a time) to avoid OOM.

Cons:
- Many more rows in the database (but they will be smaller)
- Many more SQL queries involved (but that's mainly on the scheduler; graphs won't see that much of an increase due to selecting all data points that we need together with a wildcard LIKE statement)

By PDStig posted 2nd Feb 2026, 10:50 PM

My initial implementation was terrible. The scheduler would sometimes take over an hour to generate statistics. This was because of two problems:

Every point of data would require about 5 SQL queries
The initial method of indexing was very slow

I am testing the following changes:

I put the delta table and behaviour back in
stats_preprocessed and stats_preprocessed_flat now use a special p_id column for PRIMARY. This is a hash of all other columns (except p_value). This makes merging data much faster.
I implemented batch processing of SQL. Now, instead of 5 queries per data point, we are using about 5 queries every 100 data points. Part of this was thanks to the new p_id column.

Processing times have been reduced to about 4-5 minutes. I will continue to monitor the changes.

By PDStig posted 7th Mar 2026, 11:58 AM

We're getting a massive amount of data. On composr.app, the preprocessed table is over 25 GB.

Let's modify the approach by simplifying the process significantly:
- Get rid of pivots in the database. They'll still exist internally. However, we will store everything (once) with a simple date_and_time timestamp.
- Get rid of the delta table (again). It won't be needed if we take the simple approach above.

Pros:
- Reduced preprocessed table size (it will probably still be large, but hopefully, it won't be as large as it is now).
- We can potentially get rid of preprocessed_flat and combine everything into one table. I'm not certain on this yet.
- If we process stats in this manner, then we will have very simple SQL queries that we can execute in batches. This will reduce the burden on the scheduler hook.

Cons:
- Data will not be sorted into pivots in the database. We will have to process this during runtime when loading a graph. This shouldn't be too difficult to do.
- We will still have a large database table. But I don't think this can be avoided without the risk of PHP OOM errors or complex queries (those will significantly slow down processing). We may just need to warn people that the stats addon will require a significant amount of database storage.

#6336 - Stats are still failing to generate on high-traffic sites

Rating

Comments

Statistics