#6336 - Stats are still failing to generate on high-traffic sites
| Identifier | #6336 |
|---|---|
| Issue type | Major issue (breaks an entire feature) |
| Title | Stats are still failing to generate on high-traffic sites |
| Status | Open |
| Handling member | Deleted |
| Version | 11 beta8 |
| Addon | stats |
| Description | Despite optimisation, pre-processing of statistics is still failing on composr.app, probably because of all the guest traffic it has been receiving. We need to optimise even more. |
| Funded? | No |
The system will post a comment when this issue is modified (e.g., status changes). To be notified of this, click "Enable comment notifications".

Comments
My approach is to flatten the data that we store in the database. Instead of dumping serialized data into p_data for each bucket/interval, we will flatten out the keys. Every key-value pair (data point) will be its own row in the database.
Pros:
- Much less memory use as we are not selecting dumps of p_data data; these can easily be multiple MBs;
- The flat key structure means that we can select groups of data points that we want using LIKE `keys||to||select||%` instead of loading entire dumps of data in, running unserialize on them, and finding the data points that we want. Since this is directly SQL, we can also select keys in batches (e.g., 100 at a time) to avoid OOM.
Cons:
- Many more rows in the database (but they will be smaller)
- Many more SQL queries involved (but that's mainly on the scheduler; graphs won't see that much of an increase due to selecting all data points that we need together with a wildcard LIKE statement)
I am testing the following changes:
Processing times have been reduced to about 4-5 minutes. I will continue to monitor the changes.
Let's modify the approach by simplifying the process significantly:
- Get rid of pivots in the database. They'll still exist internally. However, we will store everything (once) with a simple date_and_time timestamp.
- Get rid of the delta table (again). It won't be needed if we take the simple approach above.
Pros:
- Reduced preprocessed table size (it will probably still be large, but hopefully, it won't be as large as it is now).
- We can potentially get rid of preprocessed_flat and combine everything into one table. I'm not certain on this yet.
- If we process stats in this manner, then we will have very simple SQL queries that we can execute in batches. This will reduce the burden on the scheduler hook.
Cons:
- Data will not be sorted into pivots in the database. We will have to process this during runtime when loading a graph. This shouldn't be too difficult to do.
- We will still have a large database table. But I don't think this can be avoided without the risk of PHP OOM errors or complex queries (those will significantly slow down processing). We may just need to warn people that the stats addon will require a significant amount of database storage.