6336

This is a spacer post for a website comment topic. The content this topic relates to: 6336
In progress.

My approach is to flatten the data that we store in the database. Instead of dumping serialized data into p_data for each bucket/interval, we will flatten out the keys. Every key-value pair (data point) will be its own row in the database.

Pros:
- Much less memory use as we are not selecting dumps of p_data data; these can easily be multiple MBs;
- The flat key structure means that we can select groups of data points that we want using LIKE `keys||to||select||%` instead of loading entire dumps of data in, running unserialize on them, and finding the data points that we want. Since this is directly SQL, we can also select keys in batches (e.g., 100 at a time) to avoid OOM.

Cons:
- Many more rows in the database (but they will be smaller)
- Many more SQL queries involved (but that's mainly on the scheduler; graphs won't see that much of an increase due to selecting all data points that we need together with a wildcard LIKE statement)
My initial implementation was terrible. The scheduler would sometimes take over an hour to generate statistics. This was because of two problems:
  • Every point of data would require about 5 SQL queries
  • The initial method of indexing was very slow

I am testing the following changes:
  • I put the delta table and behaviour back in
  • stats_preprocessed and stats_preprocessed_flat now use a special p_id column for PRIMARY. This is a hash of all other columns (except p_value). This makes merging data much faster.
  • I implemented batch processing of SQL. Now, instead of 5 queries per data point, we are using about 5 queries every 100 data points. Part of this was thanks to the new p_id column.

Processing times have been reduced to about 4-5 minutes. I will continue to monitor the changes.
0 guests and 0 members have recently viewed this.