View Issue Details

IDProjectCategoryView StatusLast Update
3797Composrcore_database_driverspublic2020-02-13 20:40
ReporterChris Graham Assigned ToChris Graham  
PrioritynormalSeverityfeature 
Status resolvedResolutionfixed 
Summary3797: Cleanup queries so can_arbitrary_groupby/remove_duplicate_rows
DescriptionIn various places we use can_arbitrary_groupby and/or remove_duplicate_rows to strip out duplicated rows coming out of SQL queries.

This is because doing JOINs on tables will multiply up rows on the main table being queried, if the joined table has multiple matching records.
For example, if you are using a JOIN to check group access to a category, and a user is in multiple groups with access, you'll get as many rows back as there are groups with access.

We do it using JOINs because MySQL did not support subqueries until 4.1 in 2003. Webhosts are notorious for running old versions of MySQL, but at this point it would be shocking for a host to be running a version this old.

So we should alter the DB drives to remove the concept of can_arbitrary_groupby (which is a non-compliant MySQL hack), and assume subquery support is always there (I think this is already done in v11). Then instead of JOINs to check things like category membership or group access, we use IN or EXISTS clauses in the queries, or subqueries that select a field value (parentheses).

main_multi_content needs changes too. This currently doesn't use can_arbitrary_groupby/remove_duplicate_rows because of the complexity in this code. Instead it has it's own deduplication technique, and always reads records from offset 0, manually skipping records up until $start.
TagsRoadmap: v11
Attach Tags
Time estimation (hours)4
Sponsorship open

Sponsor

Date Added Member Amount Sponsored

Relationships

related to 3795 ResolvedChris Graham Composr website (compo.sr) Category replication and panels 
related to 3732 ResolvedChris Graham Composr Smarter DB sorting for multi-lang sites 
related to 2249 ResolvedChris Graham Composr Further metadata improvements 

Activities

Chris Graham

2019-11-17 16:39

administrator   ~6155

Also look at the export_* tasks, and try and make sure they don't have to load all rows into memory before starting streaming out a spreadsheet.

Chris Graham

2020-02-09 23:23

administrator   ~6383

Last edited: 2020-02-09 23:27

This is more complex than I thought.

In many cases we are doing JOINs that can bind 0-to-many rows because we want to get some results back from the join. So we cannot simply switch to EXISTS or IN clauses.

A better solution is probably just to be more careful about what fields we SELECT, so that we're only selecting stuff covered in the GROUP BY clause. That will only work when we don't want to SELECT something we explicitly know might be duplicated up.

Chris Graham

2020-02-09 23:25

administrator   ~6384

Last edited: 2020-02-09 23:25

MySQL error message is like:
"Expression 28 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'cms.p.id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by"

When doing "SELECT * FROM cms11_f_topics t JOIN cms11_f_posts p WHERE t.id=1 GROUP BY t.id;"

But this is fine:
"SELECT t.id FROM cms11_f_topics t JOIN cms11_f_posts p WHERE t.id=1 GROUP BY t.id;"

Guest

2020-02-10 00:45

reporter   ~6385

Unfortunately sometimes we really do want to just throw away extra matches. Here's an interesting query pattern that works...

"SELECT t.id FROM cms11_f_topics t JOIN cms11_f_posts p ON p.id=(SELECT MIN(p.id) FROM cms11_f_posts p WHERE t.id=p.p_topic_id);"

In this I am explicitly saying which of multiple rows to join to (the one with the lowest ID).

Chris Graham

2020-02-10 15:23

administrator   ~6386

For reference, I am using all the following techniques to solve this...

IN
EXISTS
() subqueries
Joins selecting on a unique ID found via a minimiser/maximiser check
GROUP BY
DISTINCT * & COUNT(DISTINCT *)

Which one to use really depends on the situation.

Chris Graham

2020-02-13 20:40

administrator   ~6393

This is now implemented. Turned out way more complex than I imagined, and took more like a week. Significant parts of the galleries code needed to be reworked, the XML DB implementation needed significant improvements to support the more sophisticated syntaxes, unit testing of that was needed, and work was needed to fix SQL compatibility on how we treat ORDER BY (anything in ORDER BY must also be selected). Generally a lot of stuff needed cleaning up, as I got into the guts of a lot of different systems and saw many issues.

The good news is some of the work towards 3354 was done in the gallery refactoring (as images and videos needed to be queried together, so it made sense to create an API for that, which will also work as an API for querying all content types at once).

Issue History

Date Modified Username Field Change
2019-04-03 19:57 Chris Graham New Issue
2019-04-03 19:58 Chris Graham Relationship added related to 3795
2019-06-27 19:00 Chris Graham Tag Attached: Roadmap: v11
2019-11-17 16:39 Chris Graham Note Added: 0006155
2019-11-26 22:03 Chris Graham Relationship added related to 3732
2020-02-09 23:23 Chris Graham Note Added: 0006383
2020-02-09 23:25 Chris Graham Note Added: 0006384
2020-02-09 23:25 Chris Graham Note Edited: 0006384
2020-02-09 23:27 Chris Graham Note Edited: 0006383
2020-02-10 00:45 Guest Note Added: 0006385
2020-02-10 14:22 Chris Graham Description Updated
2020-02-10 15:23 Chris Graham Note Added: 0006386
2020-02-12 03:06 Chris Graham Relationship added related to 2249
2020-02-13 20:40 Chris Graham Assigned To => Chris Graham
2020-02-13 20:40 Chris Graham Status Not Assigned => Resolved
2020-02-13 20:40 Chris Graham Resolution open => fixed
2020-02-13 20:40 Chris Graham Note Added: 0006393