We don't actually need to use a UUID (aka GUID) per se. I've done a deep dive into unique identifier formats.
Features commonly available in different formats:
1) Sortable by time to establish (potentially) sub-millisecond order
2) Massive pool of possible values to reduce chance of collision at a global scale (e.g. for a social network, or otherwise large networks of machines communicating with each other across a common ID domain)
3) Identification of machine creating ID
4) Secure booking of IDs by a ticketing server if we do not trust nodes to generate appropriate IDs on their own
5) Speed of generation
6) Efficient insertion into B-tree indexes, always at end of latest record block
7) Efficient retrieval of recent records, always in the same few record block(s)
8) Hard to guess what new records may be being created.
Most of these do not apply to Composr:
1) Ordering by the add_date columns should be fine for us (millisecond precision).
2) As we are scoped to a single site (and not a massive population-scale social networking site) we do not need a massive identifier pool.
3) We do not need this particularly, although might be a nice-to-have.
4) Over-complex, we can reasonably trust any server in a Composr cloud.
5) While this may have been an issue when UUIDs were first specified in the 80s, I don't think it's an issue nowadays.
6/7) This is important. We need the sequence to be ROUGHLY sequential.
8) Nice to have.
If we can fit into 64 bits we can just do as an integer on a 64 bit machine, which makes things far easier for us, and far more efficient (no storing long strings).
We can even avoid making changes to our database layer or remove auto_increment from the schema. We could simply do a wrap around query_insert calls to retrofit the existing code to create the IDs...
create_unique_db_id could be a null-op unless we enable a new config option.
In terms of the actual format:
Bit 1: Always 1, as we always want positive numbers
Bits 2-33 (32 bits): Seconds since 2022 (good for 2^32/60/60/24/365=136 years)
Bits 34-64 (31 bits): Random component (good for 2^31=2147483648=2.1 billion unique values within any second, making collisions unlikely)
Working out the exact chance of a collision after repeated trials using https://www.statology.org/binomial-distribution-calculator/
p, Probability of failure on a given trial: 1/2147483648=0.00000000046566129
n, Number of trials: 1000000 (1 million)
k, Number of failures: 1
P(X>=1) = 0.00047 = 0.047%
i.e. After 1 million IDs generated within a 1-second time window of another ID, there's only 0.047% of a collision.
If we put in numbers for 1 billion trials, there's a 37% chance of a collision.
I think that's about reasonable, failing once per billion records.
It's also worth noting we could consider making the GUIDs mandatory, and then massively simplify the cms_merge importer - which would help a lot with long term maintenance.
One thing to consider is that we will need to make sure all DB Drivers return the 'id' field for insert queries even if it is explicitly set (i.e. not coming from auto-increment). Currently this is not consistent, and I know for sure the XML DB driver does not.
Features commonly available in different formats:
1) Sortable by time to establish (potentially) sub-millisecond order
2) Massive pool of possible values to reduce chance of collision at a global scale (e.g. for a social network, or otherwise large networks of machines communicating with each other across a common ID domain)
3) Identification of machine creating ID
4) Secure booking of IDs by a ticketing server if we do not trust nodes to generate appropriate IDs on their own
5) Speed of generation
6) Efficient insertion into B-tree indexes, always at end of latest record block
7) Efficient retrieval of recent records, always in the same few record block(s)
8) Hard to guess what new records may be being created.
Most of these do not apply to Composr:
1) Ordering by the add_date columns should be fine for us (millisecond precision).
2) As we are scoped to a single site (and not a massive population-scale social networking site) we do not need a massive identifier pool.
3) We do not need this particularly, although might be a nice-to-have.
4) Over-complex, we can reasonably trust any server in a Composr cloud.
5) While this may have been an issue when UUIDs were first specified in the 80s, I don't think it's an issue nowadays.
6/7) This is important. We need the sequence to be ROUGHLY sequential.
8) Nice to have.
If we can fit into 64 bits we can just do as an integer on a 64 bit machine, which makes things far easier for us, and far more efficient (no storing long strings).
We can even avoid making changes to our database layer or remove auto_increment from the schema. We could simply do a wrap around query_insert calls to retrofit the existing code to create the IDs...
$GLOBALS['SITE_DB']->query_insert('table', create_unique_db_id(['a' => 123]));
create_unique_db_id could be a null-op unless we enable a new config option.
In terms of the actual format:
Bit 1: Always 1, as we always want positive numbers
Bits 2-33 (32 bits): Seconds since 2022 (good for 2^32/60/60/24/365=136 years)
Bits 34-64 (31 bits): Random component (good for 2^31=2147483648=2.1 billion unique values within any second, making collisions unlikely)
Working out the exact chance of a collision after repeated trials using https://www.statology.org/binomial-distribution-calculator/
p, Probability of failure on a given trial: 1/2147483648=0.00000000046566129
n, Number of trials: 1000000 (1 million)
k, Number of failures: 1
P(X>=1) = 0.00047 = 0.047%
i.e. After 1 million IDs generated within a 1-second time window of another ID, there's only 0.047% of a collision.
If we put in numbers for 1 billion trials, there's a 37% chance of a collision.
I think that's about reasonable, failing once per billion records.
References:
https://en.wikipedia.org/wiki/Universally_unique_identifier
https://medium.com/geekculture/the-wild-world-of-unique-identifiers-uuid-ulid-etc-17cfb2a38fce
https://segment.com/blog/a-brief-history-of-the-uuid/
https://www.ietf.org/id/draft-peabody-dispatch-new-uuid-format-04.html
https://stackoverflow.com/questions/56611375/innodb-clustered-index-performance-when-using-random-values-as-primary-key
Add we should add an automated test to ensure create_unique_db_id is used whenever necessary, or perhaps a dev-mode check.