feeds justification
- 3 minutes read - 468 wordsI realized I’ve left out a major part from my sequence of previous feed-related posts: a justification for why we should bother with a separate feed_sync_id
.
So let’s give it a shot!
The fundamental problem is:
AUTO_INCREMENT
ids are assigned in insertion order, but become visible to other threads in commit order.
To see how this causes a problem, consider the interactions and visibilities between three transactions to the same database:
t0: TRX0: BEGIN; INSERT INTO kv (ns, k, v) VALUES ("-", "k0", "v0"); COMMIT;
t1: TRX1: BEGIN; INSERT INTO kv (ns, k, v) VALUES ("-", "k1", "v1");
t2: TRX2: BEGIN; INSERT INTO kv (ns, k, v) VALUES ("-", "k2", "v2");
t3: TRX0: SELECT MAX(id) FROM kv;
t4: TRX2: COMMIT;
t5: TRX0: SELECT MAX(id) FROM kv;
t6: TRX1: COMMIT;
t7: TRX0: SELECT MAX(id) FROM kv;
Here, we have two transactions that both insert a new kv
record.
The database has to assign an id
value to each of those records, because we might be creating other associations to those records in our application code.
But other threads – TRX0
in this case – shouldn’t be able to see those records until we COMMIT
, and so indeed the SELECT
at t=t3
might return 1
.
Next, we commit TRX2
.
This was the third insert the database saw, so we expect MAX(id)
to be 3
, and indeed, it is, as verified at t5
.
Finally, we commit TRX1
.
Consumers will now be able to see an id=2
record – hooray!
Except: if we applied a feed-style consumption based on id
instead of feed_sync_id
, we have introduced the possibility of skipping records.
To see how this happens, consider a feed-style fetch w/ id
at t=t5
above.
We will run a query like
SELECT *
FROM kv
WHERE id > :cursor
ORDER BY id
LIMIT :limit
and fetch
KV(id=1, ns=-, k=k0, v=v0)
KV(id=3, ns=-, k=k2, v=v2)
Note that there is no KV(id=2)
record committed yet, so this response is correct!
The problem comes from subsequent fetches: In particular, we will record 3
as the cursor to use for the next fetch.
This will prevent the subsequent fetch (with cursor:=3
) from ever observing the KV(id=2)
record!
This is pretty bad, and that’s why we pay the penalty of storing a separately assigned feed_sync_id
column.
schema
After all the updates, I figure it’d be helpful to give a reference for the current schema we’ve built up over all the blog posts mentioned above:
CREATE TABLE `kv` (
`id` bigint(22) NOT NULL AUTO_INCREMENT,
`created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`feed_sync_id` bigint(22) DEFAULT NULL,
`shard` int(11) DEFAULT '0',
`ns` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '',
`k` varchar(255) COLLATE utf8_bin NOT NULL,
`v` longblob NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `u_ns_k` (`ns`,`k`),
UNIQUE KEY `u_fsi` (`feed_sync_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;