feeds justification

June 11, 2019 - 3 minutes read - 468 words

I realized I’ve left out a major part from my sequence of previous feed-related posts: a justification for why we should bother with a separate feed_sync_id. So let’s give it a shot! The fundamental problem is:

AUTO_INCREMENT ids are assigned in insertion order, but become visible to other threads in commit order.

To see how this causes a problem, consider the interactions and visibilities between three transactions to the same database:

t0: TRX0: BEGIN; INSERT INTO kv (ns, k, v) VALUES ("-", "k0", "v0"); COMMIT;
t1: TRX1: BEGIN; INSERT INTO kv (ns, k, v) VALUES ("-", "k1", "v1");
t2: TRX2: BEGIN; INSERT INTO kv (ns, k, v) VALUES ("-", "k2", "v2");
t3: TRX0: SELECT MAX(id) FROM kv;
t4: TRX2: COMMIT;
t5: TRX0: SELECT MAX(id) FROM kv;
t6: TRX1: COMMIT;
t7: TRX0: SELECT MAX(id) FROM kv;

Here, we have two transactions that both insert a new kv record. The database has to assign an id value to each of those records, because we might be creating other associations to those records in our application code. But other threads – TRX0 in this case – shouldn’t be able to see those records until we COMMIT, and so indeed the SELECT at t=t3 might return 1.

Next, we commit TRX2. This was the third insert the database saw, so we expect MAX(id) to be 3, and indeed, it is, as verified at t5.

Finally, we commit TRX1. Consumers will now be able to see an id=2 record – hooray!

Except: if we applied a feed-style consumption based on id instead of feed_sync_id, we have introduced the possibility of skipping records. To see how this happens, consider a feed-style fetch w/ id at t=t5 above. We will run a query like

SELECT *
FROM kv
WHERE id > :cursor
ORDER BY id
LIMIT :limit

and fetch

KV(id=1, ns=-, k=k0, v=v0)
KV(id=3, ns=-, k=k2, v=v2)

Note that there is no KV(id=2) record committed yet, so this response is correct! The problem comes from subsequent fetches: In particular, we will record 3 as the cursor to use for the next fetch. This will prevent the subsequent fetch (with cursor:=3) from ever observing the KV(id=2) record!

This is pretty bad, and that’s why we pay the penalty of storing a separately assigned feed_sync_id column.

schema

After all the updates, I figure it’d be helpful to give a reference for the current schema we’ve built up over all the blog posts mentioned above:

CREATE TABLE `kv` (
   `id` bigint(22) NOT NULL AUTO_INCREMENT,
   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
   `updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
   `feed_sync_id` bigint(22) DEFAULT NULL,
   `shard` int(11) DEFAULT '0',
   `ns` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '',
   `k` varchar(255) COLLATE utf8_bin NOT NULL,
   `v` longblob NOT NULL,
   PRIMARY KEY (`id`),
   UNIQUE KEY `u_ns_k` (`ns`,`k`),
   UNIQUE KEY `u_fsi` (`feed_sync_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

Apr 2019 absolutely minimal OLTP to OLAP pipeline
Jan 2019 feed sequences
Oct 2018 sharded feeds
Jun 2018 mysql feeds
Jan 2019 watching progress in mysql

schema

Related posts