Transacter and Intent/Result

June 15, 2022 - 6 minutes read - 1122 words

At SQ, we had a family of Transacter interfaces:

public interface Transacter {
  void transaction(Function<Session, Void> session);
  
  public interface Session {
    DSLContext dsl();
  }
}

where DSLContext is a jOOQ concept & is the handle for doing work against the database. Then this would be used in some class like

public class KvDao {
  @Inject Transacter transacter;
  
  void put(String ns, String k, String v) {
    transacter.transaction(session -> {
        KvRecord record = new KvRecord(ns, k, v);
        session.dsl().insertInto(KV).set(record).execute();
    });
  }
}

The parameters ns, k, and v are shorthand for namespace, key, and value, respectively. We’ll show how these might be used together shortly!

Transacter was essentially the only way we ever interacted with the underlying database. Implementations of this interface handled a few things:

getting a connection from the connection pool, if the current thread didn’t already have one
retrying optimistic lock exceptions with some bounded, nondeterministic retry interval
perhaps most importantly: managing BEGIN and COMMIT/ROLLBACK commands, or equivalently: Transactor manages transactions.

situation: chained calls, mixed data sources

One place this could break down pretty easily was when database updates and API calls (e.g. to third party services, or other internal microservices, or whatever) were interleaved:

public class KvDao {
  @Inject Transacter transacter;
  @Inject Gson gson;
  
  void complexPut(String k, String v) {
    transacter.transaction(session -> {
        KvRecord r0 = new KvRecord("intent", k, v);
        
        session.dsl().insertInto(KV).set(r0).execute();
        
        Object result = doSomeApiCall(v);
        
        KvRecord r1 = new KvRecord("result", k, gson.toJson(result));
        session.dsl().insertInto(KV).set(r1).execute();
      });
  }
}

We’re using the ns=intent and ns=result to persist a representation of data we need to make the call & what the call returns, respectively.

Why even bother writing two records above? Why not a single kv record with intent and result subfields, or a better data model?

The guiding principal here is: “Write a new record whenever new information becomes known.”

Without abiding by this principal, we’re left not knowing for sure when we discovered a new bit of information without a detailed analysis of the data schema and code that interacts with it. Furthermore, once we get in the business of updating records in-place, we’ll probably eventually end up losing some information. Often this manifests with a data model (call it foos) that has some state field – perhaps INITIAL transitions to IN_PROGRESS before COMPLETE; as we progress:

; t=t1
INSERT foos SET (id, ..., state) = ("fredbob", ..., "INITIAL");
; t=t2
UPDATE foos SET state = "IN_PROGRESS" WHERE id = "fredbob";
; t=t3
UPDATE foos SET state = "COMPLETE" WHERE id = "fredbob";
; t=t4

Even assuming appropriately defined foos.{created_at,updated_at} fields, we can’t know what time we transitioned to IN_PROGRESS, or even whether that happened! This presents problems for analyzing data after-the-fact & massively complicates training machine learning models on this data, as they critically depend on a precise understanding of what happened & when to avoid time leakage.

Furthermore: frequently, the “intent” information is also precisely the information needed to resume execution when doSomeApiCall() recovers. Especially when that data is expensive to recompute (or perhaps even no longer available!), a saved intent is a compelling & simple alternative.

problem 1: exceptions => rollbacks

Suppose we can reach the database but find ourselves partitioned away from doSomeApiCall() & the method call throws an exception. What happens then? Well, remember, we automatically started a transaction & inserted the first record in that context. However, the exception triggers a ROLLBACK command to be sent to the database, so we never commit the ns=intent record! This is the correct behavior, but it was a major surprise for me to wrap my head around – everything I’d ever played with apparently had autocommit turned on, or I’d simply missed

Probably this behavior wouldn’t have been a surprise if I’d taken a database fundamentals class, but I really struggled with this.

problem 2: unbounded delays

It turns out the exception case might not even the worst problem here: What happens if doSomeApiCall() simply never terminates, or takes a very long time?

I’d argue that this is even worse, because now in addition to other connections not being able to see the ns=intent record, we’re also holding a transaction open for an arbitrarily long time. In a write-ahead log (WAL)-based system – which is pretty much anything with meaningful ACID guarantees – we may need to iterate over the WAL for any database interaction. In a high-traffic system without [wastefully] significant headroom, this can quickly become catastrophic as the extra work incurred can easily saturate an IO or CPU quota or limitation & rendering the database server non-responsive!

the fix: separate transactions

How do we fix it, assuming we want to store intent regardless of whether the API call succeeds or fails? We need to split the intent vs result INSERTs into two separate transactions:

public class KvDao {
  @Inject Transacter transacter;
  @Inject Gson gson;
  
  void complexPut(String ns, String k, String v) {
    // STEP I
    transacter.transaction(session -> {
        KvRecord r0 = new KvRecord("intent", k, v);
        session.dsl().insertInto(KV).set(r0).execute();
    });
    
    // STEP II
    Object result = doSomeApiCall(v);

    // STEP III
    transacter.transaction(session -> {
        KvRecord r1 = new KvRecord("result", k, gson.toJson(result));
        session.dsl().insertInto(KV).set(r1).execute();
    });
  }
}

failure modes

The failure modes are now:

step I fails, nothing gets written, the entire operation can be retried
step II fails, but we’ve saved the intent so we can resume this operation at a later time.
step III fails, and we’re probably going to have a duplicate doSomeApiCall(v) call.

The final case here is most interesting, so let’s dig in a bit more.

First off, if at all possible, one should probably avoid having step III within this code path entirely, and instead rely on doSomeApiCall to (eventually) notify your application that it happened, and do the step III work there.

But not all APIs supports webhooks & in that case, we’re probably going to end up retrying step II. Many (good) APIs include an idempotence token to detect this situation & permit that API to deduplicate those events and no-op the second time. We might also be able to do some list/search/find operation to check that the API isn’t already aware of the information/events in question, but this is somewhat of a last resort as race conditions are still possible. (These can also be mitigated with some kind of lease mechanism to enforce a single owner of that record until some timeout has elapsed & we’re confident there isn’t another in flight request…. but this is a lot of finicky machinery & overhead compared to the API using a unique key to deduplicate on idempotence token.)

If all of that fails, we might actually end up with multiple API calls in flight. This is not great, but it tends to arise in primitive/early-stage systems or those not overly sensitive to duplicates.

situation: chained calls, mixed data sources

sidebar: intents & results

problem 1: exceptions => rollbacks

problem 2: unbounded delays

the fix: separate transactions

failure modes