Evolution of a Get Endpoint
- 4 minutes read - 817 wordsMy work project started pretty simple: there was a simple GetXyz
endpoint that just looked up the xyz
record in the database by a unique key & returned it.
How complicated could it be?
It was a straightforward generalization of an old GetAbc
functionality, it was really only used by oncall engineers through an admin console, it shouldn’t have been too big of a deal.
Ok, but then it outgrew its original datastore, so a separate service had to be created.
We thought about migrating clients, but at the time it seemed faster to just implement once on our side & have clients continue calling us; essentially encapsulating the separate service as an implementation detail.
But as part of the traffic swing, we figured it should support either the local
or the remote
read path.
All fine and good, but then if it’s missing, is it missing from the local
datastore or the remote datastore?
We could of course query for it with a database query, but that’s difficult to script & a bit harder to share in a Slack conversation or runbook, so we introduced a GetDbXyz
endpoint that only looked in the local datastore.
The GetDbXyz
API meaning “look in the local datastore” was actually implemented in the original service and the new service;
the original GetXyz
API could then call either GetDbXyz
depending on whether it wanted a “local” lookup or a “remote” lookup.
The next problem was that even the remote
datastore in the dedicated service could only reasonably handle around a weeks worth of xyz
records, but it was useful to have 90-180d of xyz
records available for oncall analysis.
So we consumed the feed of xyz
into a high-capacity KV store (think DynamoDB or BigTable).
Since the “remote or local read path in the endpoint” had worked so well in the past, we implemented it as another read path in our GetXyz
endpoint.
And, in fact, we implemented it as the first path, because we figured: most lookups would be in the 7d-180d window, so might as well go straight to the KV store.
A couple problems became apparent here: first, we needed the same dedicated endpoint to debug whether the record was in the “remote” or long-term “kv” data store & decode what we stored there.
So we introduced a GetKvXyz
endpoint.
Second: basically by chance – maybe cloud interconnectivity was slower than local services in the same rack, or potentially from some bug in initialization or connection management, who knows – the “kv lookup” path was actually slower than the “remote” path.
And it happened from time-to-time that the remote -> kv
consumption fell behind a bit – bad sharding strategies, library upgrade problems, cloud proxy connectivity issues, all sorts of strangeness.
Because our GetXyz
endpoint tried the kv
path first & was proving the least reliable, one of our customers actually switched over to the GetDbXyz
endpoint for the near-real-time lookups their application needed.
Anyway, all of this stuff ran for years with very little modification – occasionally a reshard to support ever-increasing data volumes, but otherwise pretty low maintenance.
We eventually removed the local
style lookups & the associated endpoint.
Until one day, almost 7 years after the original framework was laid out, we got a user complaining about 404s when looking up their records. But we had every indication that they had existed – a client must have received the record, and we had other downstreams indicating it did. What could have gone wrong?
Well, it turns out that to optimize costs on the kv
storage path, we had done an optimization:
instead of storing an entire list of user-facing identifiers and the internal database record identifiers backing those identifiers, we stripped out the user-facing identifiers so we could later fill it in.
All well and good, except sometimes the list had “synthetic” entries that didn’t have internal database records.
But we didn’t realize this before we shipped the feature that generated the synthetic entries.
It sorta makes sense: we would only see the bug if we were doing an debugging task on an Xyz
that was more than 7d old but less than 6mo old and it used the synthetic-entry-generating feature.
Usually oncall/debugging investigations happen long before the 7d database truncation window, and the synthetic-entry-generating-feature just wasn’t in super common use for ~4 years.
Point of clarification: I implemented the storage optimization and the synthetic-entry-generating feature, so I should have realized the interaction. But alas, it was early COVID-19 times, we all thought we were gonna die of acute toilet paper deficiency, and then I tried my hand at being a manager, and then I ended up taking a break from this company for a couple of years. I’m not gonna beat myself up over it too much I guess.
Anyway, that’s how you end up with 3 Get
endpoints for the same data message!