# constraint violations in tests

I had some test code that looked something like

@Test public void a() {
db.insert(0, "a");
assertThat(db.query(1).getKey()).isEqualTo("a");
}

@Test public void b() {
db.insert(0, "a");
db.insert(1, "b");
assertThat(db.query(2).getKey()).isEqualTo("b");
}

@Test public void c() {
assertThat(db.query(3)).isNull();
}


1. pretty straightforward which test is touching which records
2. pretty clear what you expect to see in the database.

There's a bunch of coincidences here that make this work, though:

• Some JUnit magic that clears the database before running the test.
• no other tests running concurrently
• the database must be non-empty

But it wasn't working, so I got constraint violation errors about a duplicate of id=1.

The 1/2/3 values are accidental complexity -- I have to assign them in my test code, even though that's normally the database's job! The first two would be easy to fix with something like:

long id = db.insertReturningId("a");
assertThat(db.query(id).getKey()).isEqualTo("a");


but what about the third? One approach is just do a bunch of queries:

long id = 1;
while (db.query(id) != null) {
id++;
}
assertThat(db.query(id)).isNull();


but that test always passes, so is it a useful test?

(introduce a new PK, uuid?)

(random id)

(use -1?)

# multiple thread test cases in java

Some of my day job lately has been trying to get a better handle on threads in Java. In particular, we had a really weird race condition resulting from multiple threads processing an input data stream and causing some unexpected conflicts in our data access layer.

To try to replicate the problem in tests, I started with something like this:

This is just a Runnable that waits a while, and then prints that it finished. Then we can test with something like:

which outputs

r2 finished
r0 finished
r1 finished
Completely finished


as we'd probably expect.

This isn't a very sophisticated use of threads, but it was actually already enough to reproduce our concurrency bug pretty reliably in a local development environment!

# big-ish personal storage

I was curious what it would cost to have something like 5-30TB of storage space on hand. The obvious choices are between buying some hard drives and paying for an S3 bucket or something.

The amazon costs are pretty straightforward: you pick a region, they tell you the cost. Starting with their pricing table:

aws = [
{"level": "standard", "c_millidollars": 23},
{"level": "infrequent", "c_millidollars": 12.5},
{"level": "one_zone", "c_millidollars": 10},
{"level": "glacier", "c_millidollars": 4}
]


we can say

gbs = 10000
for level in aws:
dollars_per_gb = level["c_millidollars"] / 1000
dollars = gbs * dollars_per_gb
print("{} ==> ${} * {} =${}/month".format(
level["level"], dollars_per_gb, gbs, dollars
))


and get back

standard ==> $0.023 * 12000 =$276.0/month
infrequent ==> $0.0125 * 12000 =$150.0/month
one_zone ==> $0.01 * 12000 =$120.0/month
glacier ==> $0.004 * 12000 =$48.0/month


Figuring out the hard drive cost, I started from some raw data:

raw = [
{"slug": "toshiba_x300_4tb", "c": 11299, "tb": 4, "link": "https://www.amazon.com/Toshiba-Performance-Desktop-Internal-HDWE150XZSTA/dp/B013JPKUU2"},
{"slug": "toshiba_x300_5tb", "c": 12999, "tb": 5, "link": "https://www.amazon.com/Toshiba-Performance-Desktop-Internal-HDWE150XZSTA/dp/B013JPLKQK"},
{"slug": "toshiba_x300_6tb", "c": 17505, "tb": 6, "link": "https://www.amazon.com/Toshiba-Performance-Desktop-Internal-HDWE150XZSTA/dp/B013JPLKJC"},
{"slug": "toshiba_x300_8tb", "c": 23973, "tb": 8, "link": "https://www.amazon.com/Toshiba-Performance-Desktop-Internal-HDWE150XZSTA/dp/B074BTZ2YJ"},
{"slug": "seagate_barracuda_4tb", "c":  9399, "tb": 4, "link": "https://www.amazon.com/Seagate-BarraCuda-3-5-Inch-Internal-ST4000DM004/dp/B071WLPRHN"},
]


This gives us a figure of $29.38/TB for hard drives. Just to blow up our hard drive costs a bit, we can assume some drive-level redundancy. Maybe we want to use ZFS 2 (but maybe we don't). One alternative is using hardware or maybe even software RAID. Let's assume we want to do 5 disks with 2 parity (something like RAID6). Just sticking with 4TB drives for a sec, we'd be buying 20TB for 12TB of capacity. Using our "average-priced" drives, that'd be$587.70 upfront cost for hard drives.

Interestingly, there's some ongoing costs for this option, too: the electricity for the computer the drives plug into. Assuming a 200-watt machine and San Francisco electricity prices ($0.201 / kWh), we'd be looking at 1 month @ 200 watts = 200 watts * 730 hours = 146,000 watt-hours = 146 kWh ==>$29.35


# sfpark api

I recently came to learn of the SFpark API, which lets one make queries like:

LAT=37.787702
LONG=-122.407796
curl "http://api.sfpark.org/sfpark/rest/availabilityservice?lat=${LAT}&long=${LONG}&radius=0.25&uom=mile&response=json" | pbcopy
pbpaste | jq '.AVL[] | select (.TYPE | contains("OFF"))'


and get a response including records like:

{
"TYPE": "OFF",
"OSPID": "950",
"NAME": "Union Square Garage",
"DESC": "333 Post Street",
"INTER": "Geary between Stockton & Powell",
"TEL": "(415) 397-0631",
"OPHRS": {
"OPS": {
"FROM": "7 Days/Wk",
"BEG": "24 Hrs/Day"
}
},
"OCC": "381",
"OPER": "670",
"PTS": "1",
"LOC": "-122.407447946,37.7876789151"
}


Pretty cool!

# regex in java

I can never remember how to do java regexes with Pattern.

Pattern pattern = Pattern.compile("x_(yy|zz)");
String input = "x_yy";
boolean m = pattern.asPredicate().test(input);
System.out.println("matches: " + m);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println("whole matched string: " + matcher.group(0));
System.out.println("matched group: " + matcher.group(1));
}


I also learned a really cool thing IntelliJ can do with pattern matches!

# guava RangeMap

We had some code like this:

NavigableMap<Double, Integer> PRIORITY_MULTIPLIERS = new TreeMap<Double, Integer>() {{
put(LOW_THRESHOLD, 1);    // (20k..40k)  => 1
put(MEDIUM_THRESHOLD, 2); // [40k..100k) => 2
put(HIGH_THRESHOLD, 3);   // [100k..+∞)  => 3
}};

Integer getPriorityMultiplier(Double v) {
if (v < LOW_THRESHOLD) {
return 0;
}
return PRIORITY_MULTIPLIERS.floorEntry(v).getValue()
}


I replaced with a RangeMap that uses Range instances as keys:

RangeMap<Double, Integer> PRIORITY_MULTIPLIERS = ImmutableRangeMap.<Double, Long>builder()
.put(Range.lessThan(LOW_THRESHOLD), 0)
.put(Range.closedOpen(LOW_THRESHOLD, MEDIUM_THRESHOLD), 1)
.put(Range.closedOpen(MEDIUM_THRESHOLD, HIGH_THRESHOLD), 2)
.put(Range.atLeast(HIGH_THRESHOLD), 3)
.build();
Integer getPriorityMultiplier(Double v) {
return rangePriorityMultipliers.get(v);
}


1. ImmutableRangeMap explicitly checks for overlapping ranges, so specifying Range.atMost for the first range would fail at the time of instantiation.
2. Puts declaration & endpoint behavior (open vs closed) together, instead of partly in the Map declaration and partly in the map.floorEntry call.
3. Simple to test that entire range is covered:

@Test public void multipliersForEntireRange() {
Set<Range<Double>> ranges = PRIORITY_MULTIPLIERS.asMapOfRanges().keySet();
ImmutableRangeSet<Double> rangeSet = ImmutableRangeSet.copyOf(ranges);
assertThat(rangeSet.encloses(Range.all())).isTrue();
}


# osquery

I've known about osquery for a while, but recently spent some time digging around in it.

The basic idea is to provide a consistent SQL interface to a bunch of system data, instead of learning idiosyncrasies of individual commands (which themselves vary across operating systems). My hacker buddy Sharvil has worked on osquery a fair bit and explained to me a couple years ago that it actually uses sqlite's virtual table functionality to provide the interface -- it's a fascinating and brilliant project!

Concretely, instead of running uptime(1) and getting either

osx ➜ uptime
8:48  up 17:27, 10 users, load averages: 4.57 4.74 5.08
linux ➜ uptime
16:52:12 up 192 days,  9:59,  1 user,  load average: 0.00, 0.00, 0.00


you can run:

osquery> SELECT * FROM uptime;
+------+-------+---------+---------+---------------+
| days | hours | minutes | seconds | total_seconds |
+------+-------+---------+---------+---------------+
| 2    | 1     | 18      | 38      | 177518        |
+------+-------+---------+---------+---------------+


We can also query for system information (shown here in line mode)

osquery> .mode line
osquery> SELECT * FROM system_info;
hostname = traviscj.local
uuid = [redacted]
cpu_type = x86_64h
cpu_subtype = Intel x86-64h Haswell
cpu_brand = Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
cpu_physical_cores = 4
cpu_logical_cores = 8
physical_memory = 17179869184
hardware_vendor = Apple Inc.
hardware_model = MacBookPro11,2
hardware_version = 1.0
hardware_serial = [redacted]
computer_name = traviscj
local_hostname = traviscj


and even perform HTTP requests with the curl table:

osquery> SELECT url, response_code FROM curl WHERE url IN ('https://traviscj.com/_status.json', 'https://pauljoos.com/_status.json');
+-----------------------------------+---------------+
| url                               | response_code |
+-----------------------------------+---------------+
| https://pauljoos.com/_status.json | 200           |
| https://traviscj.com/_status.json | 200           |
+-----------------------------------+---------------+


I tried to reproduce a simple ps wrapper I have:

#!/bin/sh
# a simple script to show java processes without extremely long classpaths

ps aux | grep '[b]in/java' | grep -v $0 | awk '{ print$2" "$11" "$NF }'


and ended up with

SELECT pid
, SUBSTR(cmdline,
(WITH char_index(i, ch) AS (values(1, '') UNION ALL SELECT i+1, substr(cmdline, i, 1) AS r FROM char_index WHERE r != '')
SELECT MAX(i) FROM char_index WHERE char_index.ch = ' ')
) as java_main_class
FROM processes WHERE name = 'java';


which produces

osquery> SELECT pid, SUBSTR(cmdline, (WITH char_index(i, ch) AS (
...>     values(1, '')
...>     UNION ALL SELECT i+1, substr(cmdline, i, 1) AS r FROM char_index WHERE r != ''
...> )
...> SELECT MAX(i) FROM char_index WHERE char_index.ch = ' ')) as java_main_class FROM processes WHERE name = 'java';
+-------+----------------------------+
| pid   | java_main_class            |
+-------+----------------------------+
| 2736  | com.company.app1.FirstApp  |
| 2737  | com.company.app2.SecondApp |
+-------+----------------------------+


This is, to me, some world-class SQL trickery. Stepping through, the WITH ... SELECT clause works like this.

osquery> WITH char_index(i, ch) AS (
...>   values(1, '')
...>   UNION ALL SELECT i+1, SUBSTR("foo bar baz", i, 1) as r FROM char_index WHERE r != ''
...> ) SELECT * FROM char_index;
+----+----+
| i  | ch |
+----+----+
| 1  |    |
| 2  | f  |
| 3  | o  |
| 4  | o  |
| 5  |    |
| 6  | b  |
| 7  | a  |
| 8  | r  |
| 9  |    |
| 10 | b  |
| 11 | a  |
| 12 | z  |
+----+----+


Adding the WHERE char_index.ch = ' ' gets us just the spaces:

osquery> WITH char_index(i, ch) AS (
...>   values(1, '')
...>   UNION ALL SELECT i+1, SUBSTR("foo bar baz", i, 1) as r FROM char_index WHERE r != ''
...> ) SELECT * FROM char_index WHERE char_index.ch = ' ';
+---+----+
| i | ch |
+---+----+
| 5 |    |
| 9 |    |
+---+----+


and of course adding MAX(i) gives the highest such index, 9. My expression then passes that entire thing into a SUBSTR(var, idx) clause to get the last element, and applies it to cmdline instead of a static string.

There's an extensive schema listing on OSQuery's website describing all the tables available in the latest version of OSQuery.

# profitably wrong

I've come to realize that my career so far has been built on being "profitably wrong." I think this is interesting because the usual approaches are being "profitably fast" (optimizing) or "profitably better" (improving), and most people think of any kind of wrongness as being a terrible thing. But sometimes the best way to optimize or improve is approximating!

The definitions of "profitably" has changed as I've worked on different things, as has the specific type of "wrongness". A couple specific ways accepting "wrongness" have been profitable for me include:

• being simpler: approximating very complex neural models with very simple mathematical models
• being faster: re-using outdated matrix factorizations to still improve a Newton iterate toward a solution
• being tractable: naive bayes makes an independence assumption that is demonstrably false, but can still be useful

This type of idea seems to pop up everywhere for me:

• HOGWILD: a parallel stochastic gradient descent algorithm that gets unreal speed by disregarding classic concurrency correctness mechanisms (read: locks) but works anyway
• bloom filter: gives up the ability to always answer correctly to instead operate in a fixed, small amount of memory.
• HyperLogLog: similarly gives an approximate unique count in a fixed, small amount of memory overhead. it's used in redis and at google.
• p2 algorithm: estimates quantiles without storing all observations.

These ideas were all initially hard for me to wrap my head around, after spending several years learning how to prove that this sorting algorithm or that one actually sorts a list. But I'm completely sold on their utility and have come to see it as my own little niche.

## DAYDREAM

What if you knew a secret to solving whatever problem you work on? What good is a secret? It's good if it helps you solve that problem better than other people working on it.

What's the worst that happens if you're wrong? What usually happens when you're wrong?

## APPROXIMATION

One way to invent secrets is by approximating reality -- inventing a model. It's not good enough to just invent a model though. That model must have some predictive power.

The key in each of my examples is that predictive power:

• simple neural models exhibit some properties of much more complex, chemically-based models.
• a Newton-like algorithm that can estimate a better iterate.

## QUANTIFY THE ERROR IN YOUR APPROXIMATION

It's important to understand the errors you might make with your approximation. At a very basic level, you need to understand whether the approximation incurs more cost than it earns to decide whether to use it or not.

One starting place for this is simply writing down the eventual outcome for false positives and false negatives. Another great approach is calculating an exact solution for an easier (relatively!) problem and comparing against the approximation.

## FIND THE APPROXIMATION'S SWEET SPOT

All approximations have some range of validity, outside of which they aren't a reasonable model of reality anymore. Sometimes, they're very well-defined and carefully explored; HOGWILD, for example, needs a certain sparsity structure to avoid destroying the effectiveness of the parallelism with very frequent concurrent writes. Other times, they're simply not well-understood (megacomplex systems under study in medicine, for example) or obscured by an assumption of the model: assuming linearity of some portion of a function, assuming an incompressible fluid, etc. It's important to have some way of knowing whether your predictions/approximations are adding value or simply garbage.

## FIX IT LATER / EVENTUAL CORRECTNESS

Sometimes, approximations apply even to problems that seem like they warrant the utmost correctness. Think money movement or automated vehicle control systems. On the contrary, lots of applications like these really just rely on some notion of eventual correctness. (For varying definitions of "eventual", of course!) When you think from the "what's the worst that could happen" or the "crap, it happened anyway because of something else" angles, it turns out that you'd just issue a refund or steer back into your lane or whatever. And, in fact, the "profitably wrong" mindset might enable you to decide on that corrective action sooner than later!

## WHEN TO AVOID / WARNINGS

Sometimes, correctness is king. If you're moving money you might be able to tolerate "wrongness" in your "fraud/not fraud likelihood" calculation, but you probably aren't willing to tolerate the same on your balance adjustment method. Even if you're controlling a spacecraft, a single wacky control recommendation might be okay, as long as there's a bulletproof watchdog/contingency plan when you get several in a row. Hopefully, you know which type of system you're working on!

It's tempting to think because it's only an approximation, you be sloppy in your implementation of it. But bugs in approximating code are much harder to nail down: did your code make a mistake because of a bug, or because the model (your simplifying assumptions) was wrong? My world-weary, hard-won advice is that it's worth as much rigor as you can muster in approximating code.

I hope I've inspired you to embrace the idea of having an error rate! I'd love to hear from people about ways you:

1. have been profitably wrong?
2. have tried to be profitably wrong and failed?

# todobackend.com

TodoBackend.com defines a simple API and has a frontend that works given any backend URL.

Particularly interesting to me:

# dancing bull pricing structure

Dancing Bull is a great Korean BBQ restaurant that opened nearby recently. They have this pricing scheme:

Note that the unlimited bulgogi is \$23/person, but the bulgogi a la carte is \$25/grill (with each additional orders after that costing \$9). This raises an obvious question: which is the better deal? Spoiler: it depends on both how many people you have and how hungry they are. First, we had to clarify how it actually worked. It's not clear from the menu, but the "\$25/grill" includes the first order of bulgogi. Introducing $P$ for the number of people, $G$ for the number of grills, and $O$ for the total number of orders, we would pay $$UL = 23\cdot P$$ for the unlimited price scheme vs $$ALC = 25\cdot G + 9\cdot ( O - G)$$ for the a la carte scheme.

Let's take an example: If we have $P=4$ people eating $O=6$ orders of bulgogi (with a single grill), we can pay:

• unlimited: $23\cdot 4 = 92$ dollars (\$23 per person) • a la carte:$25\cdot 1 + 9\cdot5 = 61$dollars (\$15.25 per person)

We're interested in finding the breakeven number of orders, where the a la carte order costs more than the unlimited order. If we're feeling less hungry than that breakeven value, we should just order a la carte.

To actually find it, we seek an $O$ that satisfies $ALC = UL$, so $$ALC = UL \implies 25\cdot G + 9\cdot ( O - G) = 23\cdot P \implies O = { {(23\cdot P - 25 \cdot G)} \over 9} + G$$

This is trivial to express in python:

def breakeven_orders(people, grills):
"""
grills such that unlimited == a_la_carte

23*p == 25*g + 9*(total_orders - grills)
"""
return (23.*people - 25.*grills) / 9. + grills


and we can use that function to make a table for ourselves:

--------------------------------------------------------------------------------
breakeven orders (orders-per-person) for (#grills, # people)
G=1               G=2              G=3              G=4
P=1     >0 (>0.78)        *                *                *
P=2     >3 (>1.67)        >1 (>0.78)       *                *
P=3     >5 (>1.96)        >4 (>1.37)       >2 (>0.78)       *
P=4     >8 (>2.11)        >6 (>1.67)       >4 (>1.22)       >3 (>0.78)
P=5     >11 (>2.20)       >9 (>1.84)       >7 (>1.49)       >5 (>1.13)
P=6     >13 (>2.26)       >11 (>1.96)      >10 (>1.67)      >8 (>1.37)
P=7     >16 (>2.30)       >14 (>2.05)      >12 (>1.79)      >10 (>1.54)
P=8     >18 (>2.33)       >16 (>2.11)      >15 (>1.89)      >13 (>1.67)
P=9     >21 (>2.36)       >19 (>2.16)      >17 (>1.96)      >15 (>1.77)
P=10    >23 (>2.38)       >22 (>2.20)      >20 (>2.02)      >18 (>1.84)
P=11    >26 (>2.39)       >24 (>2.23)      >22 (>2.07)      >21 (>1.91)
P=12    >28 (>2.41)       >27 (>2.26)      >25 (>2.11)      >23 (>1.96)
P=13    >31 (>2.42)       >29 (>2.28)      >27 (>2.15)      >26 (>2.01)
P=14    >34 (>2.43)       >32 (>2.30)      >30 (>2.17)      >28 (>2.05)
P=15    >36 (>2.44)       >34 (>2.32)      >33 (>2.20)      >31 (>2.08)
P=16    >39 (>2.44)       >37 (>2.33)      >35 (>2.22)      >33 (>2.11)

• ">X (>Y)" indicates unlimited is cheaper for >X total orders (>Y orders-per-person)
• Extra grills = faster ingestion!
• Not sure how many grills unlimited usually comes with. Probably 1 grill/4 people?

## notes & questions

• I went with 4 friends and ended up getting 7 orders of bulgogi.
• I went with 6 friends and ended up getting (I think?) 9 orders of bulgogi (with 2 grills for speed).
• It amazes me how much I didn't (and still don't!) want to do the initial \$25 grill order, even though it's just the constant term and saves money in the long run. • Pricing one scheme in terms of grills and another in terms of people feels like a dirty trick, even through it's clear why they're doing that. It makes it much harder to compare sanely with mental/napkin math: it tricks you into thinking a la carte costs more, even though you're not going to buy as many of them. • I find the previous point extremely hard to reconcile with the presence of a \$9 additional order. Why does the brain want to skip over that part?
• An earlier version of this post incorrectly calculated the example $ALC$ value, having used $P-G$ instead of $O-G$. Thanks for pointing it out, Lauren!