Posts

multiple thread test cases in java

May 15, 2018

Some of my day job lately has been trying to get a better handle on threads in Java. In particular, we had a really weird race condition resulting from multiple threads processing an input data stream and causing some unexpected conflicts in our data access layer.

To try to replicate the problem in tests, I started with something like this:

private static Runnable runWithDelay(String name, long millis) {
  return () -> {
    try {
      Thread.sleep(millis);
    } catch (InterruptedException e) {
      throw new RuntimeException(e);
    }
    System.out.println(name + " finished");
  };
}

This is just a Runnable that waits a while, and then prints that it finished. Then we can test with something like:

public static void main(String[] args) throws InterruptedException {
  Thread t0 = new Thread(runWithDelay("r0", 200));
  Thread t1 = new Thread(runWithDelay("r1", 300));
  Thread t2 = new Thread(runWithDelay("r2", 100));

  t0.start(); t1.start(); t2.start();

  t0.join(); t1.join(); t2.join();

  System.out.println("Completely finished");
}

which outputs

r2 finished
r0 finished
r1 finished
Completely finished

as we'd probably expect.

This isn't a very sophisticated use of threads, but it was actually already enough to reproduce our concurrency bug pretty reliably in a local development environment!

big-ish personal storage

Apr 25, 2018

I was curious what it would cost to have something like 5-30TB of storage space on hand. The obvious choices are between buying some hard drives and paying for an S3 bucket or something.

The amazon costs are pretty straightforward: you pick a region, they tell you the cost. Starting with their pricing table:

aws = [
    {"level": "standard", "c_millidollars": 23},
    {"level": "infrequent", "c_millidollars": 12.5},
    {"level": "one_zone", "c_millidollars": 10},
    {"level": "glacier", "c_millidollars": 4}
]

we can say

gbs = 10000
for level in aws:
    dollars_per_gb = level["c_millidollars"] / 1000
    dollars = gbs * dollars_per_gb
    print("{} ==> ${} * {} = ${}/month".format(
      level["level"], dollars_per_gb, gbs, dollars
    ))

and get back

standard ==> $0.023 * 12000 = $276.0/month
infrequent ==> $0.0125 * 12000 = $150.0/month
one_zone ==> $0.01 * 12000 = $120.0/month
glacier ==> $0.004 * 12000 = $48.0/month

Figuring out the hard drive cost, I started from some raw data:

raw = [
  {"slug": "toshiba_x300_4tb", "c": 11299, "tb": 4, "link": "https://www.amazon.com/Toshiba-Performance-Desktop-Internal-HDWE150XZSTA/dp/B013JPKUU2"},
  {"slug": "toshiba_x300_5tb", "c": 12999, "tb": 5, "link": "https://www.amazon.com/Toshiba-Performance-Desktop-Internal-HDWE150XZSTA/dp/B013JPLKQK"},
  {"slug": "toshiba_x300_6tb", "c": 17505, "tb": 6, "link": "https://www.amazon.com/Toshiba-Performance-Desktop-Internal-HDWE150XZSTA/dp/B013JPLKJC"},
  {"slug": "toshiba_x300_8tb", "c": 23973, "tb": 8, "link": "https://www.amazon.com/Toshiba-Performance-Desktop-Internal-HDWE150XZSTA/dp/B074BTZ2YJ"},
  {"slug": "seagate_barracuda_4tb", "c":  9399, "tb": 4, "link": "https://www.amazon.com/Seagate-BarraCuda-3-5-Inch-Internal-ST4000DM004/dp/B071WLPRHN"},
  {"slug": "seagate_barracuda_5tb", "c": 18857, "tb": 5, "link": "https://www.amazon.com/Seagate-Barracuda-2-5-Inch-Internal-ST5000LM000/dp/B01M0AADIX"},
]

This gives us a figure of $29.38/TB for hard drives.

Just to blow up our hard drive costs a bit, we can assume some drive-level redundancy. Maybe we want to use ZFS 2 (but maybe we don't). One alternative is using hardware or maybe even software RAID. Let's assume we want to do 5 disks with 2 parity (something like RAID6). Just sticking with 4TB drives for a sec, we'd be buying 20TB for 12TB of capacity.

Using our "average-priced" drives, that'd be $587.70 upfront cost for hard drives.

Interestingly, there's some ongoing costs for this option, too: the electricity for the computer the drives plug into. Assuming a 200-watt machine and San Francisco electricity prices ($0.201 / kWh), we'd be looking at

1 month @ 200 watts = 200 watts * 730 hours = 146,000 watt-hours = 146 kWh
==> $29.35

sfpark api

Apr 24, 2018

I recently came to learn of the SFpark API, which lets one make queries like:

LAT=37.787702
LONG=-122.407796
curl "http://api.sfpark.org/sfpark/rest/availabilityservice?lat=${LAT}&long=${LONG}&radius=0.25&uom=mile&response=json" | pbcopy
pbpaste | jq '.AVL[] | select (.TYPE | contains("OFF"))'

and get a response including records like:

{
  "TYPE": "OFF",
  "OSPID": "950",
  "NAME": "Union Square Garage",
  "DESC": "333 Post Street",
  "INTER": "Geary between Stockton & Powell",
  "TEL": "(415) 397-0631",
  "OPHRS": {
    "OPS": {
      "FROM": "7 Days/Wk",
      "BEG": "24 Hrs/Day"
    }
  },
  "OCC": "381",
  "OPER": "670",
  "PTS": "1",
  "LOC": "-122.407447946,37.7876789151"
}

Pretty cool!

regex in java

Feb 23, 2018

I can never remember how to do java regexes with Pattern.

Pattern pattern = Pattern.compile("x_(yy|zz)");
String input = "x_yy";
boolean m = pattern.asPredicate().test(input);
System.out.println("matches: " + m);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
  System.out.println("whole matched string: " + matcher.group(0));
  System.out.println("matched group: " + matcher.group(1));
}

I also learned a really cool thing IntelliJ can do with pattern matches!

java regex

guava RangeMap

Jan 26, 2018

We had some code like this:

NavigableMap<Double, Integer> PRIORITY_MULTIPLIERS = new TreeMap<Double, Integer>() {{ 
  put(LOW_THRESHOLD, 1);    // (20k..40k)  => 1
  put(MEDIUM_THRESHOLD, 2); // [40k..100k) => 2
  put(HIGH_THRESHOLD, 3);   // [100k..+∞)  => 3
}};

Integer getPriorityMultiplier(Double v) {
  if (v < LOW_THRESHOLD) {
    return 0;
  }
  return PRIORITY_MULTIPLIERS.floorEntry(v).getValue()
}

I replaced with a RangeMap that uses Range instances as keys:

RangeMap<Double, Integer> PRIORITY_MULTIPLIERS = ImmutableRangeMap.<Double, Long>builder()
      .put(Range.lessThan(LOW_THRESHOLD), 0)
      .put(Range.closedOpen(LOW_THRESHOLD, MEDIUM_THRESHOLD), 1)
      .put(Range.closedOpen(MEDIUM_THRESHOLD, HIGH_THRESHOLD), 2)
      .put(Range.atLeast(HIGH_THRESHOLD), 3)
      .build();
Integer getPriorityMultiplier(Double v) {
  return rangePriorityMultipliers.get(v);
}

Advantages here being:

  1. ImmutableRangeMap explicitly checks for overlapping ranges, so specifying Range.atMost for the first range would fail at the time of instantiation.
  2. Puts declaration & endpoint behavior (open vs closed) together, instead of partly in the Map declaration and partly in the map.floorEntry call.
  3. Simple to test that entire range is covered:

    @Test public void multipliersForEntireRange() {
      Set<Range<Double>> ranges = PRIORITY_MULTIPLIERS.asMapOfRanges().keySet();
      ImmutableRangeSet<Double> rangeSet = ImmutableRangeSet.copyOf(ranges);
      assertThat(rangeSet.encloses(Range.all())).isTrue();
    }
    

osquery

Jan 12, 2018

I've known about osquery for a while, but recently spent some time digging around in it.

The basic idea is to provide a consistent SQL interface to a bunch of system data, instead of learning idiosyncrasies of individual commands (which themselves vary across operating systems). My hacker buddy Sharvil has worked on osquery a fair bit and explained to me a couple years ago that it actually uses sqlite's virtual table functionality to provide the interface -- it's a fascinating and brilliant project!

Concretely, instead of running uptime(1) and getting either

osx ➜ uptime
 8:48  up 17:27, 10 users, load averages: 4.57 4.74 5.08
linux ➜ uptime
 16:52:12 up 192 days,  9:59,  1 user,  load average: 0.00, 0.00, 0.00

you can run:

osquery> SELECT * FROM uptime;
+------+-------+---------+---------+---------------+
| days | hours | minutes | seconds | total_seconds |
+------+-------+---------+---------+---------------+
| 2    | 1     | 18      | 38      | 177518        |
+------+-------+---------+---------+---------------+

We can also query for system information (shown here in line mode)

osquery> .mode line
osquery> SELECT * FROM system_info;
          hostname = traviscj.local
              uuid = [redacted]
          cpu_type = x86_64h
       cpu_subtype = Intel x86-64h Haswell
         cpu_brand = Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
cpu_physical_cores = 4
 cpu_logical_cores = 8
   physical_memory = 17179869184
   hardware_vendor = Apple Inc.
    hardware_model = MacBookPro11,2
  hardware_version = 1.0
   hardware_serial = [redacted]
     computer_name = traviscj
    local_hostname = traviscj

and even perform HTTP requests with the curl table:

osquery> SELECT url, response_code FROM curl WHERE url IN ('https://traviscj.com/_status.json', 'https://pauljoos.com/_status.json');
+-----------------------------------+---------------+
| url                               | response_code |
+-----------------------------------+---------------+
| https://pauljoos.com/_status.json | 200           |
| https://traviscj.com/_status.json | 200           |
+-----------------------------------+---------------+

I tried to reproduce a simple ps wrapper I have:

#!/bin/sh
# a simple script to show java processes without extremely long classpaths

ps aux | grep '[b]in/java' | grep -v $0 | awk '{ print $2" "$11" "$NF }'

and ended up with

SELECT pid
     , SUBSTR(cmdline, 
         (WITH char_index(i, ch) AS (values(1, '') UNION ALL SELECT i+1, substr(cmdline, i, 1) AS r FROM char_index WHERE r != '')
         SELECT MAX(i) FROM char_index WHERE char_index.ch = ' ')
       ) as java_main_class
FROM processes WHERE name = 'java';

which produces

osquery> SELECT pid, SUBSTR(cmdline, (WITH char_index(i, ch) AS (
    ...>     values(1, '')
    ...>     UNION ALL SELECT i+1, substr(cmdline, i, 1) AS r FROM char_index WHERE r != ''
    ...> )
    ...> SELECT MAX(i) FROM char_index WHERE char_index.ch = ' ')) as java_main_class FROM processes WHERE name = 'java';
+-------+----------------------------+
| pid   | java_main_class            |
+-------+----------------------------+
| 2736  | com.company.app1.FirstApp  |
| 2737  | com.company.app2.SecondApp |
+-------+----------------------------+

This is, to me, some world-class SQL trickery. Stepping through, the WITH ... SELECT clause works like this.

osquery> WITH char_index(i, ch) AS (
    ...>   values(1, '')
    ...>   UNION ALL SELECT i+1, SUBSTR("foo bar baz", i, 1) as r FROM char_index WHERE r != ''
    ...> ) SELECT * FROM char_index;
+----+----+
| i  | ch |
+----+----+
| 1  |    |
| 2  | f  |
| 3  | o  |
| 4  | o  |
| 5  |    |
| 6  | b  |
| 7  | a  |
| 8  | r  |
| 9  |    |
| 10 | b  |
| 11 | a  |
| 12 | z  |
+----+----+

Adding the WHERE char_index.ch = ' ' gets us just the spaces:

osquery> WITH char_index(i, ch) AS (
    ...>   values(1, '')
    ...>   UNION ALL SELECT i+1, SUBSTR("foo bar baz", i, 1) as r FROM char_index WHERE r != ''
    ...> ) SELECT * FROM char_index WHERE char_index.ch = ' ';
+---+----+
| i | ch |
+---+----+
| 5 |    |
| 9 |    |
+---+----+

and of course adding MAX(i) gives the highest such index, 9. My expression then passes that entire thing into a SUBSTR(var, idx) clause to get the last element, and applies it to cmdline instead of a static string.

There's an extensive schema listing on OSQuery's website describing all the tables available in the latest version of OSQuery.

profitably wrong

Jan 9, 2018

I've come to realize that my career so far has been built on being "profitably wrong." I think this is interesting because the usual approaches are being "profitably fast" (optimizing) or "profitably better" (improving), and most people think of any kind of wrongness as being a terrible thing. But sometimes the best way to optimize or improve is approximating!

The definitions of "profitably" has changed as I've worked on different things, as has the specific type of "wrongness". A couple specific ways accepting "wrongness" have been profitable for me include:

  • being simpler: approximating very complex neural models with very simple mathematical models
  • being faster: re-using outdated matrix factorizations to still improve a Newton iterate toward a solution
  • being tractable: naive bayes makes an independence assumption that is demonstrably false, but can still be useful

This type of idea seems to pop up everywhere for me:

  • HOGWILD: a parallel stochastic gradient descent algorithm that gets unreal speed by disregarding classic concurrency correctness mechanisms (read: locks) but works anyway
  • bloom filter: gives up the ability to always answer correctly to instead operate in a fixed, small amount of memory.
  • HyperLogLog: similarly gives an approximate unique count in a fixed, small amount of memory overhead. it's used in redis and at google.
  • p2 algorithm: estimates quantiles without storing all observations.

These ideas were all initially hard for me to wrap my head around, after spending several years learning how to prove that this sorting algorithm or that one actually sorts a list. But I'm completely sold on their utility and have come to see it as my own little niche.

DAYDREAM

What if you knew a secret to solving whatever problem you work on? What good is a secret? It's good if it helps you solve that problem better than other people working on it.

What's the worst that happens if you're wrong? What usually happens when you're wrong?

APPROXIMATION

One way to invent secrets is by approximating reality -- inventing a model. It's not good enough to just invent a model though. That model must have some predictive power.

The key in each of my examples is that predictive power:

  • simple neural models exhibit some properties of much more complex, chemically-based models.
  • a Newton-like algorithm that can estimate a better iterate.

QUANTIFY THE ERROR IN YOUR APPROXIMATION

It's important to understand the errors you might make with your approximation. At a very basic level, you need to understand whether the approximation incurs more cost than it earns to decide whether to use it or not.

One starting place for this is simply writing down the eventual outcome for false positives and false negatives. Another great approach is calculating an exact solution for an easier (relatively!) problem and comparing against the approximation.

FIND THE APPROXIMATION'S SWEET SPOT

All approximations have some range of validity, outside of which they aren't a reasonable model of reality anymore. Sometimes, they're very well-defined and carefully explored; HOGWILD, for example, needs a certain sparsity structure to avoid destroying the effectiveness of the parallelism with very frequent concurrent writes. Other times, they're simply not well-understood (megacomplex systems under study in medicine, for example) or obscured by an assumption of the model: assuming linearity of some portion of a function, assuming an incompressible fluid, etc. It's important to have some way of knowing whether your predictions/approximations are adding value or simply garbage.

FIX IT LATER / EVENTUAL CORRECTNESS

Sometimes, approximations apply even to problems that seem like they warrant the utmost correctness. Think money movement or automated vehicle control systems. On the contrary, lots of applications like these really just rely on some notion of eventual correctness. (For varying definitions of "eventual", of course!) When you think from the "what's the worst that could happen" or the "crap, it happened anyway because of something else" angles, it turns out that you'd just issue a refund or steer back into your lane or whatever. And, in fact, the "profitably wrong" mindset might enable you to decide on that corrective action sooner than later!

WHEN TO AVOID / WARNINGS

Sometimes, correctness is king. If you're moving money you might be able to tolerate "wrongness" in your "fraud/not fraud likelihood" calculation, but you probably aren't willing to tolerate the same on your balance adjustment method. Even if you're controlling a spacecraft, a single wacky control recommendation might be okay, as long as there's a bulletproof watchdog/contingency plan when you get several in a row. Hopefully, you know which type of system you're working on!

It's tempting to think because it's only an approximation, you be sloppy in your implementation of it. But bugs in approximating code are much harder to nail down: did your code make a mistake because of a bug, or because the model (your simplifying assumptions) was wrong? My world-weary, hard-won advice is that it's worth as much rigor as you can muster in approximating code.

I hope I've inspired you to embrace the idea of having an error rate! I'd love to hear from people about ways you:

  1. have been profitably wrong?
  2. have tried to be profitably wrong and failed?

todobackend.com

Jan 5, 2018

TodoBackend.com defines a simple API and has a frontend that works given any backend URL.

Particularly interesting to me:

dancing bull pricing structure

Jan 4, 2018

Dancing Bull is a great Korean BBQ restaurant that opened nearby recently. They have this pricing scheme:

Note that the unlimited bulgogi is \$23/person, but the bulgogi a la carte is \$25/grill (with each additional orders after that costing \$9). This raises an obvious question: which is the better deal?

Spoiler: it depends on both how many people you have and how hungry they are.

First, we had to clarify how it actually worked. It's not clear from the menu, but the "\$25/grill" includes the first order of bulgogi. Introducing $P$ for the number of people, $G$ for the number of grills, and $O$ for the total number of orders, we would pay $$UL = 23\cdot P$$ for the unlimited price scheme vs $$ALC = 25\cdot G + 9\cdot ( O - G)$$ for the a la carte scheme.

Let's take an example: If we have $P=4$ people eating $O=6$ orders of bulgogi (with a single grill), we can pay:

  • unlimited: $23\cdot 4 = 92$ dollars (\$23 per person)
  • a la carte: $25\cdot 1 + 9\cdot5 = 61$ dollars (\$15.25 per person)

We're interested in finding the breakeven number of orders, where the a la carte order costs more than the unlimited order. If we're feeling less hungry than that breakeven value, we should just order a la carte.

To actually find it, we seek an $O$ that satisfies $ALC = UL$, so $$ALC = UL \implies 25\cdot G + 9\cdot ( O - G) = 23\cdot P \implies O = { {(23\cdot P - 25 \cdot G)} \over 9} + G$$

This is trivial to express in python:

def breakeven_orders(people, grills):
    """
    grills such that unlimited == a_la_carte

    23*p == 25*g + 9*(total_orders - grills)
    """
    return (23.*people - 25.*grills) / 9. + grills

and we can use that function to make a table for ourselves:

--------------------------------------------------------------------------------
breakeven orders (orders-per-person) for (#grills, # people)
        G=1               G=2              G=3              G=4                
P=1     >0 (>0.78)        *                *                *                
P=2     >3 (>1.67)        >1 (>0.78)       *                *                
P=3     >5 (>1.96)        >4 (>1.37)       >2 (>0.78)       *                
P=4     >8 (>2.11)        >6 (>1.67)       >4 (>1.22)       >3 (>0.78)        
P=5     >11 (>2.20)       >9 (>1.84)       >7 (>1.49)       >5 (>1.13)        
P=6     >13 (>2.26)       >11 (>1.96)      >10 (>1.67)      >8 (>1.37)        
P=7     >16 (>2.30)       >14 (>2.05)      >12 (>1.79)      >10 (>1.54)        
P=8     >18 (>2.33)       >16 (>2.11)      >15 (>1.89)      >13 (>1.67)        
P=9     >21 (>2.36)       >19 (>2.16)      >17 (>1.96)      >15 (>1.77)        
P=10    >23 (>2.38)       >22 (>2.20)      >20 (>2.02)      >18 (>1.84)        
P=11    >26 (>2.39)       >24 (>2.23)      >22 (>2.07)      >21 (>1.91)        
P=12    >28 (>2.41)       >27 (>2.26)      >25 (>2.11)      >23 (>1.96)        
P=13    >31 (>2.42)       >29 (>2.28)      >27 (>2.15)      >26 (>2.01)        
P=14    >34 (>2.43)       >32 (>2.30)      >30 (>2.17)      >28 (>2.05)        
P=15    >36 (>2.44)       >34 (>2.32)      >33 (>2.20)      >31 (>2.08)        
P=16    >39 (>2.44)       >37 (>2.33)      >35 (>2.22)      >33 (>2.11)        
  • ">X (>Y)" indicates unlimited is cheaper for >X total orders (>Y orders-per-person)
  • Extra grills = faster ingestion!
  • Not sure how many grills unlimited usually comes with. Probably 1 grill/4 people?

notes & questions

  • I went with 4 friends and ended up getting 7 orders of bulgogi.
  • I went with 6 friends and ended up getting (I think?) 9 orders of bulgogi (with 2 grills for speed).
  • It amazes me how much I didn't (and still don't!) want to do the initial \$25 grill order, even though it's just the constant term and saves money in the long run.
  • Pricing one scheme in terms of grills and another in terms of people feels like a dirty trick, even through it's clear why they're doing that. It makes it much harder to compare sanely with mental/napkin math: it tricks you into thinking a la carte costs more, even though you're not going to buy as many of them.
  • I find the previous point extremely hard to reconcile with the presence of a \$9 additional order. Why does the brain want to skip over that part?
  • An earlier version of this post incorrectly calculated the example $ALC$ value, having used $P-G$ instead of $O-G$. Thanks for pointing it out, Lauren!

everything has a failure rate

Jan 3, 2018

I was happily riding in to work this morning on my motorcycle. Periodically while riding, a spark of terror that I've forgotten to buckle my backpack to the bike's frame hits me. So frequently, in fact, that I've grown used to reaching behind me to feel for the backpack in a spare second when I don't need to clutch.

Today was different:

I reached for the backpack but felt nothing.

I quickly signaled to pull over to the side and looked over my shoulder:

For the uninitiated, it's supposed to look like this:

I sped back home on the route I'd taken, mentally taking inventory of all the things in my backpack. My medications, my nice headphones, my work laptop. All the things on the laptop. Thoughts raced through my mind: If I can't find the backpack, who do I need to get in touch with at work to invalidate any credentials stored on it? (I wasn't that worried about this -- the laptops at Square are pretty locked down and we don't really store anything very secret on them -- but working here for nearly 4 years has given me some good operational security practices!) I wonder when the last backup completed and how much (if anything) I'll lose if it's gone. I try to decide whether I need to get in touch with Square's Trust & Security or IT team first, then realize that I'm not even sure I have either phone number stored in my phone.

I remember that my laptop's lock screen has my phone number and email. I hold out hope that my phone will ring or that a morning runner or neighbor will have found it and be waiting at the main entrance of my apartment building. All the while, I'm scanning the road for the backpack.

One of my primary maxims is

Everything has a failure rate.

I realize that this applies even to the process of securing the buckles of the backpack. I kick myself, because I have adopted the shisa kanko practice of pointing to the closed garage door and each buckle before I get on the bike -- but I neglected to do so this particular morning because I had so consciously closed the garage door.

With every turn in the route I become a bit more despondent and resigned. How long is it going to take to set everything up on the laptop again? It's early -- before 7 in the morning -- how soon can I get a replacement?

As I approach my apartment, I hold out hope that the backpack is hiding behind the last car -- perhaps it bumped off as my rear tire rolled over the curb. When that hypothesis is dreadfully disproven, a flash of hope washes over me that perhaps a neighbor recognized my name and put it inside the building before hurrying about their day. But no. There's no backpack I can see in the hallway through the floor-to-ceiling windows framing the glass door.

Fully crestfallen and expecting no relief, I hold out for the last hope -- maybe it fell off when I wheeled the bike out of the garage. It's a long shot, but it's either there or gone. I put the bike in neutral, kick the side-stand down, and jump off the bike. As I walk around to open the garage door, I notice something about the bike doesn't look quite right. And there it is:

subscribe via RSS

Powered by Olark