filter vs spec (draft)

Jan 18, 2017

Consider a silly data model to store data about cities like

message City {
  optional string city_name = 1;
  optional string state = 2;
  optional int32 population = 3;
  optional int32 year_founded = 4;
  // ... presumably others :-)

and some sample data like:

  {"city_name": "Portland", "state": "OR", "population": ...},
  {"city_name": "Portland", "state": "ME", "population": ...},
  {"city_name": "Springfield", "state": "FL", "population": ...},
  {"city_name": "Springfield", "state": "IL", "population": ...},
  {"city_name": "Springfield", "state": "CO", "population": ...}

There are some useful entities we can define: (DRAFT NB: don't read too much into the matcher vs filter lingo.)

  1. a specifier is a data object that tells you have to find the specific instance you are looking for, and lets you answer questions about existence. An example specifier message for our city model might be

    message CitySpecifier {
      required string city_name = 1;
      required string state = 2;


      "city_name": "Seattle",
      "state": "WA"

    is an example specifier instance.

    Note that I've used required here, which is against protobuf best practices, but makes it clear that CitySpecifiers like {"city_name": "Seattle"} or {"state": "WA"} should be rejected.

    A specifier is what answers the question of whether two things are the same. Another way to look at it is that these are the fields we would want to impose a unique key constraint on.

  2. a matcher is one way to say "Show me all the things with these properties". Suppose the data object is

    message CityMatcher {
      optional string city_name = 1;
      optional string state = 2;

    This has four types of queries:

    {"city_name": "Springfield"}
    {"state": "CO"}
    {"city_name": "Springfield", "state": "CO"}

    the first three of which map, respectively, to SQL where clauses like

    WHERE 1;
    WHERE city_name = 'Springfield';
    WHERE state = 'CO';

    The fourth presents an interesting predicament: Is

    1. WHERE city_name = 'Springfield' AND state = 'CO';
    2. WHERE city_name = 'Springfield' OR state = 'CO';

    a better choice for the meaning of {"city_name": "Springfield", "state": "CO"}?

    I think the choice is clear: (2) is a bad choice!
    1. We already have a way represent the set where the name is Springfield (that is, {"city_name": "Springfield"}), and a way to represent the set where the state is CO ({"state": "CO"}). These primitives can be combined either by a union of the where-clause (which indeed gives WHERE city_name = 'Springfield' OR state = 'CO') or the database layer can use the UNION operation to compute the union of returned records. So using {"city_name": "Springfield", "state": "CO"} to represent (2) would be a redundant functionality.
    2. (2) leaves us with no way of of representing the intersection of cities where the name is Springfield and the state is CO. Resolving requires either application-level code or a temporary-table-type approach that implements (1) anyway, and thus provides no reduction in complexity.

    The where-clause union is best served by introducing a compound matcher, like

    message CompoundCityMatcher {
      repeated CityMatcher city_matcher = 1;

    where city_matcher entries are reduced with a logical OR operation.

  3. a filter is a way to say "Show me all the things with these properties". So we might say have a data object like

    message CityFilter {
      repeated string city_name = 1;
      repeated string state = 2;

    Note that the fields in this object are repeated, so we can include lots of them, like

      "city_name": ["Portland", "Chicago"],
      "state": ["WA"]

    It doesn't make any sense for a city to be named both Portland and Chicago, so the clear choice to join query atoms within the same field is a logical OR -- otherwise, all filters with multiple values would return the empty set! By the same logic that we applied to CityMatcher, it is clear that different fields should be joined with AND.

    The key is then:

  • a given filter field matches to something like the SQL city_name IN ('Portland', 'Chicago')
  • multiple filter fields are joined with a logical AND, so our first CityFilter would be city_name IN ('Portland', 'Chicago') AND state IN ('WA').
  • omit the where-clause for empty fields.

    In a similar fashion to the CompoundCityMatcher, we can define a CompoundCityFilter message:

    message CompoundCityFilter {
      repeated CityFilter city_filter = 1;

    which is also OR-reduced.

    There's another problem with assuming that two different fields are joined with a logical OR. The number of rows quickly explodes as we start adding in more city_name or state. Even if the actual data transfer imposed is okay, this puts the burden to add some logic outside the database layer that does database-type-things of selecting and filtering records, but can't benefit from the indices and query-ability of the database proper.

    There's another (minor) subtlety regarding null vs empty filter fields. Essentially, what do you do with a CityFilter like {"city_name": ["Portland", "Chicago"], "state": []}?

    • WHERE city_name in ('Portland','Chicago') AND state in ()?
    • WHERE city_name in ('Portland','Chicago')?

    TODO: answer this question

The clever will look at the above data models and say: "Ah! But CityFilter is just the same thing as a CitySpecifier where each field is a singleton list!! We have no need for a CitySpecifier!" While true, the burden is then on all client code that would use CitySpecifier to include a validation that each field is exactly singleton. Furthermore, all future programmers reading such code (yourself included!) must continually bear the cognitive overhead of whether or not a given CityFilter is being used as a CityFilter or a CitySpecifier, and all future executions of the code must bear either the performance overhead of running that validation, or the risk of violating it.

It's interesting to note that, in spite of the proto+SQL terms in which I've formulated this argument, the general principals hold more generally. See, for example, Django's Complex lookups with Q objects documentation:

Keyword argument queries – in filter(), etc. – are “AND”ed together. If you need to execute more complex queries (for example, queries with OR statements), you can use Q objects.

korean beef recipe

Dec 31, 2016


  • 1/4 cup brown sugar, packed
  • 1/4 cup reduced sodium soy sauce
  • 2 tsp sesame oil
  • 1/2 tsp crushed red-pepper flakes
  • 1/4 tsp ground ginger
  • 1 tbsp vegetable oil
  • 3 cloves garlic, minced
  • 1 pound ground beef
  • 2 green onions, thinly sliced.
  • 1/4 tsp sesame seeds


  1. Mix brown sugar, soy sauce, sesame oil, red pepper flakes, ginger.
  2. Sizzle garlic in the vegetable oil, then add beef and cook until brown.
  3. Stir in soy sauce mixture.
  4. Serve immediately.

Summarized from Korean Beef Bowl on Damn Delicious, itself summarized from Korean Beef and Rice.

christmas commute

Dec 29, 2016

I did a 725-mile drive yesterday from Prosser, WA to San Francisco, CA. I captured the daylight hours with a GoPro timelapse and turned it into a video!

This was the actual loop we took, including the trip up to Washington. We spent two nights in Oregon -- one in Grants Pass and one in Portland.

My GoPro mount held the pictures upside down, so I transformed all of them with a quick

for f in upside-down/*; do
    convert $f -rotate 180 $(basename $f)

and then transformed all of the images into an MP4 with

brew install ffmpeg
ffmpeg -framerate 24 -pattern_type glob -i '*.JPG' -c:v libx264 -pix_fmt yuv420p out.mp4

I learned that I would do several things differently for the next timelapse:

  1. find a soundtrack
  2. angle the gopro up more and/or crop the video down to get rid of the dashboard
  3. use a smaller photo interval, probably 1, 2, or 5 seconds.
  4. use the "capture setting" to flip it upside down instead of flipping it "in post".
  5. set up the correct time, date, and resolution for the pictures.
  6. Probably act like an actual photographer and lower the resolution at night in the hope that we could still catch some short exposures.

CBCL: The Common Business Communication Language

Dec 12, 2016

I recently came across McCarthy's CBCL paper on Hacker News. He presents a Lisp notation as a format for sharing requests between different software on different computers. He calls out

    (IS-FLIGHT x) 

where $\epsilon x$ is an operator maps a predicate to a value matching the predicate, and considers it an improvement on the "iota" definite description operator, which seems to be an operator that returns the unique value matching the predicate. It seems interesting to me that the unique match operator is not considered as useful as the arbitrary match operator -- unique keys actually seem critical to a lot of my work with data models. Even more fascinating, though, is that this looks like an anonymous function!

I did a bit more reading, and it turns out that those definite descriptions are probably actually a reference to the domain relational calculus introduced by Lacroix and Pirotte.

McCarthy is very dismissive of XML as an inferior format to Lisp notation, but I am a bit less convinced that XML's addition of a schema doesn't add anything. Enumerating possible values a given request can take on -- that is, a specification -- seems very fruitful in the distributed systems work that I have done, to say little of how nice a standardized, portable, relatively compact, battle-tested specification and serialization format is. But perhaps McCarthy doesn't mention it because he is assuming that the Lisp notation could also have a schema/specification overlayed on top of it, and which was itself a Lisp notation definition. It almost seems like variant of the robustness principle, which has of course been refuted.

Finally, I'm interested in applying these lessons to the things that came after McCarthy's 1999 update. In particular, protocol buffers. Here, we find an even stronger schema and a ton of other features not exactly relevant to the CBCL proposal, but interesting none-the-less. I was especially struck by the description of the ADJECTIVE modifier for specifying "extra" details about a particular type of object, and the similarity to protobuf optional fields. Specifically,

The point of ADJECTIVE is that a program not understanding YELLOW could nevertheless understand that #2 pencils were called for, and could reply that they don't have any pencils, if that were so.

ends up sounding a lot (but not exactly) like an optional protobuf field--perhaps more akin to using the protobuf API to return a response that encodes "I matched all the fields I understood, but also found a field I didn't understand".

oblique programming strategies

Dec 7, 2016

Ever since I found out about it (probably on Hacker News), the idea of Oblique Strategies has fascinated me. The first editions are going on ebay for \$2500-\$3300 bucks, which I think is incredible. If you're curious and impatient, you can check out this list on github.

One recent sleepness night, I made a list of "oblique programming strategies" on my phone, transcribed here. They are not as starkly polished as Eno's version (unsurprisingly), but might be useful to you!

  • Solve the easiest possible problem in the dumbest possible way.
  • Write a test for it.
  • Is there a better name for this thing?
  • Can we move work between query time (when we need the answer) and ingest time (when we see the data that eventually informs the answer)?
  • Is it easier in a relational data store? A KV Store? A column store? A document store? A graph store?
  • Can performance be improved by batching many small updates?
  • Can clarity be improved by transforming a single update to more smaller updates?
  • Can we more profitably apply a functional or declarative or imperative paradigm to the existing design?
  • Can we profitably apply a change from synchronous to asynchronous, or vice versa?
  • Can we profitably apply an inversion of control, moving logic between many individual call sites, a static central definition, and a reflectively defined description of the work to be done?
  • Is it faster with highly mutable or easier with completely immutable data structures?
  • Is it easier on the client side or the server side?
  • List the transitive closure of fields in a data model. Regroup them to make the most sense for your application. Do you have the same data model?
  • Is it better to estimate it quickly or compute it slowly?
  • What semantics do you need? Should it be ordered? Transactional? Blocking?
  • Can you do it with a regex? Do you need to bite the bullet and make a real parser? Can you avoid parsing by using a standardized format? (A few to get you started: s-expressions/XML/protobuf/JSON/yaml/msgpack/capn/avro/edn.)
  • What is the schema for this data? Is the schema holding you back?
  • Draw a state diagram according to the spec.
  • Draw a state diagram according to the data.
  • Draw a data flow ($dX/dy$)
  • Draw a timeline ($dX/dt$)
  • How would you do it in Haskell? C? Javascript?
  • Instead of doing something, emit an object.
  • Instead of emitting an object, do something.
  • Store all of it.
  • Truncate the old stuff.
  • Write the API you wish existed.
  • Make an ugly version where all the things work.
  • Make a gorgeous version that doesn't do anything.
  • Can you codegen the boilerplate?
  • Enumerate all the cases.
  • What happens if you do it all offline / precompute everything? What happens if you recompute every time? Can you cache it?
  • Can you build an audit log?
  • Think like a tree: ignore the book-keeping details and find the cleanest representation.
  • Think like a stack: zoom in to the book-keeping details and ignore the structure.
  • Replace your implementation with an implementation that computes how much work the real implementation does for that problem.
  • What is the steady state?

motorcycle learnings at 250 miles

Dec 2, 2016

Two new learnings at the 250mi point:

  1. The backpack straps down to the back!

    The balance of the bike feels so much better with the backpack low and riding without weight on my shoulders is so much more comfortable.

  2. Ear Plugs are amazing! I noticed earplugs being worn on The Long Way Round and decided to try it out. It is really surprising how relevant noises (other cars, etc) make it through fine, but how much nicer it is to not have the wind noise!

Useful homebrew formula

Dec 1, 2016

Last week I was telling Paul about things I have brew installed on my work laptop. I pulled a full list as I was doing some upgrades and stuff this morning.

Glancing over it a bit, here are some of my favorites/most usefuls:

  • autojump for a usage-adjusting cd -- given a partial string it just goes to the most-used directory with that prefix. I have autojump aliased to j to shorten it up even more.
  • colordiff to make diffs/grep/etc nicer to look at in the terminal.
  • coreutils/dateutils is nice because it installs the GNU versions of a lot of standard linux commands like du and ls and date that work a bit strange/differently on OSX.
  • jq is amazing — lets you query JSON objects.
  • ledger to manage bank balances in plain text.
  • newsbeuter to subscribe to RSS feeds.
  • osquery lets you do SQL-like queries to find out about the system it’s running on (sharvil did a bunch of work on it, actually…)
  • terminal-notifier to pop up notifications from terminal/cron jobs.
  • tmux to have long-running terminal windows without having terminal windows open.
  • watch instead of "some command"+uparrow+enter+uparrow+enter+....
  • wget as shorthand for “curl http://xyz/tuv.json > tuv.json” (old habits…),
  • youtube-dl for downloading youtube videos.

KLR-650 first month & american gumball

Nov 30, 2016

Yesterday I finally got the title for my motorcycle and realized that I'd owned it for exactly one month!

So far, I haven't done anything very adventureous -- still getting used to riding again. I had a couple things that were making me nervous about it:

  1. The bike gets warm (>1/2 way on the heat gauge, but still significantly below the red) tooling around, usually on my commute home in the evening.

  2. There is a sputtering or almost backfiring when I am coasting down hills, which is several times each direction.

Turns out that both are normal! Hooray!

I was thinking about other things I want out of life and ended up coming across the American Gumball Rally website. In highschool, Beals and I used to watch "documentaries" about people on the Gumball 3000 and Mischief runs, their expensive cars, and their (pretty crazy) disregard for speed limits. I pitched Beals on doing the Nevada America Gumball Run in 2017, and we immediately got into whether his Focus or my Focus got to wear a racing stripe and do the honors, when Beals dropped a bomb on me:

too bad you don't have your f250 still, that would be an epic ride

It turns out my parents do still have the '91 Ford F250 to which Beals was referring. There's one catch: the truck is in Prosser, WA, and I'm in San Francisco, CA, and Beals is in Portland, OR. Okay, so, we'll take on a road trip:

Correction: two catches. The second is that the truck transmission and alternator are dodgy. And per catch (A), we have 14+ hours through Oregon and Nevada. We'll just have to bring some way to get ourselves out of trouble if it breaks down somewhere. Something that could fit in a pickup. Something like

So yeah -- that's the plan:

  1. Drive the bike to Prosser.
  2. Load the bike on the truck.
  3. Drive to Las Vegas with the truck and the bike.
  4. Do the rally.
  5. Drive the truck back to Prosser.
  6. Ride the bike back to San Francisco.

Boom -- problem solving.

stocks and options from 30k feet

Oct 14, 2016

One of my friends at work asked me if I had any book recommendations for learning about stocks and options. Mentally, I break trading down into two general classes of trading: index-type and "exotic" trading. By exotic trading, I mean picking individual stocks/options and actively trading. This runs counter to the more conservative buy-and-hold, index-based, hands-off approach.

For the exotic trading, I learned most of what I know from a class with Professor W.E. Olmstead and his book, Options for the Beginner and Beyond: Unlock the Opportunities and Minimize the Risks. For the option-uninitiated, the basic idea is that instead of buying or selling stocks directly, you buy and sell contracts that give you the right (but not obligation) to buy or sell the stock at a particular price by a particular date. That's a mouthful and options are indeed subtle beasts, but they allow the flexibility to either hedge risks you want less exposure to, or increase/leverage exposure to risks you do want to take.

Here's one pretty basic example, called the "Covered Call" strategy. Suppose you:

  • buy 100 shares of SQ (currently trading at \$11.12 per stock share or $1112 total)
  • sell 1 DEC16-16 \$13 SQ call contract (currently trading at \$.25 per option share, or \$25 per contract)

You get the $25 account credit when you sell the initial shares; your eventual profit vs loss on this trade breaks down as follows:

  • if the stock price is above $13, the option is likely to be exercised, which means you will be paid \$13 per share for each share of the contract ($1300 overall) and the counterparty takes ownership of the shares.
  • If the stock price is below $13, the option is likely to stay unexercised, so you keep the 100 shares of the stock and the original $25.

This sale can be repeated as often as the option expirations are available, which is usually every month. Exceptions happen in both directions, though: some (especially newer) stocks have less frequent option expirations, while some (extremely high-volume) stocks have more frequent option expirations -- as often as weekly! Furthermore, in the unexercised case, you don't need to buy the stock again since you still have it from the last round. With enough iterations of this, one can even pay off the original stock purchase purely from writing the covered calls.

That's not to say this is easy, though! If the market considered it impossible for the stock to go that high, nobody would buy the option contract. So the trick to this strategy is finding a stock and predicting the movement, and actually being correct.

As discussed above, you can use Google Finance to view the stock price history and the stock option price chain:

SQ options chain on 2016-10-14

These quotes, however, can be delayed; most reputable option trading companies will provide an up-to-the-second view of the option chains.

Turns out that finding, predicting, and being correct are pretty hard, and a fair bit of distraction, though. So instead, I mostly opt for the boring route: investing in index funds and riding the standard stock-market growth curve. As best as I can tell, this is pretty close to the view espoused by I admittedly haven't read the book): Invest a steady stream of earnings in an index fund with as low of management fee as possible, and let it ride as long as possible. Index funds are a kind of "synthetic" stock that tracks multiple other stocks. For example, QQQ tracks the performance of the Nasdaq-100 index, the DIA tracks the performance of the Dow Jones Industrial Average, and SPY tracks the performance of the S&P 500.

Disclaimers: The SQ stock quotes listed above are just illustrative examples and not trading recommendations, though I am long SQ as of the post date of this article.

streaks vs statistical streaks

Sep 26, 2016

Hacker News et al are obsessed with streaks, but I think they have some problems:

  1. A single regression resets to zero.

  2. There's not an easy way to gradually ramp up your streak-commitment over time.

I prefer a different approach: statistical streaks.

Suppose I made a commitment to do something differently on 2016-08-26, and did it for the next 5 days; then my 30-day statistical streak avg = 0.166, but my 5-day statistical streak avg = 1.0.

If I then have one off-day (5 consecutive days then one failure day) the 30-day statistical streak avg is still 0.166, but the 5-day statistical streak avg drops to 0.80.

If I pick it back up the next day, it goes to 0.2 and 0.80 for the 30- and 5-day statistical streak averages (respectively), while if I fail again, it drops to 0.1667 and 0.6 for the 30- and 5- variants, respectively.

Here's a tiny bit of python to demonstrate these scenarios:

    assumed_history = 0
    explicit_recent_history = [1,1,1,1,1,0,0]

    for window_days in [30, 5]:
        # lazy...
        history = [assumed_history]*(window_days)+explicit_recent_history   
        recent_history = history[-window_days:]
        success_days = sum(recent_history)

        print window_days, float(success_days) / window_days , recent_history

subscribe via RSS

Powered by Olark