Posts

smart arguments

Feb 21, 2019

It's easy to love the terseness one can get out of a command-line tool with a "pass by order" convention. For example, maybe I run something like

deploy myapp sea2 production

to deploy an application called myapp to the second production datacenter near Seattle. This works great, except now I need to remember the order of arguments is: deploy [app] [datacenter] [environment].

The typical solve for this problem is introducing named arguments, so you'd end up with

deploy --app myapp --datacenter sea2 --environment production

This no longer depends on the order, because we're explicit about which value we're passing for each argument. The command itself is a LOT longer though!

In [41]: len("deploy myapp sea2 production")
Out[41]: 28

In [42]: len("deploy --app myapp --datacenter sea2 --environment production")
Out[42]: 61

We can improve this a bit with short named arguments, like

deploy -a myapp -d sea2 -e production

In [43]: len("deploy -a myapp -d sea2 -e production")
Out[43]: 37

but now we still need to remember that the arguments are a, d, and e and which values they'll correspond to. Indeed, we might even have some code that checks that the value of environment is either "staging" or "production"!

So what if we turned that on its head? We can use our error checking code to automagically figure out which argument value corresponds to which argument!

formalism

  • you know you need N args, a0...aN.
    • In the above example, a0=myapp, a1=sea2, a2=production.
  • each arg value a_k has a qualifier predicate q_k that returns true for a value v if v is valid for arg a_k
  • last v wins (e.g. http://a.com + http://b.com prefers the second)

example 1

Suppose we want a markdown link generator, which takes two args:

  • a url (like https://traviscj.com/blog)
    • qualifier predicate: "starts with http"
    • qualifier predicate: "contains ://"
  • a slug (like myblog)
    • qualifier predicate: !url

example 2

Another nice example is if our arguments are a combination of app, datacenter, and environment, with values like these:

  • app: a / b / c / ...
  • datacenter: sea1 / sea2 / sea3 / ... / ord1 / ord2 / ord3 / ...
  • environment: staging / production

We can define simple predicates for the latter two:

  • is_datacenter(a) := a.startswith("sea") or a.startswith("ord")
  • is_environment(a) := a in ["staging", "production"]

The app arg is a bit trickier, though, because the engineers might name things however they want! The ideal case would be some internal app that maintains a list of valid applications, but failing that, we could also declare something like

# is_app(a) := not is_datacenter(a) and not is_environment(a)
def is_app(a):
  return not is_datacenter(a) and not is_environment(a)

example 3

md link generator w/ alt text

like ex1, but:

  • a link text

    • qual predicate: slug arg exists & string length greater than slug arg length
    • qual predicate: any whitespace
  • test impl in Scratch_kFFXI4

ambiguity resolutions

One trouble we might have is a given argument value qualifying for multiple arguments. When that happens, we can (roughly in order of increasing danger):

  • blow up
  • ask for resolution (ie prompt)
  • assume a default value
  • guess
  • loop over possibilities

other extensions

  • disqualifier predicates

feed sequences

Jan 8, 2019

In the mysql feeds post, I mentioned that the publisher could do

SELECT MAX(feed_sync_id)+1 
FROM kv

to find the next feed_sync_id during the publishing process, but this is actually a really bad idea. (And I knew it at the time, so forgive me for selling lies...)

Republishing

Before we jump into the problematic scenario, I'd like to motivate it with a tiny bit of background.

The republish operation is extremely useful when consumers need to receive updates. It is also extremely simple! A query like

UPDATE kv
SET v = "jim"
WHERE k = "fred";

is just amended to include feed_sync_id = null, as in:

UPDATE kv
SET v = "jim"
  , feed_sync_id = null
WHERE k = "fred";

By updating feed_sync_id in the same query (or at least transaction) as the other updates, we can lean on the database's atomicity guarantee to ensure that either both fields get updated successfully or a rollback occurs. In the latter case, the application block further processing until an update occurs. From there, we can guarantee that the feed mechanism will deliver the final update with at-least-once semantics.

at most once?

Consider the following scenario:

t=0: consumer:  initialize with cursor=""
t=1: publisher: KV(k="fred", v="bob", feed_sync_id=None, version=1)
t=2: publisher: KV(k="fred", v="bob", feed_sync_id=1, version=2)
t=3: consumer:  consumes KV(fsi=1, version=2); updates cursor=1
t=4: publisher: KV(k="fred", v="jim", feed_sync_id=None, version=3)
t=5: publisher: KV(k="fred", v="jim", feed_sync_id=1, version=4)
t=6: consumer:  fetches KV w/ fsi > 1; nothing to do!

To summarize, we've republished the last record (by feed_sync_id) after the consumer has already consumed it, and thereby missed the subsequent updates!

That is, the consumer does not see version=4 of KV(k="fred")!

sequence-based updates

To rule this out, we need to ensure that the feed_sync_id monotonically increases, even with arbitrarily republished records.

Introduce a sequences table, like:

CREATE TABLE `sequences` (
  id BIGINT(20) NOT NULL AUTO_INCREMENT,
  name VARCHAR(255) NOT NULL,
  value BIGINT(20) NOT NULL,
  version BIGINT(20) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `u_name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

where value stores the maximal feed sync id assigned, for a sequence identified by name. The inclusion of the name field allows this table to support multiple feed publishers from a single database instance. The version column can be used for optimistic locking, or simply omitted.

The publish operation will now need to

  1. find the next kv record to publish

    next_id = sql("""
        SELECT id
        FROM kv
        WHERE feed_sync_id IS NULL
        LIMIT 1
    """)
    
  2. query sequences, like

    seq = sql("""
              SELECT *
              FROM sequences
              WHERE sequences.name = :sn
              """, 
              sn=sequence_name)
    next_fsi = seq.value + 1
    
  3. use next_fsi as the feed_sync_id for the next kv record we publish

    sql("""
        UPDATE kv
        SET feed_sync_id = :fsi
        WHERE id = :id
        """, 
        fsi=next_fsi, 
        id=next_id)
    
  4. finally, advance the sequences.value (assuming we actually assigned next_fsi):

    sql("""
        UPDATE sequences
        SET value = :mf
        WHERE name = :sn
        """,
        mf=next_fsi, 
        sn=name)
    

Critically, (1)-(4) should occur within a single transaction; otherwise a feed fetch for a given cursor might return different records!

pattern: extract mutating variables to State class

Jan 7, 2019

I've come to really like the pattern of extracting local or instance variables into their own State class, and writing an algorithm in terms of that State.

A really trivial (if contrived) example of this would be a "sum cubes" algorithm; something like this:

public class CubeSummer {
  private final PrintStream printStream;
  private long i;
  private long sum;

  public CubeSummer() {
    this.printStream = System.out;
    reset();
  }

  public long sumCubes(long limit) {
    for (; i<limit; i++) {
      sum += i * i * i;
      display();
    }
    return sum;
  }

  public void reset() {
    i = 0;
    sum = 0;
  }

  public void display() {
    printStream.print("i = " + i + ";");
    printStream.println(" sum = " + sum);
  }
}

What's happened here is we've essentially promoted some variables that should arguably be local variables into instance variables, to avoid needing to pass them around explicitly. But in the process, we've also mixed long-lived state with short-lived state, broken thread safety, and forced clients of this class into a particular calling convention (either instantiate an instance per sumCubes call or carefully respect thread safety and call #reset().)

Put a different way: The lifecycle of i/sum are not very clear -- are they scoped to a particular sumCubes operation? Or across many invocations? or something else entirely? Is this class thread-safe? Who is responsible for resetting the per-request state -- the sumCubes method, or its caller? If it is the subCubes method, is that reset behavior guaranteed even when something terrible (an exception) happens inside the method?

To make it more clear, we can pull out the i & sum variables into a new state class like

public static class CubeSumState {
  public long i = 0;
  public long sum = 0;
}

With those definitions in mind, our implementation becomes:

public long sumCubes(long limit) {
  CubeSumState state = new CubeSumState();
  for (; state.i < limit; state.i++) {
    state.sum += state.i * state.i * state.i;
    display(state);
  }
  return state.sum;
}

Now the lifecycle of state is now very clear -- it's exactly one invocation of sumCubes! By removing the "per-request" state from the CubeSummer class, we've also made it thread-safe and safe to use as a singleton.

We haven't bothered with this here, but if it was useful, we could incorporate validation logic into the setters of our State object to ensure that we never enter an invalid state.

Also note that we're not strictly bound to our custom debugging output anymore: we can replace the display() with something like

public class CubeSummer {
  // ...
  private final Gson gson;

  public void display(CubeSumState state) {
    printStream.println(gson.toJson(state));
  }
}

This genericism comes with an immense benefit: it is precisely a structured log of all the relevant state, which can be used for for subsequent (re)interpretation or as a checkpoint for later resumption. It also automatically "adapts" if we introduce new fields into CubeSumState! This approach also has similar benefits for debugging, since all state and mutations are focused within a single object.

This example is, of course, a bit artificial: a pair of long wouldn't normally require a custom object to represent. But the distinction between state that is

  • long-lived / reused across many invocations
  • short-lived / used for a single invocation

is an extremely useful one in practice.

edit 2019-01-09: thanks Wouter for recommending some great improvements here wrt naming + correctness; nobody who has worked with him would be surprised at his rigor :-)

unreasonable wish -- an sql interface to websites

Jan 6, 2019

I was searching Amazon Prime for some Earl Grey tea for my wife. I got these results

earl grey tea

This was just a basic search for earl grey tea in the Prime Pantry store.

I would love to

SELECT *
FROM products p
WHERE p.prime_pantry = 1
  AND p.keywords = 'earl grey tea'
  AND p.packaging = 'bulk'

I get that this'll never happen, but man... it'd be nice.

more light, better pictures

Jan 5, 2019

We were trying to take some Christmas pictures but feeling a bit unsatisfied with the way the pictures were coming out:

I wanted some flash or something, but the batteries on all the "better" cameras were already dead, so I went looking for an alternative. We tried to get the overhead light turned on, but couldn't find the switch for that.

The next-closest source of light was a sheet music lamp on my mom's piano! Borrowing that for the cause, I was able to get this one instead:

Pretty happy with that!

watching progress in mysql

Jan 4, 2019

Pretty frequently at work, I end up polling a database with some command like

SELECT MAX(id) FROM my_tbl;
SELECT MAX(id) FROM my_tbl;
SELECT MAX(id) FROM my_tbl;
SELECT MAX(id) FROM my_tbl;
-- .... ad nauseam ....

I've eventually noticed a few patterns I use pretty consistently:

estimate ETA by including NOW()/UNIX_TIMESTAMP()

Generally, the point of hovering over the table is to get an estimate of when it will finish/catch up/whatever. For that, you generally want to include a timestamp in the query output, so when you come back a while later, you can know exactly how much time has elapsed. I might transform the above query into

SELECT
      NOW()
    , UNIX_TIMESTAMP()
    , MAX(feed_sync_id)
FROM kv;

Here, I include NOW() for a human-readable time, UNIX_TIMESTAMP() to make it easier to estimate the duration between a pair of query results by avoiding the need to add in hr*60*60 + min*60, and of course the actual value I'm interested in.

roll the starting point into subsequent queries

If I run

mysql root@localhost:traviscj_development> SELECT NOW(), UNIX_TIMESTAMP(), MAX(feed_sync_id) FROM kv;
+---------------------+------------------+-------------------+
| NOW()               | UNIX_TIMESTAMP() | MAX(feed_sync_id) |
+---------------------+------------------+-------------------+
| 2019-01-03 20:03:18 | 1546545798       | 15                |
+---------------------+------------------+-------------------+
1 row in set
Time: 0.048s

and I know I need to get through a million feed_sync_ids, I might further add a couple of clauses like

1e6 - MAX(feed_sync_id) AS remaining
UNIX_TIMESTAMP() - 1546545798 AS elapsed

which can be combined to estimate an ETA if useful.

structured lookback in subsequent queries

We can even take it one step further:

CREATE TEMPORARY TABLE poll_metrics (
   metric VARCHAR(255) NOT NULL DEFAULT 'default',
   ts TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
   value DOUBLE NOT NULL DEFAULT '0'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

This has some nice properties!

  1. it's a temporary table, so it doesn't interfere with anything else or have very much risk of becoming HUGE or anything like that -- it's not even visible to other clients of that database!
  2. it will automatically populate any of the fields if we know we're only gathering a single metric...
  3. but it allows us to collect multiple metrics if we want to!

We can easily populate this from querying a different table, as simply as

INSERT INTO poll_metrics (value) 
SELECT 13*(SELECT MAX(feed_sync_id) FROM kv);

or as we want:

INSERT INTO poll_metrics (metric, ts, value) 
SELECT
    "kv_prog"
  , NOW(3)
  , (SELECT MAX(feed_sync_id) FROM kv);

Together, those two queries would result in:

mysql root@localhost:traviscj_development> SELECT * FROM poll_metrics;
+---------+----------------------------+-------+
| metric  | ts                         | value |
+---------+----------------------------+-------+
| default | 2019-01-03 20:30:42.693000 | 195.0 |
| kv_prog | 2019-01-03 20:30:59.090000 |  15.0 |
+---------+----------------------------+-------+
2 rows in set
Time: 0.005s

stored in a table for other querying.

personal finance resources

Jan 3, 2019

I wanted to say a tiny bit about personal finance stuff, as I feel like I have gotten a fairly good feel for it.

Some great resources to start getting up to speed include:

index investments

Ideally you want to own a lot of different companies, so you've diversified your asset ownership. One of the easiest ways to do this is through index investing, where the index "tracks" a broad swath of the market, like "All S&P 500 Companies" or "All stock market companies".

My preferred way to do this these days is via the Vanguard funds through E*Trade, like:

These are a great funds because of their really low expense ratios (0.04%, 0.04%, 0.05%, respectively) and lack of brokerage fees (so you don't pay E*Trade to put them on!).

E*Trade does have a bit of a learning curve though, even when you're just trading index ETFs: you have to pick the asset allocations by buying and selling individual stocks, you have to do stuff (like buy more stocks!) with your dividends, you need to watch out that you're not buying ETFs with commissions, and soforth. You also need to be making big enough deposits to buy entire shares of the ETFs you want to buy; a single share of VOO, for example, will set you back about $230 as I write this!

All of that is a bit simpler with a "robo-advisor" like Betterment. (Forgive me, this is a referral link! This one is a non-referral link if you'd prefer that.) Here, you simply pick the ratio of stocks vs bonds and a risk tolerance and then Betterment balances your portfolio toward that target, reinvesting dividends in more shares, and efficiently rebalancing and even harvesting tax losses!

side investments

Once you've got a bit of a grasp on the "basics", you can also branch out a bit into some more exotic investments.

LendingClub is one interesting example. They allow borrowers to apply for loans and investors to those loans, where a single loan is funded by potentially many investors.

And of course, Coinbase remains a solid bet for your cryptocurrency investment/gambling needs.

final thoughts

My parting advice is to seek out a VERY strong understanding of the tax implications of your trading activity before doing it for real. I did a lot of paper trading that neglected the impact of taxes on the overall return of the strategy or trade; you really need to understand those impacts to make the correct investment decisions for you!

I should also disclose that I hold some long positions in everything I've mentioned here, and that you should consult a paid professional before taking any of the above advice!

espps are free money

Jan 2, 2019

ESPPs give employees an opportunity to buy the company stock at a discount. In both of the examples I'm aware of, the companies give a 15% discount on the LESSER of the price on the grant date and the price on the purchase date. The purchase dates are every six months, while the grants I've seen are either 12 or 24 months.

We can analyze this mathematically by breaking it into three cases. For concreteness, let's look at ADBE for a grant date of 2019-01-02. The stock is trading at \$224.27/share currently:

First, let's consider the case of the stock being exactly the same price on the purchase date. The lesser price is then \$224.27, and an ESPP enrollee will be able to purchase at \$224.27*85%=\$190.63. Those shares will still be worth 224.27 though, so the enrollee has made \$33.64 per share they purchase.

Next, consider the case of a decrease in value of the stock on the purchase date. Again, for concreteness, let's assume it goes down to \$200/share. Now the lesser price is \$200 and the purchase price is \$200*85%=\$170. The enrollee still makes \$30/share!

Finally, consider the case of an increase in value of the stock on the purchase date. For concreteness, let's assume it goes up to \$250/share. The lesser price is now the original price of \$224.27, so the enrollee can purchase at the \$190.63/share price point as in the first case. There's a critical difference, though: each share they purchase is now worth \$250/share, so the enrollee makes \$250-\$190.63=\$59.37/share this time!

At this point, you're probably wondering: what's the catch? What trick did traviscj pull on me?

There isn't one, but there is an important caveat: this assumes you immediately sell after the purchase occurs. You might or might not want to do this! Several things play into this: Do you value holding the stock? (e.g. you want to collect the dividends, etc) What are the tax implications of selling vs holding? Selling immediately will trigger the short term capital gains tax! (since you haven't held them for longer than a year) Holding for at least a year will qualify the earnings for long term capital gains tax, which is generally more favorable. But this incurs the risk of the stock moving against you within the holding duration.

DISCLAIMER: I am not a tax professional! Please consult a one before taking action on this information!

tradeoffs

Jan 1, 2019

Life is (just) a series of tradeoffs.
- DawnAnn "Mom" Johnson

I heard this so many times growing up that I think I actually stopped hearing it. But recently, I realized that it's never been more pertinent than to my current day-to-day life as a software engineer.

The fundamental tradeoff we make on the Risk Systems team is trading off between

  • how much fraud we catch (and symmetrically, miss!)
  • how many false positives we incur to catch that much fraud

Those false positives running rampant can inhibit growth (in terms of interested customers) of the product we're trying to protect, but letting too much fraud through can make a product too expensive (and even liable to be shut down!)

These consequences are pretty dire, but any project with a machine learning underpinning -- and especially those with an adversarial aspect! -- probably finds itself in a similar standoff.

A level below that, there's a more detailed level of distributed systems tradeoffs, like: If our datastore is unavailable, would we rather

  • be unavailable, and perhaps not respond to some client?
  • be available, but permit our own records to be lossy?

Another level even further below that, there's tradeoffs around how to spend the team's time! Should we

  • attempt to overengineer a bulletproof, multi-datacenter, cloud-native solution, but risk never finishing it?
  • ship the smallest thing that could possibly work, and risk having to throw it away later?

There's an interesting meta-tradeoff here, which is: Should we

  • expend extra effort into not committing to a given path? (ie: hedge)
  • not bother, paying that cost later if we need to? (ie: gamble)

I don't have answers for any of these, really. Every option I've listed has been the perfect answer for some project and the death knell for another.

And I think that was always Mom's point: It's just a tradeoff! So

  1. be aware of and open to the opposite approach
  2. be mindful of the tradeoff you're making
  3. be on the lookout for times to seek a different tradeoff

sharded feeds

Oct 29, 2018

Suppose that:

  1. our humble KV feed sees a lot of traffic.
  2. someone needs to consume our KV feed with multiple threads.

data

The first step is to introduce a notion of "shards" into our data model:

ALTER TABLE `kv`
  ADD COLUMN `shard` INT(11) DEFAULT '0',
  ADD INDEX `k_fsi_s` (`feed_sync_id`, `shard`);

publishing

We don't need to alter the publishing until the publishing itself is too slow to work with a single thread, but this introduces a lot of complications, so let's just hold off for now.

serving

SELECT *
FROM kv
WHERE kv.feed_sync_id > :fsi
  AND kv.shard = :shard
ORDER BY kv.feed_sync_id
LIMIT :limit

which is supported by an API like

/_feeds/fetch/kv?after=3&limit=5&shard=3

We can also support client-side shard configurations by also supporting a shard_count argument:

/_feeds/fetch/kv?after=3&limit=5&shard=3&shard_count=4

consumption

We need to update our cursors table to include a shard column as well:

ALTER TABLE cursors
  ADD COLUMN `shard` INT(11) DEFAULT '0',
  ADD COLUMN `shard_count` INT(11) DEFAULT '1',
  ADD UNIQUE KEY `u_name_shard` (`name`,`shard`),
  DROP KEY `u_name`;

data vs consumer shards

We haven't discussed the relationship between the number of threads a particular consumer uses (the shard_count value it provides to the API) and the range of shard values in the data model. Coupling the two together is completely sufficient for a prototype, and can even work surprisingly well for a surprising amount of traffic!

But at some point, when a system has grown to several consumers, and especially when those consumers need to do different amounts of work or have different locking requirements or performance characteristics, it can become useful to decouple the two.

The basic ideas are to

  1. write the data with a relatively large number of data shards DS (distinct shard values; between 2**9 and 2**12 are common in examples I've worked with),
  2. consume with relatively fewer consumer shards CS (e.g. <=2**5),
  3. query by mapping a given cs_i to a range of [DS_min, DS_max) values.

I've still pulled one punch here: we haven't said anything about what to actually set that new shard field to when writing new records! That'll be in the next post.

subscribe via RSS

Powered by Olark