Posts

osquery

Jan 12, 2018

I've known about osquery for a while, but recently spent some time digging around in it.

The basic idea is to provide a consistent SQL interface to a bunch of system data, instead of learning idiosyncrasies of individual commands (which themselves vary across operating systems). My hacker buddy Sharvil has worked on osquery a fair bit and explained to me a couple years ago that it actually uses sqlite's virtual table functionality to provide the interface -- it's a fascinating and brilliant project!

Concretely, instead of running uptime(1) and getting either

osx ➜ uptime
 8:48  up 17:27, 10 users, load averages: 4.57 4.74 5.08
linux ➜ uptime
 16:52:12 up 192 days,  9:59,  1 user,  load average: 0.00, 0.00, 0.00

you can run:

osquery> SELECT * FROM uptime;
+------+-------+---------+---------+---------------+
| days | hours | minutes | seconds | total_seconds |
+------+-------+---------+---------+---------------+
| 2    | 1     | 18      | 38      | 177518        |
+------+-------+---------+---------+---------------+

We can also query for system information (shown here in line mode)

osquery> .mode line
osquery> SELECT * FROM system_info;
          hostname = traviscj.local
              uuid = [redacted]
          cpu_type = x86_64h
       cpu_subtype = Intel x86-64h Haswell
         cpu_brand = Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
cpu_physical_cores = 4
 cpu_logical_cores = 8
   physical_memory = 17179869184
   hardware_vendor = Apple Inc.
    hardware_model = MacBookPro11,2
  hardware_version = 1.0
   hardware_serial = [redacted]
     computer_name = traviscj
    local_hostname = traviscj

and even perform HTTP requests with the curl table:

osquery> SELECT url, response_code FROM curl WHERE url IN ('https://traviscj.com/_status.json', 'https://pauljoos.com/_status.json');
+-----------------------------------+---------------+
| url                               | response_code |
+-----------------------------------+---------------+
| https://pauljoos.com/_status.json | 200           |
| https://traviscj.com/_status.json | 200           |
+-----------------------------------+---------------+

I tried to reproduce a simple ps wrapper I have:

#!/bin/sh
# a simple script to show java processes without extremely long classpaths

ps aux | grep '[b]in/java' | grep -v $0 | awk '{ print $2" "$11" "$NF }'

and ended up with

SELECT pid
     , SUBSTR(cmdline, 
         (WITH char_index(i, ch) AS (values(1, '') UNION ALL SELECT i+1, substr(cmdline, i, 1) AS r FROM char_index WHERE r != '')
         SELECT MAX(i) FROM char_index WHERE char_index.ch = ' ')
       ) as java_main_class
FROM processes WHERE name = 'java';

which produces

osquery> SELECT pid, SUBSTR(cmdline, (WITH char_index(i, ch) AS (
    ...>     values(1, '')
    ...>     UNION ALL SELECT i+1, substr(cmdline, i, 1) AS r FROM char_index WHERE r != ''
    ...> )
    ...> SELECT MAX(i) FROM char_index WHERE char_index.ch = ' ')) as java_main_class FROM processes WHERE name = 'java';
+-------+----------------------------+
| pid   | java_main_class            |
+-------+----------------------------+
| 2736  | com.company.app1.FirstApp  |
| 2737  | com.company.app2.SecondApp |
+-------+----------------------------+

This is, to me, some world-class SQL trickery. Stepping through, the WITH ... SELECT clause works like this.

osquery> WITH char_index(i, ch) AS (
    ...>   values(1, '')
    ...>   UNION ALL SELECT i+1, SUBSTR("foo bar baz", i, 1) as r FROM char_index WHERE r != ''
    ...> ) SELECT * FROM char_index;
+----+----+
| i  | ch |
+----+----+
| 1  |    |
| 2  | f  |
| 3  | o  |
| 4  | o  |
| 5  |    |
| 6  | b  |
| 7  | a  |
| 8  | r  |
| 9  |    |
| 10 | b  |
| 11 | a  |
| 12 | z  |
+----+----+

Adding the WHERE char_index.ch = ' ' gets us just the spaces:

osquery> WITH char_index(i, ch) AS (
    ...>   values(1, '')
    ...>   UNION ALL SELECT i+1, SUBSTR("foo bar baz", i, 1) as r FROM char_index WHERE r != ''
    ...> ) SELECT * FROM char_index WHERE char_index.ch = ' ';
+---+----+
| i | ch |
+---+----+
| 5 |    |
| 9 |    |
+---+----+

and of course adding MAX(i) gives the highest such index, 9. My expression then passes that entire thing into a SUBSTR(var, idx) clause to get the last element, and applies it to cmdline instead of a static string.

There's an extensive schema listing on OSQuery's website describing all the tables available in the latest version of OSQuery.

profitably wrong

Jan 9, 2018

I've come to realize that my career so far has been built on being "profitably wrong." I think this is interesting because the usual approaches are being "profitably fast" (optimizing) or "profitably better" (improving), and most people think of any kind of wrongness as being a terrible thing. But sometimes the best way to optimize or improve is approximating!

The definitions of "profitably" has changed as I've worked on different things, as has the specific type of "wrongness". A couple specific ways accepting "wrongness" have been profitable for me include:

  • being simpler: approximating very complex neural models with very simple mathematical models
  • being faster: re-using outdated matrix factorizations to still improve a Newton iterate toward a solution
  • being tractable: naive bayes makes an independence assumption that is demonstrably false, but can still be useful

This type of idea seems to pop up everywhere for me:

  • HOGWILD: a parallel stochastic gradient descent algorithm that gets unreal speed by disregarding classic concurrency correctness mechanisms (read: locks) but works anyway
  • bloom filter: gives up the ability to always answer correctly to instead operate in a fixed, small amount of memory.
  • HyperLogLog: similarly gives an approximate unique count in a fixed, small amount of memory overhead. it's used in redis and at google.
  • p2 algorithm: estimates quantiles without storing all observations.

These ideas were all initially hard for me to wrap my head around, after spending several years learning how to prove that this sorting algorithm or that one actually sorts a list. But I'm completely sold on their utility and have come to see it as my own little niche.

DAYDREAM

What if you knew a secret to solving whatever problem you work on? What good is a secret? It's good if it helps you solve that problem better than other people working on it.

What's the worst that happens if you're wrong? What usually happens when you're wrong?

APPROXIMATION

One way to invent secrets is by approximating reality -- inventing a model. It's not good enough to just invent a model though. That model must have some predictive power.

The key in each of my examples is that predictive power:

  • simple neural models exhibit some properties of much more complex, chemically-based models.
  • a Newton-like algorithm that can estimate a better iterate.

QUANTIFY THE ERROR IN YOUR APPROXIMATION

It's important to understand the errors you might make with your approximation. At a very basic level, you need to understand whether the approximation incurs more cost than it earns to decide whether to use it or not.

One starting place for this is simply writing down the eventual outcome for false positives and false negatives. Another great approach is calculating an exact solution for an easier (relatively!) problem and comparing against the approximation.

FIND THE APPROXIMATION'S SWEET SPOT

All approximations have some range of validity, outside of which they aren't a reasonable model of reality anymore. Sometimes, they're very well-defined and carefully explored; HOGWILD, for example, needs a certain sparsity structure to avoid destroying the effectiveness of the parallelism with very frequent concurrent writes. Other times, they're simply not well-understood (megacomplex systems under study in medicine, for example) or obscured by an assumption of the model: assuming linearity of some portion of a function, assuming an incompressible fluid, etc. It's important to have some way of knowing whether your predictions/approximations are adding value or simply garbage.

FIX IT LATER / EVENTUAL CORRECTNESS

Sometimes, approximations apply even to problems that seem like they warrant the utmost correctness. Think money movement or automated vehicle control systems. On the contrary, lots of applications like these really just rely on some notion of eventual correctness. (For varying definitions of "eventual", of course!) When you think from the "what's the worst that could happen" or the "crap, it happened anyway because of something else" angles, it turns out that you'd just issue a refund or steer back into your lane or whatever. And, in fact, the "profitably wrong" mindset might enable you to decide on that corrective action sooner than later!

WHEN TO AVOID / WARNINGS

Sometimes, correctness is king. If you're moving money you might be able to tolerate "wrongness" in your "fraud/not fraud likelihood" calculation, but you probably aren't willing to tolerate the same on your balance adjustment method. Even if you're controlling a spacecraft, a single wacky control recommendation might be okay, as long as there's a bulletproof watchdog/contingency plan when you get several in a row. Hopefully, you know which type of system you're working on!

It's tempting to think because it's only an approximation, you be sloppy in your implementation of it. But bugs in approximating code are much harder to nail down: did your code make a mistake because of a bug, or because the model (your simplifying assumptions) was wrong? My world-weary, hard-won advice is that it's worth as much rigor as you can muster in approximating code.

I hope I've inspired you to embrace the idea of having an error rate! I'd love to hear from people about ways you:

  1. have been profitably wrong?
  2. have tried to be profitably wrong and failed?

todobackend.com

Jan 5, 2018

TodoBackend.com defines a simple API and has a frontend that works given any backend URL.

Particularly interesting to me:

dancing bull pricing structure

Jan 4, 2018

Dancing Bull is a great Korean BBQ restaurant that opened nearby recently. They have this pricing scheme:

Note that the unlimited bulgogi is \$23/person, but the bulgogi a la carte is \$25/grill (with each additional orders after that costing \$9). This raises an obvious question: which is the better deal?

Spoiler: it depends on both how many people you have and how hungry they are.

First, we had to clarify how it actually worked. It's not clear from the menu, but the "\$25/grill" includes the first order of bulgogi. Introducing $P$ for the number of people, $G$ for the number of grills, and $O$ for the total number of orders, we would pay $$UL = 23\cdot P$$ for the unlimited price scheme vs $$ALC = 25\cdot G + 9\cdot ( O - G)$$ for the a la carte scheme.

Let's take an example: If we have $P=4$ people eating $O=6$ orders of bulgogi (with a single grill), we can pay:

  • unlimited: $23\cdot 4 = 92$ dollars (\$23 per person)
  • a la carte: $25\cdot 1 + 9\cdot5 = 61$ dollars (\$15.25 per person)

We're interested in finding the breakeven number of orders, where the a la carte order costs more than the unlimited order. If we're feeling less hungry than that breakeven value, we should just order a la carte.

To actually find it, we seek an $O$ that satisfies $ALC = UL$, so $$ALC = UL \implies 25\cdot G + 9\cdot ( O - G) = 23\cdot P \implies O = { {(23\cdot P - 25 \cdot G)} \over 9} + G$$

This is trivial to express in python:

def breakeven_orders(people, grills):
    """
    grills such that unlimited == a_la_carte

    23*p == 25*g + 9*(total_orders - grills)
    """
    return (23.*people - 25.*grills) / 9. + grills

and we can use that function to make a table for ourselves:

--------------------------------------------------------------------------------
breakeven orders (orders-per-person) for (#grills, # people)
        G=1               G=2              G=3              G=4                
P=1     >0 (>0.78)        *                *                *                
P=2     >3 (>1.67)        >1 (>0.78)       *                *                
P=3     >5 (>1.96)        >4 (>1.37)       >2 (>0.78)       *                
P=4     >8 (>2.11)        >6 (>1.67)       >4 (>1.22)       >3 (>0.78)        
P=5     >11 (>2.20)       >9 (>1.84)       >7 (>1.49)       >5 (>1.13)        
P=6     >13 (>2.26)       >11 (>1.96)      >10 (>1.67)      >8 (>1.37)        
P=7     >16 (>2.30)       >14 (>2.05)      >12 (>1.79)      >10 (>1.54)        
P=8     >18 (>2.33)       >16 (>2.11)      >15 (>1.89)      >13 (>1.67)        
P=9     >21 (>2.36)       >19 (>2.16)      >17 (>1.96)      >15 (>1.77)        
P=10    >23 (>2.38)       >22 (>2.20)      >20 (>2.02)      >18 (>1.84)        
P=11    >26 (>2.39)       >24 (>2.23)      >22 (>2.07)      >21 (>1.91)        
P=12    >28 (>2.41)       >27 (>2.26)      >25 (>2.11)      >23 (>1.96)        
P=13    >31 (>2.42)       >29 (>2.28)      >27 (>2.15)      >26 (>2.01)        
P=14    >34 (>2.43)       >32 (>2.30)      >30 (>2.17)      >28 (>2.05)        
P=15    >36 (>2.44)       >34 (>2.32)      >33 (>2.20)      >31 (>2.08)        
P=16    >39 (>2.44)       >37 (>2.33)      >35 (>2.22)      >33 (>2.11)        
  • ">X (>Y)" indicates unlimited is cheaper for >X total orders (>Y orders-per-person)
  • Extra grills = faster ingestion!
  • Not sure how many grills unlimited usually comes with. Probably 1 grill/4 people?

notes & questions

  • I went with 4 friends and ended up getting 7 orders of bulgogi.
  • I went with 6 friends and ended up getting (I think?) 9 orders of bulgogi (with 2 grills for speed).
  • It amazes me how much I didn't (and still don't!) want to do the initial \$25 grill order, even though it's just the constant term and saves money in the long run.
  • Pricing one scheme in terms of grills and another in terms of people feels like a dirty trick, even through it's clear why they're doing that. It makes it much harder to compare sanely with mental/napkin math: it tricks you into thinking a la carte costs more, even though you're not going to buy as many of them.
  • I find the previous point extremely hard to reconcile with the presence of a \$9 additional order. Why does the brain want to skip over that part?
  • An earlier version of this post incorrectly calculated the example $ALC$ value, having used $P-G$ instead of $O-G$. Thanks for pointing it out, Lauren!

everything has a failure rate

Jan 3, 2018

I was happily riding in to work this morning on my motorcycle. Periodically while riding, a spark of terror that I've forgotten to buckle my backpack to the bike's frame hits me. So frequently, in fact, that I've grown used to reaching behind me to feel for the backpack in a spare second when I don't need to clutch.

Today was different:

I reached for the backpack but felt nothing.

I quickly signaled to pull over to the side and looked over my shoulder:

For the uninitiated, it's supposed to look like this:

I sped back home on the route I'd taken, mentally taking inventory of all the things in my backpack. My medications, my nice headphones, my work laptop. All the things on the laptop. Thoughts raced through my mind: If I can't find the backpack, who do I need to get in touch with at work to invalidate any credentials stored on it? (I wasn't that worried about this -- the laptops at Square are pretty locked down and we don't really store anything very secret on them -- but working here for nearly 4 years has given me some good operational security practices!) I wonder when the last backup completed and how much (if anything) I'll lose if it's gone. I try to decide whether I need to get in touch with Square's Trust & Security or IT team first, then realize that I'm not even sure I have either phone number stored in my phone.

I remember that my laptop's lock screen has my phone number and email. I hold out hope that my phone will ring or that a morning runner or neighbor will have found it and be waiting at the main entrance of my apartment building. All the while, I'm scanning the road for the backpack.

One of my primary maxims is

Everything has a failure rate.

I realize that this applies even to the process of securing the buckles of the backpack. I kick myself, because I have adopted the shisa kanko practice of pointing to the closed garage door and each buckle before I get on the bike -- but I neglected to do so this particular morning because I had so consciously closed the garage door.

With every turn in the route I become a bit more despondent and resigned. How long is it going to take to set everything up on the laptop again? It's early -- before 7 in the morning -- how soon can I get a replacement?

As I approach my apartment, I hold out hope that the backpack is hiding behind the last car -- perhaps it bumped off as my rear tire rolled over the curb. When that hypothesis is dreadfully disproven, a flash of hope washes over me that perhaps a neighbor recognized my name and put it inside the building before hurrying about their day. But no. There's no backpack I can see in the hallway through the floor-to-ceiling windows framing the glass door.

Fully crestfallen and expecting no relief, I hold out for the last hope -- maybe it fell off when I wheeled the bike out of the garage. It's a long shot, but it's either there or gone. I put the bike in neutral, kick the side-stand down, and jump off the bike. As I walk around to open the garage door, I notice something about the bike doesn't look quite right. And there it is:

extracting the last few dollars from Visa/MC gift cards

Jan 2, 2018

Once in a while, I get a prepaid gift card like this:

They're -- like money -- great! But they do have one downside: it's hard to spend the change once you've bought off most of the value.

Originally this was going to be a quick "just redeem it with Amazon's gift card" post, but a bit of study revealed that it isn't quiet that simple:

  1. the authorized amount can prevent the full transaction from going through, so if it's the first transaction at Amazon, you might need to leave $1.00 to cover the reserve.
  2. the article mentions a $5 minimum, but the initial amount & card selection on Amazon doesn't seem to have a problem with lower amounts, so YMMV I guess.

The article also mentions a different solution that I should have thought of: apparently some banks will redeem the value of the card into your account. You just need to show up at a branch of your bank with proper identification.

things you don't learn in grad school

Jan 1, 2018

  1. oncall exists.
  2. the data is never available how you need it.
  3. unit tests matter.
  4. pushing code after 3pm is risky if you're not willing to work all night. (courtesy Paul Joos)

url encoding safe relational operators

Nov 28, 2017

I was catching up on Jesse Wilson's blog this morning, and in particular his post URL Encoding Is Material. In that article, he says

My advice

If you’re defining your own URLs, you’ll save a lot of trouble by avoiding characters like <, >, {, }, +, ^, &, |, and ;.

I came across this problem in one of my projects at work: I wanted a page to display some recent records with 1) a name field provided by the user and 2) a value field that match $l \leq \text{value} \leq b$, where $l$ and $b$ are given by the user. I also wanted that page to have a URL representing that condition, so that it could be easily shared with teammates and soforth. When I started working on it, I didn't have Jesse's wisdom, and so I naively came up with a URL structure something like

https://service/matches.html#0.5<=theName<=0.6

I reveled in my brilliance and sent a link to one of my co-workers, who promptly told me the page just gave an error. So I clicked the link myself, and saw something like this:

matches unsafe encoded

Something had turned the URL into https://service/matches.html#0.5%3C=theName%3C=0.6!

At this point I came across some similar advice somewhere on the Internet and re-thought the use of <=. I considered separating with slashes or something, but the actual project actually had two of these conditions in the URL, so that seemed a bit unwieldy:

https://service/matches.html#0.5/nameA/0.6/0.7/nameB/0.8

This has the same feel as a program with a long list of positional arguments -- it requires the user to remember the convention of "lower bound, then name, then upper bound". It's also not very flexible -- it would be much harder to support a condition with < instead of <= or completely omit one side of the bound. So I wanted to keep the conditions separated by /, but have something else to denote the separation between the bound and the name.

I initially considered using magic strings like leq/lt for that separation, but this also seemed a bit unsatisfying -- what if the name for some record contained with leq or lt? After mulling it over a bit, my Fortran training kicked in. Fortran defines relational operators like

.LT.
.LEQ.

and soforth. This approach resulted in URLs like:

https://service/matches.html#0.5.leq.nameA.leq.0.6/0.7.leq.nameB.leq.0.8

which, while a bit wordy, worked pretty well. And I was finally able to share links with impunity:

matches safe

snack buying

Nov 12, 2017

I got this text message from my father-in-law:

Ok. Soda 1.50 Chocolate bar 1.00 Gum 0.10 Jelly Bean .05

Have to buy exactly 14 items and spend $10

At least one of each.

There are 4 diff combos. (Order: soda, chocolate, gum and jelly bean)

Got 5-2-3-4 3-5-4-2

Can’t get others. Any ideas? :)

I whipped up this solution to check in python:

#!/usr/bin/python3
# created_at: 2017-11-13 13:53:19 -0800

def main():
    solutions = 0
    for si in range(1,7):
        for ci in range(1,11):
            for gi in range(1,101):
                for ji in range(1,201):
                    total = si*150 + ci*100 + gi*10 + ji*5
                    items = si + ci + gi + ji
                    if total == 1000 and items == 14:
                        print(si,ci,gi,ji, total, items)
                        solutions += 1
    print(solutions)

if __name__ == "__main__":
    main()

The core of this solution is the check

total = si*150 + ci*100 + gi*10 + ji*5
items = si + ci + gi + ji
if total == 1000 and items == 14:
    print(si,ci,gi,ji, total, items)
    solutions += 1

in the innermost for-loop. It checks that we spent exactly $10 (1,000 cents) and got exactly 14 items.

I used 7, 11, 101, 201 as the upper bounds here to limit the search, because if we bought purely that item, we'd be limited to 6, 10, 100, 200 as the upper bounds for each product, respectively. This implies that we search $$6 * 10 * 100 * 200 = 1,200,000$$ possible solutions! This was based on an initial mis-reading of the question -- I didn't notice the "14 items" constraint. If we restrict the upper bound to 14 for the gum & jelly bean count, we only have to examine $$6 * 10 * 14 * 14 = 11,760$$ solutions!

In any case, the above code outputs:

3 5 4 2 1000 14
5 2 3 4 1000 14
2

so his original two solutions are the only two!

I was thinking about trying to make a less brute-force solution with octave's linear programming routines, like:

A = [
  150, 100, 10, 5;
    1,   1,  1, 1;
]
b = [
  1000;
  14;
]
lb = [1,1,1,1]; ub = [inf, inf, inf, inf];

glpk([1,0,0,0], A, b, lb, ub, 'SS', 'IIII')
glpk([-1,0,0,0], A, b, lb, ub, 'SS', 'IIII')

which returns

octave:55> glpk([1,0,0,0], A, b, lb, ub, 'SS', 'IIII')'
ans =
   3   5   4   2
octave:56> glpk([-1,0,0,0], A, b, lb, ub, 'SS', 'IIII')'
ans =
   5   2   3   4

This is a bit unsatisfying because we had to tweak the objective function to get the two different solutions, so there's no proof that we can't get more solutions, which is what the original problem was asking. But it's still a nice way to get some initial solutions.

We can also modify the problem definition to check if there are any solutions with 4 sodas:

A = [
  150, 100, 10, 5;
    1,   1,  1, 1;
    1,   0,  0, 0;
]
b = [1000;14;4];
glpk([0,0,0,0], A, b, lb, ub, 'SSS', 'IIII')'

which returns

octave:51> glpk([0,0,0,0], A, b, lb, ub, 'SSS', 'IIII')'
ans =
    NA    NA    NA    NA

so we can conclude that there aren't any. We can establish the same for 1 and 2 sodas.

We could do better with custom tree search algorithm, but I'll save that for another post :-)

even the ceo doesn't get an office

Nov 6, 2017

Square is a fantastic place to work, and has been for my entire time here. One thing I'm not such a fan of, though, is the open office layout we have -- and share with many other startups.

Somehow I have the distinct memory of thinking, "Wow, even the CEO doesn't get an office here! That's really cool!" I had an important realization today, though. It's not that the CEO of companies with open offices don't get an office -- it's that they don't want one.

subscribe via RSS

Powered by Olark