Posts

relational java explorations with sqlite3 vtab

May 5, 2017

One idea i have been mulling over lately is exposing a java codebase relationally, with queries like:

SELECT *
FROM classes c
JOIN annotations a
  ON a.class_id = c.id
WHERE a.name = 'MyAnnotationClass';

The idea here is that you could search a codebase by pretending that you had tables for a whole bunch of things like:

  • classes
  • annotations
  • variables
  • literals
  • methods

This would be a useful thing because instead of relying on an IDE like IntelliJ to do "find usages" operations, you could actually script those interactions. Admittedly, of course, Java has some pretty sophisticated reflection APIs to find this sort of stuff too, but it seems like exploring the structure could be much easier writing Java code to traverse those trees.

I told sharvil about this idea and he immediately responded with a link to SQLite's vtab API. There's a Dr Dobbs article on using it as well.

toy/life data models

Apr 19, 2017

I've been experimenting a lot with some kinda "toy" data models based on random things that I wish there was a database to query, but isn't. For example:

  • What was my average arrival time this week?
  • How much of my equity has vested before a certain date?
  • When was the last time we had spaghetti for dinner?

I've been doing this with flat JSON files. This is a bit of an odd choice for me; I actually love schematizing data models in protobuf and MySQL and designing proper indices for the data models I work on during work hours.

The ones I work on off-hours are a bit different, though. I want to quickly add a new record in if I order something on Amazon or my wife asks me to do something or I think of some quick hack to do something differently. I don't want to run a "real" database server, because that would mean running a MySQL server on the internet and connecting to it (which complicates or restricts accessing that database in a safe way.) Even if I did, that database (without a ton more work) wouldn't have multiple staged backups, let alone a solid replication strategy. Because I'm trying to be scrappy with my updates, the possibility of a catastrophic update is very real -- there's nobody around I can ask to review my roll plans.

I want to edit things at work or at home in the morning and see those updates in the other space. I want to be able to roll back to points in time and undo really big commits. I want to do bulk updates -- none of these tables have more than like 50 records -- in a text editor with a multi-line cursor.

These use cases are actually (surprisingly?) all really well served by JSON + Git/Dropbox. Git gives a certain type of transactionality and time-travel query capability. Dropbox provides a replication mechanism across all of my devices and other folks have done all the hard work of inventing UIs that work appropriately on each given device.

Unfortunately, they do mean that the querying support is not as developed as a proper relational store like mysql or sqlite3. But that's okay. jq gets a big part of the way there, and some small python scripts get most of the rest of the way. Once there's a large enough corpus of records and/or enough schema stability to want a "real" database, it's pretty easy to simply import the JSON records in to that database.

Let's do a quick example! Periodically, my wife texts me asking what I want for dinner. I wanted to record a small "menu" of these, so I popped open a new file in textmate, select JSON mode, and enter a couple quick entries:

[
  {"main_course": "korean beef"},
  {"main_course": "greek food"},
  {"main_course": "thai food"},
  {"main_course": "meat loaf"}
]

I can get a quick list of just the main courses with a JQ command like

$ pbpaste | jq 'map(.main_course)'
[
  "korean beef",
  "greek food",
  "thai food",
  "meat loaf"
]

or pick a random entry in python with something like

import json
import random

with open('data/dinners.json') as dinners_file:
    dinners = json.load(dinners_file)
    print(random.choice(dinners)['main_course'])

which might output something like

greek food

If I want a mysql-cli-style table of records, I can replace the final line with

print(tabulate.tabulate(dinners))

(after a quick pip3 install tabulate, of course.)

The beauty of this approach is that if we can't come to an agreement among those options, and I start googling around on recipe sites, and come across a super-great sounding tamale pie recipe, I can just add it in along with the recipe link:

$ git diff
diff --git a/data/dinners.json b/data/dinners.json
index 60f8f6b..abb04d2 100644
--- a/data/dinners.json
+++ b/data/dinners.json
@@ -2,5 +2,6 @@
   {"main_course": "korean beef"},
   {"main_course": "greek food"},
   {"main_course": "thai food"},
-  {"main_course": "meat loaf"}
+  {"main_course": "meat loaf"},
+  {"main_course": "tamale pie", "recipe_url": "http://www.seriouseats.com/recipes/2015/01/best-tamale-pie-braised-skirt-steak-beef-charred-corn-brown-butter-cornbread-crust-recipe.html"}
 ]

I didn't need to run any

ALTER TABLE dinners 
ADD COLUMN recipe_url VARCHAR(255) DEFAULT NULL;

or anything like that. If I make a mistake in some query I don't need to delete and re-insert a new record -- I just edit the file. If I shut the lid on my laptop and ride in to work, the new entry is waiting for me on my work laptop when I get there.

These "data models" are also extremely useful as mock data when writing mustache templates or React components.

JSON files definitely aren't perfect for this use case. It's a bit verbose (christ, the quotes) and finicky (diff-minimizing trailing commas, where art thou?). As a database, it's worse in almost every respect to a real relational/KV/document store -- there's no ACID guarantees, there's no command line, everything is slow, common operations are a bit wordy, there's no phpMyAdmin, etc. But there's great support in almost every scripting or compiled language for loading JSON, so you have a lot of freedom to put together different parts ("I think I'll write this transformation in Clojure and that one in Haskell -- wait, no, MATLAB!"), and there's no servers to configure and secure, or clients to install.

In similar experiments, I have created some basic mysql tables or used a local Redis/mysql/sqlite3/mongodb. SQLite3 is an incredible, insanely battle-tested peice of software. Redis has a pretty comfortable command line and some great functionality like the HLL support and TTLs on a key-value pair. MongoDB fits a pretty similar "early experimentations" use case and has a great query language. Of course MySQL is also a very solid workhorse, especially with the InnoDB storage engine. They all absolutely have their use cases: nearly everything!! -- in contrast to flat json files, which is really barely even appropriate for my use cases! -- and most cases any of these would be much better suited for "bigger" data and especailly when any concurrency gets involved. But it's hard to argue with the immediateness of this approach, though, and it's easy to get frustrated when homebrew breaks the development redis instance on your work laptop or the todos you added last night are sitting in the mysql on your home laptop.

I hope I've convinced you there can be at least some utility in this kind of thing! I'm planning on describing some of the tooling I've built up for this, like tools for inserting new records, extracting field sets, adding fields, getting distinct values, aggregations and group-by's. So stay tuned!

appendix: starter ideas (personal)

  • account_balance_snapshots: balances held in various accounts on certain dates, because I'm too lazy to keep up with ledger-cli but still get annoyed at how slow Mint is.
  • donations: organizations I've donated to, dates, and amounts.
  • interesting_papers: links to PDFs I want to read but haven't yet.
  • travels: flight confirmation numbers, to/from airports (by IATA code), airlines, flight ids, departure times, arrival times, seat assignments.
  • stock_{purchases,sales,vestings}: tables defining stock transactions from equity compensation.

appendix: starter ideas (work)

  • app_nodes: hostnames, data centers, and environments by application. We have a service discovery err, service for this, but updates are relatively rare and
  • team_members: phone numbers, emails addresses, slack ids, birthdays, etc for people on my team. Again, we have some portals at work for (at least subsets) of this data, but it still comes in handy for the rest of the data and internet/VPN is problematic.

_useful

Apr 3, 2017

I often create a directory/file called _useful/_useful.org/_useful.txt. I have one in my Dropbox, for example, that contains:

  1. My apartment lease
  2. My car/motorcycle insurance details
  3. A textfile with my vehicle plate numbers/VINs/insurance policy numbers.

At work, I have one with

  1. the top visited links for logs/metrics/admin interfaces for the services I work with most
  2. a list of links of "typical" or "exemplar" things
    • links to our internal tool views for typical payments, merchants, etc
    • typical size (in bytes) of various protobuf messages we use a lot, size of 1M messages, #messages in 1MB/GB
  3. common coding idioms, like several variants of @RunWith that we use in various cases in our test code.
  4. useful commands for doing stuff (curl/SQL/plain old shell)

Plain text is great for all the reasons it usually is. But it's especially useful here (see what I did there?) because the file loads much faster than Google Docs or wiki pages, it's grep-able, it's trivial to copy to a new machine, there's no fuss about futzing with the document to get it to format properly, and soforth.

The naming convention is useful because it naturally gets lexigraphically sorted at the top in most macOS/iOS file lists without being a special character on the shell prompt (which complicates the aforementioned grep-ability.)

.sqliterc

Mar 31, 2017

I came across Better output from sqlite3 command line. His .sqliterc file did not work for me, but the simpler

.mode "column"
.headers on

did work nicely.

I also found out that brew install sqlite3 does not install the sqlite3 binary to $PATH, which stinks.

The sqlite3 environment is still much less pleasant than the mysql cli. A few things on my wishlist:

  • tab completion of tables, fields, keywords, and functions.
  • nicer .schema output -- I've just gotten really used to reading them in mysql output

It sounds like maybe apsw can be something useful? Or maybe Navicat? (but that's expensive)

sprat: multiplayer solitaire

Mar 29, 2017

I took a quick pass at describing the rules of a card game I grew up playing, Sprat:

Each player(or team) has 1 deck of cards. initial setup is 4 cards face up (the "personal piles"), 13 cards (top card face up, others face down) in the "sprat deck", and the remaining cards in the "flip deck".

The center of the table is the space for the "ace piles". A new ace pile can be started by any player with any ace; any player can play the next card of the same suit on any ace pile.

Each player can move cards/stacks of cards within their personal piles as long as each card decreases the face value by 1 and alternates color. (solitaire style)

Each player flips through their flip deck 3 cards at a time and can play the top card on any ace piles or personal piles. (again, solitaire style)

The round is over when the first player eliminates their sprat deck. Rounds usually take 2-15 minutes, with the higher end being very rare for >= 3 players. The round score is then (player's cards in ace piles) - 2*(player's remaining sprat deck).

One game consists of several rounds; the game is over when the first player reaches 100 points.

I wrote a basic (read: not battle hardened) implementation of this as a semi-real-time game on github/traviscj/sprat. It's barely good enough to play, but I've had a bit of fun playing it with some (non-local) family!

When I was writing up the code and trying to describe the rules to someone else, I discovered an identical game called Nertz, itself a variant of Canfield.

idea: transaction ordering in ledger-cli

Mar 27, 2017

I love ledger-cli but keeping it in sync with my bank statements drives me crazy. The problem is that the transactions can end up with an essentially arbitrary ordering, and the order they clear (and even the date they clear) is not necessarily under my direct control.

One answer to this is: stop caring about the ordering of your transactions! That's a decent answer, except that not addressing the ordering issue means that you can only ever have "end-of-day" consistency. This means you need a different report to reconcile the transactions.

It also sidesteps some more fundamental concerns:

  • It doesn't help me figure out if a given transaction is already accounted for in the running transaction total.
  • I want my near-term forecast to give me a strong guarantee that I won't overdraft the account.

I had a quick idea about this. What if instead of having a "clear date", transactions had a "clearing window" -- the earliest date and the latest date that a given transaction is expected to clear.

This doesn't solve the "put it in order" problem, but the software could construct pathological orderings and generate error conditions to show potential cashflow problems.

basic http requests and server handlers in clojure

Feb 1, 2017

I've been playing around a tiny bit with clojure, and wanted to document the process of doing something pretty basic: make a project that can request my website over HTTPs.

The first step is pretty easy:

brew install leiningen
lein new random-experimentation
cd random-experimentation

and add [clj-http "3.4.1"] to project.clj.

The HTTPS part is a bit trickier. The JDK doesn't include the Letsencrypt certificates. But I found a simple script install-letsencrypt-in-jdk.sh that can set it up.

wget https://gist.githubusercontent.com/Firefishy/109b0f1a90156f6c933a50fe40aa777e/raw/15926be913682876ae68bb4f71e489bc53feaae3/install-letsencrypt-in-jdk.sh
sudo install-letsencrypt-in-jdk.sh $(...)

With that in place, it's possible to start a lein repl and make a request:

$ lein repl
user=> (require '[clj-http.client :as client])
user=> (client/get "https://traviscj.com/_status.json")
{:request-time 877, :repeatable? false, :protocol-version {:name "HTTP", :major 1, :minor 1}, :streaming? true, :chunked? false, :reason-phrase "OK", :headers {"Server" "nginx/1.10.0 (Ubuntu)", "Date" "Thu, 02 Feb 2017 03:55:23 GMT", "Content-Type" "application/json", "Content-Length" "27", "Last-Modified" "Mon, 01 Aug 2016 02:31:03 GMT", "Connection" "close", "ETag" "\"579eb467-1b\"", "Strict-Transport-Security" "max-age=15768000", "Accept-Ranges" "bytes"}, :orig-content-encoding nil, :status 200, :length 27, :body "{\n  \"site\":\"traviscj.com\"\n}", :trace-redirects ["https://traviscj.com/_status.json"]}
user=> (def status (client/get "https://traviscj.com/_status.json"))
#'user/status
user=> (:body status)
"{\n  \"site\":\"traviscj.com\"\n}"

With a bit more repl-experimentation and judicious reloading, we can make a small script that applies a similar GET to a list of sites:

(ns random-experimentation.core
  (:require [clj-http.client :as client]))

(def sites [
            "traviscj.com"
            "pauljoos.com"
            "jdbeals.com"
            "thingsworthkeeping.com"
            "bovineherdhealth.com"
            ])

(defn get-status [site]
  (defn get-status-url [site]
    (str "https://" site "/_status.json"))
  (defn get-status-json [site]
    (client/get (get-status-url site) {:as :json}))
  [site (:status (get-status-json site))])

;; lein run -m random-experimentation.core
(defn -main []
  (println (map get-status sites))
  )

and use that within a new:

(use 'random-experimentation.core)

or existing:

(use 'random-experimentation.core :reload)

repl session.

I also tried out spinning up an HTTP server and responding to a request. This is also pretty easy; one must :require an additional module

    [ring.adapter.jetty :as j]

and define a handler:

(defn handler [request]
  (println request)
  {:status 200
   :headers {"Content-Type" "text/html"}
   :body (str "Hello World blah!" (:uri request))})

and j/run-jetty wtih that handler (and a particular port, of course):

(defn -main []
  (j/run-jetty handler {:port 3001}))

Pretty slick!

use the `locale.currency` library function

Jan 20, 2017

I had stupidly written a pair of functions like this:

def comma_sep(amount):
    thousands = int(amount/1000)
    remainder = int(amount%1000)
    if thousands > 0:
        return comma_sep(thousands)+","+"%.3d"%(remainder)
    else:
        return str(remainder)
def friendly_from_cents(total_cents):
    pos = total_cents >= 0
    total_cents = abs(total_cents)
    cents = int(total_cents%100)
    dollars = int(total_cents/100)
    friendly = "$"+comma_sep(dollars)+"."+str(cents)
    if pos:
        return friendly
    else:
        return "-" + friendly

This is stupid, because python has builtin support to localize currency:

import locale
locale.setlocale(locale.LC_ALL, 'en_US')
locale.currency(-1800, grouping=True)

I think it's tempting to fall into these traps, because it always starts simpler: I had some total_cents variable and just wanted to display quick and dirty. The complexity built up gradually, until I had something pretty complicated. Oops.

More generally, I violated the "Use the batteries" idiom of Python -- which is something I actually try to do. Kinda embarrassing.

filter vs spec (draft)

Jan 18, 2017

Consider a silly data model to store data about cities like

message City {
  optional string city_name = 1;
  optional string state = 2;
  optional int32 population = 3;
  optional int32 year_founded = 4;
  // ... presumably others :-)
}

and some sample data like:

[
  {"city_name": "Portland", "state": "OR", "population": ...},
  {"city_name": "Portland", "state": "ME", "population": ...},
  {"city_name": "Springfield", "state": "FL", "population": ...},
  {"city_name": "Springfield", "state": "IL", "population": ...},
  {"city_name": "Springfield", "state": "CO", "population": ...}
]

There are some useful entities we can define: (DRAFT NB: don't read too much into the matcher vs filter lingo.)

  1. a specifier is a data object that tells you have to find the specific instance you are looking for, and lets you answer questions about existence. An example specifier message for our city model might be

    message CitySpecifier {
      required string city_name = 1;
      required string state = 2;
    }
    

    while

    {
      "city_name": "Seattle",
      "state": "WA"
    }
    

    is an example specifier instance.

    Note that I've used required here, which is against protobuf best practices, but makes it clear that CitySpecifiers like {"city_name": "Seattle"} or {"state": "WA"} should be rejected.

    A specifier is what answers the question of whether two things are the same. Another way to look at it is that these are the fields we would want to impose a unique key constraint on.

  2. a matcher is one way to say "Show me all the things with these properties". Suppose the data object is

    message CityMatcher {
      optional string city_name = 1;
      optional string state = 2;
    }
    

    This has four types of queries:

    {}
    {"city_name": "Springfield"}
    {"state": "CO"}
    {"city_name": "Springfield", "state": "CO"}
    

    the first three of which map, respectively, to SQL where clauses like

    WHERE 1;
    WHERE city_name = 'Springfield';
    WHERE state = 'CO';
    

    The fourth presents an interesting predicament: Is

    1. WHERE city_name = 'Springfield' AND state = 'CO';
    2. WHERE city_name = 'Springfield' OR state = 'CO';

    a better choice for the meaning of {"city_name": "Springfield", "state": "CO"}?

    I think the choice is clear: (2) is a bad choice!
    1. We already have a way represent the set where the name is Springfield (that is, {"city_name": "Springfield"}), and a way to represent the set where the state is CO ({"state": "CO"}). These primitives can be combined either by a union of the where-clause (which indeed gives WHERE city_name = 'Springfield' OR state = 'CO') or the database layer can use the UNION operation to compute the union of returned records. So using {"city_name": "Springfield", "state": "CO"} to represent (2) would be a redundant functionality.
    2. (2) leaves us with no way of of representing the intersection of cities where the name is Springfield and the state is CO. Resolving requires either application-level code or a temporary-table-type approach that implements (1) anyway, and thus provides no reduction in complexity.

    The where-clause union is best served by introducing a compound matcher, like

    message CompoundCityMatcher {
      repeated CityMatcher city_matcher = 1;
    }
    

    where city_matcher entries are reduced with a logical OR operation.

  3. a filter is a way to say "Show me all the things with these properties". So we might say have a data object like

    message CityFilter {
      repeated string city_name = 1;
      repeated string state = 2;
    }
    

    Note that the fields in this object are repeated, so we can include lots of them, like

    {
      "city_name": ["Portland", "Chicago"],
      "state": ["WA"]
    }
    

    It doesn't make any sense for a city to be named both Portland and Chicago, so the clear choice to join query atoms within the same field is a logical OR -- otherwise, all filters with multiple values would return the empty set! By the same logic that we applied to CityMatcher, it is clear that different fields should be joined with AND.

    The key is then:

  • a given filter field matches to something like the SQL city_name IN ('Portland', 'Chicago')
  • multiple filter fields are joined with a logical AND, so our first CityFilter would be city_name IN ('Portland', 'Chicago') AND state IN ('WA').
  • omit the where-clause for empty fields.

    In a similar fashion to the CompoundCityMatcher, we can define a CompoundCityFilter message:

    message CompoundCityFilter {
      repeated CityFilter city_filter = 1;
    }
    

    which is also OR-reduced.

    There's another problem with assuming that two different fields are joined with a logical OR. The number of rows quickly explodes as we start adding in more city_name or state. Even if the actual data transfer imposed is okay, this puts the burden to add some logic outside the database layer that does database-type-things of selecting and filtering records, but can't benefit from the indices and query-ability of the database proper.

    There's another (minor) subtlety regarding null vs empty filter fields. Essentially, what do you do with a CityFilter like {"city_name": ["Portland", "Chicago"], "state": []}?

    • WHERE city_name in ('Portland','Chicago') AND state in ()?
    • WHERE city_name in ('Portland','Chicago')?

    TODO: answer this question

The clever will look at the above data models and say: "Ah! But CityFilter is just the same thing as a CitySpecifier where each field is a singleton list!! We have no need for a CitySpecifier!" While true, the burden is then on all client code that would use CitySpecifier to include a validation that each field is exactly singleton. Furthermore, all future programmers reading such code (yourself included!) must continually bear the cognitive overhead of whether or not a given CityFilter is being used as a CityFilter or a CitySpecifier, and all future executions of the code must bear either the performance overhead of running that validation, or the risk of violating it.

It's interesting to note that, in spite of the proto+SQL terms in which I've formulated this argument, the general principals hold more generally. See, for example, Django's Complex lookups with Q objects documentation:

Keyword argument queries – in filter(), etc. – are “AND”ed together. If you need to execute more complex queries (for example, queries with OR statements), you can use Q objects.

korean beef recipe

Dec 31, 2016

Ingredients:

  • 1/4 cup brown sugar, packed
  • 1/4 cup reduced sodium soy sauce
  • 2 tsp sesame oil
  • 1/2 tsp crushed red-pepper flakes
  • 1/4 tsp ground ginger
  • 1 tbsp vegetable oil
  • 3 cloves garlic, minced
  • 1 pound ground beef
  • 2 green onions, thinly sliced.
  • 1/4 tsp sesame seeds

Directions

  1. Mix brown sugar, soy sauce, sesame oil, red pepper flakes, ginger.
  2. Sizzle garlic in the vegetable oil, then add beef and cook until brown.
  3. Stir in soy sauce mixture.
  4. Serve immediately.

Summarized from Korean Beef Bowl on Damn Delicious, itself summarized from Korean Beef and Rice.

subscribe via RSS

Powered by Olark