toy/life data models

April 19, 2017 - 6 minutes read - 1221 words

I’ve been experimenting a lot with some kinda “toy” data models based on random things that I wish there was a database to query, but isn’t. For example:

What was my average arrival time this week?
How much of my equity has vested before a certain date?
When was the last time we had spaghetti for dinner?

I’ve been doing this with flat JSON files. This is a bit of an odd choice for me; I actually love schematizing data models in protobuf and MySQL and designing proper indices for the data models I work on during work hours.

The ones I work on off-hours are a bit different, though. I want to quickly add a new record in if I order something on Amazon or my wife asks me to do something or I think of some quick hack to do something differently. I don’t want to run a “real” database server, because that would mean running a MySQL server on the internet and connecting to it (which complicates or restricts accessing that database in a safe way.) Even if I did, that database (without a ton more work) wouldn’t have multiple staged backups, let alone a solid replication strategy. Because I’m trying to be scrappy with my updates, the possibility of a catastrophic update is very real – there’s nobody around I can ask to review my roll plans.

I want to edit things at work or at home in the morning and see those updates in the other space. I want to be able to roll back to points in time and undo really big commits. I want to do bulk updates – none of these tables have more than like 50 records – in a text editor with a multi-line cursor.

These use cases are actually (surprisingly?) all really well served by JSON + Git/Dropbox. Git gives a certain type of transactionality and time-travel query capability. Dropbox provides a replication mechanism across all of my devices and other folks have done all the hard work of inventing UIs that work appropriately on each given device.

Unfortunately, they do mean that the querying support is not as developed as a proper relational store like mysql or sqlite3. But that’s okay. jq gets a big part of the way there, and some small python scripts get most of the rest of the way. Once there’s a large enough corpus of records and/or enough schema stability to want a “real” database, it’s pretty easy to simply import the JSON records in to that database.

Let’s do a quick example! Periodically, my wife texts me asking what I want for dinner. I wanted to record a small “menu” of these, so I popped open a new file in textmate, select JSON mode, and enter a couple quick entries:

[
  {"main_course": "korean beef"},
  {"main_course": "greek food"},
  {"main_course": "thai food"},
  {"main_course": "meat loaf"}
]

I can get a quick list of just the main courses with a JQ command like

$ pbpaste | jq 'map(.main_course)'
[
  "korean beef",
  "greek food",
  "thai food",
  "meat loaf"
]

or pick a random entry in python with something like

import json
import random

with open('data/dinners.json') as dinners_file:
    dinners = json.load(dinners_file)
    print(random.choice(dinners)['main_course'])

which might output something like

greek food

If I want a mysql-cli-style table of records, I can replace the final line with

print(tabulate.tabulate(dinners))

(after a quick pip3 install tabulate, of course.)

The beauty of this approach is that if we can’t come to an agreement among those options, and I start googling around on recipe sites, and come across a super-great sounding tamale pie recipe, I can just add it in along with the recipe link:

$ git diff
diff --git a/data/dinners.json b/data/dinners.json
index 60f8f6b..abb04d2 100644
--- a/data/dinners.json
+++ b/data/dinners.json
@@ -2,5 +2,6 @@
   {"main_course": "korean beef"},
   {"main_course": "greek food"},
   {"main_course": "thai food"},
-  {"main_course": "meat loaf"}
+  {"main_course": "meat loaf"},
+  {"main_course": "tamale pie", "recipe_url": "http://www.seriouseats.com/recipes/2015/01/best-tamale-pie-braised-skirt-steak-beef-charred-corn-brown-butter-cornbread-crust-recipe.html"}
 ]

I didn’t need to run any

ALTER TABLE dinners 
ADD COLUMN recipe_url VARCHAR(255) DEFAULT NULL;

or anything like that. If I make a mistake in some query I don’t need to delete and re-insert a new record – I just edit the file. If I shut the lid on my laptop and ride in to work, the new entry is waiting for me on my work laptop when I get there.

These “data models” are also extremely useful as mock data when writing mustache templates or React components.

JSON files definitely aren’t perfect for this use case. It’s a bit verbose (christ, the quotes) and finicky (diff-minimizing trailing commas, where art thou?). As a database, it’s worse in almost every respect to a real relational/KV/document store – there’s no ACID guarantees, there’s no command line, everything is slow, common operations are a bit wordy, there’s no phpMyAdmin, etc. But there’s great support in almost every scripting or compiled language for loading JSON, so you have a lot of freedom to put together different parts (“I think I’ll write this transformation in Clojure and that one in Haskell – wait, no, MATLAB!”), and there’s no servers to configure and secure, or clients to install.

In similar experiments, I have created some basic mysql tables or used a local Redis/mysql/sqlite3/mongodb. SQLite3 is an incredible, insanely battle-tested peice of software. Redis has a pretty comfortable command line and some great functionality like the HLL support and TTLs on a key-value pair. MongoDB fits a pretty similar “early experimentations” use case and has a great query language. Of course MySQL is also a very solid workhorse, especially with the InnoDB storage engine. They all absolutely have their use cases: nearly everything!! – in contrast to flat json files, which is really barely even appropriate for my use cases! – and most cases any of these would be much better suited for “bigger” data and especailly when any concurrency gets involved. But it’s hard to argue with the immediateness of this approach, though, and it’s easy to get frustrated when homebrew breaks the development redis instance on your work laptop or the todos you added last night are sitting in the mysql on your home laptop.

I hope I’ve convinced you there can be at least some utility in this kind of thing! I’m planning on describing some of the tooling I’ve built up for this, like tools for inserting new records, extracting field sets, adding fields, getting distinct values, aggregations and group-by’s. So stay tuned!

appendix: starter ideas (personal)

account_balance_snapshots: balances held in various accounts on certain dates, because I’m too lazy to keep up with ledger-cli but still get annoyed at how slow Mint is.
donations: organizations I’ve donated to, dates, and amounts.
interesting_papers: links to PDFs I want to read but haven’t yet.
travels: flight confirmation numbers, to/from airports (by IATA code), airlines, flight ids, departure times, arrival times, seat assignments.
stock_{purchases,sales,vestings}: tables defining stock transactions from equity compensation.

appendix: starter ideas (work)

app_nodes: hostnames, data centers, and environments by application. We have a service discovery err, service for this, but updates are relatively rare and
team_members: phone numbers, emails addresses, slack ids, birthdays, etc for people on my team. Again, we have some portals at work for (at least subsets) of this data, but it still comes in handy for the rest of the data and internet/VPN is problematic.