logging

September 26, 2014 - 4 minutes read - 706 words

In grad school, I spent a lot of time writing code that read output from nonlinear optimization solvers, and tried to do useful things with it. A much better way to do that is called “structured logging”, an idea I experimented with a bit during grad school. It has also been coming up in my working life, so I wanted to delve into it a bit deeper. For a quick introduction, check out Thoughts on Logging. For a lot longer introduction, see The Log: What every software engineer should know about real-time data unifying.

The reason this became interesting in my graduate work was that we had one set of experimental scripts that would output information (like how many QP solver iterations, matrix-vector products, error observed, or whatever) into a flat CSV file. This works great for writing the initial version, because it’s so simple!

But eventually we wanted to do some moderately complex stuff with these data files, so I had some analysis scripts that pulled things out of the CSVs and generated some charts. Later still, we wanted to try a new QP solver without rewriting the analysis scripts, so we just made the new QP solver output the same structure of CSV output.

When we decided that we needed to track something else, we just added it to the CSV output, and probably the analysis would have looked weird without changing the analysis script, so it’s pretty easy to remember to update that as well. But what about the old QP solver? Well, we thought we wouldn’t be needing that again, so no sense wasting time updating that!

Of course, history is a cruel friend, and we did end up running the old QP solver again on some new data. And the analysis scripts were silently broken until we noticed and fixed them.

There’s a whole other problem too: We didn’t have any headers in the CSV file, so they were not self-describing. (Probably this was FORTRAN’s fault?)

To give a concrete example, the code might have output something like

it=  5, inner_it= 34      err: 1.2345e-3

or a CSV file like

5,34,1.2345e-3

Structured logging applied to this problem, I think, would have the programs outputting some JSON like

{
    "iteration_number": 5,
    "inner_iterations": 34,
    "error": 1.2345e-3
}

In practice, would be flattened down to something like

{"iteration_number": 5,"inner_iterations": 34,"error": 1.2345e-6}

Of course, all of these options in isolation contain exactly the same data. The differences lie in how resiliant to change they are, and how well we could:

parse output to do some extra calculation on it
reconstruct output from an old version of the code

One JSON blob per iteration is about the same amount of work to output, and perhaps slightly harder to read in than a CSV, but much much easier than the completely unstructured logging. Parsing JSON is extremely easy in almost all languages: Java and Python in particular have great support; MATLAB has some support libraries as well. (No guarantees about FORTRAN.) And of course, as long as you keep your output routine up to date, there’ll never be any confusion like “The third column of this CSV looks like a double… I wonder if that’s the error or that one weird run I did where the code was outputting CPU time instead?” because the data self describes.

Late in my PhD I took this even further and loaded from the structured JSON output into a relational database. Then I could do things like

SELECT problem_name, MEAN(inner_iterations) WHERE iteration_number = 1 GROUP BY problem_name;

which output a table like

|--------------|------------------------|
| problem_name | MEAN(inner_iterations) |
|--------------|------------------------|
| problem_1    | 123.45                 |
| problem_2    | 234.56                 |
| problem_3    | 345.67                 |
| problem_4    | 456.78                 |
|--------------|------------------------|

or run a query like

SELECT iteration_number, MEAN(inner_iterations) GROUP BY iteration_number;

which output a table like

|------------------|------------------------|
| iteration_number | MEAN(inner_iterations) |
|------------------|------------------------|
| 1                | 503                    |
| 2                | 434                    |
| 3                | 418                    |
| 4                | 342                    |
| 5                | 309                    |
| 6                | 196                    |
| 7                | 113                    |
|------------------|------------------------|

Disclaimer: This is all made-up data.

The point is: Structuring the data made it really easy and much less error-prone to build reports.