filter vs spec (draft)
- 5 minutes read - 982 wordsConsider a silly data model to store data about cities like
message City {
optional string city_name = 1;
optional string state = 2;
optional int32 population = 3;
optional int32 year_founded = 4;
// ... presumably others :-)
}
and some sample data like:
[
{"city_name": "Portland", "state": "OR", "population": ...},
{"city_name": "Portland", "state": "ME", "population": ...},
{"city_name": "Springfield", "state": "FL", "population": ...},
{"city_name": "Springfield", "state": "IL", "population": ...},
{"city_name": "Springfield", "state": "CO", "population": ...}
]
There are some useful entities we can define: (DRAFT NB: don’t read too much into the matcher vs filter lingo.)
-
a specifier is a data object that tells you have to find the specific instance you are looking for, and lets you answer questions about existence. An example specifier message for our city model might be
message CitySpecifier { required string city_name = 1; required string state = 2; }
while
{ "city_name": "Seattle", "state": "WA" }
is an example specifier instance.
Note that I’ve used
required
here, which is against protobuf best practices, but makes it clear thatCitySpecifier
s like{"city_name": "Seattle"}
or{"state": "WA"}
should be rejected.A specifier is what answers the question of whether two things are the same. Another way to look at it is that these are the fields we would want to impose a unique key constraint on.
-
a matcher is one way to say “Show me all the things with these properties”. Suppose the data object is
message CityMatcher { optional string city_name = 1; optional string state = 2; }
This has four types of queries:
{} {"city_name": "Springfield"} {"state": "CO"} {"city_name": "Springfield", "state": "CO"}
the first three of which map, respectively, to SQL where clauses like
WHERE 1; WHERE city_name = 'Springfield'; WHERE state = 'CO';
The fourth presents an interesting predicament: Is
WHERE city_name = 'Springfield' AND state = 'CO';
WHERE city_name = 'Springfield' OR state = 'CO';
a better choice for the meaning of
{"city_name": "Springfield", "state": "CO"}
?I think the choice is clear: (2) is a bad choice!
-
We already have a way represent the set where the name is
Springfield
(that is,{"city_name": "Springfield"}
), and a way to represent the set where the state isCO
({"state": "CO"}
). These primitives can be combined either by a union of the where-clause (which indeed givesWHERE city_name = 'Springfield' OR state = 'CO'
) or the database layer can use theUNION
operation to compute the union of returned records. So using{"city_name": "Springfield", "state": "CO"}
to represent (2) would be a redundant functionality. -
(2) leaves us with no way of of representing the intersection of cities where the name is
Springfield
and the state isCO
. Resolving requires either application-level code or a temporary-table-type approach that implements (1) anyway, and thus provides no reduction in complexity.
The where-clause union is best served by introducing a compound matcher, like
message CompoundCityMatcher { repeated CityMatcher city_matcher = 1; }
where
city_matcher
entries are reduced with a logicalOR
operation. -
a filter is a way to say “Show me all the things with these properties”. So we might say have a data object like
message CityFilter { repeated string city_name = 1; repeated string state = 2; }
Note that the fields in this object are
repeated
, so we can include lots of them, like{ "city_name": ["Portland", "Chicago"], "state": ["WA"] }
It doesn’t make any sense for a city to be named both
Portland
andChicago
, so the clear choice to join query atoms within the same field is a logicalOR
– otherwise, all filters with multiple values would return the empty set! By the same logic that we applied toCityMatcher
, it is clear that different fields should be joined withAND
.The key is then:
-
a given filter field matches to something like the SQL
city_name IN ('Portland', 'Chicago')
-
multiple filter fields are joined with a logical
AND
, so our firstCityFilter
would becity_name IN ('Portland', 'Chicago') AND state IN ('WA')
. -
omit the where-clause for empty fields.
In a similar fashion to the
CompoundCityMatcher
, we can define aCompoundCityFilter
message:message CompoundCityFilter { repeated CityFilter city_filter = 1; }
which is also
OR
-reduced.There’s another problem with assuming that two different fields are joined with a logical
OR
. The number of rows quickly explodes as we start adding in morecity_name
orstate
. Even if the actual data transfer imposed is okay, this puts the burden to add some logic outside the database layer that does database-type-things of selecting and filtering records, but can’t benefit from the indices and query-ability of the database proper.There’s another (minor) subtlety regarding null vs empty filter fields. Essentially, what do you do with a
CityFilter
like{"city_name": ["Portland", "Chicago"], "state": []}
?WHERE city_name in ('Portland','Chicago') AND state in ()
?WHERE city_name in ('Portland','Chicago')
?
TODO: answer this question
The clever will look at the above data models and say: “Ah! But CityFilter
is just the same thing as a CitySpecifier
where each field is a singleton list!! We have no need for a CitySpecifier
!”
While true, the burden is then on all client code that would use CitySpecifier
to include a validation that each field is exactly singleton.
Furthermore, all future programmers reading such code (yourself included!) must continually bear the cognitive overhead of whether or not a given CityFilter
is being used as a CityFilter
or a CitySpecifier
, and all future executions of the code must bear either the performance overhead of running that validation, or the risk of violating it.
It’s interesting to note that, in spite of the proto+SQL terms in which I’ve formulated this argument, the general principals hold more generally. See, for example, Django’s Complex lookups with Q objects documentation:
Keyword argument queries – in filter(), etc. – are “AND”ed together. If you need to execute more complex queries (for example, queries with OR statements), you can use Q objects.