Primary Purpose
The primary purpose of the schema redesign is to eliminate what Bill Karwin
calls the “Entity-Attribute-Value” anti-pattern in his book SQL
Antipatterns. This is where we basically store arbitrary dictionaries of
attributes in the various Info tables, which I have since discovered is a
terrible idea for all the reasons Karwin articulates.
The particular difficulty with following this within the other LNT design goals
is that we still have the desire to allow users to report very flexible
structured data.
Currently, the best idea to have to resolve this conflict is that we will
construct tables on the fly. I suspect many SQL experts might also regard that
as an anti- or scary- pattern, but it seems like the best option to me. If one
thinks of LNT as trying to be a general purpose product, then the idea of
creating tables is notionally related to the specialization (instantiation) of
the general purpose product for one’s particular test suites.
However, we certainly also want to limit the degree to which we create or modify
tables. Having the test submission mechanism having to modify the table any time
a user reported a new key would certainly be superflous.
Thus, my current plan is to follow what Karwin calls the “Semistructured Data”
pattern. What we will do is add an arbitrary blob field (to be JSON or perhaps
BSON data). We will basically expect that any fields that are required (or
almost always used) to be put in the actual table schema, but any time we see
additional fields we can handle them by just placing them in the BLOB field.
We will probably allow users to migrate fields to and from the schema. This
gives us a good amount of flexibility (and an easy path to eliminate the JSON
field if need be). We may require users to do this before they can do anything
but just see the data associated with a run. For example, if they want to use
one of the reported fields as an axis.
Secondary Purpose
One additional painful part of the current schema design is that we use separate
tests to represent the status aspect of other tests. This is nice and flexible,
but makes the UI code very painful. Especially, some things like making a graph
of the test values for all tests which passed become incredibly complex.
The plan is to handle this problem by also constructing the Sample tables on the
fly, and allowing the test suite to define the keys that go into a sample. Thus,
any one sample will reflect all of the statistics that were reported for that
test.
- This has many advantages:
- We can start using types for the columns (e.g., easy to start reporting hash
of produced binaries, for example).
- The performance impact of adding new sample values should be much lower than
in the old schema.
- The database explicitly models the fact that sample values were produced from
a single run, whereas before sample values and status could not technically
be correlated correctly.
- We eliminate the need to mangle subtest/test result key information into the
test name, which is a big win.
- The has some disadvantages, however:
- Poorly models suites where different tests reported different test results.
- Poorly models situations where we want to support a large number of test
results and that set might change frequently or dynamically (e.g., suppose we
reported one test for each LLVM Statistic counter).
However, at least for our current usage this scheme should work well enough and
be substantially faster than the old scheme.
This will probably mean that we have to do a bit of work (similar to what we had
to do for parameter sets) to handle what the UI for this should look
like. However, we should have better infrastructure for defining how the UI
should handle things in the metadata.
Other Antipatterns In Use
The “status kind” field uses and suffers from a view of the problems mentioned
in Chapter 11. 31 Flavors. It would probably be good to move being foreign key
references into an auxiliary table. This also reduces some of my reservations
about making that field required / part of every test.
Conveniently, this can also be done without actually changing the status kind
values, which makes migration easy.