Real world data can be difficult to model without adding complexity
that would be out of place in a simple cloud system. This article describes one method to shoehorn
a many-to-one relationship into a simple database, and what happens when it
goes wrong.
Imagine you are the purveyor of an email marketing system which is
effectively a flat file (think spreadsheet like), rather than a more
sophisticated database management tool.
This is fine, simple in concept and implementation, but with limitations
that you can’t really have a clean many-to-one relationship. You could have multiple fields if you want to
record, say, colour options. You could
have a colour1 field, colour2 field etc., or just a logical field for each
colour, so as many fields as there are options.
You could have one large field and put all the selected options in
separated by some character, perhaps a pipe or semicolon. All these might work, but there are
non-trivial issues with consistent data input/import and with selections
(queries) that then return the right data.
If the data consists of logical values (yes/no, on/off, 1/0), then
it is relatively simple to store the values as a string, so if the options are
Red, Orange, Yellow and Blue, you could, for example, store this as “1001” if
Red and Blue have been selected, and Orange and Yellow have not. You would also need a lookup table somewhere
to map selections to their place in this string, but overall
this is space efficient and the queries will run very fast. Even better, this can be stored as a bit
string, which will use about one eighth of the space and run even faster.
Slightly technical paragraph follows: In this instance, our binary
value for Red would be binary 1, (2 to the power 0) and the 1000 for Blue would
have a value of 8 (2 to the power 3), so our storage of Red and Blue, 1001,
would look like the decimal value 9 (8+1). Given that we can assign a value to
our offsets in the bit string, we can actually pass
the value for a particular bit to the check box in the HTML web form used for
either data input or creating a selection query, and then receive that value on
submit if the checkbox is checked. So red’s return value would be 1, Orange 2, Yellow 4, and
Blue 8 (they don’t return anything if they are not checked). Minimal coding required, and all this can be
hidden from the end-user.
As one database field can be
used to store a group of selectable values, we can have multiple fields and
groups, so one for colour, another for size, perhaps another for location, or
any group of selectable values the user cares to add as they customise their
flat file. A standard integer field may be 32 bits long, which should be
enough for almost all purposes. A web page with a group of more than 32
checkboxes is too much.
Updates to such fields should be quick, we just have
to amend bit values to 1 for those selected, and to 0 for those not
selected. Our query form will use the
lookup table to create a mask that will pick out the selected values most
efficiently. Computers are good at
binary. Many to one, done.
Problems start to arise when the user wants to change their group
of values. They wish to add Green and
Violet, and nobody is buying Yellow, so they want to take that out. You could have implemented the lookup table
as just a simple list, which would be appealing as you could then use the same
list order as the presentation order of the options on the screen. Removing yellow, insertions of new options
or a change in order is now a tricky whole file operation, as you need to amend
that group field in every record and shuffle the subsequent bits
along. If you have implemented the
lookup as an index with a numeric offset into the bitstring then you could just
leave yellow to fester by removing the lookup entry, but you may still wish to
bite the bit (ha ha) and tidy up any unused options less you run out of space
later. The programmer behind all this has to cope with fields being removed, fields being added,
possibly some name (field label) changes and changes of presentation
order. Doubtless this will be well
tested, because get it wrong and all those user’s customer’s preferences will
be mis-mapped.
Get it wrong they did. I
have rectified this situation for a customer who was fortunate to have copies
of the data before and after it all got a bit mixed up when they added some new
choices in a location option group. Menorca
was added as the fourth option in the middle of the location set, but all the
existing bit values were not adjusted and Menorca ended
up pointing at what should be the New York values instead. All the rest of the location values were
similarly misplaced down to Zurich, which was then seemingly shunned by all the
customers. Their group of language
choices were also shuffled.
The purveyor of the system declined to accept that their system
had done this, but then they were about to sell the company. They may also have broken data protection
law, at least in the UK. I couldn’t
replicate the issue, so presumably they had fixed the bug, if not the data. What was interesting was the minimal level of
on-line support, and that there was, apparently, no community in which you
could ask if others had seen this happen.
It was as if the application was so good that such facilities were not
required.
Here are some broad lessons one could draw from this:
·
When evaluating systems, check out the web support
pages. If there is no sign of bugs or
feature requests, no community pages with problems and work-arounds,
then caveat emptor.
·
Take and keep your own backups. Insidious issues can damage your data for
some time before you notice. You may need to repair your data using those aged
backups.
·
Normalisation is data science, not rocket science. Nevertheless, the effort and complexity
involved will be too much for someone who just wants to get their mailing
out. When we cheat
we create an anomaly somewhere, and we have probably moved the complexity and
risk into some program code to try to manage this.