The Victors will be Declarative
The variety of databases used in industry has flourished in the last decade, with a host of new and innovative approaches that go beyond the traditional relational database approach. Along with this development there has been a vocal nosql movement, which sought to provide data to programmes in a simpler manner, hoping to overcome what is sometimes known as the “object-relational mismatch”.
Many of these new databases, such as graph databases of which DataChemist is one, enable approaches to traversing and understanding data which were extremely difficult or infeasible using standard SQL database management systems. These developments are critical for our increasingly rich and cross-linked world of data.
However, we are at risk of forgetting the lessons of the past, and thereby repeating them again. There was a time when there really was no SQL, and then SQL came along and crushed the competition. It’s worth remembering why SQL won last time and what it should teach us about the future of graph databases and databases in general.
When there was no SQL
The first computers were largely calculational machines. For each calculation, information was loaded manually along with its programme using switch-boards, switches, keypads or entered using punch-cards. Eventually, however, the volumes of data became large and storage devices which could hold reasonable amounts of information were developed.
With the advent of larger data storage capacity and requirements and the greater permanence of data, techniques of data storage became increasingly important. At first, almost all data storage management was completely bespoke, with each programme or series of programmes managing the meaning and content of data stores in custom subroutines, organising it into blocks, fields or records manually. Older programmers may remember the rather spartan “databases” used in Fortran known as Common Blocks, in which blocks of memory were shared, sliced and diced between various subroutines:
“Blank COMMON blocks need not be the same length in different program units. However, a named COMMON block must be exactly the same length wherever it appears. This means that some knowledge about how the computer stores information is necessary. That is, the programmer must know how much storage each variable or array takes in order to ensure that the named COMMON blocks are the same length.”
Fortran’s Common Block structure brilliantly exposes the dangers of leaving the storage model implicit. Two different subroutines, using different type ascriptions for elements of the block, and different lengths will read the data in two entirely incompatible ways. The programmers must remain personally vigilant against turning meaning into garbage.
As the 1960s began to close and the 1970s arrived, there were any number of different ways of solving this problem. Sometimes the problem was solved with relatively simple libraries of subroutines, but a new approach using purpose-built databases was coming into fashion.
An example of the new approach was the MUMPS database which was developed at the Massachusetts General Hospital in Boston during 1966 and 1967. This was a hierarchical structured database, allowing the hospital to organise the vast number of datapoints in more manageable and memorable ways, exploiting the fact that their hierarchical data-structure could be used for taxonomic classification. Over time, MUMPS even incorporated ACID (Atomic, Consistent, Isolated, Durable) transactional properties, making it effective as a multiuser persistent data storage platform.
Querying and altering MUMPS databases was achieved with an interpreted programming language which had some short hands for extracting and modifying data from its hierarchical structure. The MUMPS query language was extremely procedural in character, even having a GOTO statement. The design of the language, being very early in computer science history, was quite ad hoc and it shows! It was flexible, but indeed too flexible, not allowing the RDBMS or the programmer to see or exploit important logical structure.
And then there was light
And then the landscape changed entirely. In 1974 at IBM, using ideas developed by Ted Codd, Donald Chamberlin and Raymond Boyce, began an implementation of what was to become SQL. Ted Codd’s idea was to use relational algebra to describe data abstractly in such a way that storage could be decoupled from description and queries could be obtained in a uniform, consistent, mathematical and declarative manner.
This approach meant that data was carefully modeled with a schema, and then the data which conformed to that schema by design, was interrogated with a query. The logical structure was put to the foreground and the algorithmic structure - the way that the programme would actually obtain the data, was left as an implementation detail for the database management system.
Query optimisers could take statements, and using statistics about the relations, their sizes, and the shape of the query, create plans which carried out the actual algorithmic extraction of data in a fashion that was often much faster than a naive developer would manage, far more robust against changes in the schema, ultimately requiring less maintenance, and requiring less effort to write. Further, the algorithmic properties of relational algebra were carefully explored, leading to approaches that were likely to compute in reasonable time frames, given reasonable database sizes, due to having mangeable computational upper bounds on complexity.
As the 80s and 90s rolled in, while MUMPS, and some of the other procedural data stores of the 60s and 70s still supported some users, they were almost entirely eclipsed by SQL. The advantages of the declarative approach were just too many to ignore and the old style of ad hoc procedural data processing was relegated to the dustbin of history.
Getting it wrong... again
In computers, the one thing that stays the same, is that nothing stays the same. In the last two decades the playing field has shifted violently yet again. While many challenger databases are still based on SQL, there are a host of nosql databases which are growing in popularity and eating up market share of the growing database market.
Some of these have declarative schemas (mongodb), some have declarative query languages (Cypher). Very few, if any, have taken the careful approach which was realised by SQL with relational databases. The integrated simplicity of a model of the data and a declarative language over which to query it, with conceptually simple computational properties is virtually absent.
As data complexity increases, it is not clear that there will be a victor as dominant as SQL was in the last wave. However, there will definitely be a winnowing of contenders - there are simply too many vendors, too many options and too much complexity.
Among those, the schema-less variety will be doomed, as this approach demands procedural manipulation making data migration extremely difficult and expensive and causing software to grow into an unmaintainable morass of complex procedures over which it is difficult to reason. Also will be doomed are those that opt for a schema, but imagine that it’s ok to tack on a randomly assembled programming language as a query interface, which is a poorly thought-out fix for a grab-bag of problems with no overarching design philosophy or mathematical clarity.
The victors will have a simple, but flexible data model with a mathematical precision and polish and a query language to match, allowing the database management system to deal with the details of implementation on modern hardware - including both parallelism and distribution, unburdening the developer and creating modular, composable and maintainable systems. Graphs with well defined logical schemas and mathematically precise recursive query languages, will certainly be among the more likely contenders.
The developers of DataChemist have learned from this lesson of history and will be in that number.
November 2nd, 2018
Share this post:
DataChemist Supports Global Initiative To Promote And Understand Peace Through Data Analysis
Posted January 7th, 2019 by Luke Feeney
Visual Voting Analysis Shows Sinn Fein Enjoy Strongest Voting Discipline On Dublin City Council
Posted December 31st, 2018 by Gavin Mendel-Gleason