The Elephant In The Room
We have often highlighted in the pages of this magazine the seemingly never-ending sequence of fashion waves hitting the software industry. We have discussed in 2020 the one impacting type systems, and more recently, another one concerning object-oriented programming. What is important in both cases is not the details of the technologies per se, but the social mechanisms at play in both situations. In this edition, we will encounter yet another example of such a clash, this time concerning relational database management systems, or RDBMS for short.
Because the 2010s were also the moment in which a whole generation of JavaScript-addicted startups decided to ditch decades of knowledge into the garbage bin, deciding that for whatever reason, storing chunks of JSON into a non-transactional database exposing port 27017 was the way to the future. Spoiler alert: it was not, and the 2020s appear to show, at least at first glance, a return to sanity.
Not only that, but if we pay attention, it seems like the 2020s is already the decade of PostgreSQL. Which, let us be honest, is good news, no matter how you look at it.
The best qualifier that this author could find to define PostgreSQL and its historical impact is the word “triumph.” Growing in the shadow of many more commercially successful systems, PostgreSQL has stood both the test of time and the Wisconsin Benchmarks, becoming the reference yardstick for a whole category of products. Its reliability, dependability, speed, and efficiency, are all out of the charts and add up to what is undoubtedly one of the most refined software packages available in the open-source world today.
To add insult to injury (injury to other commercial and even open-source competitors, that is), PostgreSQL is free-as-in-beer, and free-as-in-freedom, and even better, there is no dystopian corporation driving it from behind. (At least not yet. Fingers crossed.)
Looking back to the road that leads to the podman pull
command that brings a fresh copy of PostgreSQL 16 to your computer, the first thing that strikes this observer is how short it is, particularly when compared to other branches of the software field. The concept of modern RDBMS is at least 20 or 30 years younger than the rest of the industry. IBM engineers were very much involved in its creation, but as it was often the case throughout history, IBM (the corporation) was among the last to benefit from it.
The RDBMS field is so young, we can actually see it grow through the pages of computer magazines of the 1980s. BYTE Magazine had its first issue dedicated to databases in November 1981, and then another one in October 1984. Dr. Dobb’s Journal did not feature an article about databases until 1984 (“Solutions to Quirks in dBASE II” by Gene Head, on page 353 of Volume 9, number 91) and did not have many more throughout the decade; actually most of them were authored by the same Gene Head, and talk about the same dBASE program.
(It is hard to understate the importance that Ashton Tate’s dBASE had in the world of CP/M and MS-DOS microcomputers. On page 234 of the aforementioned October 1984 issue of BYTE we see it featured in a comparison table among countless other database systems whose names history has long forgotten since. The saint office trinities of WordStar, Lotus 1-2-3, and dBASE, followed by the 1990s’ GUI-based equivalents of WordPerfect, Quattro Pro, and Paradox, were all crushed by the hegemony of Microsoft Word, Excel, and Access before the turn of the millenium.)
The concept of RDBMS was born, as the legend says, with Edgar “Ted” Codd, of whom in 2023 we simultaneously commemorated 100 years of his birth (August 19th, 1923) and the 20 years of his death (April 18th, 2003). He proposed a radical idea in 1970: separating the format in which data is stored from that in which users access it. Shocking. Or, if you look at it closely, another example of the adage that says, “No problem resists an intermediate layer of abstraction.”
(Another part of the legend states that Larry Ellison read Codd’s paper, had an epiphany, and single-handedly wrote the first version of Oracle. Kids: read more scientific papers.)
Contrary to public belief, Codd was not the inventor of the SQL language per se; his paper (“A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM, Vol. 13, No. 6, June 1970) describes the relational model in mathematical terms inspired by Cantor‘s set theory, only suggesting a query language called “Alpha” and, making happy a whole generation of computer scientists, not providing any implementation details. The invention of SQL itself must be traced back to other two IBM engineers, Don Chamberlin and Ray Boyce.
Federico Biancuzzi and Shane Warden interviewed Don Chamberlin for their 2009 O’Reilly book “Masterminds of Programming”, where he describes on page 227 the work that led to the language:
Based in part on our earlier experience with the “query game,” Ray and I developed an initial proposal for a relational query langugae named SEQUEL (an acronym for “Structured English Query Language”), and published the proposal in a 16-page paper at the annual ACM SIGFIDET (precursor to SIGMOD) conference in May 1974 (the same conference that featured the famous debate between Ted Codd and Charles Bachman.) Shortly after publication of that initial paper, Ray Boyce died suddenly and tragically from the effects of a brain aneurysm.
(Chamberlin’s fascination with data query languages would take him to author one of the first drafts of XQuery, a query language for XML, at the end of the 1990s.)
Codd’s paper inspired intelligent people inside and outside of IBM: inside, it begat System R. Outside of IBM, at the now legendary University of Berkeley in California, another team led by a certain Michael Stonebraker would use a PDP/11, the Unix operating system, and the C programming language, to create a RDBMS they called “INGRES” (yes, it was an acronym.)
Spoiler alert: the “gres” particle in both words “Ingres” and “PostgreSQL” are related. Stay with me for a minute.
During the second half of the 1970s there were, then, two major languages you could use to query databases: SQL inside IBM, and QUEL outside (based on Codd’s Alpha). Why did then PHP programmers not have to worry about QUEL injection attacks in 2002? Two reasons: (first) IBM decided to commercialize System R as a product called DB/2. The strength and impact of IBM, coupled with (second) its standardization by the ANSI committee in 1986, made SQL a de facto standard overnight (something similar was happening with the IBM PC more or less at the same time) leaving INGRES in an offside position against its bigger competitors, Larry Ellison’s Oracle and (to a lesser degree) Informix.
But RDBMS require more than just a query language; C. J. Date introduced the concept of referential integrity in 1981, and Jim Gray the idea of transactions and ACID in 1983–no, not the acid that Steve Jobs was recommending to Bill Gates to drop, but the one you read about in database books, about atomicity, consistency, isolation, and durability.
By that time, however, Stonebraker was working on the successor of INGRES, a product aptly prefixed with the “Post” moniker. The foresight of Stonebraker was to extend its support of data types and referential integrity rules to other things than just strings and numbers, thus introducing abstract data types into the mix. This opened the door to one of the most groundbreaking and useful features of Postgres, its management of geographical and image data.
Fast-forward to the 1990s, and Postgres became PostgreSQL in 1995, and then got its own Geocities-inspired website in 1996 because the Internet was the next big thing.
At the beginning of the 2000s, PostgreSQL was still the lesser database, as demonstrated by the popularity of the “LAMP” stack, whereby the “M” stood for MySQL. We have to thank modern capitalism for raising PostgreSQL to prominence: Sun Microsystems bought MySQL in 2008, and Oracle swallowed Sun Microsystems in 2010. By doing so, Oracle effectively owned MySQL, which triggered two major events: the fork of MariaDB, and PostgreSQL seizing the moment.
Let us not pay attention to all those developers screaming since 2010 that NoSQL was the future and that relational databases sucked (apparently that is the technical term). During the past decade there has been too much irrationality and wheel reinvention going on; in short, a new world disorder. Instead, let us do a per saltum and arrive to our post-pandemic, global-warming, smartphone-filled, and AI-fueled world, to realize that PostgreSQL has won.
- SQLite, the most widely deployed embedded RDBMS on the planet, claims to be inspired by PostgreSQL.
- Kubernetes administrators can deploy and manage PostgreSQL instance with various operators, such as CrunchyData, StackGres, or CloudNative-PG. They can even replace the etcd default store with PostgreSQL.
- CockroachDB and YugabyteDB claim to be “a drop-in replacement” or a “protocol-compatible” database with PostgreSQL.
- The FerretDB team built a MongoDB-compatible database on top of PostgreSQL.
- Startups like Steampipe and Timescale use PostgreSQL.
- If machine learning is your thing, then you should check out PostgresML or Google’s own AlloyDB.
- Some people use it as a “pub/sub” queue, similar to what Kafka, RabbitMQ, or even Redis offer.
- InfoWorld claims that PostgreSQL is everywhere. If InfoWorld says it, it must be true.
- And if all of this is not enough, there are long lists of PostgreSQL-derived databases available online.
As much as I think that this enthusiasm for PostgreSQL is a good thing, I am pretty certain that Mr. Stonebraker does not entirely agree. One of the mantras he repeats in every interview and in every paper is, “one size does not fit all.” It is important to tailor database engines to individual situations and problems, something he himself has done in his subsequent work in the past 25 years.
These days, the term “database” refers to a large array of problems, paradigms, and products: full-text, data warehouses, data lakes, cloud databases, vector databases, column-oriented databases like ClickHouse, OLAP vs OLTP, embedded, serverless, key-value stores like Berkeley DB or LotusDB, wide column, mobile, document-oriented, relational, graph, search engine, multimodel, etc…
Even NoSQL databases have found their niche: come to mind the canonical examples of Redis as a session and cache store for web applications, the über-popular MongoDB, the venerable Realm Atlas, and the ubiquitous browser-based IndexedDB API (although I still think the Web SQL database API was undeniably superior.)
After 50 years of dramatic evolution, database systems have morphed into a variety of shapes, each suitable to different kinds of problems. Having said that, it is hard to deny the fact that PostgreSQL is an excellent choice for a long list of business endeavors; it is no surprise that it enjoys its current success, and it is hard not to recommend it in most cases.
Cover photo by mana5280 on Unsplash.