PostgreSQL - Not Just Relational
Ryan poses an interesting question in PGSQL Phriday #005 , which I think well demonstrates the extensibility of PostgreSQL, something that I've leveraged a lot on various projects.
Ryan mentions that it was extensions like
PostGIS
and
hstore
which brought him to PostgreSQL initially. For me it was
TSearch2
(yeh, it was around 2008, it later got merged into core), then taking advantage of
PostGIS
,
PL/Proxy
,
PGQ
,
JSONB
and more over the years.
At the time we'd been using Oracle's text search to index street names, to enable CRM agents to report potholes to the highways team. At the time, I'd call Oracle's text search flaky at best and not very feature rich. We had a number of CRM outages due to corrupted indexes which took ages to rebuild. On top of this the Oracle DBAs feared the license review.
We also wanted to expand the functionality to cover other aspects of our CRM integrations and our intranet in general. I built simple prototypes with both Apache Lucene and PostgreSQL. PostgreSQL ended up winning by the fact it could do other things too and handled concurrency well (or at least for us). This was well before the NoSQL buzzword bingo, so the fact PostgreSQL was a relational database which could also do other things was significant, and still is.
As the project went on, this choice really began to payoff, the real advantage of PostgreSQL is that you can mix and match these diverse features in a single query. This let us do things like mixing full text search and geographic search, eg: find me the library nearest to me.
Relational Or Not
It's significant that relational is still here and still dominant, I think this is a pretty good indicator that as a model it works and works well for lots of people. In reality I'm struggling to think of a project where our data didn't fit well in a PostgreSQL world.
For the most non-relational data I've had to deal with PostgreSQL has worked just fine, by leveraging features like: arrays,
PostGIS
,
PGQ
and
JSONB
.
I've used PostgreSQL very successfully to store large volumes of time-series IoT data. PostgreSQL had no real issues storing the data or dealing with the volume. We used tricks like arrays to reduce the impact of row overhead. Date ranges were super handy for storing and querying the data.
An insurance quoting system was probably the least relational and arguably the most 'document' centric. We made heavy use of JSONB, but the data still broke down into a number of key relations. Workflow systems tend to follow the 'document' approach more, but often there is still value in breaking data into a small number of relations at least.
I also find that
JSONB
is a great modern solution to the dreaded
Entity-Attribute-Value (EAV)
patterns, which I'm thankful I've not had to touch in a long-time now.
Hazelcast
Hazelcast is the most non-relational data-store that I've used heavily. I first started using it mostly as a distributed caching and messaging layer in a monitoring platform I spent a while working on. Being a in-memory distributed data-store it's rather different to PostgreSQL, and primarily Key-Value (KV) based.
I also worked on a TV Schedule and Video-On-Demand API layer which used Hazelcast as it's main data-store, this system didn't really hold much data, having only 64GiB of RAM across 8 nodes to hold the data and process requests. However there were a number of strange usecases where aggregates had to be computed per request which consumed almost all of the dataset.
Having all this data in RAM at the API layer helped significantly with these oddities, but wouldn't preclude using PostgreSQL for this usecase. The problem domain was hard to optimise in general, primarily because it was poorly designed and thought through at an architecture level.
PostgreSQL would have had a number of features we could have used to deal with these complexities. I'm fairly certain it would have been quicker to iterate in SQL than having to manually code up distributed memory queries. With PostgreSQL we would have gained indexes, which I'm sure would be more efficient than essentially having to perform every query as in-memory sequential scans.
When I first started that project, even with an in-memory data-store performance was an issue. The team wanted to rebuild the API on MongoDB, this was primarily driven by a lack of knowledge on Hazelcast and the thinking that MongoDB would somehow magically fix the problems we faced. The reality was the in-memory data model had been very poorly designed, with very little thought for how it was going to be queried. It took a few weeks to refactor the data model, resulting in 100x performance improvements and further use of distributed queries etc improved things even more over the months.
Thanks
In general I always think the biggest challenge is understanding the problem domain you are working with, rarely does the perfect solution exist, there will always be tradeoffs.
I personally prefer to focus on simplicity and the least number of moving parts where possible. As such tools like PostgreSQL are more useful to me, since I can use them for more than one thing.
I also think we get far to hung up on labeling and pigeon holing the tools we use in IT, if it works, what's the problem. There are often significant advantages to using fewer general purpose tools, rather than lots of domain specific tools. Don't forget nails and screws are two methods of solving the same problem, but both need different tools.
You really don't need another database, just use PostgreSQL ;)
Thoughts
Feel free to message me on Mastodon: @intrbiz@bergamot.social