The perfect database

It was the year 2001. I barely started out with PHP and eventually MySQL. A client had this wish, to log and analyze its web traffic (a substantial amount of it), and because of a clustered environment, and also lack of detail with existing solutions the only available solution then was to develop our own analytics package with php and mysql. While at first we didn’t have problems scaling, we have quickly grown the log data and eventually hit barriers as disk space availability, concurrency issues (MyISAM …) and then query speed itself for analyzing data (sum, count, avg …).

A few years back, when we were still trying to save that sinking ship, I actually started to write a mysql-backed statistics collection daemon, which took some of the load out of the computation, by doing in-memory statistics of each log query that passed it, and putting the statistics into a rolling-average database. Sure, it wasn’t 100% reliable counting power outages and wouldn’t pass an ACID test to save it’s life – but it was working with what we had.

While this approach is very technical, it is also very specific. I wasn’t trying to solve the worlds problems – I was trying to save my own. While we eventually gave up (there’s this neat thing now we call Google Analytics), we still do some of our own very specific and detailed analytics. It isn’t much of a problem, when you take a step back and figure out what you really need. For us it was a data size and frequency problem, but we didn’t realize until much later, that the data we collected was mostly irrelevant and will probably never be looked at in retrospective. If data like this drives business decisions, I find that life goes on even if you reduce the data to a fraction of what it once was, and even then it is probably too big for what people need.

Sure, a MapReduce algorithm to process all the data and collect meaningful statistics is useful for almost anyone, but unless you’re analyzing DNA strands or carrier grade network tracing (afaik, there’s some EU directives that force ISP’s to collect about 2 years worth of all ISP traffic) you most probably don’t need what you set out to have – but then again, who doesn’t want a fault-tolerant distributed database with amazing storage, indexing, processing and search performances.

I just find it that reality means you will end up with a desk piled full of post-it notes and will never achieve database nirvana, since the mighty Chtulhu will eat all your disk space, turn all your CPUs into glowsticks and dance on the ashes of your datacentre. That’s about as realistic as “The Perfect Database”.

This article is written as a response on I want a new datastore by Jeremy Zawodny.

While I have you here...

It would be great if you buy one of my books:

Buying a copy supports me writing more about similar topics.

For business inqueries, send me an email. I'm available for consultany/freelance work. See my page for more detail..

The perfect database

While I have you here...

Want to stay up to date with new posts?