MySQL optimization and diagnostics tactics

We have all been there. There's that server, which has mysterious spikes in load, which come from nowhere and leave just as they came - quickly and unexplained. While sometimes these problems can be explained with slow queries, given enough time, development and optimization, you will run into scenarios that don't register in the slow query log.

It is my opinion, that any kind of database load problem can be traced back to development. It's a general extension of "You are writing unoptimized queries" to "You're calling your optimized queries too often" or just maybe "The traffic you generate from the database is too much" or "The code you wrote does not scale to the number of users you have". But picking up on those problems in production systems with high load and high availability is tricky.

My current best option is using tcpdump to save traffic in specific intervals, especially when the database is experiencing high load I wish it didn't. But saving tcpdump data is only part of the solution, a first step. And here is where Percona's Toolkit comes in. With it's pt-query-digest, the tcpdump logs can be parsed and you can find out which operations to the database cause the load. The queries are sorted by time summary and also include the information about how many times they were called and how long the queries took to run, during this time when tcpdump was running.

This is all fine and well if you can afford to run tcpdump for long periods of time. Given the "production" and "high availability" part of your mysql deployments however, this might not be the ideal way. Running tcpdump for a few minutes might bog down your network traffic as well as raise cpu and disk utilization on the database server significantly, causing issues for your customers or users. You can run tcpdump remotely on one of your many worker nodes, and have much less impact on the database server. Or you can just run it on the database server, and pipe the output to a remote server, so you don't increase the load on the database servers disks, while you're saving the tcpdump data. And, especially for big clients, or clients that have the proper network equipment, the use of port mirroring is a good option.

Personally, I can't run tcpdump for long periods of time on a specific database server. Not only I don't get usable data for it, but like slow queries, the problematic queries in that specific system are a small percentage of the overall queries. Only certain tasks or actions from the applications trigger spikes on the database server, which occur in pretty much random intervals. My goal is to debug the load during theese spikes.

My tactics for this were simple. Judging by the graphs which display those spikes we don't like, we define a threshold value of what cpu utilization is required to start our tcpdump process. With a little modification of this script that calculates CPU usage, we start tcpdump at times when we cross this threshold value.

Using this tactic, we already fixed two problems during the first hour this was tested out, regarding incorrectly set indexes for specific queries that were fast enough not to reach the slow query log, but were being run several thousand times at specific intervals and always ended up on the top of the list.

We have identified several other problems which indicate design issues in the applications using this database, not necessarily related to indexes but just poor normalization which ends up in overly complex queries. Some issues could be solved by smaller changes in design, while others would be better handled outside the database by using Redis or other data structure server, which supports intelligent sorting. Either way, the important thing to remember is, that this approach sheds light on what some problems are, so that resources can be directed towards fixes or new implementations.

Hopefully, this is a first step to analyzing load on your database server that takes you beyond what you find in the slow query log. For us, it has been just that.

- Tit Petric

While I have you here...

It would be great if you buy one of my books:

I promise you'll learn a lot more if you buy one. Buying a copy supports me writing more about similar topics. Say thank you and buy my books.

Feel free to send me an email if you want to book my time for consultancy/freelance services. I'm great at APIs, Go, Docker, VueJS and scaling services, among many other things.