
Outperforming MySQL by hand
There is this SQL table, right? This is the part where you knowingly nod and prepare yourself for a long story about how a database is as good as the person designing it. Let’s call this table an access log. For the sake of argument, let’s also say it’s 30GB in size, has a 100 million rows, and only has an auto_increment PK ID field. Your task is, should you choose to accept it, to retrieve the last hour of entries in this log table.
Yes, this explains why you needed to KILL this query. Ok let’s think about this… if we only have the PK index over ID,… let’s try to use that. MySQL uses index for sorting under some conditions, may be that it’s possible here too. I know stamp field is somewhat k-sortable in a 60 second range, but we can generally assume that it has the same order as the ID. So lets try this:
And the EXPLAIN confirms we are using PK for sorting (bye bye filesort). But, that’s not really the problem here, we’re still going over every row to resolve the where clause, to check for the specific stamp value. Don’t even try running the query. After 5 minutes, which is totally unreasonable, the only thing left is to KILL it off. So, given a few assumptions, the only way to resolve the query is to use the PK for WHERE clause. Can we find the ID which is the start of our data and use that in the query when fetching the data?
OK, why does getting the ID work so “fast”? A few things:
- It only reads the ‘stamp’ field (and not * like before)
- The order is resolved by Primary Key
- Given the descending order, MySQL is looking at only the rows descending by ID until the limit condition is met, not processing the rest of the table.
There are more rows being analyzed than EXPLAIN tells us, but we can be sure (within the constraints of reason and logic) that this is how it works. Now we’re just left to query the table by ID:
Yay! Great success! Getting the recent interval is a breeze. But, let’s say you want some historical data from a month ago. First, we need to find the the ID.
- You know the stamp column has the same sorting order as the ID column does.
- You can efficiently query by ID but not by stamp
Given this information, you can implement your own “index scan” that gives you the appropriate ID value for a given time stamp. Since the PK index is sorted, you can use a simple binary search to query and subdivide the data. I’m providing a representation of a few steps just to illustrate how “fast” the data gets reduced down to a manageable size.
When your data is sized at a few thousand rows, the query returns the valid ID in record time, far from the few minutes it takes if you don’t try to “hack” this approach. I don’t know about you, but getting specific data chunks from a multi-million row table with sub-second query time based on fields that are not indexed,… I like it.
If your entries are not at least K-sortable, this approach won’t help you. I’m logging things like response time in milliseconds, and since those values are not sorted at all, querying them is an effort that requires a full table scan, an index or, ideally, an external database that is more suited for this task. But that’s another story,… There is this SQL table, right? … - Tit Petric
While I have you here...
It would be great if you buy one of my books:
- Go with Databases
- Advent of Go Microservices
- API Foundations in Go
- 12 Factor Apps with Docker and Go
For business inqueries, send me an email. I'm available for consultany/freelance work. See my page for more detail..
Want to stay up to date with new posts?
Stay up to date with new posts about Docker, Go, JavaScript and my thoughts on Technology. I post about twice per month, and notify you when I post. You can also follow me on my Twitter if you prefer.