Outperforming MySQL by hand

There is this SQL table, right?

This is the part where you knowingly nod and prepare yourself for a long story about how a database is as good as the person designing it. Let's call this table an access log. For the sake of argument, let's also say it's 30GB in size, has a 100 million rows, and only has an auto_increment PK ID field. Your task is, should you choose to accept it, to retrieve the last hour of entries in this log table.

set @a = DATE_SUB(now(), interval 1 hour);
select * from log_requests_ava where stamp>=@a order by stamp desc;

Just look at explain for a second and:

image

Yes, this explains why you needed to KILL this query. Ok let's think about this... if we only have the PK index over ID,... let's try to use that. MySQL uses index for sorting under some conditions, may be that it's possible here too. I know stamp field is somewhat k-sortable in a 60 second range, but we can generally assume that it has the same order as the ID. So lets try this:

set @a = DATE_SUB(now(), interval 1 hour);
select * from log_requests_ava where stamp >= @a order by id desc;

image

And the EXPLAIN confirms we are using PK for sorting (bye bye filesort). But, that's not really the problem here, we're still going over every row to resolve the where clause, to check for the specific stamp value. Don't even try running the query. After 5 minutes, which is totally unreasonable, the only thing left is to KILL it off.

So, given a few assumptions, the only way to resolve the query is to use the PK for WHERE clause. Can we find the ID which is the start of our data and use that in the query when fetching the data?

set @a = DATE_SUB(now(), interval 1 hour);
select id from log_requests_ava where stamp <= @a order by id desc limit 0,1;

image

image

OK, why does getting the ID work so "fast"? A few things:

  1. It only reads the 'stamp' field (and not * like before)
  2. The order is resolved by Primary Key
  3. Given the descending order, MySQL is looking at only the rows descending by ID until the limit condition is met, not processing the rest of the table.

There are more rows being analyzed than EXPLAIN tells us, but we can be sure (within the constraints of reason and logic) that this is how it works. Now we're just left to query the table by ID:

set @a = 101732435;
select * from log_requests_ava where id > @a order by id desc;

image

image

Yay! Great success! Getting the recent interval is a breeze. But, let's say you want some historical data from a month ago. First, we need to find the the ID.

set @a = DATE_SUB(now(), interval 1 month);
select id from log_requests_ava where stamp <= @a order by id desc limit 0,1;

Well, we're screwed. I know I have about 5 months of data, so this query is processing about 20% of the complete data, assuming uniform distribution / frequency. It took over 5 minutes to finish. Honestly, I expected worse. If we can limit by ID, we can resolve most of this 20% with the index.

set @b = '2014-01-07 08:36:56'; -- I selected a fixed point
select id from log_requests_ava where id < 77015836 and stamp <= @b order by id desc limit 0,1;

This is faster because we resolve most of the query with the primary key. MySQL starts to inspect stamp after skipping those good 20% of the table using the PK index, uses the PK for ordering, and inspects only the few 10 thousand rows between each ID.

image

image

Since the ID is sequential, the difference between them matches the number of rows

set @a = 76976421;
set @b = 77015836;

select * from log_requests_ava where id >= @a and id < @b order by id desc;

-- 39415 rows in set (0.11 sec)

Querying this interval is blazing fast, as it's still using the PK index efficiently for sorting as well as resolving the WHERE clause. The challenge is how to get the ID value for a given stamp quickly. MySQL is not able to resolve this without any indexes. But, you can cheat, as you have information that MySQL does not. You can use this information to your advantage.

  1. You know the stamp column has the same sorting order as the ID column does.
  2. You can efficiently query by ID but not by stamp

Given this information, you can implement your own "index scan" that gives you the appropriate ID value for a given time stamp. Since the PK index is sorted, you can use a simple binary search to query and subdivide the data. I'm providing a representation of a few steps just to illustrate how "fast" the data gets reduced down to a manageable size.

image

When your data is sized at a few thousand rows, the query returns the valid ID in record time, far from the few minutes it takes if you don't try to "hack" this approach. I don't know about you, but getting specific data chunks from a multi-million row table with sub-second query time based on fields that are not indexed,... I like it.

image

If your entries are not at least K-sortable, this approach won't help you. I'm logging things like response time in milliseconds, and since those values are not sorted at all, querying them is an effort that requires a full table scan, an index or, ideally, an external database that is more suited for this task. But that's another story,...

There is this SQL table, right? ...

- Tit Petric

While I have you here...

It would be great if you buy one of my books:

I promise you'll learn a lot more if you buy one. Buying a copy supports me writing more about similar topics. Say thank you and buy my books.

Feel free to send me an email if you want to book my time for consultancy/freelance services. I'm great at APIs, Go, Docker, VueJS and scaling services, among many other things.