Compression Options in MySQL (Part 2)

8 March, 2019 by awie

Swiss cheese File system

In one of my previous posts, I started a series on data compression options with MySQL. The first post focused on the more traditional compression options like InnoDB Barracuda page compression and MyISAM packing. With this second part, I’ll discuss a newer compression option, InnoDB transparent page compression with punch holes available since 5.7. First, I’ll describe the transparent page compression method and how it works. Then I’ll present similar results as in the first post.

InnoDB transparent page compression

Before we can discuss transparent page compression, we must understand how InnoDB accesses its data pages. To access an InnoDB page, you need to know the tablespace (the file) and the offset of the page within the tablespace file. The offset is the tough part with data compression. If you just compress pages and concatenate them one after the other, the offsets will no longer be at known intervals. InnoDB Barracuda page compression solves the problem by asking the DBA to guess the compression ratio of the pages with the compressed block size setting. For example, you have to tell InnoDB to use a disk block size of 8KB if you think the compression ratio will be around 2. Transparent page compression uses another approach, sparse files.

Sparse files 101

A sparse file is a file with holes in it. Even though a sparse file may be very large, if there are a lot of holes in it, it may end up using a small amount of storage. On almost every Linux system, the /var/log/lastlog file is sparse:

yves@ThinkPad-P51:/var/log$ ls -lah lastlog
-rw-rw-r-- 1 root utmp 18M jan 5 16:09 lastlog
yves@ThinkPad-P51:/var/log$ du -hs lastlog
56K lastlog

While the ls command reports an apparent size of 18MB, the du command tells us the file actually uses only 56KB. Most of the space in the file is actually unallocated. When you access a sparse file, the filesystem has to map the actual physical offsets in the file with the logical offsets seen by the application. A logical offset is no longer directly the number of bytes since the beginning of the file.

Now that we understand a bit what sparse files are, let’s talk about the punch hole aspect. When you write something to disk, you can use the fallocate call to free up, punch, part of it. The freed/punched portion is thus a hole in the file, and the filesystem can later reuse the hole to store something else. Let’s follow a simplified view of the steps required to write a transparently compressed InnoDB page.

InnoDB using sparse files

Figure 1: InnoDB Transparent page compression

In figure 1, an in memory 16KB InnoDB page with 14KB of data is going to be written to disk. As part of the write process, the data is compressed to 6KB and the page is written to the disk. Once written, InnoDB uses the fallocate call to release the 10KB of unused space. Since only full blocks are release, only 8KB is really freed. The remaining space unreleased space (2KB) is just zeroed. The freed space will be reused, either for the same file or by another one. For simplicity, let’s assume the space is reused by the same InnoDB file.

Figure 2: File system layout

If there is no immediate reuse, a portion of the InnoDB file will look like the top file layout of figure 2. The pages (numbers) are still sequentially laid out but there are holes in between. As the file system gets full, it will start to reuse the freed space so eventually, the file layout will look like the bottom one. If you notice, in the bottom layout, the pages are no longer in sequential order. There are consequences to that: the notion of disk sequential access is gone. The most stunning example is a simple file copy on a spinning device. While copying a 1GB regular file may take only 30 seconds, the copy of a 1GB sparse file can take much longer, up to 30 minutes in the worst cases. The impact on physical backup tools, like Percona Xtrabackup, are thus important. Normally physical backups are much faster than logical ones (ex: mysqldump), but with sparse files, it may no longer be true.

MySQL impacts

There are also consequences of the use of sparse files on the design of a MySQL database server. The added random operations increase the importance of using SSD/Flash based storage. Also some settings must be considered with a different perspective:

innodb_flush_neighbors should be 0 since 1 is a cheat geared toward sequential operations
innodb_read_ahead_threshold, normally set to 56, this means when 56 pages of an extent have been scanned, the next extent is read sequentially ahead of time. To be really useful, the next extent should be read before the remaining 8 pages of the current extent are read. Since sequential operations are slower, maybe this value should be lowered a little. The drawback is an increased possibility of a read ahead without use.
innodb_random_read_ahead is a wilder setting, it would be a good idea to experiment with this for your workload

There are likely to be other affected settings.

Review of the test procedure

Just to refresh memories, I am using two datasets for the basic benchmarks. The first, Wikipedia, consists in about 1B rows of Wikipedia access logs. It is moderately compressible. The second dataset, o1543, is from the defunct Percona cloud tool project. It has only 77M rows but they are much wider with 134 columns. The o1543 dataset is highly compressible.

On these two datasets, the following steps were executed:

insert the rows: record time, final size and amount of data written
large range select, record the time
20k updates, record the time to and total bytes written

Results

Final sizes

Figure 3, Innodb transparent page compression final sizes

One of the most critical metrics with compression is the final dataset size, as shown in figure 3. The possibility to use larger InnoDB pages is a big thing with transparent page compression. Larger pages allow for more repetitive patterns to be present within a page, and that improves the compression ratio. Results using page sizes of 16KB, 32KB and 64KB are shown. The uncompressed results are used as references, transparent compression (TC) using Lz4 and Zlib are the actual compressed datasets. First, we see that larger page sizes barely affect the size of the uncompressed dataset (I16, I32 and I64). Since the datasets were inserted in primary key order, the only possible impact is the filling factor of the pages. When InnoDB fills a page in PK order, even when the innodb_fill_factor is set to 100, it always leaves 1KB free per 16KB. With an amount of free space that scales with the page size, the final size doesn’t change much.

The impacts of larger page sizes on the compression ratio are important. The most drastic example is with the o1543 dataset and Zlib compression. While with a 16KB page, the compression ratio was already decent, at 3.65, it grows to an amazing 8.7 (I16/I64TCZlib) with pages of 64KB. Larger page sizes have also a positive impact on the compression ratio of the Wikipedia dataset. The original compression ratio with Zlib and 16KB pages is 2.4 and it grows to 3.4 with 64KB pages. Datasets compressed with Lz4 behave similarly to the Zlib ones but the compression ratio are slightly lower.

Overall, the I64TCZlib results for the Wikipedia dataset is the most compressed form we have so far. For the o1543 dataset, the MyISAMPacked compressed size is still slightly smaller but is read-only.

Insertion time

Figure 4, InnoDB transparent page compression insert times

We normally expect compression to add an overhead but here, the insertion speed improves with larger page sizes (figure 4). The reason is likely to be because we are using spinning disks. Spinning disks have a high latency so doing larger IO operations helps. The time overhead of compression with transparent page compression hovers between 10 and 17%. That’s much less than 60% overhead we observed for the Barracuda table compression in the previous post for the Wikipedia dataset (InnoDBCmp8k/InnoDB). We can conclude the insert rates, when inserts are in PK order, are not much affected by transparent page compression. If you are mostly inserting data, it is a nice win.

Data written by inserts

Figure 5, total amount of data written during the inserts

The amount of data written is not much affected by the transparent compression and the larger page sizes (figure 5) . That’s reasonable as many of the writes are not compressed, only the final write to the tablespace is. Neither the writes to the double write buffer, or to the InnoDB log files, or for the tablespace pre-allocation, are compressed. The differences we see are essentially the same as the ones for the final sizes. Only the uncompressed results do not fit that view but these are rather small deviations.

Range selects

Figure 6, time to complete a long range scan

The range select benchmarks are really a means of testing the decompression overhead. As you notice in figure 6, the time variations are not large. For the Wikipedia dataset, the faster range select is I64TCLz4, and it completed in 788 seconds. That’s almost two minutes slower than the faster results using InnoDB Barracuda compression (block_size=4KB). How can we explain such results? If the freed space is reused, transparent compression causes sequential operations to become random ones. The time should increase. Without space reuse, the storage layer will merge many small reads into a sequential one, and then discard the holes. Effectively, the disk will read the same amount of data, compressed or not. The only difference will come from decompression. Lz4 is extremely fast while Zlib is slower.

Going back to the Wikipedia dataset, it took the exact same time, 830s, for I16, I16TCLz4 and I32TCLz4. That seems to indicate there was no space reuse. With the xfs xfs_bmap tool on a TC compressed file, I listed the blocks used. Here is the command I used and the first lines of the output (with blocks of 512 bytes):

root@LabPS57kvm_1:/tmp# xfs_bmap /var/lib/mysql/test/query_class_metrics.ibd | more
/var/lib/mysql/test/query_class_metrics.ibd:
0: [0..31]: 1013168..1013199
1: [32..39]: 1014544..1014551
2: [40..63]: hole
3: [64..71]: 1016976..1016983
4: [72..95]: hole
5: [96..103]: 1017008..1017015
6: [104..127]: hole
7: [128..135]: 1016880..1016887
8: [136..159]: hole
9: [160..167]: 1016912..1016919
10: [168..191]: hole
...

We have the list:

0..31: 16 KB tablespace header, apparently not compressed
32..39: 4KB TC compressed page, 8 sectors of compressed data
40..63: 12KB hole (24 sectors)
…and so on

So the layout actually looks indeed like the filesystem with no reuse case (top layout) of figure 2. When InnoDB extends the tablespace, it of course proceeds by entire pages. The filesystem will try, as much as possible, to allocate continuous blocks. Initially, the tablespace increases one page at a time but rapidly it grows by extent of 64 pages. The space reuse will start only when there are no more continuous areas large enough to satisfy the allocation requests. Until then, the filesystem still performs mostly sequential operations. The performance characteristics will thus change once the freed blocks start to be reused. On a smaller server, I continued to insert data well after the filesystem would have been full without the holes. The insertion rate fell by about half but the read performance appeared unchanged.

The times of the range selects for the o1543 dataset are more predictable. In all cases, larger pages increase performance. That kind of makes sense – InnoDB needs less IOPS. With Lz4, InnoDB spends less time to decompress the pages than it would need to read the complete uncompressed pages. The opposite is also true for Zlib. The Lz4 results are the fastest, Zlib the slowest, and in between we have the uncompressed results.

20k updates time

Figure 7, time needed to perform 20k updates

Intuitively, I was expecting the larger pages to slow down the updates. Similarly, I was also expecting Lz4 compressed pages to be slower than uncompressed pages, but faster than the ones compressed with Zlib. The above figure shows the times to perform approximately 20k single row updates for both datasets. We performed the updates to the Wikipedia dataset in small separate transactions, while we used a single large update statement for the o1543 dataset.

While the compression algorithm assumption appears to hold true, the one for the page sizes is plainly wrong. Of course, the storage consists of spinning disks so the latency of random IO dominates. The important factor becomes the number of levels in the b-tree of the table. In the root node of the b-tree and all intermediate nodes, bigger pages mean more pointers to the next level. More pointers causes a bigger fan-out –ratio of nodes between levels – and fewer levels. Bigger pages also cause fewer leaf level pages which in turn require less upper level node pages.

Let’s dive a bit more into this topic. The Wikipedia dataset table has an int unsigned primary key. Considering InnoDB always leaves 1KB free in a page and, along with the primary key, each entry in a node (non-leaf) has an extra 9 bytes for the pointer to the next level page. Let’s do some math:

Total number of pages with 16KB pages = 112.6GB / (15KB) = 7871311 pages
Max number of rows in the non-leaf pages for 16KB pages and an int PK = (16 * 1024)/(4 (int PK) + 9 (ptr)) = 1260 rows/pages
Minimum number of pages in the first level above the leaf = 7871311 / 1260 = 6247 pages
Minimum number of pages at the next level = 6247 / 1260 = 5 pages
Root page = 1

Of course, our calculations are an approximation. With a 16KB page size, there are three levels above the leaves for a total of 6253 pages and a size of 98MB. It thus requires 6253 IOPS to warm up the buffer pool with the all nodes. A SATA 7200 rpm disk delivers at best 120 IOPS (one per rotation) so that’s about 51 second. Now, let’s redo the same calculations but with a page size of 32KB:

Total number of pages with 32KB pages = 110.7GB / (31KB) = 3744431 pages
Max number of rows in the non-leaf pages for 32KB pages and an int PK = (32 * 1024)/(4 (int PK) + 9 (ptr)) = 2520 rows/pages
Minimum number of pages in the first level above the leaf = 3744431 / 2520 = 1486 pages
Root page = 1

Using 32KB pages, we have one level less and only 1487 node pages for a combined size of 47MB. To warm up, the buffer pool we have to load at least the node pages, an operation requiring only a quarter of the IOPS compared to when 16KB pages were used. That’s where most of the performance gains come from. The reduced number of IOPS more than compensates for the longer time to read a large page. Again, in this setup, we used spinning disks.

Bytes written per update

Figure 8, average bytes written per update

Now, the last set of results concerns the number of bytes written per update statement (figure 8). There is a big price to pay when you want to use larger InnoDB pages, the write amplification is huge. The number of bytes approximately scales roughly with the page size. The worse case is the I64 result, about 192KB written for a single row update of an integer field (Wikipedia). If your database workload includes a large number of small single row updates, you should avoid expensive flash devices with 64KB InnoDB pages as you’ll burn your devices rapidly.

Operational considerations for larger InnoDB pages and TC

When is it good idea to use transparent compression? When should you use a larger InnoDB page size? One valid use case is a database storing large quantities of operational metrics, like the o1543 dataset. The compression ratio will be fantastic and the performance penalty limited, at least until the filesystem starts reusing the holes.

If you collect data from a large number of devices and you are likely struggling with TBs of highly compressible data, transparent compression might be an interesting option. The only issue I see, but it is a major one, is how to backup large sparse files. InnoDB transparent page compression with punch holes is an interesting solution but, unless I am missing something, it has a somewhat limited scope. There are other compression options with similar compression ratios and less drawbacks.

In this post we explored a feature available since MySQL 5.7, InnoDB transparent compression with punch holes. Performance-wise, we have an interesting solution which offers excellent compression ratio, especially when larger page sizes are used. The transparent compression with punch holes technique suffers from its foundations, sparse files. Backing up very large sparse files is a slow and IO intensive process. Instead of performing large sequential IO operations, the backup process will require millions of small random IO operations.

So far we have discussed the traditional approaches to compression in MySQL (previous post) and Innodb transparent page compression. The next post of the series on data compression with MySQL will introduce the ZFS filesystem. ZFS externalizes the compression to the filesystem in a way that is pretty similar to InnoDB transparent page compression, but the ZFS b-tree file structure removes the inconvenience of sparse files.

Stay tuned, more results are coming.

Percona XtraBackup 8.0.5 Is Now Available

8 March, 2019 by awie

Percona XtraBackup 8.0

Percona is glad to announce the release of Percona XtraBackup 8.0.5 on March 4, 2019. Downloads are available from our download site and from apt and yum repositories.

Percona XtraBackup enables MySQL backups without blocking user queries, making it ideal for companies with large data sets and mission-critical applications that cannot tolerate long periods of downtime. Offered free as an open source solution, it drives down backup costs while providing unique features for MySQL backups.

Percona XtraBackup 8.0.5 introduces the support of undo tablespaces created using the new syntax (CREATE UNDO TABLESPACE) available since MySQL 8.0.14. Percona XtraBackup also supports the binary log encryption introduced in MySQL 8.0.14.

Two new options were added to xbstream. Use the --decompress option with xbstream to decompress individual qpress files. With the --decompress-threads option, specify the number of threads to apply when decompressing. Thanks to Rauli Ikonen for this contribution.

This release of Percona XtraBackup is a General Availability release ready for use in a production environment.

All Percona software is open-source and free.

Please note the following about this release:

The deprecated innobackupex has been removed. Use the xtrabackup command to back up your instances: $ xtrabackup --backup --target-dir=/data/backup
When migrating from earlier database server versions, backup and restore and using Percona XtraBackup 2.4 and then use mysql_upgrade from MySQL 8.0.x
If using yum or apt repositories to install Percona XtraBackup 8.0.5, ensure that you have enabled the new tools repository. You can do this with the percona-release enable tools release command and then install the percona-xtrabackup-80 package.

New Features

PXB-1548: Percona XtraBackup enables updating the ib_buffer_pool file with the latest pages present in the buffer pool using the --dump-innodb-buffer-pool option. Thanks to Marcelo Altmann for contribution.
PXB-1768: Added support for undo tablespaces created with the new MySQL 8.0.14 syntax.
PXB-1781: Added support for binary log encryption introduced in MySQL 8.0.14.
PXB-1797: For xbstream, two new options were added. The --decompress option enables xbstream to decompress individual qpress files. The --decompress-threads option controls the number of threads to apply when decompressing. Thanks to Rauli Ikonen for this contribution.

Bugs Fixed

Using --lock-ddl-per-table caused the server to scan all records of partitioned tables which could lead to the “out of memory” error. Bugs fixed PXB-1691 and PXB-1698.
When Percona XtraBackup was started run with the --slave-info, incorrect coordinates were written to the xtrabackup_slave_info file. Bug fixed PXB-1737
Percona XtraBackup could crash at the prepare stage when making an incremental backup if the variable innodb-rollback-segments was changed after starting the MySQL Server. Bug fixed PXB-1785.
The full backup could fail when Percona Server was started with the --innodb-encrypt-tables parameter. Bug fixed PXB-1793.

Other bugs fixed: PXB-1632, PXB-1715, PXB-1770, PXB-1771, PXB-1773.

Troubleshooting MySQL Crashes Webinar: Q&A

17 February, 2018 by awie

Troubleshooting MySQL Crashes

In this blog, I will provide answers to the Q & A for the Troubleshooting MySQL Crashes webinar.

First, I want to thank everybody for attending our January 25, 2018, webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I was unable to answer fully during the webinar.

Q: I have the 600 seconds “Long semaphore wait” assertion failure / crashing issue following DDL queries, sometimes on the master, sometimes just the slaves. Any hints for troubleshooting these? How can I understand what semaphore holding threads are doing?

A: These are hardest errors to troubleshoot. Especially because in some cases (like long-running

CHECK TABLE

commands) long semaphore waits could be expected and appropriate behavior. If you see long semaphore waits when performing DDL operations, it makes sense to consider using pt-online-schema-change or gh-ost utilities. Also, check the list of supported online DDL operations in the MySQL User Reference Manual.

But if you want to know how to analyze such messages, let’s check the output from page #17 in the slide deck used in the webinar:

2018-01-19T20:38:43.381127Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 139970010412800 has waited at ibuf0ibuf.cc line 3454 for 321.00 seconds the semaphore:
S-lock on RW-latch at 0x7f4dde2ea310 created in file buf0buf.cc line 1453
a writer (thread id 139965530261248) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: fffffffff0000000
Last time read locked in file ibuf0ibuf.cc line 3454
Last time write locked in file /mnt/workspace/percona-server-5.7-binaries-release/label_exp/
debian-wheezy-x64/percona-server-5.7.14-8/storage/innobase/btr/btr0btr.cc line 177
2018-01-19T20:38:43.381143Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 139965135804160 has waited at buf0buf.cc line 4196 for 321.00 seconds the semaphore:
S-lock on RW-latch at 0x7f4f257d33c0 created in file hash0hash.cc line 353
a writer (thread id 139965345621760) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file buf0buf.cc line 4196
Last time write locked in file ...

The line

--Thread 139970010412800 has waited at ibuf0ibuf.cc line 3454 for 321.00 seconds the semaphore:

Shows that some transaction was waiting for a semaphore. The code responsible for this wait is located on line 3454 in file

ibuf0ibuf.cc

. I received this crash when I ran Percona Server for MySQL version 5.7.14-8. Therefore, to check what this code is doing, I need to use Percona Server 5.7.14-8 source code:

sveta@Thinkie:~/mysql_packages/percona-server-5.7.14-8$ vim storage/innobase/ibuf/ibuf0ibuf.cc
...
3454 btr_pcur_open(ibuf->index, ibuf_entry, PAGE_CUR_LE, mode, &pcur, &mtr);
...

A few lines above in the same file contain function definition and comment:

3334 /** Buffer an operation in the insert/delete buffer, instead of doing it
3335 directly to the disk page, if this is possible.
3336 @param[in] mode BTR_MODIFY_PREV or BTR_MODIFY_TREE
3337 @param[in] op operation type
3338 @param[in] no_counter TRUE=use 5.0.3 format; FALSE=allow delete
3339 buffering
3340 @param[in] entry index entry to insert
3341 @param[in] entry_size rec_get_converted_size(index, entry)
3342 @param[in,out] index index where to insert; must not be unique
3343 or clustered
3344 @param[in] page_id page id where to insert
3345 @param[in] page_size page size
3346 @param[in,out] thr query thread
3347 @return DB_SUCCESS, DB_STRONG_FAIL or other error */
3348 static MY_ATTRIBUTE((warn_unused_result))
3349 dberr_t
3350 ibuf_insert_low(
3351 ulint mode,
3352 ibuf_op_t op,
3353 ibool no_counter,
3354 const dtuple_t* entry,
3355 ulint entry_size,
3356 dict_index_t* index,
3357 const page_id_t& page_id,
3358 const page_size_t& page_size,
3359 que_thr_t* thr)
3360 {
...

The first line of the comment gives us an idea that InnoDB tries to insert data into change buffer.

Now, let’s check the next line from the error log file:

S-lock on RW-latch at 0x7f4dde2ea310 created in file buf0buf.cc line 1453
sveta@Thinkie:~/mysql_packages/percona-server-5.7.14-8$ vim storage/innobase/buf/buf0buf.cc
...
1446 /* If PFS_SKIP_BUFFER_MUTEX_RWLOCK is defined, skip registration
1447 of buffer block rwlock with performance schema.
1448
1449 If PFS_GROUP_BUFFER_SYNC is defined, skip the registration
1450 since buffer block rwlock will be registered later in
1451 pfs_register_buffer_block(). */
1452
1453 rw_lock_create(PFS_NOT_INSTRUMENTED, &block->lock, SYNC_LEVEL_VARYING);
...

And again let’s check what this function is doing:

1402 /********************************************************************//**
1403 Initializes a buffer control block when the buf_pool is created. */
1404 static
1405 void
1406 buf_block_init(

Even without knowledge of how InnoDB works internally, by reading only these comments I can guess that a thread waits for some global InnoDB lock when it tries to insert data into change buffer. The solution for this issue could be either disabling change buffer, limiting write concurrency, upgrading or using a software solution that allows you to scale writes.

Q: For the page cleaner messages, when running app using replication we didn’t get them. After switching to PXC we started getting them. Something we should look at particular to PXC to help resolve this?

A: Page cleaner messages could be a symptom of starving IO activity. You need to compare Percona XtraDB Cluster (PXC) and standalone server installation and check how exactly the write load increased.

Q: Hi, I have one question, we have a query we were joining on

BLOB

TEXT

fields that is causing system locks and high CPU alerts and causing a lot of system locks, can you please suggest how can we able to make it work? Can you please send the answer in a text I missed some information?

A: If you are joining on

BLOB

TEXT

fields you most likely don’t use indexes. This means that InnoDB has to perform a full table scan. It increases IO and CPU activity by itself, but also increases the number of locks that InnoDB has to set to resolve the query. Even if you have partial indexes on the

BLOB

and

TEXT

columns, mysqld has to compare full values for the equation, so it cannot use index only to resolve

ON

clause. It is a best practice to avoid such kinds of

JOIN

s. You can use surrogate integer keys, for example.

Q: Hi, please notice that “MySQL server has gone away” is the worst one, in my opinion, and there was no mention about that ….can you share some tips on this? Thank you.
Both MySQL from Oracle and Percona error log does not help on that, by the way …

A: “

MySQL Server has gone away

” error maybe the result of a crash. In this case, you need to handle it like any other crash symptom. But in most cases, this is a symptom of network failure. Unfortunately, MySQL doesn’t have much information why connection failures happen. Probably because, from mysqld’s point of view, a problematic network only means that the client unexpectedly disconnected after a timeout, and the client still waiting for a response receives “

MySQL Server has gone away

”. I discussed these kinds of errors in my “Troubleshooting hardware resource usage” webinar. A good practice for situations when you see this kind of error often is don’t leave idle connections open for a long time.

Q: I see that a lot of work is doing hard investigation about some possibilities of what is going wrong….is there a plan at development roadmap on improve error log output messages? If you can comment on that …

A: Percona Engineering does a lot for better diagnostics. For example, Percona Server for MySQL has an extended slow log file format, and Percona Server for MySQL 5.7.20 introduced a new

innodb_print_lock_wait_timeout_info

variable that allows log information about all InnoDB lock wait timeout errors (manual). More importantly, it logs not only blocked transaction, but also locking transaction. This feature was requested at lp:1657737 for one of our Percona Support customers and is now implemented

Oracle MySQL Engineering team also does a lot for better error logging. The start of these improvements happened in version 5.7.2, when variable log_error_verbosity was introduced. Version 8.0.4 added much better tuning control. You can read about it in the Release Notes.

Q: Hello, you do you using strace to find what exactly table have problems in case there is not clear information in mysql error log?

A: I am not a big fan of

strace

when debugging mysqld crashes, but Percona Support certainly uses this tool. I myself prefer to work with

strace

when debugging client issues, such as trying to identify why Percona XtraBackup behaves incorrectly.

Thanks everybody for attending the webinar. You can find the slides and recording of the webinar at the Troubleshooting MySQL Crashes web page.

Dealing with corrupted InnoDB data

22 January, 2016 by awie

Data corruption! It can happen. Maybe because of a bug or storage problem that you didn’t expect, or MySQL crashes when a page checksum’s result is different from what it expected. Either way, corrupted data can and does occur. What do you do then?

Let’s look at the following example and see what can be done when you face this situation.

We have some valuable data:

> select * from t limit 4;
+---+--------+
| i | c      |
+---+--------+
| 1 | Miguel |
| 2 | Angel  |
| 3 | Miguel |
| 4 | Angel  |
+---+--------+
> select count(*) from t;
+----------+
| count(*) |
+----------+
|  2097152 |
+----------+

One day the query you usually run fails and your application stops working. Even worse, it causes the crash already mentioned:

> select * from t where i=2097151;
ERROR 2006 (HY000): MySQL server has gone away

Usually this is the point when panic starts. The error log shows:

2016-01-13 08:01:48 7fbc00133700 InnoDB: uncompressed page, stored checksum in field1 2912050650, calculated checksums for field1: crc32 1490770609, innodb 1549747911, none 3735928559, stored checksum in field2 1670385167, calculated checksums for field2: crc32 1490770609, innodb 2416840536, none 3735928559, page LSN 0 130051648, low 4 bytes of LSN at page end 1476903022, page number (if stored to page already) 4651, space id (if created with >= MySQL-4.1.1 and stored already) 7
InnoDB: Page may be an index page where index id is 22
InnoDB: (index "PRIMARY" of table "test"."t")
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 4651.
InnoDB: You may have to recover from a backup.
InnoDB: It is also possible that your operating
InnoDB: system has corrupted its own file cache
InnoDB: and rebooting your computer removes the
InnoDB: error.
InnoDB: If the corrupt page is an index page
InnoDB: you can also try to fix the corruption
InnoDB: by dumping, dropping, and reimporting
InnoDB: the corrupt table. You can use CHECK
InnoDB: TABLE to scan your table for corruption.
InnoDB: See also http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 4651.
InnoDB: You may have to recover from a backup.
2016-01-13 08:01:48 7fbc00133700 InnoDB: Page dump in ascii and hex (16384 bytes):
len 16384; hex ad925dda0000122b0000122affffffff0000000007c06e4045bf00000000000000000
[...]

OK, our database is corrupted and it is printing the page dump in ASCII and hex. Usually, the recommendation is to recover from a backup. In case you don’t have one, the recommendation would be the same as the one given by the error log. When we hit corruption, first thing we should try is dumping the data and then re-importing to another server (if possible). So, how we can read a corrupted TABLE and avoid the crash? In most cases, the

innodb_force_recovery

option will help us. It has values from 1 to 6. They are documented here:

http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html

The idea is to start with 1. If that doesn’t work, proceed to 2. If it fails again, then go to 3 . . . until you find a value that allows you to dump the data. In this case I know that the problem is a corrupted InnoDB page, so a value of 1 should be enough:

“Lets the server run even if it detects a corrupt page. Tries to make SELECT * FROM tbl_name jump over corrupt index records and pages, which helps in dumping tables.”

We add

innodb_force_recovery=1

and restart the service. Now it’s time to try and dump our data with

mysqldump

. If the corruption is even worse you need to keep trying different modes. For example, I have this error:

> create table t2 like t;
> insert into t2 select * from t;
ERROR 1034 (HY000): Incorrect key file for table 't'; try to repair it
> insert into t2 select * from t;
ERROR 1712 (HY000): Index t is corrupted

innodb_force_recovery=1

doesn’t work here. It doesn’t allow me to dump the data:

# mysqldump -uroot -pmsandbox --port 5623 -h 127.0.0.1 --all-databases > dump.sql
Error: Couldn't read status information for table t ()

but in my test server, it seems that

innodb_force_recovery=3

helps.

This procedure sounds good and usually works. The problem is that the feature is mostly broken after 5.6.15.

innodb_force_recovery

values greater or equal 4 won’t allow the database to start:

2015-07-08 10:25:25 315 [ERROR] Unknown/unsupported storage engine: InnoDB
2015-07-08 10:25:25 315 [ERROR] Aborting

Bug are reported and verified here: https://bugs.mysql.com/bug.php?id=77654

That means that if you have Insert Buffer, Undo Log or Redo log corruption (values 4, 5 and 6) you can’t continue. What to do?

You can install a older version of MySQL (previous to 5.6.15) to use higher values of
```
innodb_force_recovery
```
. Modes 4, 5 and 6 can corrupt your data (even more) so they are dangerous. If there are no backups this is our only option, so my recommendation would be to make a copy of the data we have now and then proceed with higher values of
```
innodb_force_recovery
```
.

If you are using Percona Server,
```
innodb_corrupt_table_action
```
can be used to dump the data. You can use the value “salvage”. When the option value is salvage, XtraDB allows read access to a corrupted tablespace, but ignores corrupted pages.

https://www.percona.com/doc/percona-server/5.6/reliability/innodb_corrupt_table_action.html

If you can’t still dump your data, then you should try more advance solutions like Undrop for InnoDB. Also, it would be good idea to start planning to create regular database backups.

Advanced Query Tuning in MySQL 5.6 and MySQL 5.7 Webinar: Q&A

31 August, 2015 by awie

Thank you for attending my July 22 webinar titled “Advanced Query Tuning in MySQL 5.6 and 5.7” (my slides and a replay available here). As promised here is the list of questions and my answers (thank you for your great questions).

Q: Here is the explain example:

mysql> explain extended select id, site_id from test_index_id where site_id=1
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: test_index_id
         type: ref
possible_keys: key_site_id
          key: key_site_id
      key_len: 5
          ref: const
         rows: 1
     filtered: 100.00
        Extra: Using where; Using index

why is site_id a covered index for the query, given the fact that a) we are selecting “id”, b) key_site_id only contains site_id?

As the table is InnoDB, all secondary keys will always contain primary key (“id”); in this case the secondary index will contain all needed information to satisfy the above query and key_site_id will be “covered index”

Q: Applications change over time. Do you suggest doing a periodic analysis of indexes that are being used and drop the ones that are not? If yes, any suggestions as to tackle that?

Yes, that is a good idea. Usually it can be done easily with Percona toolkit or Performance_schema in MySQL 5.6

Enable slow query log and log every query, then use Pt-index-usage tool
Or use the following query (as suggested by FromDual blog post):

SELECT object_schema, object_name, index_name
  FROM performance_schema.table_io_waits_summary_by_index_usage
 WHERE index_name IS NOT NULL
   AND count_star = 0
 ORDER BY object_schema, object_name;

Q: Does the duplicate index is found on 5.6/5.7 will that causes an performance impact to the db while querying?

Duplicate keys can have negative impact on selects:

MySQL can get confused and choose a wrong index
Total index size can grow, which can cause MySQL to run out of RAM

Q: What is the suggested method to measure performance on queries (other than the slow query log) so as to know where to create indexes?

Slow query log is most common method. In MySQL 5.6 you can also use Performance Schema and use events_statements_summary_by_digest table.

Q: I’m not sure if this was covered in the webinar but… are there any best-practices for fulltext indexes?

That was not covered in this webinar, however, I’ve done a number of presentations regarding Full Text Indexes. For example: Creating Geo Enabled Applications with MySQL 5.6

Q: What would be the limit on index size or number of indexes you can defined per table?

There are no limits on Index size on disk, however, it will be good (performance wise) to have active indexes fit in RAM.

In InnoDB there are a number of index limitations, i.e. a table can contain a maximum of 64 secondary indexes.

Q: If a table has two columns you would like to sum, can you have that sum indexed as a calculated index? To add to that, can that calculated index have “case when”?

Just to clarify, this is only a feature of MySQL 5.7 (not released yet).

Yes, it is documented now:

CREATE TABLE triangle (
  sidea DOUBLE,
  sideb DOUBLE,
  sidec DOUBLE AS (SQRT(sidea * sidea + sideb * sideb))
);

Q: I have noticed that you created indexes on columns like DayOfTheWeek with very low cardinality. Shouldn’t that be a bad practice normally?

Yes, you are right! Unless, you are doing queries like “select count(*) from … where DayOfTheWeek = 7” those indexes may not be very useful.

Q: I saw an article that if you don’t specify a primary key upfront mysql / innodb creates one in the background (hidden). Is it different from a primary key itself, if most of the where fields that are used not in the primary / semi primary key? And is there a way to identify the tables with the hidden primary key indexes?

The “hidden” primary key will be 6 bytes, which will also be appended (duplicated) to all secondary keys. You can create an INT primary key auto_increment, which will be smaller (if you do not plan to store more than 4 billion rows). In addition, you will not be able to use the hidden primary key in your queries.

The following query (against information_schema) can be used to find all tables without declared primary key (with “hidden” primary key):

SELECT tables.table_schema, tables.table_name, tables.table_rows
FROM information_schema.tables
LEFT JOIN (
  SELECT table_schema, table_name
  FROM information_schema.statistics
  GROUP BY table_schema, table_name, index_name
  HAVING
    SUM(
      CASE WHEN non_unique = 0 AND nullable != 'YES' THEN 1 ELSE 0 END
    ) = COUNT(*)
) puks
ON tables.table_schema = puks.table_schema AND tables.table_name = puks.table_name
WHERE puks.table_name IS NULL
AND tables.table_type = 'BASE TABLE' AND engine='InnoDB'

You may also use mysql.innodb_index_stats table to find rows with the hidden primary key:

Example:

mysql> select * from mysql.innodb_index_stats;
+---------------+------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+
| database_name | table_name | index_name      | last_update         | stat_name    | stat_value | sample_size | stat_description                  |
+---------------+------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+
| test          | t1         | GEN_CLUST_INDEX | 2015-08-08 20:48:23 | n_diff_pfx01 | 96         | 1           | DB_ROW_ID                         |
| test          | t1         | GEN_CLUST_INDEX | 2015-08-08 20:48:23 | n_leaf_pages | 1          | NULL        | Number of leaf pages in the index |
| test          | t1         | GEN_CLUST_INDEX | 2015-08-08 20:48:23 | size         | 1          | NULL        | Number of pages in the index      |
+---------------+------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+

Q: You are using the alter table to create index, but how does mysql sort the data for creating the index? isn’t it uses temp table for that?

That is a very good question: the behavior of the “alter table … add index” has changed over time. As documented in Overview of Online DDL:

Historically, many DDL operations on InnoDB tables were expensive. Many ALTER TABLE operations worked by creating a new, empty table defined with the requested table options and indexes, then copying the existing rows to the new table one-by-one, updating the indexes as the rows were inserted. After all rows from the original table were copied, the old table was dropped and the copy was renamed with the name of the original table.
MySQL 5.5, and MySQL 5.1 with the InnoDB Plugin, optimized CREATE INDEX and DROP INDEX to avoid the table-copying behavior. That feature was known as Fast Index Creation

When MySQL uses “Fast Index Creation” operation it will create a set of temporary files in MySQL’s tmpdir:

To add a secondary index to an existing table, InnoDB scans the table, and sorts the rows using memory buffers and temporary files in order by the values of the secondary index key columns. The B-tree is then built in key-value order, which is more efficient than inserting rows into an index in random order.

Q: How good is InnoDB deadlocks on 5.7 comparing to 5.6 version. Is that based on parameters setup?

InnoDB deadlocks discussion is outside of the scope of this presentation. Valerii Kravchuk and Nilnandan Joshi did an excellent talk at Percona Live 2015 (slides available): Understanding Innodb Locks and Deadlocks

Q: What is the performance impact of generating a virtual column for a table having 66 Million records and generating the index. And how would you go about it? Do you have any suggestions on how to re organize indexes on the physical disk?

As MySQL 5.7 is not released yet, behavior of the virtual columns may change. The main question here is: will it be online operations to a) add a virtual column (as this is only metadata change it should be very light operation anyway). b) add index on that virtual column. In the labs released it was not online, however this can change.

Thank you again for attending.

The post Advanced Query Tuning in MySQL 5.6 and MySQL 5.7 Webinar: Q&A appeared first on MySQL Performance Blog.

ObjectRocket’s David Murphy talks about MongoDB, Percona Live Amsterdam

31 August, 2015 by awie

Say hello to David Murphy, lead DBA and MongoDB Master at ObjectRocket (a Rackspace company). David works on sharding, tool building, very large-scale issues and high-performance MongoDB architecture. Prior to ObjectRocket he was a MySQL/NoSQL architect at Electronic Arts. David enjoys large-scale operational tool building, high performance OS and database tuning. He is also a core code contributor to MongoDB. He’ll be speaking next month at Percona Live Amsterdam, which runs Sept. 21-13. Enter promo code “BlogInterview” at registration to save €20!

Tom: David, your 3-hour tutorial is titled “Mongo Sharding from the trench: A Veterans field guide.” Did your experience in working with vast amounts of data at Rackspace give you a unique perspective, in view, that now puts you into a position to help people just getting started? Can you give a couple examples?

David: I think this has been something organically I grew into from the days of supporting Cpanel type MySQL instances to today. I have worked for a few verticals from hosts to advertising to gaming, finally entering into the platform service. The others give me a host of knowledge around how customer need systems to work, and then the number and range of workloads we see at Rackspace re-enforces this.

ObjectRocket’s David Murphy

Many times the unique perspective comes with the scale such as someone calling up a single node to the multi-terabyte range. When they go to “shard” they can find the process that is normally very light and unnoticeable to most Mongo sharding can severally lock the metadata for an extended time. In other cases, the “balancer” might not be able to keep up with the amount of working being asked of it.

Toward the smaller end of the spectrum, having seen so many workloads from big to small. I can see similar thought processes and trends. When this happens having worked with some many of these workloads, and honestly having learned along the evolution of mongo helps me explain to clients the good, bad, and the hairy. Many times discussions come down to people not using connection pooling, non-indexed sorting, or complex operators such as $in, $nin, and more. In these cases, I can talk to people about the balance of using these concepts and when they will become bigger issues for them. My goal is to give them the enough knowledge to help determine when it is correct to use development resource to fix and issue, and when it’s manageable and that development could be better spent elsewhere.

Tom: The title of your tutorial also sounds like the perfect title of a book. Do you have any for one?

David: What an excellent question! I have thought about this. However, I think the goal of a book if I can find the time to do it. A working title might be “Mongo from the trenches: Surviving the minefield to get ahead”. I think the book might be broken into three sections: “When should you use or not user Mongo”, “Schema and Operatorators in the NoSQL world”, “Sharding”. I would do this as this could be a great mini book on its own the community really could use a level of depth similar to the MySQL 5.0 certification guides. I liked these books as it helped someone understand all the bits of what to consider with your schema design and how it affects the application as much as the database hosts. Then in the second half more administration geared it took those same schema and design choices to help you manage them with confidence.

In the end, Mongo is a good product that works well for most people as it matures we need more and discussion. On topics such as what should you monitor, how you should predict issues, and how valuable are regular audits. Especially in an ecosystem where it’s easy to spin something up, launch it, and move on to the next project.

Tom: When and why would you recommend using MongoDB instead of MySQL?

David: I am glad I mentioned this is worthy of a book already, as it is such a complex topic and one that gets me very excited.

I feel there is a bit or misinformation on both sides of this field. Many in the MySQL camp of experts know when someone says they can’t get more than 1000 TPS via MySQL. 9 out of 10 times and design, not a technology issue, the Mongo crowd love this and due to inherit sharding nature of Mongo they can sidestep these types of issues. Conversely in the Mongo camp you will hear how bad the SQL standard is, however, omitting transactions for a moment, the same types of operations exist in MySQL and Mongo. There are some interesting powers in the Mongo aggregation. However, SQL is more powerful and just as complex as some map reduce jobs and aggregations I have written.

As to your question, MySQL will always win in regards to repeatable reads to the database in a transaction. There is some talk of limited transactions in Mongo. However, these will likely not become global and cluster wide anytime soon if ever. I don’t trust floats in Mongo for financials; it’s not that Mongo doesn’t do them but rather JavaScript type floats are present. Sometimes you need to store data as a 64-bit integer and do math in the app to make it a high precision float. MySQL, on the other hand, has excellent support for precision.

Another area is simply looking at the history of Mongo and MySQL. Mongo until WiredTiger and RocksDB were very similar to MyISAM from a locking behavior and support perspective. With the advent of the new storage system, we will-will see major leaps forward in types of flows you will want in Mongo. With the writer lock issue is gone, and locking between the systems is becoming more and more similar making deciding which much harder.

The news is not all use. However, subdocuments and array support in Mongo is amazing there are so many things I can do in Mongo that even in bitwise SET/ENUM operators I could not do. So if you need that type of system, or you want to create a semi denormalize for of a view in the database. Mongo can do this with ease and on the fly. MySQL, on the other hand, would take careful planning and need whole tables updated. In this regard I feel more people could use Mongo and is ability to have a versioned document schema allowing more incremental changes to documents. With new code releases, allowing the application to read old version and “upgrade” them to the latest form. Removing a whole flurry of maintenance related pains that RDBMs have to the frustration of developers who just want to launch the new product.

The last thing I would want to say here is you need not choose, why not use both. Mongo can be very powerful for keeping a semi denormalized version of the data that is nimble to allow fast application or system updates and features. Leaving MySQL for a very specific workload that need the precision are simple are not expected to have schema changes. I am a huge fan of keeping the transactional portions in MySQL, and the rest in Mongo. Allowing you to scale quickly up and down the build of your data needs, and more slowly change the parts that need to be 100% consistent all of the time with no room for eventual consistency.

Tom: What another session(s) are you most looking forward to besides your own at Percona Live Amsterdam?

David: There are a few that are near and dear to me.

“Turtles all the way down: tuning Linux for database workloads” looks like a great one. It is one view I have always had, and DBA’s should be DBA’s, SysAdmins, and Storage people rolled into one. That way they can understand the impacts of the application down to the blocks the database reads.

“TokuDB internals” is another one. I have used TokuDB in MySQL and Mongo to some degree but as it has never had in-depth documentation. A topic like that is a great way to fill any gaps for experienced and new people alike.

“Database Reliability Engineering” looks like a great talk from a great speaker.

As an InnoDB geek, I like the idea around “Understanding InnoDB locks: case studies.”

I see a huge amount of potential for MaxScale if anyone else is curious, “Anatomy of a Proxy Server: MaxScale Internals” should be good for R/W splits and split writing type cases.

Finally, one of my favorite people is Charity as she always is so energetic and can get to the heart of the matter. If you are not going to “Upgrade your database: without losing your data, your perf or your mind” you are missing out!

Tom: Thanks for speaking with me, David! Is there anything else you’d like to add: either about Rackspace or Percona Live Amsterdam?

David: In regards to Rackspace, I urge everyone to check out the Data Services group. We handle everything from Redis to Hadoop with a goal of augmenting your groups or providing experts to help keep your uptime as high as possible. With options for dedicated hosts to platform type services, there is something that helps everyone. Rackspace is not just a cloud company but a real support company that provides amazing hardware to use, or support for other hardware location that is growing rapidly.

With Percona Amsterdam, everyone should come the group of speakers is simply amazing, I for one am excited by so many topics because they are all so compelling. Outside of that you will it hard find another a gathering of database experts with multiple technologies under their belt and who truly believe in the move to picking the right technology for the right use case.

The post ObjectRocket’s David Murphy talks about MongoDB, Percona Live Amsterdam appeared first on MySQL Performance Blog.

Update on the InnoDB double-write buffer and EXT4 transactions

29 June, 2015 by awie

In a post, written a few months ago, I found that using EXT4 transactions with the “data=journal” mount option, improves the write performance significantly, by 55%, without putting data at risk. Many people commented on the post mentioning they were not able to reproduce the results and thus, I decided to further investigate in order to find out why my results were different.

So, I ran sysbench benchmarks on a few servers and found when the InnoDB double-write buffer limitations occur and when they don’t. I also made sure some of my colleagues were able to reproduce the results. Basically, in order to reproduce the results you need the following conditions:

Spinning disk (no SSD)
Enough CPU power
A dataset that fits in the InnoDB buffer pool
A continuous high write load with many ops waiting for disk

Using the InnoDB double write buffer on an SSD disk somewhat prevents us from seeing the issue, something good performance wise. That comes from the fact that the latency of each write operation is much lower. That makes sense, the double-writer buffer is an area of 128 pages on disk that is used by the write threads. When a write thread needs to write a bunch of dirty pages to disk, it first writes them sequentially to free slots in the double write buffer in a single iop and then, it spends time writing the pages to their actual locations on disk using typically one iop per page. Once done writing, it releases the double-write buffer slots it was holding and another thread can do its work. The presence of a raid controller with a write cache certainly helps, at least until the write cache is full. Thus, since I didn’t tested with a raid controller, I suspect a raid controller write cache will delay the apparition of the symptoms but if the write load is sustained over a long period of time, the issue with the InnoDB double write buffer will appear.

So, to recapitulate, on a spinning disk, a write thread needs to hold a lock on some of the double-write buffer slots for at least a few milliseconds per page it needs to write while on a SSD disk, the slots are released very quickly because of the low latency of the SSD storage. To actually stress the InnoDB double-write buffer on a SSD disk, one must push much more writes.

That leads us to the second point, the amount of CPU resources available. At first, one of my colleague tried to reproduce the results on a small EC2 instance and failed. It appeared that by default, the sysbench oltp.lua script is doing quite a lot of reads and those reads saturate the CPU, throttling the writes. By lowering the amount of reads in the script, he was then able to reproduce the results.

For my benchmarks, I used the following command:
sysbench --num-threads=16 --mysql-socket=/var/lib/mysql/mysql.sock --mysql-database=sbtest --mysql-user=root --test=/usr/share/doc/sysbench/tests/db/oltp.lua --oltp-table-size=50000000 --oltp-test-mode=complex --mysql-engine=innodb --db-driver=mysql --report-interval=60 --max-requests=0 --max-time=3600 run

Both servers used were metal boxes with 12 physical cores (24 HT). With less CPU resources, I suggest adding the following parameters:
--oltp-point-selects=1 --oltp-range-size=1 --oltp-index-updates=10

So that the CPU is not wasted on reads and enough writes are generated. Remember we are not doing a generic benchmarks, we are just stressing the InnoDB double-write buffer.

In order to make sure something else isn’t involved, I verified the following:

Server independence, tried on 2 physical servers and one EC2 instance, Centos 6 and Ubuntu 14.04
MySQL provided, tried on MySQL community and Percona Server
MySQL version, tried on 5.5.37 and 5.6.23 (Percona Server)
Varied the InnoDB log file size from 32MB to 512MB
The impacts of the number of InnoDB write threads (1,2,4,8,16,32)
The use of Linux native asynchronous iop
Spinning and SSD storage

So, with all those verifications done, I can maintain that if you are using a server with spinning disks and a high write load, using EXT4 transactions instead of the InnoDB double write buffer yields to an increase in throughput of more than 50%. In an upcoming post, I’ll show how the performance stability is affected by the InnoDB double-write buffer under a high write load.

Appendix: the relevant part of the my.cnf
innodb_buffer_pool_size = 12G innodb_write_io_threads = 8 # or else in {1,2,4,8,16,32} innodb_read_io_threads = 8 innodb_flush_log_at_trx_commit = 0 # must be 0 or 2 to really stress the double write buffer innodb_log_file_size = 512M # or 32M, 64M innodb_log_files_in_group = 2 innodb_file_per_table innodb_flush_method=O_DIRECT # or O_DSYNC innodb_buffer_pool_restore_at_startup=300 # On 5.5.x, important to warm up the buffer pool #innodb_buffer_pool_load_at_startup=ON # on 5.6, important to warm up the buffer pool #innodb_buffer_pool_dump_at_shutdown=ON # on 5.6, important to warm up the buffer pool, skip-innodb_doublewrite # or commented out innodb_flush_neighbor_pages=none # or area for spinning

The post Update on the InnoDB double-write buffer and EXT4 transactions appeared first on MySQL Performance Blog.

Getting EXPLAIN information from already running queries in MySQL 5.7

29 June, 2015 by awie

When a new version of MySQL is about to be released we read a lot of blog posts about the performance and scalability improvements. That’s good but sometimes we miss some small features that can help us a lot in our day-to-day tasks. One good example is the blog post that Aurimas wrote about a new small feature in MySQL 5.6 that I didn’t know about until I read it: the Automatic InnoDB transaction log file size change. How cool is that?

I plan to write a series of blog posts that will show some of those small new features in MySQL 5.7 that are going to be really useful. I’m going to start with EXPLAIN FOR CONNECTION.

This feature allows us to run an EXPLAIN for an already running statement. Let’s say that you find a query that has been running for a long time and you want to check why that could be happening. In 5.7 you can just ask MySQL to EXPLAIN the query that a particular connection is running and get the execution path. You can use it if the query is a SELECT, DELETE, INSERT, REPLACE or UPDATE. Won’t work if the query is a prepared statement though.

Let me show you an example of how it works.

We have a long running join.

mysql [localhost] {msandbox} ((none)) > show processlist G
*************************** 1. row ***************************
     Id: 9
   User: msandbox
   Host: localhost
     db: employees
Command: Query
   Time: 49
  State: Sending data
   Info: select count(*) from employees, salaries where employees.emp_no = salaries.emp_no

Let’s see the execution plan for the query:

mysql [localhost] {msandbox} ((none)) > explain for connection 9 G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: employees
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 299540
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: salaries
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2803840
     filtered: 100.00
        Extra: Using where; Using join buffer (Block Nested Loop)

The join between those tables is not using any index at all so there is some room for improvement here

Conclusion

You can use this feature to see why a query is running for too long and based on the info decide how to fix it and how to proceed. This is going to be a very useful feature for DBAs who want to diagnose performance problems and slow queries.

The post Getting EXPLAIN information from already running queries in MySQL 5.7 appeared first on MySQL Performance Blog.

Innodb redo log archiving

10 April, 2014 by awie

Percona Server 5.6.11-60.3 introduces a new “log archiving” feature. Percona XtraBackup 2.1.5 supports “apply archived logs.” What does it mean and how it can be used?

Percona products propose three kinds of incremental backups. The first is full scan of data files and comparison the data with backup data to find some delta. This approach provides a history of changes and saves disk space by storing only data deltas. But the disadvantage is a full-data file scan that adds load to the disk subsystem. The second kind of incremental backup avoids extra disk load during data file scans.

The idea is in reading only changed data pages. The information about what specific pages were changed is provided by the server itself which writes files with the information during work. It’s a good alternative but changed-pages tracking adds some small load. And Percona XtraBackup’s delta reading leads to non-sequential disk io. This is good alternative but there is one more option.

The Innodb engine has a data log. It writes all operations which modify database pages to log files. This log is used in the case of unexpected server terminating to recover data. The Innodb log consists of the several log files which are filled sequentially in circular. The idea is to save those files somewhere and apply all modifications from archived logs to backup data files. The disadvantage of this approach is in using extra disk space. The advantage is there is no need to do an “explicit” backup on the host server. A simple script could sit and wait for logs to appear then scp/netcat them over to another machine.

But why not use good-old replication? Maybe replication does not have such performance as logs recovering but it is more controlled and well-known. Archived logs allows you to do any number of things with them from just storing them to doing periodic log applying. You can not recover from a ‘DROP TABLE’, etc with replication. But with this framework one could maintain the idea of “point in time” backups.

So the “archived logs” feature is one more option to organize incremental backups. It is not widely used as it was issued not so far and there is not A good understanding of how it works and how it can be used. We are open to any suggestions about its suggest improvements and use cases. The subject of this post is to describe how it works in depth. As log archiving is closely tied with innodb redo logs the internals of redo logs will be covered too. This post would be useful not only for DBA but also for Software Engineers because not only common principles are considered but the specific code too, and knowledge from this post can be used for further MySQL code exploring and patching.

What is the innodb log and how it is written?

Let’s remember what are innodb logs, why they are written, what they are used for.

The Innodb engine has buffer pool. This is a cache of database pages. Any changes are done on page in buffer pool, then page is considered as “dirty,” which means it must be flushed, and pushed to the flush list which is processed periodically by special thread. If pages are not flushed to disk and server is terminated unexpectedly the changes will be lost. To avoid this innodb writes changes to redo log and recover data from redo log during start. This technique allows to delay buffer pool pages flushing. It can increase performance because several changes of one page can be accumulated in memory and then flushed by one io. Except that flushed pages can be grouped to decrease the number of non-sequential io’s. But the down-side of this approach is time for data recovering. Let’s consider how this log is stored, generated and used for data recovering.

Log files

Redo log consists of a several log files which are treated as a circular buffer. The number and the size of log files can be configured. Each log file has a header. The description of this header can be found in “storage/innobase/include/log0log.h” by “LOG_GROUP_ID” keyword.

Each log file contains log records. Redo log records are written sequentially by log blocks of OS_FILE_LOG_BLOCK_SIZE size which is equal to 512 bytes by default and can be changed with innodb option. Each record has its LSN. LSN is a “Log Sequence Number” – the number of bytes written to log from the log creation to the certain log record. Each block consists of header, trailer and log records.

Log blocks

Let’s consider log block header. The first 4 bytes of the header is log block number. The block number is very similar as LSN but LSN is measured in bytes and block number is measured by OS_FILE_LOG_BLOCK_SIZE. Here is the simple formula how LSN is converted to block number:

return(((ulint) (lsn / OS_FILE_LOG_BLOCK_SIZE) & 0x3FFFFFFFUL) + 1);

This formula can be found in log_block_convert_lsn_to_no() function. The next two bytes is the number of bytes used in the block. The next two bytes is the offset of the first MTR log record in this block. What is MTR will be described below. Currently it can be considered as a synonym of bunch of log records which are gathered together as a description of some logical operation. For example it can be a group of log records for inserting new row to some table. This field is used when there are records of several MTR’s in one block. The next four bytes is a checkpoint number. The trailer is four bytes of log block checksum. The above description can be found in “storage/innobase/include/log0log.h” by “LOG_BLOCK_HDR_NO” keyword.

Before writing to disk log blocks must be somehow formed and stored. And the question is:

How log blocks are stored in memory and on disk?

Where log blocks are stored before flushing to disk and how they are written and flushed?

Global log object and log buffer

The answer to the first part of the question is log buffer. Server holds very important global object log_sys in memory. It contains a lot of useful information about logging state. Log buffer is pointed by log_sys->buf pointer which is initialized in log_init(). I would highlight the following log_sys fields that are used for work with log buffer and flushing:

log_sys->buf_size – the size of log buffer, can be set with innodb-log-buffer-size variable, the default value is 8M;
log_sys->buf_free – the offset from which the next log record will be written;
log_sys->max_buf_free – if log_sys->buf_free is greater then this value log buffer must be flushed, see log_free_check();
log_sys->buf_next_to_write – the offset of the next log record to write to disk;
log_sys->write_lsn – the LSN up to which log is written;
log_sys->flushed_to_disk_lsn – the LSN up to which log is written and flushed;
log_sys->lsn – the last LSN in log buffer;

So log_sys->buf_next_to_write is between 0 and log_sys->buf_free, log_sys->write_lsn is equal or less log_sys->lsn, log_sys->flushed_to_disk_lsn is less or equal to log_sys->write_lsn.

The relationships for those fields can be easily traced with debugged by setting up watchpoints.

Ok, we have log buffer, but how do log records come to this buffer?

Where log records come from?

Innodb has special objects that allow you to gather redo log records for some operations in one bunch before writing them to log buffer. These objects are called “mini-transactions” and corresponding functions and data types have “mtr” prefix in the code. The objects itself are described in mtr_t “c” structure. The most interesting fields of this structure are the following:

mtr_t::log – contains log records for the mini-transaction,
mtr_t::memo – contains pointers to pages which are changed or locked by the mini-transaction, it is used to push pages to flush list and release locks after logs records are copied to log buffer in mtr_commit() (see mtr_memo_pop_all() called in mtr_commit()).

mtr_start() function initializes an object of mtr_t type and mtr_commit() writes log records from mtr_t::log to log_sys->buf + log_sys->buf_free. So the typical sequence of any operation which changes data is the following:

mtr_start(); // initialize mtr object some_ops... // operations on data which are logged in mtr_t::log mtr_commit(); // write logged operations from mtr_t::log to log buffer log_sys->buf

page_cur_insert_rec_write_log() is a good example of how mtr records can be written and mtr::memo can be filled. The low-level function which writes data to log buffer is log_write_low(). This function is invoked inside of mtr_commit() and not only copy the log records from mtr_t object to log buffer log_sys->buf but also creates a new log blocks inside of log_sys->buf, fills their header, trailer, calculates checksum.

So log buffer contains log blocks which are sequentially filled with log records which are grouped in “mini-transactions” which logically can be treated as some logical operation over data which consists of a sequence of mini-operations(log records).

As log records are written sequentially in log buffer one mini-transaction and even one log record can be written in two neighbour blocks. That is why the header field which would contain the offset of the first MTR in the block is necessary to calculate the point from which log records parsing can be started. This field was described in 2.2.

So we have a buffer of log blocks in a memory. How is data from this buffer written to disk? The mysql documentation says that this depends on innodb_flush_log_at_trx_commit option. There can be three cases depending on the value of this option. Let’s consider each of them.

Writing log buffer to disk: innodb_flush_log_at_trx_commit is 1 or 2.

The first two cases is when innodb_flush_log_at_trx_commit is 1 or 2. In these cases flush log records are written for 2 and flushed for 1 on each transaction commit. If innodb_flush_log_at_trx_commit is 2 log records are flushed periodically by special thread which will be considered later. The low-level function which writes log records from buffer to file is log_group_write_buf(). But in the most cases it is not called directly but it is called from more high level log_write_up_to(). For the current case the calling stack is the following:

(trx_commit_in_memory() or trx_commit_complete_for_mysql() or trx_prepare() e.t.c)-> trx_flush_log_if_needed()-> trx_flush_log_if_needed_low()-> log_write_up_to()-> log_group_write_buf().

It is quite easy to find the higher levels of calling stack, just set up breakpoint on log_group_write_buf() and execute any sql query that modifies innodb data. For example for the simple “insert” sql query the higher levels of calling stack are the following:

mysql_execute_command()-> trans_commit_stmt()-> ha_commit_trans()-> TC_LOG_DUMMY::commit()-> ha_commit_low()-> innobase_commit()-> trx_commit_complete_for_mysql()-> trx_flush_log_if_needed()-> ... .

log_io_complete() callback is invoked when i/o is finished for log files (see fil_aio_wait()). log_io_complete() flushes log files if this is not forbidden by innodb_flush_method or innodb_flush_log_at_trx_commit options.

Writing log buffer to disk: innodb_flush_log_at_trx_commit is equal to 0

The third case is when innodb_flush_log_at_trx_commit is equal to 0. For this case log buffer is NOT written to disk on transaction commit, it is written and flushed periodically by separate thread “srv_master_thread”. If innodb_flush_log_at_trx_commit = 0 log files are flushed in the same thread by the same calls. The calling stack is the following:

srv_master_thread()-> (srv_master_do_active_tasks() or srv_master_do_idle_tasks() or srv_master_do_shutdown_tasks())-> srv_sync_log_buffer_in_background()-> log_buffer_sync_in_background()->log_write_up_to()->... .

Special cases for logs flushing

While log_io_complete() do flushing depending on innodb_flush_log_at_trx_commit value among others log_write_up_to() has it’s own flushing criteria. This is flush_to_disk function argument. So it is possible to force log files flushing even if innodb_flush_log_at_trx_commit = 0. Here are examples of such cases:

1) buf_flush_write_block_low()
Each page contains information about the last applied LSN(buf_flush_write_block_low::newest_modification), each log record is a description of change on certain page. Imagine we flushed some changed pages but log records for these pages were not flushed and server goes down. After starting the server some pages will have the newest modifications, but some of them were not flushed and the correspondent log records are lost too. We will have inconsistent database in this case. That is why log records must be flushed before the pages they refer.

2) srv_sync_log_buffer_in_background()
As it was described above this function is called periodically by special thread and forces flushing.

3) log_checkpoint()
When checkpoint is made log files must be reliably flushed.

4) The special handlerton innobase_flush_logs() which can be called through ha_flush_logs() from mysql server.
For example ha_flush_logs() is called from MYSQL_BIN_LOG::reset_logs() when “RESET MASTER” or “RESET SLAVE” are executed.

5) srv_master_do_shutdown_tasks() – on shutdown, ha_innobase::create() – on table creating, ha_innobase::delete_table() – on table removing, innobase_drop_database() – on all database tables removing, innobase_rename_table() – on table rename e.t.c

If log files are treated as circular buffer what happens when the buffer is overflown?

Briefly. Innodb has a mechanism which allows you to avoid overflowing. It is called “checkpoints.” The checkpoint is a state when log files are synchronized with data files. In this case there is no need to keep the history of changes before checkpoint because all pages with the last modifications LSN less or equal to checkpoint LSN are flushed and the log files space from the last written LSN to the last checkpoint LSN can be reused. We will not describe a checkpoint process here because it is a separate interesting subject. The only thing we need to know is when checkpoint happens all pages with modification LSN less or equal to checkpoint LSN are reliably flushed.

How archived logs are written by server.

So the log contains information about page changes. But as we said, log files are the circular buffer. This means that they occupy fixed disk size and the oldest records can be rewritten by the newest ones as there are points when data files are synchronized with log files called checkpoints and there is no need to store the previous history of log records to guarantee database consistency. The idea is to save somewhere all log records to have the possibility of applying them to backuped data to have some kind if incremental backup. For example if we want to have an archive of log records. As log consists of log files it is reasonable to store log records in such files too, and these files are called “archived logs.”

Archived log files are written to the directory which can be set with special innodb option. Each file has the same size as innodb log size and the suffix of each archived file is the LSN from which it is started.

As well as log writing system log archiving system stores its data in global log_sys object. Here are the most valuable fields in log_sys from my point of view:

log_sys->archive_buf, log_sys->archive_buf_size – logs archive buffer and its size, log records are copied from log buffer log_sys->buf to this buffer before writing to disk;
log_sys->archiving_phase – the current phase of log archiving: LOG_ARCHIVE_READ when log records are being copied from log_sys->buf to log_sys->archive_buf, LOG_ARCHIVE_WRITE when log_sys->archive_buf is being written to disk;
log_sys->archived_lsn – the LSN to which log files are written;
log_sys->next_archived_lsn – the LSN to which write operations was invoked but not yet finished;
log_sys->max_archived_lsn_age – the maximum difference between log_sys->lsn and log_sys->archived_lsn, if this difference exceeds the log are being archived synchronously, i.e. the difference is decreased;
log_sys->archive_lock – this is rw-lock which is used for synchronizing LOG_ARCHIVE_WRITE and LOG_ARCHIVE_READ phases, it is x-locked on LOG_ARCHIVE_WRITE phase.

So how is data copied from log_sys->buf to log_sys->archived_buf? log_archive_do() is used for this. It is not only set the proper state for archived log fields in log_sys but also invokes log_group_read_log_seg() with corresponding arguments which not only copy data from log buffer to archived log buffer but also invokes asynchronous write operation for archived log buffer. log_archive_do() can wait until io operations are finished using log_sys->archive_lock if corresponding function parameter is set.

The main question is on what circumstances log_archive_do() is invoked, i.e. when log records are being written to archived log files. The first call stack is the following:

log_free_check()-> log_check_margins()-> log_archive_margin()-> log_archive_do().

Here is text of log_free_check() with comments:

/*********************************************************************// Checks if there is need for a log buffer flush or a new checkpoint, and does this if yes. Any database operation should call this when it has modified more than about 4 pages. NOTE that this function may only be called when the OS thread owns no synchronization objects except the dictionary mutex. */ UNIV_INLINE void log_free_check(void) /*================*/ {

#ifdef UNIV_SYNC_DEBUG
ut_ad(sync_thread_levels_empty_except_dict());
#endif /* UNIV_SYNC_DEBUG */

if (log_sys->check_flush_or_checkpoint) {

log_check_margins(); } }

log_sys->check_flush_or_checkpoint is set when there is no enough free space in log buffer or it is time to do checkpoint or any other bound case. log_archive_margin() is invoked only if the limit if the difference between log_sys->lsn and log_sys->archived_lsn is exceeded. Let’s refer to this difference as archived lsn age.

One more call log_archive_do() is from log_open() when archived lsn age exceeds some limit. log_open() is called on each mtr_commit(). And for this case archived logs are written synchronously.

The next synchronous call is from log_archive_all() during shutdown.

Summarizing all above archived logs begins to be written when the log buffer is full enough to be written or when checkpoint happens or when the server is in the process of shut down. And there is no any delay between writing to archive log buffer and writing to disk. I mean there is no way to say that archived logs must be written once a second as it is possible for redo logs with innodb_flush_log_at_trx_commit = 0. As soon as data is copied to the buffer the write operation is invoked immediately for this buffer. Archived log buffer is not filled on each mtr_commit() so it does not slow down the usual logging process. The exception is when there are a lot of io operations what can be the reason of archive log age is too big. The result of big archive log age is the synchronous archived logs writing during mtr_commit(). Memory to memory copying is quite fast operation that is why the data is copied to archived log buffer and is written to disk asynchronously minimizing delays which can be caused by logs archiving.

PS: Here is another call stack for writing archived log buffer to archived log files:

log_io_complete()->log_io_complete_archive()->log_archive_check_completion_low()->log_archive_groups().

I propose to explore this stack yourself.

Logs recovery process, how it is started and works inside. Archived logs applying.

So we discovered how innodb redo logging works, and how redo logs are archived. And the last uncovered thing is how recovery works and how archived logs are applied. These two processes are very similar – that is why they are discussed in one section of this post.

The story begins with innobase_start_or_create_for_mysql() which is invoked from innobase_init(). The following trident in innobase_start_or_create_for_mysql() can be used to search the relevant code:

if (create_new_db) { ... } else if (srv_archive_recovery) { ... } else { ... }

The second condition and the last one is the place from which archived logs applying and innodb logs recovery processes correspondingly start. These two blocks wrap two pairs of functions:

recv_recovery_from_archive_start() recv_recovery_from_archive_finish()
and
recv_recovery_from_checkpoint_start() recv_recovery_from_checkpoint_finish()

And all the magic happens in these pairs. As well as global log_sys object for redo logging there is global recv_sys object for innodb recovery and archived logs applying. It is created and initialized in recv_sys_create() and recv_sys_init() functions correspondingly. The following fields if recv_sys object are the most important from my point:

recv_sys->limit_lsn – the LSN up to which recovery should be made, this value is initialized with the maximum value of uint64_t(see #define LSN_MAX) for the recovery process and with certain value which is passed as an argument of recv_recovery_from_archive_start() function and can be set via xtrabackup option for log applying;
recv_sys->parse_start_lsn – the LSN from which logs parsing is started, for the the logs recovery this value equals to the last checkpoint LSN, for logs applying this is last applied LSN;
recv_sys->scanned_lsn – the LSN up to which log files are scanned;
recv_sys->recovered_lsn – the LSN up to which log records are applied, this value <= recv_sys->scanned_lsn;

The first thing that must be done for starting recovery process is to find out the point in log files where the recovery must be started from. This is the last checkpoint LSN. recv_find_max_checkpoint() proceed this. As we can see in log_group_checkpoint() the following code writes checkpoint info into two places in the first log file depending on the checkpoint number:

/* We alternate the physical place of the checkpoint info in the first log file */

if ((log_sys->next_checkpoint_no & 1) == 0) { write_offset = LOG_CHECKPOINT_1; } else { write_offset = LOG_CHECKPOINT_2; }

So recv_find_max_checkpoint() reads checkpoint info from both places and selects the latest checkpoint.

The same idea is applied for logs, too, but the last applied LSN instead of last checkpoint LSN must be found. Here is the call stack for reading last applied LSN:

innobase_start_or_create_for_mysql()-> open_or_create_data_files()-> fil_read_first_page().

The last applied LSN is stored in the first page of data files in (min|max)_flushed_lsn fields(see FIL_PAGE_FILE_FLUSH_LSN offset). These values are written in fil_write_flushed_lsn_to_data_files() function on server shutdown.

So the main difference between logs applying and recovery process at this stage is the manner of calculating LSN from which log records will be read. For logs applying the last flushed LSN is used but for recovery process it is the last checkpoint LSN. Why does this difference take place? Logs can be applied periodically. Assume we gather archived logs and apply them once an hour to have fresh backup. After applying the previous bunch of log files there can be unfinished transactions. For the recovery process any unfinished transactions are rolled back to have consistent db state at server starting. But for the logs applying process there is no need to roll back them because any unfinished transactions can be finished during the next logs applying.

After calculating the start LSN the sequence of actions is the same for both recovering and applying. The next step is reading and parsing log records. See recv_group_scan_log_recs() which is invoked from recv_recovery_from_checkpoint_start_func() for logs recovering and recv_recovery_from_archive_start()->log_group_recover_from_archive_file() for logs applying.

The first we read log records to some buffer and then invoke recv_scan_log_recs() to parse them. recv_scan_log_recs() checks each log block on consistency(checksum + comparing the log block number written in log block with log block number calculated from log block LSN) and other edge cases and copy it to parsing buffer recv_sys->buf with recv_sys_add_to_parsing_buf() function. The parsing buffer is then parsed by recv_parse_log_recs(). Log records are stored in hash table recv_sys->addr_hash. The key for this hash table is calculated basing on space id and page number pair. This pair refers to the page to which log records must be applied. The value of the hash table is object of recv_addr_t type. recv_addr_t type contains rec_list field which is the list of log records for applying to the (space id, page num) page (see recv_add_to_hash_table().

After parsing and storing log record in hash table recv_sys->addr_hash log records are applied. The function which is responsible for log records applying is recv_apply_hashed_log_recs(). It is invoked from recv_scan_log_recs() if there is no enough memory to store log records and at the end of recovering/applying process. For each element of recv_sys->addr_hash, i.e. for each DB page which must be changed with log records recv_recover_page() is invoked. It can be invoked as from recv_apply_hashed_log_recs() in the case if page is already in buffer pool of from buf_page_io_complete() on io completion, i.e. just after page was read from storage. Applying log records on page read completion is necessary and very convenient. Assume log records have not yet applied as we had enough memory to store the whole recovery log records. But we want for example to boot DB dictionary. I this case any records that concern to the pages of the dictionary will be applied to those pages just after reading them from storage to buffer pool.

The function which applies log records to the certain page is recv_recover_page_func(). It gets the list of log records for the certain page from recv_sys->addr_hash hash table, for each element of this list it compares the lsn of last page changes with the LSN of the record, and if the former is greater the later it applies log record to the page.

After applying all log records from archived logs xtrabackup writes last applied LSN to (min|max)_flushed LSN fields of each data file and finishes execution. The logs recovery process rollbacks all unfinished transactions unless this is forbidden with innodb-force-recovery parameter.

Conclusion

We covered the processes of redo logs writing and recovery in depth. These are very important processes as they provide data consistency on crashes. These two processes became a base for logs archiving and applying features. As log records can describe any data changes the idea is to store these records somewhere and then apply them to backups for organizing some kind of incremental backup.

The features were implemented a short time ago and currently they are not widely used. So if you have something to say about them you are welcome to comment for discussion.

The post Innodb redo log archiving appeared first on MySQL Performance Blog.

InnoDB adaptive flushing in MySQL 5.6: checkpoint age and io capacity

31 October, 2013 by awie

In MySQL 5.6 InnoDB has a dedicated thread (page_cleaner) that’s responsible for performing flushing operations. Page_cleaner performs flushing of the dirty pages from the buffer pool based on two factors:
– access pattern – the least recently used pages will be flushed by LRU flusher from LRU_list when buffer pool has no free pages anymore;
– age – the oldest modified non-flushed pages are part of flush_list structure and will be flushed by flush_list flusher based on several heuristics.

There is a good overview of the page_cleaner and also here you may find some details about flushing in MySQL 5.6. Below I describe several additional aspects of the flush_list flushing that was not really covered yet.

flush_list flushing and checkpoint age

The amount of the aged pages that is possible to keep in the flush_list is limited by the combined size of the innodb log files. So the main purpose of the flush_list flushing is to flush pages from this list with such a rate that will also always allow enough free space in the log files. On the other hand, too aggressive flushing means less write combining, unnecessary load on the I/O subsystem, in the end undoing performance benefits of having larger redo logs. In MySQL 5.6 the amount of pages to flush is calculated in the InnoDB adaptive routine based on the current checkpoint age with the following formula:

 
percentage of the IO capacity that should be used for flushing =
        ((srv_max_io_capacity / srv_io_capacity) * (lsn_age_factor * sqrt(lsn_age_factor))) / 7.5;

We modeled that formula in R and found that it’s possible to improve it such a way that the curve becomes more flat and as a result flushing becomes less aggressive. That new formula is enabled in Percona Server 5.6 by default.

flush_list flushing and io_capacity

InnoDB provides two variables that allow the control of the background flushing rate – innodb_io_capacity and innodb_io_capacity_max. There is quite a detailed description for these vars. However there are several things that are not really covered in the documentation:

– innodb_io_capacity_max is the most important variable in case of adaptive flushing as only that variable actually limiting the flushing rate. See above formula and charts.

– innodb_io_capacity is used for limiting IO operations during merging of the insert buffer and flushing in cases of server inactivity/shutdown.

For practical needs, the above means the following:

– if the MySQL server is in an active state (serving user requests) you need to adjust innodb_io_capacity_max to increase/decrease flushing rate.
– if the MySQL server is in an idle state or performing shutdown flushing of the pages from flush_list will be limited by innodb_io_capacity value only.

– if change_buffering is ON and server is in active state it will allow to use either 5% of innodb_io_capacity or vary rate from 5% to 55% if more than 50% of insert buffer size was already used.
– if change_buffering is ON and server is idle it will use 100% of innodb_io_capacity for merge operations

The post InnoDB adaptive flushing in MySQL 5.6: checkpoint age and io capacity appeared first on MySQL Performance Blog.