Dealing with corrupted InnoDB data


MySQLData corruption! It can happen. Maybe because of a bug or storage problem that you didn’t expect, or MySQL crashes when a page checksum’s result is different from what it expected. Either way, corrupted data can and does occur. What do you do then?

Let’s look at the following example and see what can be done when you face this situation.

We have some valuable data:

> select * from t limit 4;
| i | c      |
| 1 | Miguel |
| 2 | Angel  |
| 3 | Miguel |
| 4 | Angel  |
> select count(*) from t;
| count(*) |
|  2097152 |

One day the query you usually run fails and your application stops working. Even worse, it causes the crash already mentioned:

> select * from t where i=2097151;
ERROR 2006 (HY000): MySQL server has gone away

Usually this is the point when panic starts. The error log shows:

2016-01-13 08:01:48 7fbc00133700 InnoDB: uncompressed page, stored checksum in field1 2912050650, calculated checksums for field1: crc32 1490770609, innodb 1549747911, none 3735928559, stored checksum in field2 1670385167, calculated checksums for field2: crc32 1490770609, innodb 2416840536, none 3735928559, page LSN 0 130051648, low 4 bytes of LSN at page end 1476903022, page number (if stored to page already) 4651, space id (if created with >= MySQL-4.1.1 and stored already) 7
InnoDB: Page may be an index page where index id is 22
InnoDB: (index "PRIMARY" of table "test"."t")
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 4651.
InnoDB: You may have to recover from a backup.
InnoDB: It is also possible that your operating
InnoDB: system has corrupted its own file cache
InnoDB: and rebooting your computer removes the
InnoDB: error.
InnoDB: If the corrupt page is an index page
InnoDB: you can also try to fix the corruption
InnoDB: by dumping, dropping, and reimporting
InnoDB: the corrupt table. You can use CHECK
InnoDB: TABLE to scan your table for corruption.
InnoDB: See also
InnoDB: about forcing recovery.
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 4651.
InnoDB: You may have to recover from a backup.
2016-01-13 08:01:48 7fbc00133700 InnoDB: Page dump in ascii and hex (16384 bytes):
 len 16384; hex ad925dda0000122b0000122affffffff0000000007c06e4045bf00000000000000000

OK, our database is corrupted and it is printing the page dump in ASCII and hex. Usually, the recommendation is to recover from a backup. In case you don’t have one, the recommendation would be the same as the one given by the error log. When we hit corruption, first thing we should try is dumping the data and then re-importing to another server (if possible). So, how we can read a corrupted TABLE and avoid the crash? In most cases, the 


  option will help us. It has values from 1 to 6. They are documented here:

The idea is to start with 1. If that doesn’t work, proceed to 2. If it fails again, then go to 3 . . . until you find a value that allows you to dump the data. In this case I know that the problem is a corrupted InnoDB page, so a value of 1 should be enough:

“Lets the server run even if it detects a corrupt page. Tries to make SELECT * FROM tbl_name jump over corrupt index records and pages, which helps in dumping tables.”

We add


 and restart the service. Now it’s time to try and dump our data with


. If the corruption is even worse you need to keep trying different modes. For example, I have this error:

> create table t2 like t;
> insert into t2 select * from t;
ERROR 1034 (HY000): Incorrect key file for table 't'; try to repair it
> insert into t2 select * from t;
ERROR 1712 (HY000): Index t is corrupted


 doesn’t work here. It doesn’t allow me to dump the data:

# mysqldump -uroot -pmsandbox --port 5623 -h --all-databases > dump.sql
Error: Couldn't read status information for table t ()

but in my test server, it seems that



This procedure sounds good and usually works. The problem is that the feature is mostly broken after 5.6.15.


 values greater or equal 4 won’t allow the database to start:

2015-07-08 10:25:25 315 [ERROR] Unknown/unsupported storage engine: InnoDB
2015-07-08 10:25:25 315 [ERROR] Aborting

Bug are reported and verified here:

That means that if you have Insert Buffer, Undo Log or Redo log corruption (values 4, 5 and 6) you can’t continue. What to do?

  • You can install a older version of MySQL (previous to 5.6.15) to use higher values of

    . Modes 4, 5 and 6 can corrupt your data (even more) so they are dangerous. If there are no backups this is our only option, so my recommendation would be to make a copy of the data we have now and then proceed with higher values of




  • If you are using Percona Server,

      can be used to dump the data. You can use the value “salvage”. When the option value is salvage, XtraDB allows read access to a corrupted tablespace, but ignores corrupted pages.

If you can’t still dump your data, then you should try more advance solutions like Undrop for InnoDB. Also, it would be good idea to start planning to create regular database backups.    :)

Read more at:

Advanced Query Tuning in MySQL 5.6 and MySQL 5.7 Webinar: Q&A

Thank you for attending my July 22 webinar titled “Advanced Query Tuning in MySQL 5.6 and 5.7” (my slides and a replay available here). As promised here is the list of questions and my answers (thank you for your great questions).

Q: Here is the explain example:

mysql> explain extended select id, site_id from test_index_id where site_id=1
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: test_index_id
         type: ref
possible_keys: key_site_id
          key: key_site_id
      key_len: 5
          ref: const
         rows: 1
     filtered: 100.00
        Extra: Using where; Using index

why is site_id a covered index for the query, given the fact that a) we are selecting “id”, b) key_site_id only contains site_id?

As the table is InnoDB, all secondary keys will always contain primary key (“id”); in this case the secondary index will contain all needed information to satisfy the above query and key_site_id will be “covered index”

Q: Applications change over time. Do you suggest doing a periodic analysis of indexes that are being used and drop the ones that are not? If yes, any suggestions as to tackle that?

Yes, that is a good idea. Usually it can be done easily with Percona toolkit or Performance_schema in MySQL 5.6

  1. Enable slow query log and log every query, then use Pt-index-usage tool
  2. Or use the following query (as suggested by FromDual blog post):
SELECT object_schema, object_name, index_name
  FROM performance_schema.table_io_waits_summary_by_index_usage
 WHERE index_name IS NOT NULL
   AND count_star = 0
 ORDER BY object_schema, object_name;

Q: Does the duplicate index is found on 5.6/5.7 will that causes an performance impact to the db while querying?

Duplicate keys can have negative impact on selects:

  1. MySQL can get confused and choose a wrong index
  2. Total index size can grow, which can cause MySQL to run out of RAM

Q: What is the suggested method to measure performance on queries (other than the slow query log) so as to know where to create indexes?

Slow query log is most common method. In MySQL 5.6 you can also use Performance Schema and use events_statements_summary_by_digest table.

Q: I’m not sure if this was covered in the webinar but… are there any best-practices for fulltext indexes?

That was not covered in this webinar, however, I’ve done a number of presentations regarding Full Text Indexes. For example: Creating Geo Enabled Applications with MySQL 5.6

Q: What would be the limit on index size or number of indexes you can defined per table?

There are no limits on Index size on disk, however, it will be good (performance wise) to have active indexes fit in RAM.

In InnoDB there are a number of index limitations, i.e. a table can contain a maximum of 64 secondary indexes.

Q:  If a table has two columns you would like to sum, can you have that sum indexed as a calculated index? To add to that, can that calculated index have “case when”?

Just to clarify, this is only a feature of MySQL 5.7 (not released yet).

Yes, it is documented now:

CREATE TABLE triangle (
  sidea DOUBLE,
  sideb DOUBLE,
  sidec DOUBLE AS (SQRT(sidea * sidea + sideb * sideb))

Q: I have noticed that you created indexes on columns like DayOfTheWeek with very low cardinality. Shouldn’t that be a bad practice normally?

Yes, you are right! Unless, you are doing queries like “select count(*) from … where DayOfTheWeek = 7” those indexes may not be very useful.

Q: I saw an article that if you don’t specify a primary key upfront mysql / innodb creates one in the background (hidden). Is it different from a primary key itself, if most of the where fields that are used not in the primary / semi primary key? And is there a way to identify the tables with the hidden primary key indexes?

The “hidden” primary key will be 6 bytes, which will also be appended (duplicated) to all secondary keys. You can create an INT primary key auto_increment, which will be smaller (if you do not plan to store more than 4 billion rows). In addition, you will not be able to use the hidden primary key in your queries.

The following query (against information_schema) can be used to find all tables without declared primary key (with “hidden” primary key):

SELECT tables.table_schema, tables.table_name, tables.table_rows
FROM information_schema.tables
  SELECT table_schema, table_name
  FROM information_schema.statistics
  GROUP BY table_schema, table_name, index_name
      CASE WHEN non_unique = 0 AND nullable != 'YES' THEN 1 ELSE 0 END
    ) = COUNT(*)
) puks
ON tables.table_schema = puks.table_schema AND tables.table_name = puks.table_name
WHERE puks.table_name IS NULL
AND tables.table_type = 'BASE TABLE' AND engine='InnoDB'

You may also use mysql.innodb_index_stats table to find rows with the hidden primary key:


mysql> select * from mysql.innodb_index_stats;
| database_name | table_name | index_name      | last_update         | stat_name    | stat_value | sample_size | stat_description                  |
| test          | t1         | GEN_CLUST_INDEX | 2015-08-08 20:48:23 | n_diff_pfx01 | 96         | 1           | DB_ROW_ID                         |
| test          | t1         | GEN_CLUST_INDEX | 2015-08-08 20:48:23 | n_leaf_pages | 1          | NULL        | Number of leaf pages in the index |
| test          | t1         | GEN_CLUST_INDEX | 2015-08-08 20:48:23 | size         | 1          | NULL        | Number of pages in the index      |

Q: You are using the alter table to create index, but how does mysql sort the data for creating the index? isn’t it uses temp table for that?

That is a very good question: the behavior of the “alter table … add index” has changed over time. As documented in Overview of Online DDL:

Historically, many DDL operations on InnoDB tables were expensive. Many ALTER TABLE operations worked by creating a new, empty table defined with the requested table options and indexes, then copying the existing rows to the new table one-by-one, updating the indexes as the rows were inserted. After all rows from the original table were copied, the old table was dropped and the copy was renamed with the name of the original table.

MySQL 5.5, and MySQL 5.1 with the InnoDB Plugin, optimized CREATE INDEX and DROP INDEX to avoid the table-copying behavior. That feature was known as Fast Index Creation

When MySQL uses “Fast Index Creation” operation it will create a set of temporary files in MySQL’s tmpdir:

To add a secondary index to an existing table, InnoDB scans the table, and sorts the rows using memory buffers and temporary files in order by the values of the secondary index key columns. The B-tree is then built in key-value order, which is more efficient than inserting rows into an index in random order.

Q: How good is InnoDB deadlocks on 5.7 comparing to 5.6 version. Is that based on parameters setup?

InnoDB deadlocks discussion is outside of the scope of this presentation. Valerii Kravchuk and Nilnandan Joshi did an excellent talk at Percona Live 2015 (slides available): Understanding Innodb Locks and Deadlocks

Q: What is the performance impact of generating a virtual column for a table having 66 Million records and generating the index. And how would you go about it? Do you have any suggestions on how to re organize indexes on the physical disk?

As MySQL 5.7 is not released yet, behavior of the virtual columns may change.  The main question here is: will it be online operations to a) add a virtual column (as this is only metadata change it should be very light operation anyway). b) add index on that virtual column. In the labs released it was not online, however this can change.

Thank you again for attending.

The post Advanced Query Tuning in MySQL 5.6 and MySQL 5.7 Webinar: Q&A appeared first on MySQL Performance Blog.

Read more at:

ObjectRocket’s David Murphy talks about MongoDB, Percona Live Amsterdam

Say hello to David Murphy, lead DBA and MongoDB Master at ObjectRocket (a Rackspace company). David works on sharding, tool building, very large-scale issues and high-performance MongoDB architecture. Prior to ObjectRocket he was a MySQL/NoSQL architect at Electronic Arts. David enjoys large-scale operational tool building, high performance OS and database tuning. He is also a core code contributor to MongoDB. He’ll be speaking next month at Percona Live Amsterdam, which runs Sept. 21-13. Enter promo code “BlogInterview” at registration to save €20!

Tom: David, your 3-hour tutorial is titled “Mongo Sharding from the trench: A Veterans field guide.” Did your experience in working with vast amounts of data at Rackspace give you a unique perspective, in view, that now puts you into a position to help people just getting started? Can you give a couple examples?

David: I think this has been something organically I grew into from the days of supporting Cpanel type MySQL instances to today. I have worked for a few verticals from hosts to advertising to gaming, finally entering into the platform service. The others give me a host of knowledge around how customer need systems to work, and then the number and range of workloads we see at Rackspace re-enforces this.

ObjectRocket's David Murphy talks MongoDB & Percona Live Amsterdam

ObjectRocket’s David Murphy

Many times the unique perspective comes with the scale such as someone calling up a single node to the multi-terabyte range. When they go to “shard” they can find the process that is normally very light and unnoticeable to most Mongo sharding can severally lock the metadata for an extended time. In other cases, the “balancer” might not be able to keep up with the amount of working being asked of it.

Toward the smaller end of the spectrum, having seen so many workloads from big to small. I can see similar thought processes and trends. When this happens having worked with some many of these workloads, and honestly having learned along the evolution of mongo helps me explain to clients the good, bad, and the hairy. Many times discussions come down to people not using connection pooling, non-indexed sorting, or complex operators such as $in, $nin, and more. In these cases, I can talk to people about the balance of using these concepts and when they will become bigger issues for them. My goal is to give them the enough knowledge to help determine when it is correct to use development resource to fix and issue, and when it’s manageable and that development could be better spent elsewhere.


Tom: The title of your tutorial also sounds like the perfect title of a book. Do you have any for one?

David: What an excellent question! I have thought about this. However, I think the goal of a book if I can find the time to do it. A working title might be “Mongo from the trenches: Surviving the minefield to get ahead”. I think the book might be broken into three sections:  “When should you use or not user Mongo”,  “Schema and Operatorators in the NoSQL world”, “Sharding”. I would do this as this could be a great mini book on its own the community really could use a level of depth similar to the MySQL 5.0 certification guides.  I liked these books as it helped someone understand all the bits of what to consider with your schema design and how it affects the application as much as the database hosts. Then in the second half more administration geared it took those same schema and design choices to help you manage them with confidence.

In the end, Mongo is a good product that works well for most people as it matures we need more and discussion. On topics such as what should you monitor, how you should predict issues, and how valuable are regular audits. Especially in an ecosystem where it’s easy to spin something up, launch it, and move on to the next project.


Tom: When and why would you recommend using MongoDB instead of MySQL?

David: I am glad I mentioned this is worthy of a book already, as it is such a complex topic and one that gets me very excited.

I feel there is a bit or misinformation on both sides of this field. Many in the MySQL camp of experts know when someone says they can’t get more than 1000 TPS via MySQL. 9 out of 10 times and design, not a technology issue,  the Mongo crowd love this and due to inherit sharding nature of Mongo they can sidestep these types of issues. Conversely in the Mongo camp you will hear how bad the  SQL standard is, however, omitting transactions for a moment, the same types of operations exist in MySQL and Mongo.  There are some interesting powers in the Mongo aggregation. However, SQL is more powerful and just as complex as some map reduce jobs and aggregations I have written.

As to your question, MySQL will always win in regards to repeatable reads to the database in a transaction. There is some talk of limited transactions in Mongo. However, these will likely not become global and cluster wide anytime soon if ever.  I don’t trust floats in Mongo for financials; it’s not that Mongo doesn’t do them but rather JavaScript type floats are present. Sometimes you need to store data as a 64-bit integer and do math in the app to make it a high precision float. MySQL, on the other hand, has excellent support for precision.

Another area is simply looking at the history of Mongo and MySQL.  Mongo until WiredTiger and  RocksDB were very similar to MyISAM from a locking behavior and support perspective. With the advent of the new storage system, we will-will see major leaps forward in types of flows you will want in Mongo. With the writer lock issue is gone, and locking between the systems is becoming more and more similar making deciding which much harder.

The news is not all use. However, subdocuments and array support in Mongo is amazing there are so many things  I can do in Mongo that even in bitwise SET/ENUM operators I could not do. So if you need that type of system, or you want to create a semi denormalize for of a view in the database. Mongo can do this with ease and on the fly. MySQL, on the other hand, would take careful planning and need whole tables updated.  In this regard I feel more people could use Mongo and is ability to have a versioned document schema allowing more incremental changes to documents. With new code  releases, allowing the application to read old version and “upgrade” them to the latest form. Removing a whole flurry of maintenance related pains that RDBMs have to the frustration of developers who just want to launch the new product.

The last thing I would want to say here is you need not choose, why not use both. Mongo can be very powerful for keeping a semi denormalized version of the data that is nimble to allow fast application or system updates and features. Leaving MySQL for a very specific workload that need the precision are simple are not expected to have schema changes.  I am a huge fan of keeping the transactional portions in MySQL, and the rest in Mongo. Allowing you to scale quickly up and down the build of your data needs, and more slowly change the parts that need to be 100% consistent all of the time with no room for eventual consistency.


Tom: What another session(s) are you most looking forward to besides your own at Percona Live Amsterdam?

David: There are a few that are near and dear to me.

Turtles all the way down: tuning Linux for database workloads” looks like a great one. It is one view I have always had, and DBA’s should be DBA’s,  SysAdmins, and Storage people rolled into one. That way they can understand the impacts of the application down to the blocks the database reads.

TokuDB internals” is another one. I have used TokuDB in MySQL and Mongo to some degree but as it has never had in-depth documentation. A topic like that is a great way to fill any gaps for experienced and new people alike.

Database Reliability Engineering” looks like a great talk from a great speaker.

As an InnoDB geek, I like the idea around “Understanding InnoDB locks: case studies.”

I see a huge amount of potential for MaxScale if anyone else is curious, “Anatomy of a Proxy Server: MaxScale Internals” should be good for R/W splits and split writing type cases.

Finally, one of my favorite people is Charity as she always is so energetic and can get to the heart of the matter. If you are not going to “Upgrade your database: without losing your data, your perf or your mind” you are missing out!


Tom: Thanks for speaking with me, David! Is there anything else you’d like to add: either about Rackspace or Percona Live Amsterdam?

David: In regards to Rackspace, I urge everyone to check out the Data Services group.  We handle everything from Redis to Hadoop with a goal of augmenting your groups or providing experts to help keep your uptime as high as possible. With options for dedicated hosts to platform type services, there is something that helps everyone. Rackspace is not just a cloud company but a real support company that provides amazing hardware to use, or support for other hardware location that is growing rapidly.

With Percona Amsterdam, everyone should come the group of speakers is simply amazing, I for one am excited by so many topics because they are all so compelling. Outside of that you will it hard find another a gathering of database experts with multiple technologies under their belt and who truly believe in the move to picking the right technology for the right use case.

The post ObjectRocket’s David Murphy talks about MongoDB, Percona Live Amsterdam appeared first on MySQL Performance Blog.

Read more at:

Update on the InnoDB double-write buffer and EXT4 transactions

In a post, written a few months ago, I found that using EXT4 transactions with the “data=journal” mount option, improves the write performance significantly, by 55%, without putting data at risk. Many people commented on the post mentioning they were not able to reproduce the results and thus, I decided to further investigate in order to find out why my results were different.

So, I ran sysbench benchmarks on a few servers and found when the InnoDB double-write buffer limitations occur and when they don’t. I also made sure some of my colleagues were able to reproduce the results. Basically, in order to reproduce the results you need the following conditions:

  • Spinning disk (no SSD)
  • Enough CPU power
  • A dataset that fits in the InnoDB buffer pool
  • A continuous high write load with many ops waiting for disk

Using the InnoDB double write buffer on an SSD disk somewhat prevents us from seeing the issue, something good performance wise. That comes from the fact that the latency of each write operation is much lower. That makes sense, the double-writer buffer is an area of 128 pages on disk that is used by the write threads. When a write thread needs to write a bunch of dirty pages to disk, it first writes them sequentially to free slots in the double write buffer in a single iop and then, it spends time writing the pages to their actual locations on disk using typically one iop per page. Once done writing, it releases the double-write buffer slots it was holding and another thread can do its work. The presence of a raid controller with a write cache certainly helps, at least until the write cache is full. Thus, since I didn’t tested with a raid controller, I suspect a raid controller write cache will delay the apparition of the symptoms but if the write load is sustained over a long period of time, the issue with the InnoDB double write buffer will appear.

So, to recapitulate, on a spinning disk, a write thread needs to hold a lock on some of the double-write buffer slots for at least a few milliseconds per page it needs to write while on a SSD disk, the slots are released very quickly because of the low latency of the SSD storage. To actually stress the InnoDB double-write buffer on a SSD disk, one must push much more writes.

That leads us to the second point, the amount of CPU resources available. At first, one of my colleague tried to reproduce the results on a small EC2 instance and failed. It appeared that by default, the sysbench oltp.lua script is doing quite a lot of reads and those reads saturate the CPU, throttling the writes. By lowering the amount of reads in the script, he was then able to reproduce the results.

For my benchmarks, I used the following command:

sysbench --num-threads=16 --mysql-socket=/var/lib/mysql/mysql.sock
--mysql-database=sbtest --mysql-user=root
--test=/usr/share/doc/sysbench/tests/db/oltp.lua --oltp-table-size=50000000
--oltp-test-mode=complex --mysql-engine=innodb --db-driver=mysql
--report-interval=60 --max-requests=0 --max-time=3600 run

Both servers used were metal boxes with 12 physical cores (24 HT). With less CPU resources, I suggest adding the following parameters:


So that the CPU is not wasted on reads and enough writes are generated. Remember we are not doing a generic benchmarks, we are just stressing the InnoDB double-write buffer.

In order to make sure something else isn’t involved, I verified the following:

  • Server independence, tried on 2 physical servers and one EC2 instance, Centos 6 and Ubuntu 14.04
  • MySQL provided, tried on MySQL community and Percona Server
  • MySQL version, tried on 5.5.37 and 5.6.23 (Percona Server)
  • Varied the InnoDB log file size from 32MB to 512MB
  • The impacts of the number of InnoDB write threads (1,2,4,8,16,32)
  • The use of Linux native asynchronous iop
  • Spinning and SSD storage

So, with all those verifications done, I can maintain that if you are using a server with spinning disks and a high write load, using EXT4 transactions instead of the InnoDB double write buffer yields to an increase in throughput of more than 50%. In an upcoming post, I’ll show how the performance stability is affected by the InnoDB double-write buffer under a high write load.

Appendix: the relevant part of the my.cnf

innodb_buffer_pool_size = 12G
innodb_write_io_threads = 8 # or else in {1,2,4,8,16,32}
innodb_read_io_threads = 8
innodb_flush_log_at_trx_commit = 0 # must be 0 or 2 to really stress the double write buffer
innodb_log_file_size = 512M # or 32M, 64M
innodb_log_files_in_group = 2
innodb_flush_method=O_DIRECT # or O_DSYNC
innodb_buffer_pool_restore_at_startup=300 # On 5.5.x, important to warm up the buffer pool
#innodb_buffer_pool_load_at_startup=ON # on 5.6, important to warm up the buffer pool
#innodb_buffer_pool_dump_at_shutdown=ON # on 5.6, important to warm up the buffer pool,
skip-innodb_doublewrite # or commented out
innodb_flush_neighbor_pages=none # or area for spinning

The post Update on the InnoDB double-write buffer and EXT4 transactions appeared first on MySQL Performance Blog.

Read more at:

Getting EXPLAIN information from already running queries in MySQL 5.7

When a new version of MySQL is about to be released we read a lot of blog posts about the performance and scalability improvements. That’s good but sometimes we miss some small features that can help us a lot in our day-to-day tasks. One good example is the blog post that Aurimas wrote about a new small feature in MySQL 5.6 that I didn’t know about until I read it: the Automatic InnoDB transaction log file size change. How cool is that?

I plan to write a series of blog posts that will show some of those small new features in MySQL 5.7 that are going to be really useful. I’m going to start with EXPLAIN FOR CONNECTION.

This feature allows us to run an EXPLAIN for an already running statement. Let’s say that you find a query that has been running for a long time and you want to check why that could be happening. In 5.7 you can just ask MySQL to EXPLAIN the query that a particular connection is running and get the execution path. You can use it if the query is a SELECT, DELETE, INSERT, REPLACE or UPDATE. Won’t work if the query is a prepared statement though.

Let me show you an example of how it works.

We have a long running join.

mysql [localhost] {msandbox} ((none)) > show processlist G
*************************** 1. row ***************************
     Id: 9
   User: msandbox
   Host: localhost
     db: employees
Command: Query
   Time: 49
  State: Sending data
   Info: select count(*) from employees, salaries where employees.emp_no = salaries.emp_no

Let’s see the execution plan for the query:

mysql [localhost] {msandbox} ((none)) > explain for connection 9 G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: employees
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 299540
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: salaries
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2803840
     filtered: 100.00
        Extra: Using where; Using join buffer (Block Nested Loop)

The join between those tables is not using any index at all so there is some room for improvement here :)


You can use this feature to see why a query is running for too long and based on the info decide how to fix it and how to proceed. This is going to be a very useful feature for DBAs who want to diagnose performance problems and slow queries.

The post Getting EXPLAIN information from already running queries in MySQL 5.7 appeared first on MySQL Performance Blog.

Read more at:

Innodb redo log archiving

Innodb redo log archivingPercona Server 5.6.11-60.3 introduces a new “log archiving” feature. Percona XtraBackup 2.1.5 supports “apply archived logs.” What does it mean and how it can be used?

Percona products propose three kinds of incremental backups. The first is full scan of data files and comparison the data with backup data to find some delta. This approach provides a history of changes and saves disk space by storing only data deltas. But the disadvantage is a full-data file scan that adds load to the disk subsystem. The second kind of incremental backup avoids extra disk load during data file scans.

The idea is in reading only changed data pages. The information about what specific pages were changed is provided by the server itself which writes files with the information during work. It’s a good alternative but changed-pages tracking adds some small load. And Percona XtraBackup’s delta reading leads to non-sequential disk io. This is good alternative but there is one more option.

The Innodb engine has a data log. It writes all operations which modify database pages to log files. This log is used in the case of unexpected server terminating to recover data. The Innodb log consists of the several log files which are filled sequentially in circular. The idea is to save those files somewhere and apply all modifications from archived logs to backup data files. The disadvantage of this approach is in using extra disk space. The advantage is there is no need to do an “explicit” backup on the host server. A simple script could sit and wait for logs to appear then scp/netcat them over to another machine.

But why not use good-old replication? Maybe replication does not have such performance as logs recovering but it is more controlled and well-known. Archived logs allows you to do any number of things with them from just storing them to doing periodic log applying. You can not recover from a ‘DROP TABLE’, etc with replication. But with this framework one could maintain the idea of “point in time” backups.

So the “archived logs” feature is one more option to organize incremental backups. It is not widely used as it was issued not so far and there is not A good understanding of how it works and how it can be used. We are open to any suggestions about its suggest improvements and use cases. The subject of this post is to describe how it works in depth. As log archiving is closely tied with innodb redo logs the internals of redo logs will be covered too. This post would be useful not only for DBA but also for Software Engineers because not only common principles are considered but the specific code too, and knowledge from this post can be used for further MySQL code exploring and patching.

What is the innodb log and how it is written?

Let’s remember what are innodb logs, why they are written, what they are used for.

The Innodb engine has buffer pool. This is a cache of database pages. Any changes are done on page in buffer pool, then page is considered as “dirty,” which means it must be flushed, and pushed to the flush list which is processed periodically by special thread. If pages are not flushed to disk and server is terminated unexpectedly the changes will be lost. To avoid this innodb writes changes to redo log and recover data from redo log during start. This technique allows to delay buffer pool pages flushing. It can increase performance because several changes of one page can be accumulated in memory and then flushed by one io. Except that flushed pages can be grouped to decrease the number of non-sequential io’s. But the down-side of this approach is time for data recovering. Let’s consider how this log is stored, generated and used for data recovering.

Log files

Redo log consists of a several log files which are treated as a circular buffer. The number and the size of log files can be configured. Each log file has a header. The description of this header can be found in “storage/innobase/include/log0log.h” by “LOG_GROUP_ID” keyword.

Each log file contains log records. Redo log records are written sequentially by log blocks of OS_FILE_LOG_BLOCK_SIZE size which is equal to 512 bytes by default and can be changed with innodb option. Each record has its LSN. LSN is a “Log Sequence Number” – the number of bytes written to log from the log creation to the certain log record. Each block consists of header, trailer and log records.

Log blocks

Let’s consider log block header. The first 4 bytes of the header is log block number. The block number is very similar as LSN but LSN is measured in bytes and block number is measured by OS_FILE_LOG_BLOCK_SIZE. Here is the simple formula how LSN is converted to block number:

return(((ulint) (lsn / OS_FILE_LOG_BLOCK_SIZE) & 0x3FFFFFFFUL) + 1);

This formula can be found in log_block_convert_lsn_to_no() function. The next two bytes is the number of bytes used in the block. The next two bytes is the offset of the first MTR log record in this block. What is MTR will be described below. Currently it can be considered as a synonym of bunch of log records which are gathered together as a description of some logical operation. For example it can be a group of log records for inserting new row to some table. This field is used when there are records of several MTR’s in one block. The next four bytes is a checkpoint number. The trailer is four bytes of log block checksum. The above description can be found in “storage/innobase/include/log0log.h” by “LOG_BLOCK_HDR_NO” keyword.

Before writing to disk log blocks must be somehow formed and stored. And the question is:

How log blocks are stored in memory and on disk?

Where log blocks are stored before flushing to disk and how they are written and flushed?

Global log object and log buffer

The answer to the first part of the question is log buffer. Server holds very important global object log_sys in memory. It contains a lot of useful information about logging state. Log buffer is pointed by log_sys->buf pointer which is initialized in log_init(). I would highlight the following log_sys fields that are used for work with log buffer and flushing:

log_sys->buf_size – the size of log buffer, can be set with innodb-log-buffer-size variable, the default value is 8M;
log_sys->buf_free – the offset from which the next log record will be written;
log_sys->max_buf_free – if log_sys->buf_free is greater then this value log buffer must be flushed, see log_free_check();
log_sys->buf_next_to_write – the offset of the next log record to write to disk;
log_sys->write_lsn – the LSN up to which log is written;
log_sys->flushed_to_disk_lsn – the LSN up to which log is written and flushed;
log_sys->lsn – the last LSN in log buffer;

So log_sys->buf_next_to_write is between 0 and log_sys->buf_free, log_sys->write_lsn is equal or less log_sys->lsn, log_sys->flushed_to_disk_lsn is less or equal to log_sys->write_lsn.

The relationships for those fields can be easily traced with debugged by setting up watchpoints.

Ok, we have log buffer, but how do log records come to this buffer?

Where log records come from?

Innodb has special objects that allow you to gather redo log records for some operations in one bunch before writing them to log buffer. These objects are called “mini-transactions” and corresponding functions and data types have “mtr” prefix in the code. The objects itself are described in mtr_t “c” structure. The most interesting fields of this structure are the following:

mtr_t::log – contains log records for the mini-transaction,
mtr_t::memo – contains pointers to pages which are changed or locked by the mini-transaction, it is used to push pages to flush list and release locks after logs records are copied to log buffer in mtr_commit() (see mtr_memo_pop_all() called in mtr_commit()).

mtr_start() function initializes an object of mtr_t type and mtr_commit() writes log records from mtr_t::log to log_sys->buf + log_sys->buf_free. So the typical sequence of any operation which changes data is the following:

mtr_start(); // initialize mtr object
some_ops... // operations on data which are logged in mtr_t::log
mtr_commit(); // write logged operations from mtr_t::log to log buffer log_sys->buf

page_cur_insert_rec_write_log() is a good example of how mtr records can be written and mtr::memo can be filled. The low-level function which writes data to log buffer is log_write_low(). This function is invoked inside of mtr_commit() and not only copy the log records from mtr_t object to log buffer log_sys->buf but also creates a new log blocks inside of log_sys->buf, fills their header, trailer, calculates checksum.

So log buffer contains log blocks which are sequentially filled with log records which are grouped in “mini-transactions” which logically can be treated as some logical operation over data which consists of a sequence of mini-operations(log records).

As log records are written sequentially in log buffer one mini-transaction and even one log record can be written in two neighbour blocks. That is why the header field which would contain the offset of the first MTR in the block is necessary to calculate the point from which log records parsing can be started. This field was described in 2.2.

So we have a buffer of log blocks in a memory. How is data from this buffer written to disk? The mysql documentation says that this depends on innodb_flush_log_at_trx_commit option. There can be three cases depending on the value of this option. Let’s consider each of them.

Writing log buffer to disk: innodb_flush_log_at_trx_commit is 1 or 2.

The first two cases is when innodb_flush_log_at_trx_commit is 1 or 2. In these cases flush log records are written for 2 and flushed for 1 on each transaction commit. If innodb_flush_log_at_trx_commit is 2 log records are flushed periodically by special thread which will be considered later. The low-level function which writes log records from buffer to file is log_group_write_buf(). But in the most cases it is not called directly but it is called from more high level log_write_up_to(). For the current case the calling stack is the following:

(trx_commit_in_memory() or
trx_commit_complete_for_mysql() or
trx_prepare() e.t.c)->

It is quite easy to find the higher levels of calling stack, just set up breakpoint on log_group_write_buf() and execute any sql query that modifies innodb data. For example for the simple “insert” sql query the higher levels of calling stack are the following:

trx_flush_log_if_needed()-> ... .

log_io_complete() callback is invoked when i/o is finished for log files (see fil_aio_wait()). log_io_complete() flushes log files if this is not forbidden by innodb_flush_method or innodb_flush_log_at_trx_commit options.

Writing log buffer to disk: innodb_flush_log_at_trx_commit is equal to 0

The third case is when innodb_flush_log_at_trx_commit is equal to 0. For this case log buffer is NOT written to disk on transaction commit, it is written and flushed periodically by separate thread “srv_master_thread”. If innodb_flush_log_at_trx_commit = 0 log files are flushed in the same thread by the same calls. The calling stack is the following:

(srv_master_do_active_tasks() or srv_master_do_idle_tasks() or srv_master_do_shutdown_tasks())->
log_buffer_sync_in_background()->log_write_up_to()->... .

Special cases for logs flushing

While log_io_complete() do flushing depending on innodb_flush_log_at_trx_commit value among others log_write_up_to() has it’s own flushing criteria. This is flush_to_disk function argument. So it is possible to force log files flushing even if innodb_flush_log_at_trx_commit = 0. Here are examples of such cases:

1) buf_flush_write_block_low()
Each page contains information about the last applied LSN(buf_flush_write_block_low::newest_modification), each log record is a description of change on certain page. Imagine we flushed some changed pages but log records for these pages were not flushed and server goes down. After starting the server some pages will have the newest modifications, but some of them were not flushed and the correspondent log records are lost too. We will have inconsistent database in this case. That is why log records must be flushed before the pages they refer.

2) srv_sync_log_buffer_in_background()
As it was described above this function is called periodically by special thread and forces flushing.

3) log_checkpoint()
When checkpoint is made log files must be reliably flushed.

4) The special handlerton innobase_flush_logs() which can be called through ha_flush_logs() from mysql server.
For example ha_flush_logs() is called from MYSQL_BIN_LOG::reset_logs() when “RESET MASTER” or “RESET SLAVE” are executed.

5) srv_master_do_shutdown_tasks() – on shutdown, ha_innobase::create() – on table creating, ha_innobase::delete_table() – on table removing, innobase_drop_database() – on all database tables removing, innobase_rename_table() – on table rename e.t.c

If log files are treated as circular buffer what happens when the buffer is overflown?

Briefly. Innodb has a mechanism which allows you to avoid overflowing. It is called “checkpoints.” The checkpoint is a state when log files are synchronized with data files. In this case there is no need to keep the history of changes before checkpoint because all pages with the last modifications LSN less or equal to checkpoint LSN are flushed and the log files space from the last written LSN to the last checkpoint LSN can be reused. We will not describe a checkpoint process here because it is a separate interesting subject. The only thing we need to know is when checkpoint happens all pages with modification LSN less or equal to checkpoint LSN are reliably flushed.

How archived logs are written by server.

So the log contains information about page changes. But as we said, log files are the circular buffer. This means that they occupy fixed disk size and the oldest records can be rewritten by the newest ones as there are points when data files are synchronized with log files called checkpoints and there is no need to store the previous history of log records to guarantee database consistency. The idea is to save somewhere all log records to have the possibility of applying them to backuped data to have some kind if incremental backup. For example if we want to have an archive of log records. As log consists of log files it is reasonable to store log records in such files too, and these files are called “archived logs.”

Archived log files are written to the directory which can be set with special innodb option. Each file has the same size as innodb log size and the suffix of each archived file is the LSN from which it is started.

As well as log writing system log archiving system stores its data in global log_sys object. Here are the most valuable fields in log_sys from my point of view:

log_sys->archive_buf, log_sys->archive_buf_size – logs archive buffer and its size, log records are copied from log buffer log_sys->buf to this buffer before writing to disk;
log_sys->archiving_phase – the current phase of log archiving: LOG_ARCHIVE_READ when log records are being copied from log_sys->buf to log_sys->archive_buf, LOG_ARCHIVE_WRITE when log_sys->archive_buf is being written to disk;
log_sys->archived_lsn – the LSN to which log files are written;
log_sys->next_archived_lsn – the LSN to which write operations was invoked but not yet finished;
log_sys->max_archived_lsn_age – the maximum difference between log_sys->lsn and log_sys->archived_lsn, if this difference exceeds the log are being archived synchronously, i.e. the difference is decreased;
log_sys->archive_lock – this is rw-lock which is used for synchronizing LOG_ARCHIVE_WRITE and LOG_ARCHIVE_READ phases, it is x-locked on LOG_ARCHIVE_WRITE phase.

So how is data copied from log_sys->buf to log_sys->archived_buf? log_archive_do() is used for this. It is not only set the proper state for archived log fields in log_sys but also invokes log_group_read_log_seg() with corresponding arguments which not only copy data from log buffer to archived log buffer but also invokes asynchronous write operation for archived log buffer. log_archive_do() can wait until io operations are finished using log_sys->archive_lock if corresponding function parameter is set.

The main question is on what circumstances log_archive_do() is invoked, i.e. when log records are being written to archived log files. The first call stack is the following:


Here is text of log_free_check() with comments:

Checks if there is need for a log buffer flush or a new checkpoint, and does
this if yes. Any database operation should call this when it has modified
more than about 4 pages. NOTE that this function may only be called when the
OS thread owns no synchronization objects except the dictionary mutex. */

#endif /* UNIV_SYNC_DEBUG */

if (log_sys->check_flush_or_checkpoint) {


log_sys->check_flush_or_checkpoint is set when there is no enough free space in log buffer or it is time to do checkpoint or any other bound case. log_archive_margin() is invoked only if the limit if the difference between log_sys->lsn and log_sys->archived_lsn is exceeded. Let’s refer to this difference as archived lsn age.

One more call log_archive_do() is from log_open() when archived lsn age exceeds some limit. log_open() is called on each mtr_commit(). And for this case archived logs are written synchronously.

The next synchronous call is from log_archive_all() during shutdown.

Summarizing all above archived logs begins to be written when the log buffer is full enough to be written or when checkpoint happens or when the server is in the process of shut down. And there is no any delay between writing to archive log buffer and writing to disk. I mean there is no way to say that archived logs must be written once a second as it is possible for redo logs with innodb_flush_log_at_trx_commit = 0. As soon as data is copied to the buffer the write operation is invoked immediately for this buffer. Archived log buffer is not filled on each mtr_commit() so it does not slow down the usual logging process. The exception is when there are a lot of io operations what can be the reason of archive log age is too big. The result of big archive log age is the synchronous archived logs writing during mtr_commit(). Memory to memory copying is quite fast operation that is why the data is copied to archived log buffer and is written to disk asynchronously minimizing delays which can be caused by logs archiving.

PS: Here is another call stack for writing archived log buffer to archived log files:


I propose to explore this stack yourself.

Logs recovery process, how it is started and works inside. Archived logs applying.

So we discovered how innodb redo logging works, and how redo logs are archived. And the last uncovered thing is how recovery works and how archived logs are applied. These two processes are very similar – that is why they are discussed in one section of this post.

The story begins with innobase_start_or_create_for_mysql() which is invoked from innobase_init(). The following trident in innobase_start_or_create_for_mysql() can be used to search the relevant code:

if (create_new_db) {
} else if (srv_archive_recovery) {
} else {

The second condition and the last one is the place from which archived logs applying and innodb logs recovery processes correspondingly start. These two blocks wrap two pairs of functions:




And all the magic happens in these pairs. As well as global log_sys object for redo logging there is global recv_sys object for innodb recovery and archived logs applying. It is created and initialized in recv_sys_create() and recv_sys_init() functions correspondingly. The following fields if recv_sys object are the most important from my point:

recv_sys->limit_lsn – the LSN up to which recovery should be made, this value is initialized with the maximum value of uint64_t(see #define LSN_MAX) for the recovery process and with certain value which is passed as an argument of recv_recovery_from_archive_start() function and can be set via xtrabackup option for log applying;
recv_sys->parse_start_lsn – the LSN from which logs parsing is started, for the the logs recovery this value equals to the last checkpoint LSN, for logs applying this is last applied LSN;
recv_sys->scanned_lsn – the LSN up to which log files are scanned;
recv_sys->recovered_lsn – the LSN up to which log records are applied, this value <= recv_sys->scanned_lsn;

The first thing that must be done for starting recovery process is to find out the point in log files where the recovery must be started from. This is the last checkpoint LSN. recv_find_max_checkpoint() proceed this. As we can see in log_group_checkpoint() the following code writes checkpoint info into two places in the first log file depending on the checkpoint number:

/* We alternate the physical place of the checkpoint info in the first
log file */


if ((log_sys->next_checkpoint_no & 1) == 0) {
write_offset = LOG_CHECKPOINT_1;
} else {
write_offset = LOG_CHECKPOINT_2;

So recv_find_max_checkpoint() reads checkpoint info from both places and selects the latest checkpoint.

The same idea is applied for logs, too, but the last applied LSN instead of last checkpoint LSN must be found. Here is the call stack for reading last applied LSN:


The last applied LSN is stored in the first page of data files in (min|max)_flushed_lsn fields(see FIL_PAGE_FILE_FLUSH_LSN offset). These values are written in fil_write_flushed_lsn_to_data_files() function on server shutdown.

So the main difference between logs applying and recovery process at this stage is the manner of calculating LSN from which log records will be read. For logs applying the last flushed LSN is used but for recovery process it is the last checkpoint LSN. Why does this difference take place? Logs can be applied periodically. Assume we gather archived logs and apply them once an hour to have fresh backup. After applying the previous bunch of log files there can be unfinished transactions. For the recovery process any unfinished transactions are rolled back to have consistent db state at server starting. But for the logs applying process there is no need to roll back them because any unfinished transactions can be finished during the next logs applying.

After calculating the start LSN the sequence of actions is the same for both recovering and applying. The next step is reading and parsing log records. See recv_group_scan_log_recs() which is invoked from recv_recovery_from_checkpoint_start_func() for logs recovering and recv_recovery_from_archive_start()->log_group_recover_from_archive_file() for logs applying.

The first we read log records to some buffer and then invoke recv_scan_log_recs() to parse them. recv_scan_log_recs() checks each log block on consistency(checksum + comparing the log block number written in log block with log block number calculated from log block LSN) and other edge cases and copy it to parsing buffer recv_sys->buf with recv_sys_add_to_parsing_buf() function. The parsing buffer is then parsed by recv_parse_log_recs(). Log records are stored in hash table recv_sys->addr_hash. The key for this hash table is calculated basing on space id and page number pair. This pair refers to the page to which log records must be applied. The value of the hash table is object of recv_addr_t type. recv_addr_t type contains rec_list field which is the list of log records for applying to the (space id, page num) page (see recv_add_to_hash_table().

After parsing and storing log record in hash table recv_sys->addr_hash log records are applied. The function which is responsible for log records applying is recv_apply_hashed_log_recs(). It is invoked from recv_scan_log_recs() if there is no enough memory to store log records and at the end of recovering/applying process. For each element of recv_sys->addr_hash, i.e. for each DB page which must be changed with log records recv_recover_page() is invoked. It can be invoked as from recv_apply_hashed_log_recs() in the case if page is already in buffer pool of from buf_page_io_complete() on io completion, i.e. just after page was read from storage. Applying log records on page read completion is necessary and very convenient. Assume log records have not yet applied as we had enough memory to store the whole recovery log records. But we want for example to boot DB dictionary. I this case any records that concern to the pages of the dictionary will be applied to those pages just after reading them from storage to buffer pool.

The function which applies log records to the certain page is recv_recover_page_func(). It gets the list of log records for the certain page from recv_sys->addr_hash hash table, for each element of this list it compares the lsn of last page changes with the LSN of the record, and if the former is greater the later it applies log record to the page.

After applying all log records from archived logs xtrabackup writes last applied LSN to (min|max)_flushed LSN fields of each data file and finishes execution. The logs recovery process rollbacks all unfinished transactions unless this is forbidden with innodb-force-recovery parameter.


We covered the processes of redo logs writing and recovery in depth. These are very important processes as they provide data consistency on crashes. These two processes became a base for logs archiving and applying features. As log records can describe any data changes the idea is to store these records somewhere and then apply them to backups for organizing some kind of incremental backup.

The features were implemented a short time ago and currently they are not widely used. So if you have something to say about them you are welcome to comment for discussion.

The post Innodb redo log archiving appeared first on MySQL Performance Blog.

Read more at:

InnoDB adaptive flushing in MySQL 5.6: checkpoint age and io capacity

In MySQL 5.6 InnoDB has a dedicated thread (page_cleaner) that’s responsible for performing flushing operations. Page_cleaner performs flushing of the dirty pages from the buffer pool based on two factors:
access pattern  –  the least recently used pages will be flushed by LRU flusher from LRU_list when buffer pool has no free pages anymore;
age – the oldest modified non-flushed pages are part of flush_list structure and will be flushed by flush_list flusher based on several heuristics.

There is a good overview of the page_cleaner and also here you may find some details about flushing in MySQL 5.6. Below I describe several additional aspects of the flush_list flushing that was not really covered yet.

flush_list flushing and checkpoint age

The amount of the aged pages that is possible to keep in the flush_list is limited by the combined size of the innodb log files. So the main purpose of the flush_list flushing is to flush pages from this list with such a rate that will also always allow enough free space in the log files. On the other hand, too aggressive flushing means less write combining, unnecessary load on the I/O subsystem, in the end undoing performance benefits of having larger redo logs.  In MySQL 5.6 the amount of pages to flush is calculated in the InnoDB adaptive routine based on the current checkpoint age with the following formula:

percentage of the IO capacity that should be used for flushing =
        ((srv_max_io_capacity / srv_io_capacity) * (lsn_age_factor * sqrt(lsn_age_factor))) / 7.5;

We modeled that formula in R and found that it’s possible to improve it such a way that the curve becomes more flat and as a result flushing becomes less aggressive. That new formula is enabled in Percona Server 5.6 by default.


flush_list flushing and io_capacity

InnoDB provides two variables that allow the control of the background flushing rate – innodb_io_capacity and innodb_io_capacity_max. There is quite a detailed description for these vars. However there are several things that are not really covered in the documentation:

innodb_io_capacity_max is the most important variable in case of adaptive flushing as only that variable actually limiting the flushing rate. See above formula and charts.

innodb_io_capacity is used for limiting IO operations during merging of the insert buffer and flushing in cases of server inactivity/shutdown.

For practical needs, the above means the following:

– if  the MySQL server is in an active state (serving user requests) you need to adjust innodb_io_capacity_max to increase/decrease flushing rate.
– if the MySQL server is in an idle state or performing shutdown flushing of the pages from flush_list will be limited by innodb_io_capacity value only.

– if change_buffering is ON and server is in active state it will allow to use either 5% of innodb_io_capacity or vary rate from 5% to 55%  if more than 50% of insert buffer size was already used.
– if change_buffering is ON and server is idle it will use 100% of innodb_io_capacity for merge operations

The post InnoDB adaptive flushing in MySQL 5.6: checkpoint age and io capacity appeared first on MySQL Performance Blog.

Read more at:

InnoDB Full-text Search in MySQL 5.6: Part 3, Performance

This is part 3 of a 3 part series covering the new InnoDB full-text search features in MySQL 5.6. To catch up on the previous parts, see part 1 or part 2

Some of you may recall a few months ago that I promised a third part in my InnoDB full-text search (FTS) series, in which I’d actually take a look at the performance of InnoDB FTS in MySQL 5.6 versus traditional MyISAM FTS. I hadn’t planned on quite such a gap between part 2 and part 3, but as they say, better late than never. Recall that we have been working with two data sets, one which I call SEO (8000-keyword-stuffed web pages) and the other which I call DIR (800K directory records), and we are comparing MyISAM FTS in MySQL 5.5.30 versus InnoDB FTS in MySQL 5.6.10.

For reference, although this is not really what I would call a benchmark run, the platform I’m using here is a Core i7-2600 3.4GHz, 32GiB of RAM, and 2 Samsung 256GB 830 SSDs in RAID-0. The OS is CentOS 6.4, and the filesystem is XFS with dm-crypt/LUKS. All MySQL settings are their respective defaults, except for innodb_ft_min_token_size, which is set to 4 (instead of the default of 3) to match MyISAM’s default ft_min_word_len.

Also, recall that the table definition for the DIR data set is:

CREATE TABLE dir_test (
  full_name VARCHAR(100),
  details TEXT

The table definition for the SEO data set is:

CREATE TABLE seo_test (
 title VARCHAR(255),

Table Load / Index Creation

First, let’s try loading data and creating our FT indexes in one pass – i.e., we’ll create the FT indexes as part of the original table definition itself. In particular, this means adding “FULLTEXT KEY (full_name, details)” to our DIR tables and adding “FULLTEXT KEY (title, body)” to the SEO tables. We’ll then drop these tables, drop our file cache, restart MySQL, and try the same process in two passes: first we’ll load the table, and then we’ll do an ALTER to add the FT indexes. All times in seconds.

Engine Data Set one-pass (load) two-pass (load, alter)
MyISAM SEO 3.91 3.96 (0.76, 3.20)
InnoDB SEO 3.777 7.32 (1.53, 5.79)
MyISAM DIR 43.159 44.93 (6.99, 37.94)
InnoDB DIR 330.76 56.99 (12.70, 44.29)

Interesting. For MyISAM, we might say that it really doesn’t make too much difference which way you proceed, as the numbers from the one-pass load and the two-pass load are within a few percent of each other, but for InnoDB, we have mixed behavior. With the smaller SEO data set, it makes more sense to do it in a one-pass process, but with the larger DIR data set, the two-pass load is much faster.

Recall that when adding the first FT index to an InnoDB table, the table itself has to be rebuilt to add the FTS_DOC_ID column, so I suspect that the size of the table when it gets rebuilt has a lot to do with the performance difference on the smaller data set. The SEO data set fits completely into the buffer pool, the DIR data set does not. That also suggests that it’s worth comparing the time required to add a second FT index (this time we will just index each table’s TEXT/MEDIUMTEXT field). While we’re at it, let’s look at the time required to drop the second FT index as well. Again, all times in seconds.

Engine Data Set FT Index Create Time FT Index Drop Time
MyISAM SEO 6.34 3.17
InnoDB SEO 3.26 0.01
MyISAM DIR 74.96 37.82
InnoDB DIR 24.59 0.01

InnoDB wins this second test all around. I’d attribute InnoDB’s win here partially to not having to rebuild the whole table with second (and subsequent) indexes, but also to the fact that at least some the InnoDB data was already in the buffer pool from when the first FT index was created. Also, we know that InnoDB generally drops indexes extremely quickly, whereas MyISAM requires a rebuild of the .MYI file, so InnoDB’s win on the drop test isn’t surprising.

Query Performance

Recall the queries that were used in the previous post from this series:

1. SELECT id, title, MATCH(title, body) AGAINST ('arizona business records'
   IN NATURAL LANGUAGE MODE) AS score FROM seo_test_{myisam,innodb} ORDER BY 3
2. SELECT id, title, MATCH(title, body) AGAINST ('corporation commission forms'
   IN NATURAL LANGUAGE MODE) AS score FROM seo_test_{myisam,innodb} ORDER BY 3 DESC
   LIMIT 5;
3. SELECT id, full_name, MATCH(full_name, details) AGAINST ('+james +peterson +arizona'
   IN BOOLEAN MODE) AS score FROM dir_test_{myisam,innodb} ORDER BY 3 DESC LIMIT 5;
4. SELECT id, full_name, MATCH(full_name, details) AGAINST ('+james +peterson arizona'
   IN BOOLEAN MODE) AS score FROM dir_test_{myisam,innodb} ORDER BY 3 DESC LIMIT 5;
5. SELECT id, full_name, MATCH(full_name, details) AGAINST ('"Thomas B Smith"'
   IN BOOLEAN MODE) AS score FROM dir_test_{myisam,innodb} ORDER BY 3 DESC LIMIT 1;

The queries were run consecutively from top to bottom, a total of 10 times each. Here are the results in tabular format:

Query # Engine Min. Execution Time Avg. Execution Time Max. Execution Time
1 MyISAM 0.007953 0.008102 0.008409
1 InnoDB 0.014986 0.015331 0.016243
2 MyISAM 0.001815 0.001893 0.001998
2 InnoDB 0.001987 0.002077 0.002156
3 MyISAM 0.000748 0.000817 0.000871
3 InnoDB 0.670110 0.676540 0.684837
4 MyISAM 0.001199 0.001283 0.001372
4 InnoDB 0.055479 0.056256 0.060985
5 MyISAM 0.008471 0.008597 0.008817
5 InnoDB 0.624305 0.630959 0.641415

Not a lot of variance in execution times for a given query, so that’s good, but InnoDB is always coming back slower than MyISAM. In general, I’m not that surprised that MyISAM tends to be faster; this is a simple single-threaded, read-only test, so none of the areas where InnoDB shines (e.g., concurrent read/write access) are being exercised here, but I am quite surprised by queries #3 and #5, where InnoDB is just getting smoked.

I ran both versions of query 5 with profiling enabled, and for the most part, the time spent in each query state was identical between the InnoDB and MyISAM versions of the query, with one exception.

InnoDB: | Creating sort index | 0.626529 |
MyISAM: | Creating sort index | 0.014588 |

That’s where the bulk of the execution time is. According to the docs, this thread state means that the thread is processing a SELECT which required an internal temporary table. Ok, sure, that makes sense, but it doesn’t really explain why InnoDB is taking so much longer, and here’s where things get a bit interesting. If you recall part 2 in this series, query 5 actually returned 0 results when run against InnoDB with the default configuration because of the middle initial “B”, and I had to set innodb_ft_min_token_size to 1 in order to get results back. For the sake of completeness, I did that again here, then restarted the server and recreated my FT index. The results? Execution time dropped by 50% and ‘Creating sort index’ didn’t even appear in the query profile:

mysql [localhost] {msandbox} (test): SELECT id, full_name, MATCH(full_name, details) AGAINST
('"Thomas B Smith"' IN BOOLEAN MODE) AS score FROM dir_test_innodb ORDER BY 3 DESC LIMIT 1;
| id    | full_name      | score             |
| 62633 | Thomas B Smith | 32.89915466308594 |
1 row in set (0.31 sec)
mysql [localhost] {msandbox} (test): show profile;
| Status                  | Duration |
| starting                | 0.000090 |
| checking permissions    | 0.000007 |
| Opening tables          | 0.000017 |
| init                    | 0.000034 |
| System lock             | 0.000012 |
| optimizing              | 0.000008 |
| statistics              | 0.000027 |
| preparing               | 0.000012 |
| FULLTEXT initialization | 0.304933 |
| executing               | 0.000008 |
| Sending data            | 0.000684 |
| end                     | 0.000006 |
| query end               | 0.000006 |
| closing tables          | 0.000011 |
| freeing items           | 0.000019 |
| cleaning up             | 0.000003 |

Hm. It’s still slower than MyISAM by quite a bit, but much faster than before. The reason it’s faster is because it found an exact match and I only asked for one row, but if I change LIMIT 1 to LIMIT 2 (or limit N>1), then ‘Creating sort index’ returns to the tune of roughly 0.5 to 0.6 seconds, and ‘FULLTEXT initialization’ remains at 0.3 seconds. So this answers another lingering question: there is a significant performance impact to using a lower innodb_ft_min_token_size (ifmts), and it can work for you or against you, depending upon your queries and how many rows you’re searching for. The time spent in “Creating sort index” doesn’t vary too much (maybe 0.05s) between ifmts=1 and ifmts=4, but the time spent in FULLTEXT initialization with ifmts=4 was typically only a few milliseconds, as opposed to the 300ms seen here.

Finally, I tried experimenting with different buffer pool sizes, temporary table sizes, per-thread buffer sizes, and I also tried changing from Antelope (ROW_FORMAT=COMPACT) to Barracuda (ROW_FORMAT=DYNAMIC) and switching character sets from utf8 to latin1, but none of these made any difference. The only thing which seemed to provide a bit of a performance improvement was upgrading to 5.6.12. The execution times for the InnoDB FTS queries under 5.6.12 were about 5-10 percent faster than with 5.6.10, and query #2 actually performed a bit better under InnoDB than MyISAM (average execution time 0.00075 seconds faster), but other than that, MyISAM still wins on raw SELECT performance.

Three blog posts later, then, what’s my overall take on InnoDB FTS in MySQL 5.6? I don’t think it’s great, but it’s serviceable. The performance for BOOLEAN MODE queries definitely leaves something to be desired, but I think InnoDB FTS fills a need for those people who want the features and capabilities of InnoDB but can’t modify their existing applications or who just don’t have enough FTS traffic to justify building out a Sphinx/Solr/Lucene-based solution.

The post InnoDB Full-text Search in MySQL 5.6: Part 3, Performance appeared first on MySQL Performance Blog.

Read more at: