A schema change inconsistency with Galera Cluster for MySQL

I recently worked on a case where one node of a Galera cluster had its schema desynchronized with the other nodes. And that was although Total Order Isolation method was in effect to perform the schema changes. Let’s see what happened.

Background

For those of you who are not familiar with how Galera can perform schema changes, here is a short recap:

  • Two methods are available depending on the value of the wsrep_OSU_method setting. Both have benefits and drawbacks, it is not the main topic of this post.
  • With TOI (Total Order Isolation), a DDL statement is performed at the same point in the replication flow on all nodes, giving strong guarantees that the schema is always identical on all nodes.
  • With RSU (Rolling Schema Upgrade), a DDL statement is not replicated to the other nodes. Until the DDL statement has been executed on all nodes, the schema is not consistent everywhere (so you should be careful not to break replication).

You can look at the official document here.

If you read carefully the section on TOI, you will see that “[…] TOI transactions will never fail certification and are guaranteed to be executed.” But also that “The system replicates the TOI query before execution and there is no way to know whether it succeeds or fails. Thus, error checking on TOI queries is switched off.”

Confusing? Not really. It simply means that with TOI, a DDL statement will always pass certification. But if for some reason, the DDL statement fails on one of the nodes, it will not be rolled back on the other nodes. This opens the door for schema inconsistencies between nodes.

A test case

Let’s create a table on a 3-node Percona XtraDB Cluster 5.6 cluster and insert a few rows:

pxc1> create table t (id int not null auto_increment primary key, c varchar(10));
pxc1> insert into t (c) values ('aaaa'),('bbbb');

Then on node 3, let’s introduce a schema change on t that can make other schema changes fail:

pxc3> set global wsrep_OSU_method=RSU;
pxc3> alter table t add d int;
pxc3> set global wsrep_OSU_method=TOI;

As the schema change was done on node 3 with RSU, it is not replicated to the other nodes.

Now let’s try another schema change on node 1:

pxc1> alter table t add d varchar(10);
Query OK, 0 rows affected (0,14 sec)
Records: 0  Duplicates: 0  Warnings: 0

Apparently everything went well and indeed on node 1 and 2, we have the correct schema:

pxc2>show create table tG
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE t (
  id int(11) NOT NULL AUTO_INCREMENT,
  c varchar(10) DEFAULT NULL,
  d varchar(10) DEFAULT NULL,
  PRIMARY KEY (id)
) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=latin1

But on node 3, the statement failed so the schema has not been changed:

pxc3> show create table tG
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE t (
  id int(11) NOT NULL AUTO_INCREMENT,
  c varchar(10) DEFAULT NULL,
  d int(11) DEFAULT NULL,
  PRIMARY KEY (id)
) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=latin1

The error is visible in the error log of node 3:

2014-07-18 10:37:14 9649 [ERROR] Slave SQL: Error 'Duplicate column name 'd'' on query. Default database: 'repl_test'. Query: 'alter table t add d varchar(10)', Error_code: 1060
2014-07-18 10:37:14 9649 [Warning] WSREP: RBR event 1 Query apply warning: 1, 200
2014-07-18 10:37:14 9649 [Warning] WSREP: Ignoring error for TO isolated action: source: 577ffd51-0e52-11e4-a30e-4bde3a7ad3f2 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 3 trx_id: -1 seqnos (l: 17, g: 200, s: 199, d: 199, ts: 7722177758966)

But of course it is easy to miss. And then a simple INSERT can trigger a shutdown on node3:

pxc2> insert into t (c,d) values ('cccc','dddd');
Query OK, 1 row affected (0,00 sec)

will trigger this on node 3:

2014-07-18 10:42:27 9649 [ERROR] Slave SQL: Column 2 of table 'repl_test.t' cannot be converted from type 'varchar(10)' to type 'int(11)', Error_code: 1677
2014-07-18 10:42:27 9649 [Warning] WSREP: RBR event 3 Write_rows apply warning: 3, 201
2014-07-18 10:42:27 9649 [Warning] WSREP: Failed to apply app buffer: seqno: 201, status: 1
	 at galera/src/trx_handle.cpp:apply():340
[...]
2014-07-18 10:42:27 9649 [Note] WSREP: Received NON-PRIMARY.
2014-07-18 10:42:27 9649 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 201)
2014-07-18 10:42:27 9649 [Note] WSREP: Received self-leave message.
2014-07-18 10:42:27 9649 [Note] WSREP: Flow-control interval: [0, 0]
2014-07-18 10:42:27 9649 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2014-07-18 10:42:27 9649 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 201)
2014-07-18 10:42:27 9649 [Note] WSREP: RECV thread exiting 0: Success
2014-07-18 10:42:27 9649 [Note] WSREP: recv_thread() joined.
2014-07-18 10:42:27 9649 [Note] WSREP: Closing replication queue.
2014-07-18 10:42:27 9649 [Note] WSREP: Closing slave action queue.
2014-07-18 10:42:27 9649 [Note] WSREP: bin/mysqld: Terminated.

Conclusion

As on regular MySQL, schema changes are challenging with Galera. Some subtleties can create a lot of troubles if you are not aware of them. So before running DDL statement, make sure you fully understand how TOI and RSU methods work.

The post A schema change inconsistency with Galera Cluster for MySQL appeared first on MySQL Performance Blog.

Percona XtraDB Cluster 5.6.19-25.6 is now available

Percona XtraDB Cluster 5.6.19-25.6Percona is glad to announce the new release of Percona XtraDB Cluster 5.6 on July 21st 2014. Binaries are available from downloads area or from our software repositories. We’re also happy to announce that Ubuntu 14.04 LTS users can now download, install, and upgrade Percona XtraDB Cluster 5.6 from Percona’s software repositories.

Based on Percona Server 5.6.19-67.0 including all the bug fixes in it, Galera Replicator 3.6, and on Codership wsrep API 25.6, Percona XtraDB Cluster 5.6.19-25.6 is now the current General Availability release. All of Percona‘s software is open-source and free, and all the details of the release can be found in the 5.6.19-25.6 milestone at Launchpad.

New Features:

  • Percona XtraDB Cluster now supports storing the Primary Component state to disk by setting the pc.recovery variable to true. The Primary Component can then recover automatically when all nodes that were part of the last saved state reestablish communications with each other. This feature can be used for automatic recovery from full cluster crashes, such as in the case of a data center power outage and graceful full cluster restarts without the need for explicitly bootstrapping a new Primary Component.
  • When joining the cluster, the state message exchange provides us with gcache seqno limits. That information is now used to choose a donor through IST first, and, if this is not possible, only then SST is attempted. The wsrep_sst_donor setting is honored, though, and it is also segment aware.
  • An asynchronous replication slave thread was stopped when the node tried to apply the next replication event while the node was in non-primary state. But it would then remain stopped after the node successfully re-joined the cluster. A new variable, wsrep_restart_slave, has been implemented which controls if the MySQL slave should be restarted automatically when the node re-joins the cluster.
  • Handling install message and install state message processing has been improved to make group forming a more stable process in cases when many nodes are joining the cluster.
  • A new wsrep_evs_repl_latency status variable has been implemented which provides the group communication replication latency information.
  • Node consistency issues with foreign key grammar have been fixed. This fix introduces two new variables: wsrep_slave_FK_checks and wsrep_slave_UK_checks. These variables are set to TRUE and FALSE respectively by default. They control whether Foreign Key and Unique Key checking is done for applier threads.

Bugs Fixed:

  • Fixed the race condition in Foreign Key processing that could cause assertion. Bug fixed #1342959.
  • The restart sequence in scripts/mysql.server would fail to capture and return if the start call failed to start the server. As a result, a restart could occur that failed upon start-up, and the script would still return 0 as if it worked without any issues. Bug fixed #1339894.
  • Updating a unique key value could cause the server to hang if a slave node had enabled parallel slaves. Bug fixed #1280896.
  • Percona XtraDB Cluster has implemented threadpool scheduling fixes. Bug fixed #1333348.
  • garbd was returning an incorrect return code, ie. when garbd was already started, return code was 0. Bug fixed #1308103.
  • rsync SST would silently fail on joiner when the rsync server port was already taken. Bug fixed #1099783.
  • When gmcast.listen_addr was configured to a certain address, the local connection point for outgoing connections was not bound to the listen address. This would happen if the OS has multiple interfaces with IP addresses in the same subnet. The OS would pick the wrong IP for a local connection point and other nodes would see connections originating from an IP address which was not listened to. Bug fixed #1240964.
  • An issue with re-setting galera provider (in wsrep_provider_options) has been fixed. Bug fixed #1260283.
  • Variable wsrep_provider_options couldn’t be set in runtime if no provider was loaded. Bug fixed #1260290.
  • Percona XtraDB Cluster couldn’t be built with Bison 3.0. Bug fixed #1262439.
  • MySQL wasn’t handling exceeding the max writeset size wsrep error correctly. Bug fixed #1270920.
  • Fixed the issue which caused a node to hang/fail when SELECTs/SHOW STATUS was run after FLUSH TABLES WITH READ LOCK was used on a node with wsrep_causal_reads set to 1 while there was a DML on other nodes. Bug fixed #1271177.
  • Lowest group communication layer (evs) would fail to handle the situation properly when a large number of nodes would suddenly start recognizing each other. Bugs fixed #1271918 and #1249805.
  • Percona XtraBackup SST would fail if the progress option was used with a large number of files. Bug fixed #1294431.

NOTE: When performing an upgrade from an older 5.6 version on Debian/Ubuntu systems, in order to upgrade the Galera package correctly, you’ll need to pin the Percona repository and run: apt-get install percona-xtradb-cluster-56. This is required because older Galera deb packages have an incorrect version number. The correct wsrep_provider_version after upgrade should be 3.6(r3a949e6).

This release contains 50 fixed bugs. The complete list of fixed bugs can be found in our release notes.

Release notes for Percona XtraDB Cluster 5.6.19-25.6 are available in our online documentation along with the installation and upgrade instructions.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

Percona XtraDB Cluster Errata can be found in our documentation.

[UPDATE 2014-07-24]: Package Percona-XtraDB-Cluster-client-56-5.6.19-25.6.824.el6.x86_64.rpm has been updated to resolve the conflict with Percona-XtraDB-Cluster-devel package.

The post Percona XtraDB Cluster 5.6.19-25.6 is now available appeared first on MySQL Performance Blog.

Reference architecture for a write-intensive MySQL deployment

We designed Percona Cloud Tools (both hardware and software setup) to handle a very high-intensive MySQL write workload. For example, we already observe inserts of 1bln+ datapoints per day. So I wanted to share what kind of hardware we use to achieve this result.

Let me describe what we use, and later I will explain why.

Server:

  • Chassis: Supermicro SC825TQ-R740LPB 2U Rackmount Chassis
  • Motherboard: Supermicro X9DRI-F dual socket
  • CPU: Dual Intel Xeon Ivy Bridge E5-2643v2 (6x 3.5Ghz cores, 12x HT cores, 25M L3)
  • Memory: 256GB (16x 16GB 256-bit quad-channel) ECC registered DDR3-1600
  • Raid: LSI MegaRAID 9260-4i 4-port 6G/s hardware RAID controller, 512M buffer
  • MainStorage: PCIe SSD HGST FlashMAX II 4.8TB
  • Secondary Storage (OS, logs): RAID 1 over 2x 3TB hard drives

Software:

When selecting hardware for your application, you need to look at many aspects – typically you’re looking for a solution for which you already have experience in working with and has also proved to be the most efficient option. For us it has been as follows:

Cloud vs Bare Metal
We have experience having hardware hosted at the data center as well as cash for upfront investments in hardware so we decided to go for physical self-hosted hardware instead of the cloud. Going this route also gave us maximum flexibility in choosing a hardware setup that was the most optimal for our application rather than selecting one of the stock options.

Scale Up vs Scale Out
We have designed a system from scratch to be able to utilize multiple servers through sharding – so our main concern is choosing the most optimal configuration for the server and provisioning servers as needed. In addition to raw performance we also need to consider power usage and overhead of managing many servers which typically makes having slightly more high-end hardware worth it.

Resource Usage
Every application uses resources in different ways so an optimal configuration will be different depending on your application. Yet all applications use the same resources you need to consider. Typically you want to plan for all of your resources to be substantially used – providing some margin for spikes and maintenance.

CPU

  • Our application processes a lot of data and uses the TokuDB storage engine which uses a lot of CPU for compression, so we needed powerful CPUs.
  • Many MySQL functions are not parallel, think executing single query or Alter table so we’re going for CPU with faster cores rather than larger amount of cores. The resulting configuration with 2 sockets giving 12 cores and 24 threads is good enough for our workloads.
  • Lower end CPUs such as Xeon E3 have very attractive price/performance but only support 32GB of memory which was not enough for our application.

Memory

  • For database boxes memory is mainly used as a cache, so depending on your application you may be better off investing in memory or storage for optimal performance. Check out this blog post for more details.
  • Accessing data in memory is much faster than even on the fastest flash storage so it is still important.
    For our workload having recent data in memory is very important so we get as much “cheap” memory as we can populating all 16 slots with 16GB dimms which have attractive cost per GB at this point.

Storage
There are multiple uses for the storage so there are many variables to consider

  • Bandwidth
    • We need to be able access data on the storage device quickly and with stable response time. HGST FlashMax II has been able to meet these very demanding needs.
  • Endurance
    • When using flash storage you need to worry about endurance – how much beating with writes flash storage can handle before it wears out. Some low cost MLC SSDs would wear out in the time frame of weeks if being written with maximum speed. HGST FlashMax II has endurance rating of 10 Petabytes written (for a random workload) – 30 Petabytes written (for a sequential workload)
    • We also use TokuDB storage engine which significantly reduces amount of writes compared to Innodb.
  • Durability
    • Does the storage provide true durability with data guaranteed to be persisted when write is acknowledged at the operating system level when power goes down or is loss possible?
      We do not want to risk database corruption in case of power failure so we were looking for storage solution which guarantees durability.
      HGST FlashMax II guarantees durability which has been confirmed by our stress tests.
  • Size
    • To scale application storage demands you need to scale both number of IO operations storage can handle and storage size. For flash storage it is often the size which becomes limiting factor.
      HGST FlashMax II 4.8 TB capacity is best available on the market which allows us to go “All Flash” and achieve very quick data access to all our data set.
  • Secondary Storage
    • Not every application need requires flash storage properties.
    • We have secondary storage with conventional drives for operating system and logs.
      Sequential read/write pattern works well with low cost conventional drives and also allow us to increase flash life time, having it handling less writes.
    • We’re using RAID with BBU for secondary storage to be able to have fully durable binary logs without paying high performance penalty.

Why PCIe SSD over SATA SSD?
There are arguments that SATA SSD provides just a good enough performance for MySQL and there is no need for PCIe. While these arguments are valid in one dimension, there are several more to consider.

First, like I said PCIe SSD still provides a best absolute response time and it is an important factor for an end user experience in SaaS systems like Percona Cloud Tools.
Second, consider maintenance operations like backup, ALTER TABLES or slave setups. While these operations are boring and do not get as much attention as a response time or throughput in benchmarks, it is still operations that DBAs performs basically daily, and it is very important to finish a backup or ALTER TABLE in a predictable time, especially on 3-4TB datasize range. And this is where PCIe SSD performs much better than SATA SSDs. For SATA SSD, especially bigger size, write endurance is another point of concern.

Why TokuDB engine?
The TokuDB engine is the best when it comes to insert operations to a huge dataset, and few more factors makes it a no-brainer:

  • TokuDB compression is a huge win. I estimate into this storage ( FlashMAX II 4.8TB) we will fit about 20-30TB of raw data.
  • TokuDB is SSD friendly, as it performs much less data writes per INSERT operation than InnoDB, which greatly extends SSD (which is, well, expensive to say the least) lifetime.

The post Reference architecture for a write-intensive MySQL deployment appeared first on MySQL Performance Blog.

Why TokuDB hates Transparent HugePages

If you try to install the TokuDB storage engine on a modern Linux distribution it might fail with following error message:

2014-07-17 19:02:55 13865 [ERROR] TokuDB will not run with transparent huge pages enabled.
2014-07-17 19:02:55 13865 [ERROR] Please disable them to continue.
2014-07-17 19:02:55 13865 [ERROR] (echo never > /sys/kernel/mm/transparent_hugepage/enabled)

You might be curious why TokuDB refuses to start with Transparent HugePages. Are they not a good thing… allowing smaller kernel page tables and less TLB misses when accessing data in the buffer pool? I was curious, so I asked Tim Callaghan this very question.

This problem originates with TokuDB using jemalloc memory allocator, which uses a particular trick to deal with memory fragmentation. The classical problem with memory allocators is fragmentation – if you allocated a say 2MB chunk from the operating system (typically using mmap),  as the process runs it is likely some of that 2MB memory block will become free but not all of it, hence it can’t be given back to operating system completely. jemalloc uses a clever trick being able to give back portions of memory allocated in such a way through madvise(…, MADV_DONTNEED) call.

Now what happens when you use transparent huge pages? In this case the operating system (and CPU, really) works with pages of a much larger size which only can be unmapped from the address space in its entirety – which does not work when smaller objects are freed which produce smaller free “holes.”

As a result, without being able to free memory efficiently the amount of allocated memory may grow unbound until the process starts to swap out – and in the end being killed by “out of memory” killer at least under some workloads. This is not a behavior you want to see from the database server. As such requiring to disable huge pages is a better choice.

Having said that this is pretty crude requirement/solution – disabling huge pages on complete operating system image to make one application work while others might be negatively impacted. I hope with a future jemalloc version/kernel releases there will be solution where jemalloc simply prevents huge pages usage for its allocations.

Using jemalloc and its approach to remove pages from resident space also makes TokuDB a lot different than typical MySQL instances running Innodb from the process space. With Innodb VSZ and RSS are often close. In fact we often monitor VSZ to ensure it is not excessively large to avoid danger of process starting to swap actively or be killed with OOM killer. TokuDB however often can look like this

[root@smt1 mysql]# ps aux | grep mysqld
mysql 14604 21.8 50.6 12922416 4083016 pts/0 Sl Jul17 1453:27 /usr/sbin/mysqld –basedir=/usr –datadir=/var/lib/mysql –plugin-dir=/usr/lib64/mysql/plugin –user=mysql –log-error=/var/lib/mysql/smt1.pz.percona.com.err –pid-file=/var/lib/mysql/smt1.pz.percona.com.pid
root 28937 0.0 0.0 103244 852 pts/2 S+ 10:38 0:00 grep mysqld

In this case TokuDB is run with defaults on 8GB system – it takes approximately 50% of memory in terms of RSS size, however the VSZ of the process is over 12GB – this is a lot more than memory available.

This is completely fine for TokuDB. If I would not have Transparent HugePages disabled, though, my RSS would be a lot closer to VSZ causing intense swapping or even process killed by OOM killer.

In addition to explaining this to me, Tim Callaghan was also kind enough to share some links on this issue from other companies such as Oracle, NuoDB , Splunk, SAP, SAP(2), which provide more background information on this topic.

The post Why TokuDB hates Transparent HugePages appeared first on MySQL Performance Blog.

DBaaS, OpenStack and Trove 101: Introduction to the basics

We’ll be publishing a series of posts on OpenStack and Trove over the next few weeks, diving into their usage and purpose. For readers who are already familiar with these technologies, there should be no doubt as to why we are incredibly excited about them, but for those who aren’t, consider this a small introduction to the basics and concepts.

What is Database as a Service (DBaaS)?
In a nutshell, DBaaS – as it is frequently referred to – is a loose moniker to the concept of providing a managed cloud-based database environment accessible by users, applications or developers. Its aim is to provide a full-fledged database environment, while minimizing the administrative turmoil and pains of managing the surrounding infrastructure.

Real life example: Imagine you are working on a new application that has to be accessible from multiple regions. Building and maintaining a large multiregion setup can be very expensive. Furthermore, it introduces additional complexity and strain on your system engineers once timezones start to come into play. The challenge of having to manage machines in multiple datacenters won’t simplify your release cycle, nor increase your engineers’ happiness.

Let’s take a look at some of the questions DBaaS could answer in a situation like this:

– How do I need to size my machines, and where should I locate them?
Small environments require less computing power and can be a good starting point, although this also means they may not be as well-prepared for future growth. Buying larger-scale and more expensive hardware and hosting can be very expensive and can be a big stumbling block for a brand new development project. Hosting machines in multiple DC’s could also introduce administrative difficulties, like having different SLA’s and potential issues setting up WAN or VPN communications. DBaaS introduces an abstraction layer, so these consideration aren’t yours, but those of the company offering it, while you get to reap all the rewards.

– Who will manage my environment from an operational standpoint?
Staffing considerations and taking on the required knowledge to properly maintain a production database are often either temporarily sweeped under the rug or, when the situation turns out badly, a cause for the untimely demise of quite a few young projects. Rather than think about how long ago you should have applied that security patch, wouldn’t it be nice to just focus on managing the data itself, and be otherwise confident that the layers beyond it are managed responsibly?

– Have a sudden need to scale out?
Once you’re up and running, enjoying the success of a growing use base, your environment will need to scale accordingly. Rather than think long and hard on the many options available, as well as the logistics attached to those changes, your DBaaS provider could handle this transparently.

Popular public options: Here are a few names of public services you may have come across already that fall under the DBaaS moniker:

– Amazon RDS
– Rackspace cloud databases
– Microsoft SQLAzure
– Heroku
– Clustrix DBaaS

What differentiates these services from a standard remote database is the abstraction layer that fully automates their backend, while still offering an environment that is familiar to what your development team is used to (be it MySQL, MongoDB, Microsoft SQLServer, or otherwise). A big tradeoff to using these services is that you are effectively trusting an external company with all of your data, which might make your legal team a bit nervous.

Private cloud options?
What if you could offer your team the best of both worlds? Or even provide a similar type of service to your own customers? Over the years, a lot of platforms have been popping up to allow effective management and automation of virtual environments such as these, allowing you to effectively “roll your own” DBaaS. To get there, there are two important layers to consider:

  • Infrastructure Management, also referred to as Infrastructure-as-a-Service (IaaS), focusing on the logistics of spinning up virtual machines and keeping their required software packages running.
  • Database Management, previously referred to DBaaS, transparently coordinating multiple database instances to work together and present themselves as a single, coherent data repository.

Examples of IaaS products:
– OpenStack
– OpenQRM

Ecample of DBaaS:
– Trove

Main Advantages of DBaaS
For reference, the main reasons why you might want to consider using an existing DBaaS are as follows:

Reduced Database management costs

DBaaS removes the amount of maintenance you need to perform on isolated DB instances. You offload the system administration of hardware, OS and database to either a dedicated service provider, or in the case where you are rolling your own, allow your database team to more efficiently manage and scale the platform (public vs private DBaaS).

– Simplifies certain security aspects

If you are opting to use a DBaaS platform, the responsibility of worrying about this or that patch being applied falls to your service provider, and you can generally assume that they’ll keep your platform secure from the software perspective.

– Centralized management

One system to rule them all. A guarantee of no nasty surprises concerning that one ancient server that should have been replaced years ago, but you never got around to it. As a user of DBaaS, all you need to worry about is how you interface with the database itself.

– Easy provisioning

Scaling of the environment happens transparently, with minimal additional management.

– Choice of backends

Typically, DBaas providers offer you the choice of a multitude of database flavors, so you can mix and match according to your needs.

Main Disadvantages
– Reduced visibility of the backend

Releasing control of the backend requires a good amount of trust in your DBaaS provider. There is limited or no visibility into how backups are run and maintained, which configuration modifications are applied, or even when and which updates will be implemented. Just as you offload your responsibilities, you in turn need to rely on an SLA contract.

– Potentially harder to recover from catastrophic failures

Similarly to the above, unless your service providers have maintained thorough backups on your behalf, the lack of direct access to the host machines means that it could be much harder to recover from database failure.

– Reduced performance for specific applications

There’s a good chance that you are working on a shared environment. This means the amount of workload-specific performance tuning options is limited.

– Privacy and Security concerns

Although it is much easier to maintain and patch your environment. Having a centralized system also means you’re more prone to potential attacks targeting your dataset. Whichever provider you go with, make sure you are intimately aware of the measures they take to protect you from that, and what is expected from your side to help keep it safe.

Conclusion: While DBaaS is an interesting concept that introduces a completely new way of approaching an application’s database infrastructure, and can bring enterprises easily scalable, and financially flexible platforms, it should not be considered a silver bullet. Some big tradeoffs need to be considered carefully from the business perspective, and any move there should be accompanied with careful planning and investigation of options.

Embracing the immense flexibility these platforms offer, though, opens up a lot of interesting perspectives too. More and more companies are looking at ways to roll their own “as-a-Service”, provisioning completely automated hosted platforms for customers on-demand, and abstracting their management layers to allow them to be serviced by smaller, highly focused technical teams.

Stay tuned: Over the next few weeks we’ll be publishing a series of posts focusing on the combination of two technologies that allow for this type of flexibility: OpenStack and Trove.

The post DBaaS, OpenStack and Trove 101: Introduction to the basics appeared first on MySQL Performance Blog.

Putting MySQL Fabric to Use: July 30 webinar

Percona MySQL webinarsMartin and I have recently been blogging together about MySQL Fabric (in case you’ve missed this, you can find the first post of the series here), and on July 30th, we’re going to be presenting a webinar on this topic titled “Putting MySQL Fabric to Use.”

The focus of the webinar is to help you get started quickly on this technology, so we’ll include very few slides (mostly just a diagram or two) and then jump straight into shared screen mode, with lots of live console and source code examples.

In order to make the best use of time, we won’t show you how to install and configure MySQL Fabric. However, we can point you to a few resources to help you get ready and even follow our examples as we go:

  • The official manual is an obvious starting point
  • Our second post in the series includes configuration instructions
  • This git repo contains the test environment we’ll use to run our demos. Specifically, we’ll use the sharding branch, so if you intend to follow our examples as we go, we recommend checking that one out.

If you’re interested, you can register for this webinar here, and if there’s something specific you’d like to see (we had a request for PHP examples in the comments to our last post) feel free to post that as a comment. We can’t promise we’ll be able to cover all requests during the webinar, but we’ll incorporate examples to the repo as time allows.

Hope to see you then!

The post Putting MySQL Fabric to Use: July 30 webinar appeared first on MySQL Performance Blog.

Monitoring MySQL flow control in Percona XtraDB Cluster 5.6

Monitoring flow control in a Galera cluster is very important. If you do not, you will not understand why writes may sometimes be stalled. Percona XtraDB Cluster 5.6 provides 2 status variables for such monitoring: wsrep_flow_control_paused and wsrep_flow_control_paused_ns. Which one should you use?

What is flow control?

Flow control does not exist with regular MySQL replication, but only with Galera replication. It is simply the mechanism nodes are using when they are not able to keep up with the write load: to keep replication synchronous, the node that is starting to lag instructs the other nodes that writes should be paused for some time so it does not get too far behind.

If you are not familiar with this notion, you should read this blogpost.

Triggering flow control and graphing it

For this test, we’ll use a 3-node Percona XtraDB Cluster 5.6 cluster. On node 3, we will adjust gcs.fc_limit so that flow control is triggered very quickly and then we will lock the node:

pxc3> set global wsrep_provider_options="gcs.fc_limit=1";
pxc3> flush tables with read lock;

Now we will use sysbench to insert rows on node 1:

$ sysbench --test=oltp --oltp-table-size=50000 --mysql-user=root --mysql-socket=/tmp/pxc1.sock prepare

Because of flow control, writes will be stalled and sysbench will hang. So after some time, we will release the lock on node 3:

pxc3> unlock tables;

During the whole process, wsrep_flow_control_paused and wsrep_flow_control_paused_ns are recorded every second with mysqladmin ext -i1. We can then build a graph of the evolution of both variables:

wsrep_flow_control_pxc3

While we can clearly see when flow control was triggered on both graphs, it is much easier to know when flow control was stopped with wsrep_flow_control_paused_ns. It would be even more obvious if we have had several timeframes when flow control is in effect.

Conclusion

Monitoring a server is obviously necessary if you want to be able to catch issues. But you need to look at the right metrics. So don’t be scared if you are seeing that wsrep_flow_control_paused is not 0: it simply means that flow control has been triggered at some point since the server started up. If you want to know what is happening right now, prefer wsrep_flow_control_paused_ns.

The post Monitoring MySQL flow control in Percona XtraDB Cluster 5.6 appeared first on MySQL Performance Blog.

Why %util number from iostat is meaningless for MySQL capacity planning

Earlier this month I wrote about vmstat iowait cpu numbers and some of the comments I got were advertising the use of util% as reported by the iostat tool instead. I find this number even more useless for MySQL performance tuning and capacity planning.

Now let me start by saying this is a really tricky and deceptive number. Many DBAs who report instances of their systems having a very busy IO subsystem said the util% in vmstat was above 99% and therefore they believe this number is a good indicator of an overloaded IO subsystem.

Indeed – when your IO subsystem is busy, up to its full capacity, the utilization should be very close to 100%. However, it is perfectly possible for the IO subsystem and MySQL with it to have plenty more capacity than when utilization is showing 100% – as I will show in an example.

Before that though lets see what the iostat manual page has to say on this topic – from this main page we can read:

%util

Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.

Which says right here that the number is useless for pretty much any production database server that is likely to be running RAID, Flash/SSD, SAN or cloud storage (such as EBS) capable of handling multiple requests in parallel.

Let’s look at the following illustration. I will run sysbench on a system with a rather slow storage data size larger than buffer pool and uniform distribution to put pressure on the IO subsystem. I will use a read-only benchmark here as it keeps things more simple…

sysbench –num-threads=1 –max-requests=0 –max-time=6000000 –report-interval=10 –test=oltp –oltp-read-only=on –db-driver=mysql –oltp-table-size=100000000 –oltp-dist-type=uniform –init-rng=on –mysql-user=root –mysql-password= run

I’m seeing some 9 transactions per second, while disk utilization from iostat is at nearly 96%:

[ 80s] threads: 1, tps: 9.30, reads/s: 130.20, writes/s: 0.00 response time: 171.82ms (95%)
[ 90s] threads: 1, tps: 9.20, reads/s: 128.80, writes/s: 0.00 response time: 157.72ms (95%)
[ 100s] threads: 1, tps: 9.00, reads/s: 126.00, writes/s: 0.00 response time: 215.38ms (95%)
[ 110s] threads: 1, tps: 9.30, reads/s: 130.20, writes/s: 0.00 response time: 141.39ms (95%)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
dm-0 0.00 0.00 127.90 0.70 4070.40 28.00 31.87 1.01 7.83 7.52 96.68

This makes a lot of sense – with read single thread read workload the drive should be only used getting data needed by the query, which will not be 100% as there is some extra time needed to process the query on the MySQL side as well as passing the result set back to sysbench.

So 96% utilization; 9 transactions per second, this is a close to full-system capacity with less than 5% of device time to spare, right?

Let’s run a benchmark with more concurrency – 4 threads at the time; we’ll see…

[ 110s] threads: 4, tps: 21.10, reads/s: 295.40, writes/s: 0.00 response time: 312.09ms (95%)
[ 120s] threads: 4, tps: 22.00, reads/s: 308.00, writes/s: 0.00 response time: 297.05ms (95%)
[ 130s] threads: 4, tps: 22.40, reads/s: 313.60, writes/s: 0.00 response time: 335.34ms (95%)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
dm-0 0.00 0.00 295.40 0.90 9372.80 35.20 31.75 4.06 13.69 3.38 100.01

So we’re seeing 100% utilization now, but what is interesting – we’re able to reclaim much more than less than 5% which was left if we look at utilization – throughput of the system increased about 2.5x

Finally let’s do the test with 64 threads – this is more concurrency than exists at storage level which is conventional hard drives in RAID on this system…

[ 70s] threads: 64, tps: 42.90, reads/s: 600.60, writes/s: 0.00 response time: 2516.43ms (95%)
[ 80s] threads: 64, tps: 42.40, reads/s: 593.60, writes/s: 0.00 response time: 2613.15ms (95%)
[ 90s] threads: 64, tps: 44.80, reads/s: 627.20, writes/s: 0.00 response time: 2404.49ms (95%)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
dm-0 0.00 0.00 601.20 0.80 19065.60 33.60 31.73 65.98 108.72 1.66 100.00

In this case we’re getting 4.5x of throughput compared to single thread and 100% utilization. We’re also getting almost double throughput of the run with 4 thread where 100% utilization was reported. This makes sense – there are 4 drives which can work in parallel and with many outstanding requests they are able to optimize their seeks better hence giving a bit more than 4x.

So what have we so ? The system which was 96% capacity but which could have driven still to provide 4.5x throughput – so it had plenty of extra IO capacity. More powerful storage might have significantly more ability to run requests in parallel so it is quite possible to see 10x or more room after utilization% starts to be reported close to 100%

So if utilization% is not very helpful what can we use to understand our database IO capacity better ? First lets look at the performance reported from those sysbench runs. If we look at 95% response time you can see 1 thread and 4 threads had relatively close 95% time growing just from 150ms to 250-300ms. This is the number I really like to look at- if system is able to respond to the queries with response time not significantly higher than it has with concurrency of 1 it is not overloaded. I like using 3x as multiplier – ie when 95% spikes to be more than 3x of the single concurrency the system might be getting to the overload.

With 64 threads the 95% response time is 15-20x of the one we see with single thread so it is surely overloaded.

Do we have anything reported by iostat which we can use in a similar way? It turns out we do! Check out the “await” column which tells us how much the requester had to wait for the IO request to be serviced. With single concurrency it is 7.8ms which is what this drives can do for random IO and is as good as it gets. With 4 threads it is 13.7ms – less than double of best possible, so also good enough… with concurrency of 64 it is however 108ms which is over 10x of what this device could produce with no waiting and which is surely sign of overload.

A couple words of caution. First, do not look at svctm which is not designed with parallel processing in mind. You can see in our case it actually gets better with high concurrency while really database had to wait a lot longer for requests submitted. Second, iostat mixes together reads and writes in single statistics which specifically for databases and especially on RAID with BBU might be like mixing apples and oranges together – writes should go to writeback cache and be acknowledged essentially instantly while reads only complete when actual data can be delivered. The tool pt-diskstats from Percona Tookit breaks them apart and so can be much more for storage tuning for database workload. Some of the recent operating systems also ship with sysstat/iostat which breaks out await to r_await and w_await which is much more useful.

Final note – I used a read-only workload on purpose – when it comes to writes things can be even more complicated – MySQL buffer pool can be more efficient with more intensive writes plus group commit might be able to commit a lot of transactions with single disk write. Still, the same base logic will apply.

Summary: The take away is simple – util% only shows if a device has at least one operation to deal with or is completely busy, which does not reflect actual utilization for a majority of modern IO subsystems. So you may have a lot of storage IO capacity left even when utilization is close to 100%.

The post Why %util number from iostat is meaningless for MySQL capacity planning appeared first on MySQL Performance Blog.

Percona Server with TokuDB (beta): Installation, configuration

My previous post was an introduction to the TokuDB storage engine and aimed at explaining the basics of its design and how it differentiates from InnoDB/XtraDB. This post is all about motivating you to give it a try and have a look for yourself. Percona Server is not officially supporting TokuDB as of today, though the guys in the development team are working hard on this and the first GA release of Percona Server with TokuDB is looming on the horizon. However, there’s a beta version available now. For the installation tests in this post I’ve used the latest version of Percona Server alongside the accompanying TokuDB complement, which was published last week.

Installing Percona Server with TokuDB on a sandbox

One of the tools Percona Support Engineers really love is Giuseppe Maxia’s MySQL Sandbox. It allows us to setup a sandbox running a MySQL instance of our choice and makes executing multiple ones for testing purposes very easily. Whenever a customer reaches us with a problem happening on a particular version of MySQL or Percona Server that we can reproduce, we quickly spin off a new sandbox and test it ourselves, so it’s very handy. I’ll use one here to explore this beta version of Percona Server with TokuDB but if you prefer you can install it the regular way using a package from our apt experimental or yum testing repositories.

We start by downloading the tarballs from here: TokuDB’s plugin has been packaged in its own tarball, so there are two to download. Once you get them let’s decompress both and create a unified working directory, compressing it again to create a single tarball we’ll use as source to create our sandbox:

[nando@test01 ~]# tar zxvf Percona-Server-5.6.17-rel66.0-608.Linux.x86_64.tar.gz
[nando@test01 ~]# tar zxvf Percona-Server-5.6.17-rel66.0-608.TokuDB.Linux.x86_64.tar.gz
[nando@test01 ~]# tar cfa Percona-Server-5.6.17-rel66.0-608.Linux.x86_64.tar.gz Percona-Server-5.6.17-rel66.0-608.Linux.x86_64/

Before going ahead, verify if you have transparent huge pages enabled as TokuDB won’t run if it is set. See this documentation page for explanation on what this is and how to disable it on Ubuntu. In my CentOS test server it was defined in a slightly different place and I’ve done the following to temporarily disable it:

[nando@test01]# echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
[nando@test01]# echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

We’re now ready to create our sandbox. The following command should be enough (I’ve chosen to run Percona Server on port 5617, you can use any other available one):

[nando@test01 ~]# make_sandbox Percona-Server-5.6.17-rel66.0-608.Linux.x86_64.tar.gz -- --sandbox_directory=tokudb --sandbox_port=5617

If the creation process goes well you will see something like the following at the end:

.... sandbox server started
Your sandbox server was installed in $HOME/sandboxes/tokudb

You should now be able to access the MySQL console on the sandbox with the default credentials; if you cannot, verify the log-in $HOME/sandboxes/tokudb/data/msandbox.err:

[nando@test01 ~]# mysql --socket=/tmp/mysql_sandbox5617.sock -umsandbox -pmsandbox

Alternatively, you can make use of the “use” script located inside the sandbox directory, which employs the same credentials (configured in the client section of the configuration file my.sandbox.cnf):

[nando@test01 ~]# cd sandboxes/tokudb/
[nando@test01 tokudb]# ./use

First thing to check is if TokuDB is being listed as an available storage engine:

mysql> SHOW ENGINES;
+--------------------+---------+----------------------------------------------------------------------------+--------------+------+------------+
| Engine             | Support | Comment                                                                    | Transactions | XA   | Savepoints |
+--------------------+---------+----------------------------------------------------------------------------+--------------+------+------------+
| (...)              | (...)   | (...)                                                                      | (...)        | (...)| (...)      |
| TokuDB             | YES     | Tokutek TokuDB Storage Engine with Fractal Tree(tm) Technology             | YES          | YES  | YES        |
| InnoDB             | DEFAULT | Percona-XtraDB, Supports transactions, row-level locking, and foreign keys | YES          | YES  | YES        |
| (...)          | (...)       | (...)                                                                      | NO           | (...)| (...)      |
+--------------------+---------+----------------------------------------------------------------------------+--------------+------+------------+

If that’s not the case, you may need to load the plugins manually – I had to do so in my sandbox; you may not need if you’re installing it from a package in a fresh setup:

mysql> INSTALL PLUGIN tokudb SONAME 'ha_tokudb.so';

TokuDB should now figure in the list of supported ENGINES but you still need to activate the related plugins:

mysql> INSTALL PLUGIN tokudb_file_map SONAME 'ha_tokudb.so';
mysql> INSTALL PLUGIN tokudb_fractal_tree_info SONAME 'ha_tokudb.so';
mysql> INSTALL PLUGIN tokudb_fractal_tree_block_map SONAME 'ha_tokudb.so';
mysql> INSTALL PLUGIN tokudb_trx SONAME 'ha_tokudb.so';
mysql> INSTALL PLUGIN tokudb_locks SONAME 'ha_tokudb.so';
mysql> INSTALL PLUGIN tokudb_lock_waits SONAME 'ha_tokudb.so';

Please note the INSTALL PLUGIN action results in permanent changes and thus is required only once. No modifications to MySQL’s configuration file are required to have those plugins load in subsequent server restarts.

Now you should see not only the main TokuDB plugin but also the add-ons to the INFORMATION SCHEMA:

mysql> SHOW PLUGINS;
+-------------------------------+----------+--------------------+--------------+---------+
| Name                          | Status   | Type               | Library      | License |
+-------------------------------+----------+--------------------+--------------+---------+
| (...)                         | (...)    | (...)              | (...)        | (...)   |
| TokuDB                        | ACTIVE   | STORAGE ENGINE     | ha_tokudb.so | GPL     |
| TokuDB_trx                    | ACTIVE   | INFORMATION SCHEMA | ha_tokudb.so | GPL     |
| TokuDB_locks                  | ACTIVE   | INFORMATION SCHEMA | ha_tokudb.so | GPL     |
| TokuDB_lock_waits             | ACTIVE   | INFORMATION SCHEMA | ha_tokudb.so | GPL     |
| TokuDB_file_map               | ACTIVE   | INFORMATION SCHEMA | ha_tokudb.so | GPL     |
| TokuDB_fractal_tree_info      | ACTIVE   | INFORMATION SCHEMA | ha_tokudb.so | GPL     |
| TokuDB_fractal_tree_block_map | ACTIVE   | INFORMATION SCHEMA | ha_tokudb.so | GPL     |
+-------------------------------+----------+--------------------+--------------+---------+

We are now ready to create our first TokuDB table – the only different thing to do here is to specify TokuDB as the storage engine to use:

mysql> CREATE TABLE test.Numbers (id INT PRIMARY KEY, number VARCHAR(20)) ENGINE=TokuDB;

Note some unfamiliar files lying in the datadir; the details surrounding those is certainly good material for future posts:

[nando@test01]# ls ~/sandboxes/tokudb/data
auto.cnf                   _test_Numbers_main_3_2_19.tokudb
ibdata1                    _test_Numbers_status_3_1_19.tokudb
ib_logfile0                tokudb.directory
ib_logfile1                tokudb.environment
log000000000000.tokulog25  __tokudb_lock_dont_delete_me_data
msandbox.err               __tokudb_lock_dont_delete_me_environment
mysql                      __tokudb_lock_dont_delete_me_logs
mysql_sandbox5617.pid      __tokudb_lock_dont_delete_me_recovery
performance_schema         __tokudb_lock_dont_delete_me_temp
tc.log                     tokudb.rollback
test

Configuration: what’s really important

As noted by Vadim long ago, “Tuning of TokuDB is much easier than InnoDB, there’re only a few parameters to change, and actually out-of-box things running pretty well“:

mysql> show variables like 'tokudb_%';
+---------------------------------+------------------+
| Variable_name                   | Value            |
+---------------------------------+------------------+
| tokudb_alter_print_error        | OFF              |
| tokudb_analyze_time             | 5                |
| tokudb_block_size               | 4194304          |
| tokudb_cache_size               | 522651648        |
| tokudb_check_jemalloc           | 1                |
| tokudb_checkpoint_lock          | OFF              |
| tokudb_checkpoint_on_flush_logs | OFF              |
| tokudb_checkpointing_period     | 60               |
| tokudb_cleaner_iterations       | 5                |
| tokudb_cleaner_period           | 1                |
| tokudb_commit_sync              | ON               |
| tokudb_create_index_online      | ON               |
| tokudb_data_dir                 |                  |
| tokudb_debug                    | 0                |
| tokudb_directio                 | OFF              |
| tokudb_disable_hot_alter        | OFF              |
| tokudb_disable_prefetching      | OFF              |
| tokudb_disable_slow_alter       | OFF              |
| tokudb_disable_slow_update      | OFF              |
| tokudb_disable_slow_upsert      | OFF              |
| tokudb_empty_scan               | rl               |
| tokudb_fs_reserve_percent       | 5                |
| tokudb_fsync_log_period         | 0                |
| tokudb_hide_default_row_format  | ON               |
| tokudb_init_flags               | 11403457         |
| tokudb_killed_time              | 4000             |
| tokudb_last_lock_timeout        |                  |
| tokudb_load_save_space          | ON               |
| tokudb_loader_memory_size       | 100000000        |
| tokudb_lock_timeout             | 4000             |
| tokudb_lock_timeout_debug       | 1                |
| tokudb_log_dir                  |                  |
| tokudb_max_lock_memory          | 65331456         |
| tokudb_pk_insert_mode           | 1                |
| tokudb_prelock_empty            | ON               |
| tokudb_read_block_size          | 65536            |
| tokudb_read_buf_size            | 131072           |
| tokudb_read_status_frequency    | 10000            |
| tokudb_row_format               | tokudb_zlib      |
| tokudb_tmp_dir                  |                  |
| tokudb_version                  | tokudb-7.1.7-rc7 |
| tokudb_write_status_frequency   | 1000             |
+---------------------------------+------------------+
42 rows in set (0.00 sec)

The most important of the tokudb_ variables is arguably tokudb_cache_size. The test server where I ran those tests (test01) have a little less than 1G of memory and as you can see above TokuDB is “reserving” half (50%) of them to itself. That’s the default behavior but, of course, you can change it. And you must do it if you are also going to have InnoDB tables on your server – you should not overcommit memory between InnoDB and TokuDB engines. Shlomi Noach wrote a good post explaining the main TokuDB-specific variables and what they do. It’s definitely a worth read.

I hope you have fun testing Percona Server with TokuDB! If you run into any problems worth reporting, please let us know.

The post Percona Server with TokuDB (beta): Installation, configuration appeared first on MySQL Performance Blog.

Failover with the MySQL Utilities – Part 1: mysqlrpladmin

MySQL Utilities are a set of tools provided by Oracle to perform many kinds of administrative tasks. When GTID-replication is enabled, 2 tools can be used for slave promotion: mysqlrpladmin and mysqlfailover. We will review mysqlrpladmin (version 1.4.3) in this post.

Summary

  • mysqlrpladmin can perform manual failover/switchover when GTID-replication is enabled.
  • You need to have your servers configured with --master-info-repository = TABLE or to add the --rpl-user option for the tool to work properly.
  • The check for errant transactions is failing in the current GA version (1.4.3) so be extra careful when using it or watch bug #73110 to see when a fix is committed.
  • There are some limitations, for instance the inability to pre-configure the list of slaves in a configuration file or the inability to check that the tool will work well without actually doing a failover or switchover.

Failover vs switchover

mysqlrpladmin can help you promote a slave to be the new master when the master goes down and then automate replication reconfiguration after this slave promotion. There are 2 separate scenarios: unplanned promotion (failover) and planned promotion (switchover). Beyond the words, it has implications on the way you have to execute the tool.

Setup for this test

To test the tool, our setup will be a master with 2 slaves, all using GTID replication. mysqlrpladmin can show us the current replication topology with the health command:

$ mysqlrpladmin --master=root@localhost:13001 --discover-slaves-login=root health
# Discovering slaves for master at localhost:13001
# Discovering slave at localhost:13002
# Found slave: localhost:13002
# Discovering slave at localhost:13003
# Found slave: localhost:13003
# Checking privileges.
#
# Replication Topology Health:
+------------+--------+---------+--------+------------+---------+
| host       | port   | role    | state  | gtid_mode  | health  |
+------------+--------+---------+--------+------------+---------+
| localhost  | 13001  | MASTER  | UP     | ON         | OK      |
| localhost  | 13002  | SLAVE   | UP     | ON         | OK      |
| localhost  | 13003  | SLAVE   | UP     | ON         | OK      |
+------------+--------+---------+--------+------------+---------+
# ...done.

As you can see, we have to specify how to connect to the master (no surprise) but instead of listing all the slaves, we can let the tool discover them.

Simple failover scenario

What will the tool do when performing failover? Essentially we will give it the list of slaves and the list of candidates and it will:

  • Run a few sanity checks
  • Elect a candidate to be the new master
  • Make the candidate as up-to-date as possible by making it a slave of all the other slaves
  • Configure replication on all the other slaves to make them replicate from the new master

After killing -9 the master, let’s try failover:

$ mysqlrpladmin --slaves=root:@localhost:13002,root:@localhost:13003 --candidates=root@localhost:13002 failover

This time, the master is down so the tool has no way to automatically discover the slaves. Thus we have to specify them with the --slaves option.

However we get an error:

# Checking privileges.
# Checking privileges on candidates.
ERROR: You must specify either the --rpl-user or set all slaves to use --master-info-repository=TABLE.

The error message is clear, but it would have been nice to have such details when running the health command (maybe a warning instead of an error). That would allow you to check beforehand that the tool can run smoothly rather than to discover in the middle of an emergency that you have to look at the documentation to find which option is missing.

Let’s choose to specify the replication user:

$ mysqlrpladmin --slaves=root:@localhost:13002,root:@localhost:13003 --candidates=root@localhost:13002 --rpl-user=repl:repl failover
# Checking privileges.
# Checking privileges on candidates.
# Performing failover.
# Candidate slave localhost:13002 will become the new master.
# Checking slaves status (before failover).
# Preparing candidate for failover.
# Creating replication user if it does not exist.
# Stopping slaves.
# Performing STOP on all slaves.
# Switching slaves to new master.
# Disconnecting new master as slave.
# Starting slaves.
# Performing START on all slaves.
# Checking slaves for errors.
# Failover complete.
#
# Replication Topology Health:
+------------+--------+---------+--------+------------+---------+
| host       | port   | role    | state  | gtid_mode  | health  |
+------------+--------+---------+--------+------------+---------+
| localhost  | 13002  | MASTER  | UP     | ON         | OK      |
| localhost  | 13003  | SLAVE   | UP     | ON         | OK      |
+------------+--------+---------+--------+------------+---------+
# ...done.

Simple switchover scenario

Let’s now restart the old master and configure it as a slave of the new master (by the way, this can be done with mysqlreplicate, another tool from the MySQL Utilities). If we want to promote the old master, we can run:

$ mysqlrpladmin --master=root@localhost:13002 --new-master=root@localhost:13001 --discover-slaves-login=root --demote-master --rpl-user=repl:repl --quiet switchover
# Discovering slave at localhost:13001
# Found slave: localhost:13001
# Discovering slave at localhost:13003
# Found slave: localhost:13003
+------------+--------+---------+--------+------------+---------+
| host       | port   | role    | state  | gtid_mode  | health  |
+------------+--------+---------+--------+------------+---------+
| localhost  | 13001  | MASTER  | UP     | ON         | OK      |
| localhost  | 13002  | SLAVE   | UP     | ON         | OK      |
| localhost  | 13003  | SLAVE   | UP     | ON         | OK      |
+------------+--------+---------+--------+------------+---------+

Notice that the master is available in this case so we can use the discover-slaves-login option. Also notice that we can tune the verbosity of the tool by using --quiet or --verbose or even log the output in a file with --log.

We also used --demote-master to make the old master a slave of the new master. Without this option, the old master will be isolated from the other nodes.

Extension points

In general doing switchover/failover at the database level is one thing but informing the other components of the application that something has changed is most often necessary for the application to keep on working correctly.

This is where the extension points are handy: you can execute a script before switchover/failover with --exec-before and after switchover/failover with --exec-after.

For instance with these simple scripts:

# cat /usr/local/bin/check_before
#!/bin/bash
/usr/local/mysql5619/bin/mysql -uroot -S /tmp/node1.sock -Ee 'SHOW SLAVE STATUS' > /tmp/before
# cat /usr/local/bin/check_after
#!/bin/bash
/usr/local/mysql5619/bin/mysql -uroot -S /tmp/node1.sock -Ee 'SHOW SLAVE STATUS' > /tmp/after

We can execute:

$ mysqlrpladmin --master=root@localhost:13001 --new-master=root@localhost:13002 --discover-slaves-login=root --demote-master --rpl-user=repl:repl --quiet --exec-before=/usr/local/bin/check_before --exec-after=/usr/local/bin/check_after switchover

And looking the /tmp/before and /tmp/after, we can see that our scripts have been executed:

# cat /tmp/before
# cat /tmp/after
*************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: localhost
                  Master_User: repl
                  Master_Port: 13002
[...]

If the external script does not seem to work, using –verbose can be useful to diagnose the issue.

What about errant transactions?

We already mentioned that errant transactions can create lots of issues when a new master is promoted in a cluster running GTIDs. So the question is: how mysqlrpladmin behaves when there is an errant transaction?

Let’s create an errant transaction:

# On localhost:13003
mysql> CREATE DATABASE test2;
mysql> FLUSH LOGS;
mysql> SHOW BINARY LOGS;
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| mysql-bin.000001 |     69309 |
| mysql-bin.000002 |   1237667 |
| mysql-bin.000003 |       617 |
| mysql-bin.000004 |       231 |
+------------------+-----------+
mysql> PURGE BINARY LOGS TO 'mysql-bin.000004';

and let’s try to promote localhost:13003 as the new master:

$ mysqlrpladmin --master=root@localhost:13001 --new-master=root@localhost:13003 --discover-slaves-login=root --demote-master --rpl-user=repl:repl --quiet switchover
[...]
+------------+--------+---------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| host       | port   | role    | state  | gtid_mode  | health                                                                                                                                                                                                                                                                                              |
+------------+--------+---------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| localhost  | 13003  | MASTER  | UP     | ON         | OK                                                                                                                                                                                                                                                                                                  |
| localhost  | 13001  | SLAVE   | UP     | ON         | IO thread is not running., Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Slave has 1 transactions behind master.  |
| localhost  | 13002  | SLAVE   | UP     | ON         | IO thread is not running., Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Slave has 1 transactions behind master.  |
+------------+--------+---------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Oops! Although it is suggested by the documentation, the tool does not check errant transactions. This is a major issue as you cannot run failover/switchover reliably with GTID replication if errant transactions are not correctly detected.

The documentation suggests errant transactions should be checked and a quick look at the code confirms that, but it does not work! So it has been reported.

Some limitations

Apart from the missing errant transaction check, I also noticed a few limitations:

  • You cannot use a configuration file listing all the slaves. This becomes boring once you have a large amount of slaves. In such a case, you should write a wrapper script around mysqlrpladmin to generate the right command for you
  • The slave election process is either automatic or it relies on the order of the servers given in the --candidates option. This is not very sophisticated.
  • It would be useful to have a –dry-run mode which would validate that everything is configured correctly but without actually failing/switching over. This is something MHA does for instance.

Conclusion

mysqlrpladmin is a very good tool to help you perform manual failover/switchover in a cluster using GTID replication. The main caveat at this point is the failing check for errant transactions, which requires a lot of care before executing the tool.

The post Failover with the MySQL Utilities – Part 1: mysqlrpladmin appeared first on MySQL Performance Blog.