How To Test and Deploy Kubernetes Operator for MySQL(PXC) in OSX/macOS?

kubernetes on mac osx

kubernetes on mac osxIn this blog post, I’m going to show you how to test Kubernetes locally on OSX/macOS. Testing Kubernetes without having access to a cloud operator in a local lab is not as easy as it sounds. I’d like to share some of my experiences in this adventure. For those who have already experienced in Virtualbox & Vagrant combination, I can tell you that it doesn’t work. Since Kubernetes will require virtualization, setting another virtual environment within another VirtualBox has several issues. After trying to bring up a cluster for a day or two, I gave up my traditional lab and figured out that Kubernetes has an alternate solution called minikube.

Installation

If your OSX/macOS doesn’t have brew I strongly recommend installing it. My OSX/macOS version at the time of this post was macOS 10.14.3 (18D109).

$ brew update && brew install kubectl && brew cask install docker minikube virtualbox

Once minikube is installed, we’ll need to start the virtual environment that is required to run our operator.

I’m starting my minikube environment with 4Gb memory since our Percona Xtradb(PXC) Cluster will have 3 MySQL nodes + 1 ProxySQL pod.

$ minikube start --memory 4096
<img src="https://s.w.org/images/core/emoji/11.2.0/72x72/1f604.png" alt="?" class="wp-smiley" style="height: 1em;max-height: 1em" />  minikube v0.35.0 on darwin (amd64)
<img src="https://s.w.org/images/core/emoji/11.2.0/72x72/1f525.png" alt="?" class="wp-smiley" style="height: 1em;max-height: 1em" />  Creating virtualbox VM (CPUs=2, Memory=4096MB, Disk=20000MB) ...
<img src="https://s.w.org/images/core/emoji/11.2.0/72x72/1f4f6.png" alt="?" class="wp-smiley" style="height: 1em;max-height: 1em" />  "minikube" IP address is 192.168.99.100
<img src="https://s.w.org/images/core/emoji/11.2.0/72x72/1f433.png" alt="?" class="wp-smiley" style="height: 1em;max-height: 1em" />  Configuring Docker as the container runtime ...
<img src="https://s.w.org/images/core/emoji/11.2.0/72x72/2728.png" alt="✨" class="wp-smiley" style="height: 1em;max-height: 1em" />  Preparing Kubernetes environment ...
<img src="https://s.w.org/images/core/emoji/11.2.0/72x72/1f69c.png" alt="?" class="wp-smiley" style="height: 1em;max-height: 1em" />  Pulling images required by Kubernetes v1.13.4 ...
<img src="https://s.w.org/images/core/emoji/11.2.0/72x72/1f680.png" alt="?" class="wp-smiley" style="height: 1em;max-height: 1em" />  Launching Kubernetes v1.13.4 using kubeadm ...
<img src="https://s.w.org/images/core/emoji/11.2.0/72x72/231b.png" alt="⌛" class="wp-smiley" style="height: 1em;max-height: 1em" />  Waiting for pods: apiserver proxy etcd scheduler controller addon-manager dns
<img src="https://s.w.org/images/core/emoji/11.2.0/72x72/1f511.png" alt="?" class="wp-smiley" style="height: 1em;max-height: 1em" />  Configuring cluster permissions ...
<img src="https://s.w.org/images/core/emoji/11.2.0/72x72/1f914.png" alt="?" class="wp-smiley" style="height: 1em;max-height: 1em" />  Verifying component health .....
<img src="https://s.w.org/images/core/emoji/11.2.0/72x72/1f497.png" alt="?" class="wp-smiley" style="height: 1em;max-height: 1em" />  kubectl is now configured to use "minikube"
<img src="https://s.w.org/images/core/emoji/11.2.0/72x72/1f3c4.png" alt="?" class="wp-smiley" style="height: 1em;max-height: 1em" />  Done! Thank you for using minikube!

We’re now ready to install Install Percona XtraDB Cluster on Kubernetes.

Setup

Clone and download Kubernetes Operator for MySQL.

$ git clone -b release-0.2.0 https://github.com/percona/percona-xtradb-cluster-operator
Cloning into 'percona-xtradb-cluster-operator'...
remote: Enumerating objects: 191, done.
remote: Counting objects: 100% (191/191), done.
remote: Compressing objects: 100% (114/114), done.
remote: Total 10321 (delta 73), reused 138 (delta 67), pack-reused 10130
Receiving objects: 100% (10321/10321), 17.04 MiB | 3.03 MiB/s, done.
Resolving deltas: 100% (3526/3526), done.
Checking out files: 100% (5159/5159), done.
$ cd percona-xtradb-cluster-operator

Here we have to make the following modifications for this operator to work on OSX/macOS.

  1. Reduce memory allocation for each pod.
  2. Reduce CPU usage for each pod.
  3. Change the topology type (because we want to run all PXC instances on one node).
$ sed -i.bak 's/1G/500m/g' deploy/cr.yaml
$ grep "memory" deploy/cr.yaml
        memory: 500m
      #   memory: 500m
        memory: 500m
      #   memory: 500m
$ sed -i.bak 's/600m/200m/g' deploy/cr.yaml
$ grep "cpu" deploy/cr.yaml
        cpu: 200m
      #   cpu: "1"
        cpu: 200m
      #   cpu: 700m
$ grep "topology" deploy/cr.yaml
      topologyKey: "kubernetes.io/hostname"
    #   topologyKey: "failure-domain.beta.kubernetes.io/zone"
$ sed -i.bak 's/kubernetes.io/hostname/none/g' deploy/cr.yaml
$ grep "topology" deploy/cr.yaml
      topologyKey: "none"
    #   topologyKey: "failure-domain.beta.kubernetes.io/zone"

We’re now ready to deploy our PXC via the operator.

$ kubectl apply -f deploy/crd.yaml
customresourcedefinition.apiextensions.k8s.io/perconaxtradbclusters.pxc.percona.com created
customresourcedefinition.apiextensions.k8s.io/perconaxtradbbackups.pxc.percona.com created
$ kubectl create namespace pxc
namespace/pxc created
$ kubectl config set-context $(kubectl config current-context) --namespace=pxc
Context "minikube" modified.
$ kubectl apply -f deploy/rbac.yaml
role.rbac.authorization.k8s.io/percona-xtradb-cluster-operator created
rolebinding.rbac.authorization.k8s.io/default-account-percona-xtradb-cluster-operator created
$ kubectl apply -f deploy/operator.yaml
deployment.apps/percona-xtradb-cluster-operator created
$ kubectl apply -f deploy/secrets.yaml
secret/my-cluster-secrets created
$ kubectl apply -f deploy/configmap.yaml
configmap/pxc created
$ kubectl apply -f deploy/cr.yaml
perconaxtradbcluster.pxc.percona.com/cluster1 created

Here we’re ready to monitor the progress of our deployment.

$ kubectl get pods
NAME                                               READY   STATUS              RESTARTS   AGE
cluster1-pxc-node-0                                0/1     ContainerCreating   0          86s
cluster1-pxc-proxysql-0                            1/1     Running             0          86s
percona-xtradb-cluster-operator-5857dfcb6c-g7bbg   1/1     Running             0          109s

If any of the nodes is having difficulty passing any STATUS to Running state

$ kubectl describe pod cluster1-pxc-node-0
Name:               cluster1-pxc-node-0
Namespace:          pxc
Priority:           0
.
..
...
Events:
  Type     Reason            Age                     From               Message
  ----     ------            ----                    ----               -------
  Warning  FailedScheduling  3m47s (x14 over 3m51s)  default-scheduler  pod has unbound immediate PersistentVolumeClaims
  Normal   Scheduled         3m47s                   default-scheduler  Successfully assigned pxc/cluster1-pxc-node-0 to minikube
  Normal   Pulling           3m45s                   kubelet, minikube  pulling image "perconalab/pxc-openshift:0.2.0"
  Normal   Pulled            118s                    kubelet, minikube  Successfully pulled image "perconalab/pxc-openshift:0.2.0"
  Normal   Created           117s                    kubelet, minikube  Created container
  Normal   Started           117s                    kubelet, minikube  Started container
  Warning  Unhealthy         89s                     kubelet, minikube  Readiness probe failed:
At this stage we’re ready to verify our cluster as soon as we see following output (READY 1/1):
$ kubectl get pods
NAME                                               READY   STATUS    RESTARTS   AGE
cluster1-pxc-node-0                                1/1     Running   0          7m38s
cluster1-pxc-node-1                                1/1     Running   0          4m46s
cluster1-pxc-node-2                                1/1     Running   0          2m25s
cluster1-pxc-proxysql-0                            1/1     Running   0          7m38s
percona-xtradb-cluster-operator-5857dfcb6c-g7bbg   1/1     Running   0          8m1s

In order to connect to this cluster, we’ll need to deploy a client shell access.

$ kubectl run -i --rm --tty percona-client --image=percona:5.7 --restart=Never -- bash -il
If you don't see a command prompt, try pressing enter.
bash-4.2$ mysql -h cluster1-pxc-proxysql -uroot -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or g.
Your MySQL connection id is 3617
Server version: 5.5.30 (ProxySQL)
Copyright (c) 2009-2019 Percona LLC and/or its affiliates
Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.
mysql> s
--------------
mysql  Ver 14.14 Distrib 5.7.25-28, for Linux (x86_64) using  6.2
Connection id:		3617
Current database:	information_schema
Current user:		root@cluster1-pxc-proxysql-0.cluster1-pxc-proxysql.pxc.svc.cluste
SSL:			Not in use
Current pager:		stdout
Using outfile:		''
Using delimiter:	;
Server version:		5.5.30 (ProxySQL)
Protocol version:	10
Connection:		cluster1-pxc-proxysql via TCP/IP
Server characterset:	latin1
Db     characterset:	utf8
Client characterset:	latin1
Conn.  characterset:	latin1
TCP port:		3306
Uptime:			14 min 1 sec
Threads: 1  Questions: 3  Slow queries: 0
--------------

A few things to remember:

  • Secrets for this setup are under deploy/secrets.yaml, you can decode via
$ echo -n '{secret}' |base64 -D

  • To reconnect shell
$ kubectl run -i --tty percona-client --image=percona:5.7 -- sh

  • To redeploy the pod delete first and repeat above steps without configuration changes
$ kubectl delete -f deploy/cr.yaml

  • To stop and delete  minikube virtual environment
$ minikube stop

$ minikube delete

References

Credits


Photo by frank mckenna on Unsplash

Upcoming Webinar Wed 2/6: Percona Software News and Roadmap Update

Percona Software News and Roadmap Update Webinar

Percona Software News and Roadmap Update WebinarJoin Percona CEO Peter Zaitsev as he presents Percona Software News and Roadmap Update on Wednesday, February 6, 2019, at 11:00 AM PST (UTC-8) / 2:00 PM EST (UTC-5).

Register Now

Come and listen to Percona CEO Peter Zaitsev discuss what’s new in Percona open source software. Topics include Percona Server for MySQL and MongoDB, Percona XtraBackup, Percona Toolkit, Percona XtraDB Cluster and Percona Monitoring and Management.

During this webinar, Peter will talk about newly released features in Percona software. He will also show a few quick demos and share with you highlights from the Percona open source software roadmap.

Peter will also talk about new developments in Percona commercial services and finish with a Q&A.

Register today to join Peter for his Percona Software News and Roadmap Update.

ProxySQL 1.4.14 and Updated proxysql-admin Tool

ProxySQL 1.4.14

ProxySQL 1.4.14

ProxySQL 1.4.14, released by ProxySQL, is now available for download in the Percona Repository along with an updated version of Percona’s proxysql-admin tool.

ProxySQL is a high-performance proxy, currently for MySQL,  and database servers in the MySQL ecosystem (like Percona Server for MySQL and MariaDB). It acts as an intermediary for client requests seeking resources from the database. René Cannaò created ProxySQL for DBAs as a means of solving complex replication topology issues.

The ProxySQL 1.4.14 source and binary packages available from the Percona download page for ProxySQL include ProxySQL Admin – a tool developed by Percona to configure Percona XtraDB Cluster nodes into ProxySQL. Docker images for release 1.4.14 are available as well. You can download the original ProxySQL from GitHub. GitHub hosts the documentation in the wiki format.

This release introduces an improvement on how proxysql-admin works with the --max-connections option. In previous releases, this option always equaled to 1000. Now, proxysql_galera_checker uses the value of the --max-connections option set by the user either in the command line (proxysql-admin --max-connections) or in the configuration file.

If the user doesn’t set this option, it defaults to 1000.

Improvements

  • PSQLADM-130: Every time a node is removed and then added back, proxysql_galera_checker script restores the custom value of the --max-connections option set using proxysql-admin --max-connections.
  • The --syncusers option of proxysql-admin starts to support MariaDB. Thanks to Jonas Kint (@jonaskint) for this contribution.

ProxySQL is available under Open Source license GPLv3.

ProxySQL Native Support for Percona XtraDB Cluster (PXC)

galera proxy content image

ProxySQL in its versions up to 1.x did not natively support Percona XtraDB Cluster (PXC). Instead, it relied on the flexibility offered by the scheduler. This approach allowed users to implement their own preferred way to manage the ProxySQL behaviour in relation to the Galera events.

From version 2.0 we can use native ProxySQL support for PXC.. The mechanism to activate native support is very similar to the one already in place for group replication.

In brief it is based on the table [runtime_]mysql_galera_hostgroups and the information needed is mostly the same:

  • writer_hostgroup: the hostgroup ID that refers to the WRITER
  • backup_writer_hostgroup: the hostgoup ID referring to the Hostgorup that will contain the candidate servers
  • reader_hostgroup: The reader Hostgroup ID, containing the list of servers that need to be taken in consideration
  • offline_hostgroup: The Hostgroup ID that will eventually contain the writer that will be put OFFLINE
  • active: True[1]/False[0] if this configuration needs to be used or not
  • max_writers: This will contain the MAX number of writers you want to have at the same time. In a sane setup this should be always 1, but if you want to have multiple writers, you can define it up to the number of nodes.
  • writer_is_also_reader: If true [1] the Writer will NOT be removed from the reader HG
  • max_transactions_behind: The number of wsrep_local_recv_queue after which the node will be set OFFLINE. This must be carefully set, observing the node behaviour.
  • comment: I suggest to put some meaningful notes to identify what is what.

Given the above let us see what we need to do in order to have a working galera native solution.
I will have three Servers:

192.168.1.205 (Node1)
  192.168.1.21  (Node2)
  192.168.1.231 (node3)

As set of Hostgroup, I will have:

Writer  HG-> 100
Reader  HG-> 101
BackupW HG-> 102
offHG   HG-> 9101

To set it up

Servers first:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.205',101,3306,1000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.21',101,3306,1000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.231',101,3306,1000);

Then the galera settings:

insert into mysql_galera_hostgroups (writer_hostgroup,backup_writer_hostgroup,reader_hostgroup, offline_hostgroup,active,max_writers,writer_is_also_reader,max_transactions_behind) values (100,102,101,9101,0,1,1,16);

As usual if we want to have R/W split we need to define the rules for it:

insert into mysql_query_rules (rule_id,proxy_port,schemaname,username,destination_hostgroup,active,retries,match_digest,apply) values(1040,6033,'windmills','app_test',100,1,3,'^SELECT.*FOR UPDATE',1);
insert into mysql_query_rules (rule_id,proxy_port,schemaname,username,destination_hostgroup,active,retries,match_digest,apply) values(1041,6033,'windmills','app_test',101,1,3,'^SELECT.*@@',1);
save mysql query rules to disk;
load mysql query rules to run;

Then another important variable… the server version, please do yourself a good service ad NEVER use the default.

update global_variables set variable_value='5.7.0' where variable_name='mysql-server_version';
LOAD MYSQL VARIABLES TO RUNTIME;SAVE MYSQL VARIABLES TO DISK;

Finally activate the whole thing:

save mysql servers to disk;
load mysql servers to runtime;

One thing to note before we go ahead. In the list of servers I had:

  1. Filled only the READER HG
  2. Used the same weight

This because of the election mechanism ProxySQL will use to identify the writer, and the (many) problems that may be attached to it.

For now let us go ahead and see what happens when I load this information to runtime.

Before running the above commands:

+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

After:

+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 1000   | 100       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 501        |
| 1000   | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 501        |
| 1000   | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 546        |
| 1000   | 101       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 467        |
| 1000   | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 546        |
| 1000   | 102       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0	 | 0	   | 0           | 0	   | 0                 | 0               | 0               | 467        |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
mysql> select * from runtime_mysql_galera_hostgroups G
*************************** 1. row ***************************
       writer_hostgroup: 100
backup_writer_hostgroup: 102
       reader_hostgroup: 101
      offline_hostgroup: 9101
                active: 0  <----------- note this
            max_writers: 1
  writer_is_also_reader: 1
max_transactions_behind: 16
                comment: NULL
1 row in set (0.01 sec)

As we can see, ProxySQL had taken the nodes from my READER group and distribute them adding node 1 in the writer and node 2 as backup_writer.

But – there is a but – wasn’t my rule set with Active=0? Indeed it was, and I assume this is a bug (#Issue  1902).

The other thing we should note is that ProxySQL had elected as writer node 3 (192.168.1.231).
As I said before what should we do IF we want to have a specific node as preferred writer?

We need to modify its weight. So say we want to have node 1 (192.168.1.205) as writer we will need something like this:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.205',101,3306,10000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.21',101,3306,100);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.231',101,3306,100);

Doing that will give us :

+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 10000  | 100       | 192.168.1.205 | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 2209       |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 546        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 508        |
| 10000  | 101       | 192.168.1.205 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 2209       |
| 100    | 102       | 192.168.1.231 | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 546        |
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 508        |
+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

If you noticed, given we had set the WEIGHT in node 1 higher, this node will become also the most utilized for reads.
We probably do not want that, so let us modify the reader weight.

update mysql_servers set weight=10 where hostgroup_id=101 and hostname='192.168.1.205';

At this point if we trigger the failover, with set global wsrep_reject_queries=all; on node 1.
ProxySQL will take action and will elect another node as writer:

+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 100    | 100       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 562        |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 562        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 588        |
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 588        |
| 10000  | 9101      | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 468        |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

Node 3 (192.168.1.231) is the new writer and node 1 is in the special group for OFFLINE.
Let see now what will happen IF we put back node 1.

+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 10000  | 100       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 449        |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 532        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 569        |
| 10000  | 101       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 449        |
| 100    | 102       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 532        |
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 569        |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

Ooops the READER has come back with the HIGHEST value and as such it will be the most used node, once more. To fix it, we need to re-run the update as before.

But there is a way to avoid this? In short the answer is NO!
This, in my opinion, is BAD and is worth a feature request, because this can really put a node on the knees.

Now this is not the only problem. There is another point that is probably worth discussion, which is the fact ProxySQL is currently doing FAILOVER/FAILBACK.

Failover, is obviously something we want to have, but failback is another discussion. The point is, once the failover is complete and the cluster has redistributed the incoming requests, doing a failback is an impacting operation that can be a disruptive one too.

If all nodes are treated as equal, there is no real way to prevent it, while if YOU set a node to be the main writer, something can be done, let us see what and how.
Say we have:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.205',101,3306,1000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.21',101,3306,100);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.1.231',101,3306,100);
+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 1000   | 100       | 192.168.1.205 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 470        |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 558        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 613        |
| 10     | 101       | 192.168.1.205 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 470        |
| 100    | 102       | 192.168.1.231 | 3306     | ONLINE | 0        | 0        | 0      | 0	  | 0           | 0	  | 0                 | 0               | 0               | 558        |
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 613        |
+--------+-----------+---------------+----------+--------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

Let us put the node down
set global wsrep_reject_queries=all;

And check:

+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 100    | 100       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 519        |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 519        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0	      | 0           | 0	      | 0                 | 0               | 0               | 506        |
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 506        |
| 1000   | 9101      | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 527        |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

We can now manipulate the weight in the special OFFLINE group and see what happen:

update mysql_servers set weight=10 where hostgroup_id=9101 and hostname='192.168.1.205'

Then I put the node up again:
set global wsrep_reject_queries=none;

+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | ConnFree | ConnOK | ConnERR | MaxConnUsed | Queries | Queries_GTID_sync | Bytes_data_sent | Bytes_data_recv | Latency_us |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+
| 100    | 100       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 537        |
| 100    | 101       | 192.168.1.231 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 537        |
| 100    | 101       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 573        |
| 10     | 101       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0	   | 0           | 0	   | 0                 | 0               | 0               | 458	|
| 100    | 102       | 192.168.1.21  | 3306     | ONLINE  | 0        | 0        | 0      | 0	   | 0           | 0	   | 0                 | 0               | 0               | 573	|
| 10     | 102       | 192.168.1.205 | 3306     | ONLINE  | 0        | 0        | 0      | 0       | 0           | 0       | 0                 | 0               | 0               | 458        |
+--------+-----------+---------------+----------+---------+----------+----------+--------+---------+-------------+---------+-------------------+-----------------+-----------------+------------+

That’s it, the node is back but with no service interruption.

At this point we can decide if make this node reader like the others, or wait and plan a proper time of the day when we can put it back as writer, while, in the meanwhile it has a bit of load to warm its bufferpool.

The other point – and important information – is what is ProxySQL is currently checking on Galera? From reading the code Proxy will trap the following:

  • read_only
  • wsrep_local_recv_queue
  • wsrep_desync
  • wsrep_reject_queries
  • wsrep_sst_donor_rejects_queries
  • primary_partition

Plus the standard sanity checks on the node.

Finally to monitor the whole situation we can use this:

mysql> select * from mysql_server_galera_log order by time_start_us desc limit 10;
+---------------+------+------------------+-----------------+-------------------+-----------+------------------------+-------------------+--------------+----------------------+---------------------------------+-------+
| hostname      | port | time_start_us    | success_time_us | primary_partition | read_only | wsrep_local_recv_queue | wsrep_local_state | wsrep_desync | wsrep_reject_queries | wsrep_sst_donor_rejects_queries | error |
+---------------+------+------------------+-----------------+-------------------+-----------+------------------------+-------------------+--------------+----------------------+---------------------------------+-------+
| 192.168.1.231 | 3306 | 1549982591661779 | 2884            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.21  | 3306 | 1549982591659644 | 2778            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.205 | 3306 | 1549982591658728 | 2794            | YES               | NO        | 0                      | 4                 | NO           | YES                  | NO                              | NULL  |
| 192.168.1.231 | 3306 | 1549982586669233 | 2827            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.21  | 3306 | 1549982586663458 | 5100            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.205 | 3306 | 1549982586658973 | 4132            | YES               | NO        | 0                      | 4                 | NO           | YES                  | NO                              | NULL  |
| 192.168.1.231 | 3306 | 1549982581665317 | 3084            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.21  | 3306 | 1549982581661261 | 3129            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.205 | 3306 | 1549982581658242 | 2786            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
| 192.168.1.231 | 3306 | 1549982576661349 | 2982            | YES               | NO        | 0                      | 4                 | NO           | NO                   | NO                              | NULL  |
+---------------+------+------------------+-----------------+-------------------+-----------+------------------------+-------------------+--------------+----------------------+---------------------------------+-------+

As you can see above the log table keeps track of what is changed. In this case, it reports that node 1 has wsrep_reject_queries activated, and it will continue to log this until we set wsrep_reject_queries=none.

Conclusions

ProxySQL galera native integration is a useful feature to manage any Galera implementation, no matter whether it’s Percona PXC, MariaDB cluster or MySQL/Galera.

The generic approach is obviously a good thing, still it may miss some specific extension like we have in PXC with the performance_schema pxc_cluster_view table.

I’ve already objected about the failover/failback, and I am here again to remind you: whenever you do a controlled failover REMEMBER to change the weight to prevent an immediate failback.

This is obviously not possible in the case of a real failover, and, for instance, a simple temporary eviction will cause two downtimes instead only one. Some environments are fine with that others not so.

Personally I think there should be a FLAG in the configuration, such that we can decide if failback should be executed or not.

 

Percona XtraDB Cluster 5.7.25-31.35 Is Now Available

Percona XtraDB Cluster 5.7

Percona XtraDB Cluster 5.7Percona is glad to announce the release of Percona XtraDB Cluster 5.7.25-31.35 on February 28, 2018. Binaries are available from the downloads section or from our software repositories.

This release of Percona XtraDB Cluster includes the support of Ubuntu 18.10 (Cosmic Cuttlefish). Percona XtraDB Cluster 5.7.25-31.35 is now the current release, based on the following:

All Percona software is open-source and free.

Bugs Fixed

  • PXC-2346mysqld could crash when executing mysqldump --single-transaction while the binary log is disabled. This problem was also reported in PXC-1711PXC-2371PXC-2419.
  • PXC-2388: In some cases, DROP FUNCTION function_name was not replicated.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

Percona XtraDB Cluster 5.6.43-28.32 Is Now Available

Percona XtraDB Cluster 5.7

Percona XtraDB Cluster 5.7

Percona is glad to announce the release of Percona XtraDB Cluster 5.6.43-28.32 on February 28, 2019. Binaries are available from the downloads section or from our software repositories.

This release of Percona XtraDB Cluster includes the support of Ubuntu 18.10 (Cosmic Cuttlefish). Percona XtraDB Cluster 5.6.43-28.32 is now the current release, based on the following:

All Percona software is open-source and free.

Bugs Fixed

  • PXC-2388: In some cases, DROP FUNCTION function_name was not replicated.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

Want IST Not SST for Node Rejoins? We Have a Solution!

IST Not SST for Node Rejoins

IST Not SST for Node RejoinsWhat if we tell you that there is a sure way to get IST not SST for node rejoins? You can guarantee that a new node rejoins using IST. Sound interesting? Keep reading.

Normally when a node is taken out of the cluster for a short period of time (for maintenance or shutdown), gcache on other nodes of the cluster help donate the missing write-set(s) when the node rejoins. This approach works if you have configured a larger gcache, or the downtime is short enough. Both approaches aren’t good, especially for a production cluster. Also, a larger gcache for the server lifetime means blocking larger disk-space when the same job can be done with relative smaller disk-space.

Re-configuring gcache, on a potential DONOR node before downtime requires a node shutdown. (Dynamic resizing of the gcache is not possible, or rather not needed now.) Restoring it back to original size needs another shutdown. So “three shutdowns” for a single downtime. No way …… not acceptable with busy production clusters and the possibility of more errors.

Introducing “gcache.freeze_purge_at_seqno”

Given the said pain-point, we are introducing gcache.freeze_purge_at_seqno Percona XtraDB Cluster 5.7.20. This controls the purging of the gcache, thereby retaining more data to facilitate IST when the node rejoins.

All the transactions in the Galera cluster world are assigned unique global sequence number (seqno). Tracking things happens using this seqno (like wsrep_last_applied, wsrep_last_committed, wsrep_replicated, wsrep_local_cached_downto, etc…). wsrep_local_cached_downto represents the sequence number down to which the gcache has been purged. Say wsrep_local_cached_downto = N, then gcache has data from [N, wsrep_replicated] and has purged data from [1,N).

gcache.freeze_purge_at_seqno takes three values:

  1. -1 (default): no freeze, the purge operates as normal.
  2. x (should be valid seqno in gcache): freeze purge of write-sets >= x. The best way to select x is to use the wsrep_last_applied value as an indicator from the node that you plan to shut down. (wsrep_applied * 0.09. Retain this extra 10% to trick the safety gap heuristic algorithm of IST.)
  3. now: freeze purge of write-sets >= smallest seqno currently in gcache. Instant freeze of gcache-purge. (If tracing x (above) is difficult, simply use “now” and you are good).

Set this on an existing node of the cluster (that will continue to be part of the cluster and can act as potential DONOR). This node continues to retain the write-sets, thereby allowing the restarting node to rejoin using IST. (You can feed the said node as a preferred DONOR through wsrep_sst_donor while restarting the said rejoining node.)

Remember to set it back to -1 once the node rejoins. This avoids hogging space on the DONOR beyond the said timeline. On the next purge cycle, all the old retained write-sets are freed as well (reclaiming the space back to original).

Note:

To find out existing value of gcache.freeze_purge_at_seqno query wsrep_provider_options.
select @@wsrep_provider_options;
To set gcache.freeze_purge_at_seqno
set global wsrep_provider_options="gcache.freeze_purge_at_seqno = now";

Why should you use it?

  • gcache grows dynamically (using existing pagestore mechanism) and shrinks once the user sets it back to -1. This means you only use disk-space when needed.
  • No restart needed. The user can concentrate on maintenance node only.
  • No complex math or understanding of seqno involved (simply use “now”).
  • Less prone to error, as SST is one of the major error-prone areas with the cluster.

So why wait? Give it a try! It is part of Percona XtraDB Cluster 5.7.20 onwards, and helps you get IST not SST for node rejoins

Note: If you need more information about gcache, check here and here.

ProxySQL 1.4.5 and Updated proxysql-admin Tool Now in the Percona Repository

ProxySQL 1.4.5

ProxySQL 1.4.5ProxySQL 1.4.5, released by ProxySQL, is now available for download in the Percona Repository along with an updated version of Percona’s proxysql-admin tool.

ProxySQL is a high-performance proxy, currently for MySQL and its forks (like Percona Server for MySQL and MariaDB). It acts as an intermediary for client requests seeking resources from the database. René Cannaò created ProxySQL for DBAs as a means of solving complex replication topology issues.

The ProxySQL 1.4.5 source and binary packages available at https://percona.com/downloads/proxysql include ProxySQL Admin – a tool, developed by Percona to configure Percona XtraDB Cluster nodes into ProxySQL. Docker images for release 1.4.5 are available as well: https://hub.docker.com/r/percona/proxysql/. You can download the original ProxySQL from https://github.com/sysown/proxysql/releases.

This release fixes the following bugs in ProxySQL Admin:

Usability improvements:

  • #PSQLADM-6: If the cluster node goes offline, the proxysql_node_monitor script now sets the node status as OFFLINE_HARD, and does not remove it from the ProxySQL database. Also, logging is consistent regardless of the cluster node online status.
  • #PSQLADM-30: Validation was added for the host priority file.
  • #PSQLADM-33: Added --proxysql-datadir option to run the proxysql-admin script with a custom ProxySQL data directory.
  • Also, BATS test suite was added for the proxysql-admin testing.

Bug fixes:

  • Fixed#PSQLADM-5: PXC mode specified with proxysql-admin with use of --mode parameter was not persistent.
  • Fixed#PSQLADM-8: ProxySQL High CPU load took place when mysqld was hanging.

ProxySQL is available under OpenSource license GPLv3.

Optimizing PXC Xtrabackup State Snapshot Transfer

State Snapshot Transfer (SST) at a glance

PXC uses a protocol called State Snapshot Transfer to provision a node joining an existing cluster with all the data it needs to synchronize.  This is analogous to cloning a slave in asynchronous replication:  you take a full backup of one node and copy it to the new one, while tracking the replication position of the backup.

PXC automates this process using scriptable SST methods.  The most common of these methods is the xtrabackup-v2 method which is the default in PXC 5.6.  Xtrabackup generally is more favored over other SST methods because it is non-blocking on the Donor node (the node contributing the backup).

The basic flow of this method is:

  • The Joiner:
    • joins the cluster
    • Learns it needs a full SST and clobbers its local datadir (the SST will replace it)
    • prepares for a state transfer by opening a socat on port 4444 (by default)
    • The socat pipes the incoming files into the datadir/.sst directory
  • The Donor:
    • is picked by the cluster (could be configured or be based on WAN segments)
    • starts a streaming Xtrabackup and pipes the output of that via socat to the Joiner on port 4444.
    • Upon finishing its backup, sends an indication of this and the final Galera GTID of the backup is sent to the Joiner
  • The Joiner:
    • Records all changes from the Donor’s backup’s GTID forward in its gcache (and overflow pages, this is limited by available disk space)
    • runs the –apply-log phase of Xtrabackup on the donor
    • Moves the datadir/.sst directory contents into the datadir
    • Starts mysqld
    • Applies all the transactions it needs (Joining and Joined states just like IST does it)
    • Moves to the ‘Synced’ state and is done.

There are a lot of moving pieces here, and nothing is really tuned by default.  On larger clusters, SST can be quite scary because it may take hours or even days.  Any failure can mean starting over again from the start.

This blog will concentrate on some ways to make a good dent in the time SST can take.  Many of these methods are trade-offs and may not apply to your situations.  Further, there may be other ways I haven’t thought of to speed things up, please share what you’ve found that works!

The Environment

I am testing SST on a PXC 5.6.24 cluster in AWS.  The nodes are c3.4xlarge and the datadirs are RAID-0 over the two ephemeral SSD drives in that instance type.  These instances are all in the same region.

My simulated application is using only node1 in the cluster and is sysbench OLTP with 200 tables with 1M rows each.  This comes out to just under 50G of data.  The test application runs on a separate server with 32 threads.

The PXC cluster itself is tuned to best practices for Innodb and Galera performance

Baseline

In my first test the cluster is a single member (receiving workload) and I am  joining node2.  This configuration is untuned for SST.  I measured the time from when mysqld started on node2 until it entered the Synced state (i.e., fully caught up).  In the log, it looked like this:

150724 15:59:24 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
... lots of other output ...
2015-07-24 16:48:39 31084 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 4647341)

Doing some math on the above, we find that the SST took 51 minutes to complete.

–use-memory

One of the first things I noticed was that the –apply-log step on the Joiner was very slow.  Anyone who uses Xtrabackup a lot will know that –apply-log will be a lot faster if you give it some extra RAM to use while making the backup consistent via the –use-memory option.  We can set this in our my.cnf like this:

[sst]
inno-apply-opts="--use-memory=20G"

The [sst] section is a special one understood only by the xtrabackup-v2 script.  inno-apply-opts allows me to specify arguments to innobackupex when it runs.

Note that this change only applies to the Joiner (i.e., you don’t have to put it on all your nodes and restart them to take advantage of it).

This change immediately makes a huge improvement to our above scenario (node2 joining node1 under load) and the SST now takes just over 30 minutes.

wsrep_slave_threads

Another slow part of getting to Synced is how long it takes to apply transactions up to realtime after the backup is restored and in place on the Joiner.  We can improve this throughput by increasing the number of apply threads on the Joiner to make better use of the CPU.  Prior to this wsrep_slave_threads was set to 1, but if I increase this to 32  (there are 16 cores on this instance type) my SST now takes 25m 32s

Compression

xtrabackup-v2 supports adding a compression process into the datastream.  On the Donor it compresses and on the Joiner it decompresses.  This allows you to trade CPU for transfer speed.  If your bottleneck turns out to be network transport and you have spare CPU, this can help a lot.

Further, I can use pigz instead of gzip to get parallel compression, but theoretically any compression utilization can work as long as it can compress and decompress standard input to standard output.  I install the ‘pigz’ package on all my nodes and change my my.cnf like this:

[sst]
inno-apply-opts="--use-memory=20G"
compressor="pigz"
decompressor="pigz -d"

Both the Joiner and the Donor must have the respective decompressor and compressor settings or the SST will fail with a vague error message (not actually having pigz installed will do the same thing).

By adding compression, my SST is down to 21 minutes, but there’s a catch.  My application performance starts to take a serious nose-dive during this test.  Pigz is consuming most of the CPU on my Donor, which is also my primary application node.  This may or may not hurt your application workload in the same way, but this emphasizes the importance of understanding (and measuring) the performance impact of SST has on your Donor nodes.

Dedicated donor

To alleviate the problem with the application, I now leave node2 up and spin up node3.  Since I’m expecting node2 to normally not be receiving application traffic directly, I can configure node3 to prefer node2 as its donor like this:

[mysqld]
...
wsrep_sst_donor = node2,

When node3 starts, this setting instructs the cluster that node3 is the preferred donor, but if that’s not available, pick something else (that’s what the trailing comma means).

Donor nodes are permitted to fall behind in replication apply as needed without sending flow control.  Sending application traffic to such a node may see an increase in the amount of stale data as well as certification failures for writes (not to mention the performance issues we saw above with node1).  Since node2 is not getting application traffic, moving into the Donor state and doing an expensive SST with pigz compression should be relatively safe for the rest of the cluster (in this case, node1).

Even if you don’t have a dedicated donor, if you use a load balancer of some kind in front of your cluster, you may elect to consider Donor nodes as failing their health checks so application traffic is diverted during any state transfer.

When I brought up node3, with node2 as the donor, the SST time dropped to 18m 33s

Conclusion

Each of these tunings helped the SST speed, though the later adjustments maybe had less of a direct impact.  Depending on your workload, database size, network and CPU available, your mileage may of course vary.  Your tunings should vary accordingly, but also realize you may actually want to limit (and not increase) the speed of state transfers in some cases to avoid other problems. For example, I’ve seen several clusters get unstable during SST and the only explanation for this is the amount of network bandwidth consumed by the state transfer preventing the actual Galera communication between the nodes. Be sure to consider the overall state of production when tuning your SSTs.

The post Optimizing PXC Xtrabackup State Snapshot Transfer appeared first on MySQL Performance Blog.

High-load clusters and desynchronized nodes on Percona XtraDB Cluster

There can be a lot of confusion and lack of planning in Percona XtraDB Clusters in regards to nodes becoming desynchronized for various reasons.  This can happen a few ways:

When I say “desynchronized” I mean a node that is permitted to build up a potentially large wsrep_local_recv_queue while some operation is happening.  For example a node taking a backup would set wsrep_desync=ON during the backup and potentially fall behind replication some amount.

Some of these operations may completely block Galera from applying transactions, while others may simply increase load on the server enough that it falls behind and applies at a reduced rate.

In all the cases above, flow control is NOT used while the node cannot apply transactions, but it MAY be used while the node is recovering from the operation.  For an example of this, see my last blog about IST.

If a cluster is fairly busy, then the flow control that CAN happen when the above operations catch up MAY be detrimental to performance.

Example setup

Let us take my typical 3 node cluster with workload on node1.  We are taking a blocking backup of some kind on node3 so we are executing the following steps:

  1. node3> set global wsrep_desync=ON;
  2. Node3’s “backup” starts, this starts with FLUSH TABLES WITH READ LOCK;
  3. Galera is paused on node3 and the wsrep_local_recv_queue grows some amount
  4. Node3’s “backup” finishes, finishing with UNLOCK TABLES;
  5. node3> set global wsrep_desync=OFF;

During the backup

This includes up through step 3 above.  My node1 is unaffected by the backup on node3, I can see it averaging 5-6k writesets(transactions) per second which it did before we began:

Screen Shot 2015-08-19 at 2.38.34 PM

 

node2 is also unaffected:

Screen Shot 2015-08-19 at 2.38.50 PM

but node3 is not applying and its queue is building up:

Screen Shot 2015-08-19 at 2.39.04 PM

Unlock tables, still wsrep_desync=ON

Let’s examine briefly what happens when node3 is permitted to start applying, but wsrep_desync stays enabled:

Screen Shot 2015-08-19 at 2.42.16 PM

node1’s performance is pretty much the same, node3 is not using flow control yet. However, there is a problem:

Screen Shot 2015-08-19 at 2.43.13 PM

It’s hard to notice, but node3 is NOT catching up, instead it is falling further behind!  We have potentially created a situation where node3 may never catch up.

The PXC nodes were close enough to the red-line of performance that node3 can only apply just about as fast (and somewhat slower until it heats up a bit) as new transactions are coming into node1.

This represents a serious concern in PXC capacity planning:

Nodes do not only need to be fast enough to handle normal workload, but also to catch up after maintenance operations or failures cause them to fall behind.

Experienced MySQL DBA’s will realize this isn’t all that different than Master/Slave replication.

Flow Control as a way to recovery

So here’s the trick:  if we turn off wsrep_desync on node3 now, node3 will use flow control if and only if the incoming replication exceeds node3’s apply rate.  This gives node3 a good chance of catching up, but the tradeoff is reducing write throughput of the cluster.  Let’s see what this looks like in context with all of our steps.  wsrep_desync is turned off at the peak of the replication queue size on node3, around 12:20PM:

Screen Shot 2015-08-19 at 2.47.12 PM

Screen Shot 2015-08-19 at 2.48.07 PM

So at the moment node3 starts utilizing flow control to prevent falling further behind, our write throughput (in this specific environment and workload) is reduced by approximately 1/3rd (YMMV).   The cluster will remain in this state until node3 catches up and returns to the ‘Synced’ state.  This catchup is still happening as I write this post, almost 4 hours after it started and will likely take another hour or two to complete.

I can see a more realtime representation of this by using myq_status on node1, summarizing every minute:

[root@node1 ~]# myq_status -i 1m wsrep
mycluster / node1 (idx: 1) / Galera 3.11(ra0189ab)
         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
19:58:47 P   5  3 Sync 0.9ms 3128 2.0M   0   27 213b   0 25.4s   0   0   0 3003k  16k  62%
19:59:47 P   5  3 Sync 1.1ms 3200 2.1M   0   31 248b   0 18.8s   0   0   0 3003k  16k  62%
20:00:47 P   5  3 Sync 0.9ms 3378 2.2M  32   27 217b   0 26.0s   0   0   0 3003k  16k  62%
20:01:47 P   5  3 Sync 0.9ms 3662 2.4M  32   33 266b   0 18.9s   0   0   0 3003k  16k  62%
20:02:47 P   5  3 Sync 0.9ms 3340 2.2M  32   27 215b   0 27.2s   0   0   0 3003k  16k  62%
20:03:47 P   5  3 Sync 0.9ms 3193 2.1M   0   27 215b   0 25.6s   0   0   0 3003k  16k  62%
20:04:47 P   5  3 Sync 0.9ms 3009 1.9M  12   28 224b   0 22.8s   0   0   0 3003k  16k  62%
20:05:47 P   5  3 Sync 0.9ms 3437 2.2M   0   27 218b   0 23.9s   0   0   0 3003k  16k  62%
20:06:47 P   5  3 Sync 0.9ms 3319 2.1M   7   28 220b   0 24.2s   0   0   0 3003k  16k  62%
20:07:47 P   5  3 Sync 1.0ms 3388 2.2M  16   31 251b   0 22.6s   0   0   0 3003k  16k  62%
20:08:47 P   5  3 Sync 1.1ms 3695 2.4M  19   39 312b   0 13.9s   0   0   0 3003k  16k  62%
20:09:47 P   5  3 Sync 0.9ms 3293 2.1M   0   26 211b   0 26.2s   0   0   0 3003k  16k  62%

This reports around 20-25 seconds of flow control every minute, which is consistent with that ~1/3rd of performance reduction we see in the graphs above.

Watching node3 the same way proves it is sending the flow control (FlowC snt):

mycluster / node3 (idx: 2) / Galera 3.11(ra0189ab)
         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
17:38:09 P   5  3 Dono 0.8ms    0   0b   0 4434 2.8M 16m 25.2s  31   0   0 18634  16k  80%
17:39:09 P   5  3 Dono 1.3ms    0   0b   1 5040 3.2M 16m 22.1s  29   0   0 37497  16k  80%
17:40:09 P   5  3 Dono 1.4ms    0   0b   0 4506 2.9M 16m 21.0s  31   0   0 16674  16k  80%
17:41:09 P   5  3 Dono 0.9ms    0   0b   0 5274 3.4M 16m 16.4s  27   0   0 22134  16k  80%
17:42:09 P   5  3 Dono 0.9ms    0   0b   0 4826 3.1M 16m 19.8s  26   0   0 16386  16k  80%
17:43:09 P   5  3 Jned 0.9ms    0   0b   0 4957 3.2M 16m 18.7s  28   0   0 83677  16k  80%
17:44:09 P   5  3 Jned 0.9ms    0   0b   0 3693 2.4M 16m 27.2s  30   0   0  131k  16k  80%
17:45:09 P   5  3 Jned 0.9ms    0   0b   0 4151 2.7M 16m 26.3s  34   0   0  185k  16k  80%
17:46:09 P   5  3 Jned 1.5ms    0   0b   0 4420 2.8M 16m 25.0s  30   0   0  245k  16k  80%
17:47:09 P   5  3 Jned 1.3ms    0   0b   1 4806 3.1M 16m 21.0s  27   0   0  310k  16k  80%

There are a lot of flow control messages (around 30) per minute.  This is a lot of ON/OFF toggles of flow control where writes are briefly delayed rather than a steady “you can’t write” for 20 seconds straight.

It also interestingly spends a long time in the Donor/Desynced state (even though wsrep_desync was turned OFF hours before) and then moves to the Joined state (this has the same meaning as during an IST).

Does it matter?

As always, it depends.

If these are web requests and suddenly the database can only handle ~66% of the traffic, that’s likely a problem, but maybe it just slows down the website somewhat.  I want to emphasize that WRITES are what is affected here.  Reads on any and all nodes should be normal (though you probably don’t want to read from node3 since it is so far behind).

If this were some queue processing that had reduced throughput, I’d expect it to possibly catch up later

This can only be answered for your application, but the takeaways for me are:

  • Don’t underestimate your capacity requirements
  • Being at the redline normally means you are well past the redline for abnormal events.
  • Plan for maintenance and failure recoveries
  • Where possible, build queuing into your workflows so diminished throughput in your architecture doesn’t generate failures.

Happy clustering!

Graphs in this post courtesy of VividCortex.

The post High-load clusters and desynchronized nodes on Percona XtraDB Cluster appeared first on MySQL Performance Blog.