A conversation with 5 Facebook MySQL gurus

A conversation with 6 Facebook MySQL gurusFacebook, the undisputed king of online social networks, has 1.23 billion monthly active users collectively contributing to an ocean of data-intensive tasks – making the company one of the world’s top MySQL users.

A small army of Facebook MySQL experts will be converging on Santa Clara, Calif. next week where several of them are leading sessions at the Percona Live MySQL Conference and Expo. I had the chance to chat virtually with four of them about their sessions: Steaphan Greene, Evan Elias, Shlomo Priymak and Yoshinori MatsunobuMark Callaghan, who spoke at Percona Live last year, also joined our conversation which included a discussion of Facebook’s use of MySQL and other open source technologies.


Tom: What’s Facebook’s view of Open Source?

Steaphan: Facebook was built on open source software, and we still invest heavily in open source today. We understand the power these communities have to drive innovation – they allow us to focus on new challenges, as opposed to reinventing the wheel over and over again. And contributing as much as possible back to open projects is in everyone’s best interest.


Tom: Why MySQL? Wouldn’t NoSQL databases, for example, be better suited for the massive workloads seen at Facebook?

Mark: MySQL is great for many of our important workloads. We make it even better with our expertise in MySQL operations and engineering, and by working with the community and learning from their experience.

Yoshinori: I have not been able to find a transactional NoSQL database better than InnoDB. And it’s easy to understand how MySQL Replication works, which makes much easier to fix problems in production.


 Tom: How does Facebook make MySQL scale?

Steaphan: Sharding, automation, monitoring, and heavy investment in operations and performance engineering.


 Tom: What other things help Facebook run smoothly?

Steaphan: Our completely open culture, and the freedom all engineers here have to try any idea they have.


Tom: What is the top scaling challenge(s) Facebook faces in 2014 – and beyond?

Mark: Our biggest challenge is to make things better (performance, efficiency, availability) in the future at the rate we made things better in the past.

Yoshinori: Availability has improved a lot so far for us. Come to my session at Percona Live to hear about that. For me personally, efficiency is the biggest challenge for 2014 and 2015. This includes reducing space and optimizing for newer-generation hardware.


Tom: Facebook deployed MySQL 5.6 last year – including on critical environments – long before many other large organizations. What prompted such a move so soon? And where there any major concerns?

Steaphan: The same thing that prompts most efforts on the Facebook infra team: We will consider any technology that will help us improve performance, efficiency, or reliability, and we’re willing to accept the risk that sometimes comes along with adopting things like 5.6 very early on. But that’s only half the story here. The other half is that Facebook encourages its engineers to go after big bets like these — in this case it was just one engineer who made this happen. And we had the MySQL engineering talent we needed to work with the Oracle team to get 5.6 ready for production at our scale.

Yoshinori: At Facebook, we have three MySQL teams — Operations, Performance and Engineering. Facebook is one of the very few MySQL users that has internal MySQL developers. We all worked hard to adapt 5.6 to our scale and ensure that it would be production-ready. We found some issues after production deployment, but in many cases we could fix the problem and deployed new MySQL binary within one or two days. When deploying in production, we expected that we encountered MySQL 5.6 specific issues, which was typical when releasing new software. We were just confident that we could fix issues immediately.

Our 5.6 deployment step was not all at once. At first rollout, we disabled most major 5.6 features, such as GTID and binlog checksum. We gradually enabled such features in production.


Tom: Where there any significant issues in that move to MySQL 5.6? Any lessons learned you like to share – along with best practices you’d like to share?

Yoshinori: Performance regression of the CPU intensive replication was a main blocker for some of our applications. I wrote a blog post about this last year. We have several design plans to fix the problem on MySQL side. One of the most effective plans is grouping multiple transactions into one, since the most expensive part is writing to InnoDB system table at transaction commit. This optimization would be done when writing binary log, or by SQL thread. It may take longer time to test and deploy in production. For existing applications, we optimized to group multiple transactions from application side to mitigate the problem.


Tom: Performance monitoring is usually challenging at any organization. How do you do that at Facebook, which has tens of thousands of MySQL instances?

Yoshinori: Top-N monitoring is very important for managing a huge number of instances. Average statistics (for example: average innodb_rows_read across all instances) is not always useful since ~1% of problematic instances won’t noticeably change average numbers. p99 gives better indicators, but in our environment we typically have fewer than 0.1% instances causing problems, in which case p99 is not helpful either. We have several graphical and command-line tools to efficiently list up top-N bad behaving hosts. After listing up bad instances, the way to investigate root cause is pretty straightforward, like what MySQL consultants usually do. Server failure is something we expect and plan for at Facebook. For example, typical MySQL DBA at small companies may not encounter master instance failure during employment, because recent mysqld and H/W are stable enough. At Facebook, master failure is a norm and something the system can accommodate.


Tom: Evan, you and Yoshinori will present on Global Transaction ID (GTID) at Facebook. GTID is very tricky to deploy to an existing large-scale environment – how, and why, did you decide on adopting it?

Evan: Our primary motivations for adopting GTID all relate to either failover or binlog backups. When a master fails, getting replicas in sync with GTID is substantially simpler, faster, and less error-prone than previous methods of diffing binlogs. For backups, GTID is a cornerstone in building cross-datacenter point-in-time recovery, without needing redundant binlog streams from every region.

The “how” question is a bit more involved, and we’ll be covering this in detail during the session. The GTID project was a joint effort between three of Facebook’s MySQL teams. Santosh added new functionality to the MySQL server to make online rollout possible, and Yoshinori improved MHA to seamlessly support GTID-based failover. I added GTID support to all of our other in-house automation, and also scripted the rollout procedure across our many thousands of replica sets. A lot of validation logic and monitoring functionality was involved to ensure the safety of the rollout.


Tom: Shlomo, your session is titled “Under the Hood – MySQL Pool Scanner (MPS).” As you point out in your talk, Facebook has one of the largest MySQL database clusters in the world, comprising thousands of servers across multiple data centers. You must have an army of DBAs – or is there some secret you’d like to share? 

Shlomo: We do have an army, yes — it’s an “army of one.” We have one person on call on the MySQL Operations team at a given time, and they don’t even need to do all that much most days. We built “robots” to do our day to day jobs. The largest and most complex robot we have is MPS, an automated system to do most of the work a DBA might otherwise spend time on, such as replacement of broken or overflowing servers. Among other things, MPS also allows a human to initiate complex bulk operations with a few keystrokes, and it will follow up and complete the operations over the course of days or weeks.

I’ll be describing some of the complex MySQL automation systems we have at Facebook, and how they fit together during my talk.


Tom: Shlomo, what does a typical day look like for you there at Facebook?

Shlomo: The team’s work mostly focuses on maintaining those robots I’ve mentioned, as well as developing new ways to improve the reliability of our databases for Facebook’s users. This year the team also spent a lot of time making sure the new MySQL features such as GTIDs and semi-sync are deeply integrated in our automation. Every day, we work hard to to make ourselves obsolete, but we haven’t gotten there just yet!

On a typical day, I probably spend much of the time coding, mainly in Python. I also spend a significant amount of time working on capacity-related projects, such as thinking of ways to optimize the way we distribute the data across our fleet of servers.
Even after 2.5 years at Facebook, I am still in awe of the number of servers we manage. The typical small-scale maintenance operation at Facebook probably involves more servers than all the companies I’ve previously worked for had, combined. It really is pretty amazing!


Tom:  What are you looking forward to the most at this year’s conference?

Evan: There are plenty of fascinating sessions this year. Just to mention a handful: Jeremy Cole and Davi Arnaut’s session on innodb_ruby, since it’s a very unique way to interactively learn about InnoDB’s internals. Baron Schwartz’s session on using Go with MySQL, as VividCortex is blazing the trail here. Peter Boros and Kenny Gryp’s talk on scalability and benchmarking, which I’m hoping will include recent developments of Percona Playback. Tom Christ’s session on my former project Jetpants, to see how it has evolved over the past year at Tumblr. And several talks by Oracle engineers about upcoming functionality in MySQL 5.7.

Steaphan: In addition to the conference sessions, I look forward to the birds of a feather session with the MySQL team.  Last year, it proved to be a valuable opportunity to engage with those upstream developers who make the changes we care about, and I expect the same this year.


Tom: If you could talk to a DBA or developer on the fence about attending this year’s conference, what would be your top 3-5 reasons for making it over to Santa Clara for this event?

Evan: I’m based in NYC, so I’m traveling a bit further than many of my colleagues, but I can still confidently say that Percona Live is well worth the trip. The MySQL ecosystem is very healthy and constantly evolving, and the conference is the best place to learn about ongoing developments across a wide spectrum of companies and contributors. It’s also a perfect opportunity to personally connect with all of the amazing engineers, DBAs, users, and vendors that make MySQL so unique and compelling.


 The Percona Live MySQL Conference and Expo 2014 runs April 1-4 in Santa Clara, Calif. Use the code “SeeMeSpeak” when registering and save 10 percent. The inaugural Open Source Appreciation Day is on March 31 – this full-day event is free but because space is limited I suggest registering now to reserve your spot.

The post A conversation with 5 Facebook MySQL gurus appeared first on MySQL Performance Blog.

Percona Server 5.6.16-64.2 with TokuDB engine Beta is now available

Percona Server version 5.6.15-64.2 with TokuDB engine

Percona Server version 5.6.16-64.2 with TokuDB engine

Percona is glad to announce the first BETA release of Percona Server 5.6.16-64.2 with TokuDB engine on March 27th, 2014. Downloads are available here and from the Percona Software Repositories.

Based on Percona Server 5.6.16-64.2 including all the features and bug fixes in it, and on TokuDB 7.1.5-rc.4, Percona Server 5.6.16-64.2-tokudb is the first BETA release in the Percona Server 5.6 with TokuDB engine series. All of Percona‘s software is open-source and free, all the details of the release can be found in the 5.6.16-64.2 milestone at Launchpad.

New Feature

  • Installer will now automatically install and enable TokuDB engine on new installations.

More information on how to install and use TokuDB can be found in the documentation. This feature is currently considered BETA quality.

Release notes for Percona Server 5.6.16-64.2 with TokuDB storage engine are available in our online documentation. We did our best to eliminate bugs and problems during the testing this release, but this is a software, so bugs are expected. If you encounter them, please report them to our bug tracking system.

The post Percona Server 5.6.16-64.2 with TokuDB engine Beta is now available appeared first on MySQL Performance Blog.

Innodb redo log archiving

Innodb redo log archivingPercona Server 5.6.11-60.3 introduces a new “log archiving” feature. Percona XtraBackup 2.1.5 supports “apply archived logs.” What does it mean and how it can be used?

Percona products propose three kinds of incremental backups. The first is full scan of data files and comparison the data with backup data to find some delta. This approach provides a history of changes and saves disk space by storing only data deltas. But the disadvantage is a full-data file scan that adds load to the disk subsystem. The second kind of incremental backup avoids extra disk load during data file scans.

The idea is in reading only changed data pages. The information about what specific pages were changed is provided by the server itself which writes files with the information during work. It’s a good alternative but changed-pages tracking adds some small load. And Percona XtraBackup’s delta reading leads to non-sequential disk io. This is good alternative but there is one more option.

The Innodb engine has a data log. It writes all operations which modify database pages to log files. This log is used in the case of unexpected server terminating to recover data. The Innodb log consists of the several log files which are filled sequentially in circular. The idea is to save those files somewhere and apply all modifications from archived logs to backup data files. The disadvantage of this approach is in using extra disk space. The advantage is there is no need to do an “explicit” backup on the host server. A simple script could sit and wait for logs to appear then scp/netcat them over to another machine.

But why not use good-old replication? Maybe replication does not have such performance as logs recovering but it is more controlled and well-known. Archived logs allows you to do any number of things with them from just storing them to doing periodic log applying. You can not recover from a ‘DROP TABLE’, etc with replication. But with this framework one could maintain the idea of “point in time” backups.

So the “archived logs” feature is one more option to organize incremental backups. It is not widely used as it was issued not so far and there is not A good understanding of how it works and how it can be used. We are open to any suggestions about its suggest improvements and use cases. The subject of this post is to describe how it works in depth. As log archiving is closely tied with innodb redo logs the internals of redo logs will be covered too. This post would be useful not only for DBA but also for Software Engineers because not only common principles are considered but the specific code too, and knowledge from this post can be used for further MySQL code exploring and patching.

What is the innodb log and how it is written?

Let’s remember what are innodb logs, why they are written, what they are used for.

The Innodb engine has buffer pool. This is a cache of database pages. Any changes are done on page in buffer pool, then page is considered as “dirty,” which means it must be flushed, and pushed to the flush list which is processed periodically by special thread. If pages are not flushed to disk and server is terminated unexpectedly the changes will be lost. To avoid this innodb writes changes to redo log and recover data from redo log during start. This technique allows to delay buffer pool pages flushing. It can increase performance because several changes of one page can be accumulated in memory and then flushed by one io. Except that flushed pages can be grouped to decrease the number of non-sequential io’s. But the down-side of this approach is time for data recovering. Let’s consider how this log is stored, generated and used for data recovering.

Log files

Redo log consists of a several log files which are treated as a circular buffer. The number and the size of log files can be configured. Each log file has a header. The description of this header can be found in “storage/innobase/include/log0log.h” by “LOG_GROUP_ID” keyword.

Each log file contains log records. Redo log records are written sequentially by log blocks of OS_FILE_LOG_BLOCK_SIZE size which is equal to 512 bytes by default and can be changed with innodb option. Each record has its LSN. LSN is a “Log Sequence Number” – the number of bytes written to log from the log creation to the certain log record. Each block consists of header, trailer and log records.

Log blocks

Let’s consider log block header. The first 4 bytes of the header is log block number. The block number is very similar as LSN but LSN is measured in bytes and block number is measured by OS_FILE_LOG_BLOCK_SIZE. Here is the simple formula how LSN is converted to block number:

return(((ulint) (lsn / OS_FILE_LOG_BLOCK_SIZE) & 0x3FFFFFFFUL) + 1);

This formula can be found in log_block_convert_lsn_to_no() function. The next two bytes is the number of bytes used in the block. The next two bytes is the offset of the first MTR log record in this block. What is MTR will be described below. Currently it can be considered as a synonym of bunch of log records which are gathered together as a description of some logical operation. For example it can be a group of log records for inserting new row to some table. This field is used when there are records of several MTR’s in one block. The next four bytes is a checkpoint number. The trailer is four bytes of log block checksum. The above description can be found in “storage/innobase/include/log0log.h” by “LOG_BLOCK_HDR_NO” keyword.

Before writing to disk log blocks must be somehow formed and stored. And the question is:

How log blocks are stored in memory and on disk?

Where log blocks are stored before flushing to disk and how they are written and flushed?

Global log object and log buffer

The answer to the first part of the question is log buffer. Server holds very important global object log_sys in memory. It contains a lot of useful information about logging state. Log buffer is pointed by log_sys->buf pointer which is initialized in log_init(). I would highlight the following log_sys fields that are used for work with log buffer and flushing:

log_sys->buf_size – the size of log buffer, can be set with innodb-log-buffer-size variable, the default value is 8M;
log_sys->buf_free – the offset from which the next log record will be written;
log_sys->max_buf_free – if log_sys->buf_free is greater then this value log buffer must be flushed, see log_free_check();
log_sys->buf_next_to_write – the offset of the next log record to write to disk;
log_sys->write_lsn – the LSN up to which log is written;
log_sys->flushed_to_disk_lsn – the LSN up to which log is written and flushed;
log_sys->lsn – the last LSN in log buffer;

So log_sys->buf_next_to_write is between 0 and log_sys->buf_free, log_sys->write_lsn is equal or less log_sys->lsn, log_sys->flushed_to_disk_lsn is less or equal to log_sys->write_lsn.

The relationships for those fields can be easily traced with debugged by setting up watchpoints.

Ok, we have log buffer, but how do log records come to this buffer?

Where log records come from?

Innodb has special objects that allow you to gather redo log records for some operations in one bunch before writing them to log buffer. These objects are called “mini-transactions” and corresponding functions and data types have “mtr” prefix in the code. The objects itself are described in mtr_t “c” structure. The most interesting fields of this structure are the following:

mtr_t::log – contains log records for the mini-transaction,
mtr_t::memo – contains pointers to pages which are changed or locked by the mini-transaction, it is used to push pages to flush list and release locks after logs records are copied to log buffer in mtr_commit() (see mtr_memo_pop_all() called in mtr_commit()).

mtr_start() function initializes an object of mtr_t type and mtr_commit() writes log records from mtr_t::log to log_sys->buf + log_sys->buf_free. So the typical sequence of any operation which changes data is the following:


mtr_start(); // initialize mtr object
some_ops... // operations on data which are logged in mtr_t::log
mtr_commit(); // write logged operations from mtr_t::log to log buffer log_sys->buf

page_cur_insert_rec_write_log() is a good example of how mtr records can be written and mtr::memo can be filled. The low-level function which writes data to log buffer is log_write_low(). This function is invoked inside of mtr_commit() and not only copy the log records from mtr_t object to log buffer log_sys->buf but also creates a new log blocks inside of log_sys->buf, fills their header, trailer, calculates checksum.

So log buffer contains log blocks which are sequentially filled with log records which are grouped in “mini-transactions” which logically can be treated as some logical operation over data which consists of a sequence of mini-operations(log records).

As log records are written sequentially in log buffer one mini-transaction and even one log record can be written in two neighbour blocks. That is why the header field which would contain the offset of the first MTR in the block is necessary to calculate the point from which log records parsing can be started. This field was described in 2.2.

So we have a buffer of log blocks in a memory. How is data from this buffer written to disk? The mysql documentation says that this depends on innodb_flush_log_at_trx_commit option. There can be three cases depending on the value of this option. Let’s consider each of them.

Writing log buffer to disk: innodb_flush_log_at_trx_commit is 1 or 2.

The first two cases is when innodb_flush_log_at_trx_commit is 1 or 2. In these cases flush log records are written for 2 and flushed for 1 on each transaction commit. If innodb_flush_log_at_trx_commit is 2 log records are flushed periodically by special thread which will be considered later. The low-level function which writes log records from buffer to file is log_group_write_buf(). But in the most cases it is not called directly but it is called from more high level log_write_up_to(). For the current case the calling stack is the following:


(trx_commit_in_memory() or
trx_commit_complete_for_mysql() or
trx_prepare() e.t.c)->
trx_flush_log_if_needed()->
trx_flush_log_if_needed_low()->
log_write_up_to()->
log_group_write_buf().

It is quite easy to find the higher levels of calling stack, just set up breakpoint on log_group_write_buf() and execute any sql query that modifies innodb data. For example for the simple “insert” sql query the higher levels of calling stack are the following:


mysql_execute_command()->
trans_commit_stmt()->
ha_commit_trans()->
TC_LOG_DUMMY::commit()->
ha_commit_low()->
innobase_commit()->
trx_commit_complete_for_mysql()->
trx_flush_log_if_needed()-> ... .

log_io_complete() callback is invoked when i/o is finished for log files (see fil_aio_wait()). log_io_complete() flushes log files if this is not forbidden by innodb_flush_method or innodb_flush_log_at_trx_commit options.

Writing log buffer to disk: innodb_flush_log_at_trx_commit is equal to 0

The third case is when innodb_flush_log_at_trx_commit is equal to 0. For this case log buffer is NOT written to disk on transaction commit, it is written and flushed periodically by separate thread “srv_master_thread”. If innodb_flush_log_at_trx_commit = 0 log files are flushed in the same thread by the same calls. The calling stack is the following:


srv_master_thread()->
(srv_master_do_active_tasks() or srv_master_do_idle_tasks() or srv_master_do_shutdown_tasks())->
srv_sync_log_buffer_in_background()->
log_buffer_sync_in_background()->log_write_up_to()->... .

Special cases for logs flushing

While log_io_complete() do flushing depending on innodb_flush_log_at_trx_commit value among others log_write_up_to() has it’s own flushing criteria. This is flush_to_disk function argument. So it is possible to force log files flushing even if innodb_flush_log_at_trx_commit = 0. Here are examples of such cases:

1) buf_flush_write_block_low()
Each page contains information about the last applied LSN(buf_flush_write_block_low::newest_modification), each log record is a description of change on certain page. Imagine we flushed some changed pages but log records for these pages were not flushed and server goes down. After starting the server some pages will have the newest modifications, but some of them were not flushed and the correspondent log records are lost too. We will have inconsistent database in this case. That is why log records must be flushed before the pages they refer.

2) srv_sync_log_buffer_in_background()
As it was described above this function is called periodically by special thread and forces flushing.

3) log_checkpoint()
When checkpoint is made log files must be reliably flushed.

4) The special handlerton innobase_flush_logs() which can be called through ha_flush_logs() from mysql server.
For example ha_flush_logs() is called from MYSQL_BIN_LOG::reset_logs() when “RESET MASTER” or “RESET SLAVE” are executed.

5) srv_master_do_shutdown_tasks() – on shutdown, ha_innobase::create() – on table creating, ha_innobase::delete_table() – on table removing, innobase_drop_database() – on all database tables removing, innobase_rename_table() – on table rename e.t.c

If log files are treated as circular buffer what happens when the buffer is overflown?

Briefly. Innodb has a mechanism which allows you to avoid overflowing. It is called “checkpoints.” The checkpoint is a state when log files are synchronized with data files. In this case there is no need to keep the history of changes before checkpoint because all pages with the last modifications LSN less or equal to checkpoint LSN are flushed and the log files space from the last written LSN to the last checkpoint LSN can be reused. We will not describe a checkpoint process here because it is a separate interesting subject. The only thing we need to know is when checkpoint happens all pages with modification LSN less or equal to checkpoint LSN are reliably flushed.

How archived logs are written by server.

So the log contains information about page changes. But as we said, log files are the circular buffer. This means that they occupy fixed disk size and the oldest records can be rewritten by the newest ones as there are points when data files are synchronized with log files called checkpoints and there is no need to store the previous history of log records to guarantee database consistency. The idea is to save somewhere all log records to have the possibility of applying them to backuped data to have some kind if incremental backup. For example if we want to have an archive of log records. As log consists of log files it is reasonable to store log records in such files too, and these files are called “archived logs.”

Archived log files are written to the directory which can be set with special innodb option. Each file has the same size as innodb log size and the suffix of each archived file is the LSN from which it is started.

As well as log writing system log archiving system stores its data in global log_sys object. Here are the most valuable fields in log_sys from my point of view:

log_sys->archive_buf, log_sys->archive_buf_size – logs archive buffer and its size, log records are copied from log buffer log_sys->buf to this buffer before writing to disk;
log_sys->archiving_phase – the current phase of log archiving: LOG_ARCHIVE_READ when log records are being copied from log_sys->buf to log_sys->archive_buf, LOG_ARCHIVE_WRITE when log_sys->archive_buf is being written to disk;
log_sys->archived_lsn – the LSN to which log files are written;
log_sys->next_archived_lsn – the LSN to which write operations was invoked but not yet finished;
log_sys->max_archived_lsn_age – the maximum difference between log_sys->lsn and log_sys->archived_lsn, if this difference exceeds the log are being archived synchronously, i.e. the difference is decreased;
log_sys->archive_lock – this is rw-lock which is used for synchronizing LOG_ARCHIVE_WRITE and LOG_ARCHIVE_READ phases, it is x-locked on LOG_ARCHIVE_WRITE phase.

So how is data copied from log_sys->buf to log_sys->archived_buf? log_archive_do() is used for this. It is not only set the proper state for archived log fields in log_sys but also invokes log_group_read_log_seg() with corresponding arguments which not only copy data from log buffer to archived log buffer but also invokes asynchronous write operation for archived log buffer. log_archive_do() can wait until io operations are finished using log_sys->archive_lock if corresponding function parameter is set.

The main question is on what circumstances log_archive_do() is invoked, i.e. when log records are being written to archived log files. The first call stack is the following:


log_free_check()->
log_check_margins()->
log_archive_margin()->
log_archive_do().

Here is text of log_free_check() with comments:


/*********************************************************************//
Checks if there is need for a log buffer flush or a new checkpoint, and does
this if yes. Any database operation should call this when it has modified
more than about 4 pages. NOTE that this function may only be called when the
OS thread owns no synchronization objects except the dictionary mutex. */
UNIV_INLINE
void
log_free_check(void)
/*================*/
{

#ifdef UNIV_SYNC_DEBUG
ut_ad(sync_thread_levels_empty_except_dict());
#endif /* UNIV_SYNC_DEBUG */

if (log_sys->check_flush_or_checkpoint) {

log_check_margins();
}
}

log_sys->check_flush_or_checkpoint is set when there is no enough free space in log buffer or it is time to do checkpoint or any other bound case. log_archive_margin() is invoked only if the limit if the difference between log_sys->lsn and log_sys->archived_lsn is exceeded. Let’s refer to this difference as archived lsn age.

One more call log_archive_do() is from log_open() when archived lsn age exceeds some limit. log_open() is called on each mtr_commit(). And for this case archived logs are written synchronously.

The next synchronous call is from log_archive_all() during shutdown.

Summarizing all above archived logs begins to be written when the log buffer is full enough to be written or when checkpoint happens or when the server is in the process of shut down. And there is no any delay between writing to archive log buffer and writing to disk. I mean there is no way to say that archived logs must be written once a second as it is possible for redo logs with innodb_flush_log_at_trx_commit = 0. As soon as data is copied to the buffer the write operation is invoked immediately for this buffer. Archived log buffer is not filled on each mtr_commit() so it does not slow down the usual logging process. The exception is when there are a lot of io operations what can be the reason of archive log age is too big. The result of big archive log age is the synchronous archived logs writing during mtr_commit(). Memory to memory copying is quite fast operation that is why the data is copied to archived log buffer and is written to disk asynchronously minimizing delays which can be caused by logs archiving.

PS: Here is another call stack for writing archived log buffer to archived log files:

log_io_complete()->log_io_complete_archive()->log_archive_check_completion_low()->log_archive_groups().

I propose to explore this stack yourself.

Logs recovery process, how it is started and works inside. Archived logs applying.

So we discovered how innodb redo logging works, and how redo logs are archived. And the last uncovered thing is how recovery works and how archived logs are applied. These two processes are very similar – that is why they are discussed in one section of this post.

The story begins with innobase_start_or_create_for_mysql() which is invoked from innobase_init(). The following trident in innobase_start_or_create_for_mysql() can be used to search the relevant code:


if (create_new_db) {
...
} else if (srv_archive_recovery) {
...
} else {
...
}

The second condition and the last one is the place from which archived logs applying and innodb logs recovery processes correspondingly start. These two blocks wrap two pairs of functions:


recv_recovery_from_archive_start()
recv_recovery_from_archive_finish()

and

recv_recovery_from_checkpoint_start()
recv_recovery_from_checkpoint_finish()

And all the magic happens in these pairs. As well as global log_sys object for redo logging there is global recv_sys object for innodb recovery and archived logs applying. It is created and initialized in recv_sys_create() and recv_sys_init() functions correspondingly. The following fields if recv_sys object are the most important from my point:

recv_sys->limit_lsn – the LSN up to which recovery should be made, this value is initialized with the maximum value of uint64_t(see #define LSN_MAX) for the recovery process and with certain value which is passed as an argument of recv_recovery_from_archive_start() function and can be set via xtrabackup option for log applying;
recv_sys->parse_start_lsn – the LSN from which logs parsing is started, for the the logs recovery this value equals to the last checkpoint LSN, for logs applying this is last applied LSN;
recv_sys->scanned_lsn – the LSN up to which log files are scanned;
recv_sys->recovered_lsn – the LSN up to which log records are applied, this value <= recv_sys->scanned_lsn;

The first thing that must be done for starting recovery process is to find out the point in log files where the recovery must be started from. This is the last checkpoint LSN. recv_find_max_checkpoint() proceed this. As we can see in log_group_checkpoint() the following code writes checkpoint info into two places in the first log file depending on the checkpoint number:


/* We alternate the physical place of the checkpoint info in the first
log file */

 

if ((log_sys->next_checkpoint_no & 1) == 0) {
write_offset = LOG_CHECKPOINT_1;
} else {
write_offset = LOG_CHECKPOINT_2;
}

So recv_find_max_checkpoint() reads checkpoint info from both places and selects the latest checkpoint.

The same idea is applied for logs, too, but the last applied LSN instead of last checkpoint LSN must be found. Here is the call stack for reading last applied LSN:


innobase_start_or_create_for_mysql()->
open_or_create_data_files()->
fil_read_first_page().

The last applied LSN is stored in the first page of data files in (min|max)_flushed_lsn fields(see FIL_PAGE_FILE_FLUSH_LSN offset). These values are written in fil_write_flushed_lsn_to_data_files() function on server shutdown.

So the main difference between logs applying and recovery process at this stage is the manner of calculating LSN from which log records will be read. For logs applying the last flushed LSN is used but for recovery process it is the last checkpoint LSN. Why does this difference take place? Logs can be applied periodically. Assume we gather archived logs and apply them once an hour to have fresh backup. After applying the previous bunch of log files there can be unfinished transactions. For the recovery process any unfinished transactions are rolled back to have consistent db state at server starting. But for the logs applying process there is no need to roll back them because any unfinished transactions can be finished during the next logs applying.

After calculating the start LSN the sequence of actions is the same for both recovering and applying. The next step is reading and parsing log records. See recv_group_scan_log_recs() which is invoked from recv_recovery_from_checkpoint_start_func() for logs recovering and recv_recovery_from_archive_start()->log_group_recover_from_archive_file() for logs applying.

The first we read log records to some buffer and then invoke recv_scan_log_recs() to parse them. recv_scan_log_recs() checks each log block on consistency(checksum + comparing the log block number written in log block with log block number calculated from log block LSN) and other edge cases and copy it to parsing buffer recv_sys->buf with recv_sys_add_to_parsing_buf() function. The parsing buffer is then parsed by recv_parse_log_recs(). Log records are stored in hash table recv_sys->addr_hash. The key for this hash table is calculated basing on space id and page number pair. This pair refers to the page to which log records must be applied. The value of the hash table is object of recv_addr_t type. recv_addr_t type contains rec_list field which is the list of log records for applying to the (space id, page num) page (see recv_add_to_hash_table().

After parsing and storing log record in hash table recv_sys->addr_hash log records are applied. The function which is responsible for log records applying is recv_apply_hashed_log_recs(). It is invoked from recv_scan_log_recs() if there is no enough memory to store log records and at the end of recovering/applying process. For each element of recv_sys->addr_hash, i.e. for each DB page which must be changed with log records recv_recover_page() is invoked. It can be invoked as from recv_apply_hashed_log_recs() in the case if page is already in buffer pool of from buf_page_io_complete() on io completion, i.e. just after page was read from storage. Applying log records on page read completion is necessary and very convenient. Assume log records have not yet applied as we had enough memory to store the whole recovery log records. But we want for example to boot DB dictionary. I this case any records that concern to the pages of the dictionary will be applied to those pages just after reading them from storage to buffer pool.

The function which applies log records to the certain page is recv_recover_page_func(). It gets the list of log records for the certain page from recv_sys->addr_hash hash table, for each element of this list it compares the lsn of last page changes with the LSN of the record, and if the former is greater the later it applies log record to the page.

After applying all log records from archived logs xtrabackup writes last applied LSN to (min|max)_flushed LSN fields of each data file and finishes execution. The logs recovery process rollbacks all unfinished transactions unless this is forbidden with innodb-force-recovery parameter.

Conclusion

We covered the processes of redo logs writing and recovery in depth. These are very important processes as they provide data consistency on crashes. These two processes became a base for logs archiving and applying features. As log records can describe any data changes the idea is to store these records somewhere and then apply them to backups for organizing some kind of incremental backup.

The features were implemented a short time ago and currently they are not widely used. So if you have something to say about them you are welcome to comment for discussion.

The post Innodb redo log archiving appeared first on MySQL Performance Blog.

Percona XtraBackup 2.2.1 alpha release is now available

Percona XtraBackup for MySQL

Percona is glad to announce the release of Percona XtraBackup 2.2.1-alpha1 on March 28th 2014. Downloads are available from our download site here. This ALPHA release, will be available in our Debian experimental and CentOS testing repositories.

This is an ALPHA quality release and it is not intended for production. If you want a high quality, Generally Available release, the current Stable version should be used (Percona XtraBackup 2.1.8 in the 2.1 series at the time of writing).

This release contains all of the features and bug fixes in Percona XtraBackup 2.1.8, plus the following:

New Features:

  • Percona XtraBackup has removed the multiple binaries (xtrabackup_56, xtrabackup_55, xtrabackup) and now uses single xtrabackup binary instead for handling backups. Single binary implementation removed the requirement to download server source tarballs and removed different patches which resulted in cleaner code.
  • Percona XtraBackup source layout has been changed to implement the single binary. Percona XtraBackup code can now be found in storage/innobase/xtrabackup.
  • Percona XtraBackup implemented support for Backup Locks.
  • Percona XtraBackup can now store backup history on the server itself in a special table created for that purpose.
  • innobackupex-1.5.1 symlink has been removed, instead innobackupex binary should be used.
  • Percona XtraBackup has removed the build.sh script and it’s now built with CMake.
  • Percona XtraBackup has been rebased on MySQL 5.6.16.

Bugs Fixed:

  • Information about tool version used to take the backup was added by implementing backup history feature. Bug fixed #1133017.
  • If an XtraDB-based binary was used to a backup an InnoDB database, it would convert it to XtraDB by adding the XTRADB_1 marker in the dictionary header page and by adding the SYS_STATS table. Bug fixed #988310.

Release notes with all the bugfixes for Percona XtraBackup 2.2.1 are available in our online documentation. All of Percona‘s software is open source and free, all the details of the release can be found in the 2.2.1 milestone at Launchpad. Bugs can be reported on the launchpad bug tracker.

The post Percona XtraBackup 2.2.1 alpha release is now available appeared first on MySQL Performance Blog.

ScaleArc: Benchmarking with sysbench

ScaleArc recently hired Percona to perform various tests on its database traffic management product. This post is the outcome of the benchmarks carried out by Uday Sawant (ScaleArc) and myself. You can also download the report directly as a PDF here.

The goal of these benchmarks is to identify the potential overhead of the ScaleArc software itself and the potential benefits of caching. The benchmarks were carried out with the trunk version of sysbench. For this reason, we used a very small set of data, so the measurements will be fast, and it’s known that caching has huge benefits when the queries themselves are rather expensive. We decided that we would rather show that benefit with a real-world application, which is coming later is this series. And if you’re in the Silicon Valley area, be sure to join us this evening at the first-ever Open Source Appreciation Day – I’d be happy to discuss the findings presented here in this post. Admission is free but due to limited space you should register now. I’ll also be available throughout the Percona Live MySQL Conference and Expo all this week.

sysbench_image1.2

In this summary graph it’s visible that in terms of throughput (read-only benchmark, which is relevant for read mostly applications), ScaleArc doesn’t have any significant overhead, while caching can have potentially huge benefits.

sysbench_image2

The situation is pretty similar with response times. ScaleArc doesn’t add any significant overhead, and caching can mean huge benefit in terms of response time as well.

In case of this particular workload (which is read only sysbench), using caching means a roughly 3x increase in throughput and a roughly 80% drop in response time.

Overall, ScaleArc is a good product in terms of performance and features as well. I would definitely recommend it.

About ScaleArc for MySQL
ScaleArc for MySQL is a software appliance that drops in transparently between applications and databases to improve application availability and performance. It requires no changes to applications or databases and delivers:

  • Instant scale up – transparent connection pooling and multiplexing, TTL-based transparent caching, surge protection
  • Transparent scale out – read/write split, load balancing, query routing, sharding
  • Automatic high availability – automatic failover
  • Real-time actionable analytics

Benchmarking setup
The client machines are running the benchmarking software like sysbench in case of these benchmarks.

CPU: 2 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (6 cores, chip multithreading off)
Memory: 64G

We used 2 clients. The results of the 2 clients are graphed separately, so it’s visible that they put the same amount of workload on the database or ScaleArc software.

Database machines
CPU: 2 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (6 cores, chip multithreading off)
Memory: 64G

Running MySQL Community Edition 5.6.15

MySQL configuration

[mysqld]
   max_allowed_packet = 64M
   thread_cache = 256
   query_cache_size = 0
   query_cache_type = 0
   max_connections = 20020
   max_user_connections = 20000
   max_connect_errors = 99999999
   wait_timeout = 28800
   interactive_timeout = 28800
   log-error=/var/lib/mysql/mysql.err
   back_log=60000
   innodb_buffer_pool_size = 3G
   innodb_additional_mem_pool_size = 16M
   innodb_log_buffer_size = 8M
   innodb_flush_log_at_trx_commit = 0
   innodb_flush_method = O_DIRECT
   innodb_open_files = 2000
   innodb_file_per_table
   innodb_log_file_size=2G
   innodb_log_files_in_group=2
   innodb_purge_threads=1
   innodb_max_purge_lag=0
   innodb_support_xa=0
   innodb_locks_unsafe_for_binlog = 1
   innodb_buffer_pool_instances=8
   sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES

The buffer pool of the database is intentionally small, so it’s easy to generate a disk-bound workload.

Please note that the following settings are not recommended in production.

innodb_support_xa=0
   innodb_locks_unsafe_for_binlog = 1

 

We used these settings to drive the node to its peak performance, avoiding any possible overhead which might be required on a production system. In typical production settings, these are not set, and binary logging is enabled, which potentially reduces ScaleArc’s overhead further.

ScaleArc software appliances
CPU: 1 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (6 cores, chip multithreading off)
Memory: 64G

The machines were running ScaleArc for MySQL 3.0.

Network
The machines were connected using 10G connections.

Measurements
All of the measurements were done with a very small database that completely fits in memory.

--oltp-table-size=10000
  --oltp-tables-count=64

 

In these benchmarks, we expected both the database and ScaleArc to be CPU bound. In case of a disk-bound workload, ScaleArc would shine even more than in this benchmark. If the queries are more expensive (they have to hit storage), the overhead in % is smaller, and in case of caching the query-by-query benefit is bigger.

We measured 3 different setups, both on read-only and read-write cases. These are the following.

  • Direct connection to the database.
  • Connection to the database through ScaleArc, where ScaleArc only acts as a pass-through filter (since it’s a load balancer that speaks the MySQL wire protocol, all the mechanics for that are still in place). Please note that this setup doesn’t make sense in real life. The purpose of this setup is to show the potential overhead of using ScaleArc and uncovering potential limitations of the ScaleArc software itself.
  • Connection to database through ScaleArc, where ScaleArc is allowed to cache. Caching in ScaleArc is TTL (Time To Live)-based caching, meaning that a read query’s results are cached in ScaleArc. If that read query is seen again before expiring, the query is not run again on the database server but rather served from the cache. Once the timer for the cached query expires, the query will be issued on the database again. Caching of course only works for reads, which are not in an explicit transaction (autocommit is on and no START TRANSACTION is issued). Because of that, we used –oltp-skip-trx during cached benchmarks (read-only case). In case of these benchmarks, the TTL was 1 hour, because we wanted to saturate the ScaleArc software while serving cached queries. An 1 hour TTL might be unrealistic for some applications, while for other applications even an 1 day TTL is something they can live with for some queries. In this case, we wanted to measure the cache’s performance, so we wanted the queries to be cached during the entire benchmark run to show the potential gain even in case of very small queries.

TTL-based caching
It’s important to note that the cache’s expiration is controlled by a TTL value – there is no other invalidation, so it’s possible to read stale data when the query results is changed, but the cache is not expired. Reading stale data alone is ok for most applications, it can happen with a regular, asynchronous slave if it’s lagging behind the master (and it always lagging behind somewhat). Otherwise, the cache is pretty similar to MySQL’s query cache, which doesn’t suffer from the stale read problem, but it has a coarse invalidation (if a table is written, the cache entries belonging to the given table are flushed). While the cache is flushed, the query cache mutex is held, which blocks reads even. Because of the mutex, the built-in query cache is a very usual performance bottleneck. ScaleArc’s cache doesn’t suffer from this.

It’s important to note that ScaleArc caches nothing by default. Also, there are other ways to invlidate cache entries apart from waiting for the TTL to expire.

  • API Call based invalidation (you can clear the cache for an entire query pattern rule with one API call)
  • Query comment based invalidation (you can put a comment /*wipe*/ before a query and wipe and refresh the cache)
  • Cache Bypass (you can send a comment /*nocache*/ and bypass the cache for that specific query)

Read-only
Sysbench throughput

sysbench_image6

In the lower region of threads (up to 32), we see that the TPS value significantly drops in case of going through ScaleArc. That’s nothing to be surprised about, the reason for that is network roundtrips. Because ScaleArc is a software appliance, it adds a hop between the database and the application, which introduces latency. If the number of threads is higher (32 and up), this starts to matter less and less, and performance is almost identical which is very impressive. It means that around the optimal degree of parallelism for these machines, ScaleArc introduces very little (barely measurable) overhead.

Sysbench response time
sysbench_image7
This graph contains the response times belonging to the previous benchmarks. This is really hard to read because at 4096 threads, the system is overloaded, and the response time is much more than in the maximum throughput region. Because it’s multiple orders of magnitude higher, the interesting response times are not readable from this graph.

sysbench_image8

The following graph is the same as above, except that the y axis is limited to 250 ms, so the region which is not visible on the graph above is visible here. What we see there regarding the overhead is pretty much the same as we saw in case of the throughput graph, which means that ScaleArc by itself introduces immeasurably low latency (which explains the difference in cases when parallelism is low). Usually applications which are utilizing the database server are using significantly more than one thread (in MySQL a single query always uses a single thread, in other words there is no intra-query parallelism). The latency from 32 threads above is actually somewhat lower when going through ScaleArc (the exact tipping point can be different here based on the number of CPUs). The reason for that is ScaleArc itself uses an event loop to connect to MySQL, so at a high concurrency, and can schedule sending the traffic to MySQL differently. This only matters when otherwise the MySQL server is saturated CPU-wise.

CPU utilization

sysbench_image9

Last but not least, this graph contains the CPU utilization of the different setups. The left-hand side shows the CPU utilization when connecting directly to the database, and the right-hand side shows connecting through ScaleArc. In both cases, the database server’s CPU is the bottleneck. It’s visible that the client node’s CPU is more than 75% idle (only client1 is graphed to improve readability, client2 is practically the same). From 32 threads and up, the blue bar (CPU user%) is relatively high on the database servers, as is the green (CPU sys%). From 64 threads, the idle time is practically 0, until the systems are overloaded. On the right hand side, we can see that ScaleArc at this load still had 50% idle CPU, which means that we could practically do the same benchmark on another set of boxes through the very same ScaleArc, and only then it will be fully utilized. We are talking about 3000 sysbench tps here. One more interesting thing to note is the relatively high system time of ibd. This is also because of the way ScaleArc connects to the database (see the previous paragraph).

[  17s] threads: 64, tps: 3001.98, reads/s: 41962.70, writes/s: 0.00, response time: 35.22ms (95%)

 

These threads are from a single client, which means that ScaleArc could keep up with parsing roughly 84000 statements / second with utilizing half of its CPU, which is impressive. Please note that the ScaleArc software in this case was tuned towards this type of workload, which means we had more query processing threads. In case of caching, we will have more cache handler threads.

Effects of caching on read-only workload
Sysbench throughput

The next set of graphs will compare the cases when cache is used and not used.

sysbench_image11

The preceding TPS graph contains reads / second (because we measured with –oltp-skip-trx), so roughly 42000 reads corresponds to roughly 3000 transactions in the earlier setup (14 reads in a transaction). On the left-hand side of the graph, the cached throughput is visible with green – on the right-hand side, the non-cached throughput is visible with red (direct access) and blue (access through ScaleArc as a pass-through filter). It’s visible that caching improves the speed drastically, but when ScaleArc becomes overloaded (8192 client threads, 4096 from each client), the performance becomes somewhat inconsistent, which is understandable considering how few cores ScaleArc was running on. On the graph, the dots are translucent, which means the colors are brighter in the areas that have more samples. Even in the overloaded case, the majority of the samples are in the region of 100k+ reads / second across two clients, which means that the performance degrades very gracefully even under heavy load.

Sysbench response time

sysbench_image12

Like in the case of a non-cached workload, the response times are not too readable because of the very high response times when the systems are overloaded. But from the overloaded response times visible, it seems like using caching doesn’t make response times worse.

sysbench_Image13

Like in the case of non-cached workload, this graph is the zoomed version of the previous one. Here the maximum of the y axis is 100 ms. From this graph, it’s visible that at lower concurrency and at the optimal throughput, caching actually helps response time. This is understandable, since in case of a cache hit, ScaleArc can serve the results, and the client (in our case here sysbench) doesn’t have to go to the database, so a roundtrip and database processing time is spared. It’s also worth mentioning that the data “comes from memory,” it doesn’t matter if we hit the ScaleArc cache of the database. When the ScaleArc cache is used, the response time is lower because the additional roundtrip to the database and potential database work (like parsing SQL) is avoided. This means that caching can have benefits even if the database fits in the buffer pool. The improvement is always subject to the workload – caching helps the most when it can cache relatively expensive queries like aggregations and queries hitting the storage.

CPU utilization

sysbench_image14

Similarly to the previous case, the preceding graph shows CPU utilization of the various components. In case of the cached workload, the client itself is much more utilized (since it gets responses sooner, it has to generate the traffic faster). With this kind of workload, when using only one client, we would hit the client’s CPU as the performance bottleneck. The database is interesting too. With caching, its CPU is barely used. This is because if a query is served from the cache, it never gets to the database, so the database’s CPU utilization will be lower. In other words, using the cache helps to offload the database. If offloading is visible on ScaleArc’s graphs, when caching is used, the CPU on the server hosting ScaleArc is much more utilized. For this benchmark, the ScaleArc software was tuned to handle a cached workload, which means more cache handler threads.

Read-write
For read-write benchmarks, we had to create oltp_nontran.lua, which is the same sysbench benchmark as oltp.lua, except that it does the reads outside of the transaction and does only the writes in transaction, so caching can have an effect on read. The rest of the benchmarking setup is the same as the read-only case.

Sysbench throughput

sysbench_image15

Similarly to the read-only case, at a low concurrency, the overhead of ScaleArc is coming from the additional network roundtrip. At the optimal concurrency, the overhead is barely measurable (the dots are plotted practically on top of each other).

Sysbench response time

sysbench_image16

sysbench_image17

The case is pretty similar with the response times as in the read-only case. Similarly, the second graph is a zoomed version of the first one, which a 250 ms maximum.

CPU utilization

sysbench_image18

The CPU utilization graph shows that in this case, the database server’s CPU is the bottleneck. What is interesting is that ScaleArc is using less CPU than in the read-only case. This is understandable, since a transaction now contains writes as well, which are expensive on the database side, but they are still just statements to route on the ScaleArc side.

Effects of caching on read-write workload
Measuring caching here is interesting because the workload is no longer read-only of mostly reads. We have a very significant amount of writes.

sysbench_image19

For 30k reads, we get 8,5k writes. It’s expected that caching won’t help as much as in the previous case, because writes can’t be cached and while they are in process, the benchmarking threads can’t proceed with reads. Please note that this means that roughly 25% of the traffic is write, a typical application scaling out with additional slaves for reads doesn’t have this kind of read-to-write ratio.

Sysbench throughput

sysbench_image20

The first graph shows that in terms of total throughput, caching still helps.

Sysbench response time

sysbench_image21

sysbench_image22

Similarly to the read-only case, caching also helps response time, because it reduces the time needed for the read part of the workload.

CPU utilization

sysbench_image24

This test really stresses the database server’s CPU when not caching. With caching on, similarly to the read-only case, the client’s workload increases somewhat (but not as much), and the database server’s CPU usage decreases significantly. In the last row, the CPU utilization of ScaleArc shows that although it’s somewhat higher with caching, it’s still not that much higher.

From these tests it’s visible that caching can still be beneficial even if the write ratio is as high as in this test.

Conclusion
Engineering is always about making the right tradeoffs. If one wants features that needs a protocol-level load balancer like ScaleArc, the price should be paid in the overhead of Layer 7 parsing and decision making. ScaleArc’s engineering team did a great job minimizing this overhead. ScaleArc itself is very well tunable for different workload types (if caching is important, ScaleArc can be tuned for caching – if query rewriting, ScaleArc can be tuned for that).

The post ScaleArc: Benchmarking with sysbench appeared first on MySQL Performance Blog.

How to add an existing Percona XtraDB Cluster to Percona ClusterControl

In my last blog post I explained how to use Percona ClusterControl to create a new Percona XtraDB Cluster from scratch. That’s a good option when you want to create a testing environment in just some mouse clicks. In this case I’m going to show you how to add your existing cluster to Percona ClusterControl so you can manage and monitor it on the web interface.

The environment will be pretty similar, we will have UI, CMON and 3 XtraDB Cluster nodes. The cluster should be already running and Percona ClusterControl also installed.

Adding an existing Cluster

The ClusterControl web interface is empty, there are no clusters on it. To add an existing one we need to click on “Add existing Galera Cluster.” (Click on the image for an enlarged view).

 

 

Add Existing ClusterA new form will be shown pretty similar to the one we saw last time when we were creating a new cluster. We can divide the form in two parts. First we need to give information about our Cluster. The info requires is the Linux distribution and version, IP of PXC nodes and MySQL root passwords. Pretty easy:

 

 

 

 

FromIn the second part we have the SSH configuration. There is one pre-requisite, the UI server should be able to connect to all servers using a SSH key. Therefore, our first step is to create a SSH key pair in our UI server and copy the public one to all other servers.

 

 

It is also necessary to add the private key in the web interface. You can do it using the form shown after clicking on “Add Key Pair”:

Add key pair

Once the key is added, we can verify the access:

Access Check

As we can see here, everything works as expected and all servers are reachable by SSH. The parameter “Create shared SSH key” also needs to be enabled. That option will make ClusterControl to create a new SSH key pair on CMON node so this one can also connect to PXC nodes with passwordless SSH.

Now everything is prepared. We can proceed with the deployment. Just click on “Add cluster” and the installation process will start. While the installation is in progress you will see this notification:

 

 

Deployment notification 

 

Clicking on it we can see the progress of the deployment:

Progress

After some minutes our PXC is shown in the Percona ClusterControl UI:

Cluster

Now we can monitor it, get alerts, clone, run backups and everything from the web interface. You can also add multiple clusters and create new ones.

The post How to add an existing Percona XtraDB Cluster to Percona ClusterControl appeared first on MySQL Performance Blog.

Facebook’s Yoshinori Matsunobu on MySQL, WebScaleSQL & Percona Live

Facebook's Yoshinori Matsunobu

Facebook’s Yoshinori Matsunobu

I spoke with Facebook database engineer Yoshinori Matsunobu here at Percona Live 2014 today about a range of topics, including MySQL at Facebook, the company’s recent move to WebScaleSQL, new MySQL flash storage technologies – and why attending the Percona Live MySQL Conference and Expo each year is very important to him.

Facebook engineers are hosting several sessions at this year’s conference and all have been standing room only. That’s not too surprising considering that Facebook, the undisputed king of online social networks, has 1.23 billion monthly active users collectively contributing to an ocean of data-intensive tasks – making the company one of the world’s top MySQL users. And they are sharing what they’ve learned in doing so this week.

You can read my previous post where I interviewed five Facebook MySQL experts (Steaphan Greene, Evan Elias, Shlomo Priymak, Yoshinori Matsunobu and Mark Callaghan) and below is my short video interview with Yoshi, who will lead his third and final session of the conference today at 12:50 p.m. Pacific time titled, “Global Transaction ID at Facebook.”

The post Facebook’s Yoshinori Matsunobu on MySQL, WebScaleSQL & Percona Live appeared first on MySQL Performance Blog.

Optimizing MySQL Performance: Choosing the Right Tool for the Job

Next Wednesday, I will present a webinar about MySQL performance profiling tools that every MySQL DBA should know.

Application performance is a key aspect of ensuring a good experience for your end users. But finding and fixing performance bottlenecks is difficult in the complex systems that define today’s web applications. Having a method and knowing how to use the tools available can significantly reduce the amount of time between problems manifesting and fixes being deployed.

In the webinar, titled “Optimizing MySQL Performance: Choosing the Right Tool for the Job,” we’ll start with the basic top, iostat, and vmstat then move onto advanced tools like GDB, Oprofile, and Strace.

I’m looking forward to this webinar and invite you to join us April 16th at 10 a.m. Pacific time. You can learn more and also register here to reserve your spot. I also invite you to submit questions ahead of time by leaving them in the comments section below. Thanks for reading and see you next Wednesday!

The post Optimizing MySQL Performance: Choosing the Right Tool for the Job appeared first on MySQL Performance Blog.

OpenSSL heartbleed CVE-2014-0160 – Data leaks make my heart bleed

The heartbleed bug was introduced in OpenSSL 1.0.1 and is present in

  • 1.0.1
  • 1.0.1a
  • 1.0.1b
  • 1.0.1c
  • 1.0.1d
  • 1.0.1e
  • 1.0.1f

The bug is not present in 1.0.1g, nor is it present in the 1.0.0 branch nor the 0.9.8 branch of OpenSSL some sources report 1.0.2-beta is also affected by this bug at the time of writing, however it is a beta product and I would really recommend not to use beta quality releases for something as fundamentally important as OpenSSL in production.

The bug itself is within the heartbeat extension of OpenSSL (RFC6520). The bug allows an attacker to leak the memory in up to 64k chunks, this is not to say the data being leaked is limited to 64k as the attacker can continually abuse this bug to leak data, until they are satisfied with what has been recovered.

At worst the attacker can retrieve the private keys, the implications for which is that the attacker now has the keys to decrypt the encrypted data, as such the only way to be 100% certain of protection against this bug is to first update OpenSSL (>= 1.0.1g) and then revoke and regenerate new keys and certificates, expect to see a tirade of revocations and re-issuing of CA certs and the like in the coming days.

So how does this affect you as a MySQL user?

Taking Percona Server as an example, this is linked against OpenSSL, meaning if you want to use TLS for your client connections and/or your replication connections you’re going to need to have openSSL installed.

You can find your version easily via your package manager for example:

  • rpm -q openssl
  • dpkg-query -W openssl

If you’re running a vulnerable installation of OpenSSL an update will be required.

  • update OpenSSL >= 1.0.1g
  1. 1.0.1e-2+deb7u5 is reported as patched on debian,
  2. 1.0.1e-16.el6_5.7 is reported as patched in RedHat and CentOS
  3. 1.0.1e-37.66 changelogs note this has been patched on Amazon AMI
  • shutdown mysqld
  • regenerate keys and certs used by mysql for TLS connections (revoking the old certs if possible to do so)
  • start mysqld

You can read more about the heartbleed bug at heartbleed.com Redhat Bugzilla Mitre CVE filing Ubuntu Security Notice

The post OpenSSL heartbleed CVE-2014-0160 – Data leaks make my heart bleed appeared first on MySQL Performance Blog.

Heartbleed: Separating FAQ From FUD

If you’ve been following this blog (my colleague, David Busby, posted about it yesterday) or any tech news outlet in the past few days, you’ve probably seen some mention of the “Heartbleed” vulnerability in certain versions of the OpenSSL library.

So what is ‘Heartbleed’, really?

In short, Heartbleed is an information-leak issue. An attacker can exploit this bug to retrieve the contents of a server’s memory without any need for local access. According to the researchers that discovered it, this can be done without leaving any trace of compromise on the system. In other words, if you’re vulnerable, they can steal your keys and you won’t even notice that they’ve gone missing. I use the word “keys” literally here; by being able to access the contents of the impacted service’s memory, the attacker is able to retrieve, among other things, private encryption keys for SSL/TLS-based services, which means that the attacker would be able to decrypt the communications, impersonate other users (see here, for example, for a session hijacking attack based on this bug), and generally gain access to data which is otherwise believed to be secure. This is a big deal. It isn’t often that bugs have their own dedicated websites and domain names, but this one does: http://www.heartbleed.com

Why is it such a big deal?

One, because it has apparently been in existence since at least 2012. Two, because SSL encryption is widespread across the Internet. And three, because there’s no way to know if your keys have been compromised. The best detection that currently exists for this are some Snort rules, but if you’re not using Snort or some other IDS, then you’re basically in the dark.

What kinds of services are impacted?

Any software that uses the SSL/TLS features of a vulnerable version of OpenSSL. This means Apache, NGINX, Percona Server, MariaDB, the commercial version of MySQL 5.6.6+, Dovecot, Postfix, SSL/TLS VPN software (OpenVPN, for example), instant-messaging clients, and many more. Also, software packages that bundle their own vulnerable version of SSL rather than relying on the system’s version, which might be patched. In other words, it’s probably easier to explain what isn’t affected.

What’s NOT impacted?

SSH does not use SSL/TLS, so you’re OK there. If you downloaded a binary installation of MySQL community from Oracle, you’re also OK, because the community builds use yaSSL, which is not known to be vulnerable to this bug. Obviously, any service which doesn’t use SSL/TLS is not going to be vulnerable, either, since the salient code paths aren’t going to be executed. So, for example, if you don’t use SSL for your MySQL connections, then this bug isn’t going to affect your database server, although it probably still impacts you in other ways (e.g., your web servers).

What about Amazon cloud services?

According to Amazon’s security bulletin on the issue, all vulnerable services have been patched, but they still recommend that you rotate your SSL certificates.

Do I need to upgrade Percona Server, MySQL, NGINX, Apache, or other server software?

No, not unless you built any of these and statically-linked them with a vulnerable version of OpenSSL. This is not common. 99% of affected users can fix this issue by upgrading their OpenSSL libraries and cycling their keys/certificates.

What about client-level tools, like Percona Toolkit or XtraBackup?

Again, no. The client sides of Percona Toolkit, Percona XtraBackup, and Percona Cloud Tools (PCT) are not impacted. Moreover, the PCT website has already been patched. Encrypted backups are still secure.

There are some conflicting reports out there about exactly how much information leakage this bug allows. What’s the deal?

Some of the news articles and blogs claim that with this bug, any piece of the server’s memory can be accessed. Others have stated that the disclosure is limited to memory space owned by processes using a vulnerable version of OpenSSL. As far as we are aware, and as reported in CERT’s Vulnerability Notes Database, the impact of the bug is the latter; i.e., it is NOT possible for an attacker to use this exploit to retrieve arbitrary bits of your server’s memory, only bits of memory from your vulnerable services. That said, your vulnerable services could still leak information that attackers could exploit in other ways.

How do I know if I’m affected?

You can test your web server at http://filippo.io/Heartbleed/ – enter your site’s URL and see what it says. If you have Go installed, you can also get a command-line tool from Github and test from the privacy of your own workstation. There’s also a Python implementation. You can also check the version of OpenSSL that’s installed on your servers. If you’re running OpenSSL 1.0.1 through 1.0.1f or 1.0.2-beta, you’re vulnerable. (Side note here: some distributions, such as RHEL/CentOS, have patched the issue without actually updating the OpenSSL version number; on RPM-based systems you can view the changelog to see if it’s been fixed, for example:

rpm -q --changelog openssl | head -2
* Mon Apr 07 2014 Tomáš Mráz <tmraz@redhat.com> 1.0.1e-16.7
- fix CVE-2014-0160 - information disclosure in TLS heartbeat extension

Also, note that versions of OpenSSL prior to 1.0.1 are not known to be vulnerable. If you’re still unsure, we would be happy to assist you in determining the extent of the issue in your environment and taking any required corrective action. Just give us a call.

How can I know if I’ve been compromised?

If you’ve already been compromised, you won’t. However, if you use Snort as an IDS, you can use some rules developed by Fox-IT to detect and defer new attacks; other IDS/IPS vendors may have similar rule updates available. Without some sort of IDS in place, however, attacks can and will go unnoticed.

Are there any exploits for this currently out there?

Currently, yes, there are some proofs-of-concept floating around out there, although before this week, that was uncertain. But given that this is likely a 2-year-old bug, I would be quite surprised if someone, somewhere (you came to my talk at Percona Live last week, didn’t you? Remember what I said about assuming that you’re already owned?) didn’t have a solid, viable exploit.

So, then, what should I do?

Ubuntu, RedHat/CentOS, Amazon, and Fedora have already released patched versions of the OpenSSL library. Upgrade now. Do it now, as in right now. If you’ve compiled your own version of OpenSSL from source, upgrade to 1.0.1g or rebuild your existing source with the -DOPENSSL_NO_HEARTBEATS flag.

Once that’s been done, stop any certificate-using services, regenerate the private keys for any services that use SSL (Apache, MySQL, whatever), and then obtain a new SSL certificate from whatever certificate authority you’re using. Once the new certificates are installed, restart your services. You can also, of course, do the private key regeneration and certificate cycling on a separate machine, and only bring the service down long enough to install the new keys and certificates. Yes, you will need to restart MySQL. Or Apache. Or whatever. Full stop, full start, no SIGHUP (or equivalent).

Unfortunately, that’s still not all. Keeping in mind the nature of this bug, you should also change / reset any user passwords and/or user sessions that were in effect prior to patching your system and recycling your keys. See, for example, the session hijacking exploit referenced above. Also note that Google services were, prior to their patching of the bug, vulnerable to this issue, which means that it’s entirely possible that your Gmail login (or any other Google-related login) could have been compromised.

Can I get away with just upgrading OpenSSL?

NO. At a bare minimum, you will need to restart your services, but in order to really be sure you’ve fixed the problem, you also need to cycle your keys and certificates (and revoke your old ones, if possible). This is the time-consuming part, but since you have no way of knowing whether or not someone has previously compromised your private keys (and you can bet that now that the bug is public, there will be a lot of would-be miscreants out there looking for servers to abuse), the only safe thing to do is cycle them. You wouldn’t leave your locks unchanged after being burgled, would you?

Also note that once you do upgrade OpenSSL, you can get a quick list of the services that need to be restarted by running the following:

sudo lsof | grep ssl | grep DEL

Where can I get help and/or more information?

In addition to the assorted links already mentioned, you can read up on the nuts and bolts of this bug, or various news articles such as this or this. There are a lot of articles out there right now, most of which are characterizing this as a major issue. It is. Test your vulnerability, upgrade your OpenSSL and rotate your private keys and certificates, and then change your passwords.

The post Heartbleed: Separating FAQ From FUD appeared first on MySQL Performance Blog.