Amazon Aurora MySQL Monitoring with Percona Monitoring and Management (PMM)

Amazon Aurora MySQL Monitoring small

In this blog post, we’ll review additional Amazon Aurora MySQL monitoring capabilities we’ve added in Percona Monitoring and Management (PMM) 1.7.0. You can see them in action in the MySQL Amazon Aurora Metrics dashboard.

Amazon Aurora MySQL Transaction CommitsAmazon Aurora MySQL Monitoring

This graph looks at the number of commits the Amazon Aurora engine performed, as well as the average commit latency. As you can see from this graph, latency does not always correlate with the number of commits performed and can be quite high in certain situations.

Amazon Aurora MySQL LoadAmazon Aurora MySQL Monitoring 2

In Percona Monitoring and Management, we often use the concept of “Load” – which roughly corresponds to the number of operations of a type in progress. This graph shows us what statements contribute the most load on the system, as well as what load corresponds to the Amazon Aurora transaction commits (which we observed in the graph before).

Amazon Aurora MySQL Memory Usage

Amazon Aurora MySQL Monitoring 3

This graph is pretty self-explanatory. It shows how much memory is used by the Amazon Aurora lock manager, as well as the amount of memory used by Amazon Aurora to store Data Dictionary.

Amazon Aurora MySQL Statement Latency

Amazon Aurora MySQL Monitoring 4

This graph shows the average latency for the most important types of statements. Latency spikes, as shown in this example, are often indicative of the instance overload.

Amazon Aurora MySQL Special Command Counters

Amazon Aurora MySQL Monitoring 5

Amazon Aurora MySQL allows a number of commands that are not available in standard MySQL. This graph shows the usage of such commands. Regular “unit_test” calls can be seen in the default Amazon Aurora install, and the rest depends on your workload.

Amazon Aurora MySQL Problems

Amazon Aurora MySQL Monitoring 6

This graph is where you want to see a flat line. It shows different kinds of internal Amazon Aurora MySQL problems, which in normal operation should generally be zero.

I hope you find these Amazon Aurora MySQL monitoring improvements useful. Let us know if there is any other Amazon Aurora information that would be helpful to display!

Update on Percona Platform Lifecycle for Ubuntu “Stable” Versions

Percona Platform Lifecycle

Percona Platform LifecycleThis blog post highlights changes to the Percona Platform Lifecycle for Ubuntu “Stable” Versions.

We have recently made some changes to our Percona Platform and Software Lifecycle policy in an effort to more strongly align with upstream Linux distributions. As part of this, we’ve set our timeframe for providing supported builds for Ubuntu “Stable” (non-LTS) releases to nine (9) months. This matches the current Ubuntu distribution upstream policy.

In the future, we will continue to shift as necessary to match the upstream policy specified by Canonical. Along with this, as we did with Debian 9 before, we will only produce 64-bit builds for this platform ongoing. It has been our intention for some time to slowly phase out 32-bit builds, as they are rarely downloaded and largely unnecessary in contemporary times.

If you have any questions or concerns, please feel free to contact Percona Support or post on our Community Forums.

Troubleshooting MySQL Crashes Webinar: Q&A

Troubleshooting MySQL Crashes

Troubleshooting MySQL CrashesIn this blog, I will provide answers to the Q & A for the Troubleshooting MySQL Crashes webinar.

First, I want to thank everybody for attending our January 25, 2018, webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I was unable to answer fully during the webinar.

Q: I have the 600 seconds “Long semaphore wait” assertion failure / crashing issue following DDL queries, sometimes on the master, sometimes just the slaves. Any hints for troubleshooting these? How can I understand what semaphore holding threads are doing?

A: These are hardest errors to troubleshoot. Especially because in some cases (like long-running

CHECK TABLE

 commands) long semaphore waits could be expected and appropriate behavior. If you see long semaphore waits when performing DDL operations, it makes sense to consider using pt-online-schema-change or gh-ost utilities. Also, check the list of supported online DDL operations in the MySQL User Reference Manual.

But if you want to know how to analyze such messages, let’s check the output from page #17 in the slide deck used in the webinar:

2018-01-19T20:38:43.381127Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 139970010412800 has waited at ibuf0ibuf.cc line 3454 for 321.00 seconds the semaphore:
S-lock on RW-latch at 0x7f4dde2ea310 created in file buf0buf.cc line 1453
a writer (thread id 139965530261248) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: fffffffff0000000
Last time read locked in file ibuf0ibuf.cc line 3454
Last time write locked in file /mnt/workspace/percona-server-5.7-binaries-release/label_exp/
debian-wheezy-x64/percona-server-5.7.14-8/storage/innobase/btr/btr0btr.cc line 177
2018-01-19T20:38:43.381143Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 139965135804160 has waited at buf0buf.cc line 4196 for 321.00 seconds the semaphore:
S-lock on RW-latch at 0x7f4f257d33c0 created in file hash0hash.cc line 353
a writer (thread id 139965345621760) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file buf0buf.cc line 4196
Last time write locked in file ...

The line

--Thread 139970010412800 has waited at ibuf0ibuf.cc line 3454 for 321.00 seconds the semaphore:

Shows that some transaction was waiting for a semaphore. The code responsible for this wait is located on line 3454 in file

ibuf0ibuf.cc

. I received this crash when I ran Percona Server for MySQL version 5.7.14-8. Therefore, to check what this code is doing, I need to use Percona Server 5.7.14-8 source code:

sveta@Thinkie:~/mysql_packages/percona-server-5.7.14-8$ vim storage/innobase/ibuf/ibuf0ibuf.cc
...
3454 btr_pcur_open(ibuf->index, ibuf_entry, PAGE_CUR_LE, mode, &pcur, &mtr);
...

A few lines above in the same file contain function definition and comment:

3334 /** Buffer an operation in the insert/delete buffer, instead of doing it
3335 directly to the disk page, if this is possible.
3336 @param[in] mode BTR_MODIFY_PREV or BTR_MODIFY_TREE
3337 @param[in] op operation type
3338 @param[in] no_counter TRUE=use 5.0.3 format; FALSE=allow delete
3339 buffering
3340 @param[in] entry index entry to insert
3341 @param[in] entry_size rec_get_converted_size(index, entry)
3342 @param[in,out] index index where to insert; must not be unique
3343 or clustered
3344 @param[in] page_id page id where to insert
3345 @param[in] page_size page size
3346 @param[in,out] thr query thread
3347 @return DB_SUCCESS, DB_STRONG_FAIL or other error */
3348 static MY_ATTRIBUTE((warn_unused_result))
3349 dberr_t
3350 ibuf_insert_low(
3351 ulint mode,
3352 ibuf_op_t op,
3353 ibool no_counter,
3354 const dtuple_t* entry,
3355 ulint entry_size,
3356 dict_index_t* index,
3357 const page_id_t& page_id,
3358 const page_size_t& page_size,
3359 que_thr_t* thr)
3360 {
...

The first line of the comment gives us an idea that InnoDB tries to insert data into change buffer.

Now, let’s check the next line from the error log file:

S-lock on RW-latch at 0x7f4dde2ea310 created in file buf0buf.cc line 1453
sveta@Thinkie:~/mysql_packages/percona-server-5.7.14-8$ vim storage/innobase/buf/buf0buf.cc
...
1446 /* If PFS_SKIP_BUFFER_MUTEX_RWLOCK is defined, skip registration
1447 of buffer block rwlock with performance schema.
1448
1449 If PFS_GROUP_BUFFER_SYNC is defined, skip the registration
1450 since buffer block rwlock will be registered later in
1451 pfs_register_buffer_block(). */
1452
1453 rw_lock_create(PFS_NOT_INSTRUMENTED, &block->lock, SYNC_LEVEL_VARYING);
...

And again let’s check what this function is doing:

1402 /********************************************************************//**
1403 Initializes a buffer control block when the buf_pool is created. */
1404 static
1405 void
1406 buf_block_init(

Even without knowledge of how InnoDB works internally, by reading only these comments I can guess that a thread waits for some global InnoDB lock when it tries to insert data into change buffer. The solution for this issue could be either disabling change buffer, limiting write concurrency, upgrading or using a software solution that allows you to scale writes.

Q: For the page cleaner messages, when running app using replication we didn’t get them. After switching to PXC we started getting them. Something we should look at particular to PXC to help resolve this?

A: Page cleaner messages could be a symptom of starving IO activity. You need to compare Percona XtraDB Cluster (PXC) and standalone server installation and check how exactly the write load increased.

Q: Hi, I have one question, we have a query we were joining on 

BLOB

 or

TEXT

 fields that is causing system locks and high CPU alerts and causing a lot of system locks, can you please suggest how can we able to make it work? Can you please send the answer in a text I missed some information?

A: If you are joining on

BLOB

 or

TEXT

 fields you most likely don’t use indexes. This means that InnoDB has to perform a full table scan. It increases IO and CPU activity by itself, but also increases the number of locks that InnoDB has to set to resolve the query. Even if you have partial indexes on the 

BLOB

 and

TEXT

 columns, mysqld has to compare full values for the equation, so it cannot use index only to resolve

ON

 clause. It is a best practice to avoid such kinds of

JOIN

s. You can use surrogate integer keys, for example.

Q: Hi, please notice that “MySQL server has gone away” is the worst one, in my opinion, and there was no mention about that ….can you share some tips on this? Thank you.
Both MySQL from Oracle and Percona error log does not help on that, by the way …

A:

MySQL Server has gone away

” error maybe the result of a crash. In this case, you need to handle it like any other crash symptom. But in most cases, this is a symptom of network failure. Unfortunately, MySQL doesn’t have much information why connection failures happen. Probably because, from mysqld’s point of view, a problematic network only means that the client unexpectedly disconnected after a timeout, and the client still waiting for a response receives “

MySQL Server has gone away

”. I discussed these kinds of errors in my  “Troubleshooting hardware resource usage” webinar. A good practice for situations when you see this kind of error often is don’t leave idle connections open for a long time.

Q: I see that a lot of work is doing hard investigation about some possibilities of what is going wrong….is there a plan at development roadmap on improve error log output messages? If you can comment on that …

A: Percona Engineering does a lot for better diagnostics. For example, Percona Server for MySQL has an extended slow log file format, and Percona Server for MySQL 5.7.20 introduced a new

innodb_print_lock_wait_timeout_info

  variable that allows log information about all InnoDB lock wait timeout errors (manual). More importantly, it logs not only blocked transaction, but also locking transaction. This feature was requested at lp:1657737 for one of our Percona Support customers and is now implemented

Oracle MySQL Engineering team also does a lot for better error logging. The start of these improvements happened in version 5.7.2, when variable log_error_verbosity was introduced. Version 8.0.4 added much better tuning control. You can read about it in the Release Notes.

Q: Hello, you do you using strace to find what exactly table have problems in case there is not clear information in mysql error log?

A: I am not a big fan of

strace

 when debugging mysqld crashes, but Percona Support certainly uses this tool. I myself prefer to work with

strace

 when debugging client issues, such as trying to identify why Percona XtraBackup behaves incorrectly.

Thanks everybody for attending the webinar. You can find the slides and recording of the webinar at the Troubleshooting MySQL Crashes web page.

This Week in Data with Colin Charles 28: Percona Live, MongoDB Transactions and Spectre/Meltdown Rumble On

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

In case you missed last week’s column, don’t forget to read the fairly lengthy FOSDEM MySQL & Friends DevRoom summary.

From a Percona Live Santa Clara 2018 standpoint, beyond the tutorials getting picked and scheduled, the talks have also been picked and scheduled (so you were very likely getting acceptance emails from the Hubb.me system by Tuesday). The rejections have not gone out yet but will follow soon. I expect the schedule to go live either today (end of week) or early next week. Cheapest tickets end March 4, so don’t wait to register!

Amazon Relational Database Service has had a lot of improvements in 2017, and the excellent summary from Jeff Barr is worth a read: Amazon Relational Database Service – Looking Back at 2017. Plenty of improvements for the MySQL, MariaDB Server, PostgreSQL and Aurora worlds.

Spectre/Meltdown and its impact are still being discovered. You need to read Brendan Gregg’s amazing post: KPTI/KAISER Meltdown Initial Performance Regressions. And if you visit Percona Live, you’ll see an amazing keynote from him too! Are you still using MyISAM? MyISAM and KPTI – Performance Implications From The Meltdown Fix suggests switching to Aria or InnoDB.

Probably the biggest news this week though? Transactions are coming to MongoDB 4.0. From the site, “MongoDB 4.0 will add support for multi-document transactions, making it the only database to combine the speed, flexibility, and power of the document model with ACID guarantees. Through snapshot isolation, transactions will provide a globally consistent view of data, and enforce all-or-nothing execution to maintain data integrity.”. You want to read the blog post, MongoDB Drops ACID (the title works if you’re an English native speaker, but maybe not quite if you aren’t). The summary diagram was a highlight for me because you can see the building blocks, plus future plans for MongoDB 4.2.

Releases

Link List

Upcoming appearances

  • SCALE16x – Pasadena, California, USA – March 8-11 2018
  • FOSSASIA 2018 – Singapore – March 22-25 2018

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

Why ZFS Affects MySQL Performance

zfs

In this blog post, we’ll look at how ZFS affects MySQL performance when used in conjunction.

ZFS and MySQL have a lot in common since they are both transactional software. Both have properties that, by default, favors consistency over performance. By doubling the complexity layers for getting committed data from the application to a persistent disk, we are logically doubling the amount of work within the whole system and reducing the output. From the ZFS layer, where is really the bulk of the work coming from?

Consider a comparative test below from a bare metal server. It has a reasonably tuned config (discussed in separate post, results and scripts here). These numbers are from sysbench tests on hardware with six SAS drives behind a RAID controller with a write-backed cache. Ext4 was configured with RAID10 softraid, while ZFS is the same (striped three pairs of mirrored VDEvs).

There are a few obvious observations here, one being ZFS results have a high variance between median and the 95th percentile. This indicates a regular sharp drop in performance. However, the most glaring thing is that with write-only only workloads of update-index, overall performance could drop to 50%:

ZFSZFS ZFS

Looking further into the IO metrics for the update-index tests (95th percentile from /proc/diskstats), ZFS’s behavior tells us a few more things.

ZFS

 

  1. ZFS batches writes better, with minimal increases in latency with larger IO size per operation.
  2. ZFS reads are heavily scattered and random – the high response times and low read IOPs and throughput means significantly higher disk seeks.

If we focus on observation #2, there are a number of possible sources of random reads:

  • InnoDB pages that are not in the buffer pool
  • When ZFS records are updated, metadata also has to be read and updated

This means that for updates on cold InnoDB records, multiple random reads are involved that are not present with filesystems like ext4. While ZFS has some tunables for improving synchronous reads, tuning them can be touch and go when trying to fit specific workloads. For this reason, ZFS introduced the use of L2ARC, where faster drives are used to cache frequently accessed data and read them in low latency.

We’ll look more into the details how ZFS affects MySQL, the tests above and the configuration behind them, and how we can further improve performance from here in upcoming posts.

Percona Server 5.5.47-37.7 is now available

Percona Server 5.5.47-37.7
Percona is glad to announce the release of Percona Server 5.5.47-37.7 on January 12, 2016. Based on MySQL 5.5.47, including all the bug fixes in it, Percona Server 5.5.47-37.7 is now the current stable release in the 5.5 series.

Percona Server is open-source and free. Details of the release can be found in the 5.5.47-37.7 milestone on Launchpad. Downloads are available here and from the Percona Software Repositories.

Bugs Fixed:

  • Running OPTIMIZE TABLE or ALTER TABLE without the ENGINE clause would silently change table engine if enforce_storage_engine variable was active. This could also result in system tables being changed to incompatible storage engines, breaking server operation. Bug fixed #1488055.

Other bugs fixed: #1179451, #1524763, and #1530102.

Release notes for Percona Server 5.5.47-37.7 are available in our online documentation. Bugs can be reported on the launchpad bug tracker.

Percona Server 5.6.28-76.1 is now available

Percona Server 5.6.28-76.1Percona is glad to announce the release of Percona Server 5.6.28-76.1 on January 12, 2016. Download the latest version from the Percona web site or from the Percona Software Repositories.

Based on MySQL 5.6.28, including all the bug fixes in it, Percona Server 5.6.28-76.1 is the current GA release in the Percona Server 5.6 series. Percona Server is open-source and free – and this is the latest release of our enhanced, drop-in replacement for MySQL. Complete details of this release can be found in the 5.6.28-76.1 milestone on Launchpad.

Bugs Fixed:

  • Clustering secondary index could not be created on a partitioned TokuDB table. Bug fixed #1527730 (DB-720).
  • When enabled, super-read-only option could break statement-based replication while executing a multi-table update statement on a slave. Bug fixed #1441259.
  • Running OPTIMIZE TABLE or ALTER TABLE without the ENGINE clause would silently change table engine if enforce_storage_engine variable was active. This could also result in system tables being changed to incompatible storage engines, breaking server operation. Bug fixed #1488055.
  • Setting the innodb_sched_priority_purge variable (available only in debug builds) while purge threads were stopped would cause a server crash. Bug fixed #1368552.
  • Small buffer pool size could cause XtraDB buffer flush thread to spin at 100% CPU. Bug fixed #1433432.
  • Enabling TokuDB with ps_tokudb_admin script inside the Docker container would cause an error due to insufficient privileges even when running as root. In order for this script to be used inside Docker containers this error has be been changed to a warning that a check is impossible. Bug fixed #1520890.
  • InnoDB status will start printing negative values for spin rounds per wait, if the wait number, even though being accounted as a signed 64-bit integer, will not fit into a signed 32-bit integer. Bug fixed #1527160 (upstream #79703).

Other bugs fixed: #1384595 (upstream #74579), #1384658 (upstream #74619), #1471141 (upstream #77705), #1179451, #1524763 and #1530102.

Release notes for Percona Server 5.6.28-76.1 are available in the online documentation. Please report any bugs on the launchpad bug tracker .

Prometheus as an Engine for MySQL Monitoring

prometheusWhen I first discovered Graphite years ago, I was very impressed with its monitoring capabilities.  Compared to many RRD-based tools that were popular at the time (like Cacti), Graphite separated the captured data and graphs, allowing you to do all kinds of math and transformations while visualizing data. For example, I could plot the relationship between system queries and disk IO, and capture how the system was getting more IO bound over time. It also had reasonably high performance, allowing me to capture high-resolution data for medium-sized systems.

Just last year I discovered Prometheus, and it also impressed me. I think it has the potential to take Graphite’s flexibility to the next level. Though I am in no way a Prometheus expert, I  want to share my understanding and thoughts on it so far.

Data Model

The data model is perhaps what attracted me to Prometheus the most. While it’s not obvious at first, when you do figure it out it has fantastic flexibility.

In the data model used by Whisper and Carbon in Graphite, you will use something like this to store MySQL data:

myapp.store.mysql.db01.status.questions = 5000

You can set up any hierarchy you like, but it has to have a hierarchy.

What Prometheus does instead is allow you to use a set of key-value pairs. The same data shown above could be presented like this:

questions_total{app=”myapp”,subsystem=”store”,engine=”mysql”,host=”db01”, source=”status”} = 5000

(You most likely wouldn’t use this exact structure, but it’s good for illustration.)

The difference between these approaches it that Prometheus provides you multiple dimensions on which you can filter and aggregate, plus you can add those dimensions later as you need them (without needing to redesign your tree hierarchy).

These labels are very dynamic, and I can change them in a second. For example, a MySQL server reporting as a “Master” might start reporting as a “Slave” in the very next second, and its data will be aggregated differently.

This is especially important in the modern, often cloud-based and virtualized world. For example, using Prometheus it is very easy to tag servers by their region or availability zones. I can also do things like compute MySQL space usage by both the database and storage engine. The possibilities are endless.

Data Capture

Unlike Graphite – where the main model is push and the hosts themselves choose what kind of information they want to push to monitoring system and at which intervals – with Prometheus you set up “Exporters” that have the ability to export the data. It is up to the Prometheus server configuration to choose what data to sample and how frequently.

The clear benefit of Prometheus’ approach is that you can have as many servers as you like pulling the data, so it is very easy to create a development version of your system and play around with it – without affecting production. It also provides a simple pathway to high availability.

(Both the push and pull approaches have their benefits and drawbacks. Brian Brazil wrote an excellent article advertising the pull model of monitoring.)

Prometheus does create a few challenges for me. Unless I want to set up Service Discovery, it is a hassle to monitor any development/test VMs I might spin up (that would otherwise not be open to external access at all). While this isn’t the main use case for Prometheus, it is helpful for me to test the dashboard’s behavior with different operating systems, workloads, etc.

A more significant issue I discovered is dealing with some data that can’t be captured to multiple locations, because the data capture causes the data to change.

Here is specific example: if I look at the

events_statements_summary_by_digest

 table in

PERFORMANCE_SCHEMA

, there is a

MAX_TIMER_WAIT

 field that shows me what the maximum query execution time is for the query pattern. If I want to get the maximum query execution time for every minute, for example, I would need to “truncate” the table to reset the statistics and let the maximum value be computed again. If I don’t perform that operation, the data becomes meaningless. If I make the exporter to reset the statistics during the poll, however, I can’t pull it from two Prometheus servers.

This is one instance where Prometheus’ performance schema design could be better. I could set up a Cron job or Event to clear out the statistics regularly and get a  proper maximum value for every five minutes, but that isn’t an overly convenient solution.

Another issue I discovered is that Prometheus doesn’t have any protection from bad (long) samples, or a very good method of detecting of them. Let’s imagine that I have a MySQL server and I’m sampling status data every second. For some reason the call to

SHOW GLOBAL STATUS

 took five seconds to execute. The truth is we don’t really know where in those five seconds the

SHOW GLOBAL STATUS

 output corresponds – it might be at very start, it might be at the very end. As such, you don’t really know how to process the counters. Whatever you do, you’re likely to be wrong. My preference in this case it to simply discard such samples, because even missing one percent of the samples is unlikely to change the whole picture. Constantly questioning whether you really had a couple of seconds where the QPS spiked to ten times the normal rate, or that it’s an invalid sample, is not something I on which I want to waste a lot of time!

My preferred approach is to configure the

SHOW GLOBAL STATUS

 capture so that if it takes more than ten percent of the capture interval, it will be discarded. For example, with a one second capture I would allow 100ms for the capture. If the system is not keeping up with this scale, I would be better to not fool myself and reduce the capture resolution to around five seconds.

The only protection Prometheus allows is to configure the scrape_timeout, but unfortunately it is only limited to one second resolution at this point. This is too coarse for any high-resolution capture.

Finally, it is also inconvenient to specify different resolutions for different data. In MySQL there is a often a lot of data that I want to capture, but the resolution needed for each capture is different. For example,

SHOW GLOBAL STATUS

 with one second resolution is must. At the same time, capturing the table size information from

INFORMATION_SCHEMA

 with a one second resolution would put too much load on MySQL, especially if there are a lot of tables. That level of resolution in this case isn’t really needed.

An attractive thing about Prometheus is that the Prometheus development team uses it a lot for MySQL monitoring, so the MySQL Exporter is really good. Most MySQL monitoring plugins I find resort to reporting just a few basics statistics, which is not nearly enough for advanced diagnostics. The Prometheus MySQL exporter gets tons of stuff and has been adding more in every version.

I also very much like that the Prometheus Exporters are designed using HTTP protocol. This means it is very easy to debug or see what kind of data they capture. They present it simply using a web-browser:

HTTP browser

Computational Model

I think the basic operations in Prometheus are pretty intuitive, but if you look at some of the advanced behaviors you’re going to find some inconveniences and some things that are likely to surprise you.

One inconvenience is that Prometheus is mainly designed for working with high resolution data. If there are more than five minute holes (by default) in the time series, they could disappear from the graphs. As I mentioned, for MySQL there is quite a lot of information that it makes sense to capture at a resolution lower than five minutes.

Prometheus functions are looking in the “past,” and designed in a way that the value of the function at any time (T) when it could be computed is not going to change. It all looks clean and logical, but it causes issues with holes in the data capture.  

As an example, let’s imagine following five seconds where the total number of questions from the start successfully scrapped some seconds but not others (due to a network issue, overload, etc.):

1  –  10
2  –  20
3  –  X
4  –  X
5  –  200

 When we capture data of “counter” type the most important value it has is not the actual counter value at the given time but the rate of change of the counter at different time intervals. If in this case, for example, the query rate was ten QPS for intervals one through two seconds, this can be clearly computed. But what was the query rate in the three through four second interval? We don’t really have exact data, but that is fine: we know there have been 180 queries during the two through five second interval, giving us 60 QPS (which we can use for the three through four seconds interval).

This is NOT, however, how Prometheus will compute it if you use a high irate() function (which is suppose to give you highest resolution possible). When you evaluate irate() at T=4, it doesn’t have access to the T=5 value, even if it is in the database. Instead, it will look back and find the matching previous interval (one through two) and use the corresponding value of ten QPS.

I find this pretty puzzling and inconvenient.

There is also the rate() function, which can be used to get the average rate for the period.  Unfortunately it can’t estimate the rate for a smaller period based on the available data for a longer period. So for example if I ask rate() function to compute a query rate at T=4, looking one second back, it will return no data. This isn’t a big deal when you’re working with data manually, but if you’re building zoomable dashboards it means you can zoom in to the point where the data will disappear (rather than stopping at the best possible value available).

Storage

Prometheus has its own high performance storage system which is based in part on LevelDB. It is highly optimized for time series and can achieve a very high level of compression. Be ready, though: all your label combinations will create a different time series on the low level, and will require a lot of files. This isn’t really a problem with SSD drives and modern file systems, but it something to look out for.

The capture engine and storage systems are rather efficient. Even though Prometheus does not have built in clustering for “scaling out,” you can reportedly get more than 300K metrics per second captured on a single node. You can also use multiple Prometheus servers as needed.

The problem I found with Prometheus’ storage is that it is very self contained: you can only use it from Prometheus or access it from the HTTP API. There are no tools at this point to export it for advanced analysis with R, or to dump the whole database into something like JSON format so it can be loaded into a different database engine. Some of these features might already be on roadmap.

Purging and Aggregation

Retention configuration in Prometheus is pretty spartan. You can set

storage.local.retention

 to the length you want to store the data, but that’s it. You can’t configure it to purge different data at different times. You can run multiple Prometheus instances to achieve this, but it’s quite a hassle.  It’s also not possible to instruct Prometheus to automatically build summaries in order to execute low resolution queries faster.

For example if I have MySQL’s query rate captured every second, but I want to view the data over a long time period (e.g., how it changed over last three months to estimate growth trends), data aggregated at hour intervals would be enough for that purpose.

There is support for recording rules to help achieve some of this, but it is not explicit or convenient in my opinion.

Looking at the Prometheus roadmap, some of these issues might not be fixed in Prometheus but achieved through integrating other systems such as InfluxDB (where experimental support already exists).

Purpose

A lot of these limitations make sense if you look at the purpose for which Prometheus was created: getting high-resolution data and being able to provide as much troubleshooting information as possible to its Alerting engine. It is not really designed for storing extensive history. Too bad! I would very much like to get both of those properties in the single system!

Visualization

As you install Prometheus, it has a built-in Expression Browser, which is great for debugging and interactive analyses. It also allows you to see what data you actually have in the database. It will disappoint you, however, if you’re looking for beautiful graphs!

HTTP installer

This shows I have the information about MySQL query rate from two servers, as well as the available and configured labels.

If I want to pick one server and look at the average rate of queries per five minutes, I can do this:

HTTP graphs

There are some tools available in the graph to chose the time range and resolution.

You should aware that visualizing data with the rate() function often shows you things that do not exist. In this case, it looks like the number of queries was gradually creeping up. In reality, I just started the benchmark so the number of queries jumped almost immediately. This is what the real situation looks like (using irate()):

HTTP graphs 2

As I explained before, irate() does not handle missing data points very well, plus it behaves somewhat bizarrely when you “zoom out” – providing instant rate information at sparse intervals (e.g., the instant rate computed every one second over 60 seconds) rather than smoothing things to averages.

There is also the PromDash tool available for Prometheus, which gives you nicer looking dashboards and supports a lot of Prometheus’ features. Now that Grafana has official support for Prometheus, it is my preferred tool to build dashboards – especially since it supports multiple data sources besides Prometheus.

Summary

I’m very excited about Prometheus. It allows me to get a lot of information easily and use it for Performance analyses in benchmarking or troubleshooting. It would be great if it also had a simple integrated solution for long term storage and trending. I am also looking forward to better integration with Grafana and better documentation on how to create Prometheus-based dashboards – especially with some Prometheus-based examples!  

Note: All above was written about Prometheus 0.16.1. Prometheus is rapidly evolving and may  change with newer versions.

OpenSSH CVE-2016-0777: Details and Mitigation

OpenSSH

OpenSSHEarlier today advisories were sent out regarding OpenSSH versions 5.4 through 7.1., informing users about a security bug in the software. In essence, the advisory instructed people to add the  

UseRoaming no

 option to their ssh_config file, with a promise for further information to be made available shortly.

 

The post on the security issue at OpenBSD Journal can be seen here: http://undeadly.org/cgi?action=article&sid=20160114142733

This information was then later released detailing the issue and the implications here: http://www.openssh.com/txt/release-7.1p2

The statement below summarized the main issue:

The matching server code has never been shipped, but the client code was enabled by default and could be tricked by a malicious server into leaking client memory to the server, including private client user keys.”

So what does this all mean? Simply speaking, this means a malicious or compromised server could potentially retrieve the users private SSH keys from memory. The stolen keys could then be used to authenticate against servers.

(2FA helps to protect servers from the use of stolen keys, however this is not in as widespread use as it should be.)

The short summary is in lieu of an update to the software, you can use the following mitigation options to protect yourself:

  1. In your ~/.ssh/config:
    Host * UseRoaming no
  2. In your ssh_config:
    Linux: /etc/ssh/ssh_config OSX: /private/etc/ssh/ssh_config
  3. On each CLI execution:
    ssh -oUseRoaming=no <hostname>

        

Personally, I’ve used a combination of 1 and 2, as often a ~/.ssh/config cleanup is required. Make sure and check that your OpenSSH is correctly configured, and keep watching for updates.

Making Apache Spark Four Times Faster

Apache SparkThis is a followup to my previous post Apache Spark with Air ontime performance data.

To recap an interesting point in that post: when using 48 cores with the server, the result was worse than with 12 cores. I wanted to understand the reason is was true, so I started digging. My primary suspicion was that Java (I never trust Java) was not good dealing with 100GB of memory.

There are few links pointing to the potential issues with a huge HEAP:

http://stackoverflow.com/questions/214362/java-very-large-heap-sizes
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/

Following the last article’s advice, I ran four instances of Spark’s slaves. This is an old technique to better utilize resources, as often (as is well known from old MySQL times) one instance doesn’t scale well.

I added the following to the config:

export SPARK_WORKER_INSTANCES=4
export SPARK_WORKER_CORES=12
export SPARK_WORKER_MEMORY=25g

The full description of the test can be found in my previous post Apache Spark with Air ontime performance data

The results:
Apache Spark

Although the results for four instances still don’t scale much after using 12 cores, at least there is no extra penalty for using more.

It could be that the dataset is just not big enough to show the setup’s full potential.

I think there is a clear indication that with the 25GB HEAP size, Java performs much better than with 100GB – at least with Oracle’s JDK (there are comments that a third-party commercial JDK may handle this better).

This is something to keep in mind when working with Java-based servers (like Apache Spark) on high end servers.