ScaleArc recently hired Percona to perform various tests on its database traffic management product. This post is the outcome of the benchmarks carried out by Uday Sawant (ScaleArc) and myself. You can also download the report directly as a PDF here.
The goal of these benchmarks is to identify the potential overhead of the ScaleArc software itself and the potential benefits of caching. The benchmarks were carried out with the trunk version of sysbench. For this reason, we used a very small set of data, so the measurements will be fast, and it’s known that caching has huge benefits when the queries themselves are rather expensive. We decided that we would rather show that benefit with a real-world application, which is coming later is this series. And if you’re in the Silicon Valley area, be sure to join us this evening at the first-ever Open Source Appreciation Day – I’d be happy to discuss the findings presented here in this post. Admission is free but due to limited space you should register now. I’ll also be available throughout the Percona Live MySQL Conference and Expo all this week.
In this summary graph it’s visible that in terms of throughput (read-only benchmark, which is relevant for read mostly applications), ScaleArc doesn’t have any significant overhead, while caching can have potentially huge benefits.
The situation is pretty similar with response times. ScaleArc doesn’t add any significant overhead, and caching can mean huge benefit in terms of response time as well.
In case of this particular workload (which is read only sysbench), using caching means a roughly 3x increase in throughput and a roughly 80% drop in response time.
Overall, ScaleArc is a good product in terms of performance and features as well. I would definitely recommend it.
About ScaleArc for MySQL
ScaleArc for MySQL is a software appliance that drops in transparently between applications and databases to improve application availability and performance. It requires no changes to applications or databases and delivers:
- Instant scale up – transparent connection pooling and multiplexing, TTL-based transparent caching, surge protection
- Transparent scale out – read/write split, load balancing, query routing, sharding
- Automatic high availability – automatic failover
- Real-time actionable analytics
The client machines are running the benchmarking software like sysbench in case of these benchmarks.
CPU: 2 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (6 cores, chip multithreading off)
We used 2 clients. The results of the 2 clients are graphed separately, so it’s visible that they put the same amount of workload on the database or ScaleArc software.
CPU: 2 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (6 cores, chip multithreading off)
Running MySQL Community Edition 5.6.15
[mysqld] max_allowed_packet = 64M thread_cache = 256 query_cache_size = 0 query_cache_type = 0 max_connections = 20020 max_user_connections = 20000 max_connect_errors = 99999999 wait_timeout = 28800 interactive_timeout = 28800 log-error=/var/lib/mysql/mysql.err back_log=60000 innodb_buffer_pool_size = 3G innodb_additional_mem_pool_size = 16M innodb_log_buffer_size = 8M innodb_flush_log_at_trx_commit = 0 innodb_flush_method = O_DIRECT innodb_open_files = 2000 innodb_file_per_table innodb_log_file_size=2G innodb_log_files_in_group=2 innodb_purge_threads=1 innodb_max_purge_lag=0 innodb_support_xa=0 innodb_locks_unsafe_for_binlog = 1 innodb_buffer_pool_instances=8 sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES
The buffer pool of the database is intentionally small, so it’s easy to generate a disk-bound workload.
Please note that the following settings are not recommended in production.
innodb_support_xa=0 innodb_locks_unsafe_for_binlog = 1
We used these settings to drive the node to its peak performance, avoiding any possible overhead which might be required on a production system. In typical production settings, these are not set, and binary logging is enabled, which potentially reduces ScaleArc’s overhead further.
ScaleArc software appliances
CPU: 1 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (6 cores, chip multithreading off)
The machines were running ScaleArc for MySQL 3.0.
The machines were connected using 10G connections.
All of the measurements were done with a very small database that completely fits in memory.
In these benchmarks, we expected both the database and ScaleArc to be CPU bound. In case of a disk-bound workload, ScaleArc would shine even more than in this benchmark. If the queries are more expensive (they have to hit storage), the overhead in % is smaller, and in case of caching the query-by-query benefit is bigger.
We measured 3 different setups, both on read-only and read-write cases. These are the following.
- Direct connection to the database.
- Connection to the database through ScaleArc, where ScaleArc only acts as a pass-through filter (since it’s a load balancer that speaks the MySQL wire protocol, all the mechanics for that are still in place). Please note that this setup doesn’t make sense in real life. The purpose of this setup is to show the potential overhead of using ScaleArc and uncovering potential limitations of the ScaleArc software itself.
- Connection to database through ScaleArc, where ScaleArc is allowed to cache. Caching in ScaleArc is TTL (Time To Live)-based caching, meaning that a read query’s results are cached in ScaleArc. If that read query is seen again before expiring, the query is not run again on the database server but rather served from the cache. Once the timer for the cached query expires, the query will be issued on the database again. Caching of course only works for reads, which are not in an explicit transaction (autocommit is on and no START TRANSACTION is issued). Because of that, we used –oltp-skip-trx during cached benchmarks (read-only case). In case of these benchmarks, the TTL was 1 hour, because we wanted to saturate the ScaleArc software while serving cached queries. An 1 hour TTL might be unrealistic for some applications, while for other applications even an 1 day TTL is something they can live with for some queries. In this case, we wanted to measure the cache’s performance, so we wanted the queries to be cached during the entire benchmark run to show the potential gain even in case of very small queries.
It’s important to note that the cache’s expiration is controlled by a TTL value – there is no other invalidation, so it’s possible to read stale data when the query results is changed, but the cache is not expired. Reading stale data alone is ok for most applications, it can happen with a regular, asynchronous slave if it’s lagging behind the master (and it always lagging behind somewhat). Otherwise, the cache is pretty similar to MySQL’s query cache, which doesn’t suffer from the stale read problem, but it has a coarse invalidation (if a table is written, the cache entries belonging to the given table are flushed). While the cache is flushed, the query cache mutex is held, which blocks reads even. Because of the mutex, the built-in query cache is a very usual performance bottleneck. ScaleArc’s cache doesn’t suffer from this.
It’s important to note that ScaleArc caches nothing by default. Also, there are other ways to invlidate cache entries apart from waiting for the TTL to expire.
- API Call based invalidation (you can clear the cache for an entire query pattern rule with one API call)
- Query comment based invalidation (you can put a comment /*wipe*/ before a query and wipe and refresh the cache)
- Cache Bypass (you can send a comment /*nocache*/ and bypass the cache for that specific query)
In the lower region of threads (up to 32), we see that the TPS value significantly drops in case of going through ScaleArc. That’s nothing to be surprised about, the reason for that is network roundtrips. Because ScaleArc is a software appliance, it adds a hop between the database and the application, which introduces latency. If the number of threads is higher (32 and up), this starts to matter less and less, and performance is almost identical which is very impressive. It means that around the optimal degree of parallelism for these machines, ScaleArc introduces very little (barely measurable) overhead.
Sysbench response time
This graph contains the response times belonging to the previous benchmarks. This is really hard to read because at 4096 threads, the system is overloaded, and the response time is much more than in the maximum throughput region. Because it’s multiple orders of magnitude higher, the interesting response times are not readable from this graph.
The following graph is the same as above, except that the y axis is limited to 250 ms, so the region which is not visible on the graph above is visible here. What we see there regarding the overhead is pretty much the same as we saw in case of the throughput graph, which means that ScaleArc by itself introduces immeasurably low latency (which explains the difference in cases when parallelism is low). Usually applications which are utilizing the database server are using significantly more than one thread (in MySQL a single query always uses a single thread, in other words there is no intra-query parallelism). The latency from 32 threads above is actually somewhat lower when going through ScaleArc (the exact tipping point can be different here based on the number of CPUs). The reason for that is ScaleArc itself uses an event loop to connect to MySQL, so at a high concurrency, and can schedule sending the traffic to MySQL differently. This only matters when otherwise the MySQL server is saturated CPU-wise.
Last but not least, this graph contains the CPU utilization of the different setups. The left-hand side shows the CPU utilization when connecting directly to the database, and the right-hand side shows connecting through ScaleArc. In both cases, the database server’s CPU is the bottleneck. It’s visible that the client node’s CPU is more than 75% idle (only client1 is graphed to improve readability, client2 is practically the same). From 32 threads and up, the blue bar (CPU user%) is relatively high on the database servers, as is the green (CPU sys%). From 64 threads, the idle time is practically 0, until the systems are overloaded. On the right hand side, we can see that ScaleArc at this load still had 50% idle CPU, which means that we could practically do the same benchmark on another set of boxes through the very same ScaleArc, and only then it will be fully utilized. We are talking about 3000 sysbench tps here. One more interesting thing to note is the relatively high system time of ibd. This is also because of the way ScaleArc connects to the database (see the previous paragraph).
[ 17s] threads: 64, tps: 3001.98, reads/s: 41962.70, writes/s: 0.00, response time: 35.22ms (95%)
These threads are from a single client, which means that ScaleArc could keep up with parsing roughly 84000 statements / second with utilizing half of its CPU, which is impressive. Please note that the ScaleArc software in this case was tuned towards this type of workload, which means we had more query processing threads. In case of caching, we will have more cache handler threads.
Effects of caching on read-only workload
The next set of graphs will compare the cases when cache is used and not used.
The preceding TPS graph contains reads / second (because we measured with –oltp-skip-trx), so roughly 42000 reads corresponds to roughly 3000 transactions in the earlier setup (14 reads in a transaction). On the left-hand side of the graph, the cached throughput is visible with green – on the right-hand side, the non-cached throughput is visible with red (direct access) and blue (access through ScaleArc as a pass-through filter). It’s visible that caching improves the speed drastically, but when ScaleArc becomes overloaded (8192 client threads, 4096 from each client), the performance becomes somewhat inconsistent, which is understandable considering how few cores ScaleArc was running on. On the graph, the dots are translucent, which means the colors are brighter in the areas that have more samples. Even in the overloaded case, the majority of the samples are in the region of 100k+ reads / second across two clients, which means that the performance degrades very gracefully even under heavy load.
Sysbench response time
Like in the case of a non-cached workload, the response times are not too readable because of the very high response times when the systems are overloaded. But from the overloaded response times visible, it seems like using caching doesn’t make response times worse.
Like in the case of non-cached workload, this graph is the zoomed version of the previous one. Here the maximum of the y axis is 100 ms. From this graph, it’s visible that at lower concurrency and at the optimal throughput, caching actually helps response time. This is understandable, since in case of a cache hit, ScaleArc can serve the results, and the client (in our case here sysbench) doesn’t have to go to the database, so a roundtrip and database processing time is spared. It’s also worth mentioning that the data “comes from memory,” it doesn’t matter if we hit the ScaleArc cache of the database. When the ScaleArc cache is used, the response time is lower because the additional roundtrip to the database and potential database work (like parsing SQL) is avoided. This means that caching can have benefits even if the database fits in the buffer pool. The improvement is always subject to the workload – caching helps the most when it can cache relatively expensive queries like aggregations and queries hitting the storage.
Similarly to the previous case, the preceding graph shows CPU utilization of the various components. In case of the cached workload, the client itself is much more utilized (since it gets responses sooner, it has to generate the traffic faster). With this kind of workload, when using only one client, we would hit the client’s CPU as the performance bottleneck. The database is interesting too. With caching, its CPU is barely used. This is because if a query is served from the cache, it never gets to the database, so the database’s CPU utilization will be lower. In other words, using the cache helps to offload the database. If offloading is visible on ScaleArc’s graphs, when caching is used, the CPU on the server hosting ScaleArc is much more utilized. For this benchmark, the ScaleArc software was tuned to handle a cached workload, which means more cache handler threads.
For read-write benchmarks, we had to create oltp_nontran.lua, which is the same sysbench benchmark as oltp.lua, except that it does the reads outside of the transaction and does only the writes in transaction, so caching can have an effect on read. The rest of the benchmarking setup is the same as the read-only case.
Similarly to the read-only case, at a low concurrency, the overhead of ScaleArc is coming from the additional network roundtrip. At the optimal concurrency, the overhead is barely measurable (the dots are plotted practically on top of each other).
Sysbench response time
The case is pretty similar with the response times as in the read-only case. Similarly, the second graph is a zoomed version of the first one, which a 250 ms maximum.
The CPU utilization graph shows that in this case, the database server’s CPU is the bottleneck. What is interesting is that ScaleArc is using less CPU than in the read-only case. This is understandable, since a transaction now contains writes as well, which are expensive on the database side, but they are still just statements to route on the ScaleArc side.
Effects of caching on read-write workload
Measuring caching here is interesting because the workload is no longer read-only of mostly reads. We have a very significant amount of writes.
For 30k reads, we get 8,5k writes. It’s expected that caching won’t help as much as in the previous case, because writes can’t be cached and while they are in process, the benchmarking threads can’t proceed with reads. Please note that this means that roughly 25% of the traffic is write, a typical application scaling out with additional slaves for reads doesn’t have this kind of read-to-write ratio.
The first graph shows that in terms of total throughput, caching still helps.
Sysbench response time
Similarly to the read-only case, caching also helps response time, because it reduces the time needed for the read part of the workload.
This test really stresses the database server’s CPU when not caching. With caching on, similarly to the read-only case, the client’s workload increases somewhat (but not as much), and the database server’s CPU usage decreases significantly. In the last row, the CPU utilization of ScaleArc shows that although it’s somewhat higher with caching, it’s still not that much higher.
From these tests it’s visible that caching can still be beneficial even if the write ratio is as high as in this test.
Engineering is always about making the right tradeoffs. If one wants features that needs a protocol-level load balancer like ScaleArc, the price should be paid in the overhead of Layer 7 parsing and decision making. ScaleArc’s engineering team did a great job minimizing this overhead. ScaleArc itself is very well tunable for different workload types (if caching is important, ScaleArc can be tuned for caching – if query rewriting, ScaleArc can be tuned for that).