At PalominoDB, we are not just another remote DBA; we are an integral part of your webops, DBA and BI teams.

Blog

Palomino Evaluates Amazon’s New High I/O SSD Instances

Amazon Web Services (AWS) recently introduced a new storage layer with SSD as its backend. Because their previous EBS storage has been challenging in the areas of I/O throughput and stability, we were extremely excited to run comparative benchmarks to offer a recommendation for customers considering upgrading to SSD. 

The opportunity to expand I/O throughput horizontally can create a longer runway for sharding (distributing data-sets across multiple MySQL logical clusters in order to reduce write I/O), which can be quite compelling. 

Our extensive tests, described in detail in our Evaluation Report prepared by René Cannao and Laine Campbell (click here for a free PDF), illustrate the potentially enormous gains in throughput, performance and cost for companies scaling MySQL in Amazon. Palomino would strongly recommend that clients consider upgrading to SSD storage. Learn more.

Quick script to get list of all jmx beans

Recently I've needed to get information from running Cassandra processes, in order to determine which parameters to monitor. jconsole can be used for this, however if you're using a host on AWS, or in a bandwidth-limited environment, you might find that it takes way too long, especially if all you want to do for the moment is get a list of all beans and their attributes. jmxterm is a good solution for this - a command line interface to your jmx port. So here's a couple of utility scripts that you can use to get all the beans and each of their attributes using jmxterm. You may need to change $jar, $host, and $port in the perl script to fit your environment, and also maybe change the domains in the bash script. Then you should be able to simply run the bash script to get a list of each bean and its attributes. I found this useful when looking for the correct syntax to set up some nagios monitoring for cassandra. For example, here is a snippet of the output:
org.apache.cassandra.internal
.
.
.
Checking bean org.apache.cassandra.internal:type=FlushWriter
  %0   - ActiveCount (int, r)
  %1   - CompletedTasks (long, r)
  %2   - CurrentlyBlockedTasks (int, r)
  %3   - PendingTasks (long, r)
  %4   - TotalBlockedTasks (int, r)
.
.
.
And I was able to figure out the syntax for a nagios check:
check_jmx!$HOSTADDRESS$!7199!-O org.apache.cassandra.internal:type=FlushWriter -A CurrentlyBlockedTask
Hopefully, these scripts might be useful to someone else trying to query cassandra (or any java process that uses jmx). get_all_bean_info.sh:
#!/bin/bash
 
DOMAINS=(
org.apache.cassandra.db
org.apache.cassandra.internal
org.apache.cassandra.net
org.apache.cassandra.request
)
 
for domain in ${DOMAINS[@]}
do
    echo "-------------------"
    echo $domain
    output=$(./query_jmx.pl $domain 2>/dev/null | tr ' ' '+' | grep '=')
    for line in $output
    do
      bean=$(echo $line | tr '+' ' ')
      echo "Checking bean $bean"
      ./query_jmx.pl $domain $bean 2>/dev/null | grep -v "#" | grep -v "Got domain"
    done
done
query_jmx.pl:
#!/usr/bin/env perl
use strict;
my $jar = "/home/ubuntu/jmxterm-1.0-alpha-4-uber.jar";
my $host = "10.176.66.219";
my $port = 7199;
 
my $domain = shift @ARGV;
my @beans = ();
my $bean;
my $size;
for my $arg (@ARGV) {
  if ($arg =~ /^\w/) {
    push (@beans, $arg);
  } else {
    last;
  }   
}     
$size = @beans;
$bean = join(' ',@beans) if ($size > 0);
open JMX, "| java -jar $jar -n";
print JMX "open $host:$port\n";
print JMX "domain $domain \n";
if (defined $bean && length $bean > 0) {
  print JMX "bean $bean \n";
  print JMX "info \n";
} else {
  print JMX "beans \n";
}
 
print JMX "close\n";
close JMX;

Mystery Solved: Replication lag in InnoDB

 

While running a backup with XtraBackup against a slave server we noticed that replication was lagging significantly. The root cause wasn't clear, but we noticed that DML statements from replication were just hanging for a long time. Replication wasn't always hanging, but it happened so frequently that a 24 hour backup caused replication to lag 11 hours.

The first hypothesis was that all the writes generated from replication (relay log, bin log, redo log, etc) were generating too high contention on IO while XtraBackup was reading the files from disk. The redo log wasn't hitting 75%, which meant that InnoDB wasn't doing aggressive flushing - some other contention was causing replication to stall.

After various tests, we found that disabling innodb_auto_lru_dump solved the issue. It wasn’t entirely clear what the relation was between the lru dump and replication lag during backup, but it was very easy to reproduce. Enabling lru dump at runtime was immediately causing replication to lag, and disabling it restored replication back to normal.

Also, when innodb_auto_lru_dump was enabled we noticed that from time to time the simple command "SHOW ENGINE INNODB STATUS" was hanging for 2-3 minutes.

To attempt to reproduce the issue outside this production environment, we tried to run various benchmarks using sysbench, with and without auto lru dump. The sbtest table (~20GB on disk) was created using the following command:

sysbench --test=oltp --mysql-table-engine=innodb --mysql-user=root --oltp-table-size=100000000 prepare

The InnoDB settings were:

innodb_buffer_pool_size = 10G

innodb_flush_log_at_trx_commit = 2

innodb_thread_concurrency = 0

innodb_flush_method=O_DIRECT

innodb_log_file_size=128M

innodb_file_per_table

 

The various benchmarks were ran using:

- read-only workload vs read-write workload;

- small buffer pool vs large buffer pool (from 2G to 30G)

- small number of threads vs large number of threads

 

None of the above benchmarks showed any significant difference with auto lru dump enabled or disabled. Perhaps these workloads were not really reproducing our environment where we were getting issues with auto lru dump. We therefore started a new series of benchmarks with only one thread doing mainly writes - this is the workload we expect in a slave used only for replication and backups.

The workload with sysbench was modified to perform more writes than read, yet the result of the benchmark didn't change a lot - enabling or disabling lru wasn't producing any significant change in performance. The problem with this benchmark was that it was generating too many writes and filling the redo log. InnoDB was then doing aggressive flushing and this was a bottleneck that was hiding any effect caused from the lru dump.

To prevent the redo from filling too quickly, we had to change the workload to read a lot of pages, change the buffer pool from 30G to 4G, and test with always restarting mysqld and with the buffer pool prewarmed with:

select sql_no_cache count(*), sum(length(c)) FROM sbtest where id between 1 and 20000000;

sysbench --num-threads=1 --test=oltp --mysql-user=root --oltp-table-size=100000000 --oltp-index-updates=10 --oltp-non-index-updates=10 --oltp-point-selects=1 --max-requests=1000 run

 

innodb_auto_lru_dump=0:    transactions: (7.26 per sec.)

innodb_auto_lru_dump=1:    transactions: (6.93 per sec.)

 

This was not a huge difference, but we finally saw some effect of the auto_lru_dump.

It became apparent that the number of transactions per second in the above benchmark was really low because the number of random reads from disk was the bottleneck. To remove this bottleneck, we removed innodb_flush_method=O_DIRECT (therefore using the default flush method), and then ran the following to load the whole table into the OS cache (not into the buffer pool).

 

dd if=sbtest/sbtest.ibd of=/dev/null bs=1M

 

To prevent the redo log from filling up, we also changed the innodb_log_file_size from 128M to 1G.

With these changes - always using a buffer pool of 4G, restarting mysqld before each test ,and prewarming the buffer pool with "select sql_no_cache count(*), sum(length(c)) FROM sbtest where id between 1 and 20000000" - we reran the same test changing the number of requests:

10K transactions:

sysbench --num-threads=1 --test=oltp --mysql-user=root --oltp-table-size=100000000 --oltp-index-updates=10 --oltp-non-index-updates=10 --oltp-point-selects=1 --max-requests=10000 run

 

innodb_auto_lru_dump=0:    transactions: (243.22 per sec.)

innodb_auto_lru_dump=1:    transactions: (230.62 per sec.)

 

50K transactions:

sysbench --num-threads=1 --test=oltp --mysql-user=root --oltp-table-size=100000000 --oltp-index-updates=10 --oltp-non-index-updates=10 --oltp-point-selects=1 --max-requests=50000 run

 

innodb_auto_lru_dump=0:    transactions: (194.31 per sec.)

innodb_auto_lru_dump=1:    transactions: (175.69 per sec.)

 

 

With innodb_auto_lru_dump=1 , performance drops by a factor of 5-10% !

 

After this, we wanted to run a completely different test with no writes, only reads.

innodb_auto_lru_dump didn't show any difference when sysbench was executed with read only workload, and we believe the reason is simply the fact that sysbench wasn't changing too many pages in the buffer pool. The easiest way to change pages in the buffer pool is to perform a full scan of a large table with a small buffer pool. We set innodb_flush_method=O_DIRECT, since otherwise the read from the OS cache was too fast and we couldn't detect any effect of innodb_auto_lru_dump. With innodb_buffer_pool_size=4G, and restarting mysqld after each test, this was the the result of a full table scan:

 

With innodb_auto_lru_dump=0 :

mysql> select sql_no_cache count(*), sum(length(c)) FROM sbtest;

+-----------+----------------+

| count(*)  | sum(length(c)) |

+-----------+----------------+

| 100000000 |      145342938 |

+-----------+----------------+

1 row in set (3 min 27.22 sec)

 

 

With innodb_auto_lru_dump=1 :

mysql> select sql_no_cache count(*), sum(length(c)) FROM sbtest;

+-----------+----------------+

| count(*)  | sum(length(c)) |

+-----------+----------------+

| 100000000 |      145342938 |

+-----------+----------------+

1 row in set (3 min 38.43 sec)

 

Again, innodb_auto_lru_dump=1 affects performance increasing the execution time by ~5% .

It is also important to note that innodb_auto_lru_dump seems to affect performance only for some specific workload scenarios. In fact, the majority of the benchmarks we ran weren't showing any performance effect caused by innodb_auto_lru_dump.

 

Optimizing your MySQL Tuning Methodology

 

Optimizing your queries

There are two general methods for creating a query plan for a query. A rule-based optimizer goes through your query, sees if you're doing X or Y, and then optimizes for method A or B depending. Rule-based optimizers grow rather large as there are many possible cases to account for. Sometime in the past couple of decades, rule-based optimizers fell out of favour and cost-based optimizers took over instead. At the moment the query is submitted, the optimizer does some simple calculations of costs of various ways of doing the query and picks the probably-best. It turns out the simplicity of the optimizer combined with the "it generally works" nature of the beast means it totally wins the popularity contest.

 

My Tuning Methodology, the Overview

With this in mind, I wanted to evaluate my own tuning methodology. At first blush, I thought my method rather coarse and unrefined. I generally look at every variable in my.cnf, do a quick search if I don't know what it is, set it to something that “seems about right,” and then go to the next one.

 

My tuning experience is almost exclusively in high-volume shops, which isn’t representative of the many different MySQL installations of different sizes in the world. In high-volume environments, there are no good rules of thumb for tunings, because if you followed any such rules, your MySQLs would crash into the dust. So this rather naïve approach is actually a good start. Because there's that word again. Rule.

 

You can tune the my.cnf by having several rules of thumb handy. A very long ruleset of "if this, then that" either documented or as a mental checklist. This works well on MySQL installations that are typical, but it isn't my methodology. Mine is iterative and experimentally-based.

 

My Tuning Methodology, the Details

So now I've got a my.cnf that I’ve touched line-by-line and picked values that seemed reasonable. Anyone who's ever administered a MySQL knows this my.cnf is going to be a disaster if left unchecked. But here's where my magic begins and why my methodology works in high-volume environments where no rules of thumb apply.

 

I place the my.cnf onto a machine that is taking representative load. That is, it receives queries much like it would if it were performing its production role. There are a lot of ways of doing this, but my favorite is to put the slave into a production pool at a small fraction (perhaps 1%) of the workload of a currently-production slave. 

 

Next, I point trending software at the slave. What do I monitor? That's easy. EVERYTHING. Every single statistic I can think of, I graph, from the OS and DBMS. Everything sar would ever tell you. Everything from "show global status". Everything in log files I can turn into numbers.

 

Then I measure the "black-box expected throughput" of the machine. That is, it's getting a fraction of the query traffic, so is it performing obviously better than the other hosts performing full work? Say this machine is supposed to be able to answer its typical queries in 250ms. Is it doing it in at most 250ms, and perhaps better? If so, I increase the load on the machine. I keep increasing the load until it starts to struggle. Say I get it up to 150% normal load, and it's now taking 260ms to answer its queries. Now I know it's "stressed."

 

At this point, I start to look at every statistic I've graphed and look for things at their limits. Is CPU maxed out? Is disk I/O at a plateau? Are read_rnd_next handler stats off-the-charts? I find it useful to have the same metrics coming from an existing reference slave, but sometimes they are obviously in a bad state (like if read_rnd_next is pushing 100,000,000/s for example, we probably need some indexes).

 

From here, I begin to get an idea of tunings or, more precisely, "experiments I can run." Maybe the tables opened is going crazy. So I tune table_open_cache, restart the server, and see if the graph corrects (secondary concern) and if my server is responding better, say 10ms faster per query (primary concern, which is most-correctly measured as "does the application seem happy with this server in its current state?"). Maybe disk I/O is crazy, but InnoDB free pages is nothing. That suggests an experiment to increase innodb_buffer_pool_size.

 

These are perhaps obvious examples, but there can be some subtler and surprising tuning choices to be made when you start viewing graphs of statistics you don't even know the meaning of. For example, suppose you didn't know what handler_read_rnd_next meant. You look at it and see it's 10,000x larger than handler_read_next which is 50x larger than handler_read_first. You start looking up what those stats mean, et voila, you've made a useful discovery.

 

At this point, I've formulated a hypothesis, which sounds a bit like this: "The graphs indicate that my system is doing suboptimal X. There is a tuning A that affects X. Thus I will change its current value from M to N, and I expect it will perform more optimally in X." I make the change, and verify the evidence on the graphs. Most importantly, the primary concern is that the application, the thing using my DBMS, should perform better. If the graphs indicate everything is better, but the application is performing worse, we must accept the failed experiment!

 

This leads us to an anecdote about InnoDB tuning.

 

Why Experimental Evidence Matters

At a previous company, I was running several 100% InnoDB MySQL servers in a particular slave query pool, and was getting a particular performance characteristic. I was looking at my.cnf and noted innodb_buffer_pool_size was rather small (2GB out of 24GB total available RAM) and that I had an extremly low cache hit rate. I formulated the obvious hypothesis: that I should increase the buffer pool to decrease my deplorable cache hit rate. Upon increasing the innodb_buffer_pool_size to 6GB (I believe in changing things in small increments), however, I discovered that although my cache hit rate much nicer, my query latency was also up (and the application was clearly suffering).

 

The first thing to do was tune it back to 2GB to ensure things returned to status quo. Then I formulated a new experiment which made no sense, but I had to try: I tuned innodb_buffer_pool_size to 1GB... Performance INCREASED over the original 2GB setting!

 

Further experimentation settled on about a 750MB innodb_buffer_pool_size being optimal for the machines in this particular query pool (that is, smaller values began leading to worse performance again).

 

This was EXTREMELY counter-intuitive, as it should be to you if you're a MySQL DBA. It took me a long time to accept the truth of this situation and move on. I formulated many other hypotheses in the process of trying to understand this phenomenon. MyISAM? Bad trending data? Matter/antimatter collision? Practical joke by my co-workers who were in the datacenter siphoning off the blue smoke from my machines?

 

To add to the mystery, I observed two additional phenomena: First, in other slave pools, increased innodb_buffer_pool_size indeed had the expected result: better cache hit rate and better performance. Second, several months later these machines were replaced with a new set of hardware that was configured similarly in RAM/disk, but were AMD processors instead of Intel. Upon loading them up with MySQL and putting them into the pool, they performed extremely badly compared to their peers. It took almost as long to discover the obvious, that increasing the innodb_buffer_pool_size decreased latency on this configuration, as it took to originally ignore this intuition. But the FACTS, that is, the graphs coming out of the trending software didn't lie. On this new configuration, we could see the performance increase as innodb_buffer_pool_size was tuned up. I believe we settled on about 18GB of total 24GB RAM as the optimal tuning.

 

Wrapping up the Experiment, Standardize on your Configuration

At some point, after staring at graphs, modifying parameters, altering tables, changing sysctl.conf settings, remounting filesystems, it starts to be more difficult to suggest tunings that will have a great effect. Keep in mind that the business has its own goals. After a while, getting another 5% performance out of your servers isn't worth the same to the business as, say, helping the developers push out the next release. So recognize that although you could spend the next several months staring at graphs and performing experiments (which are fun), at this point you should be satisfied with your increased performance and get back to the daily grind. Roll your tunings out to the pool of DBMSes, monitor the cluster to be sure that, on the whole it's performing as well as the individual test machine was (This is important! Sometimes a subset of the machines will behave in a deviant fashion differently than you expect! Be prepared for it!), and if so, move on to something else more important. Don't worry. Changes in workload or hardware configuration or application code will necessitate revisiting this procedure in the future.

 

An overview of Riak

 

At PalominoDB, we constantly evaluate new technologies and database options for our clients’ environments. NoSQL databases are especially popular right now, and Riak is an increasingly-recommended option for highly available, fault-tolerant scenarios. Moss Gross attended an introductory workshop, and shares his findings here. For more on Riak, please see the Basho wiki.

What is Riak?

Some of the key features of Riak include:

- license is apache2

- key-value store

- masterless

- distributed

- fault-tolerant

- eventually consistent (Given a sufficiently long time, all nodes in the system will agree. Depending on your requirements, this could be a determining factor on whether you should use Riak)

- highly available

- scalable

- each node has one file

- supports map/reduce for spreading out query processing among multiple processes

- The database itself is written in Erlang.  There are currently clients written in Ruby, Java, Javascript, Python and many other languages.

 

Key Concept Definitions

 

Node:  One running instance of the Riak binary, which is an Erlang virtual machine.

Cluster:  A group of Riak nodes.  A cluster is almost always in a single datacenter.

Riak Object: One fundamental unit of replication.  An object includes a key, a value, and may include one or more pieces of metadata.

Value:  Since it is a binary blob, it can be any format.  The content-type of the value is given in the object's metadata.

Metadata:  Additional headers attached to a Riak Object.  Types of metadata include content-type, links, secondary indexes

Vector Clock: An opaque indicator that establishes the temporality of a Riak object.  This is used internally by Erlang, and the actual timestamp is not intended to be read by the user from the vector clock.

Bucket: Namespace.  This can be thought of as an analog to a table, however Riak namespaces are unlike tables in any way.  They indicate a group of Riak objects that share a configuration.

Link:  Unidirectional pointer from one Riak object to another.  Since an object can have multiple links, bi-directional behavior is possible.  These links use the HTTP RFC.

Secondary Index:  An index used to tag an object for a fast lookup. This is useful for one-to-many relationships

 

 

More Key Concepts

 

Riak uses consistent hashing, which can be thought of as a ring which maps all the possible Riak objects, the number of which can be up to 2^160.

 

Partition: Logical Division of the ring.  This corresponds to logical storage units on the disk.   The number of partitions must be a power of two, and in practice should be generally very high.  The default number of partitions is 64, and this is considered a very small number.

 

 

Operations:

 

To insert data, the basic operation is a PUT request to one of the hosts.  The bucket is indicated in the address, and metadata is added via headers to the request.  For example: curl -v http://host:8091:/buckets/training/keys/my-first-object -X PUT -H "content-type: text/plain" -d "My first key"

 

To insert data without a key, use a POST request.   To retrieve an object, use GET, and to delete, use DELETE.

 

Riak doesn't have any inherent locking.  This must be handled in the application layer.

 

Administration:

 

Configuration:

Riak has two configuration files, located in /etc/ by default.

 

vm.args: identifies the node to itself and other clusters.  The name of the node is of the form 'name@foo', where 'name' is a string and 'foo' can be an ip or a hostname, and it must resolve to a machine, using /etc/hosts or DNS, etc.

 

app.config:  identifies the ringstate directory and the addresses and ports that the node listens on.

 

 

Logs:

 

There are four main logs

 

console.log:  All the INFO, WARN, and ERR messages from the node.

crash.log: crash dumps from the Erlang VM

error.log:  just the ERR messages from the node

run_erl.log: logs the start and stop of the master process

 

Diagnostics:

 

Things to check for proper operation:

 

Locally from the machine the node is on, you can run the command 'riak-admin status'. This gives one minutes stats for the node by name and cluster status

 

Cluster Status:

nodename: 'xxxx' (compare to what the rest of the cluster things it should be)

connected_nodes (verify it's what's expected)

ring_members (includes OOC members)

ring_ownership (has numbers that should show a general balance in

indexes across all of the nodes, if one is significantly different,

indicates a problem)

 

Storage Backend

 

Riak is very versatile in that you have many choices as to what to use for your storage backend.  Some of the possibilities include memory, Bitcask (a homegrown solution written in Erlang, which keeps just a hashmap of keys in memory), LevelDB (which will allow secondary indexing, uses compaction, and does not require you to store all keys in memory), or a combination.  They recommend actually using different storage engines for different buckets if it seems more appropriate.

 

Creating backups in Riak is relatively straightforward:

 

1. stop node

2. copy data directory (if not in memory)

3. start node back up

4. let read repair handle the rest

 

 

Monitoring and Security

 

- There is no built-in security in Riak - it's up to the administrator to add access restrictions via http authentication, firewalls, etc. The commercial product, Riak CS, has ACLs and security built-in however.

- JMX monitoring is built-in and can be enabled in the application configuration - just specify the port.

- MapReduce can be enabled by a setting in app.config.  JavaScript or Erlang can be used.

 

Querying connection information in MongoDB

As with all other database platforms, daily administration and troubleshooting of MongoDB often begins with analyzing database activity.  It can be helpful to see all connections from within MongoDB, which you can do by running:

   db.$cmd.sys.inprog.findOne({$all:1});
or 
   db.currentOp(true);

where the “true” makes the command include idle connections.  But often there's so many connections that unless you capture the output in a script file or some similar thing, it's kind of useless. 

Thanks to a hint given by Scott Hernandez in the mongodb-user forum, we can use the line

   db.currentOp(true).inprog.forEach(function(o){if( <filter criteria> ) printjson(o)});

to show a subset that is more manageable.  For instance, we might want to show all connections from a particular application server:

   db.currentOp(true).inprog.forEach(function(o){if(o.client.indexOf(“10.0.1.77”) != -1 ) printjson(o)});

or from the MongoDB logs we’ll see a particular connectionId and want to know where it came from:

   db.currentOp(true).inprog.forEach(function(o){if(o.connectionId == 8606 ) printjson(o)});

This will then show all the connection info for that connectionId:

   { 
   "opid" : 7117752, 
   "active" : false, 
   "lockType" : "read", 
   "waitingForLock" : false, 
   "op" : "query", 
   "ns" : "player_data.player_answerchoice", 
   "query" : { 
   "$query" : {  
   "poll_id" : ObjectId("4f58e08db3e93217c2000008") 
   } 
   }, 
   "client" : "10.0.1.77:59542",  
   "desc" : "conn", 
   "threadId" : "0x49e31940", 
   "connectionId" : 8606, 
   "numYields" : 0 
   }

Testing and Analyzing Performance with Benchmarks

Generic benchmark tools can be very useful for testing performance on your system. These benchmark tools normally have a set of predefined workloads, but often they don't match your specific workload in useful ways.

One of the best ways to reproduce your workload is to have a good sense of the application that uses the database and how it manages requests to the database. If this is not an option, it is also possible to analyze traffic and to find the most common queries, and use those to define the most common workload.

You can analyze traffic in many ways, from tcpdump to general log, from binlog (only for DML statements) to slow query log.

Afterwards it is possible to analyze them with pt-query-digest (or the obsolete mk-query-digest) to find the most common and/or heavy queries.

In the system we analyze here, the workload was mainly write intensive and involved just 4 tables:

  • tableA was receiving single-row INSERT statements;
  • for each insert on tableA , on average 200 INSERTs were performed in the other 3 tables, distributed as follows: 100 on tableB, 95 on tableC, 5 on tableD (to be more specific , for each INSERT on tableB there is an INSERT either on tableC or tableD).

 

The system also receives SELECT statements, but in a very small number and very simple primary key lookup.

To simulate the workload, we generated a simple perl script that spawns a certain number of threads that perform the DML statements, and other threads that perform the SELECT statements.

At regular intervals, the script prints statistics and progress.

The benchmark test was executed in a setup with 2 hosts: one host where the client was running, and another host where the servers were running.

The RDBMS tested were: MariaDB 5.2.3 with TokuDB 5.2.7 and InnoDB, and Percona 5.5.20.

Additionally, Percona 5.5.20 was tested as multiple instances running on the same hosts.

 

The goal of the first benchmark test was to compare TokuDB against InnoDB for this specific workload.

We executed MariaDB with TokuDB with the following (simple) config file:

[mysqld] 
user=mysql 
table_open_cache=1024 
max_connections=128 
query_cache_size=0 
innodb_file_per_table 
datadir=/localfio/datadir
log_bin 
innodb_flush_log_at_trx_commit=1 
innodb_buffer_pool_size=256M 
innodb_log_buffer_size=8M 
innodb_log_file_size=1024M 
basedir=/usr/local/tokudb 
 

 

We found the performance of InnoDB significantly better compared than TokuDB in this instance, though this test - where the dataset fits almost entirely in memory - does not show the real power of TokuDB, which excels at insertion rate at scale. Because these tables have very few indexes, TokuDB and Fractal tree indexes weren't very efficient. Furthermore, the benchmarks were running on FusionIO, which meant that performance on InnoDB didn't degrade much as on spinning disks. We excluded TokuDB out from the next benchmark tests because they are all cases which are not well-suited for TokuDB’s strengths.

We temporarily abandoned MariaDB, and tested Percona 5.5.20 with the following config file:

[mysqld] 
user=mysql 
table_open_cache=256 
max_connections=128 
query_cache_size=0 
innodb_file_per_table 
log_bin 
innodb_flush_log_at_trx_commit=1 
innodb_buffer_pool_size=2G
innodb_log_buffer_size=8M 
innodb_log_file_size=1024M 
basedir=/usr/local/mysql 
port=3306
datadir=/localfio/MULTI/db00 
socket=/localfio/MULTI/db00/mysql.sock 

 

We tried various innodb_flush_method attempts, and the graphs show that O_DIRECT performs slightly better than the default fsync(), even if the benchmark shows a weird bootstrap. We also tried ALL_O_DIRECT, which performed badly.

 

Additionally, we tried innodb_log_block_size=4096 instead of the default 512, but nothing changed: insert rate wasn't affected.

 

One of the goals of this benchmark was to test if running multiple mysqld instances on the same host performs better than a single mysqld instance.

On this specific hardware, the answer seems to be yes. Configuring 8 mysqld instances with the same config file listed below (but different paths and ports), throughput is significantly higher. Note that innodb_buffer_pool_size was set to 256M to try to stress the IO subsystem.

[mysqld] 
user=mysql 
table_open_cache=256 
max_connections=128 
query_cache_size=0 
innodb_file_per_table 
log_bin 
innodb_flush_log_at_trx_commit=1 
innodb_buffer_pool_size=256M
innodb_log_buffer_size=8M 
innodb_log_file_size=1024M 
basedir=/usr/local/mysql 
port=3306
datadir=/localfio/MULTI/db00 
socket=/localfio/MULTI/db00/mysql.sock 
 

 

All the above tests were executed using 36 client connections for writes and 36 client connections for reads.

 

We then ran a new cycle of tests, but instead of using 36 x 2 connections, we used 80 x 2 (80 for writes and 80 for reads).

 

 

With 80 connections, throughput was higher than with 36 connections, but at nearly regular intervals we found performance dropping. This seems independent from the size of the buffer pool.

It is interesting to note that with only one mysqld instance, FusionIO was performing at 4.7k – 4.8k IOPS, while with 8 mysqld instances FusionIO was performing at 27k – 29k IOPS. As expected, with a small buffer pool performance tends to slowly degrade when the data doesn't fit in memory.

We tried various values of innodb_write_io_threads, but this didn't make any difference, since the Redo Log was the most written and not the tablespaces.

To better analyze the throughput, we reduced the sample time to 10 seconds and reran the test:

 

 

It is clear that throughput drops from time to time, and for a nearly constant amount of time. While the test was running, we tried to monitor the mysqld instances, but there was no clear indication of why they were stalling. The Redo Log wasn't anywhere close to full and InnoDB wasn't performing aggressive flushing. The amount of data read from disk was pretty low but the amount of data written was spiking. Yet, the writes weren't coming from InnoDB.

The reason for the stalls became apparent when we analyzed the content of /proc/meminfo: the Linux Virtual Memory (VM) subsystem was performing dirty pages flushing!

We changed the dirty_background_ratio from 10 (the default) to 1 , and reran the test.

sysctl -w vm.dirty_background_ratio=1

 

Throughput is now way more stable, although performance has dropped by 2.8%. It is interesting to note that throughput drops at nearly the same time no matter the value of dirty_background_ratio.

A quick analysis of MySQL source code shows that binlog are synced to disk when closed, therefore the drops in throughput may be caused by the flush of binary logs.

We then raised vm.dirty_background_ratio up to 10 (the default value) and lowered max_binlog_size from 1G to 64M.

 

 

Throughput doesn't drop drastically as in the two previous tests, but goes up and down at more regular intervals.

At the end of this test, performance with max_binlog_size=64M is ~4% lower than the initial test with max_binlog_size=1G (in both cases, vm.dirty_background_ratio=10).

The last setup of 8 instances with a 256M buffer pool each and max_binlog_size=64M was then compared with a new setup:  4 instances with a 512M buffer pool each (2GB total in both cases) and max_binlog_size=64M:

 

 

An interesting outcome from this last test is that total throughput raised by around 4% (that was originally lost using binlogs of 64M) and that the total number of IOPS dropped to ~16k, leaving room for more IO in case of a different workload.

We then ran a new test using only 2 mysqld instances. It shows what was already easy to guess when running a similar test with only one mysqld instance: a lower number of mysqld instances can't fully utilize IO capacity and therefore has lower throughput.

 

Conclusions (most of them are as expected) for this specific workload and on this specific hardware:

O_DIRECT performs slightly better than the default fsync for innodb_flush_method .

A high number of clients provides more throughput than a smaller number of clients: not enough tests were performed to find the optimal number of clients.

Throughput reduces when data doesn't fit in the buffer pool.

A high number of mysqld instances running on the same server are able to better utilize the number of IOPS that FusionIO is able to provide (perhaps, it should be a very bad idea to run multiple mysqld instances on the same spinning disk or array)

The sync of binlog during binlog rotation are able to stall the system. Lowering dirty_background_ration or max_binlog_size is able to stabilize the throughput.

Securing MySQL on Linux Systems

 

MySQL is one of the most, if not the most, popular relational databases chosen by internet based startups in the last decade. Although it is a very robust platform and offers many of the necessary features to support the database needs of today's internet giants, it also suffers from some security issues that must be addressed prior to production use.

This article will discuss the necessary steps to secure a basic MySQL installation and additionally cover more advanced topics for general database security as would be applied to general production environments. 

Functionality

The following general statements apply to this security discussion:

This is a discussion of MySQL as installed on Linux; Windows, OSX, Solaris, and BSD installations may differ and the differences are not covered here.

MySQL will be handling generic web traffic as is commonly found on PHP, Python, Perl, and Java web applications. 

The standard, and included, MySQL administration tools will be used as necessary for daily contact with the database server. Non-standard tools are not discussed. 

Any remote connections not done over SSH, either explicitly or via tunnel, are assumed to be a security risk and are not advised, as they lack encryption and allow passwords to be read in transit by potential attackers. 

 

Pre-Installation Security Recommendations

Before installing any software it is a good idea to harden the operating system. This includes filesystem security features:

Encrypted filesystems to prevent inspection of data by unauthorized parties

Intrusion detection systems like Tripwire, which watch for changes to critical system files 

Port lockdown via firewall software to prevent access to system services

Strong root and OS-user passwords

Disallowing OS login by application users: set to “nologin” or “/bin/false” 

Setting up sudo for privileged accounts and requiring a password for sudo access

Running scheduled rootkit detection scripts

Running an operating system level Access Control List process: SELinux extensions or AppArmor. These programs will restrict system or server processes from accessing data and resources that is not explicitly defined. Due to general misunderstanding or lack of desire to maintain the access controls, these programs are very often disabled by system administrators. 

Post Install Security 

After MySQL is installed, either via RPM, Deb package, or other means, there are different approaches to securing the initial database. The first option is to execute the script provided with MySQL, named “mysql_secure_installation”. This will go through the following steps, which can also be taken manually if you prefer:

Checking for existence of the Root password, and if not found it will set one

mysql> SELECT * FROM mysql.users WHERE User=’root’;

mysql> UPDATE mysql.users SET Password=password(‘your password here’) WHERE User=’root’; 

Delete anonymous users

mysql> DELETE FROM mysql.users WHERE User=’’;

Removing remote access for the root account

mysql> DELETE FROM mysql.user WHERE User='root' AND Host NOT IN ('localhost', '127.0.0.1', '::1');

Removing the default “test” database

mysql> DROP database test;

Reloading the user privilege tables

mysql> FLUSH PRIVILEGES;

 

By default, MySQL will run via the “mysqld_safe” which contrary to the name does not make MySQL safer for security reasons aside from ensuring that the mysqld process is not running under the root user. The mysqld_safe script provides the functionality as follows: “Script to start the MySQL daemon and restart it if it dies unexpectedly”. As such, if one attempts to run mysqld as the root user it will complain and refuse to start. If you are troubleshooting the mysqld process and want to run without mysqld_safe you can run it as follows, via sudo. Your binary location may differ from /usr/local/bin/mysqld so replace as necessary.  

$>  sudo -u mysql /usr/local/bin/mysqld

 

MySQL Configuration Considerations

There are several configuration settings that can further secure the database during operation. These settings will be found in the /etc/my.cnf or /etc/mysql/my.cnf file depending on the version of Linux being used. 

 

old-passwords: this allows MySQL to create and authenticate users via the outdated and insecure password hashing from version 4 and older. It is strongly recommended against using this in production.

http://dev.mysql.com/doc/refman/5.5/en/server-options.html#option_mysqld_old-passwords

sql-mode: this allows the administrator to run MySQL in various operating modes. There are many options but for security the recommended minimum setting is “NO_AUTO_CREATE_USER” or “TRADITIONAL”

http://dev.mysql.com/doc/refman/5.5/en/server-options.html#option_mysqld_sql-mode

skip-grant-tables: this starts the mysqld process without any authentication tables, which is useful if one needs to recover or reset a lost root password. However, if you see this option enabled on a production server that is not undergoing recovery it should be seen as a critical security issue and dealt with immediately. 

http://dev.mysql.com/doc/refman/5.5/en/server-options.html#option_mysqld_skip-grant-tables

bind-address: this can be used to restrict network and socket access to specific interfaces, thus ensuring that traffic can only originate through the desired interface. An example is a server with multiple network interfaces (eth0, eth1, eth2) that resides on multiple subnets; in some architectures the database will answer only to application servers or users on a specific subnet and thus the mysqld process needs to be restricted to listen for connections on only the required subnet. 

http://dev.mysql.com/doc/refman/5.5/en/server-options.html#option_mysqld_bind-address

 

Data Encryption

MySQL offers in-row data encryption functions. These can be used to ensure that even in the event of a security breach, that the attacker cannot access the data in the database tables without an encryption key. Although not a MySQL specific functionality, the SQL functions for AES_ENCRYPT and AES_DECRYPT, along with a multitude of HASH mechanisms exist for employing in-row data encryption. Read more about this topic here: http://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html

 

Database User and Connection Security Considerations

MySQL users, by definition, can connect to the database server and access data. How much or how little access your users are granted will determine the level of security compromise that they pose to potential attackers. The most insecure user that can be created is a user with wildcard remote host connection ability (the ability to connect from any location, any host, any IP address) who has SUPER and GRANT options; this user can create other users and has the ability to shut down the database server process. As such, it is wise to limit the amount of access that a user is granted when creating accounts. Here are some things to keep in mind:

Limit user access to specific schemas: granting global access to a user allows the user to access all current and future schemas as opposed to a single defined schema. 

Limit originating users connections from specific IP addresses or subnets instead of allowing connections from any host or any network. 

Where possible limit users to connect only from localhost (127.0.0.1) and advise ssh tunnels to be utilized to gain access to localhost. 

Do not grant SUPER privileges to non-administrative users

Do not grant replication privileges to non replication process users.

When granting replication processes limit them to only replication privileges.

When possible, do not use hostnames for host connection privileges, instead specify IP addresses. This removes the risk of DNS spoofing and removes the requirement for a DNS lookup during connection initiation which saves time and resources.

Require strong passwords for all users and rotate passwords on a defined schedule.

If running connections over the internet without a VPN, SSH tunnel, or other encrypted means, consider using SSL for all connections. Otherwise passwords can be seen in transit by attackers, similar to FTP. 

 

References for further reading:

Redis Persistence

Clients often ask us about the benefits of using key-value stores, such as Redis, for high-volume environments. One of the key benefits that is often cited is the durability offered by Redis persistence (as described in detail here).

Developer Salvatore ‘antirez’ Sanfilippo delves into the topic on his blog. Here are some key questions you should consider as you evaluate Redis in your own environment.

Redis replication (and one of two methods of persistence) is achieved using the AOF (append-only file), which logs all statements that modify data. In the example in the article, there is a DELETE issued on a non-existent key, and that statement doesn't get logged nor replicated to a slave Redis. But in real life, masters and slaves get out-of-sync, and it can be handy to have a statement that does nothing on the master, but when replicated to the slave, has the tangible effect of moving the master and slave closer toward convergence. It seems at least it should be an option to have "no-effect" statements replicate from the master to the slave.

When the AOF gets too large, an AOF rewrite occurs. This is the minimal set of statements needed to log to reproduce the data set as it is in memory:

You may wonder what happens to data that is written to the server while the rewrite is in progress. This new data is simply also written to the old (current) AOF file, and at the same time queued into an in-memory buffer, so that when the new AOF is ready we can write this missing part inside it, and finally replace the old AOF file with the new one.

So what happens when we run out of RAM? That would be a very interesting behaviour to have defined unambiguously. He also doesn't talk about how long it takes to write the in-memory delta to the AOF before the swap-over. I suspect during that time, the server will be largely unresponsive (or should be, to preserve data integrity). On a busy server, there might be millions of entries in the in-memory delta buffer once the new AOF is finished writing.

If you think Redis having persistence means you can serve data from disk, you'd be wrong. The point of persistence is just to get the in-memory-only dataset back after a crash or restart.

AOF rewrites are generated only using sequential I/O operations, so the whole dump process is efficient even with rotational disks (no random I/O is performed). This is also true for RDB snapshots generation. The complete lack of Random I/O accesses is a rare feature among databases, and is possible mostly because Redis serves read operations from memory, so data on disk does not need to be organized for a random access pattern, but just for a sequential loading on restart.

So Redis really is just (a fast) memcached but with some persistence methods.

There is an option for fsync'ing data to the AOF in various ways. Be careful, the "appendfsync everysec" setting is actually worst-case every TWO seconds. It's only every second on average.

When you set appendfsync always, Redis still doesn't do an fsync after every write. If there are multiple threads writing, it'll batch the writes together doing what he calls "group commit." That is, every thread that performs a write in the current event loop will get written at the same time at the end of the event loop. I can't think of any downside to this, as I don't think clients get their response that data was written until the end of the event loop.

Restarting Redis requires either re-loading an RDB (Redis snapshot) or replaying AOF transactions to get to the state before the server was stopped. Redis is an in-memory database, so as you might expect, the start-up times are fairly onerous.

Redis server will load an RDB file at the rate of 10 ~ 20 seconds per gigabyte of memory used, so loading a dataset composed of tens of gigabytes can take even a few minutes.

That's the best case, as we note from the following:

Loading an AOF file [takes] twice per gigabyte in Redis 2.6, but of course if a lot of writes reached the AOF file after the latest compaction it can take longer (however Redis in the default configuration triggers a rewrite automatically if the AOF size reaches 200% of the initial size).

It isn't pretty, but I'm really glad the author is giving so much transparency here. An optimisation mentioned is to run a Redis slave and have it continue serving the application while the Redis master is restarted.

The author notes that in high-volume environments, a traditional RDBMS can in theory serve reads from the moment it's started, in practice that can cause the database to become fairly unresponsive as the disks seek like wild to pull in the required data. He further notes that Redis, once it starts serving reads, serves them at full speed.

At Palomino, we are dedicated to bringing rigor in benchmarking and analysis to the DBMSs that are our core compentencies. In a future post, watch for us to test Redis and put some solid numbers behind some of this. We are interested in seeing how Redis performs during AOF rewrites and how long it takes to start up on typical modern hardware at typical modern loads.

The Postgres-XC 1.0 beta release overview

 

When I heard about this project a year ago, I was really excited about it. Many cluster-wide projects based on Postgres were developed very slowly, based on older (i.e. Postgres-R http://www.postgres-r.org/) or proprietary (i.e. Greenplum) versions. The features that this project hoped to achieve were ambitious, as we’ll detail in this post. And best of all - this project is based on the 9.1 Postgresql version, which is really up-to-date (at the moment of writing this post, this is the last stable version).

If you are interested in a serious project for scaling horizontally your PostgreSQL architecture, you may visit the official website at http://postgres-xc.sourceforge.net/ and take a look. 

For those who are interested, there will be a tutorial at PgCon this year.  As a brief overview, I will try to give you a broad idea for those who want to get involved in the project.

 

What Postgres-XC can do:

  • Support multi-master architecture. Data nodes can contain part or all of the data of a relationship. 
  • Transparent view to application from any master. The application only needs to interact with the cluster through coordinators.
  • Distribute/Replicate per relation (replication, round robin (by default if any unique column is specified), by hash (by default if a unique is specified), by modulo or a set to a group or node)
  • Parallel transaction execution among cluster nodes.

 

What Postgres-XC cannot do:

  • Support triggers (may be supported in future releases).
  • Distribute a table with more than one parent. 

 

Before you start:

  You need to install the most recent versions of Flex and Bison. That’s important because in the last tarball, ‘./configure’ won’t raise error if they are missing, and the error will be prompted once you execute ‘make’. You will need readline-dev and zlib-dev (not mandatory but strongly recommended).

According to the documentation, Postgres-XC should be compatible with Linux platforms based upon Intel x86_64 CPU. 

The documentation needs to be improved, so we advise you to try the steps directly and read the help prompted by the commands. For example, the initdb command in the documentation is incomplete, “--nodename” is mandatory in this version. This project is new and has only a few contributors to date, but we hope its community keeps growing. Most importantly, it is great that a beta release was launched earlier than we expected.

 

Elements of architecture

 

  • GTM (Global Transaction Manager)

+ Realize that I say only GTM, not GTMs. Only one GTM can be the manager. For redundancy, you have GTM-Standby and to improve performance and failover GTM-Proxies.

+ The GTM serializes all the transaction processing and can limit the whole scalability if you implement it primitively. This should not be used in slow/wide networks and is recommended to involve the fewest number of switches between GTM and coordinators. The proxies reduce the iteration with the GTM and improve the performance. 

+ Uses MVCC technology. That means that it will still use the same control for the concurrency as Postgres.

+ This is the first thing you will need to configure. If you set up everything in the same machine, you will need to create a separate folder for the configuration and files.

 

  • Coordinator/s

+ Interface for applications (like a Postgres backend). 

+ It has its own pooler (yes, and I think this is great, avoiding more complexity in big environments).

+ Doesn’t store any data. The queries are executed in the datanodes, but...

+ … it has its own data folder (for global catalogs).

  • Datanode/s

+ Stores the data.

+ It receives the petition with a GXID (Global Transaction ID) and Global Snapshot to allow requests from several coordinators.

Both Datanodes and Coordinator use their own data directory, so keep this in mind this if you are setting both up on the same machine. 

Configuring several GTM-Proxies will improve the scalability, shrinking the I/O in the GTM. Plus, you can configure the GTM-Standby to avoid a SPOF of the general manager. It not only provides the GXID, it also receives the node registration (you can trace your log or check the file inside the gtm folder called register.node, it’s binary but is readable) and most importantly, it holds the snapshot of the current status of all the transactions.

Coordinators can point to all the datanodes and can point to the same datanode (as Oracle RAC, but we’re not sure if all the features included in that solution will be available for Postgres-XC). Coordinators connect to the proxies (if you have already configured them) or the main GTM.

Hope you enjoyed the read. We are preparing more cool stuff about this amazing solution, keep in touch!

Website by Digital Loom