Blog

Chef Cookbooks for HBase on CentOS Released

At Palomino, we've been hard at work building the Palomino Cluster Tool. Its goal is to let you build realistically-sized[1] and functionally-configured[2] distributed databases in a matter of hours instead of days or weeks as it is at present. Today marks another milestone toward that goal as we release our Chef Cookbook for building HBase on CentOS!

 

Background

Riot Games was kind enough to open source their Chef Cookbook for building a Hadoop cluster. Although the code wasn't in a state that would produce a functional cluster, and it was almost entirely undocumented, it was a great start.

Recently I was tasked with building an HBase cluster on CentOS using Chef. Although I've written a Cookbook (three times!) to do so, my code was never fully optimized. It could build a cluster, but only with hard-coded configuration parameters, or it produced a cluster that was running in a non-realistic non-production configuration.

Using the Riot Games Cookbook and the lessons I'd learned in the past, I whipped it into shape. I not only modified it to produce a functional cluster in a non-Riot environment, but also to build HBase on top of that! There are over 800 changes in the diff and documentation on how to use it.

 

Source Code

Here you can find the newest Chef Cookbook for HBase on CentOS. Here you can find the original Ansible Playbooks for HBase on Ubuntu. If you would like to use this code to build your own cluster, you are encouraged to join the mailing list to get help and advice from your peers.

 

Notes

[1] A distributed database can be tested functionally by installing on a single machine, but when it comes time to run benchmarks, or to discover the other 90% of functionality that only appears in a distributed setup, you will want to have the database installed on many machines, preferably dozens.

[2] Many projects seem to stop short of installing the database in a way that would let you benchmark it. Perhaps there are shortcuts taken like putting all database files into /tmp, or disabling logging, or removing tricky/subtle components in the interest of simplicity. The Palomino Cluster Tool provides you with a cluster that's actually ready for production. Sure, you still have to edit the configurations a little, but a good base generic configuration is provided.

Bulk Loading Options for Cassandra

Cassandra's bulk loading interfaces are most useful in two use cases: initial migration or restore from another datastore/cluster and regular ETL. Bulk loading assumes that it's important to you to avoid loading via the thrift interface, for example because you haven't integrated a client library yet or because throughput is critical. 

There are two alternative techniques used for bulk loading into Cassandra: "copy-the-sstables" and sstableloader. Copying the sstables is a filesystem level operation, while sstableloader utilizes Cassandra's internal streaming system. Neither is without disadvantages; the best choice depends on your specific use case. If you are using Counter columnfamilies, neither method has been extensively tested and you are safer writing via thrift.

The key to understanding bulk-loading throughput is that potential throughput depends significantly on the nature of the operation as well as the configuration of source and target clusters and things like number of sstables, sstable size and tolerance to potentially duplicate data. Notably but not significantly, sstableloader in 1.1 is slightly improved over the (freshly re-written) version in 1.0. [1]

Below are good cases for and notable aspects of each strategy.

Copy-the-sstables/"nodetool refresh" can be useful if:

  1. Your target cluster is not running, or if it is running, is not sensitive to latency from bulk loading at "top speed" and associated operations.
  2. You are willing to manually, or have a tool to, de-duplicate sstable names and are willing to figure out where to copy them to in any non copy-all-to-all case. You are willing to run cleanup and/or major compaction understand that some disk space is wasted until you do. [2]
  3. You don't want to deal with the potential failure modes of streaming, which are especially bad in non-LAN deploys including EC2.
  4. You are restoring in a case where RF=N, because you can just copy one node's data to all nodes in the new RF=N cluster and start the cluster without bootstrapping (auto_bootstrap: false in  cassandra.yaml).
  5. The sstables you want to import are a different version than the target cluster currently creates. Example : trying to sstableload -hc- (1.0) sstables into a -hd- (1.1) cluster is reported to not work. [3]
  6. You have your source sstables in something like s3 which can easily parallelize copies to all target nodes. s3<>ec2 is fast and free, close to best case for the inefficiency during copy stage.
  7. You want to increase RF on a running cluster, and are ok with running cleanup and/or major compaction after you do.
  8. You want to restore from a cluster with RF=[x] to a cluster whose RF is the same or smaller and whose size is a multiple of [x]. Example: restoring a 9 node RF=3 cluster to a 3 node RF=3 cluster, you copy 3 source nodes worth of sstables to each target node.

sstableloader/JMX "bulkload" can be useful if:

  1. You have a running target cluster, and want the bulk loading to respect for example streaming throttle limits.
  2. You don't have access to the data directory on your target cluster, and/or JMX to call "refresh" on it.
  3. Your replica placement strategy on the target cluster is so different from the source that the overhead of understanding where to copy sstables to is unacceptable, and/or you don't want to call cleanup on a superset of sstables.
  4. You have limited network bandwidth between the source of sstables and the target(s). In this case, copying a superset of sstables around is especially ineffecient.
  5. Your infrastructure makes it easy to temporarily copy sstables to a set of sstableloader nodes or nodes on which you call "bulkLoad" via JMX. These nodes are either non-cluster-member hosts which are otherwise able to participate in the cluster as a pseudo-member from an access perspective or cluster members with sufficient headroom to bulkload. 
  6. You can tolerate the potential data duplication and/or operational complexity which results from the fragility of streaming. LAN is best case here. A notable difference between "bulkLoad" and sstableloader is that "bulkLoad" does not have sstableloader's "--ignores" option, which means you can't tell it to ignore replica targets on failure. [4]
  7. You understand that, because it uses streaming, streams on a per-sstable basis, and streaming respects a throughput cap, your performance is bounded in terms of ability to parallelize or burst, despite "bulk" loading.

Jetpants Usage and Installation Gotchas

If you regularly manage enormous data sets, you've probably heard about Tumblr's exciting new toolkit called Jetpants, which automates common processes for MySQL data management, most notably in the area of rebalancing shards for more efficient scaling. In evaluating and using Jetpants in various environments, we identified some interesting installation and usage gotchas, and we've documented those here - let us know if you find other issues worth exploring.  

Jetpants Installation/Usage Gotchas for Ubuntu 12.04:

Jetpants has to-date only been tested on RHEL/CentOS distributions, though working with the author (Evan Elias of Tumblr, thank you much!) I was able to get it running on Ubuntu 12.04 and Mint Maya/13. There will be more work to do, since RHEL/CentOS report back a service status ("service mysql status" output, see later in this post) differently than Ubuntu/Mint, and there may be other differences as well.

For your Jetpants Console workstation, install Ruby 1.9.3 (also 1.9.1 will be installed), and check /etc/alternatives/ruby points at ruby1.9.3.  Also be sure you installed the MySQL client libraries.

# apt-get install ruby1.9.3 rubygems libmysqlclient-dev
# ln -sf /usr/bin/ruby1.9.3 /etc/alternatives/ruby
# ln -sf /usr/bin/gem1.9.3 /etc/alternatives/gem

After that, do gem install Jetpants. You may get scary warnings about inability to convert ASCII to UTF8 characters during documentation installation as below:

# gem install jetpants

Building native extensions.  This could take a while...
Fetching: sequel-3.39.0.gem (100%)
Fetching: net-ssh-2.5.2.gem (100%)
Fetching: coderay-1.0.7.gem (100%)
Fetching: slop-3.3.3.gem (100%)
Fetching: method_source-0.8.gem (100%)
Fetching: pry-0.9.10.gem (100%)
Fetching: highline-1.6.15.gem (100%)
Fetching: colored-1.2.gem (100%)
Fetching: jetpants-0.7.4.gem (100%)
Successfully installed mysql2-0.3.11
Successfully installed sequel-3.39.0
Successfully installed net-ssh-2.5.2
Successfully installed coderay-1.0.7
Successfully installed slop-3.3.3
Successfully installed method_source-0.8
Successfully installed pry-0.9.10
Successfully installed highline-1.6.15
Successfully installed colored-1.2
Successfully installed jetpants-0.7.4
10 gems installed
Installing ri documentation for mysql2-0.3.11...
Installing ri documentation for sequel-3.39.0...
Installing ri documentation for net-ssh-2.5.2...
unable to convert "\xE7" from ASCII-8BIT to UTF-8 for lib/net/ssh/authentication/pageant.rb, skipping
Installing ri documentation for coderay-1.0.7...
Installing ri documentation for slop-3.3.3...
Installing ri documentation for method_source-0.8...
Installing ri documentation for pry-0.9.10...
Installing ri documentation for highline-1.6.15...
Installing ri documentation for colored-1.2...
Installing ri documentation for jetpants-0.7.4...
Installing RDoc documentation for mysql2-0.3.11...
Installing RDoc documentation for sequel-3.39.0...
Installing RDoc documentation for net-ssh-2.5.2...
unable to convert "\xE7" from ASCII-8BIT to UTF-8 for lib/net/ssh/authentication/pageant.rb, skipping
Installing RDoc documentation for coderay-1.0.7...
Installing RDoc documentation for slop-3.3.3...
Installing RDoc documentation for method_source-0.8...
Installing RDoc documentation for pry-0.9.10...
Installing RDoc documentation for highline-1.6.15...
Installing RDoc documentation for colored-1.2...
Installing RDoc documentation for jetpants-0.7.4...

Be sure you've created /var/www writeable by the user you'll execute Jetpants as, and be sure you have an /etc/jetpants.yaml that is similar to the template provided on the Jetpants website.

And now you're ready to run Jetpants!

$ jetpants 

Tasks:

  jetpants activate_slave            # turn a standby slave into an active slave

  jetpants clone_slave               # clone a standby slave

  jetpants console                   # Jetpants interactive console

  jetpants destroy_slave             # remove a standby slave from its pool

  jetpants help [TASK]               # Describe available tasks or one specific task

  jetpants pools                     # display a full summary of every pool in the topology

  jetpants pools_compact             # display a compact summary (master, name, and size) of every pool in the topology

  jetpants promotion                 # perform a master promotion, changing which node is the master of a pool

  jetpants pull_slave                # turn an active slave into a standby slave

  jetpants rebuild_slave             # export and re-import data set on a standby slave

  jetpants regen_config              # regenerate the application configuration

  jetpants shard_cutover             # truncate the current last shard range, and add a new shard after it

  jetpants shard_offline             # mark a shard as offline (not readable or writable)

  jetpants shard_online              # mark a shard as fully online (readable and writable)

  jetpants shard_read_only           # mark a shard as read-only

  jetpants shard_split               # shard split step 1 of 4: spin up child pools with different portions of data set

  jetpants shard_split_child_reads   # shard split step 2 of 4: move reads to child shards

  jetpants shard_split_child_writes  # shard split step 3 of 4: move writes to child shards

  jetpants shard_split_cleanup       # shard split step 4 of 4: clean up data that replicated to wrong shard

  jetpants summary                   # display information about a node in the context of its pool

  jetpants weigh_slave               # change the weight of an active slave

As mentioned before, Jetpants is currently only well-tested for RHEL/CentOS distributions. RHEL/CentOS report back a service status ("service mysql status" output) differently than Ubuntu/Mint. Following is the requirements of Jetpants and actual output:

/sbin/service mysql status # check if mysql daemon running; output must include the string 'not running' if mysql is not running

This is how it works on both Ubuntu 12.04 and Mint Maya/13:

# service mysql stop
mysql stop/waiting

# service mysql status
mysql stop/waiting
# service mysql start
mysql start/running, process 31346
# service mysql status
mysql start/running, process 31346

I decided to check the return code, but it's always 0 regardless of the service state, so we must parse the output in Jetpants. See the following output:

# service mysql status ; echo $?
mysql stop/waiting
0
# service mysql start
mysql start/running, process 31629
# service mysql status ; echo $?
mysql start/running, process 31629
0

To support Debian-alikes, Jetpants must check for the other string, since the strings being output by "service" makes it impossible to check for a single substring across distros.

Over the coming days, I will be identifying other CentOS/RHEL-specific code and writing patches to make it work with at least also Ubuntu 12.04. In the future, if you're trying to port Jetpants to another distribution, you'll be able to use my pull request as a template for the areas that are distribution-specific.

When running Jetpants in EC2, there are some more gotchas (or perhaps they're just "be sure to follow best-practices"):

1. You must allow root login to your cluster. Generate a privkey/pubkey pair and allow root logins. 

      a. Edit ~root/.ssh/authorized_keys and get rid of the part where it has the "command=..." that prints an error and logs you out.

     b. Edit /etc/cloud/cloud.cfg and set disable_root: 0

     c.  Edit /etc/ssh/sshd_config and set PermitRootLogin without-password

2. You must be sure your master/slaves replication topology uses only IP addresses. If your slaves have, eg "Master_Host: ec2-5-2-3-8.compute.amazonaws.com" then Jetpants will fail. IP addressses is how you want your cluster configured in production, so do it, but it may surprise you if you're trying to build a small test cluster.

3. You must allow root login from localhost on all your MySQL instances. In other words, this command must return valid output on your slaves: mysql -e 'show slave status\G' - note that you can put root login credentials in ~root/.my.cnf rather than allow passwordless MySQL root logins.

In general, it's good to remember that Jetpants is a set of utilities for controlling an actual large MySQL cluster, so it will not work so well for tiny little test deploys where you've cut corners, and this is how it should be. Here are more Jetpants Environment Requirements/Limitations, from the documentation:

1. Need pigz installed on your cluster. If you use the Palomino Cluster Tool to build your shards, pigz will be installed for you.

2. No auto_increment on sharded tables. Must have ID generator.

3. PK of all sharded tables should start with the shard key. This will vastly improve performance of long-running processes like shard splits.

4. Port :3306 everywhere. One instance per node.

5. (Probably) doesn't work with MyISAM.

Couchbase rebalance freeze issue

We came across a Couchbase bug during a rebalance while upgrading online to 1.8.1 from 1.8.0.  

Via the UI, we upgraded our first node, re-added it to the cluster, and then set the rebalance off.  It was progressing fine, then stopped around 48% for all nodes.  The tap and disk queues were quiet and there were no servers in pending rebalance.  The upgraded node was able to service requests, but with only a small percentage of the items relative to the other nodes.  The cluster as a whole did not suffer in performance during this issue though there are some spikes in cpu during any rebalance.  

We decided to stop the rebalance, wait a few minutes, then rebalance and we see it is moving again, progressing beyond what it was.  It stopped again, now at 75%. Let sit for 7 mins, then hit Stop Rebalance and Rebalance. Not progressing at all now.

Couchbase support pointed to a bug where if there are empty vbuckets, rebalancing can hang.  This is fixed in 2.0.  The work around solution is to populate buckets with a minimum of 2048 short time to live (TTL >= (10 minutes per upgrade + (2 x rebalance_time)) x num_nodes) items so all vbuckets have something in them.  We then populated all buckets successfully and were able to restart the rebalance process which completed fine.

Reference:

http://www.couchbase.com/docs/couchbase-manual-1.8/couchbase-getting-started-upgrade-online.html

Distributed Counter Performance in HBase - Part 1

Recently I was tasked with setting up an HBase cluster to test compared against Amazon's DynamoDB offering. The client had tested that it worked well for up to 10k updates/sec, but were concerned about the cost. I set up a 7-node HBase cluster in the following configuration:

  • Node1: NameNode, HMaster, Zookeeper
  • Nodes 2-7: DataNodes, RegionServers
Every node was configured with the following hardware:
  • 32GB RAM
  • 4x Intel Xeon E7320 2.13GHz
  • 4x SAS SEAGATE  ST3300657SS drives (JBOD configuration, no RAID)
  • 4x 1GB ethernet NICs in 2x2 bonded interfaces (only one used by HBase)
HBase was configured (almost) as follows:
  • hbase-env:
    • export HBASE_HEAPSIZE=8000
    • export HBASE_OPTS="-ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode"
    • export HBASE_OPTS="$HBASE_OPTS -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps" 
    • export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
    • export HBASE_MASTER_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101"
    • export HBASE_REGIONSERVER_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10102"
    • export HBASE_THRIFT_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10103"
    • export HBASE_ZOOKEEPER_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10104"
  • hbase-site:
    • dfs.support.append =  true
    • hbase.cluster.distributed =  true
    • hbase.tmp.dir =  /var/lib/hbase/tmp
    • hbase.client.write.buffer =  8388608
    • hbase.regionserver.handler.count =  20
The Thrift interface was started on the HMaster for a client program to connect and do work.

 

A single table was created with (almost) the following definition:

 

{NAME => 'timelesstest', DEFERRED_LOG_FLUSH => 'true', FAMILIES => [{NAME => 'family', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

 

A simple Python program using the happybase HBase client was written to increment a single counter in that table as fast as possible. At first, we could only achieve around 700 increments per second, no matter how many client programs we ran. Looking at the HBase console, it was apparent that only one node was doing any work (which is as expected), but we expected better than 700 increments per second.

 

I did some analysis of the cluster. The CPU, Disk, and RAM footprints were all very low. The only indication the HBase cluster was doing work at all was that the three RegionServers hosting the table timelesstest were doing about 10x the interrupts/second of the others. Until I looked at the network. There was about 150KB/sec being transferred in and out of HMaster and the primary RegionServer for timelesstest, and about 380KB/sec for the replica RegionServers. At this point, I guessed that HBase was committing every single transaction coming through. The documentation pointed out a setting I could change: "hbase.regionserver.optionallogflushinterval", but tweaks to that value did nothing.

 

As I was speaking with Ryan Rawson, one of the original HBase committers, he pointed out that the setting only applied to a given table if the DEFERRED_LOG_FLUSH is set at the table level, which the documentation didn't make very clear[1]. So I simply ALTERed the timelesstest to enable the deferred log flushing, et voila! The number of increments per second a single client could achieve skyrocketed to 2k/sec. We added six more clients to the mix and achieved a sustained 10.5k/sec increments spiking at times up to 11.5k increments/sec.

 

Since we were incrementing only a single counter, this means that we were only using about 1/6th the power of the cluster, since there were 6 RegionServers present. We will soon be running more tests, with thousands of counters and perhaps dozens of clients. I suspect we will quickly run into a bottleneck on the Thrift server and may need to open that bottleneck up to achieve the theoretical max of about 60k increments/sec on this cluster (if that is actually achievable, which it may not due to there being a 3x replication factor for the table).

 

Stay tuned!

 

[1] I let the HBase documentation guys know about the confusion. Hopefully future HBase administrators won't have to worry about this gotcha anymore.

Benchmarking NDB Against InnoDB on a Write-Intensive Workload

Last month, we evaluated Amazon's new SSD offerings with an extensive series of performance benchmarks.

As a followup, we've prepared a second series of benchmarks that specifically explore performance issues with a write-intensive workload on both NDB and InnoDB storage engines.

Download a free PDF of our findings, and as always, we welcome your feedback, comments and questions below.

Cascade Replication and Delayed servers on PostgreSQL 9.2

Postgresql 9.2 will be released with a new replication feature: Cascade Replication. It will improve streaming between wide networks, shortening distances between the servers. For example: a master server in San Francisco, a slave in Chicago and a cascade server in Barcelona.
But it has other benefit, e.g. when you need to add nodes to the cluster, you can to take backups from your slave without directly affecting the master in terms of I/O of traffic while also minimizing the connections to it.
The setup is easy and intuitive. Like from a slave from a master, but with the difference on the primary_conninfo, which will point to the slave server.
Things to consider when cascading replication:
  • One of the 9.0 limitations in replication -cleaned on 9.1- is the problem that when a master is vacuuming a big table that affects a query that is being executed on the slave. That feature can be enabled with the hot_standby_feedback on the slave server and will add more communication between the servers (if you have very slow connection, maybe you want to disable it)
  • If you have synchronous replication it will only affect 1 slave server against the master, the cascade slave won’t be affected by this setup (cascade server cannot be synchronous).
  • The control function pg_xlog_replay_pause (which stops applying the new incoming wal records) affects only the server in which you are executing it. If the server in which you are executing the control function has cascade servers, those servers will continue streaming and applying the records (occasionally you can have a=c but b!=a, besides replication has a->b->c chain), cause that function avoids the replay, not the streaming shipping. If you want to stop the cascading to all the hosts (not only replication, although the streaming), just stop the slave or change the primary_conninfo in the cascade servers. That is useful if you run BI queries on the cascade servers and don’t want to be affected by writes, so you can delay completely the replication until finish your reports.
  • If you want to use archiving, it must be set up from the master server. Archiving is necessary if you want to enlarge the availability in case of a long downtime of a slave. If you want a delayed server or if you want an incremental backup for PITR in case of disaster.
In our test we’ll have 5 servers, that I’ll label with the port number and data directory to avoid confusions. In this case you can run postgres in same port in different hosts, but just for make clear the difference I will use different ports between them:
Master (8888, data_master) Slave(8889 data_slave) Cascade Slave(8890 data_cascade_slave) delayed standby server using omnipitr-restore (8891 data_delayed) Delayed slave hot standby server using a handmade script (8892 data_streaming_delayed).
To take a backup for replication, first create a user with replication grants and add it to pg_hba.conf in the master. From psql, run:
    CREATE USER repl WITH REPLICATION PASSWORD ‘password’;

pg_hba.conf:
local   replication     repl                            md5
host    replication     repl        127.0.0.1/32            md5
host    replication      repl        0.0.0.0/0               md5

*NOTE: if you want to setup several slaves in one move, make a backup locally in the master, compress it and then ship it among the slaves. Compress and backup command:
bin/pg_basebackup -Ft -p8888 -Urepl -Ddata_slave -z -Z9


Servers configuration
Master (8888) configuration:
postgresql.conf:
wal_level = hot_standby
max_wal_senders=3
wal_keep_segments = 30
archive_mode=on
archive_command = 'cp %p /opt/pg92beta/archives_master/%f'
pg_hba.conf:
local   replication     repl                                md5
host    replication     repl        127.0.0.1/32            md5
host    replication      repl        0.0.0.0/0               md5
max_wal_senders should be a value -at least- the number of slaves connected to this server (cascade servers doesn’t count, cause they connect to the other slaves). If you don’t know how many slaves you need, you can setup a higher value (like 10 for example). That won’t affect performance and the process of wal sender only will work if another slave is brought up.
wal_keep_segments is optional, but is a very comfortable setup when we need to have a large available segments on the master to update the slaves. If one slave gets into a not-so-long downtime it can get updated using directly the streaming replication, instead the archiving files. Is not a time value, and this is important to keep in mind. Keeping 30 segments maybe for a high load will represent just a few hours of activity and in a low activity server maybe a few days. archive_mode and archive_command are in charge of copy those segments to other destination to be used for incremental backups or update slaves that were in long period of downtime.
Slave (8889) Configuration:
postgresql.conf:
wal_level = hot_standby
max_wal_senders=3
wal_keep_segments = 30
hot_standby = on
hot_standby_feedback = on
recovery.conf:
standby_mode = on
primary_conninfo = 'user=postgres host=localhost port=8888'
trigger_file=/tmp/promote_me

This is a classic setup of a asynchronous slave server through streaming replication. No delay specified, just replicate as fast as it can. The hot_standby_feedback option, allows to communicate with the master, in case of long running queries on the slave and to avoid vacuums during its execution on the master. Cascade slave (8890) configuration:
recovery.conf:
standby_mode = on
primary_conninfo = 'user=postgres host=localhost port=8889'
Configuration of this host is like every slave, the difference is that the primary_conninfo will point to the slave server where you have taken the backup.
Delayed Slave (8891) configuration:
recovery.conf:
restore_command = '/opt/pg92beta/omniti/bin/omnipitr-restore -D /opt/pg92beta/data_delayed -l /opt/pg92beta/omniti-restore.log -s /opt/pg92beta/archives_master -w 300 %f %p'
archive_cleanup_command = '/opt/pg92beta/bin/pg_archivecleanup /opt/pg92beta/archives_master %r'
standby_mode = on
You can download omnipitr-restore at https://github.com/omniti-labs/omnipitr . Some documentation here at OmniPitr with delay.
restore_command is in charge to apply the corresponding segments and the option “-w 300” is the “wait” to apply the delay. The server will be on standby mode, which means that is not accessible to query it.

Streaming delayed hot-standby slave (8892) configuration:
Configuration is almost the same as other slaves, just keep an eye on: postgresql.conf:
#...same configuration as the other slaves...
hot_standby_feedback = off   #not necessary cause the server is delayed

Script for delayed using recovery_target_time:
#!/bin/bash
 
HOME_PG=/opt/pg92beta
DATA_PG=$HOME_PG/data_streaming_delayed
DELAYED_DATE=$(date "+%Y-%m-%d %H:%M:%S %Z" --date="5 minutes ago")
RECOVERY_FILE_LOCATION=$DATA_PG/recovery.conf
RECOVERY_FILE_LOCATION_DONE=$DATA_PG/recovery.done
CONTROL_COMMAND=$HOME_PG/control_streaming_delayed.sh
PORT=8892
 
 
res=$($CONTROL_COMMAND stop)
 
if [ -f $RECOVERY_FILE_LOCATION_DONE ] ; then
 mv $RECOVERY_FILE_LOCATION_DONE $RECOVERY_FILE_LOCATION
fi
 
sed -i "s/.*recovery_target_time[ |\=].*/recovery_target_time =\'$DELAYED_DATE\'/g" $RECOVERY_FILE_LOCATION
 
#Taken from the doc:
# If this recovery target is not the desired stopping point,
# then shutdown the server, change the recovery target settings
# to a later target and restart to continue recovery.
res=$($CONTROL_COMMAND start)
control_streaming_delayed.sh is just a one line script (--mode=fast is required to stop queries when restarted):
su postgres -c "bin/pg_ctl -D data_streaming_delayed --mode=fast -l logfile_streaming_delayed $1"
This configuration has its cons:
The “delay” frame isn’t exact. Why? You need to cron the script - which has the recovery_target_time change- each x period of time. During that period of time, the delay will increase until you re-execute the script. So, if you have configured a “30 minutes ago” and your script is executed each 5 minutes, your delay will be between 30 and 35 minutes. The scripts executes a stop/start, which is necessary to load the new configuration. DO NOT RUN A RELOAD, because the slave will stop the recovery mode. Due this, you will need to have in mind if you want to make queries in that server.
But its advantages are:
You can query the server. You can just stop the refresh of the recovery_target_time for make your reports against that server. Due that the server replay is paused, you will experience more performance in your queries.

After every start, you will find something like:
LOG:  database system was shut down in recovery at 2012-07-04 11:14:39 UTC
LOG:  entering standby mode
LOG:  redo starts at 0/1E000020
LOG:  consistent recovery state reached at 0/1E010B48
LOG:  database system is ready to accept read only connections
LOG:  recovery stopping before commit of transaction 14024, time 2012-07-04 11:14:18.319328+00
LOG:  recovery has paused
HINT:  Execute pg_xlog_replay_resume() to continue.

Do not execute pg_xlog_replay_resume() if you want to keep recovering. If you execute it, will replay the last records and start as a stand alone server, which isn’t our objective. That function will update until the last record received and will continue doing that until something stops the replay.
Let’s see an example:
root@lapp /opt/pg92beta# bin/psql -p8888 -Upostgres
psql (9.2beta2)
Type "help" for help.
 
postgres=# insert into check_time values (now());
INSERT 0 1
postgres=# select * from check_time ;
            i
----------------------------
2012-07-04 11:19:47.667041
2012-07-04 11:43:23.794384
(2 rows)
 
postgres=# \q
root@lapp /opt/pg92beta# ./change_recovery_target_time.sh
root@lapp /opt/pg92beta# tail data_streaming_delayed/pg_log/postgresql-2012-07-04_11
postgresql-2012-07-04_110653.log  postgresql-2012-07-04_110853.log  postgresql-2012-07-04_111440.log  postgresql-2012-07-04_114335.log
postgresql-2012-07-04_110808.log  postgresql-2012-07-04_111315.log  postgresql-2012-07-04_111954.log
root@lapp /opt/pg92beta# tail data_streaming_delayed/pg_log/postgresql-2012-07-04_114335.log
LOG:  database system was shut down in recovery at 2012-07-04 11:43:33 UTC
LOG:  entering standby mode
LOG:  redo starts at 0/1E000020
LOG:  consistent recovery state reached at 0/1E012734
LOG:  database system is ready to accept read only connections
LOG:  recovery stopping before commit of transaction 14027, time 2012-07-04 11:43:23.797137+00
LOG:  recovery has paused
HINT:  Execute pg_xlog_replay_resume() to continue.
root@lapp /opt/pg92beta# bin/psql -p8892 -Upostgres
psql (9.2beta2)
Type "help" for help.
 
postgres=# select * from check_time ;
            i
----------------------------
2012-07-04 11:19:47.667041
(1 row)
 
postgres=# \q

For large amounts of data, maybe you want to have more large delays (1 day, 6 hours) due to the spended time to start an instance. Based on some experience, 1 day shuold be find to detect problems or to work with reports.
http://pgxn.org/search?q=delay&in=docs
http://pgxnclient.projects.postgresql.org/install.html
http://vibhorkumar.wordpress.com/2011/11/15/new-replication-and-recovery-features-in-postgresql-9-1/

Palomino Evaluates Amazon’s New High I/O SSD Instances

Amazon Web Services (AWS) recently introduced a new storage layer with SSD as its backend. Because their previous EBS storage has been challenging in the areas of I/O throughput and stability, we were extremely excited to run comparative benchmarks to offer a recommendation for customers considering upgrading to SSD. 

The opportunity to expand I/O throughput horizontally can create a longer runway for sharding (distributing data-sets across multiple MySQL logical clusters in order to reduce write I/O), which can be quite compelling. 

Our extensive tests, described in detail in our Evaluation Report prepared by René Cannao and Laine Campbell (click here for a free PDF), illustrate the potentially enormous gains in throughput, performance and cost for companies scaling MySQL in Amazon. Palomino would strongly recommend that clients consider upgrading to SSD storage. Learn more.

Quick script to get list of all jmx beans

Recently I've needed to get information from running Cassandra processes, in order to determine which parameters to monitor. jconsole can be used for this, however if you're using a host on AWS, or in a bandwidth-limited environment, you might find that it takes way too long, especially if all you want to do for the moment is get a list of all beans and their attributes. jmxterm is a good solution for this - a command line interface to your jmx port. So here's a couple of utility scripts that you can use to get all the beans and each of their attributes using jmxterm. You may need to change $jar, $host, and $port in the perl script to fit your environment, and also maybe change the domains in the bash script. Then you should be able to simply run the bash script to get a list of each bean and its attributes. I found this useful when looking for the correct syntax to set up some nagios monitoring for cassandra. For example, here is a snippet of the output:
org.apache.cassandra.internal
.
.
.
Checking bean org.apache.cassandra.internal:type=FlushWriter
  %0   - ActiveCount (int, r)
  %1   - CompletedTasks (long, r)
  %2   - CurrentlyBlockedTasks (int, r)
  %3   - PendingTasks (long, r)
  %4   - TotalBlockedTasks (int, r)
.
.
.
And I was able to figure out the syntax for a nagios check:
check_jmx!$HOSTADDRESS$!7199!-O org.apache.cassandra.internal:type=FlushWriter -A CurrentlyBlockedTask
Hopefully, these scripts might be useful to someone else trying to query cassandra (or any java process that uses jmx). get_all_bean_info.sh:
#!/bin/bash
 
DOMAINS=(
org.apache.cassandra.db
org.apache.cassandra.internal
org.apache.cassandra.net
org.apache.cassandra.request
)
 
for domain in ${DOMAINS[@]}
do
    echo "-------------------"
    echo $domain
    output=$(./query_jmx.pl $domain 2>/dev/null | tr ' ' '+' | grep '=')
    for line in $output
    do
      bean=$(echo $line | tr '+' ' ')
      echo "Checking bean $bean"
      ./query_jmx.pl $domain $bean 2>/dev/null | grep -v "#" | grep -v "Got domain"
    done
done
query_jmx.pl:
#!/usr/bin/env perl
use strict;
my $jar = "/home/ubuntu/jmxterm-1.0-alpha-4-uber.jar";
my $host = "10.176.66.219";
my $port = 7199;
 
my $domain = shift @ARGV;
my @beans = ();
my $bean;
my $size;
for my $arg (@ARGV) {
  if ($arg =~ /^\w/) {
    push (@beans, $arg);
  } else {
    last;
  }   
}     
$size = @beans;
$bean = join(' ',@beans) if ($size > 0);
open JMX, "| java -jar $jar -n";
print JMX "open $host:$port\n";
print JMX "domain $domain \n";
if (defined $bean && length $bean > 0) {
  print JMX "bean $bean \n";
  print JMX "info \n";
} else {
  print JMX "beans \n";
}
 
print JMX "close\n";
close JMX;

Mystery Solved: Replication lag in InnoDB

 

While running a backup with XtraBackup against a slave server we noticed that replication was lagging significantly. The root cause wasn't clear, but we noticed that DML statements from replication were just hanging for a long time. Replication wasn't always hanging, but it happened so frequently that a 24 hour backup caused replication to lag 11 hours.

The first hypothesis was that all the writes generated from replication (relay log, bin log, redo log, etc) were generating too high contention on IO while XtraBackup was reading the files from disk. The redo log wasn't hitting 75%, which meant that InnoDB wasn't doing aggressive flushing - some other contention was causing replication to stall.

After various tests, we found that disabling innodb_auto_lru_dump solved the issue. It wasn’t entirely clear what the relation was between the lru dump and replication lag during backup, but it was very easy to reproduce. Enabling lru dump at runtime was immediately causing replication to lag, and disabling it restored replication back to normal.

Also, when innodb_auto_lru_dump was enabled we noticed that from time to time the simple command "SHOW ENGINE INNODB STATUS" was hanging for 2-3 minutes.

To attempt to reproduce the issue outside this production environment, we tried to run various benchmarks using sysbench, with and without auto lru dump. The sbtest table (~20GB on disk) was created using the following command:

sysbench --test=oltp --mysql-table-engine=innodb --mysql-user=root --oltp-table-size=100000000 prepare

The InnoDB settings were:

innodb_buffer_pool_size = 10G

innodb_flush_log_at_trx_commit = 2

innodb_thread_concurrency = 0

innodb_flush_method=O_DIRECT

innodb_log_file_size=128M

innodb_file_per_table

 

The various benchmarks were ran using:

- read-only workload vs read-write workload;

- small buffer pool vs large buffer pool (from 2G to 30G)

- small number of threads vs large number of threads

 

None of the above benchmarks showed any significant difference with auto lru dump enabled or disabled. Perhaps these workloads were not really reproducing our environment where we were getting issues with auto lru dump. We therefore started a new series of benchmarks with only one thread doing mainly writes - this is the workload we expect in a slave used only for replication and backups.

The workload with sysbench was modified to perform more writes than read, yet the result of the benchmark didn't change a lot - enabling or disabling lru wasn't producing any significant change in performance. The problem with this benchmark was that it was generating too many writes and filling the redo log. InnoDB was then doing aggressive flushing and this was a bottleneck that was hiding any effect caused from the lru dump.

To prevent the redo from filling too quickly, we had to change the workload to read a lot of pages, change the buffer pool from 30G to 4G, and test with always restarting mysqld and with the buffer pool prewarmed with:

select sql_no_cache count(*), sum(length(c)) FROM sbtest where id between 1 and 20000000;

sysbench --num-threads=1 --test=oltp --mysql-user=root --oltp-table-size=100000000 --oltp-index-updates=10 --oltp-non-index-updates=10 --oltp-point-selects=1 --max-requests=1000 run

 

innodb_auto_lru_dump=0:    transactions: (7.26 per sec.)

innodb_auto_lru_dump=1:    transactions: (6.93 per sec.)

 

This was not a huge difference, but we finally saw some effect of the auto_lru_dump.

It became apparent that the number of transactions per second in the above benchmark was really low because the number of random reads from disk was the bottleneck. To remove this bottleneck, we removed innodb_flush_method=O_DIRECT (therefore using the default flush method), and then ran the following to load the whole table into the OS cache (not into the buffer pool).

 

dd if=sbtest/sbtest.ibd of=/dev/null bs=1M

 

To prevent the redo log from filling up, we also changed the innodb_log_file_size from 128M to 1G.

With these changes - always using a buffer pool of 4G, restarting mysqld before each test ,and prewarming the buffer pool with "select sql_no_cache count(*), sum(length(c)) FROM sbtest where id between 1 and 20000000" - we reran the same test changing the number of requests:

10K transactions:

sysbench --num-threads=1 --test=oltp --mysql-user=root --oltp-table-size=100000000 --oltp-index-updates=10 --oltp-non-index-updates=10 --oltp-point-selects=1 --max-requests=10000 run

 

innodb_auto_lru_dump=0:    transactions: (243.22 per sec.)

innodb_auto_lru_dump=1:    transactions: (230.62 per sec.)

 

50K transactions:

sysbench --num-threads=1 --test=oltp --mysql-user=root --oltp-table-size=100000000 --oltp-index-updates=10 --oltp-non-index-updates=10 --oltp-point-selects=1 --max-requests=50000 run

 

innodb_auto_lru_dump=0:    transactions: (194.31 per sec.)

innodb_auto_lru_dump=1:    transactions: (175.69 per sec.)

 

 

With innodb_auto_lru_dump=1 , performance drops by a factor of 5-10% !

 

After this, we wanted to run a completely different test with no writes, only reads.

innodb_auto_lru_dump didn't show any difference when sysbench was executed with read only workload, and we believe the reason is simply the fact that sysbench wasn't changing too many pages in the buffer pool. The easiest way to change pages in the buffer pool is to perform a full scan of a large table with a small buffer pool. We set innodb_flush_method=O_DIRECT, since otherwise the read from the OS cache was too fast and we couldn't detect any effect of innodb_auto_lru_dump. With innodb_buffer_pool_size=4G, and restarting mysqld after each test, this was the the result of a full table scan:

 

With innodb_auto_lru_dump=0 :

mysql> select sql_no_cache count(*), sum(length(c)) FROM sbtest;

+-----------+----------------+

| count(*)  | sum(length(c)) |

+-----------+----------------+

| 100000000 |      145342938 |

+-----------+----------------+

1 row in set (3 min 27.22 sec)

 

 

With innodb_auto_lru_dump=1 :

mysql> select sql_no_cache count(*), sum(length(c)) FROM sbtest;

+-----------+----------------+

| count(*)  | sum(length(c)) |

+-----------+----------------+

| 100000000 |      145342938 |

+-----------+----------------+

1 row in set (3 min 38.43 sec)

 

Again, innodb_auto_lru_dump=1 affects performance increasing the execution time by ~5% .

It is also important to note that innodb_auto_lru_dump seems to affect performance only for some specific workload scenarios. In fact, the majority of the benchmarks we ran weren't showing any performance effect caused by innodb_auto_lru_dump.

 

Syndicate content
Website by Digital Loom