Tools

Nagios Check Calculated on Mysql Server Variables

Nagios Check For Calculating Based on Mysql Server Variables

Recently we needed to make a change for a client to one of our mysql monitoring tools so I thought it would be a good opportunity to highlight the tool and discuss some of the changes that I made.

You can access the tool on Github from our public repository here.

Before this change if you were using either the "varcomp" or "lastrun-varcomp" modes, it would only return a WARNING if your criteria for comparison were exceeded. In the new version, both WARNING and CRITICAL states can return to nagios. Here is an example: Let's say you want to alert on maximum connections, but the number of maximum connections is different for different hosts. Instead of writing distinct per/host checks, you can use this check to do a simple calculation and alert on the result of that calculation. In this case you want to send a warning when your connections exceeds 75% of max connections and alert when it reaches 80%. The entry in your nagios config file might look something like this:

mysql_health_check.pl -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ --mode varcomp --expression "Threads_connected / max_connections * 100" --comparison_warning="> '$ARG3$'" --comparison_critical "> '$ARG4$' --shortname percent_max_connections

Where the expression flag uses the names that are returned from any of "FULL PROCESSLIST", "ENGINE INNODB STATUS", "GLOBAL VARIABLES", "GLOBAL STATUS", or "SLAVE STATUS". The comparison_warning and comparison_critical flags are going to be evaluated in perl, so ensure that it is a valid perl expression (in this case you could use either > or -gt. You'll definitely need to test out your commands with a few different use cases to ensure you have good syntax. When a varcomp or lastrun-varcomp check is run, the results are kept in a local cache file against which you can make comparisons. So, to give a ridiculous example, if you want to ensure that the number of open table definitions didn't increase too much between samples, you could do something like this

mysql_health_check.pl -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ --mode lastrun-varcomp --expression "current{Open_table_definitions} - lastrun{Open_table_definitions}" --comparison_warning="> 10" --comparison_critical "> 20" --shortname increase_in_open_table_defs

Where the notation current{} and lastrun{} refer to which sample time you want. You can see more details on all the features of the script in the README on the plugin page. We welcome any comments and suggestions.

Put Opsview Hosts Into Downtime via the Shell

Recently a client of ours who used opsview to manage their resources needed to place some of their hosts into downtime in conjunction with some other cron-scheduled tasks. In order to implement that functionality, I created this simple script that should work with most installations of opsview, or with a few modifications, can be modified to be used with other, similar REST interfaces. To use, modify the 5 variables at the top of the script as necessary. The url and username are what come with the default installation of opsview. Modify CURL if it's in a different place on your system. Then, to use, for example: opsview_rest_api_downtime.sh -p Pa5sw0rd -h host_name_in_opsview -c create -t 2 Where host_name is the hostname as defined in opsview, not necessarily the same as its actual hostname.
#!/bin/bash
#
# create or delete downtime for a single host using opsview curl rest api
 
CURL=/usr/bin/curl
OPSVIEW_HOSTNAME="opsview.example.com"
USERNAME=apiuser
URL="/rest/downtime"
hours_of_downtime=2
 
usage()
{
    echo "Usage: $0 -p <opsview apiuser password> -h <host> -c (create|delete) [-t <hours_of_downtime>]"
    exit 1
}
 
while getopts p:h:t:c: opt
do
    case $opt in 
      p) password=$OPTARG;;
      h) host=$OPTARG;;
      t) hours_of_downtime=$OPTARG;;
      c) command=$OPTARG;;
      \?) usage;;
    esac
done
 
 
if [ "x$password" = "x" ] || [ "x$host" = "x" ] || [ "x$command" = "x" ]
then
    usage
fi
 
# LOGIN
 
token_response=`$CURL -s -H 'Content-Type: application/json' https://$OPSVIEW_HOSTNAME/rest/login -d "{\"username\":\"$USERNAME\",\"password\":\"$password\"}"`
token=`echo $token_response | cut -d: -f 2 | tr -d '"{}'`
if [ ${#token} -ne 40 ]
then
    echo "$0: Invalid apiuser login. Unable to $command downtime."
    exit 1
fi
 
 
if [ "$command" = "create" ]
then
    # create downtime - POST
    starttime=`date +"%Y/%m/%d %H:%M:%S"` 
    endtime=`date +"%Y/%m/%d %H:%M:%S" -d "$hours_of_downtime hours"`
    comment="$0 api call"
    data="{\"starttime\":\"$starttime\",\"endtime\":\"$endtime\",\"comment\":\"$comment\"}"
    result=`$CURL -s -H "Content-Type: application/json" -H "X-Opsview-Username: $USERNAME" -H "X-Opsview-Token: $token" https://$OPSVIEW_HOSTNAME$URL?host=$host -d "$data"`
    exit_status=$?
else
    # delete downtime - DELETE
    params="host=$host"
    result=`$CURL -s -H "Content-Type: application/json" -H "X-Opsview-Username: $USERNAME" -H "X-Opsview-Token: $token" -X DELETE https://$OPSVIEW_HOSTNAME$URL?$params`
    exit_status=$?
fi
echo "$result" | grep $host > /dev/null
host_in_output=$?
if [ "$exit_status" -ne "0" ] || [ "$host_in_output" -ne "0" ]
then
  echo "Unable to $command downtime for $host.  Result of call:"
  echo $result
  exit 1
fi

Chef Cookbooks for HBase on CentOS Released

At Palomino, we've been hard at work building the Palomino Cluster Tool. Its goal is to let you build realistically-sized[1] and functionally-configured[2] distributed databases in a matter of hours instead of days or weeks as it is at present. Today marks another milestone toward that goal as we release our Chef Cookbook for building HBase on CentOS!

 

Background

Riot Games was kind enough to open source their Chef Cookbook for building a Hadoop cluster. Although the code wasn't in a state that would produce a functional cluster, and it was almost entirely undocumented, it was a great start.

Recently I was tasked with building an HBase cluster on CentOS using Chef. Although I've written a Cookbook (three times!) to do so, my code was never fully optimized. It could build a cluster, but only with hard-coded configuration parameters, or it produced a cluster that was running in a non-realistic non-production configuration.

Using the Riot Games Cookbook and the lessons I'd learned in the past, I whipped it into shape. I not only modified it to produce a functional cluster in a non-Riot environment, but also to build HBase on top of that! There are over 800 changes in the diff and documentation on how to use it.

 

Source Code

Here you can find the newest Chef Cookbook for HBase on CentOS. Here you can find the original Ansible Playbooks for HBase on Ubuntu. If you would like to use this code to build your own cluster, you are encouraged to join the mailing list to get help and advice from your peers.

 

Notes

[1] A distributed database can be tested functionally by installing on a single machine, but when it comes time to run benchmarks, or to discover the other 90% of functionality that only appears in a distributed setup, you will want to have the database installed on many machines, preferably dozens.

[2] Many projects seem to stop short of installing the database in a way that would let you benchmark it. Perhaps there are shortcuts taken like putting all database files into /tmp, or disabling logging, or removing tricky/subtle components in the interest of simplicity. The Palomino Cluster Tool provides you with a cluster that's actually ready for production. Sure, you still have to edit the configurations a little, but a good base generic configuration is provided.

Productivity Squared: Character Graphs in-Terminal

In the course of a large cluster database administrator's job, there are dozens of times a week it can be useful to visualise some data. You're constantly working with machines that have hundreds of databases, directories, files, log files with often millions of entries each.

Wouldn't it be nice if you could visualise these situations?

  • You have dozens of directories full of files. What's the relative size of each?
  • You have a logfile with 1.5M entries. You grep out "ERROR" and that is 50,000 timestamped entries. What times of day have the most errors?
  • You have a process list with 30,000 entries, comprised of 7 unique commands. What is the relative frequency of each command?
  • You have a database with 10,000 tables, each having several million rows. Which are the 30 biggest, and what is their relative size?
  • You have slow query logs spanning a couple of weeks. You wish to know the days and times-of-day that had the most slow queries.

Typically, you can use awk, sort, grep -c, and uniq -c to give some output that contains "keys" and "values" and you can eyeball it to get a general sense. In fact, if the output is decimal, the number of digits is a base-10 logarithmic graph of the relative sizes of things. But sometimes that isn't good enough. For example, a 7x difference will not be obvious from eyeballing a bunch of numbers.

You can copy the output, paste into a CSV, load it up in Libre Office Calc, and generate graphs if you want to spend another couple minutes every command to generate the graphs. Wouldn't it be nice to just do it right there in the terminal where you already are?

If you did web searches for this sort of thing, up until a couple of weeks ago, the best you'd've found was a page referencing an amazing awk script that takes a list of numeric keys and presents an ASCII histogram. However, it couldn't take non-numeric keys, and it could not help with several of the use-cases presented above.

Enter the Palomino-written "distribution." It will take a large input, munge it, tokenise it, tally it, and give you beautiful in-terminal graphs with ANSI colour coding, unicode partial-width characters, and more. So go to the project page linked, download it, and play with it. You'll find plenty of examples, and using a little imagination, you will begin to get a different view of the log files you're constantly working with.

Happy administering those database clusters!

Exploring Configuration Management with Ansible

 

What is Ansible?

Ansible is a configuration management and deployment system, like Puppet, Capistrano, Fabric, and Chef. Its aim is to be radically simple and let you use your existing scripts to help with cluster configuration and software deployment whenever possible. Here are the ways that Ansible differentiates itself.

Simplicity

Ansible does not include a client/server architecture with pull-based clients (although in more recent versions, it does include pull-based configuration and deployment). Rather, it uses pre-existing network infrastructure: SSH. Every company has SSH installed on their cluster servers, and Ansible simply rides on top of this infrastructure to get the code and configuration out to the nodes.

Language Agnostic

You can write modules for Ansible in whatever language you desire. You simply tie into its API and go. If you have wanted to use configuration management tools like Chef, but felt put off by the need to learn Ruby, this would be an ideal choice for you.

Configuration Management and Deployment

Ansible playbooks can run steps in a defined order between roles, allowing you to be sure that, for example, the database is set up before the WWW servers start and attempt to create their database schemas. 

Why Recommend Ansible?

Ansible fills a very useful niche. It is good for companies that have grown to the point where configuration management and deploy tools would save time and help manage complexity, but not large enough to hire specialised Ruby-speaking configuration management experts.

It is good for companies that already have invested effort into an SSH infrastructure for inter-node communication. Because Ansible piggybacks onto your already-existing SSH communications infrastructure, there is no need to build large, complex server infrastructure and schedule client runs.

Finally, Ansible is good for companies that don't have a lot of time to invest into learning a new tool. There is no need to spend weeks reading tutorials and watching videos from conferences; the tool is radically simple. A few hours with the documentation, and you're off and running.

If you'd like to hear more about how Palomino can help you with installing and configuring Ansible, feel free to contact us! If you have experience with Ansible, feel free to comment, and let's start a dialogue.

Quick script to get list of all jmx beans

Recently I've needed to get information from running Cassandra processes, in order to determine which parameters to monitor. jconsole can be used for this, however if you're using a host on AWS, or in a bandwidth-limited environment, you might find that it takes way too long, especially if all you want to do for the moment is get a list of all beans and their attributes. jmxterm is a good solution for this - a command line interface to your jmx port. So here's a couple of utility scripts that you can use to get all the beans and each of their attributes using jmxterm. You may need to change $jar, $host, and $port in the perl script to fit your environment, and also maybe change the domains in the bash script. Then you should be able to simply run the bash script to get a list of each bean and its attributes. I found this useful when looking for the correct syntax to set up some nagios monitoring for cassandra. For example, here is a snippet of the output:
org.apache.cassandra.internal
.
.
.
Checking bean org.apache.cassandra.internal:type=FlushWriter
  %0   - ActiveCount (int, r)
  %1   - CompletedTasks (long, r)
  %2   - CurrentlyBlockedTasks (int, r)
  %3   - PendingTasks (long, r)
  %4   - TotalBlockedTasks (int, r)
.
.
.
And I was able to figure out the syntax for a nagios check:
check_jmx!$HOSTADDRESS$!7199!-O org.apache.cassandra.internal:type=FlushWriter -A CurrentlyBlockedTask
Hopefully, these scripts might be useful to someone else trying to query cassandra (or any java process that uses jmx). get_all_bean_info.sh:
#!/bin/bash
 
DOMAINS=(
org.apache.cassandra.db
org.apache.cassandra.internal
org.apache.cassandra.net
org.apache.cassandra.request
)
 
for domain in ${DOMAINS[@]}
do
    echo "-------------------"
    echo $domain
    output=$(./query_jmx.pl $domain 2>/dev/null | tr ' ' '+' | grep '=')
    for line in $output
    do
      bean=$(echo $line | tr '+' ' ')
      echo "Checking bean $bean"
      ./query_jmx.pl $domain $bean 2>/dev/null | grep -v "#" | grep -v "Got domain"
    done
done
query_jmx.pl:
#!/usr/bin/env perl
use strict;
my $jar = "/home/ubuntu/jmxterm-1.0-alpha-4-uber.jar";
my $host = "10.176.66.219";
my $port = 7199;
 
my $domain = shift @ARGV;
my @beans = ();
my $bean;
my $size;
for my $arg (@ARGV) {
  if ($arg =~ /^\w/) {
    push (@beans, $arg);
  } else {
    last;
  }   
}     
$size = @beans;
$bean = join(' ',@beans) if ($size > 0);
open JMX, "| java -jar $jar -n";
print JMX "open $host:$port\n";
print JMX "domain $domain \n";
if (defined $bean && length $bean > 0) {
  print JMX "bean $bean \n";
  print JMX "info \n";
} else {
  print JMX "beans \n";
}
 
print JMX "close\n";
close JMX;

Continuous Deployment Architecture For Ruby on Rails

 

We have seen agile development become more popular in recent years thanks in part to the evolution of continuous integration environments like Ruby on Rails.  These development frameworks leverage quick testing and the ability to easily deploy over clusters. As deployments happen more often, we look for ways to minimize the disruption to users. Cluster architecture already contains the components we need, with multiple servers for each role, allowing us to update a subset of the system while the rest serve the business.

We recently worked with the DevOps team at one of our clients to develop an architecture that allows them to run rolling changes through the cluster using the Ruby on Rails framework. There are two principal components of change: application code on the application servers and database structure on the database servers. The application servers can easily be targeted specifically through deploy.rb. The database side of things, however, is a bit more complicated.

In order to have zero downtime, half the application servers are taken offline and updated, then the pair of master/slave standby databases have the DDLs applied to them. Traffic is then switched to these servers.

HAProxy is put in between the application and the MySQL databaseservers to act as a router for database traffic (see http://palominodb.com/blog/2011/12/01/using-haproxy-mysql-failovers for specifics), providing the ability to flip the active and standby roles of the two server pairs (as well as providing a High Availability solution). Since there are slave relationships, we needed to be able to pause replication through the deployment application (Capistrano).  We were able to accomplish this by adding a file, deployrake.util, to the Rake subsystem under lib/tasks:

namespace :deployutil do

 desc 'Checks the replication status of the primary database'

 task :replication_test => :environment do

   # find the state of replication

    mysql_res = ActiveRecord::Base.connection.execute("SHOW SLAVE STATUS")

   mysql_res.each_hash do |row|

     if row['Slave_IO_Running'] == "Yes" and row['Slave_SQL_Running']

== "Yes" and row['Seconds_Behind_Master'].to_s == "0"

       puts "ReplicationGood"

     elsif row['Seconds_Behind_Master'].blank?

       puts "ReplicationBroken"

     else

       puts "ReplicationBehind_" + row['Seconds_Behind_Master'].to_s

     end

   end

   mysql_res.free

 end

 

 task :start_slave => :environment do

   ActiveRecord::Base.connection.execute("START SLAVE")

 end

 task :stop_slave => :environment do

   ActiveRecord::Base.connection.execute("STOP SLAVE")

 end

 

We can then create tasks in deploy.rb to call these

 desc "shows the current passive slave replication status"

 task :get_slave_replication_status, :roles => :cron do

   dbrails_env = fetch(:rails_env) + '_passiveslave'

   # find the state of replication

   set :slave_replication_status, capture("cd #{latest_release} ;

RAILS_ENV=#{dbrails_env} rake deployutil:replication_test").chomp

 end

 

 desc "stop passive slave replication"

 task :stop_passiveslave_repl, :roles => :cron do

   dbrails_env = fetch(:rails_env) + '_passiveslave'

   run "cd #{latest_release} ; RAILS_ENV=#{dbrails_env} rake

deployutil:stop_slave"

 end

 etc....

We also want to be able to limit changes to specific databases so that the changes won't go into the bin logs and propagate when the slaves are turned back on.  See http://palominodb.com/blog/2011/11/21/rails-and-database-session-variables for details on how to do this through an extension of ActiveRecord.  A word of caution here: setting sql_log_bin=0 to skip logging these changes will invalidate using the binlogs for point in time recovery.

You will need a full backup after the change.

Building libMemcached RPMs

A client running CentOS 5.4 Amazon EC2 instances needed the latest libMemcached version installed. With the inclusion of the "make rpm" target, libMemcached makes it easy to build the libMemcached RPMs by doing the following:

Spin up a new CentOS Amazon EC2 instance,

As root on the new instance:

yum install @development-tools
yum install fedora-packager
/usr/sbin/useradd makerpm

Now change to the makerpm user and build the RPMs:

su - makerpm
rpmdev-setuptree
tree
wget http://launchpad.net/libmemcached/1.0/1.0.2/+download/libmemcached-1.0.2.tar.gz
tar -zxf libmemcached-1.0.2.tar.gz
./configure && make rpm
find . -name '*rpm*'

References:
http://libmemcached.org
http://fedoraproject.org/wiki/How_to_create_an_RPM_package
https://launchpad.net/libmemcached/+download

Using HAProxy for MySQL failovers

There are a number of solutions available for MySQL architecture that provide automatic failover, but the vast majority involve potentially significant and complex changes to existing configurations. Fortunately, HAProxy can be leveraged for this purpose with a minimum of impact to the current system.  Alex Williams posted this clever solution a couple years ago in his blog.  Here we take a closer look at the details of implementing it into an already existing system.

HAProxy is a freeware load balancer, proxy, and high availability application for the TCP and HTTP protocols.  Since it was built mostly to handle web traffic, it has robust rule writing using HTTP components to check on the health of systems.  The key to Alex's solution was creating xinetd daemons on the database servers that send out HTTP messages.  The HTTP check to determine database statuses works like this:

  1. The HAProxy server sends HTTP requests to the database servers at configured intervals to specified ports
  2. The /etc/services file on the database servers maps those ports to services in their /etc/xinetd.d directory
  3. The services can call any specified script, so we build scripts that connect to the databases and checks for whatever conditions we choose.

The services then return an OK or a Service Unavailable response per the conditions.  Code for these scripts is in included in the Alex's article.

Our database configuration for this implementation is two pairs of Master-Slave databases in an Active-Passive relationship with Master-Master replication between the sets.  Siloing the passive Master-Slave provides a hot spare as well as continuous up-time during deployments provided we have a means of swapping the active/passive roles of each pair.  To accomplish the latter, we built two HAProxy configuration files, haproxy_pair1.cfg and haproxy_pair2.cfg, the only difference between the two being which Master-Slave pair is active.  Having the two files indicate which pair is active also allows immediate visibility of the current configuration.

Just as in Alex's sample configuration, our web application uses two DNS entries, appmaster and appslave to call writes and reads respectively.  The IPs for these addresses are then attached to our HAProxy server, allowing HAProxy to bind to them when it starts and then route them to the appropriate database server.

On our HAProxy server we then customized the /etc/init.d/haproxy script to handle an additional parameter of "flipactive" which provides us the capability to swap the database pairs:

flipactive() {
        # first detect which cfg file haproxy is using
        ps_out=`ps -ef |grep haproxy|grep cfg`
        # if pair2 then use pair1
        if [[ "$ps_out" =~ pair2 ]]  then
             # the -sf does a friendly reread of the config file
             /usr/local/sbin/haproxy -f /etc/haproxy/haproxy_pair1.cfg -p /var/run/haproxy.pid -st $(cat /var/run/haproxy.pid)
        # if pair1 then use pair2
        else
             /usr/local/sbin/haproxy -f /etc/haproxy/haproxy_pair2.cfg -p /var/run/haproxy.pid -st $(cat /var/run/haproxy.pid)
        fi
}

The rest of the configuration details are covered pretty well in Alex's article as well as in the HAProxy documentation.

We successfully developed and implemented this technique with the devops team at SlideShare last month to build rolling DDL scripting to multiple databases using Capistrano.  It allowed them to have explicit control over which database was being updated, thereby giving them the means necessary to update one database while other served the current application code.

Rails and Database Session Variables

Ruby's ActiveRecord provides a very robust means of manipulating database table structures especially in the context of automated deployments such as Capistrano.

When there is a master-slave database configuration, it may not always be prudent to perform DDL statements on the master and let those propagate through to the slave, especially in high-volume sites where Rails deployments may involve multiple migrations since replication lag may occur and present some significant problems.  In addition, when no downtime is specified, a DDL rollout may break the current application code, so a targeted deployment may be more prudent.

In MySQL, the solution would be to set the session variable SQL_LOG_BIN = 0 which causes subsequent statements not to get written to the binary log and therefore won't get replicated to slaves.  Unfortunately, the given connection options of ActiveRecord do not accommodate the setting of database options.  However, one way to accomplish setting this variable would be to have developers explicitly set it in the migration files:

class AddAccounts < ActiveRecord::Migration

   def self.up

     execute("set SESSION sql_log_bin=0")

     create_table :api_accounts do |t|

       t.column "user_id", :int

       t.column "name", :text

       t.column "account",:text

       t.column "created_at", :datetime

       t.column "updated_at", :datetime

     end

   end

   def self.down

     execute("set SESSION sql_log_bin=0")

     drop_table :api_accounts

  end

end

But since this would be a hassle for developers and is easily overlooked, we can leverage Capistrano's architecture to monkey patch the ActiveRecord::Migration class so that this variable is set whenever the migrations are invoked.  So we constructed a file, config/initializers/active_record.rb:

#opens the ActiveRecord::Migration class
#use alias_method to add functionality to the ActiveRecord::Migration.migrate method
class ActiveRecord::Migration
  class << self
     alias_method :migrate_conn, :migrate
     def migrate(direction)
        ActiveRecord::Base.connection_pool.with_connection do |conn|
        @connection = conn
        connection.execute("SET SQL_LOG_BIN=0")
        @connection = nil
        end
        migrate_conn(direction)
     end
  end
end
Note that the database user that invokes the migrations needs to have the SUPER privilege granted to it in order to be able to set this variable.

We successfully developed and implemented this technique with the devops team at SlideShare last month to build rolling DDL scripting to multiple databases using Capistrano.  It allowed them to have explicit control over which database was being updated, thereby giving them the means necessary to update one database while the other served the current application code.     

 

Syndicate content
Website by Digital Loom