Tools

Continuous Deployment Architecture For Ruby on Rails

 

We have seen agile development become more popular in recent years thanks in part to the evolution of continuous integration environments like Ruby on Rails.  These development frameworks leverage quick testing and the ability to easily deploy over clusters. As deployments happen more often, we look for ways to minimize the disruption to users. Cluster architecture already contains the components we need, with multiple servers for each role, allowing us to update a subset of the system while the rest serve the business.

We recently worked with the DevOps team at one of our clients to develop an architecture that allows them to run rolling changes through the cluster using the Ruby on Rails framework. There are two principal components of change: application code on the application servers and database structure on the database servers. The application servers can easily be targeted specifically through deploy.rb. The database side of things, however, is a bit more complicated.

In order to have zero downtime, half the application servers are taken offline and updated, then the pair of master/slave standby databases have the DDLs applied to them. Traffic is then switched to these servers.

HAProxy is put in between the application and the MySQL databaseservers to act as a router for database traffic (see http://palominodb.com/blog/2011/12/01/using-haproxy-mysql-failovers for specifics), providing the ability to flip the active and standby roles of the two server pairs (as well as providing a High Availability solution). Since there are slave relationships, we needed to be able to pause replication through the deployment application (Capistrano).  We were able to accomplish this by adding a file, deployrake.util, to the Rake subsystem under lib/tasks:

namespace :deployutil do

 desc 'Checks the replication status of the primary database'

 task :replication_test => :environment do

   # find the state of replication

    mysql_res = ActiveRecord::Base.connection.execute("SHOW SLAVE STATUS")

   mysql_res.each_hash do |row|

     if row['Slave_IO_Running'] == "Yes" and row['Slave_SQL_Running']

== "Yes" and row['Seconds_Behind_Master'].to_s == "0"

       puts "ReplicationGood"

     elsif row['Seconds_Behind_Master'].blank?

       puts "ReplicationBroken"

     else

       puts "ReplicationBehind_" + row['Seconds_Behind_Master'].to_s

     end

   end

   mysql_res.free

 end

 

 task :start_slave => :environment do

   ActiveRecord::Base.connection.execute("START SLAVE")

 end

 task :stop_slave => :environment do

   ActiveRecord::Base.connection.execute("STOP SLAVE")

 end

 

We can then create tasks in deploy.rb to call these

 desc "shows the current passive slave replication status"

 task :get_slave_replication_status, :roles => :cron do

   dbrails_env = fetch(:rails_env) + '_passiveslave'

   # find the state of replication

   set :slave_replication_status, capture("cd #{latest_release} ;

RAILS_ENV=#{dbrails_env} rake deployutil:replication_test").chomp

 end

 

 desc "stop passive slave replication"

 task :stop_passiveslave_repl, :roles => :cron do

   dbrails_env = fetch(:rails_env) + '_passiveslave'

   run "cd #{latest_release} ; RAILS_ENV=#{dbrails_env} rake

deployutil:stop_slave"

 end

 etc....

We also want to be able to limit changes to specific databases so that the changes won't go into the bin logs and propagate when the slaves are turned back on.  See http://palominodb.com/blog/2011/11/21/rails-and-database-session-variables for details on how to do this through an extension of ActiveRecord.  A word of caution here: setting sql_log_bin=0 to skip logging these changes will invalidate using the binlogs for point in time recovery.

You will need a full backup after the change.

Building libMemcached RPMs

A client running CentOS 5.4 Amazon EC2 instances needed the latest libMemcached version installed. With the inclusion of the "make rpm" target, libMemcached makes it easy to build the libMemcached RPMs by doing the following:

Spin up a new CentOS Amazon EC2 instance,

As root on the new instance:

yum install @development-tools
yum install fedora-packager
/usr/sbin/useradd makerpm

Now change to the makerpm user and build the RPMs:

su - makerpm
rpmdev-setuptree
tree
wget http://launchpad.net/libmemcached/1.0/1.0.2/+download/libmemcached-1.0.2.tar.gz
tar -zxf libmemcached-1.0.2.tar.gz
./configure && make rpm
find . -name '*rpm*'

References:
http://libmemcached.org
http://fedoraproject.org/wiki/How_to_create_an_RPM_package
https://launchpad.net/libmemcached/+download

Using HAProxy for MySQL failovers

There are a number of solutions available for MySQL architecture that provide automatic failover, but the vast majority involve potentially significant and complex changes to existing configurations. Fortunately, HAProxy can be leveraged for this purpose with a minimum of impact to the current system.  Alex Williams posted this clever solution a couple years ago in his blog.  Here we take a closer look at the details of implementing it into an already existing system.

HAProxy is a freeware load balancer, proxy, and high availability application for the TCP and HTTP protocols.  Since it was built mostly to handle web traffic, it has robust rule writing using HTTP components to check on the health of systems.  The key to Alex's solution was creating xinetd daemons on the database servers that send out HTTP messages.  The HTTP check to determine database statuses works like this:

  1. The HAProxy server sends HTTP requests to the database servers at configured intervals to specified ports
  2. The /etc/services file on the database servers maps those ports to services in their /etc/xinetd.d directory
  3. The services can call any specified script, so we build scripts that connect to the databases and checks for whatever conditions we choose.

The services then return an OK or a Service Unavailable response per the conditions.  Code for these scripts is in included in the Alex's article.

Our database configuration for this implementation is two pairs of Master-Slave databases in an Active-Passive relationship with Master-Master replication between the sets.  Siloing the passive Master-Slave provides a hot spare as well as continuous up-time during deployments provided we have a means of swapping the active/passive roles of each pair.  To accomplish the latter, we built two HAProxy configuration files, haproxy_pair1.cfg and haproxy_pair2.cfg, the only difference between the two being which Master-Slave pair is active.  Having the two files indicate which pair is active also allows immediate visibility of the current configuration.

Just as in Alex's sample configuration, our web application uses two DNS entries, appmaster and appslave to call writes and reads respectively.  The IPs for these addresses are then attached to our HAProxy server, allowing HAProxy to bind to them when it starts and then route them to the appropriate database server.

On our HAProxy server we then customized the /etc/init.d/haproxy script to handle an additional parameter of "flipactive" which provides us the capability to swap the database pairs:

flipactive() {
        # first detect which cfg file haproxy is using
        ps_out=`ps -ef |grep haproxy|grep cfg`
        # if pair2 then use pair1
        if [[ "$ps_out" =~ pair2 ]]  then
             # the -sf does a friendly reread of the config file
             /usr/local/sbin/haproxy -f /etc/haproxy/haproxy_pair1.cfg -p /var/run/haproxy.pid -st $(cat /var/run/haproxy.pid)
        # if pair1 then use pair2
        else
             /usr/local/sbin/haproxy -f /etc/haproxy/haproxy_pair2.cfg -p /var/run/haproxy.pid -st $(cat /var/run/haproxy.pid)
        fi
}

The rest of the configuration details are covered pretty well in Alex's article as well as in the HAProxy documentation.

We successfully developed and implemented this technique with the devops team at SlideShare last month to build rolling DDL scripting to multiple databases using Capistrano.  It allowed them to have explicit control over which database was being updated, thereby giving them the means necessary to update one database while other served the current application code.

Rails and Database Session Variables

Ruby's ActiveRecord provides a very robust means of manipulating database table structures especially in the context of automated deployments such as Capistrano.

When there is a master-slave database configuration, it may not always be prudent to perform DDL statements on the master and let those propagate through to the slave, especially in high-volume sites where Rails deployments may involve multiple migrations since replication lag may occur and present some significant problems.  In addition, when no downtime is specified, a DDL rollout may break the current application code, so a targeted deployment may be more prudent.

In MySQL, the solution would be to set the session variable SQL_LOG_BIN = 0 which causes subsequent statements not to get written to the binary log and therefore won't get replicated to slaves.  Unfortunately, the given connection options of ActiveRecord do not accommodate the setting of database options.  However, one way to accomplish setting this variable would be to have developers explicitly set it in the migration files:

class AddAccounts < ActiveRecord::Migration

   def self.up

     execute("set SESSION sql_log_bin=0")

     create_table :api_accounts do |t|

       t.column "user_id", :int

       t.column "name", :text

       t.column "account",:text

       t.column "created_at", :datetime

       t.column "updated_at", :datetime

     end

   end

   def self.down

     execute("set SESSION sql_log_bin=0")

     drop_table :api_accounts

  end

end

But since this would be a hassle for developers and is easily overlooked, we can leverage Capistrano's architecture to monkey patch the ActiveRecord::Migration class so that this variable is set whenever the migrations are invoked.  So we constructed a file, config/initializers/active_record.rb:

#opens the ActiveRecord::Migration class
#use alias_method to add functionality to the ActiveRecord::Migration.migrate method
class ActiveRecord::Migration
  class << self
     alias_method :migrate_conn, :migrate
     def migrate(direction)
        ActiveRecord::Base.connection_pool.with_connection do |conn|
        @connection = conn
        connection.execute("SET SQL_LOG_BIN=0")
        @connection = nil
        end
        migrate_conn(direction)
     end
  end
end
Note that the database user that invokes the migrations needs to have the SUPER privilege granted to it in order to be able to set this variable.

We successfully developed and implemented this technique with the devops team at SlideShare last month to build rolling DDL scripting to multiple databases using Capistrano.  It allowed them to have explicit control over which database was being updated, thereby giving them the means necessary to update one database while the other served the current application code.     

 

How to recreate an InnoDB table after the tablespace has been removed

Does your error log ever get flooded with errors like this one?

 

[ERROR] MySQL is trying to open a table handle but the .ibd file for
table my_schema/my_logging_table doesnot exist.
Have you deleted the .ibd file from thedatabase directory under
the MySQL datadir, or have you used DISCARD TABLESPACE?
See http://dev.mysql.com/doc/refman/5.0/en/innodb-troubleshooting.html how you can resolve the problem.

 

No? That is great!

 

We had a case where, in order to quickly solve a disk space issue, a SysAdmin decided to remove the biggest file in the filesystem, and of course this was an InnoDB table used for logging.

That is, he ran:

shell> rm /var/lib/mysql/my_schema/my_logging_table.ibd

He could have run TRUNCATE TABLE, but that's another story.

 

The results were not ideal:

  1. The table did not exist anymore.

  2. Errors in the application while trying to write to the table.

  3. MySQL flooding the error log.

 

The solution for this problem is to:

  • run DISCARD TABLESPACE ( InnoDB will remove insert buffer entries for that tablespace);

  • run DROP TABLE ( InnoDB will complaint that the .ibd file doesn't exist, but it will remove it from the internal data dictionary );

  • recover the CREATE TABLE statement from the latest backup ( you have backups, right? );

  • issue the CREATE TABLE statement to recreate the table.

 

Example:

mysql> ALTER TABLE my_logging_table DISCARD TABLESPACE;

Query OK, 0 rows affected (0.05 sec)

 

In the error log you will see something like:

InnoDB: Error: cannot delete tablespace 251
InnoDB: because it is not found in the tablespace memory cache.
InnoDB: Warning: cannot delete tablespace 251 in DISCARD TABLESPACE.
InnoDB: But let us remove the insert buffer entries for this tablespace.

 

mysql> DROP TABLE my_logging_table;

Query OK, 0 rows affected (0.16 sec)

 

In the error log you will see something like:

InnoDB: Error: table 'my_schema/my_logging_table'
InnoDB: in InnoDB data dictionary has tablespace id 251,
InnoDB: but tablespace with that id or name does not exist. Have
InnoDB: you deleted or moved .ibd files?
InnoDB: This may also be a table created with CREATE TEMPORARY TABLE
InnoDB: whose .ibd and .frm files MySQL automatically removed, but the
InnoDB: table still exists in the InnoDB internal data dictionary.
InnoDB: Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.0/en/innodb-troubleshooting.html
InnoDB: for how to resolve the issue.
InnoDB: We removed now the InnoDB internal data dictionary entry
InnoDB: of table `my_schema/my_logging_table`.

 

And finally:

mysql> CREATE TABLE `my_logging_table` (
(... omitted ...)
-> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Query OK, 0 rows affected (0.05 sec)

 

Of course, the final step - is a stern talking to with the SysAdmin.

Getting a List of Users From the MySQL General Log

From time to time, organizations want to know if there are any users that are not used.  For clients using MySQL 5.1 and up, that can handle a 1% overhead for general logging, we will enable the general log for a period of time and analyze the users that connect.

Note: we have some extremely busy clients, and we very rarely have a problem turning the general log on, other than making sure we're rotating and compressing logs so we do not run out of disk space.

Once we have the logs, I run this little perl tool I made -- I call it genlog_users.pl:

#!/usr/bin/perl

 

my $infile=$ARGV[0];

my %seen=();

my @uniq=();

 

open (INPUT, "<$infile");

while (<INPUT>) {

  my $line=$_;

  if ($line=~/Connect/) {

    if ($line=~/(\S*@\S*)/) { push(@uniq, $1) unless $seen{$1}++; }

 } # end if line matches Connect

}

close INPUT;

open (OUTPUT, ">>..users.txt");

$,="\n";

print OUTPUT (keys %seen);

print OUTPUT ("\n");

close OUTPUT;

 

----------

I hope it is useful for whoever stumbles on this; I know it has been useful for me in the past -- it's just doing some string matching, and I bet if I used Python it would be done in half the lines, but it's already fewer than 20 lines, so it's pretty small to begin with.

PalominoDB Percona Live: London Slides are up!

 

Percona Live: London was a rousing success for PalominoDB.  I was sad that I could not attend, but I got a few people who sent "hellos" to me via my coworkers.  But on to the most important stuff -- slides from our presentations are online!

René Cannao spoke about MySQL Backup and Recovery Tools and Techniques (description) slides (PDF)

 

Jonathan delivered a 3-hour tutorial about Advanced MySQL Scaling Strategies for Developers (description) slides (PDF)

Enjoy!

Automatic Downtimes in Nagios Without Using Time Periods

Monitoring systems are a great thing, and we rely heavily on Nagios here at PalominoDB.  We also rely heavily on xtrabackup for (mostly) non-locking, "warm" backups of MySQL.  In order to get a consistent backup with a proper binlog position on a slave, xtrabackup stops replication for a short period of time.  If the monitoring system catches this, the pager will go off, and usually in the middle of the night.

Nagios can have specific windows when it is on or off for a particular host or service, but you have to remember that when you change the time the backup runs.  I prefer to call a script from cron, just before I call the backup script, so that I can easily see that they are related.

Note that this script works even if you have Nagios password protected with .htaccess, because wget allows for a username and password.  This script is not perfect - there is not a lot of error checking, and log files are hard-coded.  But it does what it needs to.

The script is called like this:

./scheduleDowntimeByService service length host

Sample cron entries (note that the service name depends on what you are checking and the service names themselves):

00 5 * * 0 /home/palominodb/bin/scheduleDowntimeByService.pl MySQL+Replication+Lag 3600 hostname

 

00 5 * * 0 /home/palominodb/bin/scheduleDowntimeByService.pl MySQL+Slave+IO+Thread 3600 hostname

 

00 5 * * 0 /home/palominodb/bin/scheduleDowntimeByService.pl MySQL+Slave+SQL+Thread 3600 hostname

Here is the script itself:

 

cat scheduleDowntimeByService.pl 

#!/usr/bin/perl

use strict;

use POSIX;

 

my $service = shift @ARGV;

my $length = shift @ARGV;

my $host = shift @ARGV;

 

unless ($length ) {

        $length = 3600;

}

 

 

my $startTime = time();

my $endTime = $startTime + $length;

 

my $nagios_start = POSIX::strftime("%m-%d-%Y %H:%M:00", localtime($startTime));

my $nagios_end = POSIX::strftime("%m-%d-%Y %H:%M:00", localtime($endTime));

 

$nagios_start =~ s@:@%3A@g;

$nagios_start =~ s@-@%2D@g;

$nagios_start =~ s@ @%20@g;

$nagios_end =~ s@:@%3A@g;

$nagios_end =~ s@-@%2D@g;

$nagios_end =~ s@ @%20@g;

 

my $URL = 'https://monitoring.company.com/nagios//cgi-bin/cmd.cgi?fixed=1&start_time=' .  $nagios_start . '&end_time=' . $nagios_end . '&cmd_typ=56&cmd_mod=2&host=' . $host . '&service=' . $service . '&com_data=Backups&btnSubmit=Commit';

 

my $cmd = "/usr/bin/wget --quiet --user=user --password=PASS -O /tmp/nagios_downtime.html '$URL'"

;

open ( L, ">>/tmp/nagios_downtime.log" );

print L print $cmd . "\n";

print L `$cmd`;

close L;

Exporting the mysql.slow_log table into slow query log format

Using pt-query-digest is an excellent way to perform a SQL review. However, sometimes you don't have access to the slow_query_log_file. For example, when MySQL runs on Amazon RDS, the slow_query_log_file is unavailable (see the RDS FAQ). To get around this, export the mysql.slow_log table.

To export the data, run the following SQL command from an instance that has database access. The data is exported into a MySQL slow-log format, which pt-query-digest can analyze:

mysql -u user -p -h host.rds.amazonaws.com -D mysql -s -r -e "SELECT CONCAT( '# Time: ', DATE_FORMAT(start_time, '%y%m%d %H%i%s'), '\n', '# User@Host: ', user_host, '\n', '# Query_time: ', TIME_TO_SEC(query_time),  '  Lock_time: ', TIME_TO_SEC(lock_time), '  Rows_sent: ', rows_sent, '  Rows_examined: ', rows_examined, '\n', sql_text, ';' ) FROM mysql.slow_log" > /tmp/mysql.slow_log.log

The -s (--silent) option is used to suppress echoing of the query.

The -r (--raw) option is used to disable character escaping making \n into an actual newline, otherwise it's echoed as '\n'.

Once the export is complete, run pt-query-digest to do the analysis. A simple review command is:

pt-query-digest --limit 100% /tmp/mysql.slow_log.log > /tmp/query-digest.txt

query-digest.txt is now ready for review.

Important Caveats When Using pt-online-schema-change

 

Using pt-online-schema-change is a great way to change a table without downtime or switching over to a secondary database first.  While using it on production systems, we found some interesting issues to consider.

 

1) There were foreign_keys referencing the table we were altering.  There are two options to handle this.  They are specified with "--update-foreign-keys-method rebuild_constraints|drop_old_table".  One is to rebuild the tables with foreign keys.  This does lock the tables, and will block reads and writes while it is happening.  I chose to use drop_old_table.  This method disables foreign key checks, then drops the original table and renames the new table before reenabling foreign key checks.  Normally, pt-online-schema-change just renames the tables, then drops the old table, so locking is minimal.  We did find that there was locking while dropping the original table and renaming the new table.  This is something to be cautious with, especially with large tables.  In theory, you could run an online change to drop the foreign key constraints on the child tables prior to working on the parent table.

 

2) When testing, this ran in 15 minutes against an unladen database and took over 24 minutes against production.  You could add --sleep n, where n is a number of seconds, if you need it to add less load to production.  

 

3) One limit we also saw is that pt-online-schema-change will only work on tables with at least one unique single column index, such as a single column primary key.  Often partitioned tables don't have a unique single column index.

 

 

Here is some of the output of a sample pt-online-schema-change...

 
 
## Install pt-online-schema-change and pre-requisites...
wget percona.com/get/pt-online-schema-change
chmod 755 pt-online-schema-change
yum install perl-DBI
yum install perl-DBD-MySQL
 
time ./pt-online-schema-change h=127.0.0.1,D=my_schema,t=my_table -P3306 --alter "add column signature varbinary(255) NULL DEFAULT NULL; add column signature_version varchar(4) NULL DEFAULT NULL" --child-tables fk_table1,fk_table2 --update-foreign-keys-method drop_old_table -pXXXX --bin-log
# 2011-10-14T03:20:28 ./pt-online-schema-change started
# 2011-10-14T03:20:28 USE `my_schema`
# 2011-10-14T03:20:28 Alter table my_table using temporary table __tmp_my_table
# 2011-10-14T03:20:28 Checking if table my_table can be altered
# 2011-10-14T03:20:28 SHOW TRIGGERS FROM `my_schema` LIKE 'my_table'
# 2011-10-14T03:20:28 Table my_table can be altered
# 2011-10-14T03:20:28 Chunk column id, index PRIMARY
# 2011-10-14T03:20:28 Chunked table my_table into 25685 chunks
# 2011-10-14T03:20:28 User-specified child tables: fk_table1, fk_table2
# 2011-10-14T03:20:28 Starting online schema change
# 2011-10-14T03:20:28 CREATE TABLE `my_schema`.`__tmp_my_table` LIKE `my_schema`.`my_table`
# 2011-10-14T03:20:28 ALTER TABLE `my_schema`.`__tmp_my_table` add column signature varbinary(255) NULL DEFAULT NULL
# 2011-10-14T03:20:29 ALTER TABLE `my_schema`.`__tmp_my_table`  add column signature_version varchar(4) NULL DEFAULT NULL
# 2011-10-14T03:20:30 Shared columns: id, col1, col2, col3, col4, col5
# 2011-10-14T03:20:30 Calling OSCCaptureSync::capture()
# 2011-10-14T03:20:30 CREATE TRIGGER mk_osc_del AFTER DELETE ON `my_schema`.`my_table` FOR EACH ROW DELETE IGNORE FROM `my_schema`.`__tmp_my_table` WHERE `my_schema`.`__tmp_my_table`.id = OLD.id
# 2011-10-14T03:20:30 CREATE TRIGGER mk_osc_upd AFTER UPDATE ON `my_schema`.`my_table` FOR EACH ROW REPLACE INTO `my_schema`.`__tmp_my_table` (id, col1, col2, col3, col4, col5) VALUES (NEW.id, NEW.col1, NEW.col2, NEW.col3, NEW.col4, NEW.col5)
# 2011-10-14T03:20:30 CREATE TRIGGER mk_osc_ins AFTER INSERT ON `my_schema`.`my_table` FOR EACH ROW REPLACE INTO `my_schema`.`__tmp_my_table` (id, col1, col2, col3, col4, col5) VALUES(NEW.id, NEW.col1, NEW.col2, NEW.col3, NEW.col4, NEW.col5)
# 2011-10-14T03:20:30 Calling CopyRowsInsertSelect::copy()
Copying rows:   1% 24:52 remain
Copying rows:   3% 25:01 remain
...
Copying rows:  96% 00:43 remain
Copying rows:  98% 00:15 remain
# 2011-10-14T03:44:14 Calling OSCCaptureSync::sync()
# 2011-10-14T03:44:14 Renaming foreign key constraints in child table
# 2011-10-14T03:44:14 SET foreign_key_checks=0
# 2011-10-14T03:44:14 DROP TABLE `my_schema`.`my_table`
# 2011-10-14T03:44:49 RENAME TABLE `my_schema`.`__tmp_my_table` TO `my_schema`.`my_table`
# 2011-10-14T03:44:49 Calling CopyRowsInsertSelect::cleanup()
# 2011-10-14T03:44:49 Calling OSCCaptureSync::cleanup()
# 2011-10-14T03:44:49 DROP TRIGGER IF EXISTS `my_schema`.`mk_osc_del`
# 2011-10-14T03:44:49 DROP TRIGGER IF EXISTS `my_schema`.`mk_osc_ins`
# 2011-10-14T03:44:49 DROP TRIGGER IF EXISTS `my_schema`.`mk_osc_upd`
# 2011-10-14T03:44:49 ./pt-online-schema-change ended, exit status 0
 
real    24m20.777s
user    0m3.936s
sys     0m1.216s

Syndicate content
Website by Digital Loom