Blog

Insights from the PgCon 2013

PgCon 2013 was attended by 256 people across the globe.  Attendees had the opportunity to enjoy tutorials, talks and an excellent unconference (this last deserves a special mention).

I lectured a talk related with Full text search using Sphinx and Postgres (you can find the slides at http://t.co/lgFoLq37EC, and all of the talks have been recorded).  The quality of the talks in general was quite good, but I don't want to repeat what you will find in other posts.

The unconference was attended quite late into the evening. You can find a schedule of it, as well as the minutes of some of the talks that happened (and others that didn't also) here.

There was a special emphasis on the pluggable storage feature, albeit most agree that it will be a very difficult feature to implement in the near versions. A topic related to this, was the Foreign Data Wrapper enhancements.

Pluggable Storage engine was extended after. The main reason of why everybody agrees with this feature, is because an API for the storage will allow companies to collaborate with code and avoid forks to other projects.

There was a long discussion also about migrations on the hall, using pg_upgrade.

The features about replication were bi-directional and logical replication.

Full text search unconference discussion was pretty interesting. Oleg Bartunov and Alexander showed a really interesting work coming up for optimizing GIN indexes. According to their benchmarks, Postgres could improve the performance significantly.

There were a lot of discussion I missed, due the wide number of tracks and "hall spots". But th emajority of attendees I heard agreed that the unconference was quite exciting and granted the possibility to bring many new ideas.

 

 

 

 

Supporting Feminism in Technology - Part 1

 

I've been contemplating the topic of feminism and misogyny in the technology field a lot of late.  This blog post is a culmination of a significant amount of thought and reflection on the topic of women in technology.  At Palomino, I focus a lot on the values around bringing underserved populations into technology.  Women, people from working class or impoverished backgrounds, people who are gay and lesbian or transgendered, and people of latino and african american backgrounds are traditionally highly underrepresented in the US technological workforce.  Palomino, even being a woman owned business, is not exempt from this issue.  One of my goals is to build up not just Palomino's DBA and engineer population to reflect higher percentages of these populations, but to support more people in the entire community in having these opportunities.  I'd like to focus on gender in this conversation, though I have much to say on the topics of race and class as well.  In fact, they are all interrelated.


To dig into the topic, one of the first things to consider and that people ask, is why is this a big deal.  And at the top level, without any time put into consideration, I can see why someone might feel this way.  After all, if a DBA is good, why does their gender, race or background matter?  And, if you are simply considering the output of an individual or organization, this is pretty true.  But, there is more.  My hope is that people who focus on the importance of open source software and free access to technology would understand the importance of building larger populations of women engineers and administrators, but that has proven itself to not be true.


The fact is, that on a macro level, one of the largest ways to get more people from underserved populations into jobs such as database administration, infrastructure architecture and software engineering is to provide them with mentors and role models who have already broken through the barriers to make it.  And, we do exist.  Palomino and Blue Gecko were both built by women.  Oracle's Ace and Ace Director list has about 12 (of 390) women on it.  I have been meeting more people of color and women working on senior teams of clients.  I do see potential role models and mentors out there.  But, don't get too excited.  There are still plenty of  opportunities for improving how we build welcoming workplaces for the talented and diverse engineers already out there making their way in the field.  


There's also the selfish part of the equation.  More often than not, when I find women and people of color in the wild, with successful records as engineers and operators, they tend to be extremely good at their jobs.  They tend to be excellent communicators, good with clients, detailed with project planning and highly technically competent.  Is this because of more innate talent?  No.  It is because the amount of willpower, inner strength, self-confidence and chutzpah required to succeed for these people is much higher than the predominant demographic of engineers.


1. Mindfulness in Language and Communication - I find this is particularly true in remote workforces, such as Palomino where the entire culture is often built about word choice and expression of ideas.  There are obvious cases, such as how often people start email threads with "Gentlemen".  Then, there is simply the propagation of cultures around masculinity, or "brogramming".  This is more delicate.  After all, there are plenty of women, myself included, who enjoy conversations around traditionally masculine pursuits and endeavors.  I lean more towards not an exclusion of topic, but an inclusion and mindfulness of those who might feel left out.  Do you ask women about their favorite sports teams?  Do you keep an eye out for folks who might retreat from certain topics and adjust your conversations accordingly?  Shutting off social conversation is not generally helpful, but as leaders in organizations, it is a responsibility to help guide conversations to be as inclusive and supportive as possible to all staff.  And of course, any traditionally sexist, racist or classist conversations need to be privately nipped in the bud immediately as a manner of course.  Creating space for other conversations outside of traditionally masculine ones to occur is also critical.  Ask people who are not from the dominant race/class/gender in your organization about their weekends and pastimes.  Don't assume a woman is interested in knitting, but give her a chance to express what she likes.  She might surprise you and your team with the diverse range of interests that might be brought up.


2. Examine the Gendered Roles and Behaviors - Go to most tech sites and look at their team pages.  I'm willing to bet that if you are looking at client facing positions that require emotional intelligence and empathy, you will find more women than in the technical fields.  Palomino is no exception.  Our project and account management teams are all female.  Our office manager is male, however.  Ultimately, I don't recommend the policing of the gender of individual roles, but I do believe it's important to examine key expectations and behaviors around staff.  For instance, it is common practice to assume engineers and administrators do not have the emotional/social capacity to interact with users/clients.  So, organizations put account managers or project managers in between, who are often female and thus considered more socially and emotionally adept.  Rarely is it considered a priority to encourage the technical staff to step up, improve their soft skills such as empathy and to interact directly with the client base.  Instead, we build a culture of mothering, which is harmful to all parties involved.


Additionally, do we value the roles that are more empathetic, client facing and emotionally intelligent?  People always discuss how hard it is to retain and find good DBAs, and their salaries, power and "catering to" reflect this in the organization.  While a good PM may not be as hard to find, they are still just as valuable to an organization.  Do you take these roles for granted, or do you also make them feel as important, valued and encouraged as your more technical staff?  Do you let mediation fall to these same people, or do you encourage all staff to develop their skills in negotiation and conflict resolution?  


3. The Devil is in the Details - At the recent Percona Live conference, T-Shirts were given to all attendees.  When asked if there were women's sizes, the organizers stated they were unisex.  Unisex is not actually unisex.  It is men's, and not designed for women's bodies.  These details, while not large individually, add up to a feeling of being an add on; just as much as lack of kosher meals, or wheelchair ramps far from the main entrance can cause one to feel like an afterthought.   Take the extra step to define and socialize your diversity policy and your code of conduct.  O'Reilly has a great code of conduct at http://oreilly.com/conferences/code-of-conduct.html.  Note, that defining the code of conduct or the diversity policy is not enough.  You need to talk to people about these things and engage them.  When you are discussing policies around employees, or evaluating a new client, think about how this fits in to your policies.  When you are planning a company offsite, organizing a conference or writing a blog post, think about these policies.  Who will be involved or affected by your choices?  What can you do to make them feel more included?  Take the time to really think about this.


4. Recruiting - This is a challenging position, and one that I've had to consider for quite some time.  At Palomino, I'd say we get perhaps 1 out of 20 applicants who are women via our natural model of letting people come to us via word of mouth.  That is obviously a horrible ratio.  Too often, people just say "well, if women don't interview how can we hire them?".  That's a cop out.  Most hiring managers know that you don't get A players from a passive recruiting strategy.  This is just as true for getting women to interview for technical positions.  You need to spend time going to events such as the ADA Initiative Unconference (http://sf.adacamp.org), Women Powering Technology Summit (http://www.witi.com) and sponsoring, speaking and getting involved.  There are numerous meetups, from Girls who Code in NYC, Girls in Tech in Las Vegas and Women in Tech in SF.  Additionally, you should be going through LinkedIn to find women and contacting them.  Even if they are not interested, by building a network that includes more and more women, you are improving the possibility that you will find the right women for your organization.  Get out there and speak at meetups, start some introductory courses for women coming out of college and continue to build that network.  There is no reason to stay at a 5% rate of interviews, but you have to work!


This is part 1 in 2 parts.  I'd like to focus next on some ways in which dialogue around the conversations can go wrong, and how to discuss and respond to conversations around feminism and misogyny in a constructive manner.  I do look forward to feedback and conversations around the topic, and I thank you for your time in reading and considering this.

Benchmarking Postgres on AWS 4,000 PIOPs EBS instances

Introduction

Disk I/O is frequently the performance bottleneck with relational databases. With AWS recently releasing 4,000 PIOPs EBS volumes, I wanted to do some benchmarking with pgbench and PostgreSQL 9.2. Prior to this release the maximum available I/O capacity was 2,000 IOPs per volume. EBS IOPs are read and written in 16Kb chunks with their performance limited by both the I/O capacity of the EBS volumes and the network bandwidth between an EC2 instance and the EBS network. My goal isn't to provide a PostgreSQL tuning guide, an EC2 tuning guide, or a database deathmatch complete with graphs; I'll just be displaying what kind of performance is available out-of-the-box without substantive tuning. In other words, this is an exploratory benchmark not a comparative benchmark. I would have liked to compare the performance of 4,000 PIOPs EBS volumes with 2,000 PIOPs EBS volumes, but I ran out of time so that will have to wait for a following post.

Setup

Region

I conducted my testing in AWS' São Paulo region. One benefit of testing in sa-east-1 is that spot prices for larger instances are (anecdotally) more stable than in us-east. Unfortunately, sa-east-1 doesn't have any cluster compute (CC) instances available. CC instances have twice the bandwidth to the EBS network than non-CC EC2 instances. That additional bandwidth allows you to construct larger software RAID volumes. My cocktail napkin calculations show that it should be possible to reach 50,000 PIOPs on an EBS-backed CC instance without much of a problem.

EC2 instances

I tested with three EC2 instances: an m1.large from which to run pgbench, an m2.2xlarge with four EBS volumes, and an m1.xlarge with one EBS volume. All EBS volumes are 400GB with 4,000 provisioned IOPs. The m1.large instance was an on-demand instance; the other instances  — the pgbench target database servers — were all spot instances with a maximum bid of $0.05. (In one case our first spot instance was terminated, and we had to rebuild it). Some brief testing showed that having an external machine driving the benchmark was critical for the best results.

Operating System

All EC2 instances are running Ubuntu 12.10. A custom sysctl.conf tuned the Sys V shared memory as well as set swappiness to zero and memory overcommit to two.

kernel.shmmax = 13355443200
kernel.shmall = 13355443200
vm.swappiness = 0
vm.overcommit_memory = 2

Packages

The following packages were installed via apt-get:

  • htop
  • xfsprogs
  • debian-keyring
  • mdadm
  • postgresql-9.2
  • postgresql-contrib-9.2

In order to install the postgresql packages a pgdb.list file containing

deb http://apt.postgresql.org/pub/repos/apt/ squeeze-pgdg main

was placed in /etc/apt/sources.list.d and the following commands were run:

gpg --keyserver pgp.mit.edu --recv-keys ACCC4CF8
gpg --armor --export ACCC4CF8 | apt-key add -
apt-get update

RAID and Filesystems

For the one volume instance, I simply created an XFS file system and mounted it on /mnt/benchmark.

mkdir /mnt/benchmark
mkfs.xfs /dev/svdf 
mount -t xfs /dev/svdf /mnt/benchmark
echo "/dev/svdf    /mnt/benchmark    xfs    defaults    1 2" >> /etc/fstab

For the four volume instance it was only slightly more involved. mkfs.xfs analyzes the underlying disk objects and determines the appropriate values for stride and width. Below are the commands for assembling a four volume mdadm software RAID array that is mounted on boot (assuming you've attached the EBS volumes to your EC2 instance). Running dpkg-reconfigure rebuilds the initrd image.

mkdir /mnt/benchmark
mdadm --create /dev/md0 --level=0 --raid-volumes=4 /dev/svdf /dev/svdg /dev/svdh /dev/svdi
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
mkfs.xfs /dev/md0
echo "/dev/md0    /mnt/benchmark    xfs    defaults    1 2" >> /etc/fstab
dpkg-reconfigure mdadm

Benchmarking

pgbench is a utlity included in the postgresql-contrib-9.2 package. It approximates the TPC-B benchmark and can be looked at as a database stress test whose output is measured in transactions per second. It involves a significant amount of disk I/O with transactions that run for relatively short amounts of time. vacuumdb was run before each pgbench iteration. For each database server pgbench was run mimicking 16 clients, 32 clients, 48 clients, 64 clients, 80 clients, and 96 clients. At each of those client values, pgbench iterated ten times in steps of 100 from 100 to 1,000 transactions per client. It's important to realize that pgbench's stress test is not typical of a web application workload; most consumer facing web applications could achieve much higher rates than those mentioned here. The only pgbench results against AWS/EBS volumes that I'm-aware-of/is-quickly-googleable is from early 2012 and, at its best, achieves rates 50% slower than the lowest rates found here. I drove the benchmark using a very small, very unfancy bash script. An example of the pgbench commandline would be:

pgbench -h $DBHOST -j4 -r -Mextended -n -c48 -t600 -U$DBUSER

m1.xlarge with single 4,000 PIOPs volume

The maximum transaction volume for this isntance was when running below 48 concurrent clients and under 500 transactions per client. While the transaction throuput never dropped precipitously at any point, loads outside of that range exhibited varying performance. Even at its worst, though, this instance handled between 600-700 transactions/second.

m2.2xlarge with four 4,000 PIOPs volumes

I was impressed; at no point did the benchmark stress this instance — the tps rate was between 1700-1900 in most situations with peaks up to 2200 transactions per second. If I was asked to blindly size a "big" PostgreSQL database server running on AWS this is probably where I would start. It's not so large that you have operational issues like worrying about MTBFs for ten volume RAID arrays or trying to snapshot 4TB of disk space, but it is large enough to absorb a substantial amount of traffic.

Graphs and Tabular Data

single-4K-volume tps

The spread of transactions/second irrespective of number of clients.

Box plot of transactions per second. Single 4K volume

Data grouped by number of concurrent clients with each bar representing an increase in 100 transactions per second ranging from 100 to 1,000.

Bar graph of transactions per second grouped by concurrent clients. Single 4K volume

Progression of tps by individual level of concurrency. The x-axis tick marks measure single pgbench runs from 100 transactions per client to 1,000 transactions per client.

Six subgraphs of transactions per second by each level of concurrency. Single 4K volume

Raw tabular data

txns/client1002003004005006007008009001000
clients
1614551283118365311975336311009923648
321500124212327577476301067665688709
482818648997051029749736593766641
6494412817041010739596778662820612
808158931055809597801684708736663
96939889774772798682725662776708

four-4,000-PIOPs-volumes tps

Again, a box plot of the data with a y-axis of transactions/second.

Box plot of transactions per second. Four 4,000 PIOPs volumes

Grouped by number of concurrent clients between 100 and 1,000 transactions per client.

Bar graph of transactions per second grouped by concurrent clients. Four 4,000 PIOPs volumes

TPS by number of concurrent clients. The x-axis ticks mark pgbench runs progressing from 100 transactions per client to 1,000 transactions per client.

Six subgraphs of transactions per second by each level of concurrency. Four 4,000 PIOPs volumes

Tabular data m2.2xlarge with four 4,000 PIOPs EBS volumes

txns/client1002003004005006007008009001000
clients
161487161718771415138818821897177112671785
321804208321601791125919972230150117171918
481810215212961951211717751709180318171847
641810158015682056181117841849190919421658
801802204414672142164518961933174018211851
961595140320471731178318591708189617511801

PalominoDB at an industry event near you!

Find the Palomino Team at an event near you in 2013!

New York’s Effective MYSQL Meet up Group, March 12, 2013

As New York’s only active MYSQL meetup group, NY Effective MYSQL states its purpose is to share practical education for MySQL DBAs, Developers and Architects.  At their next meeting on March 12 at 6:30 PM, Laine Campbell, CEO & Principal of PalominoDB, will be the evening’s presenter and her topic will be  "RDS Pitfalls. Ways it's going to screw you. (And not in the nice way)" Speaking from her own experience, Laine will explain the Amazon RDS offering, it's patterns and anti-patterns, and it's gotchas and idiosyncrasies.  To learn more about the NY group and Laine’s presentation, please click here.

NYC* Tech Day - Wednesday, March 20, 2013

Join NYC* Tech Day and take a deep dive into Apache Cassandra™, the massively scalable NoSQL database! This two-track event will feature over 14 interactive sessions, delivered by Apache Cassandra experts. Come see our CTO Jay Edwards at the Meet the Experts area. Or just drop by our table and talk to Jay and PDB staff. For more info click here!

Percona Live MySQL Conference in Santa Clara April 22-25, 2013

PalominoDB will once again host a booth at this year’s Percona Live event.   With 110 sessions and over 90 speakers, Percona promises to a fantastic event that you might not want to miss! This year several of Palomino’s own will be presenters. Read on....

In order of their appearance:

On the first day of the conference, April 22nd, Rene Cannao will kick off a full day tutorial beginning at 9:30 AM with part 1 of Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability.  After a lunch break, Rene will continue Part 2 beginning at 1:30.  Rene, a Senior Operational DBA at Palomino, will guide attendees through a hands-on experience in the installation, configuration management and tuning of MySQL Cluster.  To see the agenda of topics being offered during this exceptional offering, please click here.

Also on  April 22nd from 9:30 AM - 12:30 PM, Jay Edwards and Ben Black will be making an in-depth tutorial:  MYSQL Patterns in Amazon - Make the Cloud Work For You.  Jay and Ben will show you how to build your MySQL environment in the cloud -- how to maintain it -- how to grow it -- and how to deal with failure.  You may want to get there early to be sure you get a seat!  Want more info on this hot topic?  Check out more on this topic.

Meet our European Team lead, Vladimir Fedorkov on April 23rd at 2:20 PM where his topic will be MYSQL Query Anti-Patterns That Can Be Moved to Sphinx.  Vlad will be discussing how to handle query bottlenecks that can result from increases in dataset and traffic.  Click here to find out more.

Also on the 23rd, at 4:50 PM, Ben Black will be back to speak on MySQL Administration in Amazon RDS.  This should be a great session for those attendees new to this this tool,  as Ben will cover common tasks in RDS and gotchas for DBA's that are new to RDS.  Check out more on this topic.

On April 24th at 1PM Mark Filipi will present Maximizing SQL Reviews and Tuning with pt-query-digest.  pt-query-digest is one of the more valuable components of the Percona Toolkit, and as one of our operational DBAs, Mark will be approaching his topic with an eye to real world experiences.  Read more about it by following this link.

Also on the 24th at 1:50 PM, Ben Black and David Turner will tag-team the topic Online Schema Changes for Maximizing Uptime.  Together they will  cover common operations implemented in production and how you can minimize downtime and customer impact.  Here’s a link for more info on this.

PGCon May 2013

One of our Operational Database Administrators, Emanuel Calvo, will be a presenter at  PGCON 2013 Postgre SQL Conference on May 23, 2013 at 1 PM.  His topic will be Sphinx and Postgres - Full Text Search extension.  Emanuel will be discussing how to integrate these two tools for optimum performance and reliability. Want to learn more about Emanuel and the conference? Click Here

Velocity June 2013 Web Performance and Operation Conference

Velocity 2013 will be held in Santa Clara from June 18 through the 20th. Organizers tout this event  as “the best place on the planet for web ops and performance professionals like you to learn from your peers, exchange ideas with experts, and share best practices and lessons learned.” Palomino’s Laine Campbell and Jay Edwards will both be presenters on the conference’s opening day, June 18th.

Laine Campbell, PalominoDB’s CEO,  will be offering Using Amazon Web Services for MySQL at Scale on Tuesday, June 18 at 1:30 PM.  The session promises “a deep dive into the AWS offerings for running MySQL at scale.”  Laine will guide attendees through various options for running MySQL at high volumes at Amazon Web Services. More on this Laine’s presentation is available through this link.

Later that same day at 3:30 PM, Jay Edwards, Palomino’s CTO, will present  Managing PostgreSQL with Ansible in EC2.  He will discuss how and why Ansible is a next generation configuration management system that includes deployment and ad hoc support as well. Find out more on Jay’s topic and presentation here. 

Couchbase Smart Client Failure Scenarios

 

The Couchbase java smart client communicates directly with the cluster to maintain awareness of data location. To do so it gathers information about the cluster architecture from a manually maintained configuration file listing all the nodes. The smart client configuration is done within the Java code and does not have a pre-designated file while the Moxi configuration is generally installed at /opt/moxi/etc/moxi-cluster.cfg

Assuming the smart client is on a separate server from the affected node there are two situations where communication between the client and a specific node might be interrupted.

In the first scenario, a node may fail. If so, the rest of the cluster will detect that from standard heartbeat checks, which are built in to Couchbase, and map its data to the replica nodes. The smart client is informed of the remappings and should be able to find all identified data again. There are known bugs with some client versions (e.g. 1.0.3) -- if you experience timeouts with the client, be sure you’re using the latest build. We also recommend that you use autofailover and that you test your email alerts. You must manually rebalance after recovery; this does not happen on its own.

In the second and more common scenario a network or DNS outage has occurred. If a node is unreachable by one or more clients, yet all nodes can still talk to that node, there is no built-in mechanism for the cluster to remap data from that node to other nodes.

Additionally, there is no built-in mechanism for the smart client to reroute traffic so you will experience timeouts in this situation.  When the network issue resolves the client should stop presenting errors.

Consider scripting a heartbeat check to run on your app servers that use the Couchbase CLI and specify failover procedures.

Nagios Check Calculated on Mysql Server Variables

Nagios Check For Calculating Based on Mysql Server Variables

Recently we needed to make a change for a client to one of our mysql monitoring tools so I thought it would be a good opportunity to highlight the tool and discuss some of the changes that I made.

You can access the tool on Github from our public repository here.

Before this change if you were using either the "varcomp" or "lastrun-varcomp" modes, it would only return a WARNING if your criteria for comparison were exceeded. In the new version, both WARNING and CRITICAL states can return to nagios. Here is an example: Let's say you want to alert on maximum connections, but the number of maximum connections is different for different hosts. Instead of writing distinct per/host checks, you can use this check to do a simple calculation and alert on the result of that calculation. In this case you want to send a warning when your connections exceeds 75% of max connections and alert when it reaches 80%. The entry in your nagios config file might look something like this:

mysql_health_check.pl -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ --mode varcomp --expression "Threads_connected / max_connections * 100" --comparison_warning="> '$ARG3$'" --comparison_critical "> '$ARG4$' --shortname percent_max_connections

Where the expression flag uses the names that are returned from any of "FULL PROCESSLIST", "ENGINE INNODB STATUS", "GLOBAL VARIABLES", "GLOBAL STATUS", or "SLAVE STATUS". The comparison_warning and comparison_critical flags are going to be evaluated in perl, so ensure that it is a valid perl expression (in this case you could use either > or -gt. You'll definitely need to test out your commands with a few different use cases to ensure you have good syntax. When a varcomp or lastrun-varcomp check is run, the results are kept in a local cache file against which you can make comparisons. So, to give a ridiculous example, if you want to ensure that the number of open table definitions didn't increase too much between samples, you could do something like this

mysql_health_check.pl -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ --mode lastrun-varcomp --expression "current{Open_table_definitions} - lastrun{Open_table_definitions}" --comparison_warning="> 10" --comparison_critical "> 20" --shortname increase_in_open_table_defs

Where the notation current{} and lastrun{} refer to which sample time you want. You can see more details on all the features of the script in the README on the plugin page. We welcome any comments and suggestions.

Put Opsview Hosts Into Downtime via the Shell

Recently a client of ours who used opsview to manage their resources needed to place some of their hosts into downtime in conjunction with some other cron-scheduled tasks. In order to implement that functionality, I created this simple script that should work with most installations of opsview, or with a few modifications, can be modified to be used with other, similar REST interfaces. To use, modify the 5 variables at the top of the script as necessary. The url and username are what come with the default installation of opsview. Modify CURL if it's in a different place on your system. Then, to use, for example: opsview_rest_api_downtime.sh -p Pa5sw0rd -h host_name_in_opsview -c create -t 2 Where host_name is the hostname as defined in opsview, not necessarily the same as its actual hostname.
#!/bin/bash
#
# create or delete downtime for a single host using opsview curl rest api
 
CURL=/usr/bin/curl
OPSVIEW_HOSTNAME="opsview.example.com"
USERNAME=apiuser
URL="/rest/downtime"
hours_of_downtime=2
 
usage()
{
    echo "Usage: $0 -p <opsview apiuser password> -h <host> -c (create|delete) [-t <hours_of_downtime>]"
    exit 1
}
 
while getopts p:h:t:c: opt
do
    case $opt in 
      p) password=$OPTARG;;
      h) host=$OPTARG;;
      t) hours_of_downtime=$OPTARG;;
      c) command=$OPTARG;;
      \?) usage;;
    esac
done
 
 
if [ "x$password" = "x" ] || [ "x$host" = "x" ] || [ "x$command" = "x" ]
then
    usage
fi
 
# LOGIN
 
token_response=`$CURL -s -H 'Content-Type: application/json' https://$OPSVIEW_HOSTNAME/rest/login -d "{\"username\":\"$USERNAME\",\"password\":\"$password\"}"`
token=`echo $token_response | cut -d: -f 2 | tr -d '"{}'`
if [ ${#token} -ne 40 ]
then
    echo "$0: Invalid apiuser login. Unable to $command downtime."
    exit 1
fi
 
 
if [ "$command" = "create" ]
then
    # create downtime - POST
    starttime=`date +"%Y/%m/%d %H:%M:%S"` 
    endtime=`date +"%Y/%m/%d %H:%M:%S" -d "$hours_of_downtime hours"`
    comment="$0 api call"
    data="{\"starttime\":\"$starttime\",\"endtime\":\"$endtime\",\"comment\":\"$comment\"}"
    result=`$CURL -s -H "Content-Type: application/json" -H "X-Opsview-Username: $USERNAME" -H "X-Opsview-Token: $token" https://$OPSVIEW_HOSTNAME$URL?host=$host -d "$data"`
    exit_status=$?
else
    # delete downtime - DELETE
    params="host=$host"
    result=`$CURL -s -H "Content-Type: application/json" -H "X-Opsview-Username: $USERNAME" -H "X-Opsview-Token: $token" -X DELETE https://$OPSVIEW_HOSTNAME$URL?$params`
    exit_status=$?
fi
echo "$result" | grep $host > /dev/null
host_in_output=$?
if [ "$exit_status" -ne "0" ] || [ "$host_in_output" -ne "0" ]
then
  echo "Unable to $command downtime for $host.  Result of call:"
  echo $result
  exit 1
fi

Benchmarking NDB vs Galera

Inspired by the benchmark in this post, we decided to run some NDB vs Galera benchmarks for ourselves.

We confirmed that NDB does not perform well using m1.large instances. In fact, it’s totally unacceptable -  no setup should ever have a minimum latency of 220ms - so m1.large instances are not an option. Apparently the instances get CPU bound, but CPU utilization never goes above ~50%. Maybe top/vmstat can’t be trusted in this virtualized environment?

So, why not use m1.xlarge instances? This sounds like a better plan!

As in the original post, our dataset is 15 tables of 2M rows each, created with:

./sysbench --test=tests/db/oltp.lua --oltp-tables-count=15 --oltp-table-size=2000000 --mysql-table-engine=ndbcluster --mysql-user=user --mysql-host=host1 prepare

Benchmark against NDB was executed with:

for i in 8 16 32 64 128 256

do

./sysbench --report-interval=30 --test=tests/db/oltp.lua --oltp-tables-count=15 --oltp-table-size=2000000 --rand-init=on --oltp-read-only=off --rand-type=uniform --max-requests=0 --mysql-user=user --mysql-port=3306  --mysql-host=host1,host2 --mysql-table-engine=ndbcluster --max-time=600 --num-threads=$i run > ndb_2_nodes_$i.txt

done

After we shutdown NDB, we started Galera and recreated the table, but found that running sysbench was failing. A suggestion from Hingo was to use --oltp-auto-inc=off, which worked.

Our benchmark against NDB was executed with:

for i in 8 16 32 64 128 256

do

./sysbench --report-interval=30 --test=tests/db/oltp.lua --oltp-tables-count=15 --oltp-table-size=2000000 --rand-init=on --oltp-read-only=off --rand-type=uniform --max-requests=0 --mysql-user=user --mysql-port=3306  --mysql-host=host1,host2 --mysql-table-engine=ndbcluster --max-time=600 --num-threads=$i --oltp-auto-inc=off run > galera_2_nodes_$i.txt

done

Below are the graphs of average throughput at the end of 10 minutes, and 95% response time.

 

 

 

 

Galera clearly performs better than NDB with 2 instances!

But things become very interesting when we graph the reports generated every 10 seconds.

 

 

 

 

 

Surprised, right? What is that?

Here we see that even if the workload fits completely in the buffer pool, the high number of TPS causes aggressive flushing.

We assume the benchmark in the Galera blog post was CPU bound, while in our benchmark the behavior is I/O bound.

We then added another 2 more nodes (m1.xlarge instances), but kept the dataset at 15 tables x 2M rows , and re-ran the benchmark with NDB and Galera. Performance on Galera gets stuck, due to I/O. Actually, with Galera, we found that performance on 4 nodes was worse than with 2 nodes; we assume this is caused by the fact that the whole cluster goes at the speed of the slower node.

Performance on NDB keeps growing as new nodes are added, so we added another 2 nodes for just NDB (6 nodes total).

 

 

 

 

The graphs show that NDB scales better than Galera, which is not what we expected to find.

It is perhaps unfair to say that NDB scales better than Galera, but rather that NDB checkpoint causes less stress on I/O than InnoDB checkpoint, thus the bottleneck is on InnoDB and not Galera itself. To be more precise, the bottleneck is on slow I/O.

The follow graph shows the performance with 512 threads and 4 nodes (NDB and Galera) or 6 nodes (only NDB). Data collected every 30 seconds.

"When the Nerds Go Marching In"

Palomino was honored to serve as part of the team of technologists on President Obama's re-election campaign. Atlantic Magazine ran a fascinating piece about Narwhal, the sophisticated data architecture that enabled the campaign to track voters, volunteers and online trends.

Palomino CEO Laine Campbell joined the team in Chicago for the final days of the campaign, ensuring maximum uptime and performance on the MySQL databases. Afterwards, President Obama thanked her for Palomino's contributions.

When Disaster Strikes: Hurricane Sandy

 

We devoted the Palomino Newsletter this month to the important topic of disaster recovery, in light of the challenges posed by Hurricane Sandy. If you're not already receiving our newsletter, you can subscribe here.


Hurricane Sandy has been on many people's minds of late; mine not least.  Having lived the last 4 years of my life in Manhattan and on the Jersey Shore, the loss of lives, the destruction of homes, business and memories, and the disruption of so much has me in shock.  I grew up in Louisiana, and hurricanes were a way of life.  You didn't do something hoping that a hurricane would not come by.  You assumed a hurricane would come.  At least, that's how I was taught.  That's the mentality I try to bring into my architectures, my process and my planning as well.  So, when hurricane Sandy bore down on the East Coast, my alarm bells started ringing, just as my email started exploding.  Every one of our US-East Amazon customers was in danger.  Who knew when power would go out?  And when it would come back?

Palomino is proud to be an Amazon Web Services consulting partner. That said, we recognize that Amazon has had its share of instability.  A few weeks ago, US-East experienced some significant EBS latency and unavailability.  We've lost availability zones.  We've lost regions.  We've found availability zones inexplicably unpredictable in terms of latency and availability.  Amazon forces us to think resiliently.  Not in preventing disasters, but weathering them, bouncing back, and being ready.  Some say this is an issue with Amazon.  That the unreliability is a drawback.  Perhaps I'm the eternal optimist, but I simply see it as a way to force rigor in anticipating, documenting and practicing our availability and business continuity plans.

None of this is new or incredibly enlightening.  Any operations person worth their salt thinks of failure and what can go wrong, and they think of it often.  So what's the point here?  I thought I'd share the war stories of the weekend to help cast a light on varying degrees of preparation.

 

Client One

Client One contacted us.  They had anticipated the problem and already been preparing to create multi-region EC2 environments; Sandy just accelerated things.  This client is in RDS, Amazon's Relational Database as a Service - in this case MySQL as a service.  RDS is such a convenient tool, until it isn't.  One of the big drawbacks? No cross-region support.  Yes, you can use Multi-AZ replication for Master availability across availability zones.  Yes, you can also create replicas in multiple availability zones.  If you do both of these things, you've got a certain level of fault tolerance in place.  You can still get hurt if your master does a multi-AZ failover.  All of your replicas will break, as RDS doesn't take into account the ability to move manually to the next binlog when a master crashes before closing their binlogs.  Thus, you are without replicas.  Not great.  But you have a working master.  Similarly, you have multiple replicas across AZs, to tolerate those failures.  But cross-region?  Nothing.

So, we had to dump all of our RDS instances and load them into RDS in another region.  Parallel dumps and loads were kicked off, accelerating the very painful process of a logical rebuild of a system.  We used SSD ephemeral storage on EC2 to speed this up as well.  The process still took 2 days.  OpenVPNs had to be set up with mappings for port 3306 to allow replication.  If this hadn't already been in process before Sandy was a threat, we never would have been ready in time.  We still had and have issues.  You can't replicate from RDS in one region to another.  Custom ETL must be created in order to keep each table as in sync as possible.  We'd done this work in a previous plan to move off of RDS, mapping tables to one of three categories - static (read-only), insert only, or upd/del.  Static just needs to be monitored for changes.  Insert only can be kept close to fresh with high water marks and batch inserts.  Transactional requires keys on updated at and created at fields, and confidence in the values in those fields.  Deletes present even bigger problems.  Digging in further is out of scope here, but consider it a future topic.

Summary: Client One was in-process for multi-region disaster recovery (DR).  A fire-drill occurred, and people had to work long, long hours doing tedious work.  But, had Sandy hit their region with the force it hit further north, we'd have been ready.

 

Client Two

Client Two contacted us also.  They had known that they were at risk, but they were small, they were pushing new features and refactoring applications, and DR was far out on their roadmap.  They too, were on RDS.  They could not afford the amount of custom work our larger clients requested, so we had to create a best effort approach.  RDS instances were created in Portland, along with cache servers, transaction engines, web services and the rest of the stack. Amazon Machine Images (AMIs) were kicked out, and we built a dump and copy process across regions.  There would be data loss, up to many hours, if the region went down and never came up.  But they would not be dead in the water.  Data loss can be mitigated by more frequent dumps and copies, but not eliminated completely.

Summary: Client Two had no plans for multi-region DR.  They had taken a conscious risk.  Luckily they had the talent and agility of a small company and could move fast with our help.  Failing over would have hurt, but they'd still be alive.

 

Client Three

We reached out proactively to Client Three. They had put together a multi-region plan for critical systems last year before we started working with them, which included scripts to rapidly build out new clusters of Hadoop based systems.  It was supposed to just work.  When we started working with Client Three, we’d scheduled our DR review, testing and modernizing for our Q4 checklist.  Too little, too late, right?  Sure enough, things didn't "just work".  It wasn't horrible, but a weekend of cleaning up, rescripting and fixing problems as they rose occurred.  But had we had to fail over? They would've been ready.

Summary: Client Three had anticipated and architected DR, but they hadn’t tested it.  Luckily we had the days before the storm to test and to fix this.  If they hadn't planned at all, I'm not sure we would've made it.

 

It’s also worth remembering that you are not alone in these shared environments.  All weekend shops were staking claims on instances and storage, and building out.  Rolling out resources got slower, and if you didn't claim, you'd lose out.  This has to be considered in your plans.  

 

To recap: Palomino loves AWS.  We're a consulting partner and have helped many clients in many different business models deploy, scale and perform in AWS.  But DR is not a luxury anymore.  It's a necessity.  Architectures have to take multi-AZ and multi-region plans into consideration in the beginning.  Many people use AWS so they save money on hardware.  They get upset when you point out the labor and extra instances needed to guarantee they can weather these storms.  But it's a hard reality.  It's one of the reasons we only recommended RDS in early phases, when downtime is tolerable.  Good configuration management also means you can deploy a skeleton infrastructure in another region; you can explode that to a full-blown install with ease.  But you have to practice, and you have to move fast.  If you think your region can go down, go to DEFCON and push the buttons.  If you're wrong, you can always tear back down.

Anticipate.
Plan.
Build it early.
Automate it.
Test it.
Test it.
Test it.
Test it.

If you haven't been able to donate to the Red Cross or other institutions helping our fellow brothers and sisters in the Northeast and in the Caribbean, please take some time to do so.  Having lost property and cared for loved ones displaced by Katrina, and now hearing so many horror stories from New Jersey and New York, I urge everyone to donate money, donate shelter, donate time and skills if you have them.  

 

Syndicate content
Website by Digital Loom