Blog

Supporting Feminism in Technology - Part 1

 

I've been contemplating the topic of feminism and misogyny in the technology field a lot of late.  This blog post is a culmination of a significant amount of thought and reflection on the topic of women in technology.  At Palomino, I focus a lot on the values around bringing underserved populations into technology.  Women, people from working class or impoverished backgrounds, people who are gay and lesbian or transgendered, and people of latino and african american backgrounds are traditionally highly underrepresented in the US technological workforce.  Palomino, even being a woman owned business, is not exempt from this issue.  One of my goals is to build up not just Palomino's DBA and engineer population to reflect higher percentages of these populations, but to support more people in the entire community in having these opportunities.  I'd like to focus on gender in this conversation, though I have much to say on the topics of race and class as well.  In fact, they are all interrelated.


To dig into the topic, one of the first things to consider and that people ask, is why is this a big deal.  And at the top level, without any time put into consideration, I can see why someone might feel this way.  After all, if a DBA is good, why does their gender, race or background matter?  And, if you are simply considering the output of an individual or organization, this is pretty true.  But, there is more.  My hope is that people who focus on the importance of open source software and free access to technology would understand the importance of building larger populations of women engineers and administrators, but that has proven itself to not be true.


The fact is, that on a macro level, one of the largest ways to get more people from underserved populations into jobs such as database administration, infrastructure architecture and software engineering is to provide them with mentors and role models who have already broken through the barriers to make it.  And, we do exist.  Palomino and Blue Gecko were both built by women.  Oracle's Ace and Ace Director list has about 12 (of 390) women on it.  I have been meeting more people of color and women working on senior teams of clients.  I do see potential role models and mentors out there.  But, don't get too excited.  There are still plenty of  opportunities for improving how we build welcoming workplaces for the talented and diverse engineers already out there making their way in the field.  


There's also the selfish part of the equation.  More often than not, when I find women and people of color in the wild, with successful records as engineers and operators, they tend to be extremely good at their jobs.  They tend to be excellent communicators, good with clients, detailed with project planning and highly technically competent.  Is this because of more innate talent?  No.  It is because the amount of willpower, inner strength, self-confidence and chutzpah required to succeed for these people is much higher than the predominant demographic of engineers.


1. Mindfulness in Language and Communication - I find this is particularly true in remote workforces, such as Palomino where the entire culture is often built about word choice and expression of ideas.  There are obvious cases, such as how often people start email threads with "Gentlemen".  Then, there is simply the propagation of cultures around masculinity, or "brogramming".  This is more delicate.  After all, there are plenty of women, myself included, who enjoy conversations around traditionally masculine pursuits and endeavors.  I lean more towards not an exclusion of topic, but an inclusion and mindfulness of those who might feel left out.  Do you ask women about their favorite sports teams?  Do you keep an eye out for folks who might retreat from certain topics and adjust your conversations accordingly?  Shutting off social conversation is not generally helpful, but as leaders in organizations, it is a responsibility to help guide conversations to be as inclusive and supportive as possible to all staff.  And of course, any traditionally sexist, racist or classist conversations need to be privately nipped in the bud immediately as a manner of course.  Creating space for other conversations outside of traditionally masculine ones to occur is also critical.  Ask people who are not from the dominant race/class/gender in your organization about their weekends and pastimes.  Don't assume a woman is interested in knitting, but give her a chance to express what she likes.  She might surprise you and your team with the diverse range of interests that might be brought up.


2. Examine the Gendered Roles and Behaviors - Go to most tech sites and look at their team pages.  I'm willing to bet that if you are looking at client facing positions that require emotional intelligence and empathy, you will find more women than in the technical fields.  Palomino is no exception.  Our project and account management teams are all female.  Our office manager is male, however.  Ultimately, I don't recommend the policing of the gender of individual roles, but I do believe it's important to examine key expectations and behaviors around staff.  For instance, it is common practice to assume engineers and administrators do not have the emotional/social capacity to interact with users/clients.  So, organizations put account managers or project managers in between, who are often female and thus considered more socially and emotionally adept.  Rarely is it considered a priority to encourage the technical staff to step up, improve their soft skills such as empathy and to interact directly with the client base.  Instead, we build a culture of mothering, which is harmful to all parties involved.


Additionally, do we value the roles that are more empathetic, client facing and emotionally intelligent?  People always discuss how hard it is to retain and find good DBAs, and their salaries, power and "catering to" reflect this in the organization.  While a good PM may not be as hard to find, they are still just as valuable to an organization.  Do you take these roles for granted, or do you also make them feel as important, valued and encouraged as your more technical staff?  Do you let mediation fall to these same people, or do you encourage all staff to develop their skills in negotiation and conflict resolution?  


3. The Devil is in the Details - At the recent Percona Live conference, T-Shirts were given to all attendees.  When asked if there were women's sizes, the organizers stated they were unisex.  Unisex is not actually unisex.  It is men's, and not designed for women's bodies.  These details, while not large individually, add up to a feeling of being an add on; just as much as lack of kosher meals, or wheelchair ramps far from the main entrance can cause one to feel like an afterthought.   Take the extra step to define and socialize your diversity policy and your code of conduct.  O'Reilly has a great code of conduct at http://oreilly.com/conferences/code-of-conduct.html.  Note, that defining the code of conduct or the diversity policy is not enough.  You need to talk to people about these things and engage them.  When you are discussing policies around employees, or evaluating a new client, think about how this fits in to your policies.  When you are planning a company offsite, organizing a conference or writing a blog post, think about these policies.  Who will be involved or affected by your choices?  What can you do to make them feel more included?  Take the time to really think about this.


4. Recruiting - This is a challenging position, and one that I've had to consider for quite some time.  At Palomino, I'd say we get perhaps 1 out of 20 applicants who are women via our natural model of letting people come to us via word of mouth.  That is obviously a horrible ratio.  Too often, people just say "well, if women don't interview how can we hire them?".  That's a cop out.  Most hiring managers know that you don't get A players from a passive recruiting strategy.  This is just as true for getting women to interview for technical positions.  You need to spend time going to events such as the ADA Initiative Unconference (http://sf.adacamp.org), Women Powering Technology Summit (http://www.witi.com) and sponsoring, speaking and getting involved.  There are numerous meetups, from Girls who Code in NYC, Girls in Tech in Las Vegas and Women in Tech in SF.  Additionally, you should be going through LinkedIn to find women and contacting them.  Even if they are not interested, by building a network that includes more and more women, you are improving the possibility that you will find the right women for your organization.  Get out there and speak at meetups, start some introductory courses for women coming out of college and continue to build that network.  There is no reason to stay at a 5% rate of interviews, but you have to work!


This is part 1 in 2 parts.  I'd like to focus next on some ways in which dialogue around the conversations can go wrong, and how to discuss and respond to conversations around feminism and misogyny in a constructive manner.  I do look forward to feedback and conversations around the topic, and I thank you for your time in reading and considering this.

Benchmarking Postgres on AWS 4,000 PIOPs EBS instances

Introduction

Disk I/O is frequently the performance bottleneck with relational databases. With AWS recently releasing 4,000 PIOPs EBS volumes, I wanted to do some benchmarking with pgbench and PostgreSQL 9.2. Prior to this release the maximum available I/O capacity was 2,000 IOPs per volume. EBS IOPs are read and written in 16Kb chunks with their performance limited by both the I/O capacity of the EBS volumes and the network bandwidth between an EC2 instance and the EBS network. My goal isn't to provide a PostgreSQL tuning guide, an EC2 tuning guide, or a database deathmatch complete with graphs; I'll just be displaying what kind of performance is available out-of-the-box without substantive tuning. In other words, this is an exploratory benchmark not a comparative benchmark. I would have liked to compare the performance of 4,000 PIOPs EBS volumes with 2,000 PIOPs EBS volumes, but I ran out of time so that will have to wait for a following post.

Setup

Region

I conducted my testing in AWS' São Paulo region. One benefit of testing in sa-east-1 is that spot prices for larger instances are (anecdotally) more stable than in us-east. Unfortunately, sa-east-1 doesn't have any cluster compute (CC) instances available. CC instances have twice the bandwidth to the EBS network than non-CC EC2 instances. That additional bandwidth allows you to construct larger software RAID volumes. My cocktail napkin calculations show that it should be possible to reach 50,000 PIOPs on an EBS-backed CC instance without much of a problem.

EC2 instances

I tested with three EC2 instances: an m1.large from which to run pgbench, an m2.2xlarge with four EBS volumes, and an m1.xlarge with one EBS volume. All EBS volumes are 400GB with 4,000 provisioned IOPs. The m1.large instance was an on-demand instance; the other instances  — the pgbench target database servers — were all spot instances with a maximum bid of $0.05. (In one case our first spot instance was terminated, and we had to rebuild it). Some brief testing showed that having an external machine driving the benchmark was critical for the best results.

Operating System

All EC2 instances are running Ubuntu 12.10. A custom sysctl.conf tuned the Sys V shared memory as well as set swappiness to zero and memory overcommit to two.

kernel.shmmax = 13355443200
kernel.shmall = 13355443200
vm.swappiness = 0
vm.overcommit_memory = 2

Packages

The following packages were installed via apt-get:

  • htop
  • xfsprogs
  • debian-keyring
  • mdadm
  • postgresql-9.2
  • postgresql-contrib-9.2

In order to install the postgresql packages a pgdb.list file containing

deb http://apt.postgresql.org/pub/repos/apt/ squeeze-pgdg main

was placed in /etc/apt/sources.list.d and the following commands were run:

gpg --keyserver pgp.mit.edu --recv-keys ACCC4CF8
gpg --armor --export ACCC4CF8 | apt-key add -
apt-get update

RAID and Filesystems

For the one volume instance, I simply created an XFS file system and mounted it on /mnt/benchmark.

mkdir /mnt/benchmark
mkfs.xfs /dev/svdf 
mount -t xfs /dev/svdf /mnt/benchmark
echo "/dev/svdf    /mnt/benchmark    xfs    defaults    1 2" >> /etc/fstab

For the four volume instance it was only slightly more involved. mkfs.xfs analyzes the underlying disk objects and determines the appropriate values for stride and width. Below are the commands for assembling a four volume mdadm software RAID array that is mounted on boot (assuming you've attached the EBS volumes to your EC2 instance). Running dpkg-reconfigure rebuilds the initrd image.

mkdir /mnt/benchmark
mdadm --create /dev/md0 --level=0 --raid-volumes=4 /dev/svdf /dev/svdg /dev/svdh /dev/svdi
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
mkfs.xfs /dev/md0
echo "/dev/md0    /mnt/benchmark    xfs    defaults    1 2" >> /etc/fstab
dpkg-reconfigure mdadm

Benchmarking

pgbench is a utlity included in the postgresql-contrib-9.2 package. It approximates the TPC-B benchmark and can be looked at as a database stress test whose output is measured in transactions per second. It involves a significant amount of disk I/O with transactions that run for relatively short amounts of time. vacuumdb was run before each pgbench iteration. For each database server pgbench was run mimicking 16 clients, 32 clients, 48 clients, 64 clients, 80 clients, and 96 clients. At each of those client values, pgbench iterated ten times in steps of 100 from 100 to 1,000 transactions per client. It's important to realize that pgbench's stress test is not typical of a web application workload; most consumer facing web applications could achieve much higher rates than those mentioned here. The only pgbench results against AWS/EBS volumes that I'm-aware-of/is-quickly-googleable is from early 2012 and, at its best, achieves rates 50% slower than the lowest rates found here. I drove the benchmark using a very small, very unfancy bash script. An example of the pgbench commandline would be:

pgbench -h $DBHOST -j4 -r -Mextended -n -c48 -t600 -U$DBUSER

m1.xlarge with single 4,000 PIOPs volume

The maximum transaction volume for this isntance was when running below 48 concurrent clients and under 500 transactions per client. While the transaction throuput never dropped precipitously at any point, loads outside of that range exhibited varying performance. Even at its worst, though, this instance handled between 600-700 transactions/second.

m2.2xlarge with four 4,000 PIOPs volumes

I was impressed; at no point did the benchmark stress this instance — the tps rate was between 1700-1900 in most situations with peaks up to 2200 transactions per second. If I was asked to blindly size a "big" PostgreSQL database server running on AWS this is probably where I would start. It's not so large that you have operational issues like worrying about MTBFs for ten volume RAID arrays or trying to snapshot 4TB of disk space, but it is large enough to absorb a substantial amount of traffic.

Graphs and Tabular Data

single-4K-volume tps

The spread of transactions/second irrespective of number of clients.

Box plot of transactions per second. Single 4K volume

Data grouped by number of concurrent clients with each bar representing an increase in 100 transactions per second ranging from 100 to 1,000.

Bar graph of transactions per second grouped by concurrent clients. Single 4K volume

Progression of tps by individual level of concurrency. The x-axis tick marks measure single pgbench runs from 100 transactions per client to 1,000 transactions per client.

Six subgraphs of transactions per second by each level of concurrency. Single 4K volume

Raw tabular data

txns/client1002003004005006007008009001000
clients
1614551283118365311975336311009923648
321500124212327577476301067665688709
482818648997051029749736593766641
6494412817041010739596778662820612
808158931055809597801684708736663
96939889774772798682725662776708

four-4,000-PIOPs-volumes tps

Again, a box plot of the data with a y-axis of transactions/second.

Box plot of transactions per second. Four 4,000 PIOPs volumes

Grouped by number of concurrent clients between 100 and 1,000 transactions per client.

Bar graph of transactions per second grouped by concurrent clients. Four 4,000 PIOPs volumes

TPS by number of concurrent clients. The x-axis ticks mark pgbench runs progressing from 100 transactions per client to 1,000 transactions per client.

Six subgraphs of transactions per second by each level of concurrency. Four 4,000 PIOPs volumes

Tabular data m2.2xlarge with four 4,000 PIOPs EBS volumes

txns/client1002003004005006007008009001000
clients
161487161718771415138818821897177112671785
321804208321601791125919972230150117171918
481810215212961951211717751709180318171847
641810158015682056181117841849190919421658
801802204414672142164518961933174018211851
961595140320471731178318591708189617511801

PalominoDB at an industry event near you!

Find the Palomino Team at an event near you in 2013!

New York’s Effective MYSQL Meet up Group, March 12, 2013

As New York’s only active MYSQL meetup group, NY Effective MYSQL states its purpose is to share practical education for MySQL DBAs, Developers and Architects.  At their next meeting on March 12 at 6:30 PM, Laine Campbell, CEO & Principal of PalominoDB, will be the evening’s presenter and her topic will be  "RDS Pitfalls. Ways it's going to screw you. (And not in the nice way)" Speaking from her own experience, Laine will explain the Amazon RDS offering, it's patterns and anti-patterns, and it's gotchas and idiosyncrasies.  To learn more about the NY group and Laine’s presentation, please click here.

NYC* Tech Day - Wednesday, March 20, 2013

Join NYC* Tech Day and take a deep dive into Apache Cassandra™, the massively scalable NoSQL database! This two-track event will feature over 14 interactive sessions, delivered by Apache Cassandra experts. Come see our CTO Jay Edwards at the Meet the Experts area. Or just drop by our table and talk to Jay and PDB staff. For more info click here!

Percona Live MySQL Conference in Santa Clara April 22-25, 2013

PalominoDB will once again host a booth at this year’s Percona Live event.   With 110 sessions and over 90 speakers, Percona promises to a fantastic event that you might not want to miss! This year several of Palomino’s own will be presenters. Read on....

In order of their appearance:

On the first day of the conference, April 22nd, Rene Cannao will kick off a full day tutorial beginning at 9:30 AM with part 1 of Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability.  After a lunch break, Rene will continue Part 2 beginning at 1:30.  Rene, a Senior Operational DBA at Palomino, will guide attendees through a hands-on experience in the installation, configuration management and tuning of MySQL Cluster.  To see the agenda of topics being offered during this exceptional offering, please click here.

Also on  April 22nd from 9:30 AM - 12:30 PM, Jay Edwards and Ben Black will be making an in-depth tutorial:  MYSQL Patterns in Amazon - Make the Cloud Work For You.  Jay and Ben will show you how to build your MySQL environment in the cloud -- how to maintain it -- how to grow it -- and how to deal with failure.  You may want to get there early to be sure you get a seat!  Want more info on this hot topic?  Check out more on this topic.

Meet our European Team lead, Vladimir Fedorkov on April 23rd at 2:20 PM where his topic will be MYSQL Query Anti-Patterns That Can Be Moved to Sphinx.  Vlad will be discussing how to handle query bottlenecks that can result from increases in dataset and traffic.  Click here to find out more.

Also on the 23rd, at 4:50 PM, Ben Black will be back to speak on MySQL Administration in Amazon RDS.  This should be a great session for those attendees new to this this tool,  as Ben will cover common tasks in RDS and gotchas for DBA's that are new to RDS.  Check out more on this topic.

On April 24th at 1PM Mark Filipi will present Maximizing SQL Reviews and Tuning with pt-query-digest.  pt-query-digest is one of the more valuable components of the Percona Toolkit, and as one of our operational DBAs, Mark will be approaching his topic with an eye to real world experiences.  Read more about it by following this link.

Also on the 24th at 1:50 PM, Ben Black and David Turner will tag-team the topic Online Schema Changes for Maximizing Uptime.  Together they will  cover common operations implemented in production and how you can minimize downtime and customer impact.  Here’s a link for more info on this.

PGCon May 2013

One of our Operational Database Administrators, Emanuel Calvo, will be a presenter at  PGCON 2013 Postgre SQL Conference on May 23, 2013 at 1 PM.  His topic will be Sphinx and Postgres - Full Text Search extension.  Emanuel will be discussing how to integrate these two tools for optimum performance and reliability. Want to learn more about Emanuel and the conference? Click Here

Velocity June 2013 Web Performance and Operation Conference

Velocity 2013 will be held in Santa Clara from June 18 through the 20th. Organizers tout this event  as “the best place on the planet for web ops and performance professionals like you to learn from your peers, exchange ideas with experts, and share best practices and lessons learned.” Palomino’s Laine Campbell and Jay Edwards will both be presenters on the conference’s opening day, June 18th.

Laine Campbell, PalominoDB’s CEO,  will be offering Using Amazon Web Services for MySQL at Scale on Tuesday, June 18 at 1:30 PM.  The session promises “a deep dive into the AWS offerings for running MySQL at scale.”  Laine will guide attendees through various options for running MySQL at high volumes at Amazon Web Services. More on this Laine’s presentation is available through this link.

Later that same day at 3:30 PM, Jay Edwards, Palomino’s CTO, will present  Managing PostgreSQL with Ansible in EC2.  He will discuss how and why Ansible is a next generation configuration management system that includes deployment and ad hoc support as well. Find out more on Jay’s topic and presentation here. 

Couchbase Smart Client Failure Scenarios

 

The Couchbase java smart client communicates directly with the cluster to maintain awareness of data location. To do so it gathers information about the cluster architecture from a manually maintained configuration file listing all the nodes. The smart client configuration is done within the Java code and does not have a pre-designated file while the Moxi configuration is generally installed at /opt/moxi/etc/moxi-cluster.cfg

Assuming the smart client is on a separate server from the affected node there are two situations where communication between the client and a specific node might be interrupted.

In the first scenario, a node may fail. If so, the rest of the cluster will detect that from standard heartbeat checks, which are built in to Couchbase, and map its data to the replica nodes. The smart client is informed of the remappings and should be able to find all identified data again. There are known bugs with some client versions (e.g. 1.0.3) -- if you experience timeouts with the client, be sure you’re using the latest build. We also recommend that you use autofailover and that you test your email alerts. You must manually rebalance after recovery; this does not happen on its own.

In the second and more common scenario a network or DNS outage has occurred. If a node is unreachable by one or more clients, yet all nodes can still talk to that node, there is no built-in mechanism for the cluster to remap data from that node to other nodes.

Additionally, there is no built-in mechanism for the smart client to reroute traffic so you will experience timeouts in this situation.  When the network issue resolves the client should stop presenting errors.

Consider scripting a heartbeat check to run on your app servers that use the Couchbase CLI and specify failover procedures.

Put Opsview Hosts Into Downtime via the Shell

Recently a client of ours who used opsview to manage their resources needed to place some of their hosts into downtime in conjunction with some other cron-scheduled tasks. In order to implement that functionality, I created this simple script that should work with most installations of opsview, or with a few modifications, can be modified to be used with other, similar REST interfaces. To use, modify the 5 variables at the top of the script as necessary. The url and username are what come with the default installation of opsview. Modify CURL if it's in a different place on your system. Then, to use, for example: opsview_rest_api_downtime.sh -p Pa5sw0rd -h host_name_in_opsview -c create -t 2 Where host_name is the hostname as defined in opsview, not necessarily the same as its actual hostname.
#!/bin/bash
#
# create or delete downtime for a single host using opsview curl rest api
 
CURL=/usr/bin/curl
OPSVIEW_HOSTNAME="opsview.example.com"
USERNAME=apiuser
URL="/rest/downtime"
hours_of_downtime=2
 
usage()
{
    echo "Usage: $0 -p <opsview apiuser password> -h <host> -c (create|delete) [-t <hours_of_downtime>]"
    exit 1
}
 
while getopts p:h:t:c: opt
do
    case $opt in 
      p) password=$OPTARG;;
      h) host=$OPTARG;;
      t) hours_of_downtime=$OPTARG;;
      c) command=$OPTARG;;
      \?) usage;;
    esac
done
 
 
if [ "x$password" = "x" ] || [ "x$host" = "x" ] || [ "x$command" = "x" ]
then
    usage
fi
 
# LOGIN
 
token_response=`$CURL -s -H 'Content-Type: application/json' https://$OPSVIEW_HOSTNAME/rest/login -d "{\"username\":\"$USERNAME\",\"password\":\"$password\"}"`
token=`echo $token_response | cut -d: -f 2 | tr -d '"{}'`
if [ ${#token} -ne 40 ]
then
    echo "$0: Invalid apiuser login. Unable to $command downtime."
    exit 1
fi
 
 
if [ "$command" = "create" ]
then
    # create downtime - POST
    starttime=`date +"%Y/%m/%d %H:%M:%S"` 
    endtime=`date +"%Y/%m/%d %H:%M:%S" -d "$hours_of_downtime hours"`
    comment="$0 api call"
    data="{\"starttime\":\"$starttime\",\"endtime\":\"$endtime\",\"comment\":\"$comment\"}"
    result=`$CURL -s -H "Content-Type: application/json" -H "X-Opsview-Username: $USERNAME" -H "X-Opsview-Token: $token" https://$OPSVIEW_HOSTNAME$URL?host=$host -d "$data"`
    exit_status=$?
else
    # delete downtime - DELETE
    params="host=$host"
    result=`$CURL -s -H "Content-Type: application/json" -H "X-Opsview-Username: $USERNAME" -H "X-Opsview-Token: $token" -X DELETE https://$OPSVIEW_HOSTNAME$URL?$params`
    exit_status=$?
fi
echo "$result" | grep $host > /dev/null
host_in_output=$?
if [ "$exit_status" -ne "0" ] || [ "$host_in_output" -ne "0" ]
then
  echo "Unable to $command downtime for $host.  Result of call:"
  echo $result
  exit 1
fi

Benchmarking NDB vs Galera

Inspired by the benchmark in this post, we decided to run some NDB vs Galera benchmarks for ourselves.

We confirmed that NDB does not perform well using m1.large instances. In fact, it’s totally unacceptable -  no setup should ever have a minimum latency of 220ms - so m1.large instances are not an option. Apparently the instances get CPU bound, but CPU utilization never goes above ~50%. Maybe top/vmstat can’t be trusted in this virtualized environment?

So, why not use m1.xlarge instances? This sounds like a better plan!

As in the original post, our dataset is 15 tables of 2M rows each, created with:

./sysbench --test=tests/db/oltp.lua --oltp-tables-count=15 --oltp-table-size=2000000 --mysql-table-engine=ndbcluster --mysql-user=user --mysql-host=host1 prepare

Benchmark against NDB was executed with:

for i in 8 16 32 64 128 256

do

./sysbench --report-interval=30 --test=tests/db/oltp.lua --oltp-tables-count=15 --oltp-table-size=2000000 --rand-init=on --oltp-read-only=off --rand-type=uniform --max-requests=0 --mysql-user=user --mysql-port=3306  --mysql-host=host1,host2 --mysql-table-engine=ndbcluster --max-time=600 --num-threads=$i run > ndb_2_nodes_$i.txt

done

After we shutdown NDB, we started Galera and recreated the table, but found that running sysbench was failing. A suggestion from Hingo was to use --oltp-auto-inc=off, which worked.

Our benchmark against NDB was executed with:

for i in 8 16 32 64 128 256

do

./sysbench --report-interval=30 --test=tests/db/oltp.lua --oltp-tables-count=15 --oltp-table-size=2000000 --rand-init=on --oltp-read-only=off --rand-type=uniform --max-requests=0 --mysql-user=user --mysql-port=3306  --mysql-host=host1,host2 --mysql-table-engine=ndbcluster --max-time=600 --num-threads=$i --oltp-auto-inc=off run > galera_2_nodes_$i.txt

done

Below are the graphs of average throughput at the end of 10 minutes, and 95% response time.

 

 

 

 

Galera clearly performs better than NDB with 2 instances!

But things become very interesting when we graph the reports generated every 10 seconds.

 

 

 

 

 

Surprised, right? What is that?

Here we see that even if the workload fits completely in the buffer pool, the high number of TPS causes aggressive flushing.

We assume the benchmark in the Galera blog post was CPU bound, while in our benchmark the behavior is I/O bound.

We then added another 2 more nodes (m1.xlarge instances), but kept the dataset at 15 tables x 2M rows , and re-ran the benchmark with NDB and Galera. Performance on Galera gets stuck, due to I/O. Actually, with Galera, we found that performance on 4 nodes was worse than with 2 nodes; we assume this is caused by the fact that the whole cluster goes at the speed of the slower node.

Performance on NDB keeps growing as new nodes are added, so we added another 2 nodes for just NDB (6 nodes total).

 

 

 

 

The graphs show that NDB scales better than Galera, which is not what we expected to find.

It is perhaps unfair to say that NDB scales better than Galera, but rather that NDB checkpoint causes less stress on I/O than InnoDB checkpoint, thus the bottleneck is on InnoDB and not Galera itself. To be more precise, the bottleneck is on slow I/O.

The follow graph shows the performance with 512 threads and 4 nodes (NDB and Galera) or 6 nodes (only NDB). Data collected every 30 seconds.

"When the Nerds Go Marching In"

Palomino was honored to serve as part of the team of technologists on President Obama's re-election campaign. Atlantic Magazine ran a fascinating piece about Narwhal, the sophisticated data architecture that enabled the campaign to track voters, volunteers and online trends.

Palomino CEO Laine Campbell joined the team in Chicago for the final days of the campaign, ensuring maximum uptime and performance on the MySQL databases. Afterwards, President Obama thanked her for Palomino's contributions.

Chef Cookbooks for HBase on CentOS Released

At Palomino, we've been hard at work building the Palomino Cluster Tool. Its goal is to let you build realistically-sized[1] and functionally-configured[2] distributed databases in a matter of hours instead of days or weeks as it is at present. Today marks another milestone toward that goal as we release our Chef Cookbook for building HBase on CentOS!

 

Background

Riot Games was kind enough to open source their Chef Cookbook for building a Hadoop cluster. Although the code wasn't in a state that would produce a functional cluster, and it was almost entirely undocumented, it was a great start.

Recently I was tasked with building an HBase cluster on CentOS using Chef. Although I've written a Cookbook (three times!) to do so, my code was never fully optimized. It could build a cluster, but only with hard-coded configuration parameters, or it produced a cluster that was running in a non-realistic non-production configuration.

Using the Riot Games Cookbook and the lessons I'd learned in the past, I whipped it into shape. I not only modified it to produce a functional cluster in a non-Riot environment, but also to build HBase on top of that! There are over 800 changes in the diff and documentation on how to use it.

 

Source Code

Here you can find the newest Chef Cookbook for HBase on CentOS. Here you can find the original Ansible Playbooks for HBase on Ubuntu. If you would like to use this code to build your own cluster, you are encouraged to join the mailing list to get help and advice from your peers.

 

Notes

[1] A distributed database can be tested functionally by installing on a single machine, but when it comes time to run benchmarks, or to discover the other 90% of functionality that only appears in a distributed setup, you will want to have the database installed on many machines, preferably dozens.

[2] Many projects seem to stop short of installing the database in a way that would let you benchmark it. Perhaps there are shortcuts taken like putting all database files into /tmp, or disabling logging, or removing tricky/subtle components in the interest of simplicity. The Palomino Cluster Tool provides you with a cluster that's actually ready for production. Sure, you still have to edit the configurations a little, but a good base generic configuration is provided.

Bulk Loading Options for Cassandra

Cassandra's bulk loading interfaces are most useful in two use cases: initial migration or restore from another datastore/cluster and regular ETL. Bulk loading assumes that it's important to you to avoid loading via the thrift interface, for example because you haven't integrated a client library yet or because throughput is critical. 

There are two alternative techniques used for bulk loading into Cassandra: "copy-the-sstables" and sstableloader. Copying the sstables is a filesystem level operation, while sstableloader utilizes Cassandra's internal streaming system. Neither is without disadvantages; the best choice depends on your specific use case. If you are using Counter columnfamilies, neither method has been extensively tested and you are safer writing via thrift.

The key to understanding bulk-loading throughput is that potential throughput depends significantly on the nature of the operation as well as the configuration of source and target clusters and things like number of sstables, sstable size and tolerance to potentially duplicate data. Notably but not significantly, sstableloader in 1.1 is slightly improved over the (freshly re-written) version in 1.0. [1]

Below are good cases for and notable aspects of each strategy.

Copy-the-sstables/"nodetool refresh" can be useful if:

  1. Your target cluster is not running, or if it is running, is not sensitive to latency from bulk loading at "top speed" and associated operations.
  2. You are willing to manually, or have a tool to, de-duplicate sstable names and are willing to figure out where to copy them to in any non copy-all-to-all case. You are willing to run cleanup and/or major compaction understand that some disk space is wasted until you do. [2]
  3. You don't want to deal with the potential failure modes of streaming, which are especially bad in non-LAN deploys including EC2.
  4. You are restoring in a case where RF=N, because you can just copy one node's data to all nodes in the new RF=N cluster and start the cluster without bootstrapping (auto_bootstrap: false in  cassandra.yaml).
  5. The sstables you want to import are a different version than the target cluster currently creates. Example : trying to sstableload -hc- (1.0) sstables into a -hd- (1.1) cluster is reported to not work. [3]
  6. You have your source sstables in something like s3 which can easily parallelize copies to all target nodes. s3<>ec2 is fast and free, close to best case for the inefficiency during copy stage.
  7. You want to increase RF on a running cluster, and are ok with running cleanup and/or major compaction after you do.
  8. You want to restore from a cluster with RF=[x] to a cluster whose RF is the same or smaller and whose size is a multiple of [x]. Example: restoring a 9 node RF=3 cluster to a 3 node RF=3 cluster, you copy 3 source nodes worth of sstables to each target node.

sstableloader/JMX "bulkload" can be useful if:

  1. You have a running target cluster, and want the bulk loading to respect for example streaming throttle limits.
  2. You don't have access to the data directory on your target cluster, and/or JMX to call "refresh" on it.
  3. Your replica placement strategy on the target cluster is so different from the source that the overhead of understanding where to copy sstables to is unacceptable, and/or you don't want to call cleanup on a superset of sstables.
  4. You have limited network bandwidth between the source of sstables and the target(s). In this case, copying a superset of sstables around is especially ineffecient.
  5. Your infrastructure makes it easy to temporarily copy sstables to a set of sstableloader nodes or nodes on which you call "bulkLoad" via JMX. These nodes are either non-cluster-member hosts which are otherwise able to participate in the cluster as a pseudo-member from an access perspective or cluster members with sufficient headroom to bulkload. 
  6. You can tolerate the potential data duplication and/or operational complexity which results from the fragility of streaming. LAN is best case here. A notable difference between "bulkLoad" and sstableloader is that "bulkLoad" does not have sstableloader's "--ignores" option, which means you can't tell it to ignore replica targets on failure. [4]
  7. You understand that, because it uses streaming, streams on a per-sstable basis, and streaming respects a throughput cap, your performance is bounded in terms of ability to parallelize or burst, despite "bulk" loading.

Couchbase rebalance freeze issue

We came across a Couchbase bug during a rebalance while upgrading online to 1.8.1 from 1.8.0.  

Via the UI, we upgraded our first node, re-added it to the cluster, and then set the rebalance off.  It was progressing fine, then stopped around 48% for all nodes.  The tap and disk queues were quiet and there were no servers in pending rebalance.  The upgraded node was able to service requests, but with only a small percentage of the items relative to the other nodes.  The cluster as a whole did not suffer in performance during this issue though there are some spikes in cpu during any rebalance.  

We decided to stop the rebalance, wait a few minutes, then rebalance and we see it is moving again, progressing beyond what it was.  It stopped again, now at 75%. Let sit for 7 mins, then hit Stop Rebalance and Rebalance. Not progressing at all now.

Couchbase support pointed to a bug where if there are empty vbuckets, rebalancing can hang.  This is fixed in 2.0.  The work around solution is to populate buckets with a minimum of 2048 short time to live (TTL >= (10 minutes per upgrade + (2 x rebalance_time)) x num_nodes) items so all vbuckets have something in them.  We then populated all buckets successfully and were able to restart the rebalance process which completed fine.

Reference:

http://www.couchbase.com/docs/couchbase-manual-1.8/couchbase-getting-started-upgrade-online.html

Syndicate content
Website by Digital Loom