Palomino

When Disaster Strikes: Hurricane Sandy

 

We devoted the Palomino Newsletter this month to the important topic of disaster recovery, in light of the challenges posed by Hurricane Sandy. If you're not already receiving our newsletter, you can subscribe here.


Hurricane Sandy has been on many people's minds of late; mine not least.  Having lived the last 4 years of my life in Manhattan and on the Jersey Shore, the loss of lives, the destruction of homes, business and memories, and the disruption of so much has me in shock.  I grew up in Louisiana, and hurricanes were a way of life.  You didn't do something hoping that a hurricane would not come by.  You assumed a hurricane would come.  At least, that's how I was taught.  That's the mentality I try to bring into my architectures, my process and my planning as well.  So, when hurricane Sandy bore down on the East Coast, my alarm bells started ringing, just as my email started exploding.  Every one of our US-East Amazon customers was in danger.  Who knew when power would go out?  And when it would come back?

Palomino is proud to be an Amazon Web Services consulting partner. That said, we recognize that Amazon has had its share of instability.  A few weeks ago, US-East experienced some significant EBS latency and unavailability.  We've lost availability zones.  We've lost regions.  We've found availability zones inexplicably unpredictable in terms of latency and availability.  Amazon forces us to think resiliently.  Not in preventing disasters, but weathering them, bouncing back, and being ready.  Some say this is an issue with Amazon.  That the unreliability is a drawback.  Perhaps I'm the eternal optimist, but I simply see it as a way to force rigor in anticipating, documenting and practicing our availability and business continuity plans.

None of this is new or incredibly enlightening.  Any operations person worth their salt thinks of failure and what can go wrong, and they think of it often.  So what's the point here?  I thought I'd share the war stories of the weekend to help cast a light on varying degrees of preparation.

 

Client One

Client One contacted us.  They had anticipated the problem and already been preparing to create multi-region EC2 environments; Sandy just accelerated things.  This client is in RDS, Amazon's Relational Database as a Service - in this case MySQL as a service.  RDS is such a convenient tool, until it isn't.  One of the big drawbacks? No cross-region support.  Yes, you can use Multi-AZ replication for Master availability across availability zones.  Yes, you can also create replicas in multiple availability zones.  If you do both of these things, you've got a certain level of fault tolerance in place.  You can still get hurt if your master does a multi-AZ failover.  All of your replicas will break, as RDS doesn't take into account the ability to move manually to the next binlog when a master crashes before closing their binlogs.  Thus, you are without replicas.  Not great.  But you have a working master.  Similarly, you have multiple replicas across AZs, to tolerate those failures.  But cross-region?  Nothing.

So, we had to dump all of our RDS instances and load them into RDS in another region.  Parallel dumps and loads were kicked off, accelerating the very painful process of a logical rebuild of a system.  We used SSD ephemeral storage on EC2 to speed this up as well.  The process still took 2 days.  OpenVPNs had to be set up with mappings for port 3306 to allow replication.  If this hadn't already been in process before Sandy was a threat, we never would have been ready in time.  We still had and have issues.  You can't replicate from RDS in one region to another.  Custom ETL must be created in order to keep each table as in sync as possible.  We'd done this work in a previous plan to move off of RDS, mapping tables to one of three categories - static (read-only), insert only, or upd/del.  Static just needs to be monitored for changes.  Insert only can be kept close to fresh with high water marks and batch inserts.  Transactional requires keys on updated at and created at fields, and confidence in the values in those fields.  Deletes present even bigger problems.  Digging in further is out of scope here, but consider it a future topic.

Summary: Client One was in-process for multi-region disaster recovery (DR).  A fire-drill occurred, and people had to work long, long hours doing tedious work.  But, had Sandy hit their region with the force it hit further north, we'd have been ready.

 

Client Two

Client Two contacted us also.  They had known that they were at risk, but they were small, they were pushing new features and refactoring applications, and DR was far out on their roadmap.  They too, were on RDS.  They could not afford the amount of custom work our larger clients requested, so we had to create a best effort approach.  RDS instances were created in Portland, along with cache servers, transaction engines, web services and the rest of the stack. Amazon Machine Images (AMIs) were kicked out, and we built a dump and copy process across regions.  There would be data loss, up to many hours, if the region went down and never came up.  But they would not be dead in the water.  Data loss can be mitigated by more frequent dumps and copies, but not eliminated completely.

Summary: Client Two had no plans for multi-region DR.  They had taken a conscious risk.  Luckily they had the talent and agility of a small company and could move fast with our help.  Failing over would have hurt, but they'd still be alive.

 

Client Three

We reached out proactively to Client Three. They had put together a multi-region plan for critical systems last year before we started working with them, which included scripts to rapidly build out new clusters of Hadoop based systems.  It was supposed to just work.  When we started working with Client Three, we’d scheduled our DR review, testing and modernizing for our Q4 checklist.  Too little, too late, right?  Sure enough, things didn't "just work".  It wasn't horrible, but a weekend of cleaning up, rescripting and fixing problems as they rose occurred.  But had we had to fail over? They would've been ready.

Summary: Client Three had anticipated and architected DR, but they hadn’t tested it.  Luckily we had the days before the storm to test and to fix this.  If they hadn't planned at all, I'm not sure we would've made it.

 

It’s also worth remembering that you are not alone in these shared environments.  All weekend shops were staking claims on instances and storage, and building out.  Rolling out resources got slower, and if you didn't claim, you'd lose out.  This has to be considered in your plans.  

 

To recap: Palomino loves AWS.  We're a consulting partner and have helped many clients in many different business models deploy, scale and perform in AWS.  But DR is not a luxury anymore.  It's a necessity.  Architectures have to take multi-AZ and multi-region plans into consideration in the beginning.  Many people use AWS so they save money on hardware.  They get upset when you point out the labor and extra instances needed to guarantee they can weather these storms.  But it's a hard reality.  It's one of the reasons we only recommended RDS in early phases, when downtime is tolerable.  Good configuration management also means you can deploy a skeleton infrastructure in another region; you can explode that to a full-blown install with ease.  But you have to practice, and you have to move fast.  If you think your region can go down, go to DEFCON and push the buttons.  If you're wrong, you can always tear back down.

Anticipate.
Plan.
Build it early.
Automate it.
Test it.
Test it.
Test it.
Test it.

If you haven't been able to donate to the Red Cross or other institutions helping our fellow brothers and sisters in the Northeast and in the Caribbean, please take some time to do so.  Having lost property and cared for loved ones displaced by Katrina, and now hearing so many horror stories from New Jersey and New York, I urge everyone to donate money, donate shelter, donate time and skills if you have them.  

 

MySQL Conference and Expo - 2012 and our company offsite

Last week was a huge week for PalominoDB.  I will admit to being cautiously optimistic about Percona taking over this conference - one of the biggest parts of the year for the MySQL community.  That being said, the conference was done quite well.  I really applaud Percona for making it happen.  In particular, the dot org pavillion was an excellent addition.  While I was stuck at the booth most of the event, I had a great view for the sheer variety of attendees coming through, and had the privilege of participating in a number of excellent conversations.

I noticed a number of patterns that seemed to prevail among the conversations.  Folks seem eager to move on to MySQL 5.5, finally comfortable with its stability.  Administrators are eager to learn more about MariaDB and Drizzle, and how they can differentiate themselves from the Oracle variants.  Partitioning is more prevalent as datasets grow, and sharding is becoming almost commonplace.  Now the focus is more on the challenging questions around HA - multi-master, synchronous replication and multi-datacenter installations.  People seem more interested in commercial additions around the MySQL ecosystem such as Tungsten, TokuDB, ScaleArc, Scalebase and Clustrix.  

René Cannao, one of our senior administrators did a great tutorial on understanding performance through measurement, benchmarking and profiling.  Feedback has been great, and we look forward to continuing to evolve our benchmarking and profiling methods.  Please keep an eye out.

Additionally, we were able to announce a very exciting partnership with SkySQL.  PalominoDB focuses on operational excellence by providing top DBAs to our clients.  We dig in to our clients' architectures and improve them, maintain them and help redesign them as they grow.  Our oncall services are top notch, regularly answering pages in under 5 minutes - and always providing a talented and experienced DBA on the other end of the phone.  What we don't do is software support.  It's just not our experience.  We can tune it, run it and grow it, but for clients who need to dig into code, fix bugs and really provide deep internals knowledge in MySQL - SkySQL are who we turn to.   We are also quite excited to help augment SkySQL's excellent services with our own - to create some of the happiest customers out there.

We are thrilled to see our partnership ecosystem grow, our service offerings expand and knowledge of our brand and the quality of services continue to improve.  I can't help but glow with pride at the reputation PalominoDB has built - through our DBAs, our clients and our partners.  Being a part of the MySQL  conference and expo only cemented this pride in community, and pride in our work.  Thank you to everyone who has helped us get there.

Thank you all of you!

PalominoDB kicks off "Mongo Month" with Exciting News!

 

PalominoDB is pleased to announce the formation of a service partnership with 10gen, the creators of the popular MongoDB database.  Our partnership provides our existing clients with access to a broader community of experts, and offers new clients a trusted partner in a rapidly evolving technology space.

Our team of senior database architects, administrators, and application developers have decades of experience working with clients to build, optimize, and support complex database systems.  We believe MongoDB has a solid place in the open source database ecosystem alongside our workhorse, MySQL, and we proud to support environments that have both technologies.

Our broad expertise in both relational database technologies and newer key-value and document-oriented storage options situates us perfectly to help clients explore how MongoDB can work in their own environments. We can help you assess your own application needs in areas of performance, scalability, availability, and security, and work with you to optimize your production environment.

“Joining the 10gen Partner program is an exciting step for us,” adds Laine Campbell, founder and principal at PalominoDB. “Our commitment to excellence in emerging database technologies has informed our MongoDB work to date, and we are thrilled to be able to extend that to a broader community in partnership with 10gen.  The ability to interface directly with 10Gen’s extensive knowledge of the product in conjunction with our own operational expertise should bring an unsurpassed level of expertise to our clients using, or looking to use, MongoDB.”

PalominoDB will be celebrating "Mongo Month" throughout March, with new blog posts, advice and tools for making the most of MongoDB in your organization. Stay tuned!

Our Pro Bono Program - bringing database excellence to non-profit partners

 

PalominoDB works with a wide range of clients on database management, support and engineering projects. Occasionally, clients find that they haven’t used the full allotment of hours they’ve purchased in a given month, and when that happens, we give them the option to donate those unused hours to our Pro Bono Program. We are pleased to be able to offer our non-profit clients the same level of innovation, service and technical expertise that startups and enterprise-level companies rely on us to provide.

Our latest project is one quite near to our hearts - worldwide literacy. Our friends at Worldreader have the ambitious goal of making digital books available to everyone in the developing world, enabling millions of people to improve their lives. 

As you might imagine, managing the digital assets, the physical e-readers and the program participants involves a lot of data. Worldreader measures success by looking at how people read, what people read, when people read and how those change over time. Managing that data, and synthesizing it into actionable project metrics are core functions that take a great deal of staff and volunteer time. 

PalominoDB helps Worldreader streamline project reporting in order to track and promote engagement with digital texts of all kinds - books, newspapers, magazine and online-only content. The pilot programs in Ghana and Kenya have been hugely successful, and as the Worldreader team heads to Uganda, PalominoDB has renewed our commitment, working with the team at Worldreader over the next year to develop content management tools, purchase and inventory management systems, and provide general database support.

None of this would be possible without our for-profit clients who donate unused hours, and our staff, who contribute time and attention to these projects. Philip Stehlik, Chief Technology Officer at Taulia, explains why the PalominoDB Pro Bono Program appealed to them.  “Taulia works to connect businesses and to improve access to capital for small and medium businesses through Dynamic Discounting. We are excited that donating our unused hours with PalominoDB allowed us to support an entirely new kind of supply chain - getting electronic books to children in the developing world."

If your non-profit would like to work with PalominoDB’s Pro Bono Program, please get in touch. And if you’re a PalominoDB client (or you think you might want to be!) and want to donate hours to our Pro Bono Program, we’d love to hear from you too. And finally, if you’re an author or publisher who wants to donate books to people hungry for great stories, please get in touch with our friends at Worldreader - they’d love to help more people fall in love with your books.

 

 

 

 

Live Blogging at MongoSV

Dwight Meriman, CEO of 10gen, speaks about the MongoDB community growing.The conference has doubled in size from 500 to 1100+ attendees.

Eliot Horowitz, CTO of 10gen, demos the MongoDB 2.2 Aggregation Framework. Simplifies aggregating data in MongoDB. He pulls in mongodb twitter feed to populate data and sums using: runCommand({aggregate: … })

The “aggregate” command in nightly builds tonight.

Cooper Bethea, Site Reliability Engineer, Foursquare, speaks on Experiences Deploying MongDB on AWS.

All data stored in MongoDB
8 production MongDB clusters
Two of the larger shards:
8 shards of users, 12 shards of check-ins.
Checkins: ~80 inserts/sec, ~2.5k ops/sec, 45/MB/s outbound at peak.
Users: ~250 updates/sec, ~4k ops/sec, 46MB/s outbound at peak
Only one unsharded cluster. Other fully sharded using replica sets.

All servers in EC2
mongoS is on mongoD instances
mongoCs are on three instances
mongoD working set contained in RAM
MongoD backing store: 4 EBS volumes with RAID0

Problem: fragmentaion leads to bloat
mongoD RAM footprints grows.
Data size, index size, storage size.

Solution: order replicaset by dataSize + indexSize, uptime DESC. --repair secondary nodes one at a time. Primary nodes require stepDown() which is more delicate.

Problem: EBS performance degrades
Symptoms: ioutil % on one volume > 90
qr/qw counts spike
fault rates > 10 in monostat
sometimes:  topless counts spike

Solution:
KILL IT! Stop mongoD process if secondary node, stepDown() + stop if primary.
Rebuild from scratch.

How long does it take? ~1 hour
Working set in RAM

Problem: fresh mongoD has not paged in all data
Solution: run queries
db.checkins.find({unused_key:1}).explain()

cat > /dev/null works too, unless your dataset size is larger then RAM.

Last Day at PalominoDB

I have been the Community Liaison and a Senior DBA at PalominoDB for 15 months, and doing remote DBA work for 4 years.  In that time I have learned that "consultant" need not be a dirty word, and that in the DBA world it is actually extremely valuable to have a remote DBA with lots of outside experience, and a team of remote DBAs for when your primary contact is sick or goes on holiday.

As with everything, there are downsides to remote database management.  Even though there is a lot of architecture experience among the remote DBAs I know, we are not often invited to architecture meetings.  This is because time is the unit of currency, and while sitting in an hour-long meeting to give 5 minutes of feedback can save hours down the road, it's hard to see that.  Many clients have gotten around this by having all DDL checked and performed by remote DBAs, and that helps a lot.

There is also no ownership - we can recommend solutions and technologies, but the client makes the actual decision about whether something needs to be done or not.  I look forward to actually owning architecture after 4 years of making "strong recommendations".

Since folks will ask, I have taken a job as a Senior DBA/Architect with Mozilla, starting Monday.  A former co-worker told me about the job; I was not particularly looking for anything, but I was intrigued.

I have said before that it is hard to find a good in-house DBA if you are not a huge company like Facebook or Google or Yahoo, and that is still true.  At Mozilla, they are 100% open and public about their ideas, and they do a lot of behind-the-scenes work.  Sound familiar?

They also allow their developers to develop on whatever platforms work best.  Their biggest database is their crash reporting database (and they do read it, so do submit your crashes).  They have MySQL, PostgreSQL, and MongoDB, and are starting to move some applications around, as developers are not always aware of what platforms will work best.  There is another DBA, so I will not be alone, but I expect to be just as challenged at Mozilla as I have been at PalominoDB and Pythian.

So, to keep up with PalominoDB, you can:

- like the PalominoDB page on Facebook

- follow @palominodb on Twitter

- connect with Laine on LinkedIn

- follow PalominoDB on LinkedIn

- continue to read PalominoDB's blog and Planet MySQL

 

To keep up with me, you can:

- follow @sheeri on Twitter

- read my blog and Planet MySQL

- connect with me on LinkedIn

- subscribe to the OurSQL podcast (details at http://www.oursql.com/?page_id=2)

Announcing PalominoDB's Non Profit Program

I've always had a dream of being able to use what we are doing at PalominoDB not only for our for profit clients, but for those who go out day to day helping those who need it.  We always try to work with clients who make people's lives better, but there are those whose entire purpose is to provide aid, education and empowerment to those who are disadvantaged or whose freedoms are at risk.  I am constantly inspired by companies such as Worldreader (http://blog.worldreader.org/) are a perfect example - who work to provide e-Readers to those who have no access to books or libraries in places such as Kenya.  The challenge has always been how to provide support from our team when they are constantly busy.  In a growing organization, resources are tight, and we are blessed with non-stop work from world class clients.

 

As we've grown to a sustainable size, I've had more time to think about issues such as this, and I would like to announce our newest program at PalominoDB - donation of hours.  A good portion of our clients are on retainer agreements with a monthly minimum.  Sometimes work is light, sometimes it is heavy and some clients just keep us around as insurance and rarely use our hours.  Regardless, we have to staff for a certain workload and thus must enforce the minimum.  I've always found myself frustrated at having to charge for hours not worked, and constantly brainstorm ways to provide maximum benefit to our regular clients.  

 

PalominoDB would now like to announce our donation of hours program - whereby we are setting up relationships with non-profits who can use our resources - and the hours for this work will be donated by our clients who have unused hours and wish to see them go to a good cause, rather than paying the 50% unused hours rate.  Clients can also donate a fixed amount of hours per month to the program.  We will start our program with one non-profit, Worldreader, and will solicit for other companies who can make effective use of our resources.   Should you know of any deserving companies, please don't hesitate to let us know!

 

You can see a video of Worldreader's work here.  Please take a look at their donations page here, as there are great opportunities to donate for e-readers or books or to sponsor classes and schools.

PalominoDB Percona Live: London Slides are up!

 

Percona Live: London was a rousing success for PalominoDB.  I was sad that I could not attend, but I got a few people who sent "hellos" to me via my coworkers.  But on to the most important stuff -- slides from our presentations are online!

René Cannao spoke about MySQL Backup and Recovery Tools and Techniques (description) slides (PDF)

 

Jonathan delivered a 3-hour tutorial about Advanced MySQL Scaling Strategies for Developers (description) slides (PDF)

Enjoy!

PalominoDB at PerconaLive

PalominoDB is very excited about our participation in the upcoming PerconaLive conference in London.  We'll have two of our European staff presenting.  On Monday, Jonathan Levin will be doing a tutorial on Advanced MySQL Scaling Strategies for Developers, and on Tuesday Rene Cannao will be presenting on MySQL Backup and Recovery Tools and Techniques.  Rene and Jonathan are two of our newer team members, and represent an exciting growth in staff outside of the US; Jonathan being in the UK and Rene being in Malta.  I know I'm thrilled to get out to London to meet a lot of new folks in our community.

We did get a chance to present at PerconaLive in NYC this year as well, which was quite a lot of fun, and the positive reception Sheeri's session got was gratifying.  Percona has become a huge part of our community and has provided great value in their information share, tools, services and now conferences - including the MySQL Conference & Expo 2012.  Having been involved in professional MySQL consulting and remote support for four years has been quite the adventure, and the fact that so many companies find space to create, share and prosper only shows the viability of open source software and the communities behind it.  

I know we at PalominoDB are quite proud to share space with companies such as Pythian, Blue Gecko, SkySQL and, of course, Percona. We are proud to support open source solutions across the board, and are even more excited to have grown to a place where we have the resources to contribute back to them, to non-profits and to the growth of our clients and every person on our team.  Here's to an exciting and brilliant future, and a great conference!

How Oracle Has Done Nothing to Change MySQL

Last night at the Oracle OpenWorld MySQL Community Reception, there were lots of old and new friends milling about.  It occurred to me that there is one very important thing Oracle has NOT changed about the MySQL world - the rock stars and higher-ups are still readily accessible.

One of the things I love about being in the open source community is that you can have an in-depth conversation with someone, and only later on find out that this person is famous.  For the most part, rock stars and important people are readily accessible.  They stay in the same hotels that attendees do, they take the same elevators, they are not whisked away by bodyguards, and they do not play the "don't you know who I am?" card.

Now, it's not surprising that the community rock stars like Mark Callaghan, Baron Schwartz, Giuseppe Maxia and Sarah Novotny are still as accessible as ever.  However, Ed Screven and Thomas Ulin were also around for the party, and I can confirm that Thomas was one of the last dozen or so to leave (Ronald Bradford and I closed out the party and were the last to leave).

So, kudos to Oracle for not keeping your VIPs locked up in a bunker.  I am very glad to see this aspect of open source culture still thriving.

Syndicate content
Website by Digital Loom