Avoid Gotchas While Starting up with MongoDB


When starting to work with a new technology in a development or sandbox environment, we tend to run things  -  as much as possible - using their default settings.  This is completely understandable, as we're not familiar with what all of the options are.  With MongoDB, there are some important issues you might face if you leave your setup in the default configuration.  Here are some of our experiences getting started trying out MongoDB on virtual machines both locally and on some Amazon EC2 instances.



Ensure that the version of MongoDB that you're using has the features you're going to use.  For example, if you installed the "mongodb" package from the default debian repositories as of this writing (v 1.4.4) , you'd find that some of the commands used for sharding are available, but sharding isn't actually supported in that version, so this incompatibility is masked.  Ensure you get the latest (as of this writing, it's 2.0.2) by adding to your repos and installing the package named mongodb-10gen.

Check that your OS version is compatible with what you're going to use it for.  MongoDB recommends a 64-bit OS due to file-size limitations in 32-bit OS's.


Why is it taking so long to start up?

If you're on a 64-bit install, version 1.9.2+, journaling is enabled by default.  MongoDB may decide it needs to preallocate the journal files, and it waits until the files are allocated before it starts listening on the default port.  This could be on the order of tens of minutes. You definitely don't need this option if you're just kicking the tires and trying to see what mongodb can do.  Restart with --nojournal.

  • What type of filesystem is your database directory mounted on?

A lot of the popular Linux AMIs available in Amazon's list have the ext3 filesystem mounted by default.  It's recommended to use ext4 or xfs filesystems for your database directory, due to the file allocation speed.  This is especially noticeable if MongoDB starts allocating journal files, as in the above. If you're using an AWS instance, you'll avoid this problem if you set up a RAID10 filesystem for your data directory, as shown here.


Disk Space

Another issue is disk space.  If you leave settings to their default and you're on a VM or a machine with limited disk space, you could very well start hitting your limits soon.  Even if you are starting up a configserver, it will end up taking up another 3GB if you're not careful.  Our recommendation is to use the --smallfiles flag as you're starting, stopping, and configuring, until you figure out what you're doing. As an example, we followed this page to create a sharded database on a debian VM with about 16GB of disk space, and it quickly ballooned to this:

moss@moss-debian:~/mongodb$ du -h .
3.1G ./a/journal
204K ./a/moveChunk/
208K ./a/moveChunk
4.0K ./a/_tmp
3.3G ./a
3.1G ./b/journal
4.0K ./b/_tmp
3.3G ./b
3.1G ./config/journal
4.0K ./config/_tmp
3.3G ./config
9.7G .


Bottom line: use "--smallfiles" in your command line flags or in your /etc/mongodb.conf files until you are actually running in an environment that has the required disk space.


Splitting and Balancing

As Jeremy Zawodny outlines in his excellent blog post “MongoDB Pre-Splitting for Faster Data Loading and Importing”, it is important to understand how MongoDB manages shards. By default, documents are grouped into 200MB chunks which are mapped to shards, and then moves those chunks between shards as the balancer attempts to manage the load. If you’re doing a large data migration, however, this can be tricky. Check Jeremy’s post for some great advice on pre-splitting while maintaining acceptable performance levels.


Spaces in Configuration Files

It's not in the documentation anywhere, but another gotcha was in a configuration file - for example a line like this, with multiple values for a single parameter:

configdb =,,

If you have spaces before or after the ",", the setting will not parse.  Just ensure it's a single string with no spaces as it is above.

When is MongoDB the right choice for your business? We explore detailed use cases.

As part of PalominoDB’s “Mongo Month”, we’re reviewing use cases for where we’d recommend choosing MongoDB and where we’d consider other technologies.

Because every environment and architecture is different, and every business has unique needs, we work carefully with clients to find the best solution for the particular challenges they face. Sometimes MongoDB or one of the other open-source technologies we build and support is most appropriate; sometimes an RDBMS is most appropriate; and often a hybrid environment makes the most sense for a client’s specific needs.

Our partners at 10gen lay out the more typical use cases, and we find there are often additional factors to consider, such as:

  • Risk tolerance for bugs and unmapped behaviors
  • Availability requirements 
  • Access- and location-based requirements
  • Security requirements
  • Existing skill sets and tooling
  • Existing architecture and infrastructure
  • Growth expectations and the timeline therein
  • Support? Community? Start-up? Enterprise Class?

Below, we’ll review some of the specific use cases our clients face, and we’ll explore how MongoDB and/or other technologies might address these most appropriately.



Craigslist is one of the most famous implementations of MongoDB for archiving - in this case, old posts.  The schemaless component makes it easy for the datastore to grow as the data model evolves.  Because MongoDB does not offer some features, such as compression, that other tools like Cassandra offer natively, Craigslist has created patterns for workarounds to address issues such as presplitting data to avoid chunks migrating to various shards or using ZFS with LZO compression.

MongoDB and Cassandra both suit this use case well.  The choice in a situation like this is often determined by in-house skillsets and preferences, as well as the actual size and amount of data that needs to be managed (stay tuned for a future post about the complexities of data management at various stages of data volume and scale). Additional considerations for Cassandra (and HBase or any other Java-based DBMS) include JVM management and tuning, which can be quite challenging for operations teams unused to working with this issue.

other solutions:

  • Cassandra

pros: Native compression reduces complexity and growth is easier to manage with new nodes. Cassandra is arguably the better choice for massive scale (both in storage growth and in throughput of inserts) and does multi-datacenter installations well.  

cons: These will vary by use case, and will be addressed in a further post.  A generalization is that MongoDB is easier to setup and manage in smaller environments or for companies constrained by resources.


Content Management

Schema evolution is a huge win in the content management field.  10gen’s production deployments show numerous successful use cases.  MongoDB’s rich querying capabilities and the document store’s ability to support hierarchical models are all perfect for a CMS. For a great real world example, see the extensive documentation on the Wordnik deployment, and how they manage 1.2TB of data across five billion records.

other solutions:

  • MySQL or PostgreSQL w/caching and read distribution

pros: Skillsets are more readily available, and can support existing tools for managing the RDBMS infrastructure.  Reads are easily scaled in known patterns in the relational database world.

cons: Schema modifications can prove expensive and hard to integrate into publishing workflows.  Failover is not automatic. 


Online Gaming

MongoDB works very well with small reads and writes, assuming you manage to keep your working set in RAM.  Compounding that with ease of schema evolution and the replica set functionality for easy read scale and failover creates a solid case to investigate MongoDB.  In fact, this rule can be pushed out to any situation where writes can grow out of hand for a single database instance.  When you find yourself needing to support growth of writes, those writes being small and numerous, you need to ask if you want to a) design the writes away (easier said than done) b) functionally partition the workload or c) shard.  MongoDB’s autosharding is nice, though not perfect - and there are gotchas PalominoDB can assist you with.  Depending on other variables mentioned earlier, MongoDB might be a solid fit for this part of your workload. Disney’s Interactive Media Group offered a great presentation at the annual MongoSV conference on how they use MongoDB in their environment. 

other solutions:

  • MySQL or PostgreSQL with sharding

pros: Skillsets are more readibly available, and can support existing tools for managing the RDBMS infrastructure.  

cons: Games require significant schema modifications in early iterations, and these can prove expensive and impactful.  Extensive development and QA hours and increased complexity come hand in hand with writing your own sharding.

  • Cassandra

pros: Skillsets are more readibly available, and can support existing tools for managing the RDBMS infrastructure.  Reads are easily scaled in known patterns in the relational database world.

cons: These will vary by use case, and will be addressed in a further post.  A generalization is that MongoDB is easier to setup and manage in smaller environments or for companies constrained by resources.


Log Centralization

Asynchronous (and speedy!) writes, capped collections for space management and the flexibility of a schemaless database (since attributes in logs tend to pop up like mushrooms after a rainstorm) are often cited as key benefits for using MongoDB to store logs.  Admittedly, one could build a queue to push data asynchronously into an RDBMS, and partitioning plus a few scripts could duplicate the space management features of a capped collection.

MongoDB and Cassandra both do this well.  Cassandra is arguably the better choice for massive scale and works well in a multi-datacenter environment.  However, MongoDB is much easier to use, manage and query. The size of the client, the skill-set on hand and the availability needs will help us here.

other solutions

  • Percona’s XtraDB w/socket handler and MySQL partitioning, XML datatypes

pros: MySQL familiarity, reuse of MySQL infrastructure (backups, monitoring etc...)

cons: Socket Handler is still somewhat buggy, partitioning requires scripts and XML is not easily queried.

  • Cassandra w/TTL data

pros: multi-datacenter support makes this more viable than MongoDB.  

cons: These will vary by use case, and will be addressed in a further post.  A generalization is that MongoDB is easier to setup and manage in smaller environments or for companies constrained by resources.


Queue Implementation

MongoDB implements its internal replication using tailable cursors and capped collections, and the same features are useful to build simple persistent network-based queuing (distributed message-passing) implementations rather than using a “proper” queueing package. One such implementation is described in detail on Captain Codeman's Blog. Building your own queueing mechanism on top of a DBMS can be suboptimal, but one organization did so because they already had MongoDB expertise on-staff and had difficult performance problems with ActiveMQ.

other solutions

  • ActiveMQ, RabbitMQ. 

pros: proper queueing solutions. Known and documented problems or solutions.

cons: more brittle, more difficult to set up, and less performant if your queueing needs are extremely simple.


In summary, MongoDB, either on its own or in a hybrid environment with other technologies, is a wonderful choice for many of the most common use cases and business challenges our clients face. Helping clients make these complex decisions, and working together on the installation, management and optimization of these tools is our core business, and we encourage you to get in touch if you’d like to explore using MongoDB in your own environment.

Considerations about text searchs in big fields and planner costs

Imagine the following scenario: you have a table with a small set of records with columns containing  tsvector and text data type. But, text fields has almost 20 megabytes of text or more.

So, what happens? Postgres planner checks the amount of reltuples (tuples) in the relation to calculate the cost of data extraction. If the amount of tuples reached by the query is higher than the 20% of the total result set or the amount of blocks in the relation is too small, then the planner will choose sequential scan, otherwise random access will be the chosen one.

But, in that case we face a problem. The amount of rows and blocks in the relation is not the “total” content. That happens because Postgres use a technique called TOAST (The Oversized  Attribute Storage Technique) which store all the data that is over 200 bytes, apart from the block. So, in the relation you will have only the firsts bytes of your column data. You can modify some options about how the columns get toasted (the level of compression), but the default behaviour is compress and store apart. This methodology is pretty cool, because if you have i.e. 500MB of data in only one field, your relation will contain all that data in the same file and in the most usual cases that is not convenient.

So, meanwhile you have small number of rows, the amount of data to read from disk is quite big in relation of the data stored in the relfilenode (the data file). But for FTS, we need to dig into all the content and for make the searchs faster, we need GIN or GIST indexes. This type of indexes, stores the data inside the index (differently from B-Tree indexes), making possible the search using the index.

So, you want to access by index to your data, but the planner estimates that the amount of rows is so small that is better to access by sequential scan instead of use index. So, the only way to force the read of index is *SET enable_seqscan TO off;* before execute your query. This variable is per session, so you might want to add it inside the transaction and set again to “on” for other queries.

But, by the way, once your table is populated with enough amount of data, you wouldn’t need anymore this execution, but if your table is quite static, could be a little fix. Besides this, it doesn’t represents a problem of performance, but this kind of things could show you how the planner works.

Let’s see an example:

CREATE TABLE documents_en (
   author_id char(2),
   text_id serial,
   author varchar(45) NOT NULL,
   title varchar(127) NOT NULL,
   content text NOT NULL,
   cont_ftsed tsvector  ,
   PRIMARY KEY (author_id, text_id)

CREATE FUNCTION ftsed_upd_en() RETURNS trigger AS $$
NEW.cont_ftsed:=(setweight(to_tsvector('english', coalesce(NEW.title,   '')), 'A') ||
    setweight(to_tsvector('english', coalesce(,  '')), 'B') ||
    setweight(to_tsvector('english', coalesce(NEW.content, '')), 'C'));
LANGUAGE plpgsql;

CREATE TRIGGER tgr_ins_ftsed_en BEFORE INSERT ON documents_en

CREATE TRIGGER tgr_upd_ftsed_en BEFORE UPDATE OF author,title,content
ON documents_en

CREATE INDEX documents_en ON nspc_es USING gin(
   (setweight(to_tsvector('english', coalesce(title,   '')), 'A') ||
    setweight(to_tsvector('english', coalesce(author,  '')), 'B') ||
    setweight(to_tsvector('english', coalesce(content, '')), 'C'))

CREATE INDEX ix_ftsed ON documents_en USING GIST(cont_ftsed);

Before continuing, we need to clarify 3 things: text_id is related with author_id, but just for testing we use it as a serial, to avoid collisions in the primary key; we use triggers to automatically set up the tsvector column; the last thing is the indexes, I created 2 indexes to show the examples. IMHO I don’t recommend functional indexes, for performance purposes, but is still an option.

To fill up the table, I’ve made a script to catch up all the READMEs in the server and put them into the DB.


 egrep  -oi '[a-z]+' $_FILE_ | egrep -i '[a-z]+{4}+' > $WORDS
 AUTHOR=$(cat $WORDS | sort -u | head -n2 | awk '{printf("%s ", $0) }')
 TITLE=$(cat ${WORDS} | sort | uniq -c | sort -nr | head -n3 | awk '{printf("%s ", $2)}')
 BODY=$(cat $WORDS | sort -u | awk '{printf("%s ", $0) }')

 $PSQL -Upostgres fts -c "INSERT INTO documents_en (author_id, author, title, content) VALUES ('${AUTHOR_ID}', '${AUTHOR}', '${TITLE}','${BODY}' );"

I know, is a bit tricky, but useful for this test. to execute it, you’ll need something like:

locate README.txt | xargs ./script

First let’s get some stats from the table:

fts=# select count(*) from documents_en ;
count : 22
fts=# select pg_size_pretty(pg_relation_size('documents_en'));
pg_size_pretty :  24 kB (size of the table)

fts=# select relpages from pg_class where relname = 'documents_en';
relpages :3 (8kb each)

Now, a simple query with a EXPLAIN ANALYZE:

fts=# EXPLAIN ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                             QUERY PLAN                                               
Seq Scan on documents_en  (cost=0.00..
3.27 rows=1 width=34) (actual time=0.228..0.464 rows=2 loops=1)
  Filter: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Filter: 20
Total runtime:
0.501 ms
(4 rows)

Oh! It’s using sequential scan (as we expect ), but now let’s add a trick:

fts=# SET enable_seqscan TO off;
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                       QUERY PLAN                                                         
Index Scan using ix_ftsed_en on documents_en  (cost=0.00..
8.27 rows=1 width=34) (actual time=0.127..0.191 rows=2 loops=1)
  Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 1
Total runtime:
0.280 ms
(4 rows)

What happens? The cost is higher but is faster? The fast answer is YES, is possible. Cost is based on estimations in the amount of blocks and type of access. Random access use to has 4:1 more cost in relation with the sequential accesses. In this particular case, the table is small so the planner estimates that is cheaper to read it without indexes. BUT that’s not the real problem, the real one is that the data is already TOASTED outside the “data file” and this data is not considered by the planner in this case.

Another question is: can we improve the access to this data? The answer is YES, if you are not complaining about the storage. What you can do is:

fts=# alter table documents_en alter column cont_ftsed set storage external;

That will uncompress the data of the column cont_ftsed and force to store it outside the table (this is not really at this way, but for be didactic we will explain it like this).

fts=# vacuum  full analyze documents_en;
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                       QUERY PLAN                                                         
Index Scan using ix_ftsed_en on documents_en  (cost=0.00..8.27 rows=1 width=34) (actual time=0.071..0.127 rows=2 loops=1)
  Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 1
Total runtime:
0.178 ms
(4 rows)

fts=# SET enable_seqscan TO on;
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                             QUERY PLAN                                               
Seq Scan on documents_en  (cost=0.00..3.27 rows=1 width=34) (actual time=0.067..0.292 rows=2 loops=1)
  Filter: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Filter: 20
Total runtime:
0.329 ms
(4 rows)

In this case we use VACUUM because the SET STORAGE option doesn’t make any changes to the previous stored data, and its effects are after. So with vacuum, we are forcing to apply the changes in all the table.

Ok, what happen if the table gets filled with more data? Automatically the planner will start to use the index:

fts=# select pg_size_pretty(pg_relation_size('documents_en'));
1744 kB
(1 row)
fts=# SELECT count(*)  FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
(1 row)

fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                     QUERY PLAN                                                       
Bitmap Heap Scan on documents_en  (cost=4.38..
47.02 rows=13 width=31) (actual time=1.091..5.404 rows=45 loops=1)
  Recheck Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 80
  ->  Bitmap Index Scan on ix_ftsed_en  (cost=0.00..4.37 rows=13 width=0) (actual time=1.003..1.003 rows=125 loops=1)
        Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
Total runtime:
5.489 ms
(6 rows)

SET enable_bitmapscan TO off;
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                         QUERY PLAN                                                          
Index Scan using ix_ftsed_en on documents_en  (cost=0.00..
56.50 rows=13 width=31) (actual time=0.307..5.286 rows=45 loops=1)
  Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 80
Total runtime:
5.390 ms
(4 rows)

We don’t see a huge difference in terms of time between each other, but we still facing the question about the relation between cost/real-performance.

But, don’t forget the functional index! I think you will like the following example:

fts=# analyze;
fts=# explain analyze select author_id, text_id from documents_en

where (((setweight(to_tsvector('english'::regconfig, COALESCE(title, ''::character varying)::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, COALESCE(author, ''::character varying)::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char"))) @@ to_tsquery('english','editor&find');
                                                                                                                                                                             QUERY PLAN                                                                                                                                                                               
------------------------------------------------------------------------------------------------------------------------------------------------------------ Seq Scan on documents_en  
(cost=0.00..292.20 rows=17 width=7) (actual time=2.960..3878.627 rows=45 loops=1)
  Filter: (((setweight(to_tsvector('english'::regconfig, (COALESCE(title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (COALESCE(author
, ''::character varying))::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char")) @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Filter: 2238
Total runtime:
3878.714 ms
(4 rows)

SET enable_seqscan TO off;
fts=# explain analyze select author_id, text_id from documents_en

where (((setweight(to_tsvector('english'::regconfig, COALESCE(title, ''::character varying)::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, COALESCE(author, ''::character varying)::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char"))) @@ to_tsquery('english','editor&find');
                                                                                                                                                                                  QUERY PLAN                                                                                                                                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on documents_en  
(cost=272.15..326.46 rows=17 width=7) (actual time=0.937..1.026 rows=45 loops=1)
  Recheck Cond: (((setweight(to_tsvector('english'::regconfig, (COALESCE(title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (COALESCE(
author, ''::character varying))::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char")) @@ '''editor'' & ''find'''::tsquery)
  ->  Bitmap Index Scan on textsearch_en  (cost=0.00..272.15 rows=17 width=0) (actual time=0.916..0.916 rows=45 loops=1)
        Index Cond: (((setweight(to_tsvector('english'::regconfig, (COALESCE(title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (COALE
SCE(author, ''::character varying))::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char")) @@ '''editor'' & ''find'''::tsquer
Total runtime:
1.109 ms
(5 rows)

Again! But hits time it is more stronger the difference in terms of time. Is not so pretty, ugh? So, It seems that you will need to be careful if you use searchs in big columns, that are toasted. Maybe you should check in you production database if you need to add some tweaks to get the maximum performance to your Postgres instance.

The version used on this post is PostgreSQL 9.2devel on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.6.2 20111027 (Red Hat 4.6.2-1), 64-bit. Besides this, I made tests in 9.1 and still see the same thing’.

Happy hacking!

Performance Optimization for MyISAM Compressed Tables

Database performance optimization is a significant part of the ongoing service we provide for our clients. We recently found that a client had a query they performed regularly that was drastically slowing the system as a whole, and we investigated to see if we could help them resolve the issue.

The symptoms were confusing: though the server configuration was sized appropriately for the available memory, the server was per-allocating approximately 20GB of RAM immediately on a query, even when the query required nowhere near that allocation, leading to excessive swapping and performance hits.

We explored various options: Were there SET SESSION statements being executed improperly before the query? Were buffer_size parameters set correctly? Were there other variables we hadn’t considered?

As we investigated, we discovered something interesting: The MySQL Server (mysqld) sets a default memory allocation for MyISAM compressed tables, and that default setting is unnecessarily large (18446744073709547520). As a result, it maps into memory all of the MyISAM compressed tables. In our case, with 20GB tables, mysqld was suddenly allocating an unreasonable amount of memory to the process.

This default memory allocation variable was introduced in MySQL 5.1.43, and you can read more about it here.

We immediately tuned this down to a more manageable size for the actual queries being run, and are seeing the expected performance improvements in our client’s queries.

Continuous Deployment Architecture For Ruby on Rails


We have seen agile development become more popular in recent years thanks in part to the evolution of continuous integration environments like Ruby on Rails.  These development frameworks leverage quick testing and the ability to easily deploy over clusters. As deployments happen more often, we look for ways to minimize the disruption to users. Cluster architecture already contains the components we need, with multiple servers for each role, allowing us to update a subset of the system while the rest serve the business.

We recently worked with the DevOps team at one of our clients to develop an architecture that allows them to run rolling changes through the cluster using the Ruby on Rails framework. There are two principal components of change: application code on the application servers and database structure on the database servers. The application servers can easily be targeted specifically through deploy.rb. The database side of things, however, is a bit more complicated.

In order to have zero downtime, half the application servers are taken offline and updated, then the pair of master/slave standby databases have the DDLs applied to them. Traffic is then switched to these servers.

HAProxy is put in between the application and the MySQL databaseservers to act as a router for database traffic (see for specifics), providing the ability to flip the active and standby roles of the two server pairs (as well as providing a High Availability solution). Since there are slave relationships, we needed to be able to pause replication through the deployment application (Capistrano).  We were able to accomplish this by adding a file, deployrake.util, to the Rake subsystem under lib/tasks:

namespace :deployutil do

 desc 'Checks the replication status of the primary database'

 task :replication_test => :environment do

   # find the state of replication

    mysql_res = ActiveRecord::Base.connection.execute("SHOW SLAVE STATUS")

   mysql_res.each_hash do |row|

     if row['Slave_IO_Running'] == "Yes" and row['Slave_SQL_Running']

== "Yes" and row['Seconds_Behind_Master'].to_s == "0"

       puts "ReplicationGood"

     elsif row['Seconds_Behind_Master'].blank?

       puts "ReplicationBroken"


       puts "ReplicationBehind_" + row['Seconds_Behind_Master'].to_s





 task :start_slave => :environment do

   ActiveRecord::Base.connection.execute("START SLAVE")


 task :stop_slave => :environment do

   ActiveRecord::Base.connection.execute("STOP SLAVE")



We can then create tasks in deploy.rb to call these

 desc "shows the current passive slave replication status"

 task :get_slave_replication_status, :roles => :cron do

   dbrails_env = fetch(:rails_env) + '_passiveslave'

   # find the state of replication

   set :slave_replication_status, capture("cd #{latest_release} ;

RAILS_ENV=#{dbrails_env} rake deployutil:replication_test").chomp



 desc "stop passive slave replication"

 task :stop_passiveslave_repl, :roles => :cron do

   dbrails_env = fetch(:rails_env) + '_passiveslave'

   run "cd #{latest_release} ; RAILS_ENV=#{dbrails_env} rake




We also want to be able to limit changes to specific databases so that the changes won't go into the bin logs and propagate when the slaves are turned back on.  See for details on how to do this through an extension of ActiveRecord.  A word of caution here: setting sql_log_bin=0 to skip logging these changes will invalidate using the binlogs for point in time recovery.

You will need a full backup after the change.

Master/Slave Replication In Limited Bandwidth Scenarios

Database replication is quite an important component of the database architectures of many of our clients. Managed properly, it offers real-time redundancy and fault-tolerance, and allows for increased scalability. When replication goes wrong, however, it can turn into a morass of data conflicts, lag and strange little mysteries to distract your DBA team.

PalominoDB recently helped a client out of one of these strange little replication mysteries, and we document it here to help you avoid getting stuck in the same sort of situation.

The symptoms: Our slave database (which we’ll call DB_Slave) was lagging. The logs revealed that the value of Seconds_Behind_Master was jumping from 0 to a random high value, and then back to 0.

What we knew: DB_Slave connects to the master (which we’ll call DB_Master), and DB_Master sends binlog events to DB_Slave.

Because this data is pushed from the master rather than pulled from the slave, there was a small window where, in theory, a replication channel could be broken, and DB_Slave wouldn’t notice until "slave-net-timeout" seconds (with a default of 3600) passed.

We also knew that Seconds_Behind_Master is calculated not as the difference between the current  datetime and the datetime of the binlog event being processed, but as the difference between the time in master binlog and current executed relay log. That means that if DB_Master is not sending binlog events to DB_Slave (for example, due to a network issue), the slave could be lagging behind the master but not be aware of it at all.

Further, if DB_Slave then received some binlog events from DB_Master, and realized at that point that it was lagging, it could still fail to notice if DB_Master once again stopped pushing binlog events. 

Investigation: Our theory that there was in fact a network issue was supported by some evidence from the logs. We saw that SHOW SLAVE STATUS didn't show any progress in Master_Log_File and Read_Master_Log_Pos : that is, DB_Slave was not getting any data from DB_Master. 

We also found this in the processlist of DB_Master:

104062295       slave_user      [IP Address]:10124     NULL    Binlog Dump     1140    Writing to net  NULL

The status "Writing to net" means that DB_Master is sending (or trying to send) data to DB_Slave.

In a healthy replication channel, the status would be "Has sent all binlog to slave; waiting for binlog to be updated".


We suspected this might be an issue related to network bandwidth, but when we ran a series of STOP SLAVE / START SLAVE processes, we noticed that the number of threads in status “Writing to net” was increasing, with up to 20 concurrent open threads. Connections simply weren’t being closed. Further investigation revealed that those connections were staying in CLOSE_WAIT status.

Based on our observations we concluded that the limited network bandwidth was at fault, and we needed a solution to compress the transmission between the master and slave databases. Luckily MySQL provides a convenient parameter, which is dynamic and is set on the slave: slave_compressed_protocol=1

Here is how to implement it:


mysql> show global variables like 'slave_compressed_protocol';


| Variable_name             | Value |


| slave_compressed_protocol | OFF   |


1 row in set (0.00 sec)

mysql> stop slave;

Query OK, 0 rows affected (0.25 sec)


mysql> set global slave_compressed_protocol=1;

Query OK, 0 rows affected (0.00 sec)

mysql> start slave;

Query OK, 0 rows affected (0.00 sec)
mysql> show global variables like 'slave_compressed_protocol';



| Variable_name             | Value |


| slave_compressed_protocol | ON    |


Resolution: We set slave_compressed_protocol=1 and restarted the slave. Replication started catching up at a very surprising speed. Even though it was catching up, we noticed the same behavior we’d noticed in the past: Seconds_Behind_Master was jumping from 0 to a random high value, and then back to 0, the io_thread was behind, and we didn’t see a high load on the server. 

As you can see in these graphs, network traffic did not increase much, but CPU usage and commands/sec had surprisingly high jumps.

In short, DB_Slave was processing more queries because the IO thread was receiving more data from DB_Master, thanks to compression. 

Live Blogging at MongoSV

Dwight Meriman, CEO of 10gen, speaks about the MongoDB community growing.The conference has doubled in size from 500 to 1100+ attendees.

Eliot Horowitz, CTO of 10gen, demos the MongoDB 2.2 Aggregation Framework. Simplifies aggregating data in MongoDB. He pulls in mongodb twitter feed to populate data and sums using: runCommand({aggregate: … })

The “aggregate” command in nightly builds tonight.

Cooper Bethea, Site Reliability Engineer, Foursquare, speaks on Experiences Deploying MongDB on AWS.

All data stored in MongoDB
8 production MongDB clusters
Two of the larger shards:
8 shards of users, 12 shards of check-ins.
Checkins: ~80 inserts/sec, ~2.5k ops/sec, 45/MB/s outbound at peak.
Users: ~250 updates/sec, ~4k ops/sec, 46MB/s outbound at peak
Only one unsharded cluster. Other fully sharded using replica sets.

All servers in EC2
mongoS is on mongoD instances
mongoCs are on three instances
mongoD working set contained in RAM
MongoD backing store: 4 EBS volumes with RAID0

Problem: fragmentaion leads to bloat
mongoD RAM footprints grows.
Data size, index size, storage size.

Solution: order replicaset by dataSize + indexSize, uptime DESC. --repair secondary nodes one at a time. Primary nodes require stepDown() which is more delicate.

Problem: EBS performance degrades
Symptoms: ioutil % on one volume > 90
qr/qw counts spike
fault rates > 10 in monostat
sometimes:  topless counts spike

KILL IT! Stop mongoD process if secondary node, stepDown() + stop if primary.
Rebuild from scratch.

How long does it take? ~1 hour
Working set in RAM

Problem: fresh mongoD has not paged in all data
Solution: run queries

cat > /dev/null works too, unless your dataset size is larger then RAM.

Hashing Algorithm in MySQL PASSWORD()

Recently we had a question from a customer: what is the hashing algorithm implemented in PASSWORD() ?

The manual doesn't give a straight answer in any of these two pages:


It is enough to dig a bit more to find the solution in that specifies "mysql.user.Password stores SHA1(SHA1(password))" .


Instead of blindly trusting the documentation (even if I believe it is correct), I did some tests and was confused by the first result:

mysql> SELECT PASSWORD("this_is_a_random_string") `pass`\G 

pass: *12E76A751EFA43A177049262A2EE36DA327D8E50

mysql> SELECT SHA1(SHA1("this_is_a_random_string")) `pass`\G 

pass: 9b653fd9fb63e1655786bfa3b3e00b0913dfc177

So it looked like SHA1(SHA1(password)) wasn't PASSWORD(password)), at least in this test.

The best documentation ever is the source code, so I read the source code and understood why my previous test was incorrect: the second SHA1() is applied to the binary data returned by the first SHA1() and not to its hex representation. Therefore in SQL I have to UNHEX() it before applying the second SHA1. In fact: 

mysql> SELECT SHA1(UNHEX(SHA1("this_is_a_random_string"))) `pass`\G 

pass: 12e76a751efa43a177049262a2ee36da327d8e50


So yes, I confirmed that mysql.user.password stores SHA1(SHA1(password)) . I also hope this post is useful to understand how MySQL implements PASSWORD().

Building libMemcached RPMs

A client running CentOS 5.4 Amazon EC2 instances needed the latest libMemcached version installed. With the inclusion of the "make rpm" target, libMemcached makes it easy to build the libMemcached RPMs by doing the following:

Spin up a new CentOS Amazon EC2 instance,

As root on the new instance:

yum install @development-tools
yum install fedora-packager
/usr/sbin/useradd makerpm

Now change to the makerpm user and build the RPMs:

su - makerpm
tar -zxf libmemcached-1.0.2.tar.gz
./configure && make rpm
find . -name '*rpm*'


Using HAProxy for MySQL failovers

There are a number of solutions available for MySQL architecture that provide automatic failover, but the vast majority involve potentially significant and complex changes to existing configurations. Fortunately, HAProxy can be leveraged for this purpose with a minimum of impact to the current system.  Alex Williams posted this clever solution a couple years ago in his blog.  Here we take a closer look at the details of implementing it into an already existing system.

HAProxy is a freeware load balancer, proxy, and high availability application for the TCP and HTTP protocols.  Since it was built mostly to handle web traffic, it has robust rule writing using HTTP components to check on the health of systems.  The key to Alex's solution was creating xinetd daemons on the database servers that send out HTTP messages.  The HTTP check to determine database statuses works like this:

  1. The HAProxy server sends HTTP requests to the database servers at configured intervals to specified ports
  2. The /etc/services file on the database servers maps those ports to services in their /etc/xinetd.d directory
  3. The services can call any specified script, so we build scripts that connect to the databases and checks for whatever conditions we choose.

The services then return an OK or a Service Unavailable response per the conditions.  Code for these scripts is in included in the Alex's article.

Our database configuration for this implementation is two pairs of Master-Slave databases in an Active-Passive relationship with Master-Master replication between the sets.  Siloing the passive Master-Slave provides a hot spare as well as continuous up-time during deployments provided we have a means of swapping the active/passive roles of each pair.  To accomplish the latter, we built two HAProxy configuration files, haproxy_pair1.cfg and haproxy_pair2.cfg, the only difference between the two being which Master-Slave pair is active.  Having the two files indicate which pair is active also allows immediate visibility of the current configuration.

Just as in Alex's sample configuration, our web application uses two DNS entries, appmaster and appslave to call writes and reads respectively.  The IPs for these addresses are then attached to our HAProxy server, allowing HAProxy to bind to them when it starts and then route them to the appropriate database server.

On our HAProxy server we then customized the /etc/init.d/haproxy script to handle an additional parameter of "flipactive" which provides us the capability to swap the database pairs:

flipactive() {
        # first detect which cfg file haproxy is using
        ps_out=`ps -ef |grep haproxy|grep cfg`
        # if pair2 then use pair1
        if [[ "$ps_out" =~ pair2 ]]  then
             # the -sf does a friendly reread of the config file
             /usr/local/sbin/haproxy -f /etc/haproxy/haproxy_pair1.cfg -p /var/run/ -st $(cat /var/run/
        # if pair1 then use pair2
             /usr/local/sbin/haproxy -f /etc/haproxy/haproxy_pair2.cfg -p /var/run/ -st $(cat /var/run/

The rest of the configuration details are covered pretty well in Alex's article as well as in the HAProxy documentation.

We successfully developed and implemented this technique with the devops team at SlideShare last month to build rolling DDL scripting to multiple databases using Capistrano.  It allowed them to have explicit control over which database was being updated, thereby giving them the means necessary to update one database while other served the current application code.

Syndicate content
Website by Digital Loom