Blog

Benchmarking NDB Against InnoDB on a Write-Intensive Workload

Last month, we evaluated Amazon's new SSD offerings with an extensive series of performance benchmarks.

As a followup, we've prepared a second series of benchmarks that specifically explore performance issues with a write-intensive workload on both NDB and InnoDB storage engines.

Download a free PDF of our findings, and as always, we welcome your feedback, comments and questions below.

Palomino Evaluates Amazon’s New High I/O SSD Instances

Amazon Web Services (AWS) recently introduced a new storage layer with SSD as its backend. Because their previous EBS storage has been challenging in the areas of I/O throughput and stability, we were extremely excited to run comparative benchmarks to offer a recommendation for customers considering upgrading to SSD. 

The opportunity to expand I/O throughput horizontally can create a longer runway for sharding (distributing data-sets across multiple MySQL logical clusters in order to reduce write I/O), which can be quite compelling. 

Our extensive tests, described in detail in our Evaluation Report prepared by René Cannao and Laine Campbell (click here for a free PDF), illustrate the potentially enormous gains in throughput, performance and cost for companies scaling MySQL in Amazon. Palomino would strongly recommend that clients consider upgrading to SSD storage. Learn more.

Querying connection information in MongoDB

As with all other database platforms, daily administration and troubleshooting of MongoDB often begins with analyzing database activity.  It can be helpful to see all connections from within MongoDB, which you can do by running:

   db.$cmd.sys.inprog.findOne({$all:1});
or 
   db.currentOp(true);

where the “true” makes the command include idle connections.  But often there's so many connections that unless you capture the output in a script file or some similar thing, it's kind of useless. 

Thanks to a hint given by Scott Hernandez in the mongodb-user forum, we can use the line

   db.currentOp(true).inprog.forEach(function(o){if( <filter criteria> ) printjson(o)});

to show a subset that is more manageable.  For instance, we might want to show all connections from a particular application server:

   db.currentOp(true).inprog.forEach(function(o){if(o.client.indexOf(“10.0.1.77”) != -1 ) printjson(o)});

or from the MongoDB logs we’ll see a particular connectionId and want to know where it came from:

   db.currentOp(true).inprog.forEach(function(o){if(o.connectionId == 8606 ) printjson(o)});

This will then show all the connection info for that connectionId:

   { 
   "opid" : 7117752, 
   "active" : false, 
   "lockType" : "read", 
   "waitingForLock" : false, 
   "op" : "query", 
   "ns" : "player_data.player_answerchoice", 
   "query" : { 
   "$query" : {  
   "poll_id" : ObjectId("4f58e08db3e93217c2000008") 
   } 
   }, 
   "client" : "10.0.1.77:59542",  
   "desc" : "conn", 
   "threadId" : "0x49e31940", 
   "connectionId" : 8606, 
   "numYields" : 0 
   }

Testing and Analyzing Performance with Benchmarks

Generic benchmark tools can be very useful for testing performance on your system. These benchmark tools normally have a set of predefined workloads, but often they don't match your specific workload in useful ways.

One of the best ways to reproduce your workload is to have a good sense of the application that uses the database and how it manages requests to the database. If this is not an option, it is also possible to analyze traffic and to find the most common queries, and use those to define the most common workload.

You can analyze traffic in many ways, from tcpdump to general log, from binlog (only for DML statements) to slow query log.

Afterwards it is possible to analyze them with pt-query-digest (or the obsolete mk-query-digest) to find the most common and/or heavy queries.

In the system we analyze here, the workload was mainly write intensive and involved just 4 tables:

  • tableA was receiving single-row INSERT statements;
  • for each insert on tableA , on average 200 INSERTs were performed in the other 3 tables, distributed as follows: 100 on tableB, 95 on tableC, 5 on tableD (to be more specific , for each INSERT on tableB there is an INSERT either on tableC or tableD).

 

The system also receives SELECT statements, but in a very small number and very simple primary key lookup.

To simulate the workload, we generated a simple perl script that spawns a certain number of threads that perform the DML statements, and other threads that perform the SELECT statements.

At regular intervals, the script prints statistics and progress.

The benchmark test was executed in a setup with 2 hosts: one host where the client was running, and another host where the servers were running.

The RDBMS tested were: MariaDB 5.2.3 with TokuDB 5.2.7 and InnoDB, and Percona 5.5.20.

Additionally, Percona 5.5.20 was tested as multiple instances running on the same hosts.

 

The goal of the first benchmark test was to compare TokuDB against InnoDB for this specific workload.

We executed MariaDB with TokuDB with the following (simple) config file:

[mysqld] 
user=mysql 
table_open_cache=1024 
max_connections=128 
query_cache_size=0 
innodb_file_per_table 
datadir=/localfio/datadir
log_bin 
innodb_flush_log_at_trx_commit=1 
innodb_buffer_pool_size=256M 
innodb_log_buffer_size=8M 
innodb_log_file_size=1024M 
basedir=/usr/local/tokudb 
 

 

We found the performance of InnoDB significantly better compared than TokuDB in this instance, though this test - where the dataset fits almost entirely in memory - does not show the real power of TokuDB, which excels at insertion rate at scale. Because these tables have very few indexes, TokuDB and Fractal tree indexes weren't very efficient. Furthermore, the benchmarks were running on FusionIO, which meant that performance on InnoDB didn't degrade much as on spinning disks. We excluded TokuDB out from the next benchmark tests because they are all cases which are not well-suited for TokuDB’s strengths.

We temporarily abandoned MariaDB, and tested Percona 5.5.20 with the following config file:

[mysqld] 
user=mysql 
table_open_cache=256 
max_connections=128 
query_cache_size=0 
innodb_file_per_table 
log_bin 
innodb_flush_log_at_trx_commit=1 
innodb_buffer_pool_size=2G
innodb_log_buffer_size=8M 
innodb_log_file_size=1024M 
basedir=/usr/local/mysql 
port=3306
datadir=/localfio/MULTI/db00 
socket=/localfio/MULTI/db00/mysql.sock 

 

We tried various innodb_flush_method attempts, and the graphs show that O_DIRECT performs slightly better than the default fsync(), even if the benchmark shows a weird bootstrap. We also tried ALL_O_DIRECT, which performed badly.

 

Additionally, we tried innodb_log_block_size=4096 instead of the default 512, but nothing changed: insert rate wasn't affected.

 

One of the goals of this benchmark was to test if running multiple mysqld instances on the same host performs better than a single mysqld instance.

On this specific hardware, the answer seems to be yes. Configuring 8 mysqld instances with the same config file listed below (but different paths and ports), throughput is significantly higher. Note that innodb_buffer_pool_size was set to 256M to try to stress the IO subsystem.

[mysqld] 
user=mysql 
table_open_cache=256 
max_connections=128 
query_cache_size=0 
innodb_file_per_table 
log_bin 
innodb_flush_log_at_trx_commit=1 
innodb_buffer_pool_size=256M
innodb_log_buffer_size=8M 
innodb_log_file_size=1024M 
basedir=/usr/local/mysql 
port=3306
datadir=/localfio/MULTI/db00 
socket=/localfio/MULTI/db00/mysql.sock 
 

 

All the above tests were executed using 36 client connections for writes and 36 client connections for reads.

 

We then ran a new cycle of tests, but instead of using 36 x 2 connections, we used 80 x 2 (80 for writes and 80 for reads).

 

 

With 80 connections, throughput was higher than with 36 connections, but at nearly regular intervals we found performance dropping. This seems independent from the size of the buffer pool.

It is interesting to note that with only one mysqld instance, FusionIO was performing at 4.7k – 4.8k IOPS, while with 8 mysqld instances FusionIO was performing at 27k – 29k IOPS. As expected, with a small buffer pool performance tends to slowly degrade when the data doesn't fit in memory.

We tried various values of innodb_write_io_threads, but this didn't make any difference, since the Redo Log was the most written and not the tablespaces.

To better analyze the throughput, we reduced the sample time to 10 seconds and reran the test:

 

 

It is clear that throughput drops from time to time, and for a nearly constant amount of time. While the test was running, we tried to monitor the mysqld instances, but there was no clear indication of why they were stalling. The Redo Log wasn't anywhere close to full and InnoDB wasn't performing aggressive flushing. The amount of data read from disk was pretty low but the amount of data written was spiking. Yet, the writes weren't coming from InnoDB.

The reason for the stalls became apparent when we analyzed the content of /proc/meminfo: the Linux Virtual Memory (VM) subsystem was performing dirty pages flushing!

We changed the dirty_background_ratio from 10 (the default) to 1 , and reran the test.

sysctl -w vm.dirty_background_ratio=1

 

Throughput is now way more stable, although performance has dropped by 2.8%. It is interesting to note that throughput drops at nearly the same time no matter the value of dirty_background_ratio.

A quick analysis of MySQL source code shows that binlog are synced to disk when closed, therefore the drops in throughput may be caused by the flush of binary logs.

We then raised vm.dirty_background_ratio up to 10 (the default value) and lowered max_binlog_size from 1G to 64M.

 

 

Throughput doesn't drop drastically as in the two previous tests, but goes up and down at more regular intervals.

At the end of this test, performance with max_binlog_size=64M is ~4% lower than the initial test with max_binlog_size=1G (in both cases, vm.dirty_background_ratio=10).

The last setup of 8 instances with a 256M buffer pool each and max_binlog_size=64M was then compared with a new setup:  4 instances with a 512M buffer pool each (2GB total in both cases) and max_binlog_size=64M:

 

 

An interesting outcome from this last test is that total throughput raised by around 4% (that was originally lost using binlogs of 64M) and that the total number of IOPS dropped to ~16k, leaving room for more IO in case of a different workload.

We then ran a new test using only 2 mysqld instances. It shows what was already easy to guess when running a similar test with only one mysqld instance: a lower number of mysqld instances can't fully utilize IO capacity and therefore has lower throughput.

 

Conclusions (most of them are as expected) for this specific workload and on this specific hardware:

O_DIRECT performs slightly better than the default fsync for innodb_flush_method .

A high number of clients provides more throughput than a smaller number of clients: not enough tests were performed to find the optimal number of clients.

Throughput reduces when data doesn't fit in the buffer pool.

A high number of mysqld instances running on the same server are able to better utilize the number of IOPS that FusionIO is able to provide (perhaps, it should be a very bad idea to run multiple mysqld instances on the same spinning disk or array)

The sync of binlog during binlog rotation are able to stall the system. Lowering dirty_background_ration or max_binlog_size is able to stabilize the throughput.

The Postgres-XC 1.0 beta release overview

 

When I heard about this project a year ago, I was really excited about it. Many cluster-wide projects based on Postgres were developed very slowly, based on older (i.e. Postgres-R http://www.postgres-r.org/) or proprietary (i.e. Greenplum) versions. The features that this project hoped to achieve were ambitious, as we’ll detail in this post. And best of all - this project is based on the 9.1 Postgresql version, which is really up-to-date (at the moment of writing this post, this is the last stable version).

If you are interested in a serious project for scaling horizontally your PostgreSQL architecture, you may visit the official website at http://postgres-xc.sourceforge.net/ and take a look. 

For those who are interested, there will be a tutorial at PgCon this year.  As a brief overview, I will try to give you a broad idea for those who want to get involved in the project.

 

What Postgres-XC can do:

  • Support multi-master architecture. Data nodes can contain part or all of the data of a relationship. 
  • Transparent view to application from any master. The application only needs to interact with the cluster through coordinators.
  • Distribute/Replicate per relation (replication, round robin (by default if any unique column is specified), by hash (by default if a unique is specified), by modulo or a set to a group or node)
  • Parallel transaction execution among cluster nodes.

 

What Postgres-XC cannot do:

  • Support triggers (may be supported in future releases).
  • Distribute a table with more than one parent. 

 

Before you start:

  You need to install the most recent versions of Flex and Bison. That’s important because in the last tarball, ‘./configure’ won’t raise error if they are missing, and the error will be prompted once you execute ‘make’. You will need readline-dev and zlib-dev (not mandatory but strongly recommended).

According to the documentation, Postgres-XC should be compatible with Linux platforms based upon Intel x86_64 CPU. 

The documentation needs to be improved, so we advise you to try the steps directly and read the help prompted by the commands. For example, the initdb command in the documentation is incomplete, “--nodename” is mandatory in this version. This project is new and has only a few contributors to date, but we hope its community keeps growing. Most importantly, it is great that a beta release was launched earlier than we expected.

 

Elements of architecture

 

  • GTM (Global Transaction Manager)

+ Realize that I say only GTM, not GTMs. Only one GTM can be the manager. For redundancy, you have GTM-Standby and to improve performance and failover GTM-Proxies.

+ The GTM serializes all the transaction processing and can limit the whole scalability if you implement it primitively. This should not be used in slow/wide networks and is recommended to involve the fewest number of switches between GTM and coordinators. The proxies reduce the iteration with the GTM and improve the performance. 

+ Uses MVCC technology. That means that it will still use the same control for the concurrency as Postgres.

+ This is the first thing you will need to configure. If you set up everything in the same machine, you will need to create a separate folder for the configuration and files.

 

  • Coordinator/s

+ Interface for applications (like a Postgres backend). 

+ It has its own pooler (yes, and I think this is great, avoiding more complexity in big environments).

+ Doesn’t store any data. The queries are executed in the datanodes, but...

+ … it has its own data folder (for global catalogs).

  • Datanode/s

+ Stores the data.

+ It receives the petition with a GXID (Global Transaction ID) and Global Snapshot to allow requests from several coordinators.

Both Datanodes and Coordinator use their own data directory, so keep this in mind this if you are setting both up on the same machine. 

Configuring several GTM-Proxies will improve the scalability, shrinking the I/O in the GTM. Plus, you can configure the GTM-Standby to avoid a SPOF of the general manager. It not only provides the GXID, it also receives the node registration (you can trace your log or check the file inside the gtm folder called register.node, it’s binary but is readable) and most importantly, it holds the snapshot of the current status of all the transactions.

Coordinators can point to all the datanodes and can point to the same datanode (as Oracle RAC, but we’re not sure if all the features included in that solution will be available for Postgres-XC). Coordinators connect to the proxies (if you have already configured them) or the main GTM.

Hope you enjoyed the read. We are preparing more cool stuff about this amazing solution, keep in touch!

Exploring a new feature of 9.2: Index-only scans

We, like other Postgres DBAs worldwide, have been waiting for the 9.2 release for some time, specifically for the index-only scan feature, which will help reduce I/O by preventing unnecessary access to heap data if you only need data from the index.

Besides 9.2 is still in development, it is possible to download a version for testing at http://www.postgresql.org/download/snapshots/ . It's important to note that it doesn’t add new behaviours, but improves the way that indexes are used.

How can we test this feature? We created a ‘big’ table starting with 10 million+ records with random values.

  • Extract a few elements, using a where clause.
  • Use aggregations.
  • Use partitioning plus index only scans.



When is this feature most useful?:

  • When you select only the columns that are specified in the index definition,  including those which are at the condition part of the query.
  • If a vacuum was executed previously. Thus happens because the scan can skip the “heap fetch” if the TID references a heap [table] page on which all tuples are known visible to everybody (src/backend/executor/nodeIndexonlyscan.c).


So, the main table is:


CREATE TABLE lot_of_values AS SELECT i, clock_timestamp() t1, random() r1, random() r2, random() r3, clock_timestamp() + (round(random()*1000)::text || ' days')::interval d1 from generate_series(1,10000000) i(i);
ALTER TABLE lot_of_values ADD PRIMARY KEY(i);

CREATE INDEX CONCURRENTLY ON lot_of_values (d1);



Something interesting: due to some improvements in write performance, we realized that 9.2 demonstrated better timing compared with 9.1.3 (~200k in 9.1, ~170k ms on 9.2). The index creation was slightly better on 9.2.

The table will contain data like this:

stuff=# \x
Expanded display is on.
stuff=# select * from lot_of_values limit 1;
-[ RECORD 1 ]---------------------
i  | 1
t1 | 2012-04-18 08:37:14.426624+00
r1 | 0.571268450468779
r2 | 0.222371176816523
r3 | 0.72282966086641
d1 | 2012-08-17 08:37:14.426713+00


Ok, let’s start with some examples. As we previously explained, we need to specify columns that are only in the index. You can’t use columns from 2 different indexes. The next example is a clear fail:


stuff=# explain select i,  d1 from lot_of_values where round(r1*100) < 10;
             QUERY PLAN                                
---------------------------------------------
Seq Scan on lot_of_values  (cost=0.00..263496.00 rows=3333333 width=12)
  Filter: (round((r1 * 100::double precision)) < 10::double precision)
(2 rows)

stuff=# set enable_seqscan=off;
SET
stuff=# explain select i,  d1 from lot_of_values where round(r1*100) < 10;
                                      QUERY PLAN                                       
---------------------------------------------
Seq Scan on lot_of_values  (cost=10000000000.00..10000263496.00 rows=3333333 width=12)
  Filter: (round((r1 * 100::double precision)) < 10::double precision)
(2 rows)



We don’t have indexes at r1 and it isn’t part of the index!

The next example is another fail, using a column that is defined in another index or directly not defined in any index:


stuff=# explain select i,  d1 from lot_of_values where i between 12345 and 23456;
                                         QUERY PLAN                                           
---------------------------------------------
Index Scan using lot_of_values_pkey on lot_of_values  (cost=0.00..450.83 rows=11590 width=12)
  Index Cond: ((i >= 12345) AND (i <= 23456))
(2 rows)



The next example is the correct case:


stuff=# explain select i from lot_of_values where i between 12345 and 23456;
                                           QUERY PLAN                                             
---------------------------------------------
Index Only Scan using lot_of_values_pkey on lot_of_values  (cost=0.00..450.83 rows=11590 width=4)
  Index Cond: ((i >= 12345) AND (i <= 23456))
(2 rows)



Also, we can try with a non-pk index:


stuff=# explain select min(d1), max(d1) from lot_of_values ;
                                          QUERY PLAN                                                             
---------------------------------------------
Result  (cost=6.93..6.94 rows=1 width=0)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=0.00..3.46 rows=1 width=8)
          ->  Index Only Scan using lot_of_values_d1_idx on lot_of_values  (cost=0.00..34634365.96 rows=10000000 width=8)
                Index Cond: (d1 IS NOT NULL)
  InitPlan 2 (returns $1)
    ->  Limit  (cost=0.00..3.46 rows=1 width=8)
          ->  Index Only Scan Backward using lot_of_values_d1_idx on lot_of_values  (cost=0.00..34634365.96 rows=10000000 width=8)
                Index Cond: (d1 IS NOT NULL)
(9 rows) stuff=# explain select min(i), max(i), avg(i) from lot_of_values where i between 1234 and 2345;
                                             QUERY PLAN                                               
---------------------------------------------
Aggregate  (cost=66.90..66.91 rows=1 width=4)
  ->  Index Only Scan using lot_of_values_pkey on lot_of_values  (cost=0.00..58.21 rows=1159 width=4)
        Index Cond: ((i >= 1234) AND (i <= 2345))
(3 rows)


The aggregation cases are special.  Index-only scans are not useful for count(*) without condition, because the index scan needs to check the visibility of the tuple, which makes it expensive. So, if you need to count the entire table, a sequential scan must be perfomed.

Just for testing purposes, we’ll try to “turn off” the seqscan node, to force an Index-only scan:


stuff=# explain (analyze true, costs true, buffers true, timing true, verbose true) select count(i) from lot_of_values;
               QUERY PLAN                                                       
---------------------------------------------
Aggregate  (cost=213496.00..213496.01 rows=1 width=4) (actual time=57865.943..57865.946 rows=1 loops=1)
  Output: count(i)
  Buffers: shared hit=2380 read=86116
  ->  Seq Scan on public.lot_of_values  (cost=0.00..188496.00 rows=10000000 width=4) (actual time=0.667..30219.806 rows=10000000 loops=1)
        Output: i, t1, r1, r2, r3, d1
        Buffers: shared hit=2380 read=86116
Total runtime: 57866.166 ms
(7 rows) stuff=# set enable_seqscan=off;
SET
stuff=# explain (analyze true, costs true, buffers true, timing true, verbose true) select count(i) from lot_of_values;
                                                          QUERY PLAN                                    
---------------------------------------------
Aggregate  (cost=351292.03..351292.04 rows=1 width=4) (actual time=64094.544..64094.547 rows=1 loops=1)
  Output: count(i)
  Buffers: shared read=110380
  ->  Index Only Scan using lot_of_values_pkey on public.lot_of_values  (cost=0.00..326292.03 rows=10000000 width=4) (actual time=38.773..35825.761 rows=10000000 loops=1)
        Output: i
        Heap Fetches: 10000000
        Buffers: shared read=110380
Total runtime: 64094.777 ms
(8 rows)



After a Vacuum, the plan changed and the cost drops to 262793.04.

For partitioning, as we expected, this works as well (in this example, we’ll use another table called ‘persons’ with ‘dni’ as PK column):


coches=# explain (analyze true, costs true, buffers true, timing true, verbose true)  select dni from persons where dni between 2100111 and 2110222;
Result  (cost=0.00..168.62 rows=22 width=8) (actual time=61.468..61.468 rows=0 loops=1)
  Output: persons.dni
  Buffers: shared hit=43 read=1
  ->  Append  (cost=0.00..168.62 rows=22 width=8) (actual time=61.442..61.442 rows=0 loops=1)
        Buffers: shared hit=43 read=1
        ->  Seq Scan onpersons  (cost=0.00..0.00 rows=1 width=8) (actual time=0.156..0.156 rows=0 loops=1)
              Output:persona.dni
              Filter: (((persona.dni)::bigint >= 2100111) AND ((persona.dni)::bigint <= 2110222))
        ->  Index Only Scan using persons_200_pkey on persons_200 persons  (cost=0.00..8.38 rows=1 width=8) (actual time=0.405..0.405 rows=0 loops=1)
              Output:persona.dni
              Index Cond: ((persons.dni >= 2100111) AND (persons.dni <= 2110222))
              Heap Fetches: 0
              Buffers: shared hit=5

…. LOT OF PARTITIONS ….
Total runtime: 11.045 ms
(114 rows)


Conclusion: this feature adds one of the most exciting performance improvements in Postgres. We see improvements at as much as 30% so far, and look forward to seeing how this scales as 9.2 becomes production-ready.


Automatically Bring Up A New Mongod Node in AWS

One of the most desirable features of MongoDB is its ability to automatically recover from a failed node. Setting this up with AWS instances can raise challenges, depending on how you're addressing your machines. We see many examples online of people using IP addresses in their configurations, like this:
cfg = {
    _id : "rs_a",
    members : [
        {_id : 0, host : "10.2.3.4", priority : 1},
        {_id : 1, host : "10.2.3.5", priority : 1},
        {_id : 2, host : "10.2.3.6", priority : 0}
    ]
}
Which is fine for giving examples, or maybe creating a single cluster that you control the hardware. But in the ephemeral world we live in with virtual instances where IPs change constantly, this is just not feasible. You need to use hostnames, and have DNS set up. Well, that's easier said than done, and so I've tried to come up with a proof of concept of automatically assigning hostnames for new instances in AWS. I hope that some of my experiences will be valuable in setting up your MongoDB clusters.
Here is the scenario: a replication set with three nodes set up in AWS using ec2 instances. One of the nodes goes down. You want to be able to bring that same node back - or a replacement node - with a minimum of configuration changes. To make this happen, you need to prepare for this event beforehand. All of this is done on Amazon's default OS, which is based on CentOS.
The MongoDB pages on setting up instances in AWS here and here
are very valuable in setting up most of it, however the details of how to set up hostnames could use some elaboration. For one thing, it doesn't really help to use the public domain addresses into your zone files - if you bring up a new instance, you'll get a new public domain name, and you'll need to go back into your zone file and update it. I wanted to minimize the reconfiguration necessary when bringing a new instance up to replace one that went down.
My next reference is Marius Ducea's very helpful article How To update DNS hostnames automatically for your Amazon EC2 instances. One of the great things I discovered here was the really useful tool Amazon provides to get info about the instance that you're on. Basically, curl http://169.254.169.254/latest/meta-data/ to get a list of available parameters, and then add that parameter on to the end of the url to get its value. I made a little bash function that I used often while going through this:
meta () {
  curl http://169.254.169.254/latest/meta-data/$1
  echo
}
Stick that in your .bashrc or similar file, and then "meta" will list all parameters, and e.g. "meta local-ipv4" will give you your local ip. The first thing you want to set up is your DNS server. Of course you must have redundancy, with at least a primary and a backup, spread across different regions. For the purposes of this demonstration, however, I just had a single DNS server. Obviously, do not do this in production! Here are the significant changes when setting up your bind server: 1. In the {{options}} block, ensure your "listen-on" block will work for all servers who need DNS. In my case, I set it to "any". You may need to do something more restrictive. 2. {{allow-recursion}} also needs to be set for the same servers. 3. As in Ducea's article, add your keys and your key blocks. You'll need to copy these keys to your instance, but be careful with leaving copies around, and ensure the keys' permissions are as strict as possible. 4. Add blocks for your internal and external zones. You can probably get by with just an internal zone, but you may find it more convenient to have both zones. 5. Each of the zones will need to reference the key. 6. The purpose of the "control" block in there is so that I can use rndc to control my named server. Although not technically necessary, it becomes useful if you need to edit your zone files after any dynamic updates have been made. You'll need to run "rndc freeze" before, and "rndc thaw", otherwise the hanging .jnl files will make named complain. You'll also need to add an /etc/rndc.conf file, which has at a minimum a "key" and an "options" block. See here for details on setting it up on CentOS systems. /etc/named.conf:
//
// named.conf
//
// Provided by Red Hat bind package to configure the ISC BIND named(8) DNS
// server as a caching only nameserver (as a localhost DNS resolver only).
//
// See /usr/share/doc/bind*/sample/ for example named configuration files.
//
options {
// listen-on port 53 { 127.0.0.1; };
 listen-on port 53 { any; };
 listen-on-v6 port 53 { ::1; };
 directory  "/var/named";
 dump-file  "/var/named/data/cache_dump.db";
        statistics-file "/var/named/data/named_stats.txt";
        memstatistics-file "/var/named/data/named_mem_stats.txt";
 allow-query     { localhost; };
 allow-recursion     { any; };
 recursion yes;
 dnssec-enable yes;
 dnssec-validation yes;
 dnssec-lookaside auto;
 /* Path to ISC DLV key */
 bindkeys-file "/etc/named.iscdlv.key";
};
controls {   
 inet 127.0.0.1 allow { localhost; } 
 keys { palomino.mongo.; }; 
};
key palomino.mongo. {
algorithm HMAC-MD5;
secret "SEcreT FRom Your Generated Key file";
};
logging {
        channel default_debug {
                file "data/named.run";
                severity dynamic;
        };
};
zone "." IN {
 type hint;
 file "named.ca";
};
include "/etc/named.rfc1912.zones";
zone "int.palomino.mongo" IN {
  type master;
  file "int.palomino.mongo.zone";
  allow-update { key palomino.mongo.; };
  allow-query     { any; }; // this should be local network only
  //allow-transfer { 10.160.34.184; };
};
zone "palomino.mongo" IN {
  type master;
  file "palomino.mongo.zone";
  allow-update { key palomino.mongo.; };
  allow-query     { any; };
  //allow-transfer { 10.160.34.184; };
};
------- In this example, I chose an obviously not public domain name to avoid confusion. A feature not in this example, but you will need, are allow-transfer lines for dealing with communication with your other nameserver(s). Also note, that in this case I used the same keys for both dynamic updates from the other hosts and for rdnc updates. You may decide to do otherwise. Finally, you may want to add reverse-lookup zones as well. External Zone File: palomino.mongo.zone
$ORIGIN .
$TTL 86400 ; 1 day
palomino.mongo IN SOA dns1.palomino.mongo. mgross.palomino.mongo. (
 2012032919 ; serial
 21600      ; refresh (6 hours)
 3600       ; retry (1 hour)
 604800     ; expire (1 week)
 86400      ; minimum (1 day)
 )
 NS dns1.palomino.mongo.
 A local.ip.of.this.instance
$ORIGIN palomino.mongo.
$TTL 60 ; 1 minute
dns1 A public.ip.of.this.instance
Internal Zone File: int.palomino.mongo.zone
$ORIGIN .
$TTL 86400 ; 1 day
int.palomino.mongo IN SOA dns1.palomino.mongo. mgross.palomino.mongo. (
 2012032917 ; serial
 21600      ; refresh (6 hours)
 3600       ; retry (1 hour)
 604800     ; expire (1 week)
 86400      ; minimum (1 day)
 )
 NS dns1.palomino.mongo.
 A local.ip.of.this.instance
$ORIGIN int.palomino.mongo.
$TTL 60 ; 1 minute
dns1 A local.ip.of.this.instance
You don't need to put any other hosts in here but your DNS server(s). All the other hosts will get in there dynamically. The local ip of this DNS server is what you'll need to bootstrap your DNS when you start your other hosts. Start your DNS server. Next, for each mongod host, I used a script similar to Ducea's, with the exception that it will take the local ip of the DNS server in addition to the hostname. Also, I insert the DNS server's IP into the /etc/resolv.conf. This is probably a bit of a hack. I don't doubt there's a more efficient way to do this. In this example, I am setting the user-data of the ec2 instance to "host:DNS-local-ip", so, for example --user-data "ar1:10.1.2.3". Hopefully this will make more sense later. update_dns.sh:
#!/bin/bash
USER_DATA=`/usr/bin/curl -s http://169.254.169.254/latest/user-data`
HOSTNAME=`echo $USER_DATA | cut -d : -f 1`
DNS_IP=`echo $USER_DATA | cut -d : -f 2`
line="nameserver $DNS_IP"
# this should skip if there is no DNS_IP sent
if  ! grep "$line" /etc/resolv.conf 1>/dev/null
then
    sed -i"" -e "/nameserver/i$line" /etc/resolv.conf
fi
DNS_KEY=/etc/named/Kpalomino.mongo.+157+29687.private
DOMAIN=palomino.mongo
#set also the hostname to the running instance
hostname $HOSTNAME.$DOMAIN
PUBIP=`/usr/bin/curl -s http://169.254.169.254/latest/meta-data/public-ipv4`
cat<<EOF | /usr/bin/nsupdate -k $DNS_KEY -v
server dns1.int.$DOMAIN
zone $DOMAIN
update delete $HOSTNAME.$DOMAIN A
update add $HOSTNAME.$DOMAIN 60 A $PUBIP
send
EOF
LOCIP=`/usr/bin/curl -s http://169.254.169.254/latest/meta-data/local-ipv4`
cat<<EOF | /usr/bin/nsupdate -k $DNS_KEY -v
server dns1.int.$DOMAIN
zone int.$DOMAIN
update delete $HOSTNAME.int.$DOMAIN A
update add $HOSTNAME.int.$DOMAIN 60 A $LOCIP
send
EOF
This file can be anywhere, I happened to put it in /etc/named, where I also put the keys. This will be run by root, and contains secrets, so you should chmod 700 those three files. The Mongo instances: I used the excellent guide on the MongoDB site to set up a single instance. Go through many of the steps the entire process, including Launching an Instance, Configure Storage, and up to Install and Configure MongoDB. Now, for this example, we are only setting up a ReplicaSet. Sharding will be done at another time. The main thing we want to do here is to be able to create a generic image that you can kick off and automatically set its DNS and connect to the rest of the Replica Set, with as little human interaction as possible. As in the example given, in /etc/mongod.conf, the only changes I made were:
fork = true
dbpath=/data
replSet = rs_a
So that when this mongod starts up, it will already be configured to be a member of rs_a. Some other changes that can be made here would be to generate these changes dynamically from user-data sent to the instance, similar to the dns update script above, but for this case, I left it hard-coded. You can go ahead and start up mongod, connect to it and ensure it functions. The final piece of the puzzle is to ensure our dns update script runs when the host starts up. At first, I had it running as part of /etc/rc.local, as in Ducea's page. However, I soon found out that means that DNS would then get updated _after_ mongod has already started, and then mongod couldn't find any of the new hosts. So, to ensure that dns is updated before mongod starts, I added these lines to /etc/init.d/mongod, right at the beginning of the start() function:
start()
{
  echo "Updating dns: "
  /etc/named/update_dns.sh
  echo -n $"Starting mongod: "
At this point, we are almost ready to save our image. If you did a test and ran mongod, your /data directory will already have journal and/or other files in there, which might have data in them we don't necessarily want on startup. So, stop mongod and remove all your data:
service mongod stop
cd /data
rm -rf *
Now your instance is ready to make an image. You can stop it now (otherwise it would reboot when you create the image). All you need now is the id of your instance. Create your image:
ec2-create-image <instance-id> --name mongod-replset-server --description "Autoboot image of mongod server"
Once you get an image id from that, we can very easily set up our replica set. All we need to bootstrap this is the local IP of your DNS server. For this example, my hostnames will be ar1.palomino.mongo, ar2.palomino.mongo, and ar3.palomino.mongo (public), and ar1.int.palomino.mongo, ar2.int.palomino.mongo, and ar3.int.palomino.mongo . In this example, my DNS server's local IP address is 10.160.121.203
[mgross@dev ~]$ for i in 1 2 3
> do
> ec2-run-instances <ami-image-id> -n 1 -k <your-keypair-name> -d ar$i:10.160.121.203 -t m1.large -z <your-availability-zone> -g <your-security-group-id>
> done
This will launch 3 instances of the mongod server, which are expecting to be in a replset named "rs_a" (from the setting we added to /etc/mongod.conf above). When they start up, right before mongod starts, they'll set the hostnames provided, connect to the DNS server and set their public and private IPs. Now if you use the nameserver that you set up, you should be able to connect with one of the hostnames, for example ar1.palomino.mongo mongo ar1.palomino.mongo set up your replica set:
> rs.config()
null
> cfg = {
...     _id : "rs_a",
...     members : [
...         {_id : 0, host : "ar1.int.palomino.mongo", priority : 1},
...         {_id : 1, host : "ar2.int.palomino.mongo", priority : 1},
...         {_id : 2, host : "ar3.int.palomino.mongo", priority : 0}
...     ]
... }
{
 "_id" : "rs_a",
 "members" : [
 {
 "_id" : 0,
 "host" : "ar1.int.palomino.mongo",
 "priority" : 1
 },
 {
 "_id" : 1,
 "host" : "ar2.int.palomino.mongo",
 "priority" : 1
 },
 {
 "_id" : 2,
 "host" : "ar3.int.palomino.mongo",
 "priority" : 0
 }
 ]
}
> rs.initiate(cfg);
{
 "info" : "Config now saved locally.  Should come online in about a minute.",
 "ok" : 1
}
> rs.status();
{
 "set" : "rs_a",
 "date" : ISODate("2012-03-30T04:20:49Z"),
 "myState" : 2,
 "members" : [
 {
 "_id" : 0,
 "name" : "ar1.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "self" : true
 },
 {
 "_id" : 1,
 "name" : "ar2.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 6,
 "stateStr" : "UNKNOWN",
 "uptime" : 2,
 "optime" : {
 "t" : 0,
 "i" : 0
 },
 "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:20:49Z"),
 "pingMs" : 0,
 "errmsg" : "still initializing"
 },
 {
 "_id" : 2,
 "name" : "ar3.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 5,
 "stateStr" : "STARTUP2",
 "uptime" : 4,
 "optime" : {
 "t" : 0,
 "i" : 0
 },
 "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:20:49Z"),
 "pingMs" : 0,
 "errmsg" : "initial sync need a member to be primary or secondary to do our initial sync"
 }
 ],
 "ok" : 1
}
PRIMARY> rs.status();
{
 "set" : "rs_a",
 "date" : ISODate("2012-03-30T04:21:46Z"),
 "myState" : 1,
 "members" : [
 {
 "_id" : 0,
 "name" : "ar1.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 1,
 "stateStr" : "PRIMARY",
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "self" : true
 },
 {
 "_id" : 1,
 "name" : "ar2.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 59,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:21:45Z"),
 "pingMs" : 0
 },
 {
 "_id" : 2,
 "name" : "ar3.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 61,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:21:45Z"),
 "pingMs" : 0
 }
 ],
 "ok" : 1
}
PRIMARY> 
-------- Now, to demonstrate our automatic new host configuration, let's completely take out one of our hosts and replace it with a brand new instance:
[root@ar1 ~]# shutdown -h now
We can see that ar2 has been elected primary:
[ec2-user@ar2 ~]$ mongo
MongoDB shell version: 2.0.4
connecting to: test
PRIMARY> rs.status()
{
 "set" : "rs_a",
 "date" : ISODate("2012-03-30T04:25:24Z"),
 "myState" : 1,
 "syncingTo" : "ar1.int.palomino.mongo:27017",
 "members" : [
 {
 "_id" : 0,
 "name" : "ar1.int.palomino.mongo:27017",
 "health" : 0,
 "state" : 8,
 "stateStr" : "(not reachable/healthy)",
 "uptime" : 0,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:23:38Z"),
 "pingMs" : 0,
 "errmsg" : "socket exception"
 },
 {
 "_id" : 1,
 "name" : "ar2.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 1,
 "stateStr" : "PRIMARY",
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "self" : true
 },
 {
 "_id" : 2,
 "name" : "ar3.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 267,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:25:24Z"),
 "pingMs" : 14
 }
 ],
 "ok" : 1
}
Now, just spin up another instance that will replace ar1: > ec2-run-instances ami-cda2fa88 -n 1 -k engineering-west -d ar1:10.160.121.203 -t m1.large -z us-west-1b -g mm-database With no further action, your instance will acquire the new hostname, get polled by the new primary ar2, and become part of the Replica Set again:
PRIMARY> rs.status()
{
 "set" : "rs_a",
 "date" : ISODate("2012-03-30T04:34:09Z"),
 "myState" : 1,
 "syncingTo" : "ar1.int.palomino.mongo:27017",
 "members" : [
 {
 "_id" : 0,
 "name" : "ar1.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 77,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:34:08Z"),
 "pingMs" : 0
 },
 {
 "_id" : 1,
 "name" : "ar2.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 1,
 "stateStr" : "PRIMARY",
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "self" : true
 },
 {
 "_id" : 2,
 "name" : "ar3.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 792,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:34:08Z"),
 "pingMs" : 19
 }
 ],
 "ok" : 1
}
PRIMARY> 
Now, this exercise had a lot missing from it: most notably multiple DNS servers for redundancy, and sharding. But I hope that this was helpful in seeing how you can use dns servers to make failover of a mongodb host a little closer to automatic. I've run this scenario several times, and I did have some issues the first time I was setting up the replica set on the cluster. Once that got going, though, the failover and promotion of the brand new host seemed to work fine. I welcome your comments.

Considerations about text searchs in big fields and planner costs

Imagine the following scenario: you have a table with a small set of records with columns containing  tsvector and text data type. But, text fields has almost 20 megabytes of text or more.

So, what happens? Postgres planner checks the amount of reltuples (tuples) in the relation to calculate the cost of data extraction. If the amount of tuples reached by the query is higher than the 20% of the total result set or the amount of blocks in the relation is too small, then the planner will choose sequential scan, otherwise random access will be the chosen one.

But, in that case we face a problem. The amount of rows and blocks in the relation is not the “total” content. That happens because Postgres use a technique called TOAST (The Oversized  Attribute Storage Technique) which store all the data that is over 200 bytes, apart from the block. So, in the relation you will have only the firsts bytes of your column data. You can modify some options about how the columns get toasted (the level of compression), but the default behaviour is compress and store apart. This methodology is pretty cool, because if you have i.e. 500MB of data in only one field, your relation will contain all that data in the same file and in the most usual cases that is not convenient.

So, meanwhile you have small number of rows, the amount of data to read from disk is quite big in relation of the data stored in the relfilenode (the data file). But for FTS, we need to dig into all the content and for make the searchs faster, we need GIN or GIST indexes. This type of indexes, stores the data inside the index (differently from B-Tree indexes), making possible the search using the index.

So, you want to access by index to your data, but the planner estimates that the amount of rows is so small that is better to access by sequential scan instead of use index. So, the only way to force the read of index is *SET enable_seqscan TO off;* before execute your query. This variable is per session, so you might want to add it inside the transaction and set again to “on” for other queries.

But, by the way, once your table is populated with enough amount of data, you wouldn’t need anymore this execution, but if your table is quite static, could be a little fix. Besides this, it doesn’t represents a problem of performance, but this kind of things could show you how the planner works.

Let’s see an example:

CREATE TABLE documents_en (
   author_id char(2),
   text_id serial,
   author varchar(45) NOT NULL,
   title varchar(127) NOT NULL,
   content text NOT NULL,
   cont_ftsed tsvector  ,
   PRIMARY KEY (author_id, text_id)
);

CREATE FUNCTION ftsed_upd_en() RETURNS trigger AS $$
DECLARE
BEGIN
NEW.cont_ftsed:=(setweight(to_tsvector('english', coalesce(NEW.title,   '')), 'A') ||
    setweight(to_tsvector('english', coalesce(NEW.author,  '')), 'B') ||
    setweight(to_tsvector('english', coalesce(NEW.content, '')), 'C'));
RETURN NEW;
END
$$
LANGUAGE plpgsql;

CREATE TRIGGER tgr_ins_ftsed_en BEFORE INSERT ON documents_en
FOR EACH ROW EXECUTE PROCEDURE ftsed_upd_en();

CREATE TRIGGER tgr_upd_ftsed_en BEFORE UPDATE OF author,title,content
ON documents_en
FOR EACH ROW EXECUTE PROCEDURE ftsed_upd_en();

CREATE INDEX documents_en ON nspc_es USING gin(
   (setweight(to_tsvector('english', coalesce(title,   '')), 'A') ||
    setweight(to_tsvector('english', coalesce(author,  '')), 'B') ||
    setweight(to_tsvector('english', coalesce(content, '')), 'C'))
);

CREATE INDEX ix_ftsed ON documents_en USING GIST(cont_ftsed);


Before continuing, we need to clarify 3 things: text_id is related with author_id, but just for testing we use it as a serial, to avoid collisions in the primary key; we use triggers to automatically set up the tsvector column; the last thing is the indexes, I created 2 indexes to show the examples. IMHO I don’t recommend functional indexes, for performance purposes, but is still an option.

To fill up the table, I’ve made a script to catch up all the READMEs in the server and put them into the DB.


#!/bin/bash
FILENAME="$@"
TMPBFF=temp_file
PSQL=/opt/pg92dev/bin/psql
WORDS=words

for _FILE_ in $FILENAME
do
 egrep  -oi '[a-z]+' $_FILE_ | egrep -i '[a-z]+{4}+' > $WORDS
 AUTHOR=$(cat $WORDS | sort -u | head -n2 | awk '{printf("%s ", $0) }')
 AUTHOR_ID=${AUTHOR:0:2}
 TITLE=$(cat ${WORDS} | sort | uniq -c | sort -nr | head -n3 | awk '{printf("%s ", $2)}')
 BODY=$(cat $WORDS | sort -u | awk '{printf("%s ", $0) }')

 $PSQL -Upostgres fts -c "INSERT INTO documents_en (author_id, author, title, content) VALUES ('${AUTHOR_ID}', '${AUTHOR}', '${TITLE}','${BODY}' );"
done


I know, is a bit tricky, but useful for this test. to execute it, you’ll need something like:

locate README.txt | xargs ./script

First let’s get some stats from the table:

fts=# select count(*) from documents_en ;
count : 22
fts=# select pg_size_pretty(pg_relation_size('documents_en'));
pg_size_pretty :  24 kB (size of the table)

fts=# select relpages from pg_class where relname = 'documents_en';
relpages :3 (8kb each)

Now, a simple query with a EXPLAIN ANALYZE:

fts=# EXPLAIN ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                             QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
Seq Scan on documents_en  (cost=0.00..
3.27 rows=1 width=34) (actual time=0.228..0.464 rows=2 loops=1)
  Filter: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Filter: 20
Total runtime:
0.501 ms
(4 rows)


Oh! It’s using sequential scan (as we expect ), but now let’s add a trick:

fts=# SET enable_seqscan TO off;
SET
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                       QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_ftsed_en on documents_en  (cost=0.00..
8.27 rows=1 width=34) (actual time=0.127..0.191 rows=2 loops=1)
  Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 1
Total runtime:
0.280 ms
(4 rows)

What happens? The cost is higher but is faster? The fast answer is YES, is possible. Cost is based on estimations in the amount of blocks and type of access. Random access use to has 4:1 more cost in relation with the sequential accesses. In this particular case, the table is small so the planner estimates that is cheaper to read it without indexes. BUT that’s not the real problem, the real one is that the data is already TOASTED outside the “data file” and this data is not considered by the planner in this case.

Another question is: can we improve the access to this data? The answer is YES, if you are not complaining about the storage. What you can do is:

fts=# alter table documents_en alter column cont_ftsed set storage external;
ALTER TABLE

That will uncompress the data of the column cont_ftsed and force to store it outside the table (this is not really at this way, but for be didactic we will explain it like this).

fts=# vacuum  full analyze documents_en;
VACUUM
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                       QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_ftsed_en on documents_en  (cost=0.00..8.27 rows=1 width=34) (actual time=0.071..0.127 rows=2 loops=1)
  Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 1
Total runtime:
0.178 ms
(4 rows)

fts=# SET enable_seqscan TO on;
SET
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                             QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
Seq Scan on documents_en  (cost=0.00..3.27 rows=1 width=34) (actual time=0.067..0.292 rows=2 loops=1)
  Filter: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Filter: 20
Total runtime:
0.329 ms
(4 rows)

In this case we use VACUUM because the SET STORAGE option doesn’t make any changes to the previous stored data, and its effects are after. So with vacuum, we are forcing to apply the changes in all the table.

Ok, what happen if the table gets filled with more data? Automatically the planner will start to use the index:

fts=# select pg_size_pretty(pg_relation_size('documents_en'));
pg_size_pretty
----------------
1744 kB
(1 row)
fts=# SELECT count(*)  FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
count
-------
   45
(1 row)


fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                     QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on documents_en  (cost=4.38..
47.02 rows=13 width=31) (actual time=1.091..5.404 rows=45 loops=1)
  Recheck Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 80
  ->  Bitmap Index Scan on ix_ftsed_en  (cost=0.00..4.37 rows=13 width=0) (actual time=1.003..1.003 rows=125 loops=1)
        Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
Total runtime:
5.489 ms
(6 rows)

fts=#
SET enable_bitmapscan TO off;
SET
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                         QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_ftsed_en on documents_en  (cost=0.00..
56.50 rows=13 width=31) (actual time=0.307..5.286 rows=45 loops=1)
  Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 80
Total runtime:
5.390 ms
(4 rows)

We don’t see a huge difference in terms of time between each other, but we still facing the question about the relation between cost/real-performance.

But, don’t forget the functional index! I think you will like the following example:

fts=# analyze;
ANALYZE
fts=# explain analyze select author_id, text_id from documents_en

where (((setweight(to_tsvector('english'::regconfig, COALESCE(title, ''::character varying)::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, COALESCE(author, ''::character varying)::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char"))) @@ to_tsquery('english','editor&find');
                                                                                                                                                                             QUERY PLAN                                                                                                                                                                               
------------------------------------------------------------------------------------------------------------------------------------------------------------ Seq Scan on documents_en  
(cost=0.00..292.20 rows=17 width=7) (actual time=2.960..3878.627 rows=45 loops=1)
  Filter: (((setweight(to_tsvector('english'::regconfig, (COALESCE(title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (COALESCE(author
, ''::character varying))::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char")) @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Filter: 2238
Total runtime:
3878.714 ms
(4 rows)

fts=#
SET enable_seqscan TO off;
SET
fts=# explain analyze select author_id, text_id from documents_en

where (((setweight(to_tsvector('english'::regconfig, COALESCE(title, ''::character varying)::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, COALESCE(author, ''::character varying)::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char"))) @@ to_tsquery('english','editor&find');
                                                                                                                                                                                  QUERY PLAN                                                                                                                                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on documents_en  
(cost=272.15..326.46 rows=17 width=7) (actual time=0.937..1.026 rows=45 loops=1)
  Recheck Cond: (((setweight(to_tsvector('english'::regconfig, (COALESCE(title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (COALESCE(
author, ''::character varying))::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char")) @@ '''editor'' & ''find'''::tsquery)
  ->  Bitmap Index Scan on textsearch_en  (cost=0.00..272.15 rows=17 width=0) (actual time=0.916..0.916 rows=45 loops=1)
        Index Cond: (((setweight(to_tsvector('english'::regconfig, (COALESCE(title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (COALE
SCE(author, ''::character varying))::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char")) @@ '''editor'' & ''find'''::tsquer
y)
Total runtime:
1.109 ms
(5 rows)

Again! But hits time it is more stronger the difference in terms of time. Is not so pretty, ugh? So, It seems that you will need to be careful if you use searchs in big columns, that are toasted. Maybe you should check in you production database if you need to add some tweaks to get the maximum performance to your Postgres instance.

The version used on this post is PostgreSQL 9.2devel on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.6.2 20111027 (Red Hat 4.6.2-1), 64-bit. Besides this, I made tests in 9.1 and still see the same thing’.

Happy hacking!





Master/Slave Replication In Limited Bandwidth Scenarios

Database replication is quite an important component of the database architectures of many of our clients. Managed properly, it offers real-time redundancy and fault-tolerance, and allows for increased scalability. When replication goes wrong, however, it can turn into a morass of data conflicts, lag and strange little mysteries to distract your DBA team.

PalominoDB recently helped a client out of one of these strange little replication mysteries, and we document it here to help you avoid getting stuck in the same sort of situation.

The symptoms: Our slave database (which we’ll call DB_Slave) was lagging. The logs revealed that the value of Seconds_Behind_Master was jumping from 0 to a random high value, and then back to 0.

What we knew: DB_Slave connects to the master (which we’ll call DB_Master), and DB_Master sends binlog events to DB_Slave.

Because this data is pushed from the master rather than pulled from the slave, there was a small window where, in theory, a replication channel could be broken, and DB_Slave wouldn’t notice until "slave-net-timeout" seconds (with a default of 3600) passed.

We also knew that Seconds_Behind_Master is calculated not as the difference between the current  datetime and the datetime of the binlog event being processed, but as the difference between the time in master binlog and current executed relay log. That means that if DB_Master is not sending binlog events to DB_Slave (for example, due to a network issue), the slave could be lagging behind the master but not be aware of it at all.

Further, if DB_Slave then received some binlog events from DB_Master, and realized at that point that it was lagging, it could still fail to notice if DB_Master once again stopped pushing binlog events. 

Investigation: Our theory that there was in fact a network issue was supported by some evidence from the logs. We saw that SHOW SLAVE STATUS didn't show any progress in Master_Log_File and Read_Master_Log_Pos : that is, DB_Slave was not getting any data from DB_Master. 

We also found this in the processlist of DB_Master:

104062295       slave_user      [IP Address]:10124     NULL    Binlog Dump     1140    Writing to net  NULL

The status "Writing to net" means that DB_Master is sending (or trying to send) data to DB_Slave.

In a healthy replication channel, the status would be "Has sent all binlog to slave; waiting for binlog to be updated".

 

We suspected this might be an issue related to network bandwidth, but when we ran a series of STOP SLAVE / START SLAVE processes, we noticed that the number of threads in status “Writing to net” was increasing, with up to 20 concurrent open threads. Connections simply weren’t being closed. Further investigation revealed that those connections were staying in CLOSE_WAIT status.

Based on our observations we concluded that the limited network bandwidth was at fault, and we needed a solution to compress the transmission between the master and slave databases. Luckily MySQL provides a convenient parameter, which is dynamic and is set on the slave: slave_compressed_protocol=1

Here is how to implement it:

 

mysql> show global variables like 'slave_compressed_protocol';

+---------------------------+-------+

| Variable_name             | Value |

+---------------------------+-------+

| slave_compressed_protocol | OFF   |

+---------------------------+-------+

1 row in set (0.00 sec)

mysql> stop slave;

Query OK, 0 rows affected (0.25 sec)

 

mysql> set global slave_compressed_protocol=1;

Query OK, 0 rows affected (0.00 sec)

mysql> start slave;

Query OK, 0 rows affected (0.00 sec)
mysql> show global variables like 'slave_compressed_protocol';

 

+---------------------------+-------+

| Variable_name             | Value |

+---------------------------+-------+

| slave_compressed_protocol | ON    |

+-----------------------------------+

Resolution: We set slave_compressed_protocol=1 and restarted the slave. Replication started catching up at a very surprising speed. Even though it was catching up, we noticed the same behavior we’d noticed in the past: Seconds_Behind_Master was jumping from 0 to a random high value, and then back to 0, the io_thread was behind, and we didn’t see a high load on the server. 

As you can see in these graphs, network traffic did not increase much, but CPU usage and commands/sec had surprisingly high jumps.

In short, DB_Slave was processing more queries because the IO thread was receiving more data from DB_Master, thanks to compression. 

Live Blogging at MongoSV

Dwight Meriman, CEO of 10gen, speaks about the MongoDB community growing.The conference has doubled in size from 500 to 1100+ attendees.

Eliot Horowitz, CTO of 10gen, demos the MongoDB 2.2 Aggregation Framework. Simplifies aggregating data in MongoDB. He pulls in mongodb twitter feed to populate data and sums using: runCommand({aggregate: … })

The “aggregate” command in nightly builds tonight.

Cooper Bethea, Site Reliability Engineer, Foursquare, speaks on Experiences Deploying MongDB on AWS.

All data stored in MongoDB
8 production MongDB clusters
Two of the larger shards:
8 shards of users, 12 shards of check-ins.
Checkins: ~80 inserts/sec, ~2.5k ops/sec, 45/MB/s outbound at peak.
Users: ~250 updates/sec, ~4k ops/sec, 46MB/s outbound at peak
Only one unsharded cluster. Other fully sharded using replica sets.

All servers in EC2
mongoS is on mongoD instances
mongoCs are on three instances
mongoD working set contained in RAM
MongoD backing store: 4 EBS volumes with RAID0

Problem: fragmentaion leads to bloat
mongoD RAM footprints grows.
Data size, index size, storage size.

Solution: order replicaset by dataSize + indexSize, uptime DESC. --repair secondary nodes one at a time. Primary nodes require stepDown() which is more delicate.

Problem: EBS performance degrades
Symptoms: ioutil % on one volume > 90
qr/qw counts spike
fault rates > 10 in monostat
sometimes:  topless counts spike

Solution:
KILL IT! Stop mongoD process if secondary node, stepDown() + stop if primary.
Rebuild from scratch.

How long does it take? ~1 hour
Working set in RAM

Problem: fresh mongoD has not paged in all data
Solution: run queries
db.checkins.find({unused_key:1}).explain()

cat > /dev/null works too, unless your dataset size is larger then RAM.
Syndicate content
Website by Digital Loom