Blog

The Postgres-XC 1.0 beta release overview

 

When I heard about this project a year ago, I was really excited about it. Many cluster-wide projects based on Postgres were developed very slowly, based on older (i.e. Postgres-R http://www.postgres-r.org/) or proprietary (i.e. Greenplum) versions. The features that this project hoped to achieve were ambitious, as we’ll detail in this post. And best of all - this project is based on the 9.1 Postgresql version, which is really up-to-date (at the moment of writing this post, this is the last stable version).

If you are interested in a serious project for scaling horizontally your PostgreSQL architecture, you may visit the official website at http://postgres-xc.sourceforge.net/ and take a look. 

For those who are interested, there will be a tutorial at PgCon this year.  As a brief overview, I will try to give you a broad idea for those who want to get involved in the project.

 

What Postgres-XC can do:

  • Support multi-master architecture. Data nodes can contain part or all of the data of a relationship. 
  • Transparent view to application from any master. The application only needs to interact with the cluster through coordinators.
  • Distribute/Replicate per relation (replication, round robin (by default if any unique column is specified), by hash (by default if a unique is specified), by modulo or a set to a group or node)
  • Parallel transaction execution among cluster nodes.

 

What Postgres-XC cannot do:

  • Support triggers (may be supported in future releases).
  • Distribute a table with more than one parent. 

 

Before you start:

  You need to install the most recent versions of Flex and Bison. That’s important because in the last tarball, ‘./configure’ won’t raise error if they are missing, and the error will be prompted once you execute ‘make’. You will need readline-dev and zlib-dev (not mandatory but strongly recommended).

According to the documentation, Postgres-XC should be compatible with Linux platforms based upon Intel x86_64 CPU. 

The documentation needs to be improved, so we advise you to try the steps directly and read the help prompted by the commands. For example, the initdb command in the documentation is incomplete, “--nodename” is mandatory in this version. This project is new and has only a few contributors to date, but we hope its community keeps growing. Most importantly, it is great that a beta release was launched earlier than we expected.

 

Elements of architecture

 

  • GTM (Global Transaction Manager)

+ Realize that I say only GTM, not GTMs. Only one GTM can be the manager. For redundancy, you have GTM-Standby and to improve performance and failover GTM-Proxies.

+ The GTM serializes all the transaction processing and can limit the whole scalability if you implement it primitively. This should not be used in slow/wide networks and is recommended to involve the fewest number of switches between GTM and coordinators. The proxies reduce the iteration with the GTM and improve the performance. 

+ Uses MVCC technology. That means that it will still use the same control for the concurrency as Postgres.

+ This is the first thing you will need to configure. If you set up everything in the same machine, you will need to create a separate folder for the configuration and files.

 

  • Coordinator/s

+ Interface for applications (like a Postgres backend). 

+ It has its own pooler (yes, and I think this is great, avoiding more complexity in big environments).

+ Doesn’t store any data. The queries are executed in the datanodes, but...

+ … it has its own data folder (for global catalogs).

  • Datanode/s

+ Stores the data.

+ It receives the petition with a GXID (Global Transaction ID) and Global Snapshot to allow requests from several coordinators.

Both Datanodes and Coordinator use their own data directory, so keep this in mind this if you are setting both up on the same machine. 

Configuring several GTM-Proxies will improve the scalability, shrinking the I/O in the GTM. Plus, you can configure the GTM-Standby to avoid a SPOF of the general manager. It not only provides the GXID, it also receives the node registration (you can trace your log or check the file inside the gtm folder called register.node, it’s binary but is readable) and most importantly, it holds the snapshot of the current status of all the transactions.

Coordinators can point to all the datanodes and can point to the same datanode (as Oracle RAC, but we’re not sure if all the features included in that solution will be available for Postgres-XC). Coordinators connect to the proxies (if you have already configured them) or the main GTM.

Hope you enjoyed the read. We are preparing more cool stuff about this amazing solution, keep in touch!

New versions of PgPool released - 3.1.3 & 3.0.7


This essential tool for Postgres architectures is continually improving, and is now available in its new releases. Both are bugfix versions.
 
For those unfamiliar with the tool, it is a middleware with functionality as a load balancer, pooler*  and/or replication system for PostgreSQL databases. The 3.1.x versions are compatible with Postgres 9.x, whose streaming replication feature was pushed to Pgpool developers to take advantage of it.   This allows the tool to balance queries without using the pgpool-replication technique.
 
In the 3.1.3/3.0.7 fixes we have:

  • Allow multi statement transactions in master/slave mode: Transactions with BEGIN, since 3.1 were sent to the slaves/standby servers as well. This brings non desirable effects when the transaction contains DELETE/INSERT/UPDATE, due to the fact that standbys cannot execute writable SQL. (3.1.3 fix)
  • Important fixes for failover detection and execution. (3.1.3 fix)
  • Added m4 files to avoid problems when compiling in older operating systems.
  • Fixed hangup on PREPARE errors.
  • Fixed memory leak in reset queries.


If you are running 3.1.x against Postgres 9 databases, we strongly recommend you upgrade PgPool due to the fixing in the multi statement feature.

For more information http://www.pgpool.net/mediawiki/index.php/Main_Page

* If you need only a connection pooler for Postgres, I prefer PgBouncer http://wiki.postgresql.org/wiki/PgBouncer. It is lightweight and more specific and simply works, without as much configuration.

MySQL Conference and Expo - 2012 and our company offsite

Last week was a huge week for PalominoDB.  I will admit to being cautiously optimistic about Percona taking over this conference - one of the biggest parts of the year for the MySQL community.  That being said, the conference was done quite well.  I really applaud Percona for making it happen.  In particular, the dot org pavillion was an excellent addition.  While I was stuck at the booth most of the event, I had a great view for the sheer variety of attendees coming through, and had the privilege of participating in a number of excellent conversations.

I noticed a number of patterns that seemed to prevail among the conversations.  Folks seem eager to move on to MySQL 5.5, finally comfortable with its stability.  Administrators are eager to learn more about MariaDB and Drizzle, and how they can differentiate themselves from the Oracle variants.  Partitioning is more prevalent as datasets grow, and sharding is becoming almost commonplace.  Now the focus is more on the challenging questions around HA - multi-master, synchronous replication and multi-datacenter installations.  People seem more interested in commercial additions around the MySQL ecosystem such as Tungsten, TokuDB, ScaleArc, Scalebase and Clustrix.  

René Cannao, one of our senior administrators did a great tutorial on understanding performance through measurement, benchmarking and profiling.  Feedback has been great, and we look forward to continuing to evolve our benchmarking and profiling methods.  Please keep an eye out.

Additionally, we were able to announce a very exciting partnership with SkySQL.  PalominoDB focuses on operational excellence by providing top DBAs to our clients.  We dig in to our clients' architectures and improve them, maintain them and help redesign them as they grow.  Our oncall services are top notch, regularly answering pages in under 5 minutes - and always providing a talented and experienced DBA on the other end of the phone.  What we don't do is software support.  It's just not our experience.  We can tune it, run it and grow it, but for clients who need to dig into code, fix bugs and really provide deep internals knowledge in MySQL - SkySQL are who we turn to.   We are also quite excited to help augment SkySQL's excellent services with our own - to create some of the happiest customers out there.

We are thrilled to see our partnership ecosystem grow, our service offerings expand and knowledge of our brand and the quality of services continue to improve.  I can't help but glow with pride at the reputation PalominoDB has built - through our DBAs, our clients and our partners.  Being a part of the MySQL  conference and expo only cemented this pride in community, and pride in our work.  Thank you to everyone who has helped us get there.

Thank you all of you!

Exploring a new feature of 9.2: Index-only scans

We, like other Postgres DBAs worldwide, have been waiting for the 9.2 release for some time, specifically for the index-only scan feature, which will help reduce I/O by preventing unnecessary access to heap data if you only need data from the index.

Besides 9.2 is still in development, it is possible to download a version for testing at http://www.postgresql.org/download/snapshots/ . It's important to note that it doesn’t add new behaviours, but improves the way that indexes are used.

How can we test this feature? We created a ‘big’ table starting with 10 million+ records with random values.

  • Extract a few elements, using a where clause.
  • Use aggregations.
  • Use partitioning plus index only scans.



When is this feature most useful?:

  • When you select only the columns that are specified in the index definition,  including those which are at the condition part of the query.
  • If a vacuum was executed previously. Thus happens because the scan can skip the “heap fetch” if the TID references a heap [table] page on which all tuples are known visible to everybody (src/backend/executor/nodeIndexonlyscan.c).


So, the main table is:


CREATE TABLE lot_of_values AS SELECT i, clock_timestamp() t1, random() r1, random() r2, random() r3, clock_timestamp() + (round(random()*1000)::text || ' days')::interval d1 from generate_series(1,10000000) i(i);
ALTER TABLE lot_of_values ADD PRIMARY KEY(i);

CREATE INDEX CONCURRENTLY ON lot_of_values (d1);



Something interesting: due to some improvements in write performance, we realized that 9.2 demonstrated better timing compared with 9.1.3 (~200k in 9.1, ~170k ms on 9.2). The index creation was slightly better on 9.2.

The table will contain data like this:

stuff=# \x
Expanded display is on.
stuff=# select * from lot_of_values limit 1;
-[ RECORD 1 ]---------------------
i  | 1
t1 | 2012-04-18 08:37:14.426624+00
r1 | 0.571268450468779
r2 | 0.222371176816523
r3 | 0.72282966086641
d1 | 2012-08-17 08:37:14.426713+00


Ok, let’s start with some examples. As we previously explained, we need to specify columns that are only in the index. You can’t use columns from 2 different indexes. The next example is a clear fail:


stuff=# explain select i,  d1 from lot_of_values where round(r1*100) < 10;
             QUERY PLAN                                
---------------------------------------------
Seq Scan on lot_of_values  (cost=0.00..263496.00 rows=3333333 width=12)
  Filter: (round((r1 * 100::double precision)) < 10::double precision)
(2 rows)

stuff=# set enable_seqscan=off;
SET
stuff=# explain select i,  d1 from lot_of_values where round(r1*100) < 10;
                                      QUERY PLAN                                       
---------------------------------------------
Seq Scan on lot_of_values  (cost=10000000000.00..10000263496.00 rows=3333333 width=12)
  Filter: (round((r1 * 100::double precision)) < 10::double precision)
(2 rows)



We don’t have indexes at r1 and it isn’t part of the index!

The next example is another fail, using a column that is defined in another index or directly not defined in any index:


stuff=# explain select i,  d1 from lot_of_values where i between 12345 and 23456;
                                         QUERY PLAN                                           
---------------------------------------------
Index Scan using lot_of_values_pkey on lot_of_values  (cost=0.00..450.83 rows=11590 width=12)
  Index Cond: ((i >= 12345) AND (i <= 23456))
(2 rows)



The next example is the correct case:


stuff=# explain select i from lot_of_values where i between 12345 and 23456;
                                           QUERY PLAN                                             
---------------------------------------------
Index Only Scan using lot_of_values_pkey on lot_of_values  (cost=0.00..450.83 rows=11590 width=4)
  Index Cond: ((i >= 12345) AND (i <= 23456))
(2 rows)



Also, we can try with a non-pk index:


stuff=# explain select min(d1), max(d1) from lot_of_values ;
                                          QUERY PLAN                                                             
---------------------------------------------
Result  (cost=6.93..6.94 rows=1 width=0)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=0.00..3.46 rows=1 width=8)
          ->  Index Only Scan using lot_of_values_d1_idx on lot_of_values  (cost=0.00..34634365.96 rows=10000000 width=8)
                Index Cond: (d1 IS NOT NULL)
  InitPlan 2 (returns $1)
    ->  Limit  (cost=0.00..3.46 rows=1 width=8)
          ->  Index Only Scan Backward using lot_of_values_d1_idx on lot_of_values  (cost=0.00..34634365.96 rows=10000000 width=8)
                Index Cond: (d1 IS NOT NULL)
(9 rows) stuff=# explain select min(i), max(i), avg(i) from lot_of_values where i between 1234 and 2345;
                                             QUERY PLAN                                               
---------------------------------------------
Aggregate  (cost=66.90..66.91 rows=1 width=4)
  ->  Index Only Scan using lot_of_values_pkey on lot_of_values  (cost=0.00..58.21 rows=1159 width=4)
        Index Cond: ((i >= 1234) AND (i <= 2345))
(3 rows)


The aggregation cases are special.  Index-only scans are not useful for count(*) without condition, because the index scan needs to check the visibility of the tuple, which makes it expensive. So, if you need to count the entire table, a sequential scan must be perfomed.

Just for testing purposes, we’ll try to “turn off” the seqscan node, to force an Index-only scan:


stuff=# explain (analyze true, costs true, buffers true, timing true, verbose true) select count(i) from lot_of_values;
               QUERY PLAN                                                       
---------------------------------------------
Aggregate  (cost=213496.00..213496.01 rows=1 width=4) (actual time=57865.943..57865.946 rows=1 loops=1)
  Output: count(i)
  Buffers: shared hit=2380 read=86116
  ->  Seq Scan on public.lot_of_values  (cost=0.00..188496.00 rows=10000000 width=4) (actual time=0.667..30219.806 rows=10000000 loops=1)
        Output: i, t1, r1, r2, r3, d1
        Buffers: shared hit=2380 read=86116
Total runtime: 57866.166 ms
(7 rows) stuff=# set enable_seqscan=off;
SET
stuff=# explain (analyze true, costs true, buffers true, timing true, verbose true) select count(i) from lot_of_values;
                                                          QUERY PLAN                                    
---------------------------------------------
Aggregate  (cost=351292.03..351292.04 rows=1 width=4) (actual time=64094.544..64094.547 rows=1 loops=1)
  Output: count(i)
  Buffers: shared read=110380
  ->  Index Only Scan using lot_of_values_pkey on public.lot_of_values  (cost=0.00..326292.03 rows=10000000 width=4) (actual time=38.773..35825.761 rows=10000000 loops=1)
        Output: i
        Heap Fetches: 10000000
        Buffers: shared read=110380
Total runtime: 64094.777 ms
(8 rows)



After a Vacuum, the plan changed and the cost drops to 262793.04.

For partitioning, as we expected, this works as well (in this example, we’ll use another table called ‘persons’ with ‘dni’ as PK column):


coches=# explain (analyze true, costs true, buffers true, timing true, verbose true)  select dni from persons where dni between 2100111 and 2110222;
Result  (cost=0.00..168.62 rows=22 width=8) (actual time=61.468..61.468 rows=0 loops=1)
  Output: persons.dni
  Buffers: shared hit=43 read=1
  ->  Append  (cost=0.00..168.62 rows=22 width=8) (actual time=61.442..61.442 rows=0 loops=1)
        Buffers: shared hit=43 read=1
        ->  Seq Scan onpersons  (cost=0.00..0.00 rows=1 width=8) (actual time=0.156..0.156 rows=0 loops=1)
              Output:persona.dni
              Filter: (((persona.dni)::bigint >= 2100111) AND ((persona.dni)::bigint <= 2110222))
        ->  Index Only Scan using persons_200_pkey on persons_200 persons  (cost=0.00..8.38 rows=1 width=8) (actual time=0.405..0.405 rows=0 loops=1)
              Output:persona.dni
              Index Cond: ((persons.dni >= 2100111) AND (persons.dni <= 2110222))
              Heap Fetches: 0
              Buffers: shared hit=5

…. LOT OF PARTITIONS ….
Total runtime: 11.045 ms
(114 rows)


Conclusion: this feature adds one of the most exciting performance improvements in Postgres. We see improvements at as much as 30% so far, and look forward to seeing how this scales as 9.2 becomes production-ready.


Automatically Bring Up A New Mongod Node in AWS

One of the most desirable features of MongoDB is its ability to automatically recover from a failed node. Setting this up with AWS instances can raise challenges, depending on how you're addressing your machines. We see many examples online of people using IP addresses in their configurations, like this:
cfg = {
    _id : "rs_a",
    members : [
        {_id : 0, host : "10.2.3.4", priority : 1},
        {_id : 1, host : "10.2.3.5", priority : 1},
        {_id : 2, host : "10.2.3.6", priority : 0}
    ]
}
Which is fine for giving examples, or maybe creating a single cluster that you control the hardware. But in the ephemeral world we live in with virtual instances where IPs change constantly, this is just not feasible. You need to use hostnames, and have DNS set up. Well, that's easier said than done, and so I've tried to come up with a proof of concept of automatically assigning hostnames for new instances in AWS. I hope that some of my experiences will be valuable in setting up your MongoDB clusters.
Here is the scenario: a replication set with three nodes set up in AWS using ec2 instances. One of the nodes goes down. You want to be able to bring that same node back - or a replacement node - with a minimum of configuration changes. To make this happen, you need to prepare for this event beforehand. All of this is done on Amazon's default OS, which is based on CentOS.
The MongoDB pages on setting up instances in AWS here and here
are very valuable in setting up most of it, however the details of how to set up hostnames could use some elaboration. For one thing, it doesn't really help to use the public domain addresses into your zone files - if you bring up a new instance, you'll get a new public domain name, and you'll need to go back into your zone file and update it. I wanted to minimize the reconfiguration necessary when bringing a new instance up to replace one that went down.
My next reference is Marius Ducea's very helpful article How To update DNS hostnames automatically for your Amazon EC2 instances. One of the great things I discovered here was the really useful tool Amazon provides to get info about the instance that you're on. Basically, curl http://169.254.169.254/latest/meta-data/ to get a list of available parameters, and then add that parameter on to the end of the url to get its value. I made a little bash function that I used often while going through this:
meta () {
  curl http://169.254.169.254/latest/meta-data/$1
  echo
}
Stick that in your .bashrc or similar file, and then "meta" will list all parameters, and e.g. "meta local-ipv4" will give you your local ip. The first thing you want to set up is your DNS server. Of course you must have redundancy, with at least a primary and a backup, spread across different regions. For the purposes of this demonstration, however, I just had a single DNS server. Obviously, do not do this in production! Here are the significant changes when setting up your bind server: 1. In the {{options}} block, ensure your "listen-on" block will work for all servers who need DNS. In my case, I set it to "any". You may need to do something more restrictive. 2. {{allow-recursion}} also needs to be set for the same servers. 3. As in Ducea's article, add your keys and your key blocks. You'll need to copy these keys to your instance, but be careful with leaving copies around, and ensure the keys' permissions are as strict as possible. 4. Add blocks for your internal and external zones. You can probably get by with just an internal zone, but you may find it more convenient to have both zones. 5. Each of the zones will need to reference the key. 6. The purpose of the "control" block in there is so that I can use rndc to control my named server. Although not technically necessary, it becomes useful if you need to edit your zone files after any dynamic updates have been made. You'll need to run "rndc freeze" before, and "rndc thaw", otherwise the hanging .jnl files will make named complain. You'll also need to add an /etc/rndc.conf file, which has at a minimum a "key" and an "options" block. See here for details on setting it up on CentOS systems. /etc/named.conf:
//
// named.conf
//
// Provided by Red Hat bind package to configure the ISC BIND named(8) DNS
// server as a caching only nameserver (as a localhost DNS resolver only).
//
// See /usr/share/doc/bind*/sample/ for example named configuration files.
//
options {
// listen-on port 53 { 127.0.0.1; };
 listen-on port 53 { any; };
 listen-on-v6 port 53 { ::1; };
 directory  "/var/named";
 dump-file  "/var/named/data/cache_dump.db";
        statistics-file "/var/named/data/named_stats.txt";
        memstatistics-file "/var/named/data/named_mem_stats.txt";
 allow-query     { localhost; };
 allow-recursion     { any; };
 recursion yes;
 dnssec-enable yes;
 dnssec-validation yes;
 dnssec-lookaside auto;
 /* Path to ISC DLV key */
 bindkeys-file "/etc/named.iscdlv.key";
};
controls {   
 inet 127.0.0.1 allow { localhost; } 
 keys { palomino.mongo.; }; 
};
key palomino.mongo. {
algorithm HMAC-MD5;
secret "SEcreT FRom Your Generated Key file";
};
logging {
        channel default_debug {
                file "data/named.run";
                severity dynamic;
        };
};
zone "." IN {
 type hint;
 file "named.ca";
};
include "/etc/named.rfc1912.zones";
zone "int.palomino.mongo" IN {
  type master;
  file "int.palomino.mongo.zone";
  allow-update { key palomino.mongo.; };
  allow-query     { any; }; // this should be local network only
  //allow-transfer { 10.160.34.184; };
};
zone "palomino.mongo" IN {
  type master;
  file "palomino.mongo.zone";
  allow-update { key palomino.mongo.; };
  allow-query     { any; };
  //allow-transfer { 10.160.34.184; };
};
------- In this example, I chose an obviously not public domain name to avoid confusion. A feature not in this example, but you will need, are allow-transfer lines for dealing with communication with your other nameserver(s). Also note, that in this case I used the same keys for both dynamic updates from the other hosts and for rdnc updates. You may decide to do otherwise. Finally, you may want to add reverse-lookup zones as well. External Zone File: palomino.mongo.zone
$ORIGIN .
$TTL 86400 ; 1 day
palomino.mongo IN SOA dns1.palomino.mongo. mgross.palomino.mongo. (
 2012032919 ; serial
 21600      ; refresh (6 hours)
 3600       ; retry (1 hour)
 604800     ; expire (1 week)
 86400      ; minimum (1 day)
 )
 NS dns1.palomino.mongo.
 A local.ip.of.this.instance
$ORIGIN palomino.mongo.
$TTL 60 ; 1 minute
dns1 A public.ip.of.this.instance
Internal Zone File: int.palomino.mongo.zone
$ORIGIN .
$TTL 86400 ; 1 day
int.palomino.mongo IN SOA dns1.palomino.mongo. mgross.palomino.mongo. (
 2012032917 ; serial
 21600      ; refresh (6 hours)
 3600       ; retry (1 hour)
 604800     ; expire (1 week)
 86400      ; minimum (1 day)
 )
 NS dns1.palomino.mongo.
 A local.ip.of.this.instance
$ORIGIN int.palomino.mongo.
$TTL 60 ; 1 minute
dns1 A local.ip.of.this.instance
You don't need to put any other hosts in here but your DNS server(s). All the other hosts will get in there dynamically. The local ip of this DNS server is what you'll need to bootstrap your DNS when you start your other hosts. Start your DNS server. Next, for each mongod host, I used a script similar to Ducea's, with the exception that it will take the local ip of the DNS server in addition to the hostname. Also, I insert the DNS server's IP into the /etc/resolv.conf. This is probably a bit of a hack. I don't doubt there's a more efficient way to do this. In this example, I am setting the user-data of the ec2 instance to "host:DNS-local-ip", so, for example --user-data "ar1:10.1.2.3". Hopefully this will make more sense later. update_dns.sh:
#!/bin/bash
USER_DATA=`/usr/bin/curl -s http://169.254.169.254/latest/user-data`
HOSTNAME=`echo $USER_DATA | cut -d : -f 1`
DNS_IP=`echo $USER_DATA | cut -d : -f 2`
line="nameserver $DNS_IP"
# this should skip if there is no DNS_IP sent
if  ! grep "$line" /etc/resolv.conf 1>/dev/null
then
    sed -i"" -e "/nameserver/i$line" /etc/resolv.conf
fi
DNS_KEY=/etc/named/Kpalomino.mongo.+157+29687.private
DOMAIN=palomino.mongo
#set also the hostname to the running instance
hostname $HOSTNAME.$DOMAIN
PUBIP=`/usr/bin/curl -s http://169.254.169.254/latest/meta-data/public-ipv4`
cat<<EOF | /usr/bin/nsupdate -k $DNS_KEY -v
server dns1.int.$DOMAIN
zone $DOMAIN
update delete $HOSTNAME.$DOMAIN A
update add $HOSTNAME.$DOMAIN 60 A $PUBIP
send
EOF
LOCIP=`/usr/bin/curl -s http://169.254.169.254/latest/meta-data/local-ipv4`
cat<<EOF | /usr/bin/nsupdate -k $DNS_KEY -v
server dns1.int.$DOMAIN
zone int.$DOMAIN
update delete $HOSTNAME.int.$DOMAIN A
update add $HOSTNAME.int.$DOMAIN 60 A $LOCIP
send
EOF
This file can be anywhere, I happened to put it in /etc/named, where I also put the keys. This will be run by root, and contains secrets, so you should chmod 700 those three files. The Mongo instances: I used the excellent guide on the MongoDB site to set up a single instance. Go through many of the steps the entire process, including Launching an Instance, Configure Storage, and up to Install and Configure MongoDB. Now, for this example, we are only setting up a ReplicaSet. Sharding will be done at another time. The main thing we want to do here is to be able to create a generic image that you can kick off and automatically set its DNS and connect to the rest of the Replica Set, with as little human interaction as possible. As in the example given, in /etc/mongod.conf, the only changes I made were:
fork = true
dbpath=/data
replSet = rs_a
So that when this mongod starts up, it will already be configured to be a member of rs_a. Some other changes that can be made here would be to generate these changes dynamically from user-data sent to the instance, similar to the dns update script above, but for this case, I left it hard-coded. You can go ahead and start up mongod, connect to it and ensure it functions. The final piece of the puzzle is to ensure our dns update script runs when the host starts up. At first, I had it running as part of /etc/rc.local, as in Ducea's page. However, I soon found out that means that DNS would then get updated _after_ mongod has already started, and then mongod couldn't find any of the new hosts. So, to ensure that dns is updated before mongod starts, I added these lines to /etc/init.d/mongod, right at the beginning of the start() function:
start()
{
  echo "Updating dns: "
  /etc/named/update_dns.sh
  echo -n $"Starting mongod: "
At this point, we are almost ready to save our image. If you did a test and ran mongod, your /data directory will already have journal and/or other files in there, which might have data in them we don't necessarily want on startup. So, stop mongod and remove all your data:
service mongod stop
cd /data
rm -rf *
Now your instance is ready to make an image. You can stop it now (otherwise it would reboot when you create the image). All you need now is the id of your instance. Create your image:
ec2-create-image <instance-id> --name mongod-replset-server --description "Autoboot image of mongod server"
Once you get an image id from that, we can very easily set up our replica set. All we need to bootstrap this is the local IP of your DNS server. For this example, my hostnames will be ar1.palomino.mongo, ar2.palomino.mongo, and ar3.palomino.mongo (public), and ar1.int.palomino.mongo, ar2.int.palomino.mongo, and ar3.int.palomino.mongo . In this example, my DNS server's local IP address is 10.160.121.203
[mgross@dev ~]$ for i in 1 2 3
> do
> ec2-run-instances <ami-image-id> -n 1 -k <your-keypair-name> -d ar$i:10.160.121.203 -t m1.large -z <your-availability-zone> -g <your-security-group-id>
> done
This will launch 3 instances of the mongod server, which are expecting to be in a replset named "rs_a" (from the setting we added to /etc/mongod.conf above). When they start up, right before mongod starts, they'll set the hostnames provided, connect to the DNS server and set their public and private IPs. Now if you use the nameserver that you set up, you should be able to connect with one of the hostnames, for example ar1.palomino.mongo mongo ar1.palomino.mongo set up your replica set:
> rs.config()
null
> cfg = {
...     _id : "rs_a",
...     members : [
...         {_id : 0, host : "ar1.int.palomino.mongo", priority : 1},
...         {_id : 1, host : "ar2.int.palomino.mongo", priority : 1},
...         {_id : 2, host : "ar3.int.palomino.mongo", priority : 0}
...     ]
... }
{
 "_id" : "rs_a",
 "members" : [
 {
 "_id" : 0,
 "host" : "ar1.int.palomino.mongo",
 "priority" : 1
 },
 {
 "_id" : 1,
 "host" : "ar2.int.palomino.mongo",
 "priority" : 1
 },
 {
 "_id" : 2,
 "host" : "ar3.int.palomino.mongo",
 "priority" : 0
 }
 ]
}
> rs.initiate(cfg);
{
 "info" : "Config now saved locally.  Should come online in about a minute.",
 "ok" : 1
}
> rs.status();
{
 "set" : "rs_a",
 "date" : ISODate("2012-03-30T04:20:49Z"),
 "myState" : 2,
 "members" : [
 {
 "_id" : 0,
 "name" : "ar1.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "self" : true
 },
 {
 "_id" : 1,
 "name" : "ar2.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 6,
 "stateStr" : "UNKNOWN",
 "uptime" : 2,
 "optime" : {
 "t" : 0,
 "i" : 0
 },
 "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:20:49Z"),
 "pingMs" : 0,
 "errmsg" : "still initializing"
 },
 {
 "_id" : 2,
 "name" : "ar3.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 5,
 "stateStr" : "STARTUP2",
 "uptime" : 4,
 "optime" : {
 "t" : 0,
 "i" : 0
 },
 "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:20:49Z"),
 "pingMs" : 0,
 "errmsg" : "initial sync need a member to be primary or secondary to do our initial sync"
 }
 ],
 "ok" : 1
}
PRIMARY> rs.status();
{
 "set" : "rs_a",
 "date" : ISODate("2012-03-30T04:21:46Z"),
 "myState" : 1,
 "members" : [
 {
 "_id" : 0,
 "name" : "ar1.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 1,
 "stateStr" : "PRIMARY",
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "self" : true
 },
 {
 "_id" : 1,
 "name" : "ar2.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 59,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:21:45Z"),
 "pingMs" : 0
 },
 {
 "_id" : 2,
 "name" : "ar3.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 61,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:21:45Z"),
 "pingMs" : 0
 }
 ],
 "ok" : 1
}
PRIMARY> 
-------- Now, to demonstrate our automatic new host configuration, let's completely take out one of our hosts and replace it with a brand new instance:
[root@ar1 ~]# shutdown -h now
We can see that ar2 has been elected primary:
[ec2-user@ar2 ~]$ mongo
MongoDB shell version: 2.0.4
connecting to: test
PRIMARY> rs.status()
{
 "set" : "rs_a",
 "date" : ISODate("2012-03-30T04:25:24Z"),
 "myState" : 1,
 "syncingTo" : "ar1.int.palomino.mongo:27017",
 "members" : [
 {
 "_id" : 0,
 "name" : "ar1.int.palomino.mongo:27017",
 "health" : 0,
 "state" : 8,
 "stateStr" : "(not reachable/healthy)",
 "uptime" : 0,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:23:38Z"),
 "pingMs" : 0,
 "errmsg" : "socket exception"
 },
 {
 "_id" : 1,
 "name" : "ar2.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 1,
 "stateStr" : "PRIMARY",
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "self" : true
 },
 {
 "_id" : 2,
 "name" : "ar3.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 267,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:25:24Z"),
 "pingMs" : 14
 }
 ],
 "ok" : 1
}
Now, just spin up another instance that will replace ar1: > ec2-run-instances ami-cda2fa88 -n 1 -k engineering-west -d ar1:10.160.121.203 -t m1.large -z us-west-1b -g mm-database With no further action, your instance will acquire the new hostname, get polled by the new primary ar2, and become part of the Replica Set again:
PRIMARY> rs.status()
{
 "set" : "rs_a",
 "date" : ISODate("2012-03-30T04:34:09Z"),
 "myState" : 1,
 "syncingTo" : "ar1.int.palomino.mongo:27017",
 "members" : [
 {
 "_id" : 0,
 "name" : "ar1.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 77,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:34:08Z"),
 "pingMs" : 0
 },
 {
 "_id" : 1,
 "name" : "ar2.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 1,
 "stateStr" : "PRIMARY",
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "self" : true
 },
 {
 "_id" : 2,
 "name" : "ar3.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 792,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:34:08Z"),
 "pingMs" : 19
 }
 ],
 "ok" : 1
}
PRIMARY> 
Now, this exercise had a lot missing from it: most notably multiple DNS servers for redundancy, and sharding. But I hope that this was helpful in seeing how you can use dns servers to make failover of a mongodb host a little closer to automatic. I've run this scenario several times, and I did have some issues the first time I was setting up the replica set on the cluster. Once that got going, though, the failover and promotion of the brand new host seemed to work fine. I welcome your comments.

Avoid Gotchas While Starting up with MongoDB

 

When starting to work with a new technology in a development or sandbox environment, we tend to run things  -  as much as possible - using their default settings.  This is completely understandable, as we're not familiar with what all of the options are.  With MongoDB, there are some important issues you might face if you leave your setup in the default configuration.  Here are some of our experiences getting started trying out MongoDB on virtual machines both locally and on some Amazon EC2 instances.

 

Versions

Ensure that the version of MongoDB that you're using has the features you're going to use.  For example, if you installed the "mongodb" package from the default debian repositories as of this writing (v 1.4.4) , you'd find that some of the commands used for sharding are available, but sharding isn't actually supported in that version, so this incompatibility is masked.  Ensure you get the latest (as of this writing, it's 2.0.2) by adding  downloads-distro.mongodb.org/repo/ubuntu-upstart to your repos and installing the package named mongodb-10gen.

Check that your OS version is compatible with what you're going to use it for.  MongoDB recommends a 64-bit OS due to file-size limitations in 32-bit OS's.

 

Why is it taking so long to start up?

If you're on a 64-bit install, version 1.9.2+, journaling is enabled by default.  MongoDB may decide it needs to preallocate the journal files, and it waits until the files are allocated before it starts listening on the default port.  This could be on the order of tens of minutes. You definitely don't need this option if you're just kicking the tires and trying to see what mongodb can do.  Restart with --nojournal.

  • What type of filesystem is your database directory mounted on?

A lot of the popular Linux AMIs available in Amazon's list have the ext3 filesystem mounted by default.  It's recommended to use ext4 or xfs filesystems for your database directory, due to the file allocation speed.  This is especially noticeable if MongoDB starts allocating journal files, as in the above. If you're using an AWS instance, you'll avoid this problem if you set up a RAID10 filesystem for your data directory, as shown here.

 

Disk Space

Another issue is disk space.  If you leave settings to their default and you're on a VM or a machine with limited disk space, you could very well start hitting your limits soon.  Even if you are starting up a configserver, it will end up taking up another 3GB if you're not careful.  Our recommendation is to use the --smallfiles flag as you're starting, stopping, and configuring, until you figure out what you're doing. As an example, we followed this page to create a sharded database on a debian VM with about 16GB of disk space, and it quickly ballooned to this:

moss@moss-debian:~/mongodb$ du -h .
3.1G ./a/journal
204K ./a/moveChunk/test.data
208K ./a/moveChunk
4.0K ./a/_tmp
3.3G ./a
3.1G ./b/journal
4.0K ./b/_tmp
3.3G ./b
3.1G ./config/journal
4.0K ./config/_tmp
3.3G ./config
9.7G .

 

Bottom line: use "--smallfiles" in your command line flags or in your /etc/mongodb.conf files until you are actually running in an environment that has the required disk space.

 

Splitting and Balancing

As Jeremy Zawodny outlines in his excellent blog post “MongoDB Pre-Splitting for Faster Data Loading and Importing”, it is important to understand how MongoDB manages shards. By default, documents are grouped into 200MB chunks which are mapped to shards, and then moves those chunks between shards as the balancer attempts to manage the load. If you’re doing a large data migration, however, this can be tricky. Check Jeremy’s post for some great advice on pre-splitting while maintaining acceptable performance levels.

 

Spaces in Configuration Files

It's not in the documentation anywhere, but another gotcha was in a configuration file - for example a line like this, with multiple values for a single parameter:

configdb = ip-10-172-31-109.us-west-1.compute.internal,ip-10-170-209-104.us-west-1.compute.internal,ip-10-172-169-38.us-west-1.compute.internal

If you have spaces before or after the ",", the setting will not parse.  Just ensure it's a single string with no spaces as it is above.

When is MongoDB the right choice for your business? We explore detailed use cases.

As part of PalominoDB’s “Mongo Month”, we’re reviewing use cases for where we’d recommend choosing MongoDB and where we’d consider other technologies.

Because every environment and architecture is different, and every business has unique needs, we work carefully with clients to find the best solution for the particular challenges they face. Sometimes MongoDB or one of the other open-source technologies we build and support is most appropriate; sometimes an RDBMS is most appropriate; and often a hybrid environment makes the most sense for a client’s specific needs.

Our partners at 10gen lay out the more typical use cases, and we find there are often additional factors to consider, such as:

  • Risk tolerance for bugs and unmapped behaviors
  • Availability requirements 
  • Access- and location-based requirements
  • Security requirements
  • Existing skill sets and tooling
  • Existing architecture and infrastructure
  • Growth expectations and the timeline therein
  • Support? Community? Start-up? Enterprise Class?

Below, we’ll review some of the specific use cases our clients face, and we’ll explore how MongoDB and/or other technologies might address these most appropriately.

 

Archiving

Craigslist is one of the most famous implementations of MongoDB for archiving - in this case, old posts.  The schemaless component makes it easy for the datastore to grow as the data model evolves.  Because MongoDB does not offer some features, such as compression, that other tools like Cassandra offer natively, Craigslist has created patterns for workarounds to address issues such as presplitting data to avoid chunks migrating to various shards or using ZFS with LZO compression.

MongoDB and Cassandra both suit this use case well.  The choice in a situation like this is often determined by in-house skillsets and preferences, as well as the actual size and amount of data that needs to be managed (stay tuned for a future post about the complexities of data management at various stages of data volume and scale). Additional considerations for Cassandra (and HBase or any other Java-based DBMS) include JVM management and tuning, which can be quite challenging for operations teams unused to working with this issue.

other solutions:

  • Cassandra

pros: Native compression reduces complexity and growth is easier to manage with new nodes. Cassandra is arguably the better choice for massive scale (both in storage growth and in throughput of inserts) and does multi-datacenter installations well.  

cons: These will vary by use case, and will be addressed in a further post.  A generalization is that MongoDB is easier to setup and manage in smaller environments or for companies constrained by resources.

 

Content Management

Schema evolution is a huge win in the content management field.  10gen’s production deployments show numerous successful use cases.  MongoDB’s rich querying capabilities and the document store’s ability to support hierarchical models are all perfect for a CMS. For a great real world example, see the extensive documentation on the Wordnik deployment, and how they manage 1.2TB of data across five billion records.

other solutions:

  • MySQL or PostgreSQL w/caching and read distribution

pros: Skillsets are more readily available, and can support existing tools for managing the RDBMS infrastructure.  Reads are easily scaled in known patterns in the relational database world.

cons: Schema modifications can prove expensive and hard to integrate into publishing workflows.  Failover is not automatic. 

 

Online Gaming

MongoDB works very well with small reads and writes, assuming you manage to keep your working set in RAM.  Compounding that with ease of schema evolution and the replica set functionality for easy read scale and failover creates a solid case to investigate MongoDB.  In fact, this rule can be pushed out to any situation where writes can grow out of hand for a single database instance.  When you find yourself needing to support growth of writes, those writes being small and numerous, you need to ask if you want to a) design the writes away (easier said than done) b) functionally partition the workload or c) shard.  MongoDB’s autosharding is nice, though not perfect - and there are gotchas PalominoDB can assist you with.  Depending on other variables mentioned earlier, MongoDB might be a solid fit for this part of your workload. Disney’s Interactive Media Group offered a great presentation at the annual MongoSV conference on how they use MongoDB in their environment. 

other solutions:

  • MySQL or PostgreSQL with sharding

pros: Skillsets are more readibly available, and can support existing tools for managing the RDBMS infrastructure.  

cons: Games require significant schema modifications in early iterations, and these can prove expensive and impactful.  Extensive development and QA hours and increased complexity come hand in hand with writing your own sharding.

  • Cassandra

pros: Skillsets are more readibly available, and can support existing tools for managing the RDBMS infrastructure.  Reads are easily scaled in known patterns in the relational database world.

cons: These will vary by use case, and will be addressed in a further post.  A generalization is that MongoDB is easier to setup and manage in smaller environments or for companies constrained by resources.

 

Log Centralization

Asynchronous (and speedy!) writes, capped collections for space management and the flexibility of a schemaless database (since attributes in logs tend to pop up like mushrooms after a rainstorm) are often cited as key benefits for using MongoDB to store logs.  Admittedly, one could build a queue to push data asynchronously into an RDBMS, and partitioning plus a few scripts could duplicate the space management features of a capped collection.

MongoDB and Cassandra both do this well.  Cassandra is arguably the better choice for massive scale and works well in a multi-datacenter environment.  However, MongoDB is much easier to use, manage and query. The size of the client, the skill-set on hand and the availability needs will help us here.

other solutions

  • Percona’s XtraDB w/socket handler and MySQL partitioning, XML datatypes

pros: MySQL familiarity, reuse of MySQL infrastructure (backups, monitoring etc...)

cons: Socket Handler is still somewhat buggy, partitioning requires scripts and XML is not easily queried.

  • Cassandra w/TTL data

pros: multi-datacenter support makes this more viable than MongoDB.  

cons: These will vary by use case, and will be addressed in a further post.  A generalization is that MongoDB is easier to setup and manage in smaller environments or for companies constrained by resources.

 

Queue Implementation

MongoDB implements its internal replication using tailable cursors and capped collections, and the same features are useful to build simple persistent network-based queuing (distributed message-passing) implementations rather than using a “proper” queueing package. One such implementation is described in detail on Captain Codeman's Blog. Building your own queueing mechanism on top of a DBMS can be suboptimal, but one organization did so because they already had MongoDB expertise on-staff and had difficult performance problems with ActiveMQ.

other solutions

  • ActiveMQ, RabbitMQ. 

pros: proper queueing solutions. Known and documented problems or solutions.

cons: more brittle, more difficult to set up, and less performant if your queueing needs are extremely simple.

 

In summary, MongoDB, either on its own or in a hybrid environment with other technologies, is a wonderful choice for many of the most common use cases and business challenges our clients face. Helping clients make these complex decisions, and working together on the installation, management and optimization of these tools is our core business, and we encourage you to get in touch if you’d like to explore using MongoDB in your own environment.

Considerations about text searchs in big fields and planner costs

Imagine the following scenario: you have a table with a small set of records with columns containing  tsvector and text data type. But, text fields has almost 20 megabytes of text or more.

So, what happens? Postgres planner checks the amount of reltuples (tuples) in the relation to calculate the cost of data extraction. If the amount of tuples reached by the query is higher than the 20% of the total result set or the amount of blocks in the relation is too small, then the planner will choose sequential scan, otherwise random access will be the chosen one.

But, in that case we face a problem. The amount of rows and blocks in the relation is not the “total” content. That happens because Postgres use a technique called TOAST (The Oversized  Attribute Storage Technique) which store all the data that is over 200 bytes, apart from the block. So, in the relation you will have only the firsts bytes of your column data. You can modify some options about how the columns get toasted (the level of compression), but the default behaviour is compress and store apart. This methodology is pretty cool, because if you have i.e. 500MB of data in only one field, your relation will contain all that data in the same file and in the most usual cases that is not convenient.

So, meanwhile you have small number of rows, the amount of data to read from disk is quite big in relation of the data stored in the relfilenode (the data file). But for FTS, we need to dig into all the content and for make the searchs faster, we need GIN or GIST indexes. This type of indexes, stores the data inside the index (differently from B-Tree indexes), making possible the search using the index.

So, you want to access by index to your data, but the planner estimates that the amount of rows is so small that is better to access by sequential scan instead of use index. So, the only way to force the read of index is *SET enable_seqscan TO off;* before execute your query. This variable is per session, so you might want to add it inside the transaction and set again to “on” for other queries.

But, by the way, once your table is populated with enough amount of data, you wouldn’t need anymore this execution, but if your table is quite static, could be a little fix. Besides this, it doesn’t represents a problem of performance, but this kind of things could show you how the planner works.

Let’s see an example:

CREATE TABLE documents_en (
   author_id char(2),
   text_id serial,
   author varchar(45) NOT NULL,
   title varchar(127) NOT NULL,
   content text NOT NULL,
   cont_ftsed tsvector  ,
   PRIMARY KEY (author_id, text_id)
);

CREATE FUNCTION ftsed_upd_en() RETURNS trigger AS $$
DECLARE
BEGIN
NEW.cont_ftsed:=(setweight(to_tsvector('english', coalesce(NEW.title,   '')), 'A') ||
    setweight(to_tsvector('english', coalesce(NEW.author,  '')), 'B') ||
    setweight(to_tsvector('english', coalesce(NEW.content, '')), 'C'));
RETURN NEW;
END
$$
LANGUAGE plpgsql;

CREATE TRIGGER tgr_ins_ftsed_en BEFORE INSERT ON documents_en
FOR EACH ROW EXECUTE PROCEDURE ftsed_upd_en();

CREATE TRIGGER tgr_upd_ftsed_en BEFORE UPDATE OF author,title,content
ON documents_en
FOR EACH ROW EXECUTE PROCEDURE ftsed_upd_en();

CREATE INDEX documents_en ON nspc_es USING gin(
   (setweight(to_tsvector('english', coalesce(title,   '')), 'A') ||
    setweight(to_tsvector('english', coalesce(author,  '')), 'B') ||
    setweight(to_tsvector('english', coalesce(content, '')), 'C'))
);

CREATE INDEX ix_ftsed ON documents_en USING GIST(cont_ftsed);


Before continuing, we need to clarify 3 things: text_id is related with author_id, but just for testing we use it as a serial, to avoid collisions in the primary key; we use triggers to automatically set up the tsvector column; the last thing is the indexes, I created 2 indexes to show the examples. IMHO I don’t recommend functional indexes, for performance purposes, but is still an option.

To fill up the table, I’ve made a script to catch up all the READMEs in the server and put them into the DB.


#!/bin/bash
FILENAME="$@"
TMPBFF=temp_file
PSQL=/opt/pg92dev/bin/psql
WORDS=words

for _FILE_ in $FILENAME
do
 egrep  -oi '[a-z]+' $_FILE_ | egrep -i '[a-z]+{4}+' > $WORDS
 AUTHOR=$(cat $WORDS | sort -u | head -n2 | awk '{printf("%s ", $0) }')
 AUTHOR_ID=${AUTHOR:0:2}
 TITLE=$(cat ${WORDS} | sort | uniq -c | sort -nr | head -n3 | awk '{printf("%s ", $2)}')
 BODY=$(cat $WORDS | sort -u | awk '{printf("%s ", $0) }')

 $PSQL -Upostgres fts -c "INSERT INTO documents_en (author_id, author, title, content) VALUES ('${AUTHOR_ID}', '${AUTHOR}', '${TITLE}','${BODY}' );"
done


I know, is a bit tricky, but useful for this test. to execute it, you’ll need something like:

locate README.txt | xargs ./script

First let’s get some stats from the table:

fts=# select count(*) from documents_en ;
count : 22
fts=# select pg_size_pretty(pg_relation_size('documents_en'));
pg_size_pretty :  24 kB (size of the table)

fts=# select relpages from pg_class where relname = 'documents_en';
relpages :3 (8kb each)

Now, a simple query with a EXPLAIN ANALYZE:

fts=# EXPLAIN ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                             QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
Seq Scan on documents_en  (cost=0.00..
3.27 rows=1 width=34) (actual time=0.228..0.464 rows=2 loops=1)
  Filter: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Filter: 20
Total runtime:
0.501 ms
(4 rows)


Oh! It’s using sequential scan (as we expect ), but now let’s add a trick:

fts=# SET enable_seqscan TO off;
SET
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                       QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_ftsed_en on documents_en  (cost=0.00..
8.27 rows=1 width=34) (actual time=0.127..0.191 rows=2 loops=1)
  Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 1
Total runtime:
0.280 ms
(4 rows)

What happens? The cost is higher but is faster? The fast answer is YES, is possible. Cost is based on estimations in the amount of blocks and type of access. Random access use to has 4:1 more cost in relation with the sequential accesses. In this particular case, the table is small so the planner estimates that is cheaper to read it without indexes. BUT that’s not the real problem, the real one is that the data is already TOASTED outside the “data file” and this data is not considered by the planner in this case.

Another question is: can we improve the access to this data? The answer is YES, if you are not complaining about the storage. What you can do is:

fts=# alter table documents_en alter column cont_ftsed set storage external;
ALTER TABLE

That will uncompress the data of the column cont_ftsed and force to store it outside the table (this is not really at this way, but for be didactic we will explain it like this).

fts=# vacuum  full analyze documents_en;
VACUUM
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                       QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_ftsed_en on documents_en  (cost=0.00..8.27 rows=1 width=34) (actual time=0.071..0.127 rows=2 loops=1)
  Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 1
Total runtime:
0.178 ms
(4 rows)

fts=# SET enable_seqscan TO on;
SET
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                             QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
Seq Scan on documents_en  (cost=0.00..3.27 rows=1 width=34) (actual time=0.067..0.292 rows=2 loops=1)
  Filter: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Filter: 20
Total runtime:
0.329 ms
(4 rows)

In this case we use VACUUM because the SET STORAGE option doesn’t make any changes to the previous stored data, and its effects are after. So with vacuum, we are forcing to apply the changes in all the table.

Ok, what happen if the table gets filled with more data? Automatically the planner will start to use the index:

fts=# select pg_size_pretty(pg_relation_size('documents_en'));
pg_size_pretty
----------------
1744 kB
(1 row)
fts=# SELECT count(*)  FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
count
-------
   45
(1 row)


fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                     QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on documents_en  (cost=4.38..
47.02 rows=13 width=31) (actual time=1.091..5.404 rows=45 loops=1)
  Recheck Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 80
  ->  Bitmap Index Scan on ix_ftsed_en  (cost=0.00..4.37 rows=13 width=0) (actual time=1.003..1.003 rows=125 loops=1)
        Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
Total runtime:
5.489 ms
(6 rows)

fts=#
SET enable_bitmapscan TO off;
SET
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                         QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_ftsed_en on documents_en  (cost=0.00..
56.50 rows=13 width=31) (actual time=0.307..5.286 rows=45 loops=1)
  Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 80
Total runtime:
5.390 ms
(4 rows)

We don’t see a huge difference in terms of time between each other, but we still facing the question about the relation between cost/real-performance.

But, don’t forget the functional index! I think you will like the following example:

fts=# analyze;
ANALYZE
fts=# explain analyze select author_id, text_id from documents_en

where (((setweight(to_tsvector('english'::regconfig, COALESCE(title, ''::character varying)::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, COALESCE(author, ''::character varying)::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char"))) @@ to_tsquery('english','editor&find');
                                                                                                                                                                             QUERY PLAN                                                                                                                                                                               
------------------------------------------------------------------------------------------------------------------------------------------------------------ Seq Scan on documents_en  
(cost=0.00..292.20 rows=17 width=7) (actual time=2.960..3878.627 rows=45 loops=1)
  Filter: (((setweight(to_tsvector('english'::regconfig, (COALESCE(title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (COALESCE(author
, ''::character varying))::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char")) @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Filter: 2238
Total runtime:
3878.714 ms
(4 rows)

fts=#
SET enable_seqscan TO off;
SET
fts=# explain analyze select author_id, text_id from documents_en

where (((setweight(to_tsvector('english'::regconfig, COALESCE(title, ''::character varying)::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, COALESCE(author, ''::character varying)::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char"))) @@ to_tsquery('english','editor&find');
                                                                                                                                                                                  QUERY PLAN                                                                                                                                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on documents_en  
(cost=272.15..326.46 rows=17 width=7) (actual time=0.937..1.026 rows=45 loops=1)
  Recheck Cond: (((setweight(to_tsvector('english'::regconfig, (COALESCE(title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (COALESCE(
author, ''::character varying))::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char")) @@ '''editor'' & ''find'''::tsquery)
  ->  Bitmap Index Scan on textsearch_en  (cost=0.00..272.15 rows=17 width=0) (actual time=0.916..0.916 rows=45 loops=1)
        Index Cond: (((setweight(to_tsvector('english'::regconfig, (COALESCE(title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (COALE
SCE(author, ''::character varying))::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char")) @@ '''editor'' & ''find'''::tsquer
y)
Total runtime:
1.109 ms
(5 rows)

Again! But hits time it is more stronger the difference in terms of time. Is not so pretty, ugh? So, It seems that you will need to be careful if you use searchs in big columns, that are toasted. Maybe you should check in you production database if you need to add some tweaks to get the maximum performance to your Postgres instance.

The version used on this post is PostgreSQL 9.2devel on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.6.2 20111027 (Red Hat 4.6.2-1), 64-bit. Besides this, I made tests in 9.1 and still see the same thing’.

Happy hacking!





PalominoDB kicks off "Mongo Month" with Exciting News!

 

PalominoDB is pleased to announce the formation of a service partnership with 10gen, the creators of the popular MongoDB database.  Our partnership provides our existing clients with access to a broader community of experts, and offers new clients a trusted partner in a rapidly evolving technology space.

Our team of senior database architects, administrators, and application developers have decades of experience working with clients to build, optimize, and support complex database systems.  We believe MongoDB has a solid place in the open source database ecosystem alongside our workhorse, MySQL, and we proud to support environments that have both technologies.

Our broad expertise in both relational database technologies and newer key-value and document-oriented storage options situates us perfectly to help clients explore how MongoDB can work in their own environments. We can help you assess your own application needs in areas of performance, scalability, availability, and security, and work with you to optimize your production environment.

“Joining the 10gen Partner program is an exciting step for us,” adds Laine Campbell, founder and principal at PalominoDB. “Our commitment to excellence in emerging database technologies has informed our MongoDB work to date, and we are thrilled to be able to extend that to a broader community in partnership with 10gen.  The ability to interface directly with 10Gen’s extensive knowledge of the product in conjunction with our own operational expertise should bring an unsurpassed level of expertise to our clients using, or looking to use, MongoDB.”

PalominoDB will be celebrating "Mongo Month" throughout March, with new blog posts, advice and tools for making the most of MongoDB in your organization. Stay tuned!

Performance Optimization for MyISAM Compressed Tables

Database performance optimization is a significant part of the ongoing service we provide for our clients. We recently found that a client had a query they performed regularly that was drastically slowing the system as a whole, and we investigated to see if we could help them resolve the issue.

The symptoms were confusing: though the server configuration was sized appropriately for the available memory, the server was per-allocating approximately 20GB of RAM immediately on a query, even when the query required nowhere near that allocation, leading to excessive swapping and performance hits.

We explored various options: Were there SET SESSION statements being executed improperly before the query? Were buffer_size parameters set correctly? Were there other variables we hadn’t considered?

As we investigated, we discovered something interesting: The MySQL Server (mysqld) sets a default memory allocation for MyISAM compressed tables, and that default setting is unnecessarily large (18446744073709547520). As a result, it maps into memory all of the MyISAM compressed tables. In our case, with 20GB tables, mysqld was suddenly allocating an unreasonable amount of memory to the process.

This default memory allocation variable was introduced in MySQL 5.1.43, and you can read more about it here.

We immediately tuned this down to a more manageable size for the actual queries being run, and are seeing the expected performance improvements in our client’s queries.

Syndicate content
Website by Digital Loom