Emanuel Calvo

Non-deterministic Functions and the Binary Log

I wrote this post because I ran across this issue when debugging why tables with triggers/functions that were not getting replicated to slaves. The problem appears when binlog_format is set up  as STATEMENT. You can check the ‘change log’ checking the following link for more information about default values [1].

If you have non-deterministic functions  [2]  that insert or modify your data with dynamic SQL like this :

<CODE>
DELIMITER $$
CREATE FUNCTION `insert_dynamic_data` (in_param VARCHAR(30)) RETURNS bigint(20)
BEGIN
 ....
 INSERT INTO `table_to_ins` (param) VALUES(in_param);
 ...
 /* You can also add here @@warning_count or row_count() check
     and return -1 to catch errors*/
 return LAST_INSERT_ID();
END
$$
</CODE>

The problem starts when you plan to user “replication” through binary logs events. In that cases, you must consider:

"If a stored program that modifies data is nondeterministic, it is not repeatable. This can result in different data on a master and slave, or cause restored data to differ from the original data."  [3]

When you create a function, by default is NOT DETERMINISTIC, which means that for each execution could the code will be executed again. In the other hand, DETERMINISTIC will check if the parameters are the same and if is that condition comes true, will return the same result as before without executing again the code.

If you want to create the function with the binlog_format in STATEMENT and log_bin activated without the log_bin_trust_function_creators, you will get the following error:

Version 5.1.41:
mysql> DELIMITER $$
mysql> CREATE FUNCTION fx_pru () RETURNS int
    -> BEGIN
    ->  insert into prueba select 'po',round(rand()*100), rand();
    -> RETURN 1;
    -> END
    -> $$
ERROR 1418 (HY000): This function has none of DETERMINISTIC, NO SQL, or READS SQL DATA in its declaration and binary logging is enabled (you *might* want to use the less safe log_bin_trust_function_creators variable)

What happened so? The problem was that the funciton were created previous to activate the log_bin.

You have to ways to fix it: activating binlog_format as ROW or declare in your /etc/my.cnf the log-bin-trust-function-creators which allows insertions in the binary log without enforcing deterministic property. In the specified case, we cannot declare this function as DETERMINISTIC because it uses dynamic SQL (a different parameters, different results).

What happens with the execution of the function? Basically, it creates a lock contention ( if you usually use Nagios, you will see a mysql-lock-contention alarm), but MySQL will not raise any error or alarm, which causes confusion when trying to find the error. Other statements that run outside the function will be executed normally. In the previous example, this function was only for inserts, but every statement inside the function will not work. Derived  from this problem, statements inside the function will not be executed directly.

You wouldn't  see any errors. The only thing you could realize is the lock contention. The lock is generated because the statement hangs a lock into the table trying to reach a commit that never comes.

It is a bug? No, is a feature. It prevents unknown and potentially slow executions from being replicated, causing lagging into the replication process.

I recommend a link about this and related problems with triggers: [4]

1: http://www.filehippo.com/es/download_mysql/changelog/6962/
2: http://en.wikipedia.org/wiki/Nondeterministic_algorithm
3: http://dev.mysql.com/doc/refman/5.1/en/stored-programs-logging.html
4: http://dba.stackexchange.com/questions/321/dynamic-sql-in-mysql-stored-routines

Indexing text columns with GIST or GIN to optimize LIKE, ILIKE using pg_trgm in Postgres 9.1 (Part 1)

 
“The pg_trgm module provides functions and operators for determining the similarity of ASCII alphanumeric text based on trigram matching, as well as index operator classes that support fast searching for similar strings.”
 
This is the introduction to the official documentation of the extension at [1]. Note, that I used the EXTENSION terminology instead of CONTRIB, as in PostgreSQL 9.1.  Now we’ll use the CREATE EXTENSION command to include this module in the database. This new methodology allows us to manage modules installations/uninstallations with only a few commands.

The idea of this post is to show you how KNN GIST and pg_trgm could be used together to obtain interesting results. First, let’s start with some basic elements of pg_trgm.

Installation

Installing the module is easy. If you are installing through the use of source code, you must compile the module and once you get access to the database execute:

CREATE EXTENSION pg_trgm;

That’s it! Installation complete!

What is a Trigram and how use them?


A trigram is a group of three consecutive characters in a string that can be used to detect the similarity of two words (for example) or the ‘distance’ between them.

When we talk about ‘distance’, 0 means in the same place and 1 is very far. When we talk about similarity, 1 is equal and 0 is totally different. In other words, distance is 1 minus the similarity value. These concepts about distance and similarity, are necessary to start without confusion.

What’s is a trigram? A trigram is a group of three consecutive characters from a string, used to know the similarity between two strings by counting the trigrams they share.

A trigram looks like this:

palominodb=# select show_trgm('PalominoDB CO');
                              show_trgm
-----------------------------------------------------------------------
{"  c","  p"," co"," pa",alo,"co ","db ",ino,lom,min,nod,odb,omi,pal}
(1 row)

In the pg_trgm extension we have functions and operators. show_limit and set_limit are functions used to set up and show the similarity threshold for the % operator. This operator takes the form “string1 % string2” and returns a boolean type (“t” if the similarity is greater than the similarity threshold, otherwise it returns “f”).

palominodb=# select show_limit();
show_limit
       0.4

palominodb=# select set_limit(0.3), show_limit();
set_limit | show_limit
         0.3 |    0.3

In the following example we’ll see the use of each one. Operator % will return true if the similarity of the strings is greater than similarity threshold returned by show_limit function. In this example, both string are equal, in consequence, the operation will return true:

palominodb=# select similarity('Palomino','Palomino')  AS Similarity,
                                  'Palomino'<->'Palomino'              AS distance,
                                  'Palomino' % 'Palomino'              AS SimilarOrNot;
-[ RECORD 1 ]+--
similarity          | 1
distance     | 0
similarornot     | t

Index Support and usage


Now let’s discuss combining GIST or GIN and pg_trgm. pg_trgm includes an operator class to support searches using similarity, like, or ilike operators. GIN and GIST have several differences. If you don’t know which to choose, just remember a few rules: GIN searches quicker than GIST but is slower to update; if you have a write-intensive table use GIST. GIN is better for static data. Please be aware, however, that they don’t support exact matching with the equals operator!  You can do an exact match using like/ilike with no wildcards.  If you want to use the equals operator(=), you must create a standard BTREE index on the pertinent column.

In the following examples, we’ll show a table only with a GIST index. As you can see, if you want to match the exact value with equal operator, it will scan the whole table:

palominodb=# EXPLAIN ANALYZE  SELECT id, texto FROM texto_busqueda WHERE texto = 'Palomino';
                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on texto_busqueda  (cost=0.00..90.15 rows=1 width=136) (actual time=16.835..16.846 rows=1 loops=1)
  Filter: (texto = 'Palomino'::text)
Total runtime: 17.094 ms
(3 rows)

But, if we use LIKE operator, index scan will be activated:

palominodb=# EXPLAIN ANALYZE  SELECT id, texto FROM texto_busqueda WHERE texto like 'Palomino';
                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Scan using texto_busqueda_texto_idx on texto_busqueda  (cost=0.00..8.27 rows=1 width=136) (actual time=0.374..1.780 rows=1 loops=1)
  Index Cond: (texto ~~ 'Palomino'::text)
Total runtime: 1.979 ms
(3 rows)


palominodb=# EXPLAIN ANALYZE  SELECT id, texto FROM texto_busqueda WHERE texto like '%Palomino%';
                                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Index Scan using texto_busqueda_texto_idx on texto_busqueda  (cost=0.00..8.27 rows=1 width=136) (actual time=0.171..1.732 rows=1 loops=1)
  Index Cond: (texto ~~ '%Palomino%'::text)
Total runtime: 1.882 ms
(3 rows)



To use an index for  match equal strings, we need to create a BTREE index. But in case of BTREE there is a limitation of 8191 bytes per index row. So, if you have very large text columns you will not allowed to create a BTREE index without using functional indexes.

We get this result because, unlike BTREE indexes, the search string is not left-anchored.

The creation of indexes with the pg_trgm operator class is simple:

CREATE INDEX ON texto_busqueda USING GIST(texto gist_trgm_ops);
or
CREATE INDEX ON texto_busqueda USING GIN(texto gin_trgm_ops);

If you want a more comprehensive understanding of GIST or GIN implementation on Postgres, you can download the the source code and read  src/backend/access/gist/README and src/backend/access/gin/README.


Another useful technique

Combining % operator to get the strings that have a similarity greater than the established threshold and similarity function -that returns the similarity-, we can get ordered from the most similar to the less one discarding all the strings that aren’t similar enough:

palominodb=# SELECT ctid, similarity(texto, 'Palominodb') AS simil
palominodb-#  FROM texto_busqueda
palominodb-#  WHERE texto % 'Palominodb'
palominodb-#  ORDER BY simil DESC;
 ctid  |  simil
--------+----------
(55,3) | 0.666667
(1 row)

The same query, but with its EXPLAIN plan shows that the generated index is used in the condition:

palominodb=# EXPLAIN ANALYZE  SELECT ctid, similarity(texto, 'Palominodb') AS sml
palominodb-#  FROM texto_busqueda
palominodb-#  WHERE texto % 'Palominodb'
palominodb-#  ORDER BY sml DESC;
                                                             QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Sort  (cost=14.26..14.27 rows=3 width=138) (actual time=3.428..3.437 rows=1 loops=1)
  Sort Key: (similarity(texto, 'Palominodb'::text))
  Sort Method: quicksort  Memory: 17kB
  ->  Bitmap Heap Scan on texto_busqueda  (cost=4.28..14.24 rows=3 width=138) (actual time=3.336..3.383 rows=1 loops=1)
        Recheck Cond: (texto % 'Palominodb'::text)
        ->  Bitmap Index Scan on texto_busqueda_texto_idx  (cost=0.00..4.28 rows=3 width=0) (actual time=3.278..3.278 rows=1 loops=1)
              Index Cond: (texto % 'Palominodb'::text)
Total runtime: 3.578 ms
(8 rows)


Well, this is the first part. Hope you enjoyed the reading and I’ll wait your comments and feedback!

 


[1] http://www.postgresql.org/docs/9.1/static/pgtrgm.html

Postgresql 9.1 - Part 1: General Features

General scope

 
Postgresql 9.1 runs over the theme “features, innovation and extensibility” and it really does. This version was born to overcome Postgresql 9.0 ‘s limitations and known bugs in replication. If you are developing over 9.0, it’s time to think seriously about preparing your code for Postgresql 9.1.
 
The intent of this series of posts are not to be another release features posts. I offer a vision based on my personal experience and focus on the features that I saw exciting for the most of the projects where I’m involved. If you want to read an excellent general article about the new features of this version, web to [2].
 
At the moment of this post, the last PostgreSQL version is 9.1.1 . It includes 11 commits to fix GIST memory leaks, VACUUM improvements, catalog fixes and others. A description of the minor release can be check at [3].
 
The main features included are:

  • Synchronous Replication
  • Foreign Data Support
  • Per Column collation support
  • True SSI (Serializable Snapshot Isolation)
  • Unlogged tables for ephemeral data
  • Writable Common Table Expressions
  • K-nearest-neighbor added to GIST indexes
  • Se-linux integration with the SECURITY LEVEL command
  • Update the PL/Python server-side language
  • To come: PGXN Client for install extensions easily from the command line. More information: http://pgxn.org/faq/  The source will be on https://github.com/pgxn/pgxn-client



Some of these features could be considered minor, but many think they are very cool while using 9.1  in their environments.

Considerations before migrating


If you are an old Pg user, you may already know the migration risks listed on the next page. Still, I advise that you note and carefully learn about these risks. Many users freeze their developments to older versions simply because they didn’t know how to solve new issues. The most notable case is when 8.3 stopped using implicit casts for some datatypes and many queries didn’t work as a result.  



There are some important changes that could affect your queries, so take a pen and note:

  • The default value of standard_conforming_strings is now turned on by default. That means that backslashes are normal characters (which is the SQL standard behavior). So, if you have backslashes in your SQL code, you must add E’’ strings. For example: E’Don’t’
  • Function-style and attribute-style data type casts were disallowed for composite types. If you have code like value_composite.text or text(value_composite), you will need to use CAST or :: operator.
  • Whereas before the checks were skipped, domains are now based on arrays when they are updated, which results in a rechecking of the constraints.
  • String_to_array function returns now an empty array for a zero-length string (before it returned NULL). The same function splits into characters if you use the NULL separator.
  • The inclusion of the INSTEAD OF action for triggers will require you to recheck the logic of your triggers.
  • If you are an actual 9.0 replication user, you may know that in 9.1 you can control the side effects of VACUUM operations during big queries execution and replication. This is a really important improvement. Basically, if you run a big query in the slave server and the master starts a VACUUM process, the slave server can request the master postpone the cleanup of death rows that are being used by the query.


Brief description of main features


Don’t worry about the details, we’ll cover each feature in future posts.

  • Synchronous Replication
    • This feature enhances the durability of the data. Only one server can be synchronous with the master, the rest of the replicated servers will be asynchronous. If the actual synchronous server goes down, another server will become synchronous (using a list of servers in synchronous_standby_names).  Failover is not automatic, so you must use external tools to activate the standby sync server, one of the most popular is pgpool [4].
  • Foreign Data Support
    • The feature of Foreign Data Wrappers has been included since 8.4, but now it is possible to reach data from any database where a plugin exists. Included in the contribs, is a file called file_fwd, which connects CSV files to a linked table. Basically it provides an interface to connect to external data. In my opinion, this is perhaps one of the most useful features of this versions, especially if you have different data sources in your environment.
  • Serializable Snapshot Isolation
    • This new level of serialization is the strictest. Postgres now supports READ COMMITED, REPEATABLE READ (old serializable) and SERIALIZABLE. It uses predicate locking to keep the lock if the write would have an impact on the result. You will not need explicit locks to use this level, due to the automatic protection provided.
  • Unlogged tables
    • Postgres uses the WAL log to have a log of all the data changes to prevent data loss and guarantee consistency in the event of a crash, but it consumes resources and sometimes we have data that we can recover from other sources or that is ephemeral. In these cases, creation of unlogged tables allows the database to have tables without logging into the WAL, reducing the writes to disk. Otherwise, this data will not be replicated, due to the mechanism of replication used by Postgres (through WAL records shipping).
  • Writable Common Table Expressions
    • CTE was included in 8.4 version, but in this version, it was improved to allow you to use writes inside the CTE (WITH clause). This could save a lot of code in your functions.
  • K-nearest-neighbor added to GIST indexes
    • Postgres supports multiple types of indexes; one of them is GiST (Generalized Search Tree). With 9.1, we can define a ‘distance’ for datatypes and use it for with a GiST index. Right now, this feature is implemented for point, pg_trgm contrib and others btree_gist datatypes. The operator for distance is <-> . Another feature you will enjoy is that LIKE and ILIKE operators can use the tgrm index without scanning the whole table.
  • SE-Linux integration
    • Postgres is now the first database to be fully integrated with military security-grade. SECURITY LABEL applies a security label to a database object. This facility is intended to allow integration with label-based mandatory access control (MAC) systems such as SE-Linux instead of the more traditional access control - discretionary with users and groups. (DAC).



References:

[1] http://www.postgresql.org/docs/9.1/static/release-9-1.html
[2] http://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.1
[3] http://www.postgresql.org/docs/9.1/static/release-9-1-1.html
[4] http://pgpool.projects.postgresql.org/


Syndicate content
Website by Digital Loom