Beware: Default charset for mysqldump is utf8, regardless of server default charset

 

I ran into this issue a while ago, and was reminded of it again recently.  mysqldump uses a default charset of utf8, even when the default charset of the server is set differently.  Why does this matter?

The problem exists more in the fact that if you have string data that is in latin1 format, you are allowed to put in non-Latin characters. This can lead to lost data, especially when upgrading a major series (e.g. 5.0 to 5.1 or 5.1 to 5.5), because you're supposed to export and import the data.

Also, when importing a backup of an InnoDB table, if there is an error with one of the parts of the INSERT, the whole INSERT statement rolls back.  I have experienced major data loss because the garbled characters cause an error when INSERTed, and it causes perfectly fine data *not* to import because they're in the same INSERT statement as the garbled data.

For example:

First, set variables such on a MySQL server (5.0 or 5.1, I haven't tested on 5.5):

mysql> show global variables like '%char%';

+--------------------------+----------------------------+

| Variable_name            | Value                      |

+--------------------------+----------------------------+

| character_set_client     | latin1                     | 

| character_set_connection | latin1                     | 

| character_set_database   | latin1                     | 

| character_set_filesystem | binary                     | 

| character_set_results    | latin1                     | 

| character_set_server     | latin1                     | 

| character_set_system     | utf8                       | 

| character_sets_dir       | /usr/share/mysql/charsets/ | 

+--------------------------+----------------------------+

8 rows in set (0.00 sec)

 

Then create these tables with data:

 

CREATE TABLE `test_utf8` (

  `kwid` int(10) unsigned NOT NULL default '0',

  `keyword` varchar(80) NOT NULL default ''

) ENGINE=InnoDB DEFAULT CHARSET=utf8;

 

INSERT INTO `test_utf8` VALUES

(1,'watching'),(2,'poet'),(3,'просмотра'),(4,'Поэту');

 

CREATE TABLE `test_latin1` (

  `kwid` int(10) unsigned NOT NULL default '0',

  `keyword` varchar(80) NOT NULL default ''

) ENGINE=InnoDB DEFAULT CHARSET=latin1;

 

INSERT INTO `test_latin1` VALUES

(1,'watching'),(2,'poet'),(3,'просмотра'),(4,'Поэту');

 

Now compare:

mysqldump test > test_export_utf8.sql

mysqldump --default-character-set=latin1 test > test_export_latin1.sql

 

Note that the test export with the default character set of utf8 has mojibake whereas the export with latin1 does not.

 

So be *extremely* careful when using mysqldump - whether for backups or while upgrading.  You can checksum your data before and after  an export/import with mysqldump to be sure that your data is the same.

 

Comments

This is not limited to mysqldump. Even mysql client uses different defaults for version 5.0 (Latin1) and 5.5 (UTF-8) It will also happen if the client uses different collaction than the data. For e.g. Navicat and MySQL Query Browser uses different collation to connect to the DB than the normal mysql> prompt.If you are using "concat" function in update query you are almost certainly going to have the issue like the following at some point.

  • Error Number: 1270
  • Description of error :
  • Illegal mix of collations(utf8_general_ci,COERCIBLE),(latin1_swedish_ci,IMPLICIT),(utf8_general_ci,COERCIBLE) for operation 'concat'
  1. We can change the collation after connection using set command: set @@session.collation_connection='latin1_swedish_ci' ;
  2. adding a line to the [client] section of your config file is also an option: default-character-set=latin1But it will affect all the clients connecting to the DB.
  3. Or while connecting:mysql --default-character-set=latin1

Tue, 09/27/2011 - 06:56

It is true that lots of applications exist that don't care about the charset used for storage on the server.  They will decode and encode the byte-stream client side.  Historically I think that happened before MySQL 4.1 where unicode for storage was not an option.  So app developers needed a workaround in order to create multilangual applications. Maybe also PHP is to blame to some extent (PHP never was particularly good at unicode!).But it is a real mess. The data may work with a specific application and fail with others such as GUI clients used for maintenance and backup tools. I'd recommend anyone hit by this to re-encode the data and do necessary changes in the application.That said: data should back up and restore of course.  Does not 'mysqldump' have an option do define the encoding of the backup file?Peter

Fri, 09/16/2011 - 04:03

I agree with anonymous. But that does not change the fact that we often get customers who have data that doesn't belong in latin1, but still are able to run their business with that data. It happens for at least 80% of our clients. Identifying this early on is crucial to helping them plan for internationalizing their code and data, and safely guiding them to a place where utf8 is across their platforms.

Laine Campbell
Thu, 09/15/2011 - 16:36

If utf8 is a problem, this means there is bad data in a table anyway - here storing utf8 bytes in a latin1 column.   That's doomed to bugginess and failure regardless of mysqldump.

Anonymous
Thu, 09/15/2011 - 16:33

Shlomi,

I absolutely agree that utf8 is good, but setting the default when a different one already exists is bad.

Plus, with folks upgrading (from 5.0 to 5.1 or 5.1 to 5.5) I figured they might want to know.

Sheeri Cabral
Thu, 09/15/2011 - 13:10

Good writing. On a light mode, I would rename title to "Good news! default charset for mysqldump is utf8!".I'm waiting for the time the default charset in mysqld turns into utf8 as well (or has it, and I'm not aware?). Really, how longer shall we remain with latin1/swedish_ci?I think utf8 is a great default, and all mysql binaries should work their way towards it.Of course, I agree with you on your points; it's painful when dump's charset does not match dataset's. I've evidenced the trouble when a customer tried to restore their data once.

Thu, 09/15/2011 - 12:48

Reply

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.
Website by Digital Loom