Beware: Default charset for mysqldump is utf8, regardless of server default charset
I ran into this issue a while ago, and was reminded of it again recently. mysqldump uses a default charset of utf8, even when the default charset of the server is set differently. Why does this matter?
The problem exists more in the fact that if you have string data that is in latin1 format, you are allowed to put in non-Latin characters. This can lead to lost data, especially when upgrading a major series (e.g. 5.0 to 5.1 or 5.1 to 5.5), because you're supposed to export and import the data.
Also, when importing a backup of an InnoDB table, if there is an error with one of the parts of the INSERT, the whole INSERT statement rolls back. I have experienced major data loss because the garbled characters cause an error when INSERTed, and it causes perfectly fine data *not* to import because they're in the same INSERT statement as the garbled data.
For example:
First, set variables such on a MySQL server (5.0 or 5.1, I haven't tested on 5.5):
mysql> show global variables like '%char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
Then create these tables with data:
CREATE TABLE `test_utf8` (
`kwid` int(10) unsigned NOT NULL default '0',
`keyword` varchar(80) NOT NULL default ''
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `test_utf8` VALUES
(1,'watching'),(2,'poet'),(3,'просмотра'),(4,'Поэту');
CREATE TABLE `test_latin1` (
`kwid` int(10) unsigned NOT NULL default '0',
`keyword` varchar(80) NOT NULL default ''
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `test_latin1` VALUES
(1,'watching'),(2,'poet'),(3,'просмотра'),(4,'Поэту');
Now compare:
mysqldump test > test_export_utf8.sql
mysqldump --default-character-set=latin1 test > test_export_latin1.sql
Note that the test export with the default character set of utf8 has mojibake whereas the export with latin1 does not.
So be *extremely* careful when using mysqldump - whether for backups or while upgrading. You can checksum your data before and after an export/import with mysqldump to be sure that your data is the same.
Archives
- May 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- September 2012
- August 2012
- July 2012
- June 2012
- May 2012
- April 2012
- March 2012
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- December 2010
- November 2010
- October 2010
- September 2010
- November 2009
- March 2008
- November 2007
- October 2007


Comments
This is not limited to mysqldump. Even mysql client uses different defaults for version 5.0 (Latin1) and 5.5 (UTF-8) It will also happen if the client uses different collaction than the data. For e.g. Navicat and MySQL Query Browser uses different collation to connect to the DB than the normal mysql> prompt.If you are using "concat" function in update query you are almost certainly going to have the issue like the following at some point.
It is true that lots of applications exist that don't care about the charset used for storage on the server. They will decode and encode the byte-stream client side. Historically I think that happened before MySQL 4.1 where unicode for storage was not an option. So app developers needed a workaround in order to create multilangual applications. Maybe also PHP is to blame to some extent (PHP never was particularly good at unicode!).But it is a real mess. The data may work with a specific application and fail with others such as GUI clients used for maintenance and backup tools. I'd recommend anyone hit by this to re-encode the data and do necessary changes in the application.That said: data should back up and restore of course. Does not 'mysqldump' have an option do define the encoding of the backup file?Peter
I agree with anonymous. But that does not change the fact that we often get customers who have data that doesn't belong in latin1, but still are able to run their business with that data. It happens for at least 80% of our clients. Identifying this early on is crucial to helping them plan for internationalizing their code and data, and safely guiding them to a place where utf8 is across their platforms.
If utf8 is a problem, this means there is bad data in a table anyway - here storing utf8 bytes in a latin1 column. That's doomed to bugginess and failure regardless of mysqldump.
Shlomi,
I absolutely agree that utf8 is good, but setting the default when a different one already exists is bad.
Plus, with folks upgrading (from 5.0 to 5.1 or 5.1 to 5.5) I figured they might want to know.
Good writing. On a light mode, I would rename title to "Good news! default charset for mysqldump is utf8!".I'm waiting for the time the default charset in mysqld turns into utf8 as well (or has it, and I'm not aware?). Really, how longer shall we remain with latin1/swedish_ci?I think utf8 is a great default, and all mysql binaries should work their way towards it.Of course, I agree with you on your points; it's painful when dump's charset does not match dataset's. I've evidenced the trouble when a customer tried to restore their data once.
Reply