Make Database Backups The Right Way! (for UTF-8)

February 20, 2009

Software, Thoughts

matrix_neo

My website provides Japanese literature, especially manga, and thanks to utf-8 the database is fluent in Japanese and many other languages. Website owners all over the world have adopted the utf8 standard. So if your database relies on utf-8 and you are planning to move away to another server or webhost you have to secure your database and make backups. If this happens MySql comes with a big surprise! MySQL 5.27+ has the tendency to convert international characters to ghibberish when you use

mysqldump -uDBUSER -pDBPASS DATABASE > database.sql

When you use this command chances are high that all international characters like German Umlaute, Traditional Chinese or Furigana are incorrectly encoded. Your blog entries and forum posts become unreadable! This tutorial shows a way to prevent double encoding and makes databse backup with MySql childsplay. You won’t lose your precious utf8 sets!

All you need is PhpMyAdmin (if not avaible, ssh will do the trick). Let’s begin with a problem explanation. Because losing data is bad, but not knowing why it happend and how to prevent it from happening again is even worse.

1. The MySQL Bug

It starts with the mysqldump command (MySQL bug report #28969) which seems to have a weird bug …

If your database tables are encoded in utf8 and one asks mysql to dump (export) the database in utf8 what occurs is double encoding!

e.g. If you enforce utf-8 output with

mysqldump -uUSER -pPASS –default-character-set=utf8 DATABASE > database.sql

the character “ë”, that is represented in utf8 with the bytes C3 AB, is double encoded into C3 83 C2 AB and ë becomes Ã!!

So your “Account manager België” becomes “Account manager België”.

Very frustrating!

2. SOLUTION

To prevent double encoding from destryong your database this workaround has proven very effective: Simply dump the utf-8 database in latin1. latin1 lets MySqlDump export the whole utf-8 database as raw, providing you with the ORIGINAL utf-8 database!. If you use PhpMyAdmin simply export the utf-8 database as latin1. If you have only ssh, type the following command:

mysqldump DATABASE -uDBUSER -pDBPASS –default-character-set=latin1 -r database.sql

The next step is to open the fake latin1 database and to manually correct the database coding by replacing this line somewhere at the beginning:

Replace this line

/*!40101 SET NAMES latin1 */;

with the correct line:

/*!40101 SET NAMES utf8 */;

I use Notepad++ to edit databases, because it can handle large databases and parses much faster than the original windows notepad. You can download this freeware here.

3. Restore the database

If you use PhpMyAdmin simply import the fixed database.sql file using the import tool. If you have ssh enter the following command:

mysql -uDBUSER -pDBPASS NEWDATABASE < database.sql

Note: database.sql is the backup file, which will be imported into NEWDATABASE. DBUSER and DBPASS belong to NEWDATABASE.

Update:

Upon further research, the real solution seems to be to add the following in the /etc/mysql/my.cnf file:

[client]
default-character-set=utf8

[mysqld]
default-character-set=utf8
default-collation = utf8_general_ci

and execute a

# /etc/init.d/mysql restart

afterwards.

Restart a mysql client instance and then check these variables:

mysql> SHOW VARIABLES LIKE ‘character_set%’;
+————————–+—————————-+
| Variable_name | Value |
+————————–+—————————-+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+————————–+—————————-+
8 rows in set (0.00 sec)

All is utf-8 now and a mysqldump command now seems to spit out proper utf-8 (without the dirty trick of specifying –default-character-set=latin1).

Before the addition of the utf-8 lines in the my.cnf file, the results of above query was:

mysql> SHOW VARIABLES LIKE ‘character_set%’;
+————————–+—————————-+
| Variable_name | Value |
+————————–+—————————-+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+————————–+—————————-+
8 rows in set (0.00 sec)

Note: Some php and ruby scripts can be configurated to encode in utf8. If you see this option you should definitely activate this line:

encoding:utf8

After all the hard work a man needs enjoyment:

I hope this will help you to keep your data safe!

silv

No Responses to “Make Database Backups The Right Way! (for UTF-8)”

  1. infyquest Says:

    I want more enjoyment, that’s not enough

    Reply

  2. infyquest Says:

    To be more specific on the mysql version, its 5.0.27+
    We havent heard about this in 5.1.x versions.
    So the latest 5.1.x can be considered safe, unless proven.

    Reply

Leave a Reply

What is 11 + 16 ?
Please leave these two fields as-is:
Are you human?