My website provides Japanese literature, especially manga, and thanks to utf-8 the database is fluent in Japanese and many other languages. Website owners all over the world have adopted the utf8 standard. So if your database relies on utf-8 and you are planning to move away to another server or webhost you have to secure your database and make backups. If this happens MySql comes with a big surprise! MySQL 5.27+ has the tendency to convert international characters to ghibberish when you use
mysqldump -uDBUSER -pDBPASS DATABASE > database.sql
When you use this command chances are high that all international characters like German Umlaute, Traditional Chinese or Furigana are incorrectly encoded. Your blog entries and forum posts become unreadable! This tutorial shows a way to prevent double encoding and makes databse backup with MySql childsplay. You won’t lose your precious utf8 sets!
All you need is PhpMyAdmin (if not avaible, ssh will do the trick). Let’s begin with a problem explanation. Because losing data is bad, but not knowing why it happend and how to prevent it from happening again is even worse.
1. The MySQL Bug
It starts with the mysqldump command (MySQL bug report #28969) which seems to have a weird bug …
If your database tables are encoded in utf8 and one asks mysql to dump (export) the database in utf8 what occurs is double encoding!
e.g. If you enforce utf-8 output with
mysqldump -uUSER -pPASS –default-character-set=utf8 DATABASE > database.sql
the character “ë”, that is represented in utf8 with the bytes C3 AB, is double encoded into C3 83 C2 AB and ë becomes Ã!!
So your “Account manager België” becomes “Account manager België”.
Very frustrating!
2. SOLUTION
To prevent double encoding from destryong your database this workaround has proven very effective: Simply dump the utf-8 database in latin1. latin1 lets MySqlDump export the whole utf-8 database as raw, providing you with the ORIGINAL utf-8 database!. If you use PhpMyAdmin simply export the utf-8 database as latin1. If you have only ssh, type the following command:
mysqldump DATABASE -uDBUSER -pDBPASS –default-character-set=latin1 -r database.sql
The next step is to open the fake latin1 database and to manually correct the database coding by replacing this line somewhere at the beginning:
Replace this line
/*!40101 SET NAMES latin1 */;
with the correct line:
/*!40101 SET NAMES utf8 */;
I use Notepad++ to edit databases, because it can handle large databases and parses much faster than the original windows notepad. You can download this freeware here.
3. Restore the database
If you use PhpMyAdmin simply import the fixed database.sql file using the import tool. If you have ssh enter the following command:
mysql -uDBUSER -pDBPASS NEWDATABASE < database.sql
Note: database.sql is the backup file, which will be imported into NEWDATABASE. DBUSER and DBPASS belong to NEWDATABASE.
Update:
Upon further research, the real solution seems to be to add the following in the /etc/mysql/my.cnf file:
[client]
default-character-set=utf8
…
[mysqld]
default-character-set=utf8
default-collation = utf8_general_ci
…
and execute a
# /etc/init.d/mysql restart
afterwards.
Restart a mysql client instance and then check these variables:
mysql> SHOW VARIABLES LIKE ‘character_set%’;
+————————–+—————————-+
| Variable_name | Value |
+————————–+—————————-+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+————————–+—————————-+
8 rows in set (0.00 sec)
All is utf-8 now and a mysqldump command now seems to spit out proper utf-8 (without the dirty trick of specifying –default-character-set=latin1).
Before the addition of the utf-8 lines in the my.cnf file, the results of above query was:
mysql> SHOW VARIABLES LIKE ‘character_set%’;
+————————–+—————————-+
| Variable_name | Value |
+————————–+—————————-+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+————————–+—————————-+
8 rows in set (0.00 sec)
Note: Some php and ruby scripts can be configurated to encode in utf8. If you see this option you should definitely activate this line:
encoding:utf8
After all the hard work a man needs enjoyment:

I hope this will help you to keep your data safe!
silv





February 21st, 2009 at 3:20 AM
I want more enjoyment, that’s not enough
February 21st, 2009 at 3:23 AM
To be more specific on the mysql version, its 5.0.27+
We havent heard about this in 5.1.x versions.
So the latest 5.1.x can be considered safe, unless proven.