How To Change An Early WPMU Database from latin1 to utf8 Encoding.

I’ve just completed a necessary, but unglamorous, change to Edubuzz.org’s underlying WordPress Multi-User (WPMU) software that has taken me months, although the change itself can be completed in an hour or so. This note is to hopefully help anyone in the same situation avoid some of that wasted time by providing a step-by-step guide.

What’s the back story? On upgrading from WPMU 1.2.5.a to 1.3.3, we got strange  characters appearing throughout existing blog content. That turned out to be a common problem. I guess because it’s to do with encoding of characters in the database, it’s unfamiliar territory for most people and certainly was for me. I tried a quick fix from the WPMU forum, which got rid of the symbols, only to find that I couldn’t create new blogs (also here). Eventually I found that early versions of WPMU had used latin1 encoding within the database in places where the current version expected utf8. We had started with V1.0RC4. A bit more research showed that the solution was to convert the database character set to ensure it used utf8 throughout. The trouble was, attempts to do that weren’t successful, so I ended up changing the wp-config.php file to specify latin1 encoding in the meantime, as follows:

define('DB_CHARSET', 'latin1');

That enabled the site to work properly, without the  symbols appearing, and new blogs were able to be created. I decided at that point to postpone converting the database until I understood what I was doing.

How to decide what to do? I went looking for instructions on what to do, but nothing I tried seemed to work. None were specific to the WPMU situation, although there was one link out from a forum post which I tried. There are a number of places on the web where you can find guidance, but they tended to be very short and to the point, and didn’t explain what was going on. But as well as these “quick fix” solutions (if you can ever call dumping a 1000-blog database and importing it again quick) there were two which proposed a different, much more time-consuming, approach. Both of these, one from the WordPress Codex and the other a phenomenally comprehensive piece from Derek Sivers at O’Reilly, raised the possibility that the dump/import route might not be adequate.

Where can I find out what this latin1 / utf8 stuff is all about? If you want to get your head round this, have a look at Joel Spolsky’s Unicode article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

IT’S NOT THAT HARD.

All that stuff about “plain text = ascii = characters are 8 bits” is not only wrong, it’s hopelessly wrong, and if you’re still programming that way, you’re not much better than a medical doctor who doesn’t believe in germs. Please do not write another line of code until you finish reading this article.

Do I need to convert from latin1 to utf8 via binary? This is the approach advised at the WordPress Codex, and it’s an approach also found in MySQL documentation from the time of utf8′s introduction. But to do it that way means using ALTER TABLE commands to modify every column of every table that contains text, first to binary, and then to utf8. That might be OK with a single-user blog, but represents a huge amount of work on a 1000-blog database with thousands of tables.

What worked for me? The key point for me was the realisation – eventually – that although the database tables had the default character set of Latin1, they contained data already encoded in utf8. If you’re dealing with an installation of WPMU from the early days, chances are it will be the same as we’ve never changed that. This is a scenario described by Alex King in his Guide to Fixing a MySQL Character Encoding Mismatch. Alex’s article is good, but doesn’t go into detail of the commands to use. Further digging led to more detailed instructions from Mamash (Furled here) :

If you’ve ever used a UTF8 application on a pre-4.1 MySQL server, or never cared about encodings on a 4.1 setup even, you may have a non-UTF8 database containing UTF8 data. While this doesn’t bother most applications (e.g. PHP weblogs), it’s not clean and you can’t sort properly with any non-Western characters.

Mamash’s instructions worked perfectly.

To edit the MySQL dump file I used the VIM editor. In VIM, the command line I used was:
:%s/ENGINE=MyISAM DEFAULT CHARSET=latin1/ENGINE=MyISAM DEFAULT CHARSET=utf8/gI
where: % indicates all lines, s is substitute, g is every occurrence on line and I means “don’t ignore case”. The end switches were probably unnecessary.

Once that’s all done, the wp-config.php file has to be set to:

define('DB_CHARSET', 'utf8');

Anything else I should know? During my earlier trials, I’d found out that Rob Miller’s Now Reading WordPress Plugin, used on www.edubuzz.org, would cause problems. The symptoms were “max key length is 1000 bytes” errors on importing the modified mysql dump file to the new utf8 database.The Plugin has 4 key fields, each of 250 characters, which takes up the full maximum MySQL key length of 1000 bytes. Latin1 encoding just uses 1 byte per character, but utf8 uses up to 3. I’ve reduced each of these fields to 80 characters, from 250, and things seem to be working OK.