PHP+MySQL and Unicode/UTF-8

Most of the wide-spread languages support Unicode and/or UTF-8 at least in some form; post-2K languages support Unicode quite commonly, but there are always few things to look after.

PHP has “native” support for UTF-8 – the quotes are due to the fact that PHP’s UTF-8 is not strictly standard compliant.

First, you need to enable proper locale; this is “C” by default.
For the complete list of locale available on your system, try `locale -a`.

Simply run

setlocale(LC_ALL, "en_US.utf8", "en.utf8", "en_US.utf-8", "en.utf-8");

before anything else; this will try to set one of the listed locales, in the order they are listed; the setlocale() returns the name of the set locale, or false if it failed to do so.

Note that many functions are not UTF-8-aware – e.g. strlen(), strpos(), substr(), etc. You are free to use these functions as long as you know exactly what you’re doing – e.g. if you embed one string in another, and remember the offset and the strlen of the string, you can retrieve it unharmed with substr(). But substr() is not aware of UTF-8 characters, so you cannot safely cut a string in substrings, as you could cut inside some characters.

To connect to MySQL with proper locale; you have to set 3 MySQL’s settings:

$dbconn = mysql_connect($dbhost, $dbuser, $dbpass);
mysql_query("set names utf8;", $dbconn);
mysql_query("set character set utf8;", $dbconn);
mysql_query("set collation_connection = 'utf8_general_ci';", $dbconn);

And of course, all of your MySQL tables, as well as the database itself, should be in correct locale:

create database if not exists `my_database`
default character set = 'utf8' default collate = 'utf8_general_ci';

use `my_database`;

create table if not exists `my_table`
    `id` int unsigned primary key auto_increment,
    `column` varchar(128) not null,
    ... other definitions ...
) engine = InnoDB default character set = 'utf8' default collate = 'utf8_general_ci';

Note that using UTF-8 character set for table will make all char and varchar columns 3 times wider than their ASCII locale counterpart – this is because MySQL uses up to 3 bytes to represent UTF-8 characters. Also note that PHP uses up to 6 bytes for UTF-8 characters; what happens when you hit this mismatch is beyond the scope of this article.

This should get you going; there are many more small things to keep an eye on, such as:

  • When outputting UTF-8 strings to HTML, you need to specify locale explicitly for htmlentities():
    $escstr = htmlentities($str, ENT_COMPAT, "utf-8");
  • Your HTML should contain proper “Content-Type” definition in head tag:
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

One response to “PHP+MySQL and Unicode/UTF-8

  1. Aw, this was an extremely nice post. Spending some
    time and actual effort to make a very good article… but what can I say… I put things off a whole
    lot and never manage to get anything done.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s