Most of the wide-spread languages support Unicode and/or UTF-8 at least in some form; post-2K languages support Unicode quite commonly, but there are always few things to look after.
PHP has “native” support for UTF-8 – the quotes are due to the fact that PHP’s UTF-8 is not strictly standard compliant.
First, you need to enable proper locale; this is “C” by default.
For the complete list of locale available on your system, try `locale -a`.
Simply run
setlocale(LC_ALL, "en_US.utf8", "en.utf8", "en_US.utf-8", "en.utf-8");
before anything else; this will try to set one of the listed locales, in the order they are listed; the setlocale() returns the name of the set locale, or false if it failed to do so.
Note that many functions are not UTF-8-aware – e.g. strlen(), strpos(), substr(), etc. You are free to use these functions as long as you know exactly what you’re doing – e.g. if you embed one string in another, and remember the offset and the strlen of the string, you can retrieve it unharmed with substr(). But substr() is not aware of UTF-8 characters, so you cannot safely cut a string in substrings, as you could cut inside some characters.
To connect to MySQL with proper locale; you have to set 3 MySQL’s settings:
$dbconn = mysql_connect($dbhost, $dbuser, $dbpass);
mysql_query("set names utf8;", $dbconn);
mysql_query("set character set utf8;", $dbconn);
mysql_query("set collation_connection = 'utf8_general_ci';", $dbconn);
And of course, all of your MySQL tables, as well as the database itself, should be in correct locale:
create database if not exists `my_database`
default character set = 'utf8' default collate = 'utf8_general_ci';
use `my_database`;
create table if not exists `my_table`
(
`id` int unsigned primary key auto_increment,
`column` varchar(128) not null,
... other definitions ...
) engine = InnoDB default character set = 'utf8' default collate = 'utf8_general_ci';
Note that using UTF-8 character set for table will make all char and varchar columns 3 times wider than their ASCII locale counterpart – this is because MySQL uses up to 3 bytes to represent UTF-8 characters. Also note that PHP uses up to 6 bytes for UTF-8 characters; what happens when you hit this mismatch is beyond the scope of this article.
This should get you going; there are many more small things to keep an eye on, such as:
- When outputting UTF-8 strings to HTML, you need to specify locale explicitly for htmlentities():
$escstr = htmlentities($str, ENT_COMPAT, "utf-8");
- Your HTML should contain proper “Content-Type” definition in head tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />