unicodeIn three decades of programming, I’ve never doubted ASCII. When faced with a mysterious bug, faithful single-byte ASCII has never betrayed my trust in its character. But variable-byte Unicode, well, that is a different story.

When localizing software strange and twisted things happen mysteriously. After the investigation, the actual cause is never exactly Unicode itself. No, it is too internationally respectable for that. Yet, frequently standing nearby in the dark shadows, Unicode is there, lurking time and again. What is this treachery?

The Treachery of Reliable Search

Consider using Regular Expressions to search a data set of artists for “René” to find the René Magritte.

The last character in René is an e-acute. Admittedly, changing the “é”character to an plain “e” may be tempting (Rene), but this approach will not do much for the tens of thousands of non-ASCII characters needed for localization. There are very good reasons why Unicode exists as an international standard. And it is somewhat disrespectful to René to intentionally misspell his name.

The four byte Unicode (UTF-32) representation of e-acute is u00E9. But there is also a two byte Unicode (UTF-16) representation of e-acute is u0301.  This means that RegEx can miss matching “René” with “René”, if the  e-acute uses these two different endings.

If the dataset has HTML encodings, it gets even more fun. The e-acute could be “é”, or the HTML encoded hex value for the UTF-32, or the HTML encoded decimal value for UTF-32. Add to this the capitalized e-acute: É. That gives ten variations of the same character to match.

One would like to think that foundational tools, such as RegEx, automatically handle this or there is a simple option to turn it on and the problem goes away. Unfortunately, that is not the case, at least not yet.

So beware of normally reliable search utilities to find Unicode characters.

The Treachery of Reliable Tools

Microsoft Excel is a great tools for localization. Put a key in the first column, the phrase to be translated in the second column, and give the third column to the translator for the target language phrase. Excel supports Unicode, so no problem, right? Well…not exactly.

To demonstrate the first problem with Excel and its support for Unicode, export a CSV (Comma Separated Values) file of Asian Unicode from Excel and then opening it back up in Excel. It converts the characters into little boxes to indicate your device doesn’t have a font to display the text. These boxes are known as tofu*.

Google appears to be eating Microsoft’s lunch (or at least its tofu) when it comes to successfully handling Unicode. A workaround is to copy the Excel worksheet and paste it into a Google Docs sheet, then export it from there to CSV. No tofu when imported into Unity3D and NGUI.

The second problem I encountered was with Chinese Unicode in Excel and quotes. The translator’s pipeline added a strange quote at the beginning and end of phrase, invisible to Excel. When pasted into Google Translate, the quote is visible (another i18n point for Google).

Both Excel and Word could not find the character in search (perhaps related to #1 above).When the translated phrase was imported into Unity3D and NGUI, all translated strings were broken.

So beware of normally reliable tools to translate Unicode.

* Note: Excel 2011 displays Korean in tofu on my Mac. These issues may have been fixed in a later version of Excel. But Unicode has been around since 1991. Just saying.