Wrestling with UTF-8

Unicode logoI’m working on a project for an international customer base, initially supporting the Spanish and English languages. Having worked on international projects before, I knew that I’d have to make some accommodations, but I was still, in the 21st century, surprised at how un-automatic the process still was to make it all work. The surprises I’m seeing are now less frequent, but I no longer trust that I won’t find another around the next corner.

The project

I’m developing a small patient automation system (the piClinic) for use in limited resource clinics in developing countries. While there is no shortage of Electronic Health Record (EHR) systems, they tend to work best in well-funded and well-supported clinics and hospitals. For everyone else (which is a rather large population) there are virtually no suitable systems, especially for small clinics in countries that do not (yet) need to support the comprehensive (and complex) data collection and reporting requirements for health information in the U.S.

The piClinic system is designed to fill the gap between zero automation and complete EHR systems until that gap can be closed or the clinic grows out of it and becomes able to install a more full-featured system. Given that much of the developing world speaks a language other than English, internationalization is something that needs to be built in from the start and not just bolted on as an afterthought.

To understand this saga, it will help to know that the app is written in PHP with data stored in MySQL. Why I made those choices is the topic for another post.

Excel: Close, but not quite

The UI uses only strings that are defined by variables that are selected based on the user’s language choice. As a web-based app, the default choice is the browser’s language preference. In the code, I have strings defined for each supported language—English and Spanish, for now, but the list is designed to grow.

To manage the list of strings, I created a spreadsheet that consists of columns for:

  • Source code file (in which the strings are used)
  • Symbolic variable name used by the program
  • English text
  • Spanish text.

Each row represents one text object in the UI.

With this arrangement I can just add columns as I add more languages. At some point this might become unwieldy, but for the foreseeable future, it looks like it’ll work.

To create the string files for the code, I have a Python script that reads the .CSV file and writes the various files—one per UI page.

While it was easy to enter the UTF-8 characters that the Spanish text used (after selecting the appropriate keyboard layout), I found that Excel would only begrudgingly output those strings in the .csv as UTF-8. It took a lot of work (menu and dialog option selections) to make that happen each time I made a change and it was not at all automatic.

I eventually gave up and just edited the UTF-8 .csv file in Notepad++ which supports reading and saving in UTF-8 encoding. If I need to do spreadsheet manipulations (e.g. to find duplicates), I can always load it into Excel and wrestle with it after I’m done, but that isn’t part of the routine.

Python and the BOM

Python also needs some encouragement to work with UTF-8 characters with regard to reading and writing them to files. It was here where I learned about BOM (Byte Order Mark) characters. In short, text files can start with a special, invisible character, the BOM character, to tell programs how its text has been encoded. Sometimes you need it, sometimes you don’t. If you’re writing a PHP file, it turns out that you don’t otherwise that character shows up in the HTML produced by the file (as it appears before the <?php script tag). It’s not a problem for the browser, but, well, it’s a problem to me. I want my generated HTML to validate, and that’s an error.

It turns out that the errant characters were the result of the Python script I wrote to create the localized string files. In the Python script, reading and writing the UTF-8 .csv files was fixed by replacing the generic file.open call with codecs.open.

Like so many magic tricks, it’s easy, once you know the magic words.

Getting Python to do that with its own files (i.e. the Python scripts and includes) requires yet more magic. To populate the database tables, I wrote some scripts that included some Python files with UTF-8-encoded strings as  potential data values. To get Python to recognize that these files are UTF-8 encoded, https://www.python.org/dev/peps/pep-0263/ suggests using this comment at the beginning of each Python source file.

# coding=utf-8

And, that seemed to do the trick.

MySQL

Not every step was mysterious. Defining the tables in MySQL was straightforward in that I just defined all the text fields in the database tables to be UTF-8 encoded. So far, that has seemed to work.

I was getting some spurious results in PHPMyAdmin, but I think that was an artifact of the scripts I used to generate the data (i.e. those I used before adding the UTF-8 encoding comment). Now that I’ve cleaned all that up and reloaded them all from scratch, the spurious results seem to have disappeared.

HTML

The HTML generated by the PHP scripts, seemed to do the right thing, more often than not, but just to be sure, I added a Content-Type header to each UI script.

header(‘Content-type: text/html; charset=utf-8’);

Net result

Since these accommodations, the app is fully bilingual and the UI can be changed on-the-fly by overriding the browser language options with a query parameter in the URL. Data is now being displayed as it should be and future development should, hopefully, just be a matter of continuing on and fleshing out the functionality and UI design.

Thanks

I can’t leave without giving a shout out to Mercer University’s Research that Reaches Out program who have funded the hardware development for this project and the Mercer on Mission program who will be funding part of the field test next summer.

 

Leave a Reply