Advanced String Techniques in C++ - Part I: Unicode
by (28 August 2000)
|Return to The Archives|
These tutorials (there'll be two of them) will discuss how to implement a few
neat string handling techniques in your applications and games. I've
intentionally stayed clear of references to game development in these tutorials
because I feel that the techniques presented herein win by being presented in a
more general context. After all, a large part of game development is the
development of the foundation driver technology, and there's nothing more
fundamental than string management, is there?|
This first tutorial discusses Unicode and localization techniques. What this basically means is how to add support for character sets and languages other than English in a very simple way. The reasons for doing so should be quite obvious to everyone: At the very least, making it possible for non-English players of your games to input data in their native language (for instance, when using network chat modes or perhaps when naming a character in an RPG) is a way of showing them great respect.
A Farewell To Char
I happen to live in Sweden, and despite common belief we neither have polar
bears wandering around the towns nor have we limited our diet to meatballs. What
we do have, however, is an extended Latin alphabet. For the sole purpose of
limiting our communication with the outside world (or so it feels), we've placed
little dots and circles above certain letters in our alphabet, making them
virtually unpronounceable to anyone not living in Northern Europe. Such
characters are a pain to store and transmit electronically since the characters
basically don't exist on anything but Swedish computers.|
The reason for this is quite obvious: Characters in computer systems are normally stored as ASCII codes, or a derivative thereof (such as the ANSI codes used by Windows). The problem with these encoding methods is the limited space available for characters; since ASCII uses eight bits to store each character, a maximum of 2^8 or 256 characters can be defined. This is quite enough for encoding the standard Latin alphabet, the digits, punctuation and a few diacritics, but not the exotic characters of non-English languages (just take a look at languages such as Chinese or Japanese Kanji, which have thousands of letters representing complete syllables, words and phrases).
The solution is to increase the number of bits used to store each character, and doing so in a standardized way to allow painless data transfers between systems using different languages. Two such standards exists and are in use today: Multibyte Character Sets (MBCS) and Unicode.
The Mess of MBCS
To be blunt, MBCS is the inferior of the two. The character set is based on the ASCII-friendly char data type, but each character occupies either one or two chars, effectively rendering all your favorite C string functions useless. When parsing an MBCS string (yes, it has to be parsed before it's used!), you must examine the bits of every character you read to determine if the next character in the string is a part of the current character or not, and if it is, how to combine the two to form a human-readable character. While this standard certainly provides you with more than 256 characters, it requires you to use complex (and hence slow) string functions for even the most trivial tasks.
Most modern Windows compilers come with a set of string functions (all of which have the mbs prefix) that operate on multi-byte character strings. They behave like their regular C counterparts (e.g. mbslen() implements the same functionality as strlen()). Windows even has a few functions for parsing MBCS strings character-by-character, namely CharNext(), CharPrev() and IsDBCSLeadByte(). Look'em up in the Win32 API reference for more info.
Now that you know what MBCS is, don't use it. If you do, you'll be sorry. Instead, read on and discover the wonders of Unicode!
Unicode was invented by Apple and Xerox in the late 80's, and is now maintained by an industry consortium responsible for assigning new character codes etc.
With Unicode, every character is encoded as a 16-bit (or 2-byte) quantity (unsigned short in C), thus making available as many as 65 536 characters, more than enough for all significant written languages in the world today.
C's good old string functions won't work on Unicode strings either, but luckily there's another set of C runtime library functions available for Unicode strings, prefixed with wcs (for "wide character set", not to be confused with the aforementioned mbs function set). You'll find all your old workhorses here, such as wcslen(), which implements behavior equivalent to strlen(). In addition, writing your own functions to operate on Unicode strings is nowhere near as difficult or frustrating as writing MBCS functions, since no parsing is necessary. There's no hassle with any CharNext()-like functions, traversing a string is once again as easy as blindly increasing a pointer and looking for a terminating NULL (as is the case with ASCII strings).
The Unicode consortium have defined code points (a code point being the Unicode index of a specific character) for a wide variety of languages, diacritics, special symbols, dingbats, mathematical and scientific symbols etc. They've also reserved quite a bit of room for you to store any custom characters your application might use. And they've been foreseeing enough to place the standard ASCII characters at code points 0-255, making ASCII to Unicode translation and comparison a breeze.
Where's the Catch?
However, Unicode do have a few drawbacks (or rather design issues that you need to be aware of). First, of course, is the fact that Unicode strings occupy twice as much space as ASCII strings, since two bytes are used per character instead of just one. This same fact leads to a few other issues that need to be pointed out: You cannot treat Unicode strings as arrays of bytes, as is perfectly legal with ASCII strings. Instead, you need to treat them as arrays of characters. You must also make sure you're not performing any arithmetic operations on your strings under the assumption that characters occupy only one byte each.
There are also portability issues to consider: Not all operating systems and compilers have support for Unicode. An operating system without Unicode support isn't that big a problem actually - just make sure you're not using Unicode strings when calling OS functions. On the other hand, your compiler and C runtime library must have explicit support for Unicode in order for you to use it, for reasons that will soon become obvious.
If you're targeting the Windows platform, note that NT and Windows 2000 have full Unicode support (in fact, all NT-API functions expect Unicode strings, ASCII is supported by an internal conversion stage), but Windows 95 and 98 have only very rudimentary support for Unicode. We'll take care of that problem a little later.
Trying It At Home
So, how do we take Unicode from concept to reality? Assuming you're on a
platform and compiler that supports it, it's quite simple. So let's pretend
we're using Visual C++ on Windows 2000 for a moment, shall we?|
First, you need to inform the C runtime library that you wish to use Unicode. That is done by placing the following lines before any other C headers:
The _UNICODE macro tells tchar.h, which is the Unicode header file shipped with the compiler, to include the following definition of the Unicode character type:
...which should be used instead of char for Unicode strings. Since we're running on Windows, we also need to tell the Win32 API we're interested in taking advantage of Unicode. This is done by placing the following line before the inclusion of windows.h:
This line causes Windows to redefine a few of its internal string data types to be 16-bit quantities. It might be a good idea to stick these definitions and inclusions in a common header file included by all program modules, to avoid wreaking havoc if some module is unintentionally not using Unicode.
Next in line is the problem of literals. The following code would work perfectly well with any C compiler:
But try the following:
The compiler will tell you it's an illegal assignment since you can't assign a string to an array of 16-bit integers. But try the following:
I bet it'll work perfectly. What's that _TEXT thing and what magic is lurking beneath it? The answer is it's a macro defined in tchar.h. Here it is written out:
For those of you not very familiar with C's macro system, all this macro does is it sticks a capital L in front of the string literal (the ## is a macro concatenation command which just merges the L with the parameter), thus making our initial source line look like this to the compiler:
The magic L is what tells the compiler this is a Unicode literal and not a char->unsigned short conversion. This is why you'll need a Unicode-capable compiler to compile such programs. The same goes for character literals. Following are character literal assignments with both ASCII and Unicode character variables, respectively:
To aid in porting applications that use Unicode to non-Unicode platforms, tchar.h contains a few more features. If you don't define _UNICODE prior to including tchar.h, _TEXT will be defined in the following way:
...Which means it virtually does nothing, thus falling back to standard ASCII string functionality. In addition, using the special data type TCHAR (also defined in tchar.h), data type independence can be achieved since TCHAR is set up to be equal to a wchar_t when _UNICODE is defined, and char when it isn't. This makes the following source line work in both Unicode and non-Unicode environments:
As another aid in porting your applications, tchar.h defines a set of string manipulation macros (all having the _tcs prefix) that expand to either the corresponding ASCII or Unicode functions, depending on whether or not _UNICODE has been defined. As an example, _tcslen expands to wcslen when used in Unicode applications, and strlen in ASCII applications.
...And that's about all you need to know to start using Unicode! But as always, operating systems tend to put restrictions upon programmers, and Unicode is no exception...
Speaking Unicode To A Window
As I said earlier, Windows NT and 2000 are built with Unicode in mind, whereas
95/98 aren't. Even though Microsoft did their best (?) to hide this from the
programmers, we must still be cautious under some circumstances.|
Internally, the Win32 API maintains two versions of any function that operates on strings in any way, one for Unicode and one for ASCII strings. Take for example the CreateWindow() function, to which the first two arguments are strings (window class and window title). It comes in two flavors, both defined in winuser.h (One of Windows' internal headers):
If you've specified the UNICODE macro before including windows.h, Windows automatically defines CreateWindow to call CreateWindowW (the Unicode function), otherwise it is defined to call CreateWindowA (The ASCII/ANSI version). The same applies to all string processing Win32 API functions (The LPCTSTR data type, by the way, is just Microsoft's way of saying "pointer to a constant string in either Unicode or ASCII format, depending upon whether or not UNICODE has been #defined"). In the same way, there are also Unicode and ASCII versions of many structures.
Nothing stops us from running a UNICODE-compiled application under Windows 95 or 98, but it surely won't work correctly if we start passing Unicode strings to the Win32 API functions, which requires strings to be in ASCII format. There are two ways to get around this limitation:
I prefer the second choice, since working with multiple code bases or build commands is a constant source of headache. In addition, it's much more convenient for the end user to have but one executable that runs on all platforms (in reality, I guess this is more or less expected by today's users of Windows programs). Let's look at an example of such a situation, again involving CreateWindow.
Converting Between Unicode And ASCII
We'll often need functions to convert from Unicode to ASCII and vice versa. Such functions are easy to implement yourself, but you could also use the ones included in the Win32 API:
Using The Back Door to Detect Unicode Support
Of course, we need to determine if we're running on a Unicode-compatible version of Windows, because if we are, there's naturally no need to convert strings to ASCII before calling the API functions. For reasons unknown, Win32API does not provide a function to determine whether or not a particular Windows installation is capable of using Unicode. However, it can be detected using the following little function:
Supporting Two Worlds
Being armed with such a function, it's a no-brainer to implement the calling of CreateWindow in a way that works on both Unicode and non-Unicode versions of Windows:
Prizes and Penalties
One thing worth noting (particularly since this is going to be used in the context of game development) is the issue of performance.
If you're using Unicode, the Win32 API functions in WinNT and Windows 2000 will execute faster since they do not have to convert the strings to Unicode before doing the actual work. This is however the case with ASCII strings - since the functions use Unicode internally, all ASCII strings must be converted to Unicode and that takes time. The situation is reversed under Windows 95/98.
The next thing you need to implement is the localization functionality that
actually makes use of all this Unicode stuff. What this means is that when
you're about to display a string of text to the user, you first browse a
database to see if that particular string is available in some language selected
by the user.|
There are many different ways to accomplish this; one way is to simply use Windows' string table resources for localization, thus defining a string table for each language you wish to support (Windows resources are always stored in Unicode format). This has two obvious drawbacks: Primarily, it makes your application very hard to port, as non-Microsoft platforms (e.g. Linux) have no support for such string table resources. Secondly, it makes it hard to provide support for new languages after the product has been shipped.
One great way of solving these problems is to perform localization the same way it's done in Unreal. Here, you store a file (for instance using regular INI syntax) containing all the strings for a language, like this:
OutOfMemoryError=Out of memory!
FileNotFoundError=File not found!
OutOfMemoryError=Slut pa minne!
FileNotFoundError=Kan inte hitta filen!
Then create a function to load such strings:
By replacing all explicit string references with calls to such a function, you achieve full language independence. If a string isn't localized for a particular language (meaning it cannot be found in the language file), it might be good to default to English (hence the Default argument in the function prototype above). Here's how such a function can be used:
The string files must of course be written in Unicode format for languages to take advantage of the extended character set; such files can be written with for instance Microsoft Word and Notepad. In fact, Unicode .txt files differs from ASCII .txt files in only two ways:
|For more information about Unicode, visit the Unicode consortium's home page at http://www.unicode.org.|
Until Next Time...
If you're not fed up with strings yet, there's one more tutorial to take care of
In the next tutorial, we'll examine the use of string classes for encapsulating all this ASCII/Unicode functionality, plus we'll add some extras to make C++ string management really earn the pluses.
Fredrik Andersson (email@example.com)
Comment: This address is only temporary, I'll soon have another mail address...
Lead Programmer, Herring Interactive