flipcode - Advanced String Techniques in C++

Advanced String Techniques in C++ - Part I: Unicode
by (28 August 2000)

Introduction

These tutorials (there'll be two of them) will discuss how to implement a few neat string handling techniques in your applications and games. I've intentionally stayed clear of references to game development in these tutorials because I feel that the techniques presented herein win by being presented in a more general context. After all, a large part of game development is the development of the foundation driver technology, and there's nothing more fundamental than string management, is there?

This first tutorial discusses Unicode and localization techniques. What this basically means is how to add support for character sets and languages other than English in a very simple way. The reasons for doing so should be quite obvious to everyone: At the very least, making it possible for non-English players of your games to input data in their native language (for instance, when using network chat modes or perhaps when naming a character in an RPG) is a way of showing them great respect.

A Farewell To Char

I happen to live in Sweden, and despite common belief we neither have polar bears wandering around the towns nor have we limited our diet to meatballs. What we do have, however, is an extended Latin alphabet. For the sole purpose of limiting our communication with the outside world (or so it feels), we've placed little dots and circles above certain letters in our alphabet, making them virtually unpronounceable to anyone not living in Northern Europe. Such characters are a pain to store and transmit electronically since the characters basically don't exist on anything but Swedish computers.

The reason for this is quite obvious: Characters in computer systems are normally stored as ASCII codes, or a derivative thereof (such as the ANSI codes used by Windows). The problem with these encoding methods is the limited space available for characters; since ASCII uses eight bits to store each character, a maximum of 2^8 or 256 characters can be defined. This is quite enough for encoding the standard Latin alphabet, the digits, punctuation and a few diacritics, but not the exotic characters of non-English languages (just take a look at languages such as Chinese or Japanese Kanji, which have thousands of letters representing complete syllables, words and phrases).

The solution is to increase the number of bits used to store each character, and doing so in a standardized way to allow painless data transfers between systems using different languages. Two such standards exists and are in use today: Multibyte Character Sets (MBCS) and Unicode.

The Mess of MBCS

To be blunt, MBCS is the inferior of the two. The character set is based on the ASCII-friendly char data type, but each character occupies either one or two chars, effectively rendering all your favorite C string functions useless. When parsing an MBCS string (yes, it has to be parsed before it's used!), you must examine the bits of every character you read to determine if the next character in the string is a part of the current character or not, and if it is, how to combine the two to form a human-readable character. While this standard certainly provides you with more than 256 characters, it requires you to use complex (and hence slow) string functions for even the most trivial tasks.

Most modern Windows compilers come with a set of string functions (all of which have the mbs prefix) that operate on multi-byte character strings. They behave like their regular C counterparts (e.g. mbslen() implements the same functionality as strlen()). Windows even has a few functions for parsing MBCS strings character-by-character, namely CharNext(), CharPrev() and IsDBCSLeadByte(). Look'em up in the Win32 API reference for more info.

Now that you know what MBCS is, don't use it. If you do, you'll be sorry. Instead, read on and discover the wonders of Unicode!

Unicode

Unicode was invented by Apple and Xerox in the late 80's, and is now maintained by an industry consortium responsible for assigning new character codes etc.

With Unicode, every character is encoded as a 16-bit (or 2-byte) quantity (unsigned short in C), thus making available as many as 65 536 characters, more than enough for all significant written languages in the world today.

C's good old string functions won't work on Unicode strings either, but luckily there's another set of C runtime library functions available for Unicode strings, prefixed with wcs (for "wide character set", not to be confused with the aforementioned mbs function set). You'll find all your old workhorses here, such as wcslen(), which implements behavior equivalent to strlen(). In addition, writing your own functions to operate on Unicode strings is nowhere near as difficult or frustrating as writing MBCS functions, since no parsing is necessary. There's no hassle with any CharNext()-like functions, traversing a string is once again as easy as blindly increasing a pointer and looking for a terminating NULL (as is the case with ASCII strings).

The Unicode consortium have defined code points (a code point being the Unicode index of a specific character) for a wide variety of languages, diacritics, special symbols, dingbats, mathematical and scientific symbols etc. They've also reserved quite a bit of room for you to store any custom characters your application might use. And they've been foreseeing enough to place the standard ASCII characters at code points 0-255, making ASCII to Unicode translation and comparison a breeze.

Where's the Catch?

However, Unicode do have a few drawbacks (or rather design issues that you need to be aware of). First, of course, is the fact that Unicode strings occupy twice as much space as ASCII strings, since two bytes are used per character instead of just one. This same fact leads to a few other issues that need to be pointed out: You cannot treat Unicode strings as arrays of bytes, as is perfectly legal with ASCII strings. Instead, you need to treat them as arrays of characters. You must also make sure you're not performing any arithmetic operations on your strings under the assumption that characters occupy only one byte each.

There are also portability issues to consider: Not all operating systems and compilers have support for Unicode. An operating system without Unicode support isn't that big a problem actually - just make sure you're not using Unicode strings when calling OS functions. On the other hand, your compiler and C runtime library must have explicit support for Unicode in order for you to use it, for reasons that will soon become obvious.

If you're targeting the Windows platform, note that NT and Windows 2000 have full Unicode support (in fact, all NT-API functions expect Unicode strings, ASCII is supported by an internal conversion stage), but Windows 95 and 98 have only very rudimentary support for Unicode. We'll take care of that problem a little later.

Trying It At Home

So, how do we take Unicode from concept to reality? Assuming you're on a platform and compiler that supports it, it's quite simple. So let's pretend we're using Visual C++ on Windows 2000 for a moment, shall we?

First, you need to inform the C runtime library that you wish to use Unicode. That is done by placing the following lines before any other C headers:


#define _UNICODE	// Tell C we're using Unicode, notice the _
#include <tchar.h>	// Include Unicode support functions
#include <stdlib.h>
#include <string.h>
#include ...

The _UNICODE macro tells tchar.h, which is the Unicode header file shipped with the compiler, to include the following definition of the Unicode character type:


typedef unsigned short wchar_t;

...which should be used instead of char for Unicode strings. Since we're running on Windows, we also need to tell the Win32 API we're interested in taking advantage of Unicode. This is done by placing the following line before the inclusion of windows.h:


#define UNICODE					// No underscore this time
#include <windows.h>
...

This line causes Windows to redefine a few of its internal string data types to be 16-bit quantities. It might be a good idea to stick these definitions and inclusions in a common header file included by all program modules, to avoid wreaking havoc if some module is unintentionally not using Unicode.

Next in line is the problem of literals. The following code would work perfectly well with any C compiler:


char mystring[] = "flipcode";		// ASCII literal assignment

But try the following:


wchar_t mystring[] = "flipcode";	// Unicode literal assignment

The compiler will tell you it's an illegal assignment since you can't assign a string to an array of 16-bit integers. But try the following:


wchar_t mystring[] = _TEXT("flipcode");

I bet it'll work perfectly. What's that _TEXT thing and what magic is lurking beneath it? The answer is it's a macro defined in tchar.h. Here it is written out:


#define _TEXT(t) L ## t

For those of you not very familiar with C's macro system, all this macro does is it sticks a capital L in front of the string literal (the ## is a macro concatenation command which just merges the L with the parameter), thus making our initial source line look like this to the compiler:


wchar_t mystring[] = L"flipcode";

The magic L is what tells the compiler this is a Unicode literal and not a char->unsigned short conversion. This is why you'll need a Unicode-capable compiler to compile such programs. The same goes for character literals. Following are character literal assignments with both ASCII and Unicode character variables, respectively:


char mychar = 'A';
wchar_t mychar = _TEXT('A');

To aid in porting applications that use Unicode to non-Unicode platforms, tchar.h contains a few more features. If you don't define _UNICODE prior to including tchar.h, _TEXT will be defined in the following way:


#define _TEXT(t) t

...Which means it virtually does nothing, thus falling back to standard ASCII string functionality. In addition, using the special data type TCHAR (also defined in tchar.h), data type independence can be achieved since TCHAR is set up to be equal to a wchar_t when _UNICODE is defined, and char when it isn't. This makes the following source line work in both Unicode and non-Unicode environments:


TCHAR SomeString[] = _TEXT("flipcode");

As another aid in porting your applications, tchar.h defines a set of string manipulation macros (all having the _tcs prefix) that expand to either the corresponding ASCII or Unicode functions, depending on whether or not _UNICODE has been defined. As an example, _tcslen expands to wcslen when used in Unicode applications, and strlen in ASCII applications.

...And that's about all you need to know to start using Unicode! But as always, operating systems tend to put restrictions upon programmers, and Unicode is no exception...

Speaking Unicode To A Window

As I said earlier, Windows NT and 2000 are built with Unicode in mind, whereas 95/98 aren't. Even though Microsoft did their best (?) to hide this from the programmers, we must still be cautious under some circumstances.

Internally, the Win32 API maintains two versions of any function that operates on strings in any way, one for Unicode and one for ASCII strings. Take for example the CreateWindow() function, to which the first two arguments are strings (window class and window title). It comes in two flavors, both defined in winuser.h (One of Windows' internal headers):


HWND CreateWindowA(LPCTSTR lpClassName, LPCTSTR lpWindowTitle, ...);
HWND CreateWindowW(LPCTSTR lpClassName, LPCTSTR lpWindowTitle, ...);
#ifdef UNICODE
#define CreateWindow CreateWindowW
#else
#define CreateWindow CreateWindowA
#endif

If you've specified the UNICODE macro before including windows.h, Windows automatically defines CreateWindow to call CreateWindowW (the Unicode function), otherwise it is defined to call CreateWindowA (The ASCII/ANSI version). The same applies to all string processing Win32 API functions (The LPCTSTR data type, by the way, is just Microsoft's way of saying "pointer to a constant string in either Unicode or ASCII format, depending upon whether or not UNICODE has been #defined"). In the same way, there are also Unicode and ASCII versions of many structures.

Nothing stops us from running a UNICODE-compiled application under Windows 95 or 98, but it surely won't work correctly if we start passing Unicode strings to the Win32 API functions, which requires strings to be in ASCII format. There are two ways to get around this limitation:

Maintain one Unicode version and one non-Unicode version of your app.
Convert any Unicode strings to ASCII format before calling a Windows 95/98 Win32 API function.

I prefer the second choice, since working with multiple code bases or build commands is a constant source of headache. In addition, it's much more convenient for the end user to have but one executable that runs on all platforms (in reality, I guess this is more or less expected by today's users of Windows programs). Let's look at an example of such a situation, again involving CreateWindow.

Converting Between Unicode And ASCII

We'll often need functions to convert from Unicode to ASCII and vice versa. Such functions are easy to implement yourself, but you could also use the ones included in the Win32 API:


// Convert an ASCII string to a Unicode String
char SomeAsciiStr[] = "Ascii!";
wchar_t SomeUnicodeStr[1024];
MultiByteToWideChar(CP_ACP, 0, SomeAsciiStr, -1, SomeUnicodeStr, 1024);

// Convert a Unicode string to an ASCII string
char SomeAsciiStr[1024];
wchar_t SomeUnicodeStr[] = L"Unicode!";
WideCharToMultiByte(CP_ACP, 0, SomeUnicodeStr, -1, SomeAsciiStr, 1024, NULL, NULL);

Using The Back Door to Detect Unicode Support

Of course, we need to determine if we're running on a Unicode-compatible version of Windows, because if we are, there's naturally no need to convert strings to ASCII before calling the API functions. For reasons unknown, Win32API does not provide a function to determine whether or not a particular Windows installation is capable of using Unicode. However, it can be detected using the following little function:


// Use a harmless Win32 API function to determine if Windows is currently capable of using
// Unicode. Since we're calling the Unicode version (W), the function will fail if called on
// a version of Windows that's not Unicode-compatible.
// It might be a good idea to determine this in the app's initialization phase, and store the
// result in a global boolean variable.
bool IsUnicodeOS()
{
	OSVERSIONINFOW		os;
	memset(&os, 0, sizeof(OSVERSIONINFOW));
	os.dwOSVersionInfoSize = sizeof(OSVERSIONINFOW);
	return (GetVersionExW(&os) != 0);
}

Supporting Two Worlds

Being armed with such a function, it's a no-brainer to implement the calling of CreateWindow in a way that works on both Unicode and non-Unicode versions of Windows:


HWND MyCreateWindow(const TCHAR *ClassName, const TCHAR *WindowTitle, ...)
{
    #ifdef UNICODE		// This is a Unicode program, must see if OS has Unicode
	if (IsUnicodeOS() == false)	// Win95/98, must build ASCII strings
	{
	    char aClassName[1024], aWindowTitle[1024];
	    WideCharToMultiByte(CP_ACP, 0, ClassName, -1, aClassName, 1024, NULL, NULL);
	    WideCharToMultiByte(CP_ACP, 0, WindowTitle, -1, aWindowTitle, 1024, NULL, NULL);
	    CreateWindowA(aClassName, aWindowTitle, ...);
	}
	else
    #endif
	{
	    // If we get here, we're either running a Unicode version of the app on a Unicode version
	    // of Windows, or a non-UC version on non-UC Windows,
	    CreateWindow(ClassName, WindowTitle, ...);		// Use the one defined by Windows
        }
}

Prizes and Penalties

One thing worth noting (particularly since this is going to be used in the context of game development) is the issue of performance.

If you're using Unicode, the Win32 API functions in WinNT and Windows 2000 will execute faster since they do not have to convert the strings to Unicode before doing the actual work. This is however the case with ASCII strings - since the functions use Unicode internally, all ASCII strings must be converted to Unicode and that takes time. The situation is reversed under Windows 95/98.

Localization

The next thing you need to implement is the localization functionality that actually makes use of all this Unicode stuff. What this means is that when you're about to display a string of text to the user, you first browse a database to see if that particular string is available in some language selected by the user.

There are many different ways to accomplish this; one way is to simply use Windows' string table resources for localization, thus defining a string table for each language you wish to support (Windows resources are always stored in Unicode format). This has two obvious drawbacks: Primarily, it makes your application very hard to port, as non-Microsoft platforms (e.g. Linux) have no support for such string table resources. Secondly, it makes it hard to provide support for new languages after the product has been shipped.

One great way of solving these problems is to perform localization the same way it's done in Unreal. Here, you store a file (for instance using regular INI syntax) containing all the strings for a language, like this:

english.str:
OutOfMemoryError=Out of memory!
FileNotFoundError=File not found!

swedish.str:
OutOfMemoryError=Slut pa minne!
FileNotFoundError=Kan inte hitta filen!

Then create a function to load such strings:


TCHAR *LoadLocalizedString(char *language, char *Key, char *Default);

By replacing all explicit string references with calls to such a function, you achieve full language independence. If a string isn't localized for a particular language (meaning it cannot be found in the language file), it might be good to default to English (hence the Default argument in the function prototype above). Here's how such a function can be used:


Old way: MessageBox(hWnd, "Out of memory!", NULL, MB_OK);
New way: MessageBox(hWnd, LoadLocalizedString("swedish", "OutOfMemoryError", "Out of memory!"), NULL, MB_OK);

The string files must of course be written in Unicode format for languages to take advantage of the extended character set; such files can be written with for instance Microsoft Word and Notepad. In fact, Unicode .txt files differs from ASCII .txt files in only two ways:

Unicode .txt files always start with a 2-byte header, the first byte being 0xFF and the second one 0xFE (for little-endian files). Use these bytes to determine if the text following the header is in Unicode format or not. If no Unicode header is found, the two bytes is of course part of the actual text and are therefore not a header.
And of course, the text in Unicode files is stored in Unicode format, meaning there's two bytes per character for you to read.

Further Reading

For more information about Unicode, visit the Unicode consortium's home page at http://www.unicode.org.

Until Next Time...

If you're not fed up with strings yet, there's one more tutorial to take care of that.

In the next tutorial, we'll examine the use of string classes for encapsulating all this ASCII/Unicode functionality, plus we'll add some extras to make C++ string management really earn the pluses.

Fredrik Andersson (f01fan@efd.lth.se)
Comment: This address is only temporary, I'll soon have another mail address...
Lead Programmer, Herring Interactive

Article Series:

Advanced String Techniques in C++ - Part I: Unicode
Advanced String Techniques in C++ - Part II: A Complete String Class