UDN
Search public documentation:
CharacterEncoding
日本語訳
中国翻译
한국어
Interested in the Unreal Engine?
Visit the Unreal Technology site.
Looking for jobs and company info?
Check out the Epic games site.
Questions about support via UDN?
Contact the UDN Staff
中国翻译
한국어
Interested in the Unreal Engine?
Visit the Unreal Technology site.
Looking for jobs and company info?
Check out the Epic games site.
Questions about support via UDN?
Contact the UDN Staff
UE3 Home > Engine Programming > Character Encoding
Character Encoding
- Character Encoding
- Overview
- Text Formats
- UE3 Internal String Representation
- Text files loaded by UE3
- Text files saved by Unreal
- Recommended encoding for text files used by Unreal
- Storing UTF-16 text files in Perforce
- Conversion routines
- ToUpper() and ToLower() Non-Trivial in Unicode
- Notes about C++ source code specific to East Asian encodings
- See Also
Overview
Text Formats
- ASCII
- characters between 32 and 126 inclusive, and 0, 9, 10 and 13. (P4 type text) (This is validated with a P4 trigger on checkin)
- ANSI
- ASCII and the current codepage (e.g. Western European high ASCII) (needs to stored as binary on the P4 server)
- UTF-8
- a string made up of single bytes which can use special character sequences to get non ANSI characters (a superset of ASCII) (P4 type Unicode)
- UTF-16
- a string made up of 2 bytes per character with a BOM (although can go to 4 bytes with astral characters) (P4 type UTF-16) (This is validated with a P4 trigger on checkin)
The case for Binary
Pros | Cons |
---|---|
Internal format is not defined; each file can be loaded no matter what format it is. | Non mergable. Requires all files of this type to be exclusive checkout. |
Internal format is not defined; each file could be in a different format. | |
P4 stores the entirety of each version, which can unnecessarily bloat the depot size. |
The case for Text
Pros | Cons |
---|---|
Mergable. Exclusive checkout is not required. | Very limiting; only ASCII characters allowed. |
The case for UTF-8
Pros | Cons |
---|---|
Simple access to all characters we will ever need. | Has a different memory profile for Asian languages |
Uses less memory | P4 type Unicode is not enabled on our Perforce server |
Is a superset of ASCII; a plain ASCII string is a perfectly valid UTF-8 string | String operations more complicated; have to parse the string to do something as simple as a length calculation. |
Still works when the game detects the string is ASCII and outputs it as such | MSDev doesn't handle anything other than ASCII very well in Asian regions. This is why we validate text as ASCII during checkin. |
If we did have a Unicode enabled server, the files would be mergable and exclusive checkout would not be required. | |
Can detect whether a string is UTF-8 by parsing it (with or without a BOM) |
The case for UTF-16
Pros | Cons |
---|---|
Simple access to all characters we will ever need. | Uses more memory. |
Simple. Memory usage is twice the number of characters (for characters we use, which are all in the Basic Multilingual Plane) | Difficult to detect this format if it doesn't have a BOM. |
Simple. String operations can split/combine without having to parse the strings. | Does not work when the game detects the string is ASCII and outputs it as such (this is now detected on checkin with the UTF-16 validator) |
Same as the format used in game, no translation, parsing or memory operations required. | MSDev doesn't handle anything other than ASCII very well in Asian regions. This is why we validate text as ASCII during checkin. |
Mergable. Exclusive checkout is not required. | |
C# uses UTF-16 internally. |
UE3 Internal String Representation
Text files loaded by UE3
Text files saved by Unreal
Recommended encoding for text files used by Unreal
INT and INI files
UTF-16 in either endian. While the default MBCS encoding for an Asian language (eg CP932) will work on Windows, these files need to be loaded on PS3 and Xbox360 and the conversion code only runs on Windows.Source code
In general we don't recommend string literals inside C++ or UnrealScript source code and we recommend this data goes in INT files.UnrealScript source code
UTF-16 or default Windows encoding. In either case, Japanese and Korean string literals and comments should work correctly because the compiler runs on Windows and generates Unreal packages that will store the text in UTF-16 internally. If you use the default Windows encoding, you must be aware that these .uc files will not work on machines with a different locale.C++ Source
UTF-8 or default Windows encoding. MSVC, the Xbox360 compiler and gcc should all be happy with with UTF-8 encoded source files. Latin-1 encoded files with characters with the high bit set, for example copyright, trademark or degree symbols should be avoided in source code where possible because the encoding will break on systems with different locales. Some instances of this in 3rd party software are unavoidable (eg copyright notices) so for MSVC we disable warning 4819, which would otherwise occur when compiling on Asian Windows.Storing UTF-16 text files in Perforce
- Do not use 'Text'
- If a UTF-x file is checked in and stored as text, it will be corrupted after syncing.
- If you use 'Binary', mark the files as exclusive checkout
- People can check in ASCII, UTF-8, UTF-16 and it will work in engine.
- However, binary files cannot be merged, so if the files are not marked as exclusive checkout, changes will be stomped upon.
- If you use 'UTF-16', make sure no one checks in a file that isn't UTF-16
- We have a Perforce trigger that disallows checking in a non UTF-16 as UTF-16
- //depot/UnrealEngine3/Development/Tools/P4Utils/CheckUTF16/
- We have a Perforce trigger that disallows checking in a non UTF-16 as UTF-16
- The 'Unicode' type is UTF-8, and of no use to us here.
Conversion routines
- TCHAR_TO_ANSI(str)
- TCHAR_TO_OEM(str)
- ANSI_TO_TCHAR(str)
- TCHAR_TO_UTF8(str)
- UTF8_TO_TCHAR(str)
- typedef TStringConversion<TCHAR,ANSICHAR,FANSIToTCHAR_Convert> FANSIToTCHAR;
- typedef TStringConversion<ANSICHAR,TCHAR,FTCHARToANSI_Convert> FTCHARToANSI;
- typedef TStringConversion<ANSICHAR,TCHAR,FTCHARToOEM_Convert> FTCHARToOEM;
- typedef TStringConversion<ANSICHAR,TCHAR,FTCHARToUTF8_Convert> FTCHARToUTF8;
- typedef TStringConversion<TCHAR,ANSICHAR,FUTF8ToTCHAR_Convert> FUTF8ToTCHAR;
FString String; ... FTCHARToANSI Convert(*String); Ar->Serialize((ANSICHAR*)Convert, Convert.Length()); // FTCHARToANSI::Length() returns the number of bytes for the encoded string, excluding the null terminator.
ToUpper() and ToLower() Non-Trivial in Unicode
- ISO/IEC 8859-1 for English, French, German, Italian, Portuguese and both Spanishes
- ISO/IEC 8859-2 for Polish, Czech and Hungarian
- ISO/IEC 8859-5 for Russian
Notes about C++ source code specific to East Asian encodings
Take care when compiling C++ source code on Windows with running with a single byte character code page (e.g. CP437 United States), if the source code has an East Asian double byte character encoding such as CP932 (Japanese), CP936 (Simplified Chinese) or CP950 (Traditional Chinese). These East Asian character encoding systems uses 0x81-0xFE for first byte, and 0x40-0xFE for second byte. A value of 0x5C in the second byte will be interpreted as backslash in ASCII/latin-1, and that has a special meaning for C++. (Escape sequence inside a string literal, and line continuation if used at the end of a line).
When compiling that source code on single-byte code page Windows, the compiler doesn't care about the East Asian double byte character encoding, and this could cause either a compile error or worse, create a bug in the EXE. Single-line comments:
These can cause difficult-to-find bugs or errors caused by a missing line, if in the end of East Asian comment has 0x5c.
// EastAsianCharacterCommentThatContains0x5cInTheEndOfComment0x5c'\' important_function(); /* this line would be connected to above line as part of comment */Inside a string literal:
This can cause a broken string or an error with an recognized 0x5c escape sequence.
printf("EastAsianCharacterThatContains0x5c'\'AndIfContains0x5cInTheEndOfString0x5c'\'"); function(); printf("Compiler recognizes left double quotation mark in this line as the end of string literal that continued from first line, and expected this message is C++ code.");If the character following 0x5c does specify a escape sequence, compiler converts escape sequence character set to single specified character.
(If doesn't specify, the result is implementation defined, but MSVC removes 0x5c, and warns "unrecognized character escape sequence".)
In the above case, the end of string has a 0x5c backslash and next character is a double quote, so the escape sequence \" is converted to a double quote in the string data, and compiler continues to make string data before next double quote or end of file, and causes an error. Examples of dangerous characters:
CP932 (Japanese Shift-JIS) "?" is 0x955C, and so many CP932 characters have 0x5C.
CP936 (Simplified Chinese GBK) "?" is 0x815C, and so many CP936 characters have 0x5C.
CP950 (Traditional Chinese Big5) "?" is 0xA55C, and so many CP950 characters have 0x5C.
CP949 (Korean, EUC-KR) is OK, because EUC-KR doesn't use 0x5C for the second byte. UTF-8 without BOM (Some text editors describe BOM as signature)
Take care for compiling C++ source code on East Asian code page CP949 (Korean), CP932 (Japanese), CP936 (Simplified Chinese) or CP950 (Traditional Chinese) Windows, if that source code has an East Asian character stored as UTF-8. UTF-8 character encoding uses three bytes for East Asian characters: 0xE0-0xEF for the first byte, 0x80-0xBF for the second byte and 0x80-0xBF for the third byte. Without the BOM, East Asian Windows' default encoding recognizes the three UTF-8 encoded bytes and the following byte as two 2-byte East Asian encoded characters, pairing of first and second bytes for one first East Asian character, and the third byte and folowing byte paired to form tge second East Asian character.
Problems can occur if the character following the UTF-8 encoded three bytes has special meaning in the string literals or comments. Eg In in-line comment:
Causes hard-to-find bugs or errors with missing code, if the comment text contains an odd number of East Asian characters, and next character marks the end of the comment.
/*OddNumberOfEastAsianCharacterComment*/ important_function(); /*normal comment*/The compiler on East Asian code page Windows recognizes the last byte of the UTF-8 decoded East Asian character comment and asterisk '*' as a single East Asian character, and next characters is treated as still part of the comment. In above case, compiler removes important_function() as it seems to be part of the comment.
This behavior is very dangerous and it is difficult to find the missing code. In single-line comment:
Using backslash '\' at the end of of an East Asian comment causes hard-to-find bugs or errors without missing lines.
// OddNumberOfEastAsianCharacterComment\ description(); /* coder intended this line as comment, by using backslash at the end of above line */This is a very rare case, because programmers shouldn't intentionally write backslashes '\' at the end of comments. Inside string literals:
This causes broken strings, errors or warnings when an odd number of UTF-8 encoded East Asian characters are inside a string literal and the following character has special meaning.
printf("OddNumberOfEastAsiaCharacterString"); printf("OddNumberOfEastAsiaCharacterString%d",0); printf("OddNumberOfEastAsiaCharacterString\n");The C++ compiler on East Asian code page Windows interprets the last byte of the UTF-8 decoded East Asian character string and next character as a single East Asian character. If you are lucky, compiler warning "C4819" (if not disabled) or an error will alert you to the problem. If unlucky, the string would be broken. Conclusion
You can use UTF-8 or default Windows encoding for C++ source code, but please be aware of these problem. Again, we don't recommend string literals inside C++ or UnrealScript source. Please make sure to use East Asian as your default code page if you have to use East Asian character encoding in C++ source code.
Another good way is to use UTF-8 with BOM (some text editors describes the BOM as a Unicode signature). Note
We tested a few compilers with UTF-8 and UTF-16 on 18 Feb 2010.
MSVC for PC and Xbox 360, and gcc or slc for PS3 are able to compile UTF-8-encoded source code (with and without BOM). But UTF-16 (little-endian/big-endian) is supported only by MSVC.
Perforce is able to work with both UTF-16 and UTF-8, but p4 diff displays the BOM in UTF-8 files as a visible character. External reference: Code Pages Supported by Windows