Notes about C++ Source Code Specific to East Asian Encodings
Both UTF-8 and the default Windows encodings can cause problems with the C++ compiler, as follows:
Default Windows encoding
Take care when compiling C++ source code on Windows with running with a single byte character code page (e.g. CP437 United States), if the source code has an East Asian double byte character encoding such as CP932 (Japanese), CP936 (Simplified Chinese), or CP950 (Traditional Chinese).
These East Asian character encoding systems uses 0x81-0xFE for first byte, and 0x40-0xFE for second byte. A value of 0x5C in the second byte will be interpreted as backslash in ASCII/latin-1, and that has a special meaning for C++. (Escape sequence inside a string literal, and line continuation if used at the end of a line).
When compiling that source code on single-byte code page Windows, the compiler does not care about the East Asian double byte character encoding, and this could cause either a compile error or worse, create a bug in the EXE.
These can cause difficult-to-find bugs or errors caused by a missing line, if in the end of East Asian comment has 0x5c.
important_function(); /* this line would be connected to above line as part of comment */
Inside a string literal:
This can cause a broken string or an error with an recognized 0x5c escape sequence.
printf("Compiler recognizes left double quotation mark in this line as the end of string literal that continued from first line, and expected this message is C++ code.");
If the character following 0x5c does specify a escape sequence, compiler converts escape sequence character set to single specified character.
(If does not specify, the result is implementation defined, but MSVC removes 0x5c, and warns "unrecognized character escape sequence".)
In the above case, the end of string has a 0x5c backslash and next character is a double quote, so the escape sequence " is converted to a double quote in the string data, and compiler continues to make string data before next double quote or end of file, and causes an error.
Examples of dangerous characters:
CP932 (Japanese Shift-JIS) "?" is 0x955C, and so many CP932 characters have 0x5C.
CP936 (Simplified Chinese GBK) "?" is 0x815C, and so many CP936 characters have 0x5C.
CP950 (Traditional Chinese Big5) "?" is 0xA55C, and so many CP950 characters have 0x5C.
CP949 (Korean, EUC-KR) is OK, because EUC-KR does not use 0x5C for the second byte.
UTF-8 without BOM (Some text editors describe BOM as signature)
Take care for compiling C++ source code on East Asian code page CP949 (Korean), CP932 (Japanese), CP936 (Simplified Chinese) or CP950 (Traditional Chinese) Windows, if that source code has an East Asian character stored as UTF-8.
UTF-8 character encoding uses three bytes for East Asian characters: 0xE0-0xEF for the first byte, 0x80-0xBF for the second byte and 0x80-0xBF for the third byte. Without the BOM, East Asian Windows' default encoding recognizes the three UTF-8 encoded bytes and the following byte as two 2-byte East Asian encoded characters, pairing of first and second bytes for one first East Asian character, and the third byte and following byte paired to form the second East Asian character.
Problems can occur if the character following the UTF-8 encoded three bytes has special meaning in the string literals or comments.
Eg In in-line comment:
Causes hard-to-find bugs or errors with missing code, if the comment text contains an odd number of East Asian characters, and next character marks the end of the comment.
The compiler on East Asian code page Windows recognizes the last byte of the UTF-8 decoded East Asian character comment and asterisk * as a single East Asian character, and next characters is treated as still part of the comment. In above case, compiler removes important_function() as it seems to be part of the comment.
This behavior is very dangerous and it is difficult to find the missing code.
In single-line comment:
Using backslash '' at the end of an East Asian comment causes hard-to-find bugs or errors without missing lines.
description(); /* coder intended this line as comment, by using backslash at the end of above line */
This is a very rare case, because programmers should not intentionally write backslashes '' at the end of comments.
Inside string literals:
This causes broken strings, errors or warnings when an odd number of UTF-8 encoded East Asian characters are inside a string literal and the following character has special meaning.
The C++ compiler on East Asian code page Windows interprets the last byte of the UTF-8 decoded East Asian character string and next character as a single East Asian character. If you are lucky, compiler warning "C4819" (if not disabled) or an error will alert you to the problem. If unlucky, the string would be broken.
You can use UTF-8 or default Windows encoding for C++ source code, but please be aware of these problem. Again, we do not recommend string literals inside C++ source. Please make sure to use East Asian as your default code page if you have to use East Asian character encoding in C++ source code.
Another good way is to use UTF-8 with BOM (some text editors describes the BOM as a Unicode signature).
We tested a few compilers with UTF-8 and UTF-16 on 18 Feb 2010.
MSVC for PC and Xbox 360, and gcc or slc for PS3 are able to compile UTF-8-encoded source code (with and without BOM).
But UTF-16 (little-endian/big-endian) is supported only by MSVC.
Perforce is able to work with both UTF-16 and UTF-8, but p4 diff displays the BOM in UTF-8 files as a visible character.
External reference: Code Pages Supported by Windows