+@node Character sets
+@section Character sets
+
+Source code character set processing in C and related languages is
+rather complicated. The C standard discusses two character sets, but
+there are really at least four.
+
+The files input to CPP might be in any character set at all. CPP's
+very first action, before it even looks for line boundaries, is to
+convert the file into the character set it uses for internal
+processing. That set is what the C standard calls the @dfn{source}
+character set. It must be isomorphic with ISO 10646, also known as
+Unicode. CPP uses the UTF-8 encoding of Unicode.
+
+At present, GNU CPP does not implement conversion from arbitrary file
+encodings to the source character set. Use of any encoding other than
+plain ASCII or UTF-8, except in comments, will cause errors. Use of
+encodings that are not strict supersets of ASCII, such as Shift JIS,
+may cause errors even if non-ASCII characters appear only in comments.
+We plan to fix this in the near future.
+
+All preprocessing work (the subject of the rest of this manual) is
+carried out in the source character set. If you request textual
+output from the preprocessor with the @option{-E} option, it will be
+in UTF-8.
+
+After preprocessing is complete, string and character constants are
+converted again, into the @dfn{execution} character set. This
+character set is under control of the user; the default is UTF-8,
+matching the source character set. Wide string and character
+constants have their own character set, which is not called out
+specifically in the standard. Again, it is under control of the user.
+The default is UTF-16 or UTF-32, whichever fits in the target's
+@code{wchar_t} type, in the target machine's byte
+order.@footnote{UTF-16 does not meet the requirements of the C
+standard for a wide character set, but the choice of 16-bit
+@code{wchar_t} is enshrined in some system ABIs so we cannot fix
+this.} Octal and hexadecimal escape sequences do not undergo
+conversion; @t{'\x12'} has the value 0x12 regardless of the currently
+selected execution character set. All other escapes are replaced by
+the character in the source character set that they represent, then
+converted to the execution character set, just like unescaped
+characters.
+
+GCC does not permit the use of characters outside the ASCII range, nor
+@samp{\u} and @samp{\U} escapes, in identifiers. We hope this will
+change eventually, but there are problems with the standard semantics
+of such ``extended identifiers'' which must be resolved through the
+ISO C and C++ committees first.
+