gcc/cppinternals.texi

   1 \input texinfo
   2 @setfilename cppinternals.info
   3 @settitle The GNU C Preprocessor Internals
   4
   5 @ifinfo
   6 @dircategory Programming
   7 @direntry
   8 * Cpplib:                      Cpplib internals.
   9 @end direntry
  10 @end ifinfo
  11
  12 @c @smallbook
  13 @c @cropmarks
  14 @c @finalout
  15 @setchapternewpage odd
  16 @ifinfo
  17 This file documents the internals of the GNU C Preprocessor.
  18
  19 Copyright 2000 Free Software Foundation, Inc.
  20
  21 Permission is granted to make and distribute verbatim copies of
  22 this manual provided the copyright notice and this permission notice
  23 are preserved on all copies.
  24
  25 @ignore
  26 Permission is granted to process this file through Tex and print the
  27 results, provided the printed document carries copying permission
  28 notice identical to this one except for the removal of this paragraph
  29 (this paragraph not being relevant to the printed manual).
  30
  31 @end ignore
  32 Permission is granted to copy and distribute modified versions of this
  33 manual under the conditions for verbatim copying, provided also that
  34 the entire resulting derived work is distributed under the terms of a
  35 permission notice identical to this one.
  36
  37 Permission is granted to copy and distribute translations of this manual
  38 into another language, under the above conditions for modified versions.
  39 @end ifinfo
  40
  41 @titlepage
  42 @c @finalout
  43 @title Cpplib Internals
  44 @subtitle Last revised Dec 2000
  45 @subtitle for GCC version 3.0
  46 @author Neil Booth
  47 @page
  48 @vskip 0pt plus 1filll
  49 @c man begin COPYRIGHT
  50 Copyright @copyright{} 2000
  51 Free Software Foundation, Inc.
  52
  53 Permission is granted to make and distribute verbatim copies of
  54 this manual provided the copyright notice and this permission notice
  55 are preserved on all copies.
  56
  57 Permission is granted to copy and distribute modified versions of this
  58 manual under the conditions for verbatim copying, provided also that
  59 the entire resulting derived work is distributed under the terms of a
  60 permission notice identical to this one.
  61
  62 Permission is granted to copy and distribute translations of this manual
  63 into another language, under the above conditions for modified versions.
  64 @c man end
  65 @end titlepage
  66 @page
  67
  68 @node Top, Conventions,, (DIR)
  69 @chapter Cpplib - the core of the GNU C Preprocessor
  70
  71 The GNU C preprocessor in GCC 3.0 has been completely rewritten.  It is
  72 now implemented as a library, cpplib, so it can be easily shared between
  73 a stand-alone preprocessor, and a preprocessor integrated with the C,
  74 C++ and Objective C front ends.  It is also available for use by other
  75 programs, though this is not recommended as its exposed interface has
  76 not yet reached a point of reasonable stability.
  77
  78 This library has been written to be re-entrant, so that it can be used
  79 to preprocess many files simultaneously if necessary.  It has also been
  80 written with the preprocessing token as the fundamental unit; the
  81 preprocessor in previous versions of GCC would operate on text strings
  82 as the fundamental unit.
  83
  84 This brief manual documents some of the internals of cpplib, and a few
  85 tricky issues encountered.  It also describes certain behaviour we would
  86 like to preserve, such as the format and spacing of its output.
  87
  88 Identifiers, macro expansion, hash nodes, lexing.
  89
  90 @menu
  91 * Conventions::     Conventions used in the code.
  92 * Lexer::           The combined C, C++ and Objective C Lexer.
  93 * Whitespace::      Input and output newlines and whitespace.
  94 * Concept Index::   Index of concepts and terms.
  95 * Index::           Index.
  96 @end menu
  97
  98 @node Conventions, Lexer, Top, Top
  99
 100 cpplib has two interfaces - one is exposed internally only, and the
 101 other is for both internal and external use.
 102
 103 The convention is that functions and types that are exposed to multiple
 104 files internally are prefixed with @samp{_cpp_}, and are to be found in
 105 the file @samp{cpphash.h}.  Functions and types exposed to external
 106 clients are in @samp{cpplib.h}, and prefixed with @samp{cpp_}.
 107
 108 We are striving to reduce the information exposed in cpplib.h to the
 109 bare minimum necessary, and then to keep it there.  This makes clear
 110 exactly what external clients are entitled to assume, and allows us to
 111 change internals in the future without worrying whether library clients
 112 are perhaps relying on some kind of undocumented implementation-specific
 113 behaviour.
 114
 115 @node Lexer, Whitespace, Conventions, Top
 116
 117 The lexer is contained in the file @samp{cpplex.c}.  We want to have a
 118 lexer that is single-pass, for efficiency reasons.  We would also like
 119 the lexer to only step forwards through the input files, and not step
 120 back.  This will make future changes to support different character
 121 sets, in particular state or shift-dependent ones, much easier.
 122
 123 This file also contains all information needed to spell a token, i.e. to
 124 output it either in a diagnostic or to a preprocessed output file.  This
 125 information is not exported, but made available to clients through such
 126 functions as @samp{cpp_spell_token} and @samp{cpp_token_len}.
 127
 128 The most painful aspect of lexing ISO-standard C and C++ is handling
 129 trigraphs and backlash-escaped newlines.  Trigraphs are processed before
 130 any interpretation of the meaning of a character is made, and unfortunately
 131 there is a trigraph representation for a backslash, so it is possible for
 132 the trigraph @samp{??/} to introduce an escaped newline.
 133
 134 Escaped newlines are tedious because theoretically they can occur
 135 anywhere - between the @samp{+} and @samp{=} of the @samp{+=} token,
 136 within the characters of an identifier, and even between the @samp{*}
 137 and @samp{/} that terminates a comment.  Moreover, you cannot be sure
 138 there is just one - there might be an arbitrarily long sequence of them.
 139
 140 So the routine @samp{parse_identifier}, that lexes an identifier, cannot
 141 assume that it can scan forwards until the first non-identifier
 142 character and be done with it, because this could be the @samp{\}
 143 introducing an escaped newline, or the @samp{?} introducing the trigraph
 144 sequence that represents the @samp{\} of an escaped newline.  Similarly
 145 for the routine that handles numbers, @samp{parse_number}.  If these
 146 routines stumble upon a @samp{?} or @samp{\}, they call
 147 @samp{skip_escaped_newlines} to skip over any potential escaped newlines
 148 before checking whether they can finish.
 149
 150 Similarly code in the main body of @samp{_cpp_lex_token} cannot simply
 151 check for a @samp{=} after a @samp{+} character to determine whether it
 152 has a @samp{+=} token; it needs to be prepared for an escaped newline of
 153 some sort.  These cases use the function @samp{get_effective_char},
 154 which returns the first character after any intervening newlines.
 155
 156 The lexer needs to keep track of the correct column position,
 157 including counting tabs as specified by the @samp{-ftabstop=} option.
 158 This should be done even within comments; C-style comments can appear in
 159 the middle of a line, and we want to report diagnostics in the correct
 160 position for text appearing after the end of the comment.
 161
 162 Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers,
 163 may be invalid and require a diagnostic.  However, if they appear in a
 164 macro expansion we don't want to complain with each use of the macro.
 165 It is therefore best to catch them during the lexing stage, in
 166 @samp{parse_identifier}.  In both cases, whether a diagnostic is needed
 167 or not is dependent upon lexer state.  For example, we don't want to
 168 issue a diagnostic for re-poisoning a poisoned identifier, or for using
 169 @samp{__VA_ARGS__} in the expansion of a variable-argument macro.
 170 Therefore @samp{parse_identifier} makes use of flags to determine
 171 whether a diagnostic is appropriate.  Since we change state on a
 172 per-token basis, and don't lex whole lines at a time, this is not a
 173 problem.
 174
 175 Another place where state flags are used to change behaviour is whilst
 176 parsing header names.  Normally, a @samp{<} would be lexed as a single
 177 token.  After a @samp{#include} directive, though, it should be lexed
 178 as a single token as far as the nearest @samp{>} character.  Note that
 179 we don't allow the terminators of header names to be escaped; the first
 180 @samp{"} or @samp{>} terminates the header name.
 181
 182 Interpretation of some character sequences depends upon whether we are
 183 lexing C, C++ or Objective C, and on the revision of the standard in
 184 force.  For example, @samp{@@foo} is a single identifier token in
 185 objective C, but two separate tokens @samp{@@} and @samp{foo} in C or
 186 C++.  Such cases are handled in the main function @samp{_cpp_lex_token},
 187 based upon the flags set in the @samp{cpp_options} structure.
 188
 189 Note we have almost, but not quite, achieved the goal of not stepping
 190 backwards in the input stream.  Currently @samp{skip_escaped_newlines}
 191 does step back, though with care it should be possible to adjust it so
 192 that this does not happen.  For example, one tricky issue is if we meet
 193 a trigraph, but the command line option @samp{-trigraphs} is not in
 194 force but @samp{-Wtrigraphs} is, we need to warn about it but then
 195 buffer it and continue to treat it as 3 separate characters.
 196
 197 @node Whitespace, Concept Index, Lexer, Top
 198
 199 The lexer has been written to treat each of @samp{\r}, @samp{\n},
 200 @samp{\r\n} and @samp{\n\r} as a single new line indicator.  This allows
 201 it to transparently preprocess MS-DOS, Macintosh and Unix files without
 202 their needing to pass through a special filter beforehand.
 203
 204 We also decided to treat a backslash, either @samp{\} or the trigraph
 205 @samp{??/}, separated from one of the above newline forms by whitespace
 206 only (one or more space, tab, form-feed, vertical tab or NUL characters),
 207 as an intended escaped newline.  The library issues a diagnostic in this
 208 case.
 209
 210 Handling newlines in this way is made simpler by doing it in one place
 211 only.  The function @samp{handle_newline} takes care of all newline
 212 characters, and @samp{skip_escaped_newlines} takes care of all escaping
 213 of newlines, deferring to @samp{handle_newline} to handle the newlines
 214 themselves.
 215
 216 @node Concept Index, Index, Whitespace, Top
 217 @unnumbered Concept Index
 218 @printindex cp
 219
 220 @node Index,, Concept Index, Top
 221 @unnumbered Index of Directives, Macros and Options
 222 @printindex fn
 223
 224 @contents
 225 @bye