gcc/cppinternals.texi

   1 \input texinfo
   2 @setfilename cppinternals.info
   3 @settitle The GNU C Preprocessor Internals
   4
   5 @ifinfo
   6 @dircategory Programming
   7 @direntry
   8 * Cpplib: (cppinternals).      Cpplib internals.
   9 @end direntry
  10 @end ifinfo
  11
  12 @c @smallbook
  13 @c @cropmarks
  14 @c @finalout
  15 @setchapternewpage odd
  16 @ifinfo
  17 This file documents the internals of the GNU C Preprocessor.
  18
  19 Copyright 2000, 2001 Free Software Foundation, Inc.
  20
  21 Permission is granted to make and distribute verbatim copies of
  22 this manual provided the copyright notice and this permission notice
  23 are preserved on all copies.
  24
  25 @ignore
  26 Permission is granted to process this file through Tex and print the
  27 results, provided the printed document carries copying permission
  28 notice identical to this one except for the removal of this paragraph
  29 (this paragraph not being relevant to the printed manual).
  30
  31 @end ignore
  32 Permission is granted to copy and distribute modified versions of this
  33 manual under the conditions for verbatim copying, provided also that
  34 the entire resulting derived work is distributed under the terms of a
  35 permission notice identical to this one.
  36
  37 Permission is granted to copy and distribute translations of this manual
  38 into another language, under the above conditions for modified versions.
  39 @end ifinfo
  40
  41 @titlepage
  42 @c @finalout
  43 @title Cpplib Internals
  44 @subtitle Last revised Jan 2001
  45 @subtitle for GCC version 3.0
  46 @author Neil Booth
  47 @page
  48 @vskip 0pt plus 1filll
  49 @c man begin COPYRIGHT
  50 Copyright @copyright{} 2000, 2001
  51 Free Software Foundation, Inc.
  52
  53 Permission is granted to make and distribute verbatim copies of
  54 this manual provided the copyright notice and this permission notice
  55 are preserved on all copies.
  56
  57 Permission is granted to copy and distribute modified versions of this
  58 manual under the conditions for verbatim copying, provided also that
  59 the entire resulting derived work is distributed under the terms of a
  60 permission notice identical to this one.
  61
  62 Permission is granted to copy and distribute translations of this manual
  63 into another language, under the above conditions for modified versions.
  64 @c man end
  65 @end titlepage
  66 @page
  67
  68 @node Top, Conventions,, (DIR)
  69 @chapter Cpplib - the core of the GNU C Preprocessor
  70
  71 The GNU C preprocessor in GCC 3.0 has been completely rewritten.  It is
  72 now implemented as a library, cpplib, so it can be easily shared between
  73 a stand-alone preprocessor, and a preprocessor integrated with the C,
  74 C++ and Objective C front ends.  It is also available for use by other
  75 programs, though this is not recommended as its exposed interface has
  76 not yet reached a point of reasonable stability.
  77
  78 This library has been written to be re-entrant, so that it can be used
  79 to preprocess many files simultaneously if necessary.  It has also been
  80 written with the preprocessing token as the fundamental unit; the
  81 preprocessor in previous versions of GCC would operate on text strings
  82 as the fundamental unit.
  83
  84 This brief manual documents some of the internals of cpplib, and a few
  85 tricky issues encountered.  It also describes certain behaviour we would
  86 like to preserve, such as the format and spacing of its output.
  87
  88 Identifiers, macro expansion, hash nodes, lexing.
  89
  90 @menu
  91 * Conventions::     Conventions used in the code.
  92 * Lexer::           The combined C, C++ and Objective C Lexer.
  93 * Whitespace::      Input and output newlines and whitespace.
  94 * Hash Nodes::      All identifiers are hashed.
  95 * Macro Expansion:: Macro expansion algorithm.
  96 * Files::           File handling.
  97 * Concept Index::   Index of concepts and terms.
  98 * Index::           Index.
  99 @end menu
 100
 101 @node Conventions, Lexer, Top, Top
 102 @unnumbered Conventions
 103
 104 cpplib has two interfaces - one is exposed internally only, and the
 105 other is for both internal and external use.
 106
 107 The convention is that functions and types that are exposed to multiple
 108 files internally are prefixed with @samp{_cpp_}, and are to be found in
 109 the file @samp{cpphash.h}.  Functions and types exposed to external
 110 clients are in @samp{cpplib.h}, and prefixed with @samp{cpp_}.
 111
 112 We are striving to reduce the information exposed in cpplib.h to the
 113 bare minimum necessary, and then to keep it there.  This makes clear
 114 exactly what external clients are entitled to assume, and allows us to
 115 change internals in the future without worrying whether library clients
 116 are perhaps relying on some kind of undocumented implementation-specific
 117 behaviour.
 118
 119 @node Lexer, Whitespace, Conventions, Top
 120 @unnumbered The Lexer
 121
 122 The lexer is contained in the file @samp{cpplex.c}.  We want to have a
 123 lexer that is single-pass, for efficiency reasons.  We would also like
 124 the lexer to only step forwards through the input files, and not step
 125 back.  This will make future changes to support different character
 126 sets, in particular state or shift-dependent ones, much easier.
 127
 128 This file also contains all information needed to spell a token, i.e. to
 129 output it either in a diagnostic or to a preprocessed output file.  This
 130 information is not exported, but made available to clients through such
 131 functions as @samp{cpp_spell_token} and @samp{cpp_token_len}.
 132
 133 The most painful aspect of lexing ISO-standard C and C++ is handling
 134 trigraphs and backlash-escaped newlines.  Trigraphs are processed before
 135 any interpretation of the meaning of a character is made, and unfortunately
 136 there is a trigraph representation for a backslash, so it is possible for
 137 the trigraph @samp{??/} to introduce an escaped newline.
 138
 139 Escaped newlines are tedious because theoretically they can occur
 140 anywhere - between the @samp{+} and @samp{=} of the @samp{+=} token,
 141 within the characters of an identifier, and even between the @samp{*}
 142 and @samp{/} that terminates a comment.  Moreover, you cannot be sure
 143 there is just one - there might be an arbitrarily long sequence of them.
 144
 145 So the routine @samp{parse_identifier}, that lexes an identifier, cannot
 146 assume that it can scan forwards until the first non-identifier
 147 character and be done with it, because this could be the @samp{\}
 148 introducing an escaped newline, or the @samp{?} introducing the trigraph
 149 sequence that represents the @samp{\} of an escaped newline.  Similarly
 150 for the routine that handles numbers, @samp{parse_number}.  If these
 151 routines stumble upon a @samp{?} or @samp{\}, they call
 152 @samp{skip_escaped_newlines} to skip over any potential escaped newlines
 153 before checking whether they can finish.
 154
 155 Similarly code in the main body of @samp{_cpp_lex_token} cannot simply
 156 check for a @samp{=} after a @samp{+} character to determine whether it
 157 has a @samp{+=} token; it needs to be prepared for an escaped newline of
 158 some sort.  These cases use the function @samp{get_effective_char},
 159 which returns the first character after any intervening newlines.
 160
 161 The lexer needs to keep track of the correct column position,
 162 including counting tabs as specified by the @samp{-ftabstop=} option.
 163 This should be done even within comments; C-style comments can appear in
 164 the middle of a line, and we want to report diagnostics in the correct
 165 position for text appearing after the end of the comment.
 166
 167 Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers,
 168 may be invalid and require a diagnostic.  However, if they appear in a
 169 macro expansion we don't want to complain with each use of the macro.
 170 It is therefore best to catch them during the lexing stage, in
 171 @samp{parse_identifier}.  In both cases, whether a diagnostic is needed
 172 or not is dependent upon lexer state.  For example, we don't want to
 173 issue a diagnostic for re-poisoning a poisoned identifier, or for using
 174 @samp{__VA_ARGS__} in the expansion of a variable-argument macro.
 175 Therefore @samp{parse_identifier} makes use of flags to determine
 176 whether a diagnostic is appropriate.  Since we change state on a
 177 per-token basis, and don't lex whole lines at a time, this is not a
 178 problem.
 179
 180 Another place where state flags are used to change behaviour is whilst
 181 parsing header names.  Normally, a @samp{<} would be lexed as a single
 182 token.  After a @samp{#include} directive, though, it should be lexed
 183 as a single token as far as the nearest @samp{>} character.  Note that
 184 we don't allow the terminators of header names to be escaped; the first
 185 @samp{"} or @samp{>} terminates the header name.
 186
 187 Interpretation of some character sequences depends upon whether we are
 188 lexing C, C++ or Objective C, and on the revision of the standard in
 189 force.  For example, @samp{@@foo} is a single identifier token in
 190 objective C, but two separate tokens @samp{@@} and @samp{foo} in C or
 191 C++.  Such cases are handled in the main function @samp{_cpp_lex_token},
 192 based upon the flags set in the @samp{cpp_options} structure.
 193
 194 Note we have almost, but not quite, achieved the goal of not stepping
 195 backwards in the input stream.  Currently @samp{skip_escaped_newlines}
 196 does step back, though with care it should be possible to adjust it so
 197 that this does not happen.  For example, one tricky issue is if we meet
 198 a trigraph, but the command line option @samp{-trigraphs} is not in
 199 force but @samp{-Wtrigraphs} is, we need to warn about it but then
 200 buffer it and continue to treat it as 3 separate characters.
 201
 202 @node Whitespace, Hash Nodes, Lexer, Top
 203 @unnumbered Whitespace
 204
 205 The lexer has been written to treat each of @samp{\r}, @samp{\n},
 206 @samp{\r\n} and @samp{\n\r} as a single new line indicator.  This allows
 207 it to transparently preprocess MS-DOS, Macintosh and Unix files without
 208 their needing to pass through a special filter beforehand.
 209
 210 We also decided to treat a backslash, either @samp{\} or the trigraph
 211 @samp{??/}, separated from one of the above newline indicators by
 212 non-comment whitespace only, as intending to escape the newline.  It
 213 tends to be a typing mistake, and cannot reasonably be mistaken for
 214 anything else in any of the C-family grammars.  Since handling it this
 215 way is not strictly conforming to the ISO standard, the library issues a
 216 warning wherever it encounters it.
 217
 218 Handling newlines like this is made simpler by doing it in one place
 219 only.  The function @samp{handle_newline} takes care of all newline
 220 characters, and @samp{skip_escaped_newlines} takes care of arbitrarily
 221 long sequences of escaped newlines, deferring to @samp{handle_newline}
 222 to handle the newlines themselves.
 223
 224 @node Hash Nodes, Macro Expansion, Whitespace, Top
 225 @unnumbered Hash Nodes
 226
 227 When cpplib encounters an "identifier", it generates a hash code for it
 228 and stores it in the hash table.  By "identifier" we mean tokens with
 229 type @samp{CPP_NAME}; this includes identifiers in the usual C sense, as
 230 well as keywords, directive names, macro names and so on.  For example,
 231 all of "pragma", "int", "foo" and "__GNUC__" are identifiers and hashed
 232 when lexed.
 233
 234 Each node in the hash table contain various information about the
 235 identifier it represents.  For example, its length and type.  At any one
 236 time, each identifier falls into exactly one of three categories:
 237
 238 @itemize @bullet
 239 @item Macros
 240
 241 These have been declared to be macros, either on the command line or
 242 with @samp{#define}.  A few, such as @samp{__TIME__} are builtins
 243 entered in the hash table during initialisation.  The hash node for a
 244 normal macro points to a structure with more information about the
 245 macro, such as whether it is function-like, how many arguments it takes,
 246 and its expansion.  Builtin macros are flagged as special, and instead
 247 contain an enum indicating which of the various builtin macros it is.
 248
 249 @item Assertions
 250
 251 Assertions are in a separate namespace to macros.  To enforce this, cpp
 252 actually prepends a @samp{#} character before hashing and entering it in
 253 the hash table.  An assertion's node points to a chain of answers to
 254 that assertion.
 255
 256 @item Void
 257
 258 Everything else falls into this category - an identifier that is not
 259 currently a macro, or a macro that has since been undefined with
 260 @samp{#undef}.
 261
 262 When preprocessing C++, this category also includes the named operators,
 263 such as @samp{xor}.  In expressions these behave like the operators they
 264 represent, but in contexts where the spelling of a token matters they
 265 are spelt differently.  This spelling distinction is relevant when they
 266 are operands of the stringizing and pasting macro operators @samp{#} and
 267 @samp{##}.  Named operator hash nodes are flagged, both to catch the
 268 spelling distinction and to prevent them from being defined as macros.
 269 @end itemize
 270
 271 The same identifiers share the same hash node.  Since each identifier
 272 token, after lexing, contains a pointer to its hash node, this is used
 273 to provide rapid lookup of various information.  For example, when
 274 parsing a @samp{#define} statement, CPP flags each argument's identifier
 275 hash node with the index of that argument.  This makes duplicated
 276 argument checking an O(1) operation for each argument.  Similarly, for
 277 each identifier in the macro's expansion, lookup to see if it is an
 278 argument, and which argument it is, is also an O(1) operation.  Further,
 279 each directive name, such as @samp{endif}, has an associated directive
 280 enum stored in its hash node, so that directive lookup is also O(1).
 281
 282 Later, CPP may also store C front-end information in its identifier hash
 283 table, such as a @samp{tree} pointer.
 284
 285 @node Macro Expansion, Files, Hash Nodes, Top
 286 @unnumbered Macro Expansion Algorithm
 287 @printindex cp
 288
 289 @node Files, Concept Index, Macro Expansion, Top
 290 @unnumbered File Handling
 291 @printindex cp
 292
 293 @node Concept Index, Index, Files, Top
 294 @unnumbered Concept Index
 295 @printindex cp
 296
 297 @node Index,, Concept Index, Top
 298 @unnumbered Index of Directives, Macros and Options
 299 @printindex fn
 300
 301 @contents
 302 @bye