gcc/cppinternals.texi

   1 \input texinfo
   2 @setfilename cppinternals.info
   3 @settitle The GNU C Preprocessor Internals
   4
   5 @ifinfo
   6 @dircategory Programming
   7 @direntry
   8 * Cpplib: (cppinternals).      Cpplib internals.
   9 @end direntry
  10 @end ifinfo
  11
  12 @c @smallbook
  13 @c @cropmarks
  14 @c @finalout
  15 @setchapternewpage odd
  16 @ifinfo
  17 This file documents the internals of the GNU C Preprocessor.
  18
  19 Copyright 2000, 2001 Free Software Foundation, Inc.
  20
  21 Permission is granted to make and distribute verbatim copies of
  22 this manual provided the copyright notice and this permission notice
  23 are preserved on all copies.
  24
  25 @ignore
  26 Permission is granted to process this file through Tex and print the
  27 results, provided the printed document carries copying permission
  28 notice identical to this one except for the removal of this paragraph
  29 (this paragraph not being relevant to the printed manual).
  30
  31 @end ignore
  32 Permission is granted to copy and distribute modified versions of this
  33 manual under the conditions for verbatim copying, provided also that
  34 the entire resulting derived work is distributed under the terms of a
  35 permission notice identical to this one.
  36
  37 Permission is granted to copy and distribute translations of this manual
  38 into another language, under the above conditions for modified versions.
  39 @end ifinfo
  40
  41 @titlepage
  42 @c @finalout
  43 @title Cpplib Internals
  44 @subtitle Last revised Jan 2001
  45 @subtitle for GCC version 3.0
  46 @author Neil Booth
  47 @page
  48 @vskip 0pt plus 1filll
  49 @c man begin COPYRIGHT
  50 Copyright @copyright{} 2000, 2001
  51 Free Software Foundation, Inc.
  52
  53 Permission is granted to make and distribute verbatim copies of
  54 this manual provided the copyright notice and this permission notice
  55 are preserved on all copies.
  56
  57 Permission is granted to copy and distribute modified versions of this
  58 manual under the conditions for verbatim copying, provided also that
  59 the entire resulting derived work is distributed under the terms of a
  60 permission notice identical to this one.
  61
  62 Permission is granted to copy and distribute translations of this manual
  63 into another language, under the above conditions for modified versions.
  64 @c man end
  65 @end titlepage
  66 @page
  67
  68 @node Top, Conventions,, (DIR)
  69 @chapter Cpplib - the core of the GNU C Preprocessor
  70
  71 The GNU C preprocessor in GCC 3.0 has been completely rewritten.  It is
  72 now implemented as a library, cpplib, so it can be easily shared between
  73 a stand-alone preprocessor, and a preprocessor integrated with the C,
  74 C++ and Objective C front ends.  It is also available for use by other
  75 programs, though this is not recommended as its exposed interface has
  76 not yet reached a point of reasonable stability.
  77
  78 This library has been written to be re-entrant, so that it can be used
  79 to preprocess many files simultaneously if necessary.  It has also been
  80 written with the preprocessing token as the fundamental unit; the
  81 preprocessor in previous versions of GCC would operate on text strings
  82 as the fundamental unit.
  83
  84 This brief manual documents some of the internals of cpplib, and a few
  85 tricky issues encountered.  It also describes certain behaviour we would
  86 like to preserve, such as the format and spacing of its output.
  87
  88 Identifiers, macro expansion, hash nodes, lexing.
  89
  90 @menu
  91 * Conventions::     Conventions used in the code.
  92 * Lexer::           The combined C, C++ and Objective C Lexer.
  93 * Whitespace::      Input and output newlines and whitespace.
  94 * Hash Nodes::      All identifiers are hashed.
  95 * Macro Expansion:: Macro expansion algorithm.
  96 * Files::           File handling.
  97 * Index::           Index.
  98 @end menu
  99
 100 @node Conventions, Lexer, Top, Top
 101 @unnumbered Conventions
 102 @cindex interface
 103 @cindex header files
 104
 105 cpplib has two interfaces - one is exposed internally only, and the
 106 other is for both internal and external use.
 107
 108 The convention is that functions and types that are exposed to multiple
 109 files internally are prefixed with @samp{_cpp_}, and are to be found in
 110 the file @samp{cpphash.h}.  Functions and types exposed to external
 111 clients are in @samp{cpplib.h}, and prefixed with @samp{cpp_}.  For
 112 historical reasons this is no longer quite true, but we should strive to
 113 stick to it.
 114
 115 We are striving to reduce the information exposed in cpplib.h to the
 116 bare minimum necessary, and then to keep it there.  This makes clear
 117 exactly what external clients are entitled to assume, and allows us to
 118 change internals in the future without worrying whether library clients
 119 are perhaps relying on some kind of undocumented implementation-specific
 120 behaviour.
 121
 122 @node Lexer, Whitespace, Conventions, Top
 123 @unnumbered The Lexer
 124 @cindex lexer
 125 @cindex tokens
 126
 127 The lexer is contained in the file @samp{cpplex.c}.  We want to have a
 128 lexer that is single-pass, for efficiency reasons.  We would also like
 129 the lexer to only step forwards through the input files, and not step
 130 back.  This will make future changes to support different character
 131 sets, in particular state or shift-dependent ones, much easier.
 132
 133 This file also contains all information needed to spell a token, i.e. to
 134 output it either in a diagnostic or to a preprocessed output file.  This
 135 information is not exported, but made available to clients through such
 136 functions as @samp{cpp_spell_token} and @samp{cpp_token_len}.
 137
 138 The most painful aspect of lexing ISO-standard C and C++ is handling
 139 trigraphs and backlash-escaped newlines.  Trigraphs are processed before
 140 any interpretation of the meaning of a character is made, and unfortunately
 141 there is a trigraph representation for a backslash, so it is possible for
 142 the trigraph @samp{??/} to introduce an escaped newline.
 143
 144 Escaped newlines are tedious because theoretically they can occur
 145 anywhere - between the @samp{+} and @samp{=} of the @samp{+=} token,
 146 within the characters of an identifier, and even between the @samp{*}
 147 and @samp{/} that terminates a comment.  Moreover, you cannot be sure
 148 there is just one - there might be an arbitrarily long sequence of them.
 149
 150 So the routine @samp{parse_identifier}, that lexes an identifier, cannot
 151 assume that it can scan forwards until the first non-identifier
 152 character and be done with it, because this could be the @samp{\}
 153 introducing an escaped newline, or the @samp{?} introducing the trigraph
 154 sequence that represents the @samp{\} of an escaped newline.  Similarly
 155 for the routine that handles numbers, @samp{parse_number}.  If these
 156 routines stumble upon a @samp{?} or @samp{\}, they call
 157 @samp{skip_escaped_newlines} to skip over any potential escaped newlines
 158 before checking whether they can finish.
 159
 160 Similarly code in the main body of @samp{_cpp_lex_token} cannot simply
 161 check for a @samp{=} after a @samp{+} character to determine whether it
 162 has a @samp{+=} token; it needs to be prepared for an escaped newline of
 163 some sort.  These cases use the function @samp{get_effective_char},
 164 which returns the first character after any intervening newlines.
 165
 166 The lexer needs to keep track of the correct column position,
 167 including counting tabs as specified by the @samp{-ftabstop=} option.
 168 This should be done even within comments; C-style comments can appear in
 169 the middle of a line, and we want to report diagnostics in the correct
 170 position for text appearing after the end of the comment.
 171
 172 Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers,
 173 may be invalid and require a diagnostic.  However, if they appear in a
 174 macro expansion we don't want to complain with each use of the macro.
 175 It is therefore best to catch them during the lexing stage, in
 176 @samp{parse_identifier}.  In both cases, whether a diagnostic is needed
 177 or not is dependent upon lexer state.  For example, we don't want to
 178 issue a diagnostic for re-poisoning a poisoned identifier, or for using
 179 @samp{__VA_ARGS__} in the expansion of a variable-argument macro.
 180 Therefore @samp{parse_identifier} makes use of flags to determine
 181 whether a diagnostic is appropriate.  Since we change state on a
 182 per-token basis, and don't lex whole lines at a time, this is not a
 183 problem.
 184
 185 Another place where state flags are used to change behaviour is whilst
 186 parsing header names.  Normally, a @samp{<} would be lexed as a single
 187 token.  After a @code{#include} directive, though, it should be lexed
 188 as a single token as far as the nearest @samp{>} character.  Note that
 189 we don't allow the terminators of header names to be escaped; the first
 190 @samp{"} or @samp{>} terminates the header name.
 191
 192 Interpretation of some character sequences depends upon whether we are
 193 lexing C, C++ or Objective C, and on the revision of the standard in
 194 force.  For example, @samp{::} is a single token in C++, but two
 195 separate @samp{:} tokens, and almost certainly a syntax error, in C.
 196 Such cases are handled in the main function @samp{_cpp_lex_token}, based
 197 upon the flags set in the @samp{cpp_options} structure.
 198
 199 Note we have almost, but not quite, achieved the goal of not stepping
 200 backwards in the input stream.  Currently @samp{skip_escaped_newlines}
 201 does step back, though with care it should be possible to adjust it so
 202 that this does not happen.  For example, one tricky issue is if we meet
 203 a trigraph, but the command line option @samp{-trigraphs} is not in
 204 force but @samp{-Wtrigraphs} is, we need to warn about it but then
 205 buffer it and continue to treat it as 3 separate characters.
 206
 207 @node Whitespace, Hash Nodes, Lexer, Top
 208 @unnumbered Whitespace
 209 @cindex whitespace
 210 @cindex newlines
 211 @cindex escaped newlines
 212 @cindex paste avoidance
 213 @cindex line numbers
 214
 215 The lexer has been written to treat each of @samp{\r}, @samp{\n},
 216 @samp{\r\n} and @samp{\n\r} as a single new line indicator.  This allows
 217 it to transparently preprocess MS-DOS, Macintosh and Unix files without
 218 their needing to pass through a special filter beforehand.
 219
 220 We also decided to treat a backslash, either @samp{\} or the trigraph
 221 @samp{??/}, separated from one of the above newline indicators by
 222 non-comment whitespace only, as intending to escape the newline.  It
 223 tends to be a typing mistake, and cannot reasonably be mistaken for
 224 anything else in any of the C-family grammars.  Since handling it this
 225 way is not strictly conforming to the ISO standard, the library issues a
 226 warning wherever it encounters it.
 227
 228 Handling newlines like this is made simpler by doing it in one place
 229 only.  The function @samp{handle_newline} takes care of all newline
 230 characters, and @samp{skip_escaped_newlines} takes care of arbitrarily
 231 long sequences of escaped newlines, deferring to @samp{handle_newline}
 232 to handle the newlines themselves.
 233
 234 Another whitespace issue only concerns the stand-alone preprocessor: we
 235 want to guarantee that re-reading the preprocessed output results in an
 236 identical token stream.  Without taking special measures, this might not
 237 be the case because of macro substitution.  We could simply insert a
 238 space between adjacent tokens, but ideally we would like to keep this to
 239 a minimum, both for aesthetic reasons and because it causes problems for
 240 people who still try to abuse the preprocessor for things like Fortran
 241 source and Makefiles.
 242
 243 The token structure contains a flags byte, and two flags are of interest
 244 here: @samp{PREV_WHITE} and @samp{AVOID_LPASTE}.  @samp{PREV_WHITE}
 245 indicates that the token was preceded by whitespace; if this is the case
 246 we need not worry about it incorrectly pasting with its predecessor.
 247 The @samp{AVOID_LPASTE} flag is set by the macro expansion routines, and
 248 indicates that paste avoidance by insertion of a space to the left of
 249 the token may be necessary.  Recursively, the first token of a macro
 250 substitution, the first token after a macro substitution, the first
 251 token of a substituted argument, and the first token after a substituted
 252 argument are all flagged @samp{AVOID_LPASTE} by the macro expander.
 253
 254 If a token flagged in this way does not have a @samp{PREV_WHITE} flag,
 255 and the routine @var{cpp_avoid_paste} determines that it might be
 256 misinterpreted by the lexer if a space is not inserted between it and
 257 the immediately preceding token, then stand-alone CPP's output routines
 258 will insert a space between them.  To avoid excessive spacing,
 259 @var{cpp_avoid_paste} tries hard to only request a space if one is
 260 likely to be necessary, but for reasons of efficiency it is slightly
 261 conservative and might recommend a space where one is not strictly
 262 needed.
 263
 264 Finally, the preprocessor takes great care to ensure it keeps track of
 265 both the position of a token in the source file, for diagnostic
 266 purposes, and where it should appear in the output file, because using
 267 CPP for other languages like assembler requires this.  The two positions
 268 may differ for the following reasons:
 269
 270 @itemize @bullet
 271 @item
 272 Escaped newlines are deleted, so lines spliced in this way are joined to
 273 form a single logical line.
 274
 275 @item
 276 A macro expansion replaces the tokens that form its invocation, but any
 277 newlines appearing in the macro's arguments are interpreted as a single
 278 space, with the result that the macro's replacement appears in full on
 279 the same line that the macro name appeared in the source file.  This is
 280 particularly important for stringification of arguments - newlines
 281 embedded in the arguments must appear in the string as spaces.
 282 @end itemize
 283
 284 The source file location is maintained in the @var{lineno} member of the
 285 @var{cpp_buffer} structure, and the column number inferred from the
 286 current position in the buffer relative to the @var{line_base} buffer
 287 variable, which is updated with every newline whether escaped or not.
 288
 289 TODO: Finish this.
 290
 291 @node Hash Nodes, Macro Expansion, Whitespace, Top
 292 @unnumbered Hash Nodes
 293 @cindex hash table
 294 @cindex identifiers
 295 @cindex macros
 296 @cindex assertions
 297 @cindex named operators
 298
 299 When cpplib encounters an "identifier", it generates a hash code for it
 300 and stores it in the hash table.  By "identifier" we mean tokens with
 301 type @samp{CPP_NAME}; this includes identifiers in the usual C sense, as
 302 well as keywords, directive names, macro names and so on.  For example,
 303 all of "pragma", "int", "foo" and "__GNUC__" are identifiers and hashed
 304 when lexed.
 305
 306 Each node in the hash table contain various information about the
 307 identifier it represents.  For example, its length and type.  At any one
 308 time, each identifier falls into exactly one of three categories:
 309
 310 @itemize @bullet
 311 @item Macros
 312
 313 These have been declared to be macros, either on the command line or
 314 with @code{#define}.  A few, such as @samp{__TIME__} are builtins
 315 entered in the hash table during initialisation.  The hash node for a
 316 normal macro points to a structure with more information about the
 317 macro, such as whether it is function-like, how many arguments it takes,
 318 and its expansion.  Builtin macros are flagged as special, and instead
 319 contain an enum indicating which of the various builtin macros it is.
 320
 321 @item Assertions
 322
 323 Assertions are in a separate namespace to macros.  To enforce this, cpp
 324 actually prepends a @code{#} character before hashing and entering it in
 325 the hash table.  An assertion's node points to a chain of answers to
 326 that assertion.
 327
 328 @item Void
 329
 330 Everything else falls into this category - an identifier that is not
 331 currently a macro, or a macro that has since been undefined with
 332 @code{#undef}.
 333
 334 When preprocessing C++, this category also includes the named operators,
 335 such as @samp{xor}.  In expressions these behave like the operators they
 336 represent, but in contexts where the spelling of a token matters they
 337 are spelt differently.  This spelling distinction is relevant when they
 338 are operands of the stringizing and pasting macro operators @code{#} and
 339 @code{##}.  Named operator hash nodes are flagged, both to catch the
 340 spelling distinction and to prevent them from being defined as macros.
 341 @end itemize
 342
 343 The same identifiers share the same hash node.  Since each identifier
 344 token, after lexing, contains a pointer to its hash node, this is used
 345 to provide rapid lookup of various information.  For example, when
 346 parsing a @code{#define} statement, CPP flags each argument's identifier
 347 hash node with the index of that argument.  This makes duplicated
 348 argument checking an O(1) operation for each argument.  Similarly, for
 349 each identifier in the macro's expansion, lookup to see if it is an
 350 argument, and which argument it is, is also an O(1) operation.  Further,
 351 each directive name, such as @samp{endif}, has an associated directive
 352 enum stored in its hash node, so that directive lookup is also O(1).
 353
 354 @node Macro Expansion, Files, Hash Nodes, Top
 355 @unnumbered Macro Expansion Algorithm
 356
 357 @node Files, Index, Macro Expansion, Top
 358 @unnumbered File Handling
 359 @cindex files
 360
 361 Fairly obviously, the file handling code of cpplib resides in the file
 362 @samp{cppfiles.c}.  It takes care of the details of file searching,
 363 opening, reading and caching, for both the main source file and all the
 364 headers it recursively includes.
 365
 366 The basic strategy is to minimize the number of system calls.  On many
 367 systems, the basic @code{open ()} and @code{fstat ()} system calls can
 368 be quite expensive.  For every @code{#include}-d file, we need to try
 369 all the directories in the search path until we find a match.  Some
 370 projects, such as glibc, pass twenty or thirty include paths on the
 371 command line, so this can rapidly become time consuming.
 372
 373 For a header file we have not encountered before we have little choice
 374 but to do this.  However, it is often the case that the same headers are
 375 repeatedly included, and in these cases we try to avoid repeating the
 376 filesystem queries whilst searching for the correct file.
 377
 378 For each file we try to open, we store the constructed path in a splay
 379 tree.  This path first undergoes simplification by the function
 380 @code{_cpp_simplify_pathname}.  For example,
 381 @samp{/usr/include/bits/../foo.h} is simplified to
 382 @samp{/usr/include/foo.h} before we enter it in the splay tree and try
 383 to @code{open ()} the file.  CPP will then find subsequent uses of
 384 @samp{foo.h}, even as @samp{/usr/include/foo.h}, in the splay tree and
 385 save system calls.
 386
 387 Further, it is likely the file contents have also been cached, saving a
 388 @code{read ()} system call.  We don't bother caching the contents of
 389 header files that are re-inclusion protected, and whose re-inclusion
 390 macro is defined when we leave the header file for the first time.  If
 391 the host supports it, we try to map suitably large files into memory,
 392 rather than reading them in directly.
 393
 394 The include paths are intenally stored on a null-terminated
 395 singly-linked list, starting with the @code{"header.h"} directory search
 396 chain, which then links into the @code{<header.h>} directory chain.
 397
 398 Files included with the @code{<foo.h>} syntax start the lookup directly
 399 in the second half of this chain.  However, files included with the
 400 @code{"foo.h"} syntax start at the beginning of the chain, but with one
 401 extra directory prepended.  This is the directory of the current file;
 402 the one containing the @code{#include} directive.  Prepending this
 403 directory on a per-file basis is handled by the function
 404 @code{search_from}.
 405
 406 Note that a header included with a directory component, such as
 407 @code{#include "mydir/foo.h"} and opened as
 408 @samp{/usr/local/include/mydir/foo.h}, will have the complete path minus
 409 the basename @samp{foo.h} as the current directory.
 410
 411 Enough information is stored in the splay tree that CPP can immediately
 412 tell whether it can skip the header file because of the multiple include
 413 optimisation, whether the file didn't exist or couldn't be opened for
 414 some reason, or whether the header was flagged not to be re-used, as it
 415 is with the obsolete @code{#import} directive.
 416
 417 For the benefit of MS-DOS filesystems with an 8.3 filename limitation,
 418 CPP offers the ability to treat various include file names as aliases
 419 for the real header files with shorter names.  The map from one to the
 420 other is found in a special file called @samp{header.gcc}, stored in the
 421 command line (or system) include directories to which the mapping
 422 applies.  This may be higher up the directory tree than the full path to
 423 the file minus the base name.
 424
 425 @node Index,, Files, Top
 426 @unnumbered Index
 427 @printindex cp
 428
 429 @contents
 430 @bye