gcc/doc/cppinternals.texi

   1 \input texinfo
   2 @setfilename cppinternals.info
   3 @settitle The GNU C Preprocessor Internals
   4
   5 @ifinfo
   6 @dircategory Programming
   7 @direntry
   8 * Cpplib: (cppinternals).      Cpplib internals.
   9 @end direntry
  10 @end ifinfo
  11
  12 @c @smallbook
  13 @c @cropmarks
  14 @c @finalout
  15 @setchapternewpage odd
  16 @ifinfo
  17 This file documents the internals of the GNU C Preprocessor.
  18
  19 Copyright 2000, 2001 Free Software Foundation, Inc.
  20
  21 Permission is granted to make and distribute verbatim copies of
  22 this manual provided the copyright notice and this permission notice
  23 are preserved on all copies.
  24
  25 @ignore
  26 Permission is granted to process this file through Tex and print the
  27 results, provided the printed document carries copying permission
  28 notice identical to this one except for the removal of this paragraph
  29 (this paragraph not being relevant to the printed manual).
  30
  31 @end ignore
  32 Permission is granted to copy and distribute modified versions of this
  33 manual under the conditions for verbatim copying, provided also that
  34 the entire resulting derived work is distributed under the terms of a
  35 permission notice identical to this one.
  36
  37 Permission is granted to copy and distribute translations of this manual
  38 into another language, under the above conditions for modified versions.
  39 @end ifinfo
  40
  41 @titlepage
  42 @c @finalout
  43 @title Cpplib Internals
  44 @subtitle Last revised Jan 2001
  45 @subtitle for GCC version 3.0
  46 @author Neil Booth
  47 @page
  48 @vskip 0pt plus 1filll
  49 @c man begin COPYRIGHT
  50 Copyright @copyright{} 2000, 2001
  51 Free Software Foundation, Inc.
  52
  53 Permission is granted to make and distribute verbatim copies of
  54 this manual provided the copyright notice and this permission notice
  55 are preserved on all copies.
  56
  57 Permission is granted to copy and distribute modified versions of this
  58 manual under the conditions for verbatim copying, provided also that
  59 the entire resulting derived work is distributed under the terms of a
  60 permission notice identical to this one.
  61
  62 Permission is granted to copy and distribute translations of this manual
  63 into another language, under the above conditions for modified versions.
  64 @c man end
  65 @end titlepage
  66 @contents
  67 @page
  68
  69 @node Top, Conventions,, (DIR)
  70 @chapter Cpplib - the core of the GNU C Preprocessor
  71
  72 The GNU C preprocessor in GCC 3.0 has been completely rewritten.  It is
  73 now implemented as a library, cpplib, so it can be easily shared between
  74 a stand-alone preprocessor, and a preprocessor integrated with the C,
  75 C++ and Objective C front ends.  It is also available for use by other
  76 programs, though this is not recommended as its exposed interface has
  77 not yet reached a point of reasonable stability.
  78
  79 This library has been written to be re-entrant, so that it can be used
  80 to preprocess many files simultaneously if necessary.  It has also been
  81 written with the preprocessing token as the fundamental unit; the
  82 preprocessor in previous versions of GCC would operate on text strings
  83 as the fundamental unit.
  84
  85 This brief manual documents some of the internals of cpplib, and a few
  86 tricky issues encountered.  It also describes certain behaviour we would
  87 like to preserve, such as the format and spacing of its output.
  88
  89 Identifiers, macro expansion, hash nodes, lexing.
  90
  91 @menu
  92 * Conventions::     Conventions used in the code.
  93 * Lexer::           The combined C, C++ and Objective C Lexer.
  94 * Whitespace::      Input and output newlines and whitespace.
  95 * Hash Nodes::      All identifiers are hashed.
  96 * Macro Expansion:: Macro expansion algorithm.
  97 * Files::           File handling.
  98 * Index::           Index.
  99 @end menu
 100
 101 @node Conventions, Lexer, Top, Top
 102 @unnumbered Conventions
 103 @cindex interface
 104 @cindex header files
 105
 106 cpplib has two interfaces - one is exposed internally only, and the
 107 other is for both internal and external use.
 108
 109 The convention is that functions and types that are exposed to multiple
 110 files internally are prefixed with @samp{_cpp_}, and are to be found in
 111 the file @samp{cpphash.h}.  Functions and types exposed to external
 112 clients are in @samp{cpplib.h}, and prefixed with @samp{cpp_}.  For
 113 historical reasons this is no longer quite true, but we should strive to
 114 stick to it.
 115
 116 We are striving to reduce the information exposed in cpplib.h to the
 117 bare minimum necessary, and then to keep it there.  This makes clear
 118 exactly what external clients are entitled to assume, and allows us to
 119 change internals in the future without worrying whether library clients
 120 are perhaps relying on some kind of undocumented implementation-specific
 121 behaviour.
 122
 123 @node Lexer, Whitespace, Conventions, Top
 124 @unnumbered The Lexer
 125 @cindex lexer
 126 @cindex tokens
 127
 128 The lexer is contained in the file @samp{cpplex.c}.  We want to have a
 129 lexer that is single-pass, for efficiency reasons.  We would also like
 130 the lexer to only step forwards through the input files, and not step
 131 back.  This will make future changes to support different character
 132 sets, in particular state or shift-dependent ones, much easier.
 133
 134 This file also contains all information needed to spell a token, i.e. to
 135 output it either in a diagnostic or to a preprocessed output file.  This
 136 information is not exported, but made available to clients through such
 137 functions as @samp{cpp_spell_token} and @samp{cpp_token_len}.
 138
 139 The most painful aspect of lexing ISO-standard C and C++ is handling
 140 trigraphs and backlash-escaped newlines.  Trigraphs are processed before
 141 any interpretation of the meaning of a character is made, and unfortunately
 142 there is a trigraph representation for a backslash, so it is possible for
 143 the trigraph @samp{??/} to introduce an escaped newline.
 144
 145 Escaped newlines are tedious because theoretically they can occur
 146 anywhere - between the @samp{+} and @samp{=} of the @samp{+=} token,
 147 within the characters of an identifier, and even between the @samp{*}
 148 and @samp{/} that terminates a comment.  Moreover, you cannot be sure
 149 there is just one - there might be an arbitrarily long sequence of them.
 150
 151 So the routine @samp{parse_identifier}, that lexes an identifier, cannot
 152 assume that it can scan forwards until the first non-identifier
 153 character and be done with it, because this could be the @samp{\}
 154 introducing an escaped newline, or the @samp{?} introducing the trigraph
 155 sequence that represents the @samp{\} of an escaped newline.  Similarly
 156 for the routine that handles numbers, @samp{parse_number}.  If these
 157 routines stumble upon a @samp{?} or @samp{\}, they call
 158 @samp{skip_escaped_newlines} to skip over any potential escaped newlines
 159 before checking whether they can finish.
 160
 161 Similarly code in the main body of @samp{_cpp_lex_token} cannot simply
 162 check for a @samp{=} after a @samp{+} character to determine whether it
 163 has a @samp{+=} token; it needs to be prepared for an escaped newline of
 164 some sort.  These cases use the function @samp{get_effective_char},
 165 which returns the first character after any intervening newlines.
 166
 167 The lexer needs to keep track of the correct column position,
 168 including counting tabs as specified by the @samp{-ftabstop=} option.
 169 This should be done even within comments; C-style comments can appear in
 170 the middle of a line, and we want to report diagnostics in the correct
 171 position for text appearing after the end of the comment.
 172
 173 Some identifiers, such as @samp{__VA_ARGS__} and poisoned identifiers,
 174 may be invalid and require a diagnostic.  However, if they appear in a
 175 macro expansion we don't want to complain with each use of the macro.
 176 It is therefore best to catch them during the lexing stage, in
 177 @samp{parse_identifier}.  In both cases, whether a diagnostic is needed
 178 or not is dependent upon lexer state.  For example, we don't want to
 179 issue a diagnostic for re-poisoning a poisoned identifier, or for using
 180 @samp{__VA_ARGS__} in the expansion of a variable-argument macro.
 181 Therefore @samp{parse_identifier} makes use of flags to determine
 182 whether a diagnostic is appropriate.  Since we change state on a
 183 per-token basis, and don't lex whole lines at a time, this is not a
 184 problem.
 185
 186 Another place where state flags are used to change behaviour is whilst
 187 parsing header names.  Normally, a @samp{<} would be lexed as a single
 188 token.  After a @code{#include} directive, though, it should be lexed
 189 as a single token as far as the nearest @samp{>} character.  Note that
 190 we don't allow the terminators of header names to be escaped; the first
 191 @samp{"} or @samp{>} terminates the header name.
 192
 193 Interpretation of some character sequences depends upon whether we are
 194 lexing C, C++ or Objective C, and on the revision of the standard in
 195 force.  For example, @samp{::} is a single token in C++, but two
 196 separate @samp{:} tokens, and almost certainly a syntax error, in C.
 197 Such cases are handled in the main function @samp{_cpp_lex_token}, based
 198 upon the flags set in the @samp{cpp_options} structure.
 199
 200 Note we have almost, but not quite, achieved the goal of not stepping
 201 backwards in the input stream.  Currently @samp{skip_escaped_newlines}
 202 does step back, though with care it should be possible to adjust it so
 203 that this does not happen.  For example, one tricky issue is if we meet
 204 a trigraph, but the command line option @samp{-trigraphs} is not in
 205 force but @samp{-Wtrigraphs} is, we need to warn about it but then
 206 buffer it and continue to treat it as 3 separate characters.
 207
 208 @node Whitespace, Hash Nodes, Lexer, Top
 209 @unnumbered Whitespace
 210 @cindex whitespace
 211 @cindex newlines
 212 @cindex escaped newlines
 213 @cindex paste avoidance
 214 @cindex line numbers
 215
 216 The lexer has been written to treat each of @samp{\r}, @samp{\n},
 217 @samp{\r\n} and @samp{\n\r} as a single new line indicator.  This allows
 218 it to transparently preprocess MS-DOS, Macintosh and Unix files without
 219 their needing to pass through a special filter beforehand.
 220
 221 We also decided to treat a backslash, either @samp{\} or the trigraph
 222 @samp{??/}, separated from one of the above newline indicators by
 223 non-comment whitespace only, as intending to escape the newline.  It
 224 tends to be a typing mistake, and cannot reasonably be mistaken for
 225 anything else in any of the C-family grammars.  Since handling it this
 226 way is not strictly conforming to the ISO standard, the library issues a
 227 warning wherever it encounters it.
 228
 229 Handling newlines like this is made simpler by doing it in one place
 230 only.  The function @samp{handle_newline} takes care of all newline
 231 characters, and @samp{skip_escaped_newlines} takes care of arbitrarily
 232 long sequences of escaped newlines, deferring to @samp{handle_newline}
 233 to handle the newlines themselves.
 234
 235 Another whitespace issue only concerns the stand-alone preprocessor: we
 236 want to guarantee that re-reading the preprocessed output results in an
 237 identical token stream.  Without taking special measures, this might not
 238 be the case because of macro substitution.  We could simply insert a
 239 space between adjacent tokens, but ideally we would like to keep this to
 240 a minimum, both for aesthetic reasons and because it causes problems for
 241 people who still try to abuse the preprocessor for things like Fortran
 242 source and Makefiles.
 243
 244 The token structure contains a flags byte, and two flags are of interest
 245 here: @samp{PREV_WHITE} and @samp{AVOID_LPASTE}.  @samp{PREV_WHITE}
 246 indicates that the token was preceded by whitespace; if this is the case
 247 we need not worry about it incorrectly pasting with its predecessor.
 248 The @samp{AVOID_LPASTE} flag is set by the macro expansion routines, and
 249 indicates that paste avoidance by insertion of a space to the left of
 250 the token may be necessary.  Recursively, the first token of a macro
 251 substitution, the first token after a macro substitution, the first
 252 token of a substituted argument, and the first token after a substituted
 253 argument are all flagged @samp{AVOID_LPASTE} by the macro expander.
 254
 255 If a token flagged in this way does not have a @samp{PREV_WHITE} flag,
 256 and the routine @var{cpp_avoid_paste} determines that it might be
 257 misinterpreted by the lexer if a space is not inserted between it and
 258 the immediately preceding token, then stand-alone CPP's output routines
 259 will insert a space between them.  To avoid excessive spacing,
 260 @var{cpp_avoid_paste} tries hard to only request a space if one is
 261 likely to be necessary, but for reasons of efficiency it is slightly
 262 conservative and might recommend a space where one is not strictly
 263 needed.
 264
 265 Finally, the preprocessor takes great care to ensure it keeps track of
 266 both the position of a token in the source file, for diagnostic
 267 purposes, and where it should appear in the output file, because using
 268 CPP for other languages like assembler requires this.  The two positions
 269 may differ for the following reasons:
 270
 271 @itemize @bullet
 272 @item
 273 Escaped newlines are deleted, so lines spliced in this way are joined to
 274 form a single logical line.
 275
 276 @item
 277 A macro expansion replaces the tokens that form its invocation, but any
 278 newlines appearing in the macro's arguments are interpreted as a single
 279 space, with the result that the macro's replacement appears in full on
 280 the same line that the macro name appeared in the source file.  This is
 281 particularly important for stringification of arguments - newlines
 282 embedded in the arguments must appear in the string as spaces.
 283 @end itemize
 284
 285 The source file location is maintained in the @var{lineno} member of the
 286 @var{cpp_buffer} structure, and the column number inferred from the
 287 current position in the buffer relative to the @var{line_base} buffer
 288 variable, which is updated with every newline whether escaped or not.
 289
 290 TODO: Finish this.
 291
 292 @node Hash Nodes, Macro Expansion, Whitespace, Top
 293 @unnumbered Hash Nodes
 294 @cindex hash table
 295 @cindex identifiers
 296 @cindex macros
 297 @cindex assertions
 298 @cindex named operators
 299
 300 When cpplib encounters an "identifier", it generates a hash code for it
 301 and stores it in the hash table.  By "identifier" we mean tokens with
 302 type @samp{CPP_NAME}; this includes identifiers in the usual C sense, as
 303 well as keywords, directive names, macro names and so on.  For example,
 304 all of "pragma", "int", "foo" and "__GNUC__" are identifiers and hashed
 305 when lexed.
 306
 307 Each node in the hash table contain various information about the
 308 identifier it represents.  For example, its length and type.  At any one
 309 time, each identifier falls into exactly one of three categories:
 310
 311 @itemize @bullet
 312 @item Macros
 313
 314 These have been declared to be macros, either on the command line or
 315 with @code{#define}.  A few, such as @samp{__TIME__} are builtins
 316 entered in the hash table during initialisation.  The hash node for a
 317 normal macro points to a structure with more information about the
 318 macro, such as whether it is function-like, how many arguments it takes,
 319 and its expansion.  Builtin macros are flagged as special, and instead
 320 contain an enum indicating which of the various builtin macros it is.
 321
 322 @item Assertions
 323
 324 Assertions are in a separate namespace to macros.  To enforce this, cpp
 325 actually prepends a @code{#} character before hashing and entering it in
 326 the hash table.  An assertion's node points to a chain of answers to
 327 that assertion.
 328
 329 @item Void
 330
 331 Everything else falls into this category - an identifier that is not
 332 currently a macro, or a macro that has since been undefined with
 333 @code{#undef}.
 334
 335 When preprocessing C++, this category also includes the named operators,
 336 such as @samp{xor}.  In expressions these behave like the operators they
 337 represent, but in contexts where the spelling of a token matters they
 338 are spelt differently.  This spelling distinction is relevant when they
 339 are operands of the stringizing and pasting macro operators @code{#} and
 340 @code{##}.  Named operator hash nodes are flagged, both to catch the
 341 spelling distinction and to prevent them from being defined as macros.
 342 @end itemize
 343
 344 The same identifiers share the same hash node.  Since each identifier
 345 token, after lexing, contains a pointer to its hash node, this is used
 346 to provide rapid lookup of various information.  For example, when
 347 parsing a @code{#define} statement, CPP flags each argument's identifier
 348 hash node with the index of that argument.  This makes duplicated
 349 argument checking an O(1) operation for each argument.  Similarly, for
 350 each identifier in the macro's expansion, lookup to see if it is an
 351 argument, and which argument it is, is also an O(1) operation.  Further,
 352 each directive name, such as @samp{endif}, has an associated directive
 353 enum stored in its hash node, so that directive lookup is also O(1).
 354
 355 @node Macro Expansion, Files, Hash Nodes, Top
 356 @unnumbered Macro Expansion Algorithm
 357
 358 @node Files, Index, Macro Expansion, Top
 359 @unnumbered File Handling
 360 @cindex files
 361
 362 Fairly obviously, the file handling code of cpplib resides in the file
 363 @samp{cppfiles.c}.  It takes care of the details of file searching,
 364 opening, reading and caching, for both the main source file and all the
 365 headers it recursively includes.
 366
 367 The basic strategy is to minimize the number of system calls.  On many
 368 systems, the basic @code{open ()} and @code{fstat ()} system calls can
 369 be quite expensive.  For every @code{#include}-d file, we need to try
 370 all the directories in the search path until we find a match.  Some
 371 projects, such as glibc, pass twenty or thirty include paths on the
 372 command line, so this can rapidly become time consuming.
 373
 374 For a header file we have not encountered before we have little choice
 375 but to do this.  However, it is often the case that the same headers are
 376 repeatedly included, and in these cases we try to avoid repeating the
 377 filesystem queries whilst searching for the correct file.
 378
 379 For each file we try to open, we store the constructed path in a splay
 380 tree.  This path first undergoes simplification by the function
 381 @code{_cpp_simplify_pathname}.  For example,
 382 @samp{/usr/include/bits/../foo.h} is simplified to
 383 @samp{/usr/include/foo.h} before we enter it in the splay tree and try
 384 to @code{open ()} the file.  CPP will then find subsequent uses of
 385 @samp{foo.h}, even as @samp{/usr/include/foo.h}, in the splay tree and
 386 save system calls.
 387
 388 Further, it is likely the file contents have also been cached, saving a
 389 @code{read ()} system call.  We don't bother caching the contents of
 390 header files that are re-inclusion protected, and whose re-inclusion
 391 macro is defined when we leave the header file for the first time.  If
 392 the host supports it, we try to map suitably large files into memory,
 393 rather than reading them in directly.
 394
 395 The include paths are internally stored on a null-terminated
 396 singly-linked list, starting with the @code{"header.h"} directory search
 397 chain, which then links into the @code{<header.h>} directory chain.
 398
 399 Files included with the @code{<foo.h>} syntax start the lookup directly
 400 in the second half of this chain.  However, files included with the
 401 @code{"foo.h"} syntax start at the beginning of the chain, but with one
 402 extra directory prepended.  This is the directory of the current file;
 403 the one containing the @code{#include} directive.  Prepending this
 404 directory on a per-file basis is handled by the function
 405 @code{search_from}.
 406
 407 Note that a header included with a directory component, such as
 408 @code{#include "mydir/foo.h"} and opened as
 409 @samp{/usr/local/include/mydir/foo.h}, will have the complete path minus
 410 the basename @samp{foo.h} as the current directory.
 411
 412 Enough information is stored in the splay tree that CPP can immediately
 413 tell whether it can skip the header file because of the multiple include
 414 optimisation, whether the file didn't exist or couldn't be opened for
 415 some reason, or whether the header was flagged not to be re-used, as it
 416 is with the obsolete @code{#import} directive.
 417
 418 For the benefit of MS-DOS filesystems with an 8.3 filename limitation,
 419 CPP offers the ability to treat various include file names as aliases
 420 for the real header files with shorter names.  The map from one to the
 421 other is found in a special file called @samp{header.gcc}, stored in the
 422 command line (or system) include directories to which the mapping
 423 applies.  This may be higher up the directory tree than the full path to
 424 the file minus the base name.
 425
 426 @node Index,, Files, Top
 427 @unnumbered Index
 428 @printindex cp
 429
 430 @bye