2011-03-11 Richard Guenther <rguenther@suse.de>

[pf3gnuchains/gcc-fork.git] / gcc / doc / cppinternals.texi
diff --git a/gcc/doc/cppinternals.texi b/gcc/doc/cppinternals.texi

index a831922..a22ef0d 100644 (file)
--- a/gcc/doc/cppinternals.texi
+++ b/gcc/doc/cppinternals.texi
@@ -2,8 +2,10 @@
  @setfilename cppinternals.info
  @settitle The GNU C Preprocessor Internals
  
+@include gcc-common.texi
+
  @ifinfo
-@dircategory Programming
+@dircategory Software development
  @direntry
  * Cpplib: (cppinternals).      Cpplib internals.
  @end direntry
@@ -16,7 +18,8 @@
  @ifinfo
  This file documents the internals of the GNU C Preprocessor.
  
-Copyright 2000, 2001 Free Software Foundation, Inc.
+Copyright 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free Software
+Foundation, Inc.
  
  Permission is granted to make and distribute verbatim copies of
  this manual provided the copyright notice and this permission notice
@@ -39,15 +42,13 @@ into another language, under the above conditions for modified versions.
  @end ifinfo
  
  @titlepage
-@c @finalout
  @title Cpplib Internals
-@subtitle Last revised December 2001
-@subtitle for GCC version 3.1
+@versionsubtitle
  @author Neil Booth
  @page
  @vskip 0pt plus 1filll
  @c man begin COPYRIGHT
-Copyright @copyright{} 2000, 2001
+Copyright @copyright{} 2000, 2001, 2002, 2004, 2005
  Free Software Foundation, Inc.
  
  Permission is granted to make and distribute verbatim copies of
@@ -66,12 +67,13 @@ into another language, under the above conditions for modified versions.
  @contents
  @page
  
+@ifnottex
  @node Top
  @top
-@chapter Cpplib---the core of the GNU C Preprocessor
+@chapter Cpplib---the GNU C Preprocessor
  
-The GNU C preprocessor in GCC 3.x has been completely rewritten.  It is
-now implemented as a library, @dfn{cpplib}, so it can be easily shared between
+The GNU C preprocessor is
+implemented as a library, @dfn{cpplib}, so it can be easily shared between
  a stand-alone preprocessor, and a preprocessor integrated with the C,
  C++ and Objective-C front ends.  It is also available for use by other
  programs, though this is not recommended as its exposed interface has
@@ -98,8 +100,9 @@ the way they have.
  * Line Numbering::      Tracking location within files.
  * Guard Macros::        Optimizing header files with guard macros.
  * Files::               File handling.
-* Index::               Index.
+* Concept Index::       Index.
  @end menu
+@end ifnottex
  
  @node Conventions
  @unnumbered Conventions
@@ -111,7 +114,7 @@ other is for both internal and external use.
  
  The convention is that functions and types that are exposed to multiple
  files internally are prefixed with @samp{_cpp_}, and are to be found in
-the file @file{cpphash.h}.  Functions and types exposed to external
+the file @file{internal.h}.  Functions and types exposed to external
  clients are in @file{cpplib.h}, and prefixed with @samp{cpp_}.  For
  historical reasons this is no longer quite true, but we should strive to
  stick to it.
@@ -130,7 +133,7 @@ behavior.
  @cindex escaped newlines
  
  @section Overview
-The lexer is contained in the file @file{cpplex.c}.  It is a hand-coded
+The lexer is contained in the file @file{lex.c}.  It is a hand-coded
  lexer, and not implemented as a state machine.  It can understand C, C++
  and Objective-C source code, and has been extended to allow reasonably
  successful preprocessing of assembly language.  The lexer does not make
@@ -226,7 +229,7 @@ foo
  @end smallexample
  
  This is a good example of the subtlety of getting token spacing correct
-in the preprocessor; there are plenty of tests in the test suite for
+in the preprocessor; there are plenty of tests in the testsuite for
  corner cases like this.
  
  The lexer is written to treat each of @samp{\r}, @samp{\n}, @samp{\r\n}
@@ -368,8 +371,8 @@ chaining a new token run on to the end of the existing one.
  
  The tokens forming a macro's replacement list are collected by the
  @code{#define} handler, and placed in storage that is only freed by
-@code{cpp_destroy}.  So if a macro is expanded in our line of tokens,
-the pointers to the tokens of its expansion that we return will always
+@code{cpp_destroy}.  So if a macro is expanded in the line of tokens,
+the pointers to the tokens of its expansion that are returned will always
  remain valid.  However, macros are a little trickier than that, since
  they give rise to three sources of fresh tokens.  They are the built-in
  macros like @code{__LINE__}, and the @samp{#} and @samp{##} operators
@@ -469,26 +472,26 @@ enum stored in its hash node, so that directive lookup is also O(1).
  @unnumbered Macro Expansion Algorithm
  @cindex macro expansion
  
-Macro expansion is a surprisingly tricky operation, fraught with nasty
-corner cases and situations that render what you thought was a nifty
-way to optimize the preprocessor's expansion algorithm wrong in quite
-subtle ways.
+Macro expansion is a tricky operation, fraught with nasty corner cases
+and situations that render what you thought was a nifty way to
+optimize the preprocessor's expansion algorithm wrong in quite subtle
+ways.
  
  I strongly recommend you have a good grasp of how the C and C++
  standards require macros to be expanded before diving into this
  section, let alone the code!.  If you don't have a clear mental
  picture of how things like nested macro expansion, stringification and
-token pasting are supposed to work, damage to you sanity can quickly
+token pasting are supposed to work, damage to your sanity can quickly
  result.
  
-@section Internal representation of Macros
+@section Internal representation of macros
  @cindex macro representation (internal)
  
  The preprocessor stores macro expansions in tokenized form.  This
  saves repeated lexing passes during expansion, at the cost of a small
  increase in memory consumption on average.  The tokens are stored
  contiguously in memory, so a pointer to the first one and a token
-count is all we need.
+count is all you need to get the replacement list of a macro.
  
  If the macro is a function-like macro the preprocessor also stores its
  parameters, in the form of an ordered list of pointers to the hash
@@ -502,13 +505,137 @@ the original parameters to the macro, both for dumping with e.g.,
  @option{-dD}, and to warn about non-trivial macro redefinitions when
  the parameter names have changed.
  
-@section Nested object-like macros
-
-@c TODO
+@section Macro expansion overview
+The preprocessor maintains a @dfn{context stack}, implemented as a
+linked list of @code{cpp_context} structures, which together represent
+the macro expansion state at any one time.  The @code{struct
+cpp_reader} member variable @code{context} points to the current top
+of this stack.  The top normally holds the unexpanded replacement list
+of the innermost macro under expansion, except when cpplib is about to
+pre-expand an argument, in which case it holds that argument's
+unexpanded tokens.
+
+When there are no macros under expansion, cpplib is in @dfn{base
+context}.  All contexts other than the base context contain a
+contiguous list of tokens delimited by a starting and ending token.
+When not in base context, cpplib obtains the next token from the list
+of the top context.  If there are no tokens left in the list, it pops
+that context off the stack, and subsequent ones if necessary, until an
+unexhausted context is found or it returns to base context.  In base
+context, cpplib reads tokens directly from the lexer.
+
+If it encounters an identifier that is both a macro and enabled for
+expansion, cpplib prepares to push a new context for that macro on the
+stack by calling the routine @code{enter_macro_context}.  When this
+routine returns, the new context will contain the unexpanded tokens of
+the replacement list of that macro.  In the case of function-like
+macros, @code{enter_macro_context} also replaces any parameters in the
+replacement list, stored as @code{CPP_MACRO_ARG} tokens, with the
+appropriate macro argument.  If the standard requires that the
+parameter be replaced with its expanded argument, the argument will
+have been fully macro expanded first.
+
+@code{enter_macro_context} also handles special macros like
+@code{__LINE__}.  Although these macros expand to a single token which
+cannot contain any further macros, for reasons of token spacing
+(@pxref{Token Spacing}) and simplicity of implementation, cpplib
+handles these special macros by pushing a context containing just that
+one token.
+
+The final thing that @code{enter_macro_context} does before returning
+is to mark the macro disabled for expansion (except for special macros
+like @code{__TIME__}).  The macro is re-enabled when its context is
+later popped from the context stack, as described above.  This strict
+ordering ensures that a macro is disabled whilst its expansion is
+being scanned, but that it is @emph{not} disabled whilst any arguments
+to it are being expanded.
+
+@section Scanning the replacement list for macros to expand
+The C standard states that, after any parameters have been replaced
+with their possibly-expanded arguments, the replacement list is
+scanned for nested macros.  Further, any identifiers in the
+replacement list that are not expanded during this scan are never
+again eligible for expansion in the future, if the reason they were
+not expanded is that the macro in question was disabled.
+
+Clearly this latter condition can only apply to tokens resulting from
+argument pre-expansion.  Other tokens never have an opportunity to be
+re-tested for expansion.  It is possible for identifiers that are
+function-like macros to not expand initially but to expand during a
+later scan.  This occurs when the identifier is the last token of an
+argument (and therefore originally followed by a comma or a closing
+parenthesis in its macro's argument list), and when it replaces its
+parameter in the macro's replacement list, the subsequent token
+happens to be an opening parenthesis (itself possibly the first token
+of an argument).
+
+It is important to note that when cpplib reads the last token of a
+given context, that context still remains on the stack.  Only when
+looking for the @emph{next} token do we pop it off the stack and drop
+to a lower context.  This makes backing up by one token easy, but more
+importantly ensures that the macro corresponding to the current
+context is still disabled when we are considering the last token of
+its replacement list for expansion (or indeed expanding it).  As an
+example, which illustrates many of the points above, consider
  
-@section Function-like macros
+@smallexample
+#define foo(x) bar x
+foo(foo) (2)
+@end smallexample
  
-@c TODO
+@noindent which fully expands to @samp{bar foo (2)}.  During pre-expansion
+of the argument, @samp{foo} does not expand even though the macro is
+enabled, since it has no following parenthesis [pre-expansion of an
+argument only uses tokens from that argument; it cannot take tokens
+from whatever follows the macro invocation].  This still leaves the
+argument token @samp{foo} eligible for future expansion.  Then, when
+re-scanning after argument replacement, the token @samp{foo} is
+rejected for expansion, and marked ineligible for future expansion,
+since the macro is now disabled.  It is disabled because the
+replacement list @samp{bar foo} of the macro is still on the context
+stack.
+
+If instead the algorithm looked for an opening parenthesis first and
+then tested whether the macro were disabled it would be subtly wrong.
+In the example above, the replacement list of @samp{foo} would be
+popped in the process of finding the parenthesis, re-enabling
+@samp{foo} and expanding it a second time.
+
+@section Looking for a function-like macro's opening parenthesis
+Function-like macros only expand when immediately followed by a
+parenthesis.  To do this cpplib needs to temporarily disable macros
+and read the next token.  Unfortunately, because of spacing issues
+(@pxref{Token Spacing}), there can be fake padding tokens in-between,
+and if the next real token is not a parenthesis cpplib needs to be
+able to back up that one token as well as retain the information in
+any intervening padding tokens.
+
+Backing up more than one token when macros are involved is not
+permitted by cpplib, because in general it might involve issues like
+restoring popped contexts onto the context stack, which are too hard.
+Instead, searching for the parenthesis is handled by a special
+function, @code{funlike_invocation_p}, which remembers padding
+information as it reads tokens.  If the next real token is not an
+opening parenthesis, it backs up that one token, and then pushes an
+extra context just containing the padding information if necessary.
+
+@section Marking tokens ineligible for future expansion
+As discussed above, cpplib needs a way of marking tokens as
+unexpandable.  Since the tokens cpplib handles are read-only once they
+have been lexed, it instead makes a copy of the token and adds the
+flag @code{NO_EXPAND} to the copy.
+
+For efficiency and to simplify memory management by avoiding having to
+remember to free these tokens, they are allocated as temporary tokens
+from the lexer's current token run (@pxref{Lexing a line}) using the
+function @code{_cpp_temp_token}.  The tokens are then re-used once the
+current line of tokens has been read in.
+
+This might sound unsafe.  However, tokens runs are not re-used at the
+end of a line if it happens to be in the middle of a macro argument
+list, and cpplib only wants to back-up more than one lexer token in
+situations where no macro expansion is involved, so the optimization
+is safe.
  
  @node Token Spacing
  @unnumbered Token Spacing
@@ -516,8 +643,8 @@ the parameter names have changed.
  @cindex spacing
  @cindex token spacing
  
-First, let's look at an issue that only concerns the stand-alone
-preprocessor: we want to guarantee that re-reading its preprocessed
+First, consider an issue that only concerns the stand-alone
+preprocessor: there needs to be a guarantee that re-reading its preprocessed
  output results in an identical token stream.  Without taking special
  measures, this might not be the case because of macro substitution.
  For example:
@@ -546,7 +673,7 @@ expansion, but accidental pasting can occur in many places: both before
  and after each macro replacement, each argument replacement, and
  additionally each token created by the @samp{#} and @samp{##} operators.
  
-Let's look at how the preprocessor gets whitespace output correct
+Look at how the preprocessor gets whitespace output correct
  normally.  The @code{cpp_token} structure contains a flags byte, and one
  of those flags is @code{PREV_WHITE}.  This is flagged by the lexer, and
  indicates that the token was preceded by whitespace of some form other
@@ -595,11 +722,11 @@ a macro's first replacement token expands straight into another macro.
  
  Here, two padding tokens are generated with sources the @samp{foo} token
  between the brackets, and the @samp{bar} token from foo's replacement
-list, respectively.  Clearly the first padding token is the one we
-should use, so our output code should contain a rule that the first
+list, respectively.  Clearly the first padding token is the one to
+use, so the output code should contain a rule that the first
  padding token in a sequence is the one that matters.
  
-But what if we happen to leave a macro expansion?  Adjusting the above
+But what if a macro expansion is left?  Adjusting the above
  example slightly:
  
  @smallexample
@@ -664,8 +791,8 @@ lexed on if, for example, there are intervening escaped newlines or
  C-style comments.  For example:
  
  @smallexample
-foo /* A long
-comment */ bar \
+foo /* @r{A long
+comment} */ bar \
  baz
  @result{}
  foo bar baz
@@ -838,7 +965,7 @@ directives outside the main conditional block for the optimization to be
  on.
  
  Note that whilst we are inside the conditional block, @code{mi_valid} is
-likely to be reset to @code{false}, but this does not matter since the
+likely to be reset to @code{false}, but this does not matter since
  the closing @code{#endif} restores it to @code{true} if appropriate.
  
  Finally, since @code{_cpp_lex_direct} pops the file off the buffer stack
@@ -871,7 +998,7 @@ is turned off.
  @cindex files
  
  Fairly obviously, the file handling code of cpplib resides in the file
-@file{cppfiles.c}.  It takes care of the details of file searching,
+@file{files.c}.  It takes care of the details of file searching,
  opening, reading and caching, for both the main source file and all the
  headers it recursively includes.
  
@@ -934,8 +1061,8 @@ command line (or system) include directories to which the mapping
  applies.  This may be higher up the directory tree than the full path to
  the file minus the base name.
  
-@node Index
-@unnumbered Index
+@node Concept Index
+@unnumbered Concept Index
  @printindex cp
  
  @bye