ext/scintilla/doc/Lexer.txt

   1 How to write a scintilla lexer\r
   2 \r
   3 A lexer for a particular language determines how a specified range of\r
   4 text shall be colored.  Writing a lexer is relatively straightforward\r
   5 because the lexer need only color given text.  The harder job of\r
   6 determining how much text actually needs to be colored is handled by\r
   7 Scintilla itself, that is, the lexer's caller.\r
   8 \r
   9 \r
  10 Parameters\r
  11 \r
  12 The lexer for language LLL has the following prototype:\r
  13 \r
  14     static void ColouriseLLLDoc (\r
  15         unsigned int startPos, int length,\r
  16         int initStyle,\r
  17         WordList *keywordlists[],\r
  18         Accessor &styler);\r
  19 \r
  20 The styler parameter is an Accessor object.  The lexer must use this\r
  21 object to access the text to be colored.  The lexer gets the character\r
  22 at position i using styler.SafeGetCharAt(i);\r
  23 \r
  24 The startPos and length parameters indicate the range of text to be\r
  25 recolored; the lexer must determine the proper color for all characters\r
  26 in positions startPos through startPos+length.\r
  27 \r
  28 The initStyle parameter indicates the initial state, that is, the state\r
  29 at the character before startPos. States also indicate the coloring to\r
  30 be used for a particular range of text.\r
  31 \r
  32 Note:  the character at StartPos is assumed to start a line, so if a\r
  33 newline terminates the initStyle state the lexer should enter its\r
  34 default state (or whatever state should follow initStyle).\r
  35 \r
  36 The keywordlists parameter specifies the keywords that the lexer must\r
  37 recognize.  A WordList class object contains methods that make simplify\r
  38 the recognition of keywords.  Present lexers use a helper function\r
  39 called classifyWordLLL to recognize keywords.  These functions show how\r
  40 to use the keywordlists parameter to recognize keywords.  This\r
  41 documentation will not discuss keywords further.\r
  42 \r
  43 \r
  44 The lexer code\r
  45 \r
  46 The task of a lexer can be summarized briefly: for each range r of\r
  47 characters that are to be colored the same, the lexer should call\r
  48 \r
  49     styler.ColourTo(i, state)\r
  50         \r
  51 where i is the position of the last character of the range r.  The lexer\r
  52 should set the state variable to the coloring state of the character at\r
  53 position i and continue until the entire text has been colored.\r
  54 \r
  55 Note 1:  the styler (Accessor) object remembers the i parameter in the\r
  56 previous calls to styler.ColourTo, so the single i parameter suffices to\r
  57 indicate a range of characters.\r
  58 \r
  59 Note 2: As a side effect of calling styler.ColourTo(i,state), the\r
  60 coloring states of all characters in the range are remembered so that\r
  61 Scintilla may set the initStyle parameter correctly on future calls to\r
  62 the\r
  63 lexer.\r
  64 \r
  65 \r
  66 Lexer organization\r
  67 \r
  68 There are at least two ways to organize the code of each lexer.  Present\r
  69 lexers use what might be called a "character-based" approach: the outer\r
  70 loop iterates over characters, like this:\r
  71 \r
  72   lengthDoc = startPos + length ;\r
  73   for (unsigned int i = startPos; i < lengthDoc; i++) {\r
  74     chNext = styler.SafeGetCharAt(i + 1);\r
  75     << handle special cases >>\r
  76     switch(state) {\r
  77       // Handlers examine only ch and chNext.\r
  78       // Handlers call styler.ColorTo(i,state) if the state changes.\r
  79       case state_1: << handle ch in state 1 >>\r
  80       case state_2: << handle ch in state 2 >>\r
  81       ...\r
  82       case state_n: << handle ch in state n >>\r
  83     }\r
  84     chPrev = ch;\r
  85   }\r
  86   styler.ColourTo(lengthDoc - 1, state);\r
  87 \r
  88 \r
  89 An alternative would be to use a "state-based" approach.  The outer loop\r
  90 would iterate over states, like this:\r
  91 \r
  92   lengthDoc = startPos+lenth ;\r
  93   for ( unsigned int i = startPos ;; ) {\r
  94     char ch = styler.SafeGetCharAt(i);\r
  95     int new_state = 0 ;\r
  96     switch ( state ) {\r
  97       // scanners set new_state if they set the next state.\r
  98       case state_1: << scan to the end of state 1 >> break ;\r
  99       case state_2: << scan to the end of state 2 >> break ;\r
 100       case default_state:\r
 101         << scan to the next non-default state and set new_state >>\r
 102     }\r
 103     styler.ColourTo(i, state);\r
 104     if ( i >= lengthDoc ) break ;\r
 105     if ( ! new_state ) {\r
 106       ch = styler.SafeGetCharAt(i);\r
 107       << set state based on ch in the default state >>\r
 108     }\r
 109   }\r
 110   styler.ColourTo(lengthDoc - 1, state);\r
 111 \r
 112 This approach might seem to be more natural.  State scanners are simpler\r
 113 than character scanners because less needs to be done.  For example,\r
 114 there is no need to test for the start of a C string inside the scanner\r
 115 for a C comment.  Also this way makes it natural to define routines that\r
 116 could be used by more than one scanner; for example, a scanToEndOfLine\r
 117 routine.\r
 118 \r
 119 However, the special cases handled in the main loop in the\r
 120 character-based approach would have to be handled by each state scanner,\r
 121 so both approaches have advantages.  These special cases are discussed\r
 122 below.\r
 123 \r
 124 Special case: Lead characters\r
 125 \r
 126 Lead bytes are part of DBCS processing for languages such as Japanese\r
 127 using an encoding such as Shift-JIS. In these encodings, extended\r
 128 (16-bit) characters are encoded as a lead byte followed by a trail byte.\r
 129 \r
 130 Lead bytes are rarely of any lexical significance, normally only being\r
 131 allowed within strings and comments. In such contexts, lexers should\r
 132 ignore ch if styler.IsLeadByte(ch) returns TRUE.\r
 133 \r
 134 Note: UTF-8 is simpler than Shift-JIS, so no special handling is\r
 135 applied for it. All UTF-8 extended characters are >= 128 and none are\r
 136 lexically significant in programming languages which, so far, use only\r
 137 characters in ASCII for operators, comment markers, etc.\r
 138 \r
 139 \r
 140 Special case: Folding\r
 141 \r
 142 Folding may be performed in the lexer function. It is better to use a \r
 143 separate folder function as that avoids some troublesome interaction \r
 144 between styling and folding. The folder function will be run after the\r
 145 lexer function if folding is enabled. The rest of this section explains\r
 146 how to perform folding within the lexer function.\r
 147 \r
 148 During initialization, lexers that support folding set\r
 149 \r
 150     bool fold = styler.GetPropertyInt("fold");\r
 151         \r
 152 If folding is enabled in the editor, fold will be TRUE and the lexer\r
 153 should call:\r
 154 \r
 155     styler.SetLevel(line, level);\r
 156         \r
 157 at the end of each line and just before exiting.\r
 158 \r
 159 The line parameter is simply the count of the number of newlines seen. \r
 160 It's initial value is styler.GetLine(startPos) and it is incremented\r
 161 (after calling styler.SetLevel) whenever a newline is seen.\r
 162 \r
 163 The level parameter is the desired indentation level in the low 12 bits,\r
 164 along with flag bits in the upper four bits. The indentation level\r
 165 depends on the language.  For C++, it is incremented when the lexer sees\r
 166 a '{' and decremented when the lexer sees a '}' (outside of strings and\r
 167 comments, of course).\r
 168 \r
 169 The following flag bits, defined in Scintilla.h, may be set or cleared\r
 170 in the flags parameter. The SC_FOLDLEVELWHITEFLAG flag is set if the\r
 171 lexer considers that the line contains nothing but whitespace.  The\r
 172 SC_FOLDLEVELHEADERFLAG flag indicates that the line is a fold point. \r
 173 This normally means that the next line has a greater level than present\r
 174 line.  However, the lexer may have some other basis for determining a\r
 175 fold point.  For example, a lexer might create a header line for the\r
 176 first line of a function definition rather than the last.\r
 177 \r
 178 The SC_FOLDLEVELNUMBERMASK mask denotes the level number in the low 12\r
 179 bits of the level param. This mask may be used to isolate either flags\r
 180 or level numbers.\r
 181 \r
 182 For example, the C++ lexer contains the following code when a newline is\r
 183 seen:\r
 184 \r
 185   if (fold) {\r
 186     int lev = levelPrev;\r
 187 \r
 188     // Set the "all whitespace" bit if the line is blank.\r
 189     if (visChars == 0)\r
 190       lev |= SC_FOLDLEVELWHITEFLAG;\r
 191 \r
 192     // Set the "header" bit if needed.\r
 193     if ((levelCurrent > levelPrev) && (visChars > 0))\r
 194       lev |= SC_FOLDLEVELHEADERFLAG;\r
 195       styler.SetLevel(lineCurrent, lev);\r
 196         \r
 197     // reinitialize the folding vars describing the present line.\r
 198     lineCurrent++;\r
 199     visChars = 0;  // Number of non-whitespace characters on the line.\r
 200     levelPrev = levelCurrent;\r
 201   }\r
 202 \r
 203 The following code appears in the C++ lexer just before exit:\r
 204 \r
 205   // Fill in the real level of the next line, keeping the current flags\r
 206   // as they will be filled in later.\r
 207   if (fold) {\r
 208     // Mask off the level number, leaving only the previous flags.\r
 209     int flagsNext = styler.LevelAt(lineCurrent);\r
 210     flagsNext &= ~SC_FOLDLEVELNUMBERMASK;\r
 211     styler.SetLevel(lineCurrent, levelPrev | flagsNext);\r
 212   }\r
 213         \r
 214 \r
 215 Don't worry about performance\r
 216 \r
 217 The writer of a lexer may safely ignore performance considerations: the\r
 218 cost of redrawing the screen is several orders of magnitude greater than\r
 219 the cost of function calls, etc.  Moreover, Scintilla performs all the\r
 220 important optimizations; Scintilla ensures that a lexer will be called\r
 221 only to recolor text that actually needs to be recolored.  Finally, it\r
 222 is not necessary to avoid extra calls to styler.ColourTo: the sytler\r
 223 object buffers calls to ColourTo to avoid multiple updates of the\r
 224 screen.\r
 225 \r
 226 Page contributed by Edward K. Ream