notepad-plus-plus/scintilla/doc/Lexer.txt

How to write a scintilla lexer

A lexer for a particular language determines how a specified range of
text shall be colored.  Writing a lexer is relatively straightforward
because the lexer need only color given text.  The harder job of
determining how much text actually needs to be colored is handled by
Scintilla itself, that is, the lexer's caller.


Parameters

The lexer for language LLL has the following prototype:

    static void ColouriseLLLDoc (
        unsigned int startPos, int length,
        int initStyle,
        WordList *keywordlists[],
        Accessor &styler);

The styler parameter is an Accessor object.  The lexer must use this
object to access the text to be colored.  The lexer gets the character
at position i using styler.SafeGetCharAt(i);

The startPos and length parameters indicate the range of text to be
recolored; the lexer must determine the proper color for all characters
in positions startPos through startPos+length.

The initStyle parameter indicates the initial state, that is, the state
at the character before startPos. States also indicate the coloring to
be used for a particular range of text.

Note:  the character at StartPos is assumed to start a line, so if a
newline terminates the initStyle state the lexer should enter its
default state (or whatever state should follow initStyle).

The keywordlists parameter specifies the keywords that the lexer must
recognize.  A WordList class object contains methods that make simplify
the recognition of keywords.  Present lexers use a helper function
called classifyWordLLL to recognize keywords.  These functions show how
to use the keywordlists parameter to recognize keywords.  This
documentation will not discuss keywords further.


The lexer code

The task of a lexer can be summarized briefly: for each range r of
characters that are to be colored the same, the lexer should call

    styler.ColourTo(i, state)

where i is the position of the last character of the range r.  The lexer
should set the state variable to the coloring state of the character at
position i and continue until the entire text has been colored.

Note 1:  the styler (Accessor) object remembers the i parameter in the
previous calls to styler.ColourTo, so the single i parameter suffices to
indicate a range of characters.

Note 2: As a side effect of calling styler.ColourTo(i,state), the
coloring states of all characters in the range are remembered so that
Scintilla may set the initStyle parameter correctly on future calls to
the
lexer.


Lexer organization

There are at least two ways to organize the code of each lexer.  Present
lexers use what might be called a "character-based" approach: the outer
loop iterates over characters, like this:

  lengthDoc = startPos + length ;
  for (unsigned int i = startPos; i < lengthDoc; i++) {
    chNext = styler.SafeGetCharAt(i + 1);
    << handle special cases >>
    switch(state) {
      // Handlers examine only ch and chNext.
      // Handlers call styler.ColorTo(i,state) if the state changes.
      case state_1: << handle ch in state 1 >>
      case state_2: << handle ch in state 2 >>
      ...
      case state_n: << handle ch in state n >>
    }
    chPrev = ch;
  }
  styler.ColourTo(lengthDoc - 1, state);


An alternative would be to use a "state-based" approach.  The outer loop
would iterate over states, like this:

  lengthDoc = startPos+lenth ;
  for ( unsigned int i = startPos ;; ) {
    char ch = styler.SafeGetCharAt(i);
    int new_state = 0 ;
    switch ( state ) {
      // scanners set new_state if they set the next state.
      case state_1: << scan to the end of state 1 >> break ;
      case state_2: << scan to the end of state 2 >> break ;
      case default_state:
        << scan to the next non-default state and set new_state >>
    }
    styler.ColourTo(i, state);
    if ( i >= lengthDoc ) break ;
    if ( ! new_state ) {
      ch = styler.SafeGetCharAt(i);
      << set state based on ch in the default state >>
    }
  }
  styler.ColourTo(lengthDoc - 1, state);

This approach might seem to be more natural.  State scanners are simpler
than character scanners because less needs to be done.  For example,
there is no need to test for the start of a C string inside the scanner
for a C comment.  Also this way makes it natural to define routines that
could be used by more than one scanner; for example, a scanToEndOfLine
routine.

However, the special cases handled in the main loop in the
character-based approach would have to be handled by each state scanner,
so both approaches have advantages.  These special cases are discussed
below.

Special case: Lead characters

Lead bytes are part of DBCS processing for languages such as Japanese
using an encoding such as Shift-JIS. In these encodings, extended
(16-bit) characters are encoded as a lead byte followed by a trail byte.

Lead bytes are rarely of any lexical significance, normally only being
allowed within strings and comments. In such contexts, lexers should
ignore ch if styler.IsLeadByte(ch) returns TRUE.

Note: UTF-8 is simpler than Shift-JIS, so no special handling is
applied for it. All UTF-8 extended characters are >= 128 and none are
lexically significant in programming languages which, so far, use only
characters in ASCII for operators, comment markers, etc.


Special case: Folding

Folding may be performed in the lexer function. It is better to use a
separate folder function as that avoids some troublesome interaction
between styling and folding. The folder function will be run after the
lexer function if folding is enabled. The rest of this section explains
how to perform folding within the lexer function.

During initialization, lexers that support folding set

    bool fold = styler.GetPropertyInt("fold");

If folding is enabled in the editor, fold will be TRUE and the lexer
should call:

    styler.SetLevel(line, level);

at the end of each line and just before exiting.

The line parameter is simply the count of the number of newlines seen.
It's initial value is styler.GetLine(startPos) and it is incremented
(after calling styler.SetLevel) whenever a newline is seen.

The level parameter is the desired indentation level in the low 12 bits,
along with flag bits in the upper four bits. The indentation level
depends on the language.  For C++, it is incremented when the lexer sees
a '{' and decremented when the lexer sees a '}' (outside of strings and
comments, of course).

The following flag bits, defined in Scintilla.h, may be set or cleared
in the flags parameter. The SC_FOLDLEVELWHITEFLAG flag is set if the
lexer considers that the line contains nothing but whitespace.  The
SC_FOLDLEVELHEADERFLAG flag indicates that the line is a fold point.
This normally means that the next line has a greater level than present
line.  However, the lexer may have some other basis for determining a
fold point.  For example, a lexer might create a header line for the
first line of a function definition rather than the last.

The SC_FOLDLEVELNUMBERMASK mask denotes the level number in the low 12
bits of the level param. This mask may be used to isolate either flags
or level numbers.

For example, the C++ lexer contains the following code when a newline is
seen:

  if (fold) {
    int lev = levelPrev;

    // Set the "all whitespace" bit if the line is blank.
    if (visChars == 0)
      lev |= SC_FOLDLEVELWHITEFLAG;

    // Set the "header" bit if needed.
    if ((levelCurrent > levelPrev) && (visChars > 0))
      lev |= SC_FOLDLEVELHEADERFLAG;
      styler.SetLevel(lineCurrent, lev);

    // reinitialize the folding vars describing the present line.
    lineCurrent++;
    visChars = 0;  // Number of non-whitespace characters on the line.
    levelPrev = levelCurrent;
  }

The following code appears in the C++ lexer just before exit:

  // Fill in the real level of the next line, keeping the current flags
  // as they will be filled in later.
  if (fold) {
    // Mask off the level number, leaving only the previous flags.
    int flagsNext = styler.LevelAt(lineCurrent);
    flagsNext &= ~SC_FOLDLEVELNUMBERMASK;
    styler.SetLevel(lineCurrent, levelPrev | flagsNext);
  }


Don't worry about performance

The writer of a lexer may safely ignore performance considerations: the
cost of redrawing the screen is several orders of magnitude greater than
the cost of function calls, etc.  Moreover, Scintilla performs all the
important optimizations; Scintilla ensures that a lexer will be called
only to recolor text that actually needs to be recolored.  Finally, it
is not necessary to avoid extra calls to styler.ColourTo: the sytler
object buffers calls to ColourTo to avoid multiple updates of the
screen.

Page contributed by Edward K. Ream