*Matcher (MIT/GNU Scheme Pucked Reference Manual)

14.14.1 *Matcher

The matcher language is a declarative language for specifying a matcher procedure. A matcher procedure is a procedure that accepts a single parser-buffer argument and returns a boolean value indicating whether the match it performs was successful. If the match succeeds, the internal pointer of the parser buffer is moved forward over the matched text. If the match fails, the internal pointer is unchanged.

For example, here is a matcher procedure that matches the character ‘a’:

(lambda (b) (match-parser-buffer-char b #\a))

Here is another example that matches two given characters, c1 and c2, in sequence:

(lambda (b)
  (let ((p (get-parser-buffer-pointer b)))
    (if (match-parser-buffer-char b c1)
        (if (match-parser-buffer-char b c2)
            #t
            (begin
              (set-parser-buffer-pointer! b p)
              #f))
        #f)))

This is code is clear, but has lots of details that get in the way of understanding what it is doing. Here is the same example in the matcher language:

(*matcher (seq (char c1) (char c2)))

This is much simpler and more intuitive. And it generates virtually the same code:

(pp (*matcher (seq (char c1) (char c2))))
-| (lambda (#[b1])
-|   (let ((#[p1] (get-parser-buffer-pointer #[b1])))
-|     (and (match-parser-buffer-char #[b1] c1)
-|          (if (match-parser-buffer-char #[b1] c2)
-|              #t
-|              (begin
-|                (set-parser-buffer-pointer! #[b1] #[p1])
-|                #f)))))

Now that we have seen an example of the language, it’s time to look at the detail. The *matcher special form is the interface between the matcher language and Scheme.

special form: *matcher mexp: The operand mexp is an expression in the matcher language. The *matcher expression expands into Scheme code that implements a matcher procedure.

Here are the predefined matcher expressions. New matcher expressions can be defined using the macro facility (see Parser-language Macros). We will start with the primitive expressions.

matcher expression: char expression
matcher expression: char-ci expression
matcher expression: not-char expression
matcher expression: not-char-ci expression: These expressions match a given character. In each case, the expression operand is a Scheme expression that must evaluate to a character at run time. The ‘-ci’ expressions do case-insensitive matching. The ‘not-’ expressions match any character other than the given one.

matcher expression: string expression
matcher expression: string-ci expression: These expressions match a given string. The expression operand is a Scheme expression that must evaluate to a string at run time. The string-ci expression does case-insensitive matching.

matcher expression: char-set expression: These expressions match a single character that is a member of a given character set. The expression operand is a Scheme expression that must evaluate to a character set at run time.

matcher expression: end-of-input: The end-of-input expression is successful only when there are no more characters available to be matched.

matcher expression: discard-matched

The discard-matched expression always successfully matches the null string. However, it isn’t meant to be used as a matching expression; it is used for its effect. discard-matched causes all of the buffered text prior to this point to be discarded (i.e. it calls discard-parser-buffer-head! on the parser buffer).

Note that discard-matched may not be used in certain places in a matcher expression. The reason for this is that it deliberately discards information needed for backtracking, so it may not be used in a place where subsequent backtracking will need to back over it. As a rule of thumb, use discard-matched only in the last operand of a seq or alt expression (including any seq or alt expressions in which it is indirectly contained).

In addition to the above primitive expressions, there are two convenient abbreviations. A character literal (e.g. ‘#\A’) is a legal primitive expression, and is equivalent to a char expression with that literal as its operand (e.g. ‘(char #\A)’). Likewise, a string literal is equivalent to a string expression (e.g. ‘(string "abc")’).

Next there are several combinator expressions. These closely correspond to similar combinators in regular expressions. Parameters named mexp are arbitrary expressions in the matcher language.

matcher expression: seq mexp …

This matches each mexp operand in sequence. For example,

(seq (char-set char-set:alphabetic)
     (char-set char-set:numeric))

matches an alphabetic character followed by a numeric character, such as ‘H4’.

Note that if there are no mexp operands, the seq expression successfully matches the null string.

matcher expression: alt mexp …

This attempts to match each mexp operand in order from left to right. The first one that successfully matches becomes the match for the entire alt expression.

The alt expression participates in backtracking. If one of the mexp operands matches, but the overall match in which this expression is embedded fails, the backtracking mechanism will cause the alt expression to try the remaining mexp operands. For example, if the expression

(seq (alt "ab" "a") "b")

is matched against the text ‘abc’, the alt expression will initially match its first operand. But it will then fail to match the second operand of the seq expression. This will cause the alt to be restarted, at which time it will match ‘a’, and the overall match will succeed.

Note that if there are no mexp operands, the alt match will always fail.

matcher expression: * mexp

This matches zero or more occurrences of the mexp operand. (Consequently this match always succeeds.)

The * expression participates in backtracking; if it matches N occurrences of mexp, but the overall match fails, it will backtrack to N-1 occurrences and continue. If the overall match continues to fail, the * expression will continue to backtrack until there are no occurrences left.

matcher expression: + mexp

This matches one or more occurrences of the mexp operand. It is equivalent to

(seq mexp (* mexp))

matcher expression: ? mexp

This matches zero or one occurrences of the mexp operand. It is equivalent to

(alt mexp (seq))

matcher expression: sexp expression

The sexp expression allows arbitrary Scheme code to be embedded inside a matcher. The expression operand must evaluate to a matcher procedure at run time; the procedure is called to match the parser buffer. For example,

(*matcher
 (seq "a"
      (sexp parse-foo)
      "b"))

expands to

(lambda (#[b1])
  (let ((#[p1] (get-parser-buffer-pointer #[b1])))
    (and (match-parser-buffer-char #[b1] #\a)
         (if (parse-foo #[b1])
             (if (match-parser-buffer-char #[b1] #\b)
                 #t
                 (begin
                   (set-parser-buffer-pointer! #[b1] #[p1])
                   #f))
             (begin
               (set-parser-buffer-pointer! #[b1] #[p1])
               #f)))))

The case in which expression is a symbol is so common that it has an abbreviation: ‘(sexp symbol)’ may be abbreviated as just symbol.

matcher expression: with-pointer identifier mexp

The with-pointer expression fetches the parser buffer’s internal pointer (using get-parser-buffer-pointer), binds it to identifier, and then matches the pattern specified by mexp. Identifier must be a symbol.

This is meant to be used on conjunction with sexp, as a way to capture a pointer to a part of the input stream that is outside the sexp expression. An example of the use of with-pointer appears above (see with-pointer example).