From: Chris Hanson Date: Fri, 5 May 2017 07:09:14 +0000 (-0700) Subject: Rewrite the regular expression section for Unicode-safe implementation. X-Git-Tag: mit-scheme-pucked-9.2.12~14^2~81 X-Git-Url: https://birchwood-abbey.net/git?a=commitdiff_plain;h=cd333f111a5259435d72fdd21336450ee20a7c60;p=mit-scheme.git Rewrite the regular expression section for Unicode-safe implementation. Also a few small updates here and there. --- diff --git a/doc/ref-manual/scheme.texinfo b/doc/ref-manual/scheme.texinfo index 9cf3f9766..596d60f16 100644 --- a/doc/ref-manual/scheme.texinfo +++ b/doc/ref-manual/scheme.texinfo @@ -226,8 +226,8 @@ Strings Regular Expressions -* Regular-expression procedures:: -* REXP abstraction:: +* Regular S-Expressions:: +* Regsexp Procedures:: Lists diff --git a/doc/ref-manual/strings.texi b/doc/ref-manual/strings.texi index c3b8fed59..089a5ee41 100644 --- a/doc/ref-manual/strings.texi +++ b/doc/ref-manual/strings.texi @@ -783,7 +783,8 @@ string has a longer grapheme-cluster length than the given length. If from the string until it is the correct length; if it is @code{#f} then the string is returned unchanged. The grapheme clusters are removed from the beginning of the string if @code{where} is -@code{leading}, otherwise from the end of the string. +@code{leading}, otherwise from the end of the string. The default +value of this argument is @code{#t}. @end itemize Some examples: @@ -834,7 +835,7 @@ indices). @end example @end deffn -@deffn procedure string-trimmer where trim-char? copy? +@deffn procedure string-trimmer where to-trim copy? @cindex trimming, of string This procedure's arguments are keyword arguments; that is, each argument is a symbol of the same name followed by its value. The @@ -857,10 +858,11 @@ trailing characters, or both. The default value of this argument is @code{both}. @item @findex char-whitespace? -@var{trim-char?} is a procedure that accepts a single character -argument and returns a true value for a character that should be -removed by the trimmer, or a false value for a character that should -be retained. The default value of this argument is @code{char-whitespace?}. +@var{to-trim} is either a character, a character set, or more +generally a procedure that accepts a single character argument and +returns a boolean value. The trimmer uses this to identify characters +to remove. The default value of this argument is +@code{char-whitespace?}. @item @var{copy?} is a boolean: if @code{#t}, the trimmer returns an immutable copy of the trimmed string, if @code{#f} it returns a slice. @@ -881,15 +883,15 @@ Some examples: ((string-trimmer) " ABC DEF ") @result{} "ABC DEF" -((string-trimmer 'trim-char? char-numeric? 'where 'leading) +((string-trimmer 'to-trim char-numeric? 'where 'leading) "21 East 21st Street #3") @result{} " East 21st Street #3" -((string-trimmer 'trim-char? char-numeric? 'where 'trailing) +((string-trimmer 'to-trim char-numeric? 'where 'trailing) "21 East 21st Street #3") @result{} "21 East 21st Street #" -((string-trimmer 'trim-char? char-numeric?) +((string-trimmer 'to-trim char-numeric?) "21 East 21st Street #3") @result{} " East 21st Street #" @end example @@ -1153,338 +1155,173 @@ don't distinguish uppercase and lowercase letters. @node Regular Expressions, , Searching and Matching Strings, Strings @section Regular Expressions -MIT/GNU Scheme provides support for using regular expressions to search and -match strings. This manual does not define regular expressions; instead -see @ref{Regexps, , Syntax of Regular Expressions, emacs, The Emacs -Editor}. - -In addition to providing standard regular-expression support, MIT/GNU -Scheme also provides the @acronym{REXP} abstraction. This is an -alternative way to write regular expressions that is easier to read -and understand than the standard notation. Regular expressions -written in this notation can be translated into the standard -notation. - -The regular-expression support is a run-time-loadable option. To use -it, execute - -@example -(load-option 'regular-expression) -@end example - -@noindent -once before calling any of the procedures defined here. +MIT/GNU Scheme provides support for matching and searching strings +against regular expressions. This is considerably more flexible than +ordinary string matching and searching, but potentially much slower. +On the other hand it is less powerful than the mechanism described in +@ref{Parser Language}. + +While traditional regular expressions are defined with string patterns +in which characters like @samp{[} and @samp{*} have special meanings. +Unfortunately, the syntax of these patterns is not only baroque but +also comes in many different and mutually-incompatible varieties. As +a consequence we have chosen to specify regular expressions using an +s-expression syntax, which we call a @dfn{regular s-expression}, +abbreviated as @dfn{regsexp}. + +Previous releases of MIT/GNU Scheme provided a regular-expression +mechanism nearly identical to that of GNU Emacs version 18. This +mechanism still exists but is deprecated and will be removed in a +future release. @menu -* Regular-expression procedures:: -* REXP abstraction:: +* Regular S-Expressions:: +* Regsexp Procedures:: @end menu -@node Regular-expression procedures, REXP abstraction, Regular Expressions, Regular Expressions -@subsection Regular-expression procedures -@cindex searching, for regular expression -@cindex regular expression, searching string for - -Procedures that perform regular-expression match and search accept -standardized arguments. @var{Regexp} is the regular expression; it is -either a string representation of a regular expression, or a compiled -regular expression object. @var{String} is the string being matched -or searched. Procedures that operate on substrings also accept -@var{start} and @var{end} index arguments with the usual meaning. The -optional argument @var{case-fold?} says whether the match/search is -case-sensitive; if @var{case-fold?} is @code{#f}, it is -case-sensitive, otherwise it is case-insensitive. The optional -argument @var{syntax-table} is a character syntax table that defines -the character syntax, such as which characters are legal word -constituents. This feature is primarily for Edwin, so character -syntax tables will not be documented here. Supplying @code{#f} for -(or omitting) @var{syntax-table} will select the default character -syntax, equivalent to Edwin's @code{fundamental} mode. - -@deffn procedure re-string-match regexp string [case-fold? [syntax-table]] -@deffnx procedure re-substring-match regexp string start end [case-fold? [syntax-table]] -These procedures match @var{regexp} against the respective string or -substring, returning @code{#f} for no match, or a set of match registers -(see below) if the match succeeds. Here is an example showing how to -extract the matched substring: +@node Regular S-Expressions, Regsexp Procedures, Regular Expressions, Regular Expressions +@subsection Regular S-Expressions -@example -@group -(let ((r (re-substring-match @var{regexp} @var{string} @var{start} @var{end}))) - (and r - (substring @var{string} @var{start} (re-match-end-index 0 r)))) -@end group -@end example -@end deffn +A regular s-expression is either a character or a string, which +matches itself, or one of the following forms. -@deffn procedure re-string-search-forward regexp string [case-fold? [syntax-table]] -@deffnx procedure re-substring-search-forward regexp string start end [case-fold? [syntax-table]] -Searches @var{string} for the leftmost substring matching @var{regexp}. -Returns a set of match registers (see below) if the search is -successful, or @code{#f} if it is unsuccessful. - -@code{re-substring-search-forward} limits its search to the specified -substring of @var{string}; @code{re-string-search-forward} searches all -of @var{string}. -@end deffn - -@deffn procedure re-string-search-backward regexp string [case-fold? [syntax-table]] -@deffnx procedure re-substring-search-backward regexp string start end [case-fold? [syntax-table]] -Searches @var{string} for the rightmost substring matching @var{regexp}. -Returns a set of match registers (see below) if the search is -successful, or @code{#f} if it is unsuccessful. - -@code{re-substring-search-backward} limits its search to the specified -substring of @var{string}; @code{re-string-search-backward} searches all -of @var{string}. -@end deffn - -When a successful match or search occurs, the above procedures return a -set of @dfn{match registers}. The match registers are a set of index -registers that record indexes into the matched string. Each index -register corresponds to an instance of the regular-expression grouping -operator @samp{\(}, and records the start index (inclusive) and end -index (exclusive) of the matched group. These registers are numbered -from @code{1} to @code{9}, corresponding left-to-right to the grouping -operators in the expression. Additionally, register @code{0} -corresponds to the entire substring matching the regular expression. - -@deffn procedure re-match-start-index n registers -@deffnx procedure re-match-end-index n registers -@var{N} must be an exact integer between @code{0} and @code{9} -inclusive. @var{Registers} must be a match-registers object as returned -by one of the regular-expression match or search procedures above. -@code{re-match-start-index} returns the start index of the corresponding -regular-expression register, and @code{re-match-end-index} returns the -corresponding end index. -@end deffn - -@deffn procedure re-match-extract string registers n -@var{Registers} must be a match-registers object as returned by one of -the regular-expression match or search procedures above. @var{String} -must be the string that was passed as an argument to the procedure that -returned @var{registers}. @var{N} must be an exact integer between -@code{0} and @code{9} inclusive. If the matched regular expression -contained @var{m} grouping operators, then the value of this procedure -is undefined for @var{n} strictly greater than @var{m}. - -This procedure extracts the substring corresponding to the match -register specified by @var{registers} and @var{n}. This is equivalent -to the following expression: +These forms match one or more characters literally: -@example -@group -(substring @var{string} - (re-match-start-index @var{n} @var{registers}) - (re-match-end-index @var{n} @var{registers})) -@end group -@end example +@deffn {regsexp} char-ci char +Matches @var{char} without considering case. @end deffn -@deffn procedure regexp-group alternative @dots{} -Each @var{alternative} must be a string representation of a regular -expression. The returned value is a new string representation of a -regular expression that consists of the @var{alternative}s combined by -a grouping operator. For example: - -@example -@group -(regexp-group "foo" "bar" "baz") - @result{} "\\(foo\\|bar\\|baz\\)" -@end group -@end example +@deffn {regsexp} string-ci string +Matches @var{string} without considering case. @end deffn -@deffn procedure re-compile-pattern regexp-string -@var{Regexp-string} must be the string representation of a regular -expression. Returns a compiled regular expression object of the -represented regular expression. - -Procedures that apply regular expressions, such as -@code{re-string-search-forward}, are sometimes faster when used with -compiled regular expression objects than when used with the string -representations of regular expressions, so applications that reuse -regular expressions may speed up matching and searching by caching the -compiled regular expression objects. However, the regular expression -procedures have some internal caches as well, so this is likely to -improve performance only for applications that use a large number of -different regular expressions before cycling through the same ones -again. +@deffn {regsexp} any-char +Matches one character other than @code{#\newline}. @end deffn -@node REXP abstraction, , Regular-expression procedures, Regular Expressions -@subsection REXP abstraction - -@cindex REXP abstraction -In addition to providing standard regular-expression support, MIT/GNU -Scheme also provides the @acronym{REXP} abstraction. This is an -alternative way to write regular expressions that is easier to read -and understand than the standard notation. Regular expressions -written in this notation can be translated into the standard notation. - -The @acronym{REXP} abstraction is a set of combinators that are -composed into a complete regular expression. Each combinator directly -corresponds to a particular piece of regular-expression notation. For -example, the expression @code{(rexp-any-char)} corresponds to the -@code{.} character in standard regular-expression notation, while -@code{(rexp* @var{rexp})} corresponds to the @code{*} character. +@deffn {regsexp} char-set datum @dots{} +@deffnx {regsexp} inverse-char-set datum @dots{} +Matches one character in (not in) the character set specified by +@code{(char-set @var{datum @dots{}})}. +@end deffn -The primary advantages of @acronym{REXP} are that it makes the nesting -structure of regular expressions explicit, and that it simplifies the -description of complex regular expressions by allowing them to be -built up using straightforward combinators. +These forms match no characters, but only at specific locations in the +input string: -@deffn procedure rexp? object -Returns @code{#t} if @var{object} is a @acronym{REXP} expression, or -@code{#f} otherwise. A @acronym{REXP} is one of: a string, which -represents the pattern matching that string; a character set, which -represents the pattern matching a character in that set; or an object -returned by calling one of the procedures defined here. +@deffn {regsexp} line-start +@deffnx {regsexp} line-end +Matches no characters at the start (end) of a line. @end deffn -@deffn procedure rexp->regexp rexp -Converts @var{rexp} to standard regular-expression notation, returning -a newly-allocated string. +@deffn {regsexp} string-start +@deffnx {regsexp} string-end +Matches no characters at the start (end) of the string. @end deffn -@deffn procedure rexp-compile rexp -Converts @var{rexp} to standard regular-expression notation, then -compiles it and returns the compiled result. Equivalent to +These forms match repetitions of a given regsexp. Most of them come +in two forms, one of which is @dfn{greedy} and the other @dfn{shy}. +The greedy form matches as many repetitions as it can, then uses +failure backtracking to reduce the number of repetitions one at a +time. The shy form matches the minimum number of repetitions, then +uses failure backtracking to increase the number of repetitions one at +a time. The shy form is similar to the greedy form except that a +@code{?} is added at the end of the form's keyword. -@example -(re-compile-pattern (rexp->regexp @var{rexp}) #f) -@end example +@deffn {regsexp} ? regsexp +@deffnx {regsexp} ?? regsexp +Matches @var{regsexp} zero or one time. @end deffn -@deffn procedure rexp-any-char -Returns a @acronym{REXP} that matches any single character except a -newline. This is equivalent to the @code{.} construct. +@deffn {regsexp} * regsexp +@deffnx {regsexp} *? regsexp +Matches @var{regsexp} zero or more times. @end deffn -@deffn procedure rexp-line-start -Returns a @acronym{REXP} that matches the start of a line. This is -equivalent to the @code{^} construct. +@deffn {regsexp} + regsexp +@deffnx {regsexp} +? regsexp +Matches @var{regsexp} one or more times. @end deffn -@deffn procedure rexp-line-end -Returns a @acronym{REXP} that matches the end of a line. This is -equivalent to the @code{$} construct. -@end deffn +@deffn {regsexp} ** n m regsexp +@deffnx {regsexp} **? n m regsexp +The @var{n} argument must be an exact nonnegative integer. The +@var{m} argument must be either an exact integer greater than or equal +to @var{n}, or else @code{#f}. -@deffn procedure rexp-string-start -Returns a @acronym{REXP} that matches the start of the text being -matched. This is equivalent to the @code{\`} construct. +Matches @var{regsexp} at least @var{n} times and at most @var{m} +times; if @var{m} is @code{#f} then there is no upper limit. @end deffn -@deffn procedure rexp-string-end -Returns a @acronym{REXP} that matches the end of the text being -matched. This is equivalent to the @code{\'} construct. +@deffn {regsexp} ** n regsexp +This is an abbreviation for @code{(** @var{n} @var{n} +@var{regsexp})}. This matcher is neither greedy nor shy since it +matches a fixed number of repetitions. @end deffn -@deffn procedure rexp-word-edge -Returns a @acronym{REXP} that matches the start or end of a word. -This is equivalent to the @code{\b} construct. -@end deffn +These forms implement alternatives and sequencing: -@deffn procedure rexp-not-word-edge -Returns a @acronym{REXP} that matches anywhere that is not the start -or end of a word. This is equivalent to the @code{\B} construct. +@deffn {regsexp} alt regsexp @dots{} +Matches one of the @var{regsexp} arguments, trying each in order from +left to right. @end deffn -@deffn procedure rexp-word-start -Returns a @acronym{REXP} that matches the start of a word. -This is equivalent to the @code{\<} construct. +@deffn {regsexp} seq regsexp @dots{} +Matches the first @var{regsexp}, then continues the match with the +next @var{regsexp}, and so on until all of the arguments are matched. @end deffn -@deffn procedure rexp-word-end -Returns a @acronym{REXP} that matches the end of a word. -This is equivalent to the @code{\>} construct. -@end deffn +These forms implement named @dfn{registers}, which store matched +segments of the input string: -@deffn procedure rexp-word-char -Returns a @acronym{REXP} that matches any word-constituent character. -This is equivalent to the @code{\w} construct. -@end deffn +@deffn {regsexp} group key regsexp +The @var{key} argument must be a fixnum, a character, or a symbol. -@deffn procedure rexp-not-word-char -Returns a @acronym{REXP} that matches any character that isn't a word -constituent. This is equivalent to the @code{\W} construct. +Matches @var{regsexp}. If the match succeeds, the matched segment is +stored in the register named @var{key}. @end deffn -The next two procedures accept a @var{syntax-type} argument specifying -the syntax class to be matched against. This argument is a symbol -selected from the following list. Each symbol is followed by the -equivalent character used in standard regular-expression notation. -@code{whitespace} (space character), -@code{punctuation} (@code{.}), -@code{word} (@code{w}), -@code{symbol} (@code{_}), -@code{open} (@code{(}), -@code{close} (@code{)}), -@code{quote} (@code{'}), -@code{string-delimiter} (@code{"}), -@code{math-delimiter} (@code{$}), -@code{escape} (@code{\}), -@code{char-quote} (@code{/}), -@code{comment-start} (@code{<}), -@code{comment-end} (@code{>}). +@deffn {regsexp} group-ref key +The @var{key} argument must be a fixnum, a character, or a symbol. -@deffn procedure rexp-syntax-char syntax-type -Returns a @acronym{REXP} that matches any character of type -@var{syntax-type}. This is equivalent to the @code{\s} construct. +Matches the characters stored in the register named @var{key}. It is +an error if that register has not been initialized with a +corresponding @code{group} expression. @end deffn -@deffn procedure rexp-not-syntax-char syntax-type -Returns a @acronym{REXP} that matches any character not of type -@var{syntax-type}. This is equivalent to the @code{\S} construct. -@end deffn - -@deffn procedure rexp-sequence rexp @dots{} -Returns a @acronym{REXP} that matches each @var{rexp} argument in -sequence. If no @var{rexp} argument is supplied, the result matches -the null string. This is equivalent to concatenating the regular -expressions corresponding to each @var{rexp} argument. -@end deffn +@node Regsexp Procedures, , Regular S-Expressions, Regular Expressions +@subsection Regsexp Procedures -@deffn procedure rexp-alternatives rexp @dots{} -Returns a @acronym{REXP} that matches any of the @var{rexp} -arguments. This is equivalent to concatenating the regular -expressions corresponding to each @var{rexp} argument, separating them -by the @code{\|} construct. -@end deffn +The regular s-expression implementation has two parts, like +many other regular-expression implementations: a compiler that +translates the pattern into an efficient form, and one or more +procedures that use that pattern to match or search inputs. -@deffn procedure rexp-group rexp @dots{} -@code{rexp-group} is like @code{rexp-sequence}, except that the result -is marked as a match group. This is equivalent to the @code{\(} -@dots{} @code{\)} construct. +@deffn procedure compile-regsexp regsexp +Compiles @var{regsexp} by translating it into a procedure that +implements the specified matcher. @end deffn -The next three procedures in principal accept a single @acronym{REXP} -argument. For convenience, they accept multiple arguments, which are -converted into a single argument by @code{rexp-group}. Note, however, -that if only one @acronym{REXP} argument is supplied, and it's very -simple, no grouping occurs. +The match and search procedures each return a list when they are +successful, and @code{#f} when they fail. The returned list is of the +form @code{(@var{s} @var{e} @var{register} @dots{})}, where @var{s} is +the index at which the match starts, @var{e} is the index at which the +match ends, and each @var{register} is a pair @code{(@var{key} +. @var{contents})} where @var{key} is the register's name and +@var{contents} is the contents of that register as a string. -@deffn procedure rexp* rexp @dots{} -Returns a @acronym{REXP} that matches zero or more instances of the -pattern matched by the @var{rexp} arguments. This is equivalent to -the @code{*} construct. -@end deffn +@deffn procedure regsexp-match-string crse string [start [end]] +The @var{crse} argument must be a value returned by +@code{compile-regsexp}. The @var{string} argument must satisfy +@code{string-in-nfc?}. -@deffn procedure rexp+ rexp @dots{} -Returns a @acronym{REXP} that matches one or more instances of the -pattern matched by the @var{rexp} arguments. This is equivalent to -the @code{+} construct. +Matches @var{string} against @var{crse} and returns the result. @end deffn -@deffn procedure rexp-optional rexp @dots{} -Returns a @acronym{REXP} that matches zero or one instances of the -pattern matched by the @var{rexp} arguments. This is equivalent to -the @code{?} construct. -@end deffn +@deffn procedure regsexp-search-string-forward crse string [start [end]] +The @var{crse} argument must be a value returned by +@code{compile-regsexp}. The @var{string} argument must satisfy +@code{string-in-nfc?}. -@deffn procedure rexp-case-fold rexp -Returns a @acronym{REXP} that matches the same pattern as @var{rexp}, -but is insensitive to character case. This has no equivalent in -standard regular-expression notation. +Searches @var{string} from left to right for a match against +@var{crse} and returns the result. @end deffn