Next: Lists, Previous: Characters, Up: Top [Contents][Index]
• Searching and Matching Strings: | ||
• Regular Expressions: |
Strings are sequences of characters. Strings are written as sequences
of characters enclosed within quotation marks ("
). Within a
string literal, various escape sequences represent characters other
than themselves. Escape sequences always start with a backslash
(\
):
\a
: alarm, U+0007\b
: backspace, U+0008\t
: character tabulation, U+0009\n
: linefeed, U+000A\r
: return, U+000D\"
: double quote, U+0022\\
: backslash, U+005C\|
: vertical line, U+007C\
intraline-whitespace* line-ending intraline-whitespace* : nothing\x
hex-scalar-value;
: specified character (note the terminating semi-colon).
The result is unspecified if any other character in a string occurs after a backslash.
Except for a line ending, any character outside of an escape sequence
stands for itself in the string literal. A line ending which is
preceded by \
intraline-whitespace expands to nothing
(along with any trailing intraline whitespace), and can be used to
indent strings for improved legibility. Any other line ending has the
same effect as inserting a \n
character into the string.
Examples:
"The word \"recursion\" has many meanings." "Another example:\ntwo lines of text" "Here's text \ containing just one line" "\x03B1; is named GREEK SMALL LETTER ALPHA."
The length of a string is the number of characters that it contains. This number is an exact, non-negative integer that is fixed when the string is created. The valid indexes of a string are the exact non-negative integers less than the length of the string. The first character of a string has index 0, the second has index 1, and so on.
Some of the procedures that operate on strings ignore the difference between upper and lower case. The names of the versions that ignore case end with ‘-ci’ (for “case insensitive”).
Implementations may forbid certain characters from appearing in
strings. However, with the exception of #\null
,
ASCII characters must not be forbidden. For example, an
implementation might support the entire Unicode repertoire, but only
allow characters U+0001 to U+00FF (the Latin-1 repertoire without
#\null
) in strings.
Implementation note: MIT/GNU Scheme allows any “bitless” character to be stored in a string. In effect this means any character with a Unicode code point, including surrogates. String operations that accept characters automatically strip their bucky bits.
It is an error to pass such a forbidden character to
make-string
, string
, string-set!
, or
string-fill!
, as part of the list passed to
list->string
, or as part of the vector passed to
vector->string
, or in UTF-8 encoded form within a bytevector
passed to utf8->string
. It is also an error for a procedure
passed to string-map
to return a forbidden character, or for
read-string
to attempt to read one.
MIT/GNU Scheme supports both mutable and immutable
strings. Procedures that mutate strings, in particular
string-set!
and string-fill!
, will signal an error if
given an immutable string. Nearly all procedures that return strings
return immutable strings; notable exceptions are make-string
and string-copy
, which always return mutable strings, and
string-builder
which gives the programmer the ability to choose
mutable or immutable results.
Returns #t
if obj is a string, otherwise returns #f
.
The make-string
procedure returns a newly allocated mutable
string of length k. If char is given, then all the
characters of the string are initialized to char, otherwise the
contents of the string are unspecified.
Returns an immutable string whose characters are the concatenation of
the characters from the given objects. Each object is converted to
characters as if passed to the display
procedure.
This is an MIT/GNU Scheme extension to the standard string
that
accepts only characters as arguments.
The procedure string*
is identical to string
but takes a
single argument that’s a list of objects, rather than multiple object
arguments.
Returns the number of characters in the given string.
It is an error if k is not a valid index of string.
The string-ref
procedure returns character k of
string using zero-origin indexing. There is no requirement for
this procedure to execute in constant time.
It is an error if string
is not a mutable string or if k
is not a valid index of string.
The string-set!
procedure stores char in element k of string.
There is no requirement for this procedure to execute in constant time.
(define (f) (make-string 3 #\*)) (define (g) "***") (string-set! (f) 0 #\?) ⇒ unspecified (string-set! (g) 0 #\?) ⇒ error (string-set! (symbol->string 'immutable) 0 #\?) ⇒ error
Returns #t
if all the strings are the same length and contain
exactly the same characters in the same positions, otherwise returns
#f
.
Returns #t
if, after case-folding, all the strings are the same
length and contain the same characters in the same positions,
otherwise returns #f
. Specifically, these procedures behave as
if string-foldcase
were applied to their arguments before
comparing them.
These procedures return #t
if their arguments are (respectively):
monotonically increasing, monotonically decreasing,
monotonically non-decreasing, or monotonically non-increasing.
These predicates are required to be transitive.
These procedures compare strings in an implementation-defined way.
One approach is to make them the lexicographic extensions to strings
of the corresponding orderings on characters. In that case,
string<?
would be the lexicographic ordering on strings
induced by the ordering char<?
on characters, and if the two
strings differ in length but are the same up to the length of the
shorter string, the shorter string would be considered to be
lexicographically less than the longer string. However, it is also
permitted to use the natural ordering imposed by the implementation’s
internal representation of strings, or a more complex locale-specific
ordering.
In all cases, a pair of strings must satisfy exactly one of
string<?
, string=?
, and string>?
, and must satisfy
string<=?
if and only if they do not satisfy string>?
and
string>=?
if and only if they do not satisfy string<?
.
The ‘-ci’ procedures behave as if they applied
string-foldcase
to their arguments before invoking the
corresponding procedures without ‘-ci’.
If-eq, if-lt, and if-gt are procedures of no arguments (thunks). The two strings are compared; if they are equal, if-eq is applied, if string1 is less than string2, if-lt is applied, else if string1 is greater than string2, if-gt is applied. The value of the procedure is the value of the thunk that is applied.
string-compare
distinguishes uppercase and lowercase letters;
string-compare-ci
does not.
(define (cheer) (display "Hooray!"))
(define (boo) (display "Boo-hiss!"))
(string-compare "a" "b" cheer (lambda() 'ignore) boo)
-| Hooray!
⇒ unspecified
These procedures apply the Unicode full string uppercasing,
lowercasing, titlecasing, and case-folding algorithms to their
arguments and return the result. In certain cases, the result differs
in length from the argument. If the result is equal to the argument
in the sense of string=?
, the argument may be returned. Note
that language-sensitive mappings and foldings are not used.
The Unicode Standard prescribes special treatment of the Greek letter
\Sigma, whose normal lower-case form is \sigma but which
becomes \varsigma at the end of a word. See
UAX #44 (part of the
Unicode Standard) for details. However, implementations of
string-downcase
are not required to provide this behavior, and
may choose to change \Sigma to \sigma in all cases.
These procedures return #t
if all the letters in the string are
lower case or upper case, otherwise they return #f
. The string
must contain at least one letter or the procedures return #f
.
(map string-upper-case? '("" "A" "art" "Art" "ART")) ⇒ (#f #t #f #f #t)
Returns an immutable copy of the part of the given string between start and end.
Returns a slice of string, restricted to the range of
characters specified by start and end. The returned slice
will be mutable if string
is mutable, or immutable if
string
is immutable.
A slice is a kind of string that provides a view into another string. The slice behaves like any other string, but changes to a mutable slice are reflected in the original string and vice versa.
(define foo (string-copy "abcde")) foo ⇒ "abcde" (define bar (string-slice foo 1 4)) bar ⇒ "bcd" (string-set! foo 2 #\z) foo ⇒ "abzde" bar ⇒ "bzd" (string-set! bar 1 #\y) bar ⇒ "byd" foo ⇒ "abyde"
Returns an immutable string whose characters are the concatenation of the characters in the given strings.
The non-standard procedure string-append*
is identical to
string-append
but takes a single argument that’s a list of
strings, rather than multiple string arguments.
It is an error if any element of list is not a character.
The string->list
procedure returns a newly allocated list of
the characters of string between start and end.
list->string
returns an immutable string formed from the
elements in the list list. In both procedures, order is
preserved. string->list
and list->string
are inverses
so far as equal?
is concerned.
Returns a newly allocated mutable copy of the part of the given string between start and end.
It is an error if to is not a mutable string or if at is
less than zero or greater than the length of to. It is also an
error if (- (string-length to) at)
is less than
(- end start)
.
Copies the characters of string from between start and end to string to, starting at at. The order in which characters are copied is unspecified, except that if the source and destination overlap, copying takes place as if the source is first copied into a temporary string and then into the destination. This can be achieved without allocating storage by making sure to copy in the correct direction in such circumstances.
(define a "12345") (define b (string-copy "abcde")) (string-copy! b 1 a 0 2) ⇒ 3 b ⇒ "a12de"%
Implementation note: in MIT/GNU Scheme string-copy!
returns the
value (+ at (- end start))
.
It is an error if string is not a mutable string or if fill is not a character.
The string-fill!
procedure stores fill in the elements of
string between start and end.
The next two procedures treat a given string as a sequence of grapheme clusters, a concept defined by the Unicode standard in UAX #29:
It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + acute-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.
This procedure returns the number of grapheme clusters in string.
For ASCII strings, this is identical to
string-length
.
This procedure slices string at the grapheme-cluster boundaries specified by the start and end indices. These indices are grapheme-cluster indices, not normal string indices.
For ASCII strings, this is identical to string-slice
.
This procedure returns a list of word break indices for string, ordered from smallest index to largest. Word breaks are defined by the Unicode standard in UAX #29, and generally coincide with what we think of as the boundaries of words in written text.
MIT/GNU Scheme supports the Unicode canonical normalization forms NFC (Normalization Form C) and NFD (Normalization Form D). The reason for these forms is that there can be multiple different Unicode sequences for a given text; these sequences are semantically identical and should be treated equivalently for all purposes. If two such sequences are normalized to the same form, the resulting normalized sequences will be identical.
By default, most procedures that return strings return them in
NFC. Notable exceptions are list->string
,
vector->string
, and the utfX->string
procedures, which
do no normalization, and of course string->nfd
.
Generally speaking, NFC is preferred for most purposes, as it is the minimal-length sequence for the variants. Consult the Unicode standard for the details and for information about why one normalization form is preferable for a specific purpose.
These procedures return #t
if string is in Unicode
Normalization Form C or D respectively. Otherwise they return
#f
.
Note that if string consists only of code points strictly less
than #xC0
, then string-in-nfd?
returns #t
. If
string consists only of code points strictly less than
#x300
, then string-in-nfc?
returns #t
.
Consequently both of these procedures will return #t
for an
ASCII string argument.
The procedures convert string into Unicode Normalization Form C or D respectively. If string is already in the correct form, they return string itself, or an immutable copy if string is mutable.
It is an error if proc does not accept as many arguments as there are strings and return a single character.
The string-map
procedure applies proc element-wise to the
elements of the strings and returns an immutable string of the
results, in order. If more than one string is given and not all
strings have the same length, string-map
terminates when the
shortest string runs out. The dynamic order in which proc is
applied to the elements of the strings is unspecified. If
multiple returns occur from string-map
, the values returned by
earlier returns are not mutated.
(string-map char-foldcase "AbdEgH") ⇒ "abdegh" (string-map (lambda (c) (integer->char (+ 1 (char->integer c)))) "HAL") ⇒ "IBM" (string-map (lambda (c k) ((if (eqv? k #\u) char-upcase char-downcase) c)) "studlycaps xxx" "ululululul") ⇒ "StUdLyCaPs"
It is an error if proc does not accept as many arguments as there are strings.
The arguments to string-for-each
are like the arguments to
string-map
, but string-for-each
calls proc for its
side effects rather than for its values. Unlike string-map
,
string-for-each
is guaranteed to call proc on the elements
of the lists in order from the first element(s) to the last, and
the value returned by string-for-each
is unspecified. If more
than one string is given and not all strings have the same
length, string-for-each
terminates when the shortest string
runs out. It is an error for proc to mutate any of the strings.
(let ((v '())) (string-for-each (lambda (c) (set! v (cons (char->integer c) v))) "abcde") v) ⇒ (101 100 99 98 97)
It is an error if proc does not accept as many arguments as there are strings.
The string-count
procedure applies proc element-wise to the
elements of the strings and returns a count of the number of
true values it returns. If more than one string is given and not all strings
have the same length, string-count
terminates when the shortest
string runs out. The dynamic order in which proc is applied to
the elements of the strings is unspecified.
It is an error if proc does not accept as many arguments as there are strings.
The string-any
procedure applies proc element-wise to the
elements of the strings and returns #t
if it returns a
true value. If proc doesn’t return a true value,
string-any
returns #f
.
If more than one string is given and not all strings have the
same length, string-any
terminates when the shortest string
runs out. The dynamic order in which proc is applied to the
elements of the strings is unspecified.
It is an error if proc does not accept as many arguments as there are strings.
The string-every
procedure applies proc element-wise to the
elements of the strings and returns #f
if it returns a
false value. If proc doesn’t return a false value,
string-every
returns #t
.
If more than one string is given and not all strings have the
same length, string-every
terminates when the shortest string
runs out. The dynamic order in which proc is applied to the
elements of the strings is unspecified.
Returns #t
if string has zero length; otherwise returns
#f
.
(string-null? "") ⇒ #t (string-null? "Hi") ⇒ #f
These procedures return an exact non-negative integer that can be used
for storing the specified string in a hash table. Equal strings
(in the sense of string=?
and string-ci=?
respectively)
return equal (=
) hash codes, and non-equal but similar strings
are usually mapped to distinct hash codes.
If the optional argument modulus is specified, it must be an
exact positive integer, and the result of the hash computation is
restricted to be less than that value. This is equivalent to calling
modulo
on the result, but may be faster.
Equivalent to (substring string 0 end)
.
Equivalent to (substring string start)
.
This procedure returns a string builder that can be used to incrementally collect characters and later convert that collection to a string. This is similar to a string output port, but is less general and significantly faster.
The optional buffer-length argument, if given, must be an exact
positive integer. It controls the size of the internal buffers that
are used to accumulate characters. Larger values make the builder
somewhat faster but use more space. The default value of this
argument is 16
.
The returned string builder is a procedure that accepts zero or one arguments as follows:
empty?
, the string builder returns #t
if the string being built is empty and #f
otherwise.
count
, the string builder returns the size
of the string being built.
reset!
, the string builder discards the
string being built and returns to the state it was in when initially
created.
The “result” arguments control the form of the returned string. The
arguments immutable
and mutable
are straightforward,
specifying the mutability of the returned string. For these
arguments, the returned string contains exactly the same characters,
in the same order, as were appended to the builder.
However, calling with the argument nfc
, or with no arguments,
returns an immutable string in Unicode Normalization Form C, exactly
as if string->nfc
were called on one of the other two result
strings.
This procedure’s arguments are keyword arguments; that is, each argument is a symbol of the same name followed by its value. The order of the arguments doesn’t matter, but each argument may appear only once.
These procedures return a joiner procedure that takes multiple
strings and joins them together into an immutable string. The joiner
returned by string-joiner
accepts these strings as multiple
string arguments, while string-joiner*
accepts the strings as a
single list-valued argument.
The joiner produces a result by adding prefix before, suffix after, and infix between each input string, then concatenating everything together into a single string. Each of the prefix, suffix, and infix arguments is optional and defaults to an empty string, so normally at least one is specified.
Some examples:
((string-joiner) "a" "b" "c") ⇒ "abc" ((string-joiner 'infix " ") "a" "b" "c") ⇒ "a b c" ((string-joiner 'infix ", ") "a" "b" "c") ⇒ "a, b, c" ((string-joiner* 'infix ", " 'prefix "<" 'suffix ">") '("a" "b" "c")) ⇒ "<a>, <b>, <c>"
This procedure’s arguments are keyword arguments; that is, each argument is a symbol of the same name followed by its value. The order of the arguments doesn’t matter, but each argument may appear only once.
This procedure returns a splitter procedure that splits a given string into parts, returning a list of the parts. This is done by identifying delimiter characters and breaking the string at those delimiters. The splitting process is controlled by the arguments:
char-whitespace?
.
#t
, then all of the adjacent delimiters are treated as if they
were a single delimiter, and the string is split at the beginning and
end of the delimiters. If allow-runs? is #f
, then
adjacent delimiters are treated as if they were separate with an empty
string between them. The default value of this argument is #t
.
string-slice
.
#t
is equivalent to a copier of
substring
, while a value of #f
is equivalent to a
copier of string-slice
.
Some examples:
((string-splitter) "a b c") ⇒ ("a" "b" "c") ((string-splitter) "a\tb\tc") ⇒ ("a" "b" "c") ((string-splitter 'delimiter #\space) "a\tb\tc") ⇒ ("a\tb\tc") ((string-splitter) " a b c ") ⇒ ("a" "b" "c") ((string-splitter 'allow-runs? #f) " a b c ") ⇒ ("" "a" "" "b" "" "c" "")
This procedure’s arguments are keyword arguments; that is, each argument is a symbol of the same name followed by its value. The order of the arguments doesn’t matter, but each argument may appear only once.
This procedure returns a padder procedure that takes a string and a grapheme-cluster length as its arguments and returns a new string that has been padded to that length. The padder adds grapheme clusters to the string until it has the specified length. If the string’s grapheme-cluster length is greater than the given length, the string may, depending on the arguments, be reduced to the specified length.
The padding process is controlled by the arguments:
leading
or trailing
,
which directs the padder to add/remove leading or trailing grapheme
clusters. The default value of this argument is leading
.
" "
(a single
space character).
clip?
is #t
, grapheme clusters are removed (by slicing)
from the string until it is the correct length; if it is #f
then the string is returned unchanged. The grapheme clusters are
removed from the beginning of the string if where
is
leading
, otherwise from the end of the string. The default
value of this argument is #t
.
Some examples:
((string-padder) "abc def" 10) ⇒ " abc def" ((string-padder 'where 'trailing) "abc def" 10) ⇒ "abc def " ((string-padder 'fill-with "X") "abc def" 10) ⇒ "XXXabc def" ((string-padder) "abc def" 5) ⇒ "c def" ((string-padder 'where 'trailing) "abc def" 5) ⇒ "abc d" ((string-padder 'clip? #f) "abc def" 5) ⇒ "abc def"
These procedures are deprecated and should be replaced by use
of string-padder
which is more flexible.
These procedures return an immutable string created by padding
string out to length k, using char. If char
is not given, it defaults to #\space
. If k is less than
the length of string, the resulting string is a truncated form
of string. string-pad-left
adds padding characters or
truncates from the beginning of the string (lowest indices), while
string-pad-right
does so at the end of the string (highest
indices).
(string-pad-left "hello" 4) ⇒ "ello" (string-pad-left "hello" 8) ⇒ " hello" (string-pad-left "hello" 8 #\*) ⇒ "***hello" (string-pad-right "hello" 4) ⇒ "hell" (string-pad-right "hello" 8) ⇒ "hello "
This procedure’s arguments are keyword arguments; that is, each argument is a symbol of the same name followed by its value. The order of the arguments doesn’t matter, but each argument may appear only once.
This procedure returns a trimmer procedure that takes a string as its argument and trims that string, returning the trimmed result. The trimming process is controlled by the arguments:
leading
, trailing
, or
both
, which directs the trimmer to trim leading characters,
trailing characters, or both. The default value of this argument is
both
.
char-whitespace?
.
string-slice
.
#t
is equivalent to a copier of
substring
, while a value of #f
is equivalent to a
copier of string-slice
.
Some examples:
((string-trimmer 'where 'leading) " ABC DEF ") ⇒ "ABC DEF " ((string-trimmer 'where 'trailing) " ABC DEF ") ⇒ " ABC DEF" ((string-trimmer 'where 'both) " ABC DEF ") ⇒ "ABC DEF" ((string-trimmer) " ABC DEF ") ⇒ "ABC DEF" ((string-trimmer 'to-trim char-numeric? 'where 'leading) "21 East 21st Street #3") ⇒ " East 21st Street #3" ((string-trimmer 'to-trim char-numeric? 'where 'trailing) "21 East 21st Street #3") ⇒ "21 East 21st Street #" ((string-trimmer 'to-trim char-numeric?) "21 East 21st Street #3") ⇒ " East 21st Street #"
These procedures are deprecated and should be replaced by use
of string-trimmer
which is more flexible.
Returns an immutable string created by removing all characters that
are not in char-set from: (string-trim
) both ends of
string; (string-trim-left
) the beginning of string;
or (string-trim-right
) the end of string. Char-set
defaults to char-set:not-whitespace
.
(string-trim " in the end ") ⇒ "in the end" (string-trim " ") ⇒ "" (string-trim "100th" char-set:numeric) ⇒ "100" (string-trim-left "-.-+-=-" (char-set #\+)) ⇒ "+-=-" (string-trim "but (+ x y) is" (char-set #\( #\))) ⇒ "(+ x y)"
Returns an immutable string containing the same characters as string except that all instances of char1 have been replaced by char2.
Next: Lists, Previous: Characters, Up: Top [Contents][Index]