Strings (MIT/GNU Scheme Pucked Reference Manual)

6 Strings

Strings are sequences of characters. Strings are written as sequences of characters enclosed within quotation marks ("). Within a string literal, various escape sequences represent characters other than themselves. Escape sequences always start with a backslash (\):

\a : alarm, U+0007
\b : backspace, U+0008
\t : character tabulation, U+0009
\n : linefeed, U+000A
\r : return, U+000D
\" : double quote, U+0022
\\ : backslash, U+005C
\| : vertical line, U+007C
\intraline-whitespace* line-ending intraline-whitespace*
     : nothing
\xhex-scalar-value;
     : specified character (note the terminating semi-colon).

The result is unspecified if any other character in a string occurs after a backslash.

Except for a line ending, any character outside of an escape sequence stands for itself in the string literal. A line ending which is preceded by \intraline-whitespace expands to nothing (along with any trailing intraline whitespace), and can be used to indent strings for improved legibility. Any other line ending has the same effect as inserting a \n character into the string.

Examples:

"The word \"recursion\" has many meanings."
"Another example:\ntwo lines of text"
"Here's text \
   containing just one line"
"\x03B1; is named GREEK SMALL LETTER ALPHA."

The length of a string is the number of characters that it contains. This number is an exact, non-negative integer that is fixed when the string is created. The valid indexes of a string are the exact non-negative integers less than the length of the string. The first character of a string has index 0, the second has index 1, and so on.

Some of the procedures that operate on strings ignore the difference between upper and lower case. The names of the versions that ignore case end with ‘-ci’ (for “case insensitive”).

Implementations may forbid certain characters from appearing in strings. However, with the exception of #\null, ASCII characters must not be forbidden. For example, an implementation might support the entire Unicode repertoire, but only allow characters U+0001 to U+00FF (the Latin-1 repertoire without #\null) in strings.

Implementation note: MIT/GNU Scheme allows any “bitless” character to be stored in a string. In effect this means any character with a Unicode code point, including surrogates. String operations that accept characters automatically strip their bucky bits.

It is an error to pass such a forbidden character to make-string, string, string-set!, or string-fill!, as part of the list passed to list->string, or as part of the vector passed to vector->string, or in UTF-8 encoded form within a bytevector passed to utf8->string. It is also an error for a procedure passed to string-map to return a forbidden character, or for read-string to attempt to read one.

MIT/GNU Scheme supports both mutable and immutable strings. Procedures that mutate strings, in particular string-set! and string-fill!, will signal an error if given an immutable string. Nearly all procedures that return strings return immutable strings; notable exceptions are make-string and string-copy, which always return mutable strings, and string-builder which gives the programmer the ability to choose mutable or immutable results.

standard procedure: string? obj: Returns #t if obj is a string, otherwise returns #f.

standard procedure: make-string k [char]: The make-string procedure returns a newly allocated mutable string of length k. If char is given, then all the characters of the string are initialized to char, otherwise the contents of the string are unspecified.

extended standard procedure: string object …

procedure: string* objects

Returns an immutable string whose characters are the concatenation of the characters from the given objects. Each object is converted to characters as if passed to the display procedure.

This is an MIT/GNU Scheme extension to the standard string that accepts only characters as arguments.

The procedure string* is identical to string but takes a single argument that’s a list of objects, rather than multiple object arguments.

standard procedure: string-length string: Returns the number of characters in the given string.

standard procedure: string-ref string k

It is an error if k is not a valid index of string.

The string-ref procedure returns character k of string using zero-origin indexing. There is no requirement for this procedure to execute in constant time.

standard procedure: string-set! string k char

It is an error if string is not a mutable string or if k is not a valid index of string.

The string-set! procedure stores char in element k of string. There is no requirement for this procedure to execute in constant time.

(define (f) (make-string 3 #\*))
(define (g) "***")
(string-set! (f) 0 #\?)  ⇒  unspecified
(string-set! (g) 0 #\?)  ⇒  error
(string-set! (symbol->string 'immutable) 0 #\?)  ⇒  error

standard procedure: string=? string1 string2 string …: Returns #t if all the strings are the same length and contain exactly the same characters in the same positions, otherwise returns #f.

char library procedure: string-ci=? string1 string2 string …: Returns #t if, after case-folding, all the strings are the same length and contain the same characters in the same positions, otherwise returns #f. Specifically, these procedures behave as if string-foldcase were applied to their arguments before comparing them.

standard procedure: string<? string1 string2 string …

char library procedure: string-ci<? string1 string2 string …

standard procedure: string>? string1 string2 string …

char library procedure: string-ci>? string1 string2 string …

standard procedure: string<=? string1 string2 string …

char library procedure: string-ci<=? string1 string2 string …

standard procedure: string>=? string1 string2 string …

char library procedure: string-ci>=? string1 string2 string …

These procedures return #t if their arguments are (respectively): monotonically increasing, monotonically decreasing, monotonically non-decreasing, or monotonically non-increasing.

These predicates are required to be transitive.

These procedures compare strings in an implementation-defined way. One approach is to make them the lexicographic extensions to strings of the corresponding orderings on characters. In that case, string<? would be the lexicographic ordering on strings induced by the ordering char<? on characters, and if the two strings differ in length but are the same up to the length of the shorter string, the shorter string would be considered to be lexicographically less than the longer string. However, it is also permitted to use the natural ordering imposed by the implementation’s internal representation of strings, or a more complex locale-specific ordering.

In all cases, a pair of strings must satisfy exactly one of string<?, string=?, and string>?, and must satisfy string<=? if and only if they do not satisfy string>? and string>=? if and only if they do not satisfy string<?.

The ‘-ci’ procedures behave as if they applied string-foldcase to their arguments before invoking the corresponding procedures without ‘-ci’.

procedure: string-compare string1 string2 if-eq if-lt if-gt

procedure: string-compare-ci string1 string2 if-eq if-lt if-gt

If-eq, if-lt, and if-gt are procedures of no arguments (thunks). The two strings are compared; if they are equal, if-eq is applied, if string1 is less than string2, if-lt is applied, else if string1 is greater than string2, if-gt is applied. The value of the procedure is the value of the thunk that is applied.

string-compare distinguishes uppercase and lowercase letters;
string-compare-ci does not.

(define (cheer) (display "Hooray!"))
(define (boo)   (display "Boo-hiss!"))
(string-compare "a" "b"  cheer  (lambda() 'ignore)  boo)
        -|  Hooray!
        ⇒  unspecified

char library procedure: string-upcase string

char library procedure: string-downcase string

procedure: string-titlecase string

char library procedure: string-foldcase string

These procedures apply the Unicode full string uppercasing, lowercasing, titlecasing, and case-folding algorithms to their arguments and return the result. In certain cases, the result differs in length from the argument. If the result is equal to the argument in the sense of string=?, the argument may be returned. Note that language-sensitive mappings and foldings are not used.

The Unicode Standard prescribes special treatment of the Greek letter \Sigma, whose normal lower-case form is \sigma but which becomes \varsigma at the end of a word. See UAX #44 (part of the Unicode Standard) for details. However, implementations of string-downcase are not required to provide this behavior, and may choose to change \Sigma to \sigma in all cases.

procedure: string-upper-case? string

procedure: string-lower-case? string

These procedures return #t if all the letters in the string are lower case or upper case, otherwise they return #f. The string must contain at least one letter or the procedures return #f.

(map string-upper-case? '(""    "A"    "art"  "Art"  "ART"))
                       ⇒ (#f    #t     #f     #f     #t)

standard procedure: substring string [start [end]]: Returns an immutable copy of the part of the given string between start and end.

procedure: string-slice string [start [end]]

Returns a slice of string, restricted to the range of characters specified by start and end. The returned slice will be mutable if string is mutable, or immutable if string is immutable.

A slice is a kind of string that provides a view into another string. The slice behaves like any other string, but changes to a mutable slice are reflected in the original string and vice versa.

(define foo (string-copy "abcde"))
foo ⇒ "abcde"

(define bar (string-slice foo 1 4))
bar ⇒ "bcd"

(string-set! foo 2 #\z)
foo ⇒ "abzde"
bar ⇒ "bzd"

(string-set! bar 1 #\y)
bar ⇒ "byd"
foo ⇒ "abyde"

standard procedure: string-append string …

procedure: string-append* strings

Returns an immutable string whose characters are the concatenation of the characters in the given strings.

The non-standard procedure string-append* is identical to string-append but takes a single argument that’s a list of strings, rather than multiple string arguments.

standard procedure: string->list string [start [end]]

standard procedure: list->string list

It is an error if any element of list is not a character.

The string->list procedure returns a newly allocated list of the characters of string between start and end. list->string returns an immutable string formed from the elements in the list list. In both procedures, order is preserved. string->list and list->string are inverses so far as equal? is concerned.

standard procedure: string-copy string [start [end]]: Returns a newly allocated mutable copy of the part of the given string between start and end.

standard procedure: string-copy! to at from [start [end]]

It is an error if to is not a mutable string or if at is less than zero or greater than the length of to. It is also an error if (- (string-length to) at) is less than (- end start).

Copies the characters of string from between start and end to string to, starting at at. The order in which characters are copied is unspecified, except that if the source and destination overlap, copying takes place as if the source is first copied into a temporary string and then into the destination. This can be achieved without allocating storage by making sure to copy in the correct direction in such circumstances.

(define a "12345")
(define b (string-copy "abcde"))
(string-copy! b 1 a 0 2) ⇒ 3
b ⇒ "a12de"%

Implementation note: in MIT/GNU Scheme string-copy! returns the value (+ at (- end start)).

standard procedure: string-fill! string fill [start [end]]

It is an error if string is not a mutable string or if fill is not a character.

The string-fill! procedure stores fill in the elements of string between start and end.

The next two procedures treat a given string as a sequence of grapheme clusters, a concept defined by the Unicode standard in UAX #29:

It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + acute-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.

procedure: grapheme-cluster-length string

This procedure returns the number of grapheme clusters in string.

For ASCII strings, this is identical to string-length.

procedure: grapheme-cluster-slice string start end

This procedure slices string at the grapheme-cluster boundaries specified by the start and end indices. These indices are grapheme-cluster indices, not normal string indices.

For ASCII strings, this is identical to string-slice.

procedure: string-word-breaks string: This procedure returns a list of word break indices for string, ordered from smallest index to largest. Word breaks are defined by the Unicode standard in UAX #29, and generally coincide with what we think of as the boundaries of words in written text.

MIT/GNU Scheme supports the Unicode canonical normalization forms NFC (Normalization Form C) and NFD (Normalization Form D). The reason for these forms is that there can be multiple different Unicode sequences for a given text; these sequences are semantically identical and should be treated equivalently for all purposes. If two such sequences are normalized to the same form, the resulting normalized sequences will be identical.

By default, most procedures that return strings return them in NFC. Notable exceptions are list->string, vector->string, and the utfX->string procedures, which do no normalization, and of course string->nfd.

Generally speaking, NFC is preferred for most purposes, as it is the minimal-length sequence for the variants. Consult the Unicode standard for the details and for information about why one normalization form is preferable for a specific purpose.

procedure: string-in-nfc? string

procedure: string-in-nfd? string

These procedures return #t if string is in Unicode Normalization Form C or D respectively. Otherwise they return #f.

Note that if string consists only of code points strictly less than #xC0, then string-in-nfd? returns #t. If string consists only of code points strictly less than #x300, then string-in-nfc? returns #t. Consequently both of these procedures will return #t for an ASCII string argument.

procedure: string->nfc string
procedure: string->nfd string: The procedures convert string into Unicode Normalization Form C or D respectively. If string is already in the correct form, they return string itself, or an immutable copy if string is mutable.

standard procedure: string-map proc string string …

It is an error if proc does not accept as many arguments as there are strings and return a single character.

The string-map procedure applies proc element-wise to the elements of the strings and returns an immutable string of the results, in order. If more than one string is given and not all strings have the same length, string-map terminates when the shortest string runs out. The dynamic order in which proc is applied to the elements of the strings is unspecified. If multiple returns occur from string-map, the values returned by earlier returns are not mutated.

(string-map char-foldcase "AbdEgH")  ⇒  "abdegh"

(string-map
 (lambda (c)
   (integer->char (+ 1 (char->integer c))))
 "HAL")                 ⇒  "IBM"

(string-map
 (lambda (c k)
   ((if (eqv? k #\u) char-upcase char-downcase) c))
 "studlycaps xxx"
 "ululululul")          ⇒  "StUdLyCaPs"

standard procedure: string-for-each proc string string …

It is an error if proc does not accept as many arguments as there are strings.

The arguments to string-for-each are like the arguments to string-map, but string-for-each calls proc for its side effects rather than for its values. Unlike string-map, string-for-each is guaranteed to call proc on the elements of the lists in order from the first element(s) to the last, and the value returned by string-for-each is unspecified. If more than one string is given and not all strings have the same length, string-for-each terminates when the shortest string runs out. It is an error for proc to mutate any of the strings.

(let ((v '()))
  (string-for-each
   (lambda (c) (set! v (cons (char->integer c) v)))
   "abcde")
  v)                    ⇒  (101 100 99 98 97)

procedure: string-count proc string string …

It is an error if proc does not accept as many arguments as there are strings.

The string-count procedure applies proc element-wise to the elements of the strings and returns a count of the number of true values it returns. If more than one string is given and not all strings have the same length, string-count terminates when the shortest string runs out. The dynamic order in which proc is applied to the elements of the strings is unspecified.

procedure: string-any proc string string …

It is an error if proc does not accept as many arguments as there are strings.

The string-any procedure applies proc element-wise to the elements of the strings and returns #t if it returns a true value. If proc doesn’t return a true value, string-any returns #f.

If more than one string is given and not all strings have the same length, string-any terminates when the shortest string runs out. The dynamic order in which proc is applied to the elements of the strings is unspecified.

procedure: string-every proc string string …

It is an error if proc does not accept as many arguments as there are strings.

The string-every procedure applies proc element-wise to the elements of the strings and returns #f if it returns a false value. If proc doesn’t return a false value, string-every returns #t.

If more than one string is given and not all strings have the same length, string-every terminates when the shortest string runs out. The dynamic order in which proc is applied to the elements of the strings is unspecified.

procedure: string-null? string

Returns #t if string has zero length; otherwise returns #f.

(string-null? "")       ⇒  #t
(string-null? "Hi")     ⇒  #f

procedure: string-hash string [modulus]

procedure: string-hash-ci string [modulus]

These procedures return an exact non-negative integer that can be used for storing the specified string in a hash table. Equal strings (in the sense of string=? and string-ci=? respectively) return equal (=) hash codes, and non-equal but similar strings are usually mapped to distinct hash codes.

If the optional argument modulus is specified, it must be an exact positive integer, and the result of the hash computation is restricted to be less than that value. This is equivalent to calling modulo on the result, but may be faster.

procedure: string-head string end: Equivalent to (substring string 0 end).

procedure: string-tail string start: Equivalent to (substring string start).

procedure: string-builder [buffer-length]

This procedure returns a string builder that can be used to incrementally collect characters and later convert that collection to a string. This is similar to a string output port, but is less general and significantly faster.

The optional buffer-length argument, if given, must be an exact positive integer. It controls the size of the internal buffers that are used to accumulate characters. Larger values make the builder somewhat faster but use more space. The default value of this argument is 16.

The returned string builder is a procedure that accepts zero or one arguments as follows:

Given a character argument, the string builder appends that character to the string being built and returns an unspecified value.
Given a string argument, the string builder appends that string to the string being built and returns an unspecified value.
Given no arguments, or one of the “result” arguments (see below), the string builder returns a copy of the string being built. Note that this does not affect the string being built, so immediately calling the builder with no arguments a second time returns a new copy of the same string.
Given the argument empty?, the string builder returns #t if the string being built is empty and #f otherwise.
Given the argument count, the string builder returns the size of the string being built.
Given the argument reset!, the string builder discards the string being built and returns to the state it was in when initially created.

The “result” arguments control the form of the returned string. The arguments immutable and mutable are straightforward, specifying the mutability of the returned string. For these arguments, the returned string contains exactly the same characters, in the same order, as were appended to the builder.

However, calling with the argument nfc, or with no arguments, returns an immutable string in Unicode Normalization Form C, exactly as if string->nfc were called on one of the other two result strings.

procedure: string-joiner infix prefix suffix

procedure: string-joiner* infix prefix suffix

This procedure’s arguments are keyword arguments; that is, each argument is a symbol of the same name followed by its value. The order of the arguments doesn’t matter, but each argument may appear only once.

These procedures return a joiner procedure that takes multiple strings and joins them together into an immutable string. The joiner returned by string-joiner accepts these strings as multiple string arguments, while string-joiner* accepts the strings as a single list-valued argument.

The joiner produces a result by adding prefix before, suffix after, and infix between each input string, then concatenating everything together into a single string. Each of the prefix, suffix, and infix arguments is optional and defaults to an empty string, so normally at least one is specified.

Some examples:

((string-joiner) "a" "b" "c")
  ⇒  "abc"

((string-joiner 'infix " ") "a" "b" "c")
  ⇒  "a b c"

((string-joiner 'infix ", ") "a" "b" "c")
  ⇒  "a, b, c"

((string-joiner* 'infix ", " 'prefix "<" 'suffix ">")
 '("a" "b" "c"))
  ⇒  "<a>, <b>, <c>"

procedure: string-splitter delimiter allow-runs? copier copy?

This procedure returns a splitter procedure that splits a given string into parts, returning a list of the parts. This is done by identifying delimiter characters and breaking the string at those delimiters. The splitting process is controlled by the arguments:

delimiter is either a character, a character set, or more generally a procedure that accepts a single character argument and returns a boolean value. The splitter uses this to identify delimiters in the string. The default value of this argument is char-whitespace?.
allow-runs? is a boolean that controls what happens when two or more adjacent delimiters are found. If allow-runs? is #t, then all of the adjacent delimiters are treated as if they were a single delimiter, and the string is split at the beginning and end of the delimiters. If allow-runs? is #f, then adjacent delimiters are treated as if they were separate with an empty string between them. The default value of this argument is #t.
copier is a procedure that accepts three arguments: a string, a start index, and an end index, returning the specified substring as a string. It defaults to string-slice.
copy? is a boolean, for backwards compatibility; instead use copier. A value of #t is equivalent to a copier of substring, while a value of #f is equivalent to a copier of string-slice.

Some examples:

((string-splitter) "a b c")
  ⇒  ("a" "b" "c")

((string-splitter) "a\tb\tc")
  ⇒  ("a" "b" "c")

((string-splitter 'delimiter #\space) "a\tb\tc")
  ⇒  ("a\tb\tc")

((string-splitter) " a  b  c ")
  ⇒  ("a" "b" "c")

((string-splitter 'allow-runs? #f) " a  b  c ")
  ⇒  ("" "a" "" "b" "" "c" "")

procedure: string-padder where fill-with clip?

This procedure returns a padder procedure that takes a string and a grapheme-cluster length as its arguments and returns a new string that has been padded to that length. The padder adds grapheme clusters to the string until it has the specified length. If the string’s grapheme-cluster length is greater than the given length, the string may, depending on the arguments, be reduced to the specified length.

The padding process is controlled by the arguments:

where is a symbol: either leading or trailing, which directs the padder to add/remove leading or trailing grapheme clusters. The default value of this argument is leading.
fill-with is a string that contains exactly one grapheme cluster, which is used as the padding to increase the size of the string. The default value of this argument is " " (a single space character).
clip? is a boolean that controls what happens if the given string has a longer grapheme-cluster length than the given length. If clip? is #t, grapheme clusters are removed (by slicing) from the string until it is the correct length; if it is #f then the string is returned unchanged. The grapheme clusters are removed from the beginning of the string if where is leading, otherwise from the end of the string. The default value of this argument is #t.

Some examples:

((string-padder) "abc def" 10)
  ⇒  "   abc def"

((string-padder 'where 'trailing) "abc def" 10)
  ⇒  "abc def   "

((string-padder 'fill-with "X") "abc def" 10)
  ⇒  "XXXabc def"

((string-padder) "abc def" 5)
  ⇒  "c def"

((string-padder 'where 'trailing) "abc def" 5)
  ⇒  "abc d"

((string-padder 'clip? #f) "abc def" 5)
  ⇒  "abc def"

obsolete procedure: string-pad-left string k [char]

obsolete procedure: string-pad-right string k [char]

These procedures are deprecated and should be replaced by use of string-padder which is more flexible.

These procedures return an immutable string created by padding string out to length k, using char. If char is not given, it defaults to #\space. If k is less than the length of string, the resulting string is a truncated form of string. string-pad-left adds padding characters or truncates from the beginning of the string (lowest indices), while string-pad-right does so at the end of the string (highest indices).

(string-pad-left "hello" 4)             ⇒  "ello"
(string-pad-left "hello" 8)             ⇒  "   hello"
(string-pad-left "hello" 8 #\*)         ⇒  "***hello"
(string-pad-right "hello" 4)            ⇒  "hell"
(string-pad-right "hello" 8)            ⇒  "hello   "

procedure: string-trimmer where to-trim copier copy?

This procedure returns a trimmer procedure that takes a string as its argument and trims that string, returning the trimmed result. The trimming process is controlled by the arguments:

where is a symbol: either leading, trailing, or both, which directs the trimmer to trim leading characters, trailing characters, or both. The default value of this argument is both.
to-trim is either a character, a character set, or more generally a procedure that accepts a single character argument and returns a boolean value. The trimmer uses this to identify characters to remove. The default value of this argument is char-whitespace?.
copier is a procedure that accepts three arguments: a string, a start index, and an end index, returning the specified substring as a string. It defaults to string-slice.
copy? is a boolean, for backwards compatibility; instead use copier. A value of #t is equivalent to a copier of substring, while a value of #f is equivalent to a copier of string-slice.

Some examples:

((string-trimmer 'where 'leading) "    ABC   DEF    ")
  ⇒  "ABC   DEF    "

((string-trimmer 'where 'trailing) "    ABC   DEF    ")
  ⇒  "    ABC   DEF"

((string-trimmer 'where 'both) "    ABC   DEF    ")
  ⇒  "ABC   DEF"

((string-trimmer) "    ABC   DEF    ")
  ⇒  "ABC   DEF"

((string-trimmer 'to-trim char-numeric? 'where 'leading)
 "21 East 21st Street #3")
  ⇒  " East 21st Street #3"

((string-trimmer 'to-trim char-numeric? 'where 'trailing)
 "21 East 21st Street #3")
  ⇒  "21 East 21st Street #"

((string-trimmer 'to-trim char-numeric?)
 "21 East 21st Street #3")
  ⇒  " East 21st Street #"

obsolete procedure: string-trim string [char-set]

obsolete procedure: string-trim-left string [char-set]

obsolete procedure: string-trim-right string [char-set]

These procedures are deprecated and should be replaced by use of string-trimmer which is more flexible.

Returns an immutable string created by removing all characters that are not in char-set from: (string-trim) both ends of string; (string-trim-left) the beginning of string; or (string-trim-right) the end of string. Char-set defaults to char-set:not-whitespace.

(string-trim "  in the end  ")          ⇒  "in the end"
(string-trim "              ")          ⇒  ""
(string-trim "100th" char-set:numeric)  ⇒  "100"
(string-trim-left "-.-+-=-" (char-set #\+))
                                        ⇒  "+-=-"
(string-trim "but (+ x y) is" (char-set #\( #\)))
                                        ⇒  "(+ x y)"

procedure: string-replace string char1 char2: Returns an immutable string containing the same characters as string except that all instances of char1 have been replaced by char2.