Character Sets (MIT/GNU Scheme Pucked Reference Manual)

5.3 Character Sets

MIT/GNU Scheme’s character-set abstraction is used to represent groups of characters, such as the letters or digits. A character set may contain any character. Alternatively, a character set can be treated as a set of code points.

Implementation note: MIT/GNU Scheme allows any “bitless” character to be stored in a character set; operations that accept characters automatically strip their bucky bits.

procedure: char-set? object: Returns #t if object is a character set, otherwise it returns #f.

procedure: char-in-set? char char-set: Returns #t if char is in char-set, otherwise it returns #f.

procedure: code-point-in-set? code-point char-set: Returns #t if code-point is in char-set, otherwise it returns #f.

procedure: char-set-predicate char-set: Returns a procedure of one argument that returns #t if its argument is a character in char-set, otherwise it returns #f.

procedure: compute-char-set predicate: Calls predicate once on each Unicode code point, and returns a character set containing exactly the code points for which predicate returns a true value.

The next procedures represent a character set as a code-point list, which is a list of code-point range elements. A code-point range is either a Unicode code point, or a pair (start . end) that specifies a contiguous range of code points. Both start and end must be exact nonnegative integers less than or equal to #x110000, and start must be less than or equal to end. The range specifies all of the code points greater than or equal to start and strictly less than end.

procedure: char-set element …

procedure: char-set* elements

Returns a new character set consisting of the characters specified by elements. The procedure char-set takes these elements as multiple arguments, while char-set* takes them as a single list-valued argument; in all other respects these procedures are identical.

An element can take several forms, each of which specifies one or more characters to include in the resulting character set: a character includes itself; a string includes all of the characters it contains; a character set includes its members; or a code-point range includes the corresponding characters.

In addition, an element may be a symbol from the following table, which represents the characters as shown:

Name	Unicode character specification
`alphabetic`	Alphabetic = True
`alphanumeric`	Alphabetic = True \| Numeric_Type = Decimal
`cased`	Cased = True
`lower-case`	Lowercase = True
`numeric`	Numeric_Type = Decimal
`unicode`	General_Category != (Cs \| Cn)
`upper-case`	Uppercase = True
`whitespace`	White_Space = True

procedure: char-set->code-points char-set: Returns a code-point list specifying the contents of char-set. The returned list consists of numerically sorted, disjoint, and non-abutting code-point ranges.

procedure: char-set=? char-set-1 char-set-2: Returns #t if char-set-1 and char-set-2 contain exactly the same characters, otherwise it returns #f.

procedure: char-set-invert char-set: Returns a character set that’s the inverse of char-set. That is, the returned character set contains exactly those characters that aren’t in char-set.

procedure: char-set-union char-set …
procedure: char-set-intersection char-set …
procedure: char-set-difference char-set-1 char-set …: These procedures compute the respective set union, set intersection, and set difference of their arguments.

procedure: char-set-union* char-sets
procedure: char-set-intersection* char-sets: These procedures correspond to char-set-union and char-set-intersection but take a single argument that’s a list of character sets rather than multiple character-set arguments.

constant: char-set:alphabetic
constant: char-set:numeric
constant: char-set:whitespace
constant: char-set:upper-case
constant: char-set:lower-case
constant: char-set:alphanumeric: These constants are the character sets corresponding to char-alphabetic?, char-numeric?, char-whitespace?, char-upper-case?, char-lower-case?, and char-alphanumeric? respectively.

procedure: 8-bit-char-set? char-set: Returns #t if char-set contains only 8-bit code points (i.e.. ISO 8859-1 characters), otherwise it returns #f.