-@node Characters, Strings, Numbers, Top
+@node Characters
@chapter Characters
@cindex character (defn)
-Characters are objects that represent printed characters, such as
-letters and digits.
+Characters are objects that represent printed characters such as
+letters and digits. MIT/GNU Scheme supports the full Unicode
+character repertoire.
@menu
-* External Representation of Characters::
-* Comparison of Characters::
-* Miscellaneous Character Operations::
-* Internal Representation of Characters::
-* ISO-8859-1 Characters::
-* Character Sets::
+* Character implementation::
* Unicode::
+* Character Sets::
@end menu
-@node External Representation of Characters, Comparison of Characters, Characters, Characters
-@section External Representation of Characters
-@cindex external representation, for character
-
@cindex #\ as external representation
@findex #\
Characters are written using the notation @code{#\@var{character}} or
-@code{#\@var{character-name}}. For example:
+@code{#\@var{character-name}} or @code{#\x@var{hex-scalar-value}}.
+
+The following standard character names are supported:
@example
@group
-#\a @r{; lowercase letter}
-#\A @r{; uppercase letter}
-#\( @r{; left parenthesis}
-#\space @r{; the space character}
-#\newline @r{; the newline character}
+#\alarm @r{; U+0007}
+#\backspace @r{; U+0008}
+#\delete @r{; U+007F}
+#\escape @r{; U+001B}
+#\newline @r{; the linefeed character, U+000A}
+#\null @r{; the null character, U+0000}
+#\return @r{; the return character, U+000D}
+#\space @r{; the preferred way to write a space, U+0020}
+#\tab @r{; the tab character, U+0009}
@end group
@end example
-@findex #\space
+@findex #\alarm
+@findex #\backspace
+@findex #\delete
+@findex #\escape
@findex #\newline
+@findex #\null
+@findex #\return
+@findex #\space
+@findex #\tab
-@noindent
-Case is significant in @code{#\@var{character}}, but not in
-@code{#\@var{character-name}}. If @var{character} in
-@code{#\@var{character}} is a letter, @var{character} must be followed
-by a delimiter character such as a space or parenthesis. Characters
-written in the @code{#\} notation are self-evaluating; you don't need to
-quote them.
-
-@findex #\U+
-In addition to the standard character syntax, MIT Scheme also supports a
-general syntax that denotes any Unicode character by its scalar value.
-This notation is @code{#\U+@var{scalar-value}}, where @var{scalar-value} is
-a sequence of hexadecimal digits for a valid scalar value. So the above
-examples could also be written like this:
+Here are some additional examples:
@example
@group
-#\U+61 @r{; lowercase letter}
-#\U+41 @r{; uppercase letter}
-#\U+28 @r{; left parenthesis}
-#\U+20 @r{; the space character}
-#\U+0A @r{; the newline character}
+#\a @r{; lowercase letter}
+#\A @r{; uppercase letter}
+#\( @r{; left parenthesis}
+#\ @r{; the space character}
@end group
@end example
+@noindent
+Case is significant in @code{#\@var{character}}, and in
+@code{#\@var{character-name}}, but not in
+@code{#\x@var{hex-scalar-value}}. If @var{character} in
+@code{#\@var{character}} is alphabetic, then any character immediately
+following @var{character} cannot be one that can appear in an
+identifier. This rule resolves the ambiguous case where, for example,
+the sequence of characters @samp{#\space} could be taken to be either
+a representation of the space character or a representation of the
+character @samp{#\s} followed by a representation of the symbol
+@samp{pace}.
+
+Characters written in the @code{#\} notation are self-evaluating.
+That is, they do not have to be quoted in programs.
+
+Some of the procedures that operate on characters ignore the
+difference between upper case and lower case. The procedures that
+ignore case have @samp{-ci} (for ``case insensitive'') embedded in
+their names.
+
@cindex bucky bit, prefix (defn)
@cindex control, bucky bit prefix (defn)
@cindex meta, bucky bit prefix (defn)
@cindex super, bucky bit prefix (defn)
@cindex hyper, bucky bit prefix (defn)
-A character name may include one or more @dfn{bucky bit} prefixes to
-indicate that the character includes one or more of the keyboard shift
-keys Control, Meta, Super, or Hyper (note that the Control bucky bit
-prefix is not the same as the @acronym{ASCII} control key). The bucky
-bit prefixes and their meanings are as follows (case is not
-significant):
+MIT/GNU Scheme allows a character name to include one or more
+@dfn{bucky bit} prefixes to indicate that the character includes one
+or more of the keyboard shift keys Control, Meta, Super, or Hyper
+(note that the Control bucky bit prefix is not the same as the
+@acronym{ASCII} control key). The bucky bit prefixes and their
+meanings are as follows (case is not significant):
@example
@group
@group
#\c-a @r{; Control-a}
#\meta-b @r{; Meta-b}
-#\c-s-m-h-a @r{; Control-Meta-Super-Hyper-A}
-@end group
-@end example
-
-@cindex character, named (defn)
-@cindex name, of character
-The following @var{character-name}s are supported, shown here with their
-@acronym{ASCII} equivalents:
-
-@example
-@group
-Character Name ASCII Name
--------------- ----------
-
-altmode ESC
-backnext US
-backspace BS
-call SUB
-linefeed LF
-page FF
-return CR
-rubout DEL
-space
-tab HT
-@end group
-@end example
-@findex #\altmode
-@findex #\backnext
-@findex #\backspace
-@findex #\call
-@findex #\linefeed
-@findex #\page
-@findex #\return
-@findex #\rubout
-@findex #\space
-@findex #\tab
-
-@noindent
-@cindex newline character (defn)
-@findex #\newline
-In addition, @code{#\newline} is the same as @code{#\linefeed} (but this
-may change in the future, so you should not depend on it). All of the
-standard @acronym{ASCII} names for non-printing characters are supported:
-
-@example
-@group
-NUL SOH STX ETX EOT ENQ ACK BEL
-BS HT LF VT FF CR SO SI
-DLE DC1 DC2 DC3 DC4 NAK SYN ETB
-CAN EM SUB ESC FS GS RS US
-DEL
+#\c-s-m-h-A @r{; Control-Meta-Super-Hyper-A}
@end group
@end example
-@deffn procedure char->name char [slashify?]
+@deffn procedure char->name char
Returns a string corresponding to the printed representation of
-@var{char}. This is the @var{character} or @var{character-name}
-component of the external representation, combined with the appropriate
-bucky bit prefixes.
+@var{char}. This is the @var{character}, @var{character-name}, or
+@code{x@var{hex-scalar-value}} component of the external
+representation, combined with the appropriate bucky bit prefixes.
@example
@group
(char->name #\a) @result{} "a"
-(char->name #\space) @result{} "Space"
+(char->name #\space) @result{} "space"
(char->name #\c-a) @result{} "C-a"
(char->name #\control-a) @result{} "C-a"
@end group
@end example
-
-@findex read
-@var{Slashify?}, if specified and true, says to insert the necessary
-backslash characters in the result so that @code{read} will parse it
-correctly. In other words, the following generates the external
-representation of @var{char}:
-
-@example
-(string-append "#\\" (char->name @var{char} #t))
-@end example
-
-@noindent
-If @var{slashify?} is not specified, it defaults to @code{#f}.
@end deffn
@deffn procedure name->char string
@example
@group
(name->char "a") @result{} #\a
-(name->char "space") @result{} #\Space
+(name->char "space") @result{} #\space
+(name->char "SPACE") @result{} #\space
(name->char "c-a") @result{} #\C-a
(name->char "control-a") @result{} #\C-a
@end group
@end example
@end deffn
-@node Comparison of Characters, Miscellaneous Character Operations, External Representation of Characters, Characters
-@section Comparison of Characters
+@deffn {standard procedure} char? object
+@cindex type predicate, for character
+Returns @code{#t} if @var{object} is a character, otherwise returns
+@code{#f}.
+@end deffn
+
+@deffn {standard procedure} char=? char1 char2
+@deffnx {standard procedure} char<? char1 char2
+@deffnx {standard procedure} char>? char1 char2
+@deffnx {standard procedure} char<=? char1 char2
+@deffnx {standard procedure} char>=? char1 char2
+@cindex equivalence predicate, for characters
@cindex ordering, of characters
@cindex comparison, of characters
-@cindex equivalence predicates, for characters
-
-@deffn procedure char=? char1 char2
-@deffnx procedure char<? char1 char2
-@deffnx procedure char>? char1 char2
-@deffnx procedure char<=? char1 char2
-@deffnx procedure char>=? char1 char2
-@deffnx {procedure} char-ci=? char1 char2
-@deffnx {procedure} char-ci<? char1 char2
-@deffnx {procedure} char-ci>? char1 char2
-@deffnx {procedure} char-ci<=? char1 char2
-@deffnx {procedure} char-ci>=? char1 char2
-@cindex equivalence predicate, for characters
-Returns @code{#t} if the specified characters are have the appropriate
-order relationship to one another; otherwise returns @code{#f}. The
-@code{-ci} procedures don't distinguish uppercase and lowercase letters.
+These procedures return @code{#t} if the results of passing their
+arguments to @code{char->integer} are respectively equal,
+monotonically increasing, monotonically decreasing, monotonically
+non-decreasing, or monotonically non-increasing.
-Character ordering follows these portability rules:
+These predicates are transitive.
+@end deffn
-@itemize @bullet
-@item
-The digits are in order; for example, @code{(char<? #\0 #\9)} returns
-@code{#t}.
+@deffn {standard procedure} char-ci=? char1 char2
+@deffnx {standard procedure} char-ci<? char1 char2
+@deffnx {standard procedure} char-ci>? char1 char2
+@deffnx {standard procedure} char-ci<=? char1 char2
+@deffnx {standard procedure} char-ci>=? char1 char2
+These procedures are similar to @code{char=?} et cetera, but they
+treat upper case and lower case letters as the same. For example,
+@code{(char-ci=? #\A #\a)} returns @code{#t}.
-@item
-The uppercase characters are in order; for example, @code{(char<? #\A
-#\B)} returns @code{#t}.
+Specifically, these procedures behave as if @code{char-foldcase} were
+applied to their arguments before they were compared.
+@end deffn
-@item
-The lowercase characters are in order; for example, @code{(char<? #\a
-#\b)} returns @code{#t}.
-@end itemize
+@deffn {standard procedure} char-alphabetic? char
+@deffnx {standard procedure} char-numeric? char
+@deffnx {standard procedure} char-whitespace? char
+@deffnx {standard procedure} char-upper-case? char
+@deffnx {standard procedure} char-lower-case? char
+These procedures return @code{#t} if their arguments are alphabetic,
+numeric, whitespace, upper case, or lower case characters
+respectively, otherwise they return @code{#f}.
-MIT/GNU Scheme uses a specific character ordering, in which characters
-have the same order as their corresponding integers. See the
-documentation for @code{char->integer} for further details.
+Specifically, they return @code{#t} when applied to characters with
+the Unicode properties Alphabetic, Numeric_Decimal, White_Space,
+Uppercase, or Lowercase respectively, and @code{#f} when applied to
+any other Unicode characters. Note that many Unicode characters are
+alphabetic but neither upper nor lower case.
+@end deffn
-@strong{Note}: Although character objects can represent all of Unicode,
-the model of alphabetic case used covers only @acronym{ASCII} letters,
-which means that case-insensitive comparisons and case conversions are
-incorrect for non-@acronym{ASCII} letters. This will eventually be
-fixed.
+@deffn procedure char-alphanumeric? char
+Returns @code{#t} if @var{char} is either alphabetic or numeric,
+otherwise it returns @code{#f}.
@end deffn
-@node Miscellaneous Character Operations, Internal Representation of Characters, Comparison of Characters, Characters
-@section Miscellaneous Character Operations
+@deffn {standard procedure} digit-value char
+This procedure returns the numeric value (0 to 9) of its argument
+if it is a numeric digit (that is, if @code{char-numeric?} returns @code{#t}),
+or @code{#f} on any other character.
-@deffn procedure char? object
-@cindex type predicate, for character
-Returns @code{#t} if @var{object} is a character; otherwise returns
-@code{#f}.
+@example
+@group
+(digit-value #\3) @result{} 3
+(digit-value #\x0664) @result{} 4
+(digit-value #\x0AE6) @result{} 0
+(digit-value #\x0EA6) @result{} #f
+@end group
+@end example
+@end deffn
+
+@deffn {standard procedure} char->integer char
+@deffnx {standard procedure} integer->char n
+Given a Unicode character, @code{char->integer} returns an exact
+integer between @code{0} and @code{#xD7FF} or between @code{#xE000}
+and @code{#x10FFFF} which is equal to the Unicode scalar value of that
+character. Given a non-Unicode character, it returns an exact integer
+greater than @code{#x10FFFF}.
+
+Given an exact integer that is the value returned by a character when
+@code{char->integer} is applied to it, @code{integer->char} returns
+that character.
+
+Implementation note: MIT/GNU Scheme allows any Unicode code point, not
+just scalar values.
+
+Implementation note: If the argument to @code{char->integer} or
+@code{integer->char} is a constant, the MIT/GNU Scheme compiler will
+constant-fold the call, replacing it with the corresponding result.
+This is a very useful way to denote unusual character constants or
+@acronym{ASCII} codes.
@end deffn
-@deffn procedure char-upcase char
-@deffnx procedure char-downcase char
+@deffn {standard procedure} char-upcase char
+@deffnx {standard procedure} char-downcase char
+@deffnx {standard procedure} char-foldcase char
@cindex uppercase, character conversion
@cindex lowercase, character conversion
@cindex case conversion, of character
-@findex char-ci=?
-Returns the uppercase or lowercase equivalent of @var{char} if
-@var{char} is a letter; otherwise returns @var{char}. These procedures
-return a character @var{char2} such that @code{(char-ci=? @var{char}
-@var{char2})}.
+@cindex case folding, of character
+The @code{char-upcase} procedure, given an argument that is the
+lowercase part of a Unicode casing pair, returns the uppercase member
+of the pair. Note that language-sensitive casing pairs are not used.
+If the argument is not the lowercase member of such a pair, it is
+returned.
+
+The @code{char-downcase} procedure, given an argument that is the
+uppercase part of a Unicode casing pair, returns the lowercase member
+of the pair. Note that language-sensitive casing pairs are not used.
+If the argument is not the uppercase member of such a pair, it is
+returned.
-@strong{Note}: Although character objects can represent all of Unicode,
-the model of alphabetic case used covers only @acronym{ASCII} letters,
-which means that case-insensitive comparisons and case conversions are
-incorrect for non-@acronym{ASCII} letters. This will eventually be
-fixed.
+The @code{char-foldcase} procedure applies the Unicode simple
+case-folding algorithm to its argument and returns the result. Note
+that language-sensitive folding is not used. See
+@uref{http://www.unicode.org/reports/tr44/, UAX #44} (part of the
+Unicode Standard) for details.
+
+Note that many Unicode lowercase characters do not have uppercase
+equivalents.
@end deffn
@deffn procedure char->digit char [radix]
If @var{char} is a character representing a digit in the given
-@var{radix}, returns the corresponding integer value. If you specify
-@var{radix} (which must be an exact integer between 2 and 36 inclusive),
-the conversion is done in that base, otherwise it is done in base 10.
-If @var{char} doesn't represent a digit in base @var{radix},
-@code{char->digit} returns @code{#f}.
+@var{radix}, returns the corresponding integer value. If @var{radix}
+is specified (which must be an exact integer between 2 and 36
+inclusive), the conversion is done in that base, otherwise it is done
+in base 10. If @var{char} doesn't represent a digit in base
+@var{radix}, @code{char->digit} returns @code{#f}.
Note that this procedure is insensitive to the alphabetic case of
@var{char}.
@deffn procedure digit->char digit [radix]
Returns a character that represents @var{digit} in the radix given by
-@var{radix}. @var{Radix} must be an exact integer between 2 and 36
-(inclusive), and defaults to 10. @var{Digit}, which must be an
-exact non-negative integer, should be less than @var{radix}; if
-@var{digit} is greater than or equal to @var{radix}, @code{digit->char}
-returns @code{#f}.
+@var{radix}. The @var{radix} argument, if given, must be an exact
+integer between 2 and 36 (inclusive); it defaults to 10. The
+@var{digit} argument must be an exact non-negative integer strictly
+less than @var{radix}.
@example
@group
@end example
@end deffn
-@node Internal Representation of Characters, ISO-8859-1 Characters, Miscellaneous Character Operations, Characters
-@section Internal Representation of Characters
+@node Character implementation, Unicode, Characters, Characters
+@section Character implementation
@cindex internal representation, for character
@cindex character code (defn)
@cindex bucky bit, of character (defn)
@cindex ASCII character
An MIT/GNU Scheme character consists of a @dfn{code} part and a
-@dfn{bucky bits} part. The MIT/GNU Scheme set of characters can
-represent more characters than @acronym{ASCII} can; it includes
-characters with Super and Hyper bucky bits, as well as Control and Meta.
-Every @acronym{ASCII} character corresponds to some MIT/GNU Scheme
-character, but not vice versa.@footnote{Note that the Control bucky bit
-is different from the @acronym{ASCII} control key. This means that
-@code{#\SOH} (@acronym{ASCII} ctrl-A) is different from @code{#\C-A}.
-In fact, the Control bucky bit is completely orthogonal to the
-@acronym{ASCII} control key, making possible such characters as
-@code{#\C-SOH}.}
-
-MIT/GNU Scheme uses a 21-bit character code with 4 bucky bits. The
-character code contains the Unicode scalar value for the character. This
-is a change from earlier versions of the system, which used the
-@acronym{ISO-8859-1} scalar value, but it is upwards compatible with
-previous usage, since @acronym{ISO-8859-1} is a proper subset of
-Unicode.
+@dfn{bucky bits} part. The code part is a Unicode code point, while
+the bucky bits are an additional set of bits representing shift keys
+available on some keyboards.
+
+There are 4 bucky bits, named @dfn{control}, @dfn{meta}, @dfn{super},
+and @dfn{hyper}. On GNU/Linux systems running a graphical desktop,
+the control bit corresponds to the @key{CTRL} key; the meta bit
+corresponds to the @key{ALT} key; and the super bit corresponds to the
+``windows'' key. On Macos, these are the @key{CONTROL}, @key{OPTION},
+and @key{COMMAND} keys respectively.
+
+Characters with bucky bits are not used much outside of graphical user
+interfaces (e.g. Edwin). They cannot be stored in strings or
+character sets, and aren't read or written by textual I/O ports.
@deffn procedure make-char code bucky-bits
@cindex construction, of character
-Builds a character from @var{code} and @var{bucky-bits}. Both
-@var{code} and @var{bucky-bits} must be exact non-negative integers in
-the appropriate range. Use @code{char-code} and @code{char-bits} to
-extract the code and bucky bits from the character. If @code{0} is
-specified for @var{bucky-bits}, @code{make-char} produces an ordinary
-character; otherwise, the appropriate bits are turned on as follows:
-
+Builds a character from @var{code} and @var{bucky-bits}. The value of
+@var{code} must be a Unicode code point; the value of @var{bucky-bits}
+must be an exact non-negative integer strictly less than @code{16}.
+If @code{0} is specified for @var{bucky-bits}, @code{make-char}
+produces an ordinary character; otherwise, the appropriate bits are
+set as follows:
@example
@group
-1 Meta
-2 Control
-4 Super
-8 Hyper
+1 meta
+2 control
+4 super
+8 hyper
@end group
@end example
For example,
-
@example
@group
(make-char 97 0) @result{} #\a
@end example
@end deffn
-@deffn procedure char-bits char
-@cindex selection, of character component
-@cindex component selection, of character
-Returns the exact integer representation of @var{char}'s bucky bits.
-For example,
-
-@example
-@group
-(char-bits #\a) @result{} 0
-(char-bits #\m-a) @result{} 1
-(char-bits #\c-a) @result{} 2
-(char-bits #\c-m-a) @result{} 3
-@end group
-@end example
-@end deffn
-
@deffn procedure char-code char
-Returns the character code of @var{char}, an exact integer. For
-example,
+Returns the Unicode code point of @var{char}. Note that if @var{char}
+has no bucky bits set, then this is the same value returned by
+@code{char->integer}.
+For example,
@example
@group
(char-code #\a) @result{} 97
(char-code #\c-a) @result{} 97
@end group
@end example
-
-Note that in MIT/GNU Scheme, the value of @code{char-code} is the
-Unicode scalar value for @var{char}.
@end deffn
-@defvr variable char-code-limit
-@defvrx variable char-bits-limit
-These variables define the (exclusive) upper limits for the character
-code and bucky bits (respectively). The character code and bucky bits
-are always exact non-negative integers, and are strictly less than the
-value of their respective limit variable.
-@end defvr
-
-@deffn procedure char->integer char
-@deffnx procedure integer->char k
-@code{char->integer} returns the character code representation for
-@var{char}. @code{integer->char} returns the character whose character
-code representation is @var{k}.
-
-@findex char-ascii?
-@findex char->ascii
-In MIT/GNU Scheme, if @code{(char-ascii? @var{char})} is true, then
-
-@example
-(eqv? (char->ascii @var{char}) (char->integer @var{char}))
-@end example
-
-@noindent
-However, this behavior is not required by the Scheme standard, and
-code that depends on it is not portable to other implementations.
-
-@findex char<=?
-@findex <=
-These procedures implement order isomorphisms between the set of
-characters under the @code{char<=?} ordering and some subset of the
-integers under the @code{<=} ordering. That is, if
-
-@example
-(char<=? a b) @result{} #t @r{and} (<= x y) @result{} #t
-@end example
-
-and @code{x} and @code{y} are in the range of @code{char->integer},
-then
-
-@example
-@group
-(<= (char->integer a)
- (char->integer b)) @result{} #t
-(char<=? (integer->char x)
- (integer->char y)) @result{} #t
-@end group
-@end example
-
-In MIT/GNU Scheme, the specific relationship implemented by these
-procedures is as follows:
+@deffn procedure char-bits char
+@cindex selection, of character component
+@cindex component selection, of character
+Returns the exact integer representation of @var{char}'s bucky bits.
+For example,
@example
@group
-(define (char->integer c)
- (+ (* (char-bits c) #x200000)
- (char-code c)))
-
-(define (integer->char n)
- (make-char (remainder n #x200000)
- (quotient n #x200000)))
+(char-bits #\a) @result{} 0
+(char-bits #\m-a) @result{} 1
+(char-bits #\c-a) @result{} 2
+(char-bits #\c-m-a) @result{} 3
@end group
@end example
-
-This implies that @code{char->integer} and @code{char-code} produce
-identical results for characters that have no bucky bits set, and that
-characters are ordered according to their Unicode scalar values.
-
-Note: If the argument to @code{char->integer} or @code{integer->char} is
-a constant, the compiler will constant-fold the call, replacing it with
-the corresponding result. This is a very useful way to denote unusual
-character constants or @acronym{ASCII} codes.
@end deffn
-@defvr variable char-integer-limit
-The range of @code{char->integer} is defined to be the exact
-non-negative integers that are less than the value of this variable
-(exclusive). Note, however, that there are some holes in this range,
-because the character code must be a valid Unicode scalar value.
+@defvr constant char-code-limit
+This constant is the strict upper limit on a character's @var{code}
+value. It is @code{#x110000} unless some future version of Unicode
+increases the range of code points.
@end defvr
-@node ISO-8859-1 Characters, Character Sets, Internal Representation of Characters, Characters
-@section ISO-8859-1 Characters
-
-MIT/GNU Scheme internally uses @acronym{ISO-8859-1} codes for
-@acronym{I/O}, and stores character objects in a fashion that makes it
-convenient to convert between @acronym{ISO-8859-1} codes and
-characters. Also, character strings are implemented as byte vectors
-whose elements are @acronym{ISO-8859-1} codes; these codes are
-converted to character objects when accessed. For these reasons it is
-sometimes desirable to be able to convert between @acronym{ISO-8859-1}
-codes and characters.
-
-@cindex ISO-8859-1 character (defn)
-@cindex character, ISO-8859-1 (defn)
-Not all characters can be represented as @acronym{ISO-8859-1} codes. A
-character that has an equivalent @acronym{ISO-8859-1} representation is
-called an @dfn{ISO-8859-1 character}.
+@defvr constant char-bits-limit
+This constant is the strict upper limit on a character's
+@var{bucky-bits} value. It is currently @code{#x10} and unlikely to
+change in the future.
+@end defvr
-For historical reasons, the procedures that manipulate
-@acronym{ISO-8859-1} characters use the word ``@acronym{ASCII}'' rather
-than ``@acronym{ISO-8859-1}''.
+@deffn procedure bitless-char? object
+@cindex bitless character
+@cindex character, bitless
+Returns @code{#t} if @var{object} is a character with no bucky bits
+set, otherwise it returns @code{#f} .
+@end deffn
-@deffn procedure char-ascii? char
-Returns the @acronym{ISO-8859-1} code for @var{char} if @var{char} has an
-@acronym{ISO-8859-1} representation; otherwise returns @code{#f}.
+@deffn procedure char-predicate char
+Returns a procedure of one argument that returns @code{#t} if its
+argument is a character @code{char=?} to @var{char}, otherwise it
+returns @code{#f}.
+@end deffn
-In the current implementation, the characters that satisfy this
-predicate are those in which the bucky bits are turned off, and for
-which the character code is less than 256.
+@deffn procedure char-ci-predicate char
+Returns a procedure of one argument that returns @code{#t} if its
+argument is a character @code{char-ci=?} to @var{char}, otherwise it
+returns @code{#f}.
@end deffn
-@deffn procedure char->ascii char
-Returns the @acronym{ISO-8859-1} code for @var{char}. An error
-@code{condition-type:bad-range-argument} is signalled if @var{char}
-doesn't have an @acronym{ISO-8859-1} representation.
-@findex condition-type:bad-range-argument
+@node Unicode, Character Sets, Character implementation, Characters
+@section Unicode
+
+@cindex Unicode
+@cindex Unicode code point
+@cindex Unicode scalar value
+@cindex code point
+@cindex scalar value
+MIT/GNU Scheme implements the full Unicode character repertoire,
+defining predicates for Unicode characters and their associated
+integer values. A @dfn{Unicode code point} is an exact non-negative
+integer strictly less than @code{#x110000}. A @dfn{Unicode scalar
+value} is a Unicode code point that doesn't fall between @code{#xD800}
+inclusive and @code{#xE000} exclusive; in other words, any Unicode
+code point except for the @dfn{surrogate} code points.
+
+@deffn procedure unicode-code-point? object
+Returns @code{#t} if @var{object} is a Unicode code point, otherwise
+it returns @code{#f}.
@end deffn
-@deffn procedure ascii->char code
-@var{Code} must be the exact integer representation of an
-@acronym{ISO-8859-1} code. This procedure returns the character
-corresponding to @var{code}.
+@deffn procedure unicode-scalar-value? object
+Returns @code{#t} if @var{object} is a Unicode scalar value, otherwise
+it returns @code{#f}.
@end deffn
-@node Character Sets, Unicode, ISO-8859-1 Characters, Characters
+@deffn procedure unicode-char? object
+Returns @code{#t} if @var{object} is a character corresponding to a
+Unicode scalar value, otherwise it returns @code{#f}. In other words,
+it is true for any character for which @code{char->integer} returns a
+value satisfying @code{unicode-scalar-value?}.
+
+Note that this is a bit of a misnomer since characters corresponding
+to Unicode ``noncharacter'' scalar values satisfy this predicate.
+@end deffn
+
+@deffn procedure char-general-category char
+@deffnx procedure code-point-general-category code-point
+Returns the Unicode general category of @var{char} (or
+@var{code-point}) as a descriptive symbol:
+
+@multitable @columnfractions .1 .4
+@headitem Category @tab Symbol
+@item Lu
+@tab @code{letter:uppercase}
+@item Ll
+@tab @code{letter:lowercase}
+@item Lt
+@tab @code{letter:titlecase}
+@item Lm
+@tab @code{letter:modifier}
+@item Lo
+@tab @code{letter:other}
+@item Mn
+@tab @code{mark:nonspacing}
+@item Mc
+@tab @code{mark:spacing-combining}
+@item Me
+@tab @code{mark:enclosing}
+@item Nd
+@tab @code{number:decimal-digit}
+@item Nl
+@tab @code{number:letter}
+@item No
+@tab @code{number:other}
+@item Pc
+@tab @code{punctuation:connector}
+@item Pd
+@tab @code{punctuation:dash}
+@item Ps
+@tab @code{punctuation:open}
+@item Pe
+@tab @code{punctuation:close}
+@item Pi
+@tab @code{punctuation:initial-quote}
+@item Pf
+@tab @code{punctuation:final-quote}
+@item Po
+@tab @code{punctuation:other}
+@item Sm
+@tab @code{symbol:math}
+@item Sc
+@tab @code{symbol:currency}
+@item Sk
+@tab @code{symbol:modifier}
+@item So
+@tab @code{symbol:other}
+@item Zs
+@tab @code{separator:space}
+@item Zl
+@tab @code{separator:line}
+@item Zp
+@tab @code{separator:paragraph}
+@item Cc
+@tab @code{other:control}
+@item Cf
+@tab @code{other:format}
+@item Cs
+@tab @code{other:surrogate}
+@item Co
+@tab @code{other:private-use}
+@item Cn
+@tab @code{other:not-assigned}
+@end multitable
+@end deffn
+
+@node Character Sets, , Unicode, Characters
@section Character Sets
@cindex character set
@cindex set, of characters
MIT/GNU Scheme's character-set abstraction is used to represent groups
of characters, such as the letters or digits. A character set may
-contain any Unicode character.
+contain any ``bitless'' character. Alternatively, a character set can
+be treated as a set of code points.
@deffn procedure char-set? object
@cindex type predicate, for character set
-Returns @code{#t} if @var{object} is a character set; otherwise returns
-@code{#f}.
+Returns @code{#t} if @var{object} is a character set, otherwise it
+returns @code{#f}.
@end deffn
-@defvr variable char-set:upper-case
-@defvrx variable char-set:lower-case
-@defvrx variable char-set:alphabetic
-@defvrx variable char-set:numeric
-@defvrx variable char-set:alphanumeric
-@defvrx variable char-set:whitespace
-@defvrx variable char-set:not-whitespace
-@defvrx variable char-set:graphic
-@defvrx variable char-set:not-graphic
-@defvrx variable char-set:standard
-These variables contain predefined character sets. At present, these
-character sets contain only @acronym{ISO-8859-1} characters; in the
-future they will contain all the relevant Unicode characters. To see
-the contents of one of these sets, use @code{char-set->scalar-values}.
-
-@cindex alphabetic character (defn)
-@cindex character, alphabetic (defn)
-@cindex numeric character (defn)
-@cindex character, numeric (defn)
-@cindex alphanumeric character (defn)
-@cindex character, alphanumeric (defn)
-@cindex whitespace character (defn)
-@cindex character, whitespace (defn)
-@cindex graphic character (defn)
-@cindex character, graphic (defn)
-@cindex standard character (defn)
-@cindex character, standard (defn)
-@findex #\space
-@findex #\tab
-@findex #\page
-@findex #\linefeed
-@findex #\return
-@findex #\newline
-@dfn{Alphabetic} characters are the 52 upper and lower case letters.
-@dfn{Numeric} characters are the 10 decimal digits. @dfn{Alphanumeric}
-characters are those in the union of these two sets. @dfn{Whitespace}
-characters are @code{#\space}, @code{#\tab}, @code{#\page},
-@code{#\linefeed}, and @code{#\return}. @var{Graphic} characters are
-the printing characters and @code{#\space}. @var{Standard} characters
-are the printing characters, @code{#\space}, and @code{#\newline}.
-These are the printing characters:
-
-@example
-@group
-! " # $ % & ' ( ) * + , - . /
-0 1 2 3 4 5 6 7 8 9
-: ; < = > ? @@
-A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
-[ \ ] ^ _ `
-a b c d e f g h i j k l m n o p q r s t u v w x y z
-@{ | @} ~
-@end group
-@end example
-@end defvr
+@deffn procedure char-in-set? char char-set
+Returns @code{#t} if @var{char} is in @var{char-set}, otherwise it
+returns @code{#f}.
+@end deffn
-@deffn {procedure} char-upper-case? char
-@deffnx {procedure} char-lower-case? char
-@deffnx {procedure} char-alphabetic? char
-@deffnx {procedure} char-numeric? char
-@deffnx procedure char-alphanumeric? char
-@deffnx {procedure} char-whitespace? char
-@deffnx procedure char-graphic? char
-@deffnx procedure char-standard? object
-These predicates are defined in terms of the respective character sets
-defined above.
+@deffn procedure code-point-in-set? code-point char-set
+Returns @code{#t} if @var{code-point} is in @var{char-set}, otherwise
+it returns @code{#f}.
@end deffn
-@deffn procedure char-set-member? char-set char
-Returns @code{#t} if @var{char} is in @var{char-set}; otherwise returns
+@deffn procedure char-set-predicate char-set
+Returns a procedure of one argument that returns @code{#t} if its
+argument is a character in @var{char-set}, otherwise it returns
@code{#f}.
@end deffn
-@deffn {procedure} char-set=? char-set-1 char-set-2
-Returns @code{#t} if @var{char-set-1} and @var{char-set-2} contain
-exactly the same characters; otherwise returns @code{#f}.
+@deffn procedure compute-char-set predicate
+Calls @var{predicate} once on each Unicode code point, and returns a
+character set containing exactly the code points for which
+@var{predicate} returns a true value.
@end deffn
-@deffn procedure char-set char @dots{}
-@cindex construction, of character set
-Returns a character set consisting of the specified characters. With no
-arguments, @code{char-set} returns an empty character set.
-@end deffn
+@cindex code-point list
+@cindex code-point range
+@cindex code-point range
+The next procedures represent a character set as a @dfn{code-point
+list}, which is a list of @dfn{code-point range} elements. A
+code-point range is either a Unicode code point, or a pair
+@code{(@var{start} . @var{end})} that specifies a contiguous range of
+code points. Both @var{start} and @var{end} must be exact nonnegative
+integers less than or equal to @code{#x110000}, and @var{start} must
+be less than or equal to @var{end}. The range specifies all of the
+code points greater than or equal to @var{start} and strictly less
+than @var{end}.
-@deffn procedure chars->char-set chars
-Returns a character set consisting of @var{chars}, which must be a list
-of characters. This is equivalent to @code{(apply char-set
-@var{chars})}.
-@end deffn
+@deffn procedure char-set element @dots{}
+@deffnx procedure char-set* elements
+Returns a new character set consisting of the characters specified by
+@var{element}s. The procedure @code{char-set} takes these elements as
+multiple arguments, while @code{char-set*} takes them as a single
+list-valued argument; in all other respects these procedures are
+identical.
-@deffn procedure string->char-set string
-Returns a character set consisting of all the characters that occur in
-@var{string}.
+An @var{element} can take several forms, each of which specifies one
+or more characters to include in the resulting character set: a
+(bitless) character includes itself; a string includes all of the characters it
+contains; a character set includes its members; or a code-point range
+includes the corresponding characters.
@end deffn
-@deffn procedure scalar-values->char-set items
-Returns a character set containing the Unicode scalar values described
-by @var{items}. @var{Items} must satisfy
-@code{well-formed-scalar-values-list?}.
+@deffn procedure char-set->code-points char-set
+Returns a code-point list specifying the contents of @var{char-set}.
+The returned list consists of numerically sorted, disjoint, and
+non-abutting code-point ranges.
@end deffn
-@deffn procedure char-set->scalar-values char-set
-Returns a well-formed scalar-values list that describes the Unicode
-scalar values represented by @var{char-set}.
-@end deffn
-
-@deffn procedure well-formed-scalar-values-list? object
-Returns @code{#t} if @var{object} is a well-formed scalar-values list,
-otherwise returns @code{#f}. A well-formed scalar-values list is a
-proper list, each element of which is either a Unicode scalar value or
-a pair of Unicode scalar values. A pair of Unicode scalar values
-represents a contiguous range of Unicode scalar values. The @sc{car}
-of the pair is the inclusive lower limit, and the @sc{cdr} is the
-exclusive upper limit. The lower limit must be less than or equal to
-the upper limit.
+@deffn {procedure} char-set=? char-set-1 char-set-2
+Returns @code{#t} if @var{char-set-1} and @var{char-set-2} contain
+exactly the same characters, otherwise it returns @code{#f}.
@end deffn
@deffn procedure char-set-invert char-set
-Returns a character set consisting of the characters that are not in
-@var{char-set}.
-@end deffn
-
-@deffn procedure char-set-difference char-set1 char-set @dots{}
-Returns a character set consisting of the characters that are in
-@var{char-set1} but aren't in any of the @var{char-set}s.
-@end deffn
-
-@deffn procedure char-set-intersection char-set @dots{}
-Returns a character set consisting of the characters that are in all of
-the @var{char-set}s.
+Returns a character set that's the inverse of @var{char-set}. That
+is, the returned character set contains exactly those characters that
+aren't in @var{char-set}.
@end deffn
@deffn procedure char-set-union char-set @dots{}
-Returns a character set consisting of the characters that are in at
-least one o the @var{char-set}s.
-@end deffn
+@deffnx procedure char-set-intersection char-set @dots{}
+@deffnx procedure char-set-difference char-set-1 char-set @dots{}
+These procedures compute the respective set union, set intersection,
+and set difference of their arguments.
+@end deffn
+
+@deffn procedure char-set-union* char-sets
+@deffnx procedure char-set-intersection* char-sets
+These procedures correspond to @code{char-set-union} and
+@code{char-set-intersection} but take a single argument that's a list
+of character sets rather than multiple character-set arguments.
+@end deffn
+
+@defvr constant char-set:alphabetic
+@defvrx constant char-set:numeric
+@defvrx constant char-set:whitespace
+@defvrx constant char-set:upper-case
+@defvrx constant char-set:lower-case
+@defvrx constant char-set:alphanumeric
+These constants are the character sets corresponding to
+@code{char-alphabetic?}, @code{char-numeric?},
+@code{char-whitespace?}, @code{char-upper-case?},
+@code{char-lower-case?}, and @code{char-alphanumeric?} respectively.
+@end defvr
@deffn procedure 8-bit-char-set? char-set
-Returns @code{#t} if @var{char-set} contains only 8-bit scalar values
-(i.e.@. @acronym{ISO-8859-1} characters), otherwise returns @code{#f}.
-@end deffn
-
-@deffn procedure ascii-range->char-set lower upper
-This procedure is obsolete. Instead use
-
-@example
-(scalar-values->char-set (list (cons @var{lower} @var{upper})))
-@end example
-@end deffn
-
-@deffn procedure char-set-members char-set
-This procedure is obsolete; instead use @code{char-set->scalar-values}.
-
-Returns a newly allocated list of the @acronym{ISO-8859-1} characters
-in @var{char-set}. If @var{char-set} contains any characters outside
-of the @acronym{ISO-8859-1} range, they will not be in the returned
-list.
-@end deffn
-
-@node Unicode, , Character Sets, Characters
-@section Unicode
-
-@cindex Unicode
-MIT/GNU Scheme provides rudimentary support for Unicode characters.
-In an ideal world, Unicode would be the base character set for MIT/GNU
-Scheme. But MIT/GNU Scheme predates the invention of Unicode, and
-converting an application of this size is a considerable undertaking.
-So for the time being, the base character set for strings is
-@acronym{ISO-8859-1}, and Unicode support is grafted on.
-
-This Unicode support was implemented as a part of the @acronym{XML}
-parser (@pxref{XML Support}) implementation. @acronym{XML} uses
-Unicode as its base character set, and any @acronym{XML}
-implementation @emph{must} support Unicode.
-
-@cindex Scalar value, Unicode
-@cindex Unicode character
-@cindex Character, Unicode
-The basic unit in a Unicode implementation is the @dfn{scalar value}. The
-character equivalent of a scalar value is a @dfn{Unicode character}.
-
-@deffn procedure unicode-scalar-value? object
-Returns @code{#t} if @var{object} is a Unicode scalar value. Scalar
-values are implemented as exact non-negative integers. They are further
-limited, by the Unicode standard, to be strictly less than
-@code{#x110000}, with the values @code{#xD800} through @code{#xDFFF},
-@code{#xFFFE}, and @code{#xFFFF} excluded.
-@end deffn
-
-@deffn procedure unicode-char? object
-Returns @code{#t} if @var{object} is a Unicode character, specifically
-if @var{object} is a character with no bucky bits and whose code
-satisfies @code{unicode-scalar-value?}.
-@end deffn
-
-The Unicode implementation consists of these parts:
-
-@itemize @bullet
-@item
-An implementation of @dfn{wide strings}, which are character strings
-that support the full Unicode character set with constant-time access.
-
-@item
-@acronym{I/O} procedures that read and write Unicode characters in
-several external representations, specifically @acronym{UTF-8},
-@acronym{UTF-16}, and @acronym{UTF-32}.
-@end itemize
-
-@menu
-* Wide Strings::
-* Unicode Representations::
-@end menu
-
-@node Wide Strings, Unicode Representations, Unicode, Unicode
-@subsection Wide Strings
-
-@cindex Wide string
-@cindex String, wide
-Wide characters can be combined into @dfn{wide strings}, which are
-similar to strings but can contain any Unicode character sequence. The
-implementation used for wide strings is guaranteed to provide
-constant-time access to each character in the string.
-
-@deffn procedure wide-string? object
-Returns @code{#t} if @var{object} is a wide string.
-@end deffn
-
-@deffn procedure make-wide-string k [unicode-char]
-Returns a newly allocated wide string of length @var{k}. If @var{char}
-is specified, all elements of the returned string are initialized to
-@var{char}; otherwise the contents of the string are unspecified.
-@end deffn
-
-@deffn procedure wide-string unicode-char @dots{}
-Returns a newly allocated wide string consisting of the specified
-characters.
-@end deffn
-
-@deffn procedure wide-string-length wide-string
-Returns the length of @var{wide-string} as an exact non-negative
-integer.
-@end deffn
-
-@deffn procedure wide-string-ref wide-string k
-Returns character @var{k} of @var{wide-string}. @var{K} must be a valid
-index of @var{string}.
-@end deffn
-
-@deffn procedure wide-string-set! wide-string k unicode-char
-Stores @var{char} in element @var{k} of @var{wide-string} and returns an
-unspecified value. @var{K} must be a valid index of @var{wide-string}.
-@end deffn
-
-@deffn procedure string->wide-string string [start [end]]
-Returns a newly allocated wide string with the same contents as
-@var{string}. If @var{start} and @var{end} are supplied, they specify a
-substring of @var{string} that is to be converted. @var{Start} defaults
-to @samp{0}, and @var{end} defaults to @samp{(string-length
-@var{string})}.
-@end deffn
-
-@deffn procedure wide-string->string wide-string [start [end]]
-Returns a newly allocated string with the same contents as
-@var{wide-string}. The argument @var{wide-string} must satisfy
-@code{wide-string?}. If @var{start} and @var{end} are supplied, they
-specify a substring of @var{wide-string} that is to be converted.
-@var{Start} defaults to @samp{0}, and @var{end} defaults to
-@samp{(wide-string-length @var{wide-string})}.
-
-It is an error if any character in @var{wide-string} fails to satisfy
-@code{char-ascii?}.
-@end deffn
-
-@deffn procedure open-wide-input-string wide-string [start [end]]
-Returns a new input port that sources the characters of
-@var{wide-string}. The optional arguments @var{start} and @var{end} may
-be used to specify that the port delivers characters from a substring of
-@var{wide-string}; if not given, @var{start} defaults to @samp{0} and
-@var{end} defaults to @samp{(wide-string-length @var{wide-string})}.
-@end deffn
-
-@deffn procedure open-wide-output-string
-Returns an output port that accepts Unicode characters and strings and
-accumulates them in a buffer. Call @code{get-output-string} on the
-returned port to get a wide string containing the accumulated
-characters.
-@end deffn
-
-@deffn procedure call-with-wide-output-string procedure
-Creates a wide-string output port and calls @var{procedure} on that
-port. The value returned by @var{procedure} is ignored, and the
-accumulated output is returned as a wide string. This is equivalent to:
-
-@example
-@group
-(define (call-with-wide-output-string procedure)
- (let ((port (open-wide-output-string)))
- (procedure port)
- (get-output-string port)))
-@end group
-@end example
-@end deffn
-
-@node Unicode Representations, , Wide Strings, Unicode
-@subsection Unicode Representations
-
-@cindex Unicode external representations
-@cindex external representations, Unicode
-The procedures in this section implement transformations that convert
-between the internal representation of Unicode characters and several
-standard external representations. These external representations are
-all implemented as sequences of bytes, but they differ in their intended
-usage.
-
-@cindex UTF-8
-@cindex UTF-16
-@cindex UTF-32
-@table @acronym
-@item UTF-8
-Each character is written as a sequence of one to four bytes.
-
-@item UTF-16
-Each character is written as a sequence of one or two 16-bit integers.
-
-@item UTF-32
-Each character is written as a single 32-bit integer.
-@end table
-
-@cindex Big endian
-@cindex Little endian
-@cindex Host endian
-@cindex Endianness
-The @acronym{UTF-16} and @acronym{UTF-32} representations may be
-serialized to and from a byte stream in either @dfn{big-endian} or
-@dfn{little-endian} order. In big-endian order, the most significant
-byte is first, the next most significant byte is second, etc.@: In
-little-endian order, the least significant byte is first, etc.@: All of
-the @acronym{UTF-16} and @acronym{UTF-32} representation procedures are
-available in both orders, which are indicated by names containing
-@samp{utfNN-be} and @samp{utfNN-le}, respectively. There are also
-procedures that implement @dfn{host-endian} order, which is either
-big-endian or little-endian depending on the underlying computer
-architecture.
-
-@deffn procedure utf8-string->wide-string string [start [end]]
-@deffnx procedure utf16-be-string->wide-string string [start [end]]
-@deffnx procedure utf16-le-string->wide-string string [start [end]]
-@deffnx procedure utf16-string->wide-string string [start [end]]
-@deffnx procedure utf32-be-string->wide-string string [start [end]]
-@deffnx procedure utf32-le-string->wide-string string [start [end]]
-@deffnx procedure utf32-string->wide-string string [start [end]]
-Each of these procedures converts a byte vector to a wide string,
-treating @var{string} as a stream of bytes encoded in the corresponding
-@samp{utfNN} representation. The arguments @var{start} and @var{end}
-allow specification of a substring; they default to zero and
-@var{string}'s length, respectively.
-@end deffn
-
-@deffn procedure utf8-string-length string [start [end]]
-@deffnx procedure utf16-be-string-length string [start [end]]
-@deffnx procedure utf16-le-string-length string [start [end]]
-@deffnx procedure utf16-string-length string [start [end]]
-@deffnx procedure utf32-be-string-length string [start [end]]
-@deffnx procedure utf32-le-string-length string [start [end]]
-@deffnx procedure utf32-string-length string [start [end]]
-Each of these procedures counts the number of Unicode characters in a
-byte vector, treating @var{string} as a stream of bytes encoded in the
-corresponding @samp{utfNN} representation. The arguments @var{start}
-and @var{end} allow specification of a substring; they default to zero
-and @var{string}'s length, respectively.
-@end deffn
-
-@deffn procedure wide-string->utf8-string string [start [end]]
-@deffnx procedure wide-string->utf16-be-string string [start [end]]
-@deffnx procedure wide-string->utf16-le-string string [start [end]]
-@deffnx procedure wide-string->utf16-string string [start [end]]
-@deffnx procedure wide-string->utf32-be-string string [start [end]]
-@deffnx procedure wide-string->utf32-le-string string [start [end]]
-@deffnx procedure wide-string->utf32-string string [start [end]]
-Each of these procedures converts a wide string to a stream of bytes
-encoded in the corresponding @samp{utfNN} representation, and returns
-that stream as a byte vector. The arguments @var{start}
-and @var{end} allow specification of a substring; they default to zero
-and @var{string}'s length, respectively.
+Returns @code{#t} if @var{char-set} contains only 8-bit code points
+(i.e.@. @acronym{ISO-8859-1} characters), otherwise it returns
+@code{#f}.
@end deffn