From: Chris Hanson <org/chris-hanson/cph>
Date: Sun, 30 Aug 2009 07:08:31 +0000 (-0700)
Subject: Change "code point" to "scalar value" everywhere.  Change "wide char"
X-Git-Tag: 20100708-Gtk~355
X-Git-Url: https://birchwood-abbey.net/git?a=commitdiff_plain;h=3641498176e63df5c18f7df6db5b58f73fc9af7a;p=mit-scheme.git

Change "code point" to "scalar value" everywhere.  Change "wide char"
to "unicode char" everywhere.  Relax the definition of a well-formed
scalar-value list so that it allows overlaps, unsorted elements, and
non-canonical pairs.
---

diff --git a/doc/ref-manual/characters.texi b/doc/ref-manual/characters.texi
index a40f2b363..7938c9ed8 100644
--- a/doc/ref-manual/characters.texi
+++ b/doc/ref-manual/characters.texi
@@ -54,9 +54,9 @@ quote them.
 
 @findex #\U+
 In addition to the standard character syntax, MIT Scheme also supports a
-general syntax that denotes any Unicode character by its code point.
-This notation is @code{#\U+@var{code-point}}, where @var{code-point} is
-a sequence of hexadecimal digits for a valid code point.  So the above
+general syntax that denotes any Unicode character by its scalar value.
+This notation is @code{#\U+@var{scalar-value}}, where @var{scalar-value} is
+a sequence of hexadecimal digits for a valid scalar value.  So the above
 examples could also be written like this:
 
 @example
@@ -331,9 +331,9 @@ In fact, the Control bucky bit is completely orthogonal to the
 @code{#\C-SOH}.}
 
 MIT/GNU Scheme uses a 21-bit character code with 4 bucky bits.  The
-character code contains the Unicode code point for the character.  This
+character code contains the Unicode scalar value for the character.  This
 is a change from earlier versions of the system, which used the
-@acronym{ISO-8859-1} code point, but it is upwards compatible with
+@acronym{ISO-8859-1} scalar value, but it is upwards compatible with
 previous usage, since @acronym{ISO-8859-1} is a proper subset of
 Unicode.
 
@@ -395,7 +395,7 @@ example,
 @end example
 
 Note that in MIT/GNU Scheme, the value of @code{char-code} is the
-Unicode code point for @var{char}.
+Unicode scalar value for @var{char}.
 @end deffn
 
 @defvr variable char-code-limit
@@ -463,7 +463,7 @@ procedures is as follows:
 
 This implies that @code{char->integer} and @code{char-code} produce
 identical results for characters that have no bucky bits set, and that
-characters are ordered according to their Unicode code points.
+characters are ordered according to their Unicode scalar values.
 
 Note: If the argument to @code{char->integer} or @code{integer->char} is
 a constant, the compiler will constant-fold the call, replacing it with
@@ -475,7 +475,7 @@ character constants or @acronym{ASCII} codes.
 The range of @code{char->integer} is defined to be the exact
 non-negative integers that are less than the value of this variable
 (exclusive).  Note, however, that there are some holes in this range,
-because the character code must be a valid Unicode code point.
+because the character code must be a valid Unicode scalar value.
 @end defvr
 
 @node ISO-8859-1 Characters, Character Sets, Internal Representation of Characters, Characters
@@ -691,24 +691,24 @@ parser (@pxref{XML Support}) implementation.  @acronym{XML} uses
 Unicode as its base character set, and any @acronym{XML}
 implementation @emph{must} support Unicode.
 
-@cindex Code point, Unicode
-@cindex Wide character
-@cindex Character, wide
-The basic unit in a Unicode implementation is the @dfn{code point}.  The
-character equivalent of a code point is a @dfn{wide character}.
+@cindex Scalar value, Unicode
+@cindex Unicode character
+@cindex Character, Unicode
+The basic unit in a Unicode implementation is the @dfn{scalar value}.  The
+character equivalent of a scalar value is a @dfn{Unicode character}.
 
-@deffn procedure unicode-code-point? object
-Returns @code{#t} if @var{object} is a Unicode code point.  Code points
-are implemented as exact non-negative integers.  They are further
+@deffn procedure unicode-scalar-value? object
+Returns @code{#t} if @var{object} is a Unicode scalar value.  Scalar
+values are implemented as exact non-negative integers.  They are further
 limited, by the Unicode standard, to be strictly less than
 @code{#x110000}, with the values @code{#xD800} through @code{#xDFFF},
 @code{#xFFFE}, and @code{#xFFFF} excluded.
 @end deffn
 
-@deffn procedure wide-char? object
-Returns @code{#t} if @var{object} is a wide character, specifically if
-@var{object} is a character with no bucky bits and whose code satisfies
-@code{unicode-code-point?}.
+@deffn procedure unicode-char? object
+Returns @code{#t} if @var{object} is a Unicode character, specifically
+if @var{object} is a character with no bucky bits and whose code
+satisfies @code{unicode-scalar-value?}.
 @end deffn
 
 The Unicode implementation consists of three parts:
@@ -725,7 +725,7 @@ several external representations, specifically @acronym{UTF-8},
 
 @item
 An @dfn{alphabet} abstraction, which is an efficient implementation of
-sets of Unicode code points (similar to the @code{char-set}
+sets of Unicode scalar values (similar to the @code{char-set}
 abstraction).
 @end itemize
 
@@ -749,13 +749,13 @@ constant-time access to each character in the string.
 Returns @code{#t} if @var{object} is a wide string.
 @end deffn
 
-@deffn procedure make-wide-string k [wide-char]
+@deffn procedure make-wide-string k [unicode-char]
 Returns a newly allocated wide string of length @var{k}.  If @var{char}
 is specified, all elements of the returned string are initialized to
 @var{char}; otherwise the contents of the string are unspecified.
 @end deffn
 
-@deffn procedure wide-string wide-char @dots{}
+@deffn procedure wide-string unicode-char @dots{}
 Returns a newly allocated wide string consisting of the specified
 characters.
 @end deffn
@@ -770,7 +770,7 @@ Returns character @var{k} of @var{wide-string}.  @var{K} must be a valid
 index of @var{string}.
 @end deffn
 
-@deffn procedure wide-string-set! wide-string k wide-char
+@deffn procedure wide-string-set! wide-string k unicode-char
 Stores @var{char} in element @var{k} of @var{wide-string} and returns an
 unspecified value.  @var{K} must be a valid index of @var{wide-string}.
 @end deffn
@@ -804,7 +804,7 @@ be used to specify that the port delivers characters from a substring of
 @end deffn
 
 @deffn procedure open-wide-output-string
-Returns an output port that accepts wide characters and strings and
+Returns an output port that accepts Unicode characters and strings and
 accumulates them in a buffer.  Call @code{get-output-string} on the
 returned port to get a wide string containing the accumulated
 characters.
@@ -915,37 +915,38 @@ and @var{string}'s length, respectively.
 Applications often need to manipulate sets of characters, such as the
 set of alphabetic characters or the set of whitespace characters.  The
 @dfn{alphabet} abstraction provides an efficient implementation of
-sets of Unicode code points.
+sets of Unicode scalar values.
 
 @deffn procedure alphabet? object
 Returns @code{#t} if @var{object} is a Unicode alphabet, otherwise
 returns @code{#f}.
 @end deffn
 
-@deffn procedure alphabet wide-char @dots{}
-Returns a Unicode alphabet containing the wide characters passed as
+@deffn procedure alphabet unicode-char @dots{}
+Returns a Unicode alphabet containing the Unicode characters passed as
 arguments.
 @end deffn
 
-@deffn procedure code-points->alphabet items
-Returns a Unicode alphabet containing the code points described by
+@deffn procedure scalar-values->alphabet items
+Returns a Unicode alphabet containing the scalar values described by
 @var{items}.  @var{Items} must satisfy
-@code{well-formed-code-points-list?}.
+@code{well-formed-scalar-values-list?}.
 @end deffn
 
-@deffn procedure alphabet->code-points alphabet
-Returns a well-formed code-points list that describes the code points
+@deffn procedure alphabet->scalar-values alphabet
+Returns a well-formed scalar-values list that describes the scalar values
 represented by @var{alphabet}.
 @end deffn
 
-@deffn procedure well-formed-code-points-list? object
-Returns @code{#t} if @var{object} is a well-formed code-points list,
-otherwise returns @code{#f}.  A well-formed code-points list is a
-proper list, each element of which is either a code point or a pair of
-code points.  A pair of code points represents a contiguous range of
-code points.  The @sc{car} of the pair is the lower limit, and the
-@sc{cdr} is the upper limit.  Both limits are inclusive, and the lower
-limit must be strictly less than the upper limit.
+@deffn procedure well-formed-scalar-values-list? object
+Returns @code{#t} if @var{object} is a well-formed scalar-values list,
+otherwise returns @code{#f}.  A well-formed scalar-values list is a
+proper list, each element of which is either a unicode scalar value or a
+pair of unicode scalar values.  A pair of scalar values represents a
+contiguous range of scalar values.  The @sc{car} of the pair is the
+lower limit, and the @sc{cdr} is the upper limit.  Both limits are
+inclusive, and the lower limit must be less than or equal to the upper
+limit.
 @end deffn
 
 @deffn procedure char-in-alphabet? char alphabet
@@ -954,23 +955,23 @@ otherwise returns @code{#f}.
 @end deffn
 
 Character sets and alphabets can be converted to one another, provided
-that the alphabet contains only 8-bit code points.  This is true
-because 8-bit code points in Unicode map directly to
+that the alphabet contains only 8-bit scalar values.  This is true
+because 8-bit scalar values in Unicode map directly to
 @acronym{ISO-8859-1} characters, which is what character sets contain.
 
 @deffn procedure char-set->alphabet char-set
-Returns a Unicode alphabet containing the code points that correspond
+Returns a Unicode alphabet containing the scalar values that correspond
 to characters that are members of @var{char-set}.
 @end deffn
 
 @deffn procedure alphabet->char-set alphabet
 Returns a character set containing the characters that correspond to
-8-bit code points that are members of @var{alphabet}.  (Code points
+8-bit scalar values that are members of @var{alphabet}.  (Scalar values
 outside the 8-bit range are ignored.)
 @end deffn
 
 @deffn procedure string->alphabet string
-Returns a Unicode alphabet containing the code points corresponding to
+Returns a Unicode alphabet containing the scalar values corresponding to
 the characters in @var{string}.  Equivalent to
 
 @example
@@ -979,22 +980,22 @@ the characters in @var{string}.  Equivalent to
 @end deffn
 
 @deffn procedure alphabet->string alphabet
-Returns a newly-allocated string containing the characters
-corresponding to the 8-bit code points in @var{alphabet}.  (Code
-points outside the 8-bit range are ignored.)
+Returns a newly-allocated string containing the characters corresponding
+to the 8-bit scalar values in @var{alphabet}.  (Scalar values outside
+the 8-bit range are ignored.)
 @end deffn
 
 @deffn procedure 8-bit-alphabet? alphabet
-Returns @code{#t} if @var{alphabet} contains only 8-bit code points,
+Returns @code{#t} if @var{alphabet} contains only 8-bit scalar values,
 otherwise returns @code{#f}.
 @end deffn
 
 @deffn procedure alphabet+ alphabet @dots{}
-Returns a Unicode alphabet that contains each code point that is a
+Returns a Unicode alphabet that contains each scalar value that is a
 member of any of the @var{alphabet} arguments.
 @end deffn
 
 @deffn procedure alphabet- alphabet1 alphabet2
-Returns a Unicode alphabet that contains each code point that is a
+Returns a Unicode alphabet that contains each scalar value that is a
 member of @var{alphabet1} and is not a member of @var{alphabet2}.
 @end deffn
diff --git a/src/runtime/genio.scm b/src/runtime/genio.scm
index 6a8792efe..79c12fbb7 100644
--- a/src/runtime/genio.scm
+++ b/src/runtime/genio.scm
@@ -1857,7 +1857,7 @@ USA.
 		     (* (get-byte bv bs 1) #x10000)
 		     (* (get-byte bv bs 2) #x100)
 		     (get-byte bv bs 3))))
-	     (if (unicode-code-point? cp)
+	     (if (unicode-scalar-value? cp)
 		 (begin
 		   (set-input-buffer-start! ib (fix:+ bs 4))
 		   cp)
@@ -1873,7 +1873,7 @@ USA.
 		     (* (get-byte bv bs 2) #x10000)
 		     (* (get-byte bv bs 1) #x100)
 		     (get-byte bv bs 0))))
-	     (if (unicode-code-point? cp)
+	     (if (unicode-scalar-value? cp)
 		 (begin
 		   (set-input-buffer-start! ib (fix:+ bs 4))
 		   cp)
diff --git a/src/runtime/runtime.pkg b/src/runtime/runtime.pkg
index 276e0f79e..95907d045 100644
--- a/src/runtime/runtime.pkg
+++ b/src/runtime/runtime.pkg
@@ -4910,6 +4910,9 @@ USA.
   (files "unicode")
   (parent (runtime))
   (export ()
+	  (error:not-wide-char error:not-unicode-char)
+	  (guarantee-wide-char guarantee-unicode-char)
+	  (wide-char? unicode-char?)
 	  (wide-string->utf16-be-string string->utf16-be-string)
 	  (wide-string->utf16-le-string string->utf16-le-string)
 	  (wide-string->utf16-string string->utf16-string)
@@ -4924,7 +4927,7 @@ USA.
 	  alphabet+
 	  alphabet-
 	  alphabet->char-set
-	  alphabet->code-points
+	  alphabet->scalar-values
 	  alphabet->string
 	  alphabet-predicate
 	  alphabet?
@@ -4944,11 +4947,11 @@ USA.
 	  call-with-utf8-output-string
 	  char-in-alphabet?
 	  char-set->alphabet
-	  code-points->alphabet
 	  combine-utf16-surrogates
 	  error:not-8-bit-alphabet
 	  error:not-alphabet
-	  error:not-unicode-code-point
+	  error:not-unicode-char
+	  error:not-unicode-scalar-value
 	  error:not-utf16-be-string
 	  error:not-utf16-high-surrogate
 	  error:not-utf16-le-string
@@ -4958,14 +4961,14 @@ USA.
 	  error:not-utf32-le-string
 	  error:not-utf32-string
 	  error:not-utf8-string
-	  error:not-well-formed-code-point-list
-	  error:not-wide-char
+	  error:not-well-formed-scalar-value-list
 	  error:not-wide-string
 	  for-all-chars-in-string?
 	  for-any-char-in-string?
 	  guarantee-8-bit-alphabet
 	  guarantee-alphabet
-	  guarantee-unicode-code-point
+	  guarantee-unicode-char
+	  guarantee-unicode-scalar-value
 	  guarantee-utf16-be-string
 	  guarantee-utf16-high-surrogate
 	  guarantee-utf16-le-string
@@ -4975,8 +4978,7 @@ USA.
 	  guarantee-utf32-le-string
 	  guarantee-utf32-string
 	  guarantee-utf8-string
-	  guarantee-well-formed-code-point-list
-	  guarantee-wide-char
+	  guarantee-well-formed-scalar-value-list
 	  guarantee-wide-string
 	  guarantee-wide-string-index
 	  guarantee-wide-substring
@@ -4995,6 +4997,7 @@ USA.
 	  open-utf32-output-string
 	  open-utf8-input-string
 	  open-utf8-output-string
+	  scalar-values->alphabet
 	  split-into-utf16-surrogates
 	  string->alphabet
 	  string->utf16-be-string
@@ -5006,7 +5009,8 @@ USA.
 	  string->utf8-string
 	  string->utf8-string
 	  string->wide-string
-	  unicode-code-point?
+	  unicode-char?
+	  unicode-scalar-value?
 	  utf16-be-string->wide-string
 	  utf16-be-string-length
 	  utf16-be-string-valid?
@@ -5038,8 +5042,7 @@ USA.
 	  utf8-string-length
 	  utf8-string-valid?
 	  utf8-string?
-	  well-formed-code-point-list?
-	  wide-char?
+	  well-formed-scalar-value-list?
 	  wide-string
 	  wide-string->string
 	  wide-string-index?
diff --git a/src/runtime/string.scm b/src/runtime/string.scm
index ad8a94bf2..9114ac238 100644
--- a/src/runtime/string.scm
+++ b/src/runtime/string.scm
@@ -203,7 +203,8 @@ USA.
   (cond ((string? object) (string->utf8-string object))
 	((symbol? object) (symbol-name object))
 	((wide-string? object) (wide-string->utf8-string object))
-	((wide-char? object) (wide-string->utf8-string (wide-string object)))
+	((unicode-char? object)
+	 (wide-string->utf8-string (wide-string object)))
 	(else (%->string object 'UTF8-STRING))))
 
 (define (%->string object caller)
diff --git a/src/runtime/unicode.scm b/src/runtime/unicode.scm
index 289d36bc5..6764160a2 100644
--- a/src/runtime/unicode.scm
+++ b/src/runtime/unicode.scm
@@ -162,21 +162,18 @@ USA.
 
 ;;;; Unicode characters
 
-(define (wide-char? object)
+(define (unicode-char? object)
   (and (char? object)
        (legal-code-32? (char->integer object))))
 
-(define-guarantee wide-char "a Unicode character")
+(define-guarantee unicode-char "a Unicode character")
 
-(define (unicode-code-point? object)
-  (and (%unicode-code-point? object)
-       (not (illegal? object))))
-
-(define (%unicode-code-point? object)
+(define (unicode-scalar-value? object)
   (and (index-fixnum? object)
-       (fix:< object char-code-limit)))
+       (fix:< object char-code-limit)
+       (not (illegal? object))))
 
-(define-guarantee unicode-code-point "a Unicode code point")
+(define-guarantee unicode-scalar-value "a Unicode scalar value")
 
 (define-integrable (legal-code-32? pt)
   (and (fix:< pt char-code-limit)
@@ -213,26 +210,26 @@ Not used at the moment.
 (define-integrable (make-alphabet-low)
   (make-string #x100 (integer->char 0)))
 
-(define-integrable (alphabet-low-ref low code-point)
-  (not (fix:= (fix:and (vector-8b-ref low (fix:lsh code-point -3))
-		       (fix:lsh 1 (fix:and code-point 7)))
+(define-integrable (alphabet-low-ref low scalar-value)
+  (not (fix:= (fix:and (vector-8b-ref low (fix:lsh scalar-value -3))
+		       (fix:lsh 1 (fix:and scalar-value 7)))
 	      0)))
 
-(define-integrable (alphabet-low-set! low code-point)
+(define-integrable (alphabet-low-set! low scalar-value)
   (vector-8b-set! low
-		  (fix:lsh code-point -3)
-		  (fix:or (vector-8b-ref low (fix:lsh code-point -3))
-			  (fix:lsh 1 (fix:and code-point 7)))))
+		  (fix:lsh scalar-value -3)
+		  (fix:or (vector-8b-ref low (fix:lsh scalar-value -3))
+			  (fix:lsh 1 (fix:and scalar-value 7)))))
 
 (define null-alphabet
   (make-alphabet (make-alphabet-low) '#() '#()))
 
 (define (char-in-alphabet? char alphabet)
-  (guarantee-wide-char char 'CHAR-IN-ALPHABET?)
+  (guarantee-unicode-char char 'CHAR-IN-ALPHABET?)
   (guarantee-alphabet alphabet 'CHAR-IN-ALPHABET?)
-  (%code-point-in-alphabet? (char-code char) alphabet))
+  (%scalar-value-in-alphabet? (char-code char) alphabet))
 
-(define (%code-point-in-alphabet? pt alphabet)
+(define (%scalar-value-in-alphabet? pt alphabet)
   (if (fix:< pt #x800)
       (alphabet-low-ref (alphabet-low alphabet) pt)
       (let ((high1 (alphabet-high1 alphabet))
@@ -246,36 +243,31 @@ Not used at the moment.
 			(loop (fix:+ index 1) upper))
 		       (else #t))))))))
 
-(define (well-formed-code-point-list? items)
-  (if (pair? items)
-      (and (well-formed-item? (car items))
-	   (let loop ((a (car items)) (items (cdr items)))
-	     (if (pair? items)
-		 (let ((b (car items))
-		       (items (cdr items)))
-		   (and (well-formed-item? b)
-			(fix:< (if (pair? a) (cdr a) a)
-			       (if (pair? b) (car b) b))
-			(loop b items)))
-		 (null? items))))
-      (null? items)))
+(define (well-formed-scalar-value-list? items)
+  (list-of-type? items well-formed-item?))
 
 (define (well-formed-item? item)
   (if (pair? item)
-      (and (%unicode-code-point? (car item))
-	   (%unicode-code-point? (cdr item))
-	   (fix:< (car item) (cdr item)))
-      (%unicode-code-point? item)))
+      (and (unicode-scalar-value? (car item))
+	   (unicode-scalar-value? (cdr item))
+	   (fix:<= (car item) (cdr item)))
+      (unicode-scalar-value? item)))
 
-(define-guarantee well-formed-code-point-list "a Unicode code-point list")
+(define-guarantee well-formed-scalar-value-list "a Unicode scalar-value list")
 
-(define (code-points->alphabet items)
-  (guarantee-well-formed-code-point-list items 'CODE-POINTS->ALPHABET)
-  (%code-points->alphabet items))
+(define (scalar-values->alphabet items)
+  (guarantee-well-formed-scalar-value-list items 'SCALAR-VALUES->ALPHABET)
+  (%scalar-values->alphabet items))
+
+(define (alphabet . chars)
+  (for-each (lambda (char)
+	      (guarantee-unicode-char char 'ALPHABET))
+	    chars)
+  (%scalar-values->alphabet (map char->integer chars)))
 
-(define (%code-points->alphabet items)
+(define (%scalar-values->alphabet items)
   (receive (low-items high-items)
-      (split-list (canonicalize-code-point-list items) #x800)
+      (split-list (canonicalize-scalar-value-list items) #x800)
     (let ((low (make-alphabet-low)))
       (for-each (lambda (item)
 		  (if (pair? item)
@@ -299,83 +291,13 @@ Not used at the moment.
 		  (vector-set! high2 i (car items)))))
 	  (make-alphabet low high1 high2))))))
 
-(define (canonicalize-code-point-list items)
-  (if (pair? items)
-      (let ((a (car items)))
-	(let loop
-	    ((al (if (pair? a) (car a) a))
-	     (ah (if (pair? a) (cdr a) a))
-	     (items (cdr items)))
-	  (if (pair? items)
-	      (let ((b (car items))
-		    (items (cdr items)))
-		(let ((bl (if (pair? b) (car b) b))
-		      (bh (if (pair? b) (cdr b) b)))
-		  (if (fix:= (fix:+ ah 1) bl)
-		      (loop al bh items)
-		      (cons (if (fix:= al ah) al (cons al ah))
-			    (loop bl bh items)))))
-	      (list (if (fix:= al ah) al (cons al ah))))))
-      '()))
-
-(define (split-list items limit)
-  (let loop ((items items) (low '()))
-    (if (pair? items)
-	(let ((item (car items)))
-	  (cond ((not (pair? item))
-		 (if (fix:< item limit)
-		     (loop (cdr items) (cons item low))
-		     (values low items)))
-		((fix:< (cdr item) limit)
-		 (loop (cdr items) (cons item low)))
-		((fix:<= limit (car item))
-		 (values low items))
-		(else
-		 (values (cons (cons (car item) (fix:- limit 1)) low)
-			 (cons (cons limit (cdr item)) items)))))
-	(values low '()))))
-
-(define (alphabet . chars)
-  (%code-points->alphabet (chars->wfcp-list (remove-duplicate-chars chars))))
-
-(define (remove-duplicate-chars chars)
-  (let ((table (make-eq-hash-table)))
-    (for-each (lambda (char)
-		(guarantee-wide-char char 'REMOVE-DUPLICATE-CHARS)
-		(hash-table/put! table char #t))
-	      chars)
-    (hash-table/key-list table)))
-
-(define (chars->wfcp-list chars)
-  (let ((chars (sort chars char<?)))
-    (if (pair? chars)
-	(let loop
-	    ((pt (char->integer (car chars)))
-	     (chars (cdr chars))
-	     (items '()))
-	  (if (pair? chars)
-	      (let ((pt* (char->integer (car chars)))
-		    (chars (cdr chars)))
-		(if (fix:= pt* (fix:+ pt 1))
-		    (let find-max ((pt* pt*) (chars chars))
-		      (if (pair? chars)
-			  (let ((pt** (char->integer (car chars)))
-				(chars (cdr chars)))
-			    (if (fix:= pt** (fix:+ pt* 1))
-				(find-max pt** chars)
-				(loop pt** chars (cons (cons pt pt*) items))))
-			  (reverse! (cons (cons pt pt*) items))))
-		    (loop pt* chars (cons pt items))))
-	      (reverse! (cons pt items))))
-	'())))
-
-(define (alphabet->code-points alphabet)
-  (guarantee-alphabet alphabet 'ALPHABET->CODE-POINTS)
-  (append! (alphabet-low->code-points (alphabet-low alphabet))
-	   (alphabet-high->code-points (alphabet-high1 alphabet)
+(define (alphabet->scalar-values alphabet)
+  (guarantee-alphabet alphabet 'ALPHABET->SCALAR-VALUES)
+  (append! (alphabet-low->scalar-values (alphabet-low alphabet))
+	   (alphabet-high->scalar-values (alphabet-high1 alphabet)
 				       (alphabet-high2 alphabet))))
 
-(define (alphabet-low->code-points low)
+(define (alphabet-low->scalar-values low)
   (let find-lower ((i 0) (result '()))
     (if (fix:< i #x800)
 	(if (alphabet-low-ref low i)
@@ -397,15 +319,127 @@ Not used at the moment.
 	    (find-lower (fix:+ i 1) result))
 	(reverse! result))))
 
-(define (alphabet-high->code-points lower upper)
+(define (alphabet-high->scalar-values lower upper)
   (let ((n (vector-length lower)))
     (let loop ((i 0) (result '()))
       (if (fix:< i n)
 	  (loop (fix:+ i 1)
-		(cons (cons (vector-ref lower i) (vector-ref upper i))
+		(cons (if (fix:< (vector-ref lower i) (vector-ref upper i))
+			  (cons (vector-ref lower i) (vector-ref upper i))
+			  (vector-ref lower i))
 		      result))
 	  (reverse! result)))))
 
+(define (canonicalize-scalar-value-list items)
+  (if (pair? items)
+      (let ((items
+	     (sort items
+	       (lambda (a b)
+		 (let ((al (if (pair? a) (car a) a))
+		       (ah (if (pair? a) (cdr a) a))
+		       (bl (if (pair? b) (car b) b))
+		       (bh (if (pair? b) (cdr b) b)))
+		   (or (fix:< al bl)
+		       (and (fix:= al bl)
+			    (fix:< ah bh)))))))
+	    (make-item
+	     (lambda (l h)
+	       (if (fix:= l h)
+		   l
+		   (cons l h)))))
+	(let loop
+	    ((al (if (pair? (car items)) (caar items) (car items)))
+	     (ah (if (pair? (car items)) (cdar items) (car items)))
+	     (items (cdr items)))
+	  (if (pair? items)
+	      (let ((bl (if (pair? (car items)) (caar items) (car items)))
+		    (bh (if (pair? (car items)) (cdar items) (car items)))
+		    (items (cdr items)))
+		(if (fix:< (fix:+ ah 1) bl)
+		    (cons (make-item al ah)
+			  (loop bl bh items))
+		    (loop al (fix:max ah bh) items)))
+	      (list (make-item al ah)))))
+      items))
+
+(define (split-list items limit)
+  (let loop ((items items) (low '()))
+    (if (pair? items)
+	(let ((item (car items)))
+	  (cond ((not (pair? item))
+		 (if (fix:< item limit)
+		     (loop (cdr items) (cons item low))
+		     (values low items)))
+		((fix:< (cdr item) limit)
+		 (loop (cdr items) (cons item low)))
+		((fix:<= limit (car item))
+		 (values low items))
+		(else
+		 (values (cons (cons (car item) (fix:- limit 1)) low)
+			 (cons (cons limit (cdr item)) items)))))
+	(values low '()))))
+
+#|
+
+(define (test-canonicalize-scalar-value-list n-items n-iter)
+  (run-cpl-test n-items n-iter canonicalize-scalar-value-list))
+
+(define (test-alphabet->scalar-values n-items n-iter)
+  (run-cpl-test n-items n-iter
+		(lambda (cpl)
+		  (alphabet->scalar-values (scalar-values->alphabet cpl)))))
+
+(define (run-cpl-test n-items n-iter procedure)
+  (do ((i 0 (+ i 1))
+       (failures '()
+		 (let ((cpl (make-test-cpl n-items)))
+		   (guarantee-well-formed-scalar-value-list cpl)
+		   (let ((cpl* (procedure cpl)))
+		     (if (canonical-scalar-value-list? cpl*)
+			 failures
+			 (cons (cons cpl cpl*) failures))))))
+      ((not (< i n-iter))
+       (let ((n-failures (length failures)))
+	 (if (> n-failures 0)
+	     (begin
+	       (write-string "Got ")
+	       (write n-failures)
+	       (write-string " failure")
+	       (if (> n-failures 1)
+		   (write-string "s"))
+	       (write-string " out of ")
+	       (write n-iter)
+	       (newline)
+	       (pp failures)))))))
+
+(define (make-test-cpl n-items)
+  (make-initialized-list n-items
+    (lambda (i)
+      (let loop ()
+	(let ((n (random #x10000)))
+	  (if (unicode-scalar-value? n)
+	      (let ((m (random #x100)))
+		(if (fix:= m 0)
+		    n
+		    (if (unicode-scalar-value? (fix:+ n m))
+			(fix:+ n m)
+			(loop))))
+	      (loop)))))))
+
+(define (canonical-scalar-value-list? items)
+  (and (well-formed-scalar-value-list? items)
+       (if (pair? items)
+	   (let loop ((a (car items)) (items (cdr items)))
+	     (if (pair? items)
+		 (let ((b (car items))
+		       (items (cdr items)))
+		   (and (fix:< (fix:+ (if (pair? a) (cdr a) a) 1)
+			       (if (pair? b) (car b) b))
+			(loop b items)))
+		 #t))
+	   #t)))
+|#
+
 (define (8-bit-alphabet? alphabet)
   (and (fix:= (vector-length (alphabet-high1 alphabet)) 0)
        (let ((low (alphabet-low alphabet)))
@@ -443,7 +477,7 @@ Not used at the moment.
   (let loop ((i 0) (chars '()))
     (if (fix:< i #x100)
 	(loop (fix:+ i 1)
-	      (if (%code-point-in-alphabet? i alphabet)
+	      (if (%scalar-value-in-alphabet? i alphabet)
 		  (cons (integer->char i) chars)
 		  chars))
 	(apply string (reverse! chars)))))
@@ -581,12 +615,12 @@ Not used at the moment.
    (make-vector length
 		(if (if (default-object? char) #f char)
 		    (begin
-		      (guarantee-wide-char char 'MAKE-WIDE-STRING)
+		      (guarantee-unicode-char char 'MAKE-WIDE-STRING)
 		      char)
 		    (integer->char 0)))))
 
 (define (wide-string . chars)
-  (for-each (lambda (char) (guarantee-wide-char char 'WIDE-STRING)) chars)
+  (for-each (lambda (char) (guarantee-unicode-char char 'WIDE-STRING)) chars)
   (%make-wide-string (list->vector chars)))
 
 (define (wide-string-length string)
@@ -607,7 +641,7 @@ Not used at the moment.
 (define (wide-string-set! string index char)
   (guarantee-wide-string string 'WIDE-STRING-SET!)
   (guarantee-wide-string-index index string 'WIDE-STRING-SET!)
-  (guarantee-wide-char char 'WIDE-STRING-SET!)
+  (guarantee-unicode-char char 'WIDE-STRING-SET!)
   (%wide-string-set! string index char))
 
 (define-integrable (%wide-string-set! string index char)
@@ -957,7 +991,7 @@ Not used at the moment.
 	 #x10000))
 
 (define (split-into-utf16-surrogates n)
-  (guarantee-unicode-code-point n 'split-into-utf16-surrogates)
+  (guarantee-unicode-scalar-value n 'split-into-utf16-surrogates)
   (let ((n (fix:- n #x10000)))
     (values (fix:or (fix:and (fix:lsh n -10) #x03FF) #xD800)
 	    (fix:or (fix:and n #x03FF) #xDC00))))
diff --git a/src/xml/turtle.scm b/src/xml/turtle.scm
index c75a845a7..a6db82779 100644
--- a/src/xml/turtle.scm
+++ b/src/xml/turtle.scm
@@ -278,7 +278,7 @@ USA.
 		  char-set:turtle-digit))
 
 (define alphabet:name-start-char
-  (code-points->alphabet
+  (scalar-values->alphabet
    '((#x0041 . #x005A)
      #x005F
      (#x0061 . #x007A)
@@ -297,7 +297,7 @@ USA.
 
 (define alphabet:name-char
   (alphabet+ alphabet:name-start-char
-	     (code-points->alphabet
+	     (scalar-values->alphabet
 	      '(#x002D
 		(#x0030 . #x0039)
 		#x00B7
@@ -308,7 +308,7 @@ USA.
   (alphabet- alphabet:name-start-char (alphabet #\_)))
 
 (define alphabet:character
-  (code-points->alphabet '((#x20 . #x5B) (#x5D . #x10FFFF))))
+  (scalar-values->alphabet '((#x20 . #x5B) (#x5D . #x10FFFF))))
 
 (define alphabet:ucharacter
   (alphabet- alphabet:character (alphabet #\>)))
diff --git a/src/xml/xml-chars.scm b/src/xml/xml-chars.scm
index c0b7881be..0d06ae13c 100644
--- a/src/xml/xml-chars.scm
+++ b/src/xml/xml-chars.scm
@@ -28,7 +28,7 @@ USA.
 (declare (usual-integrations))
 
 (define alphabet:xml-base-char
-  (code-points->alphabet
+  (scalar-values->alphabet
    '((#x0041 . #x005A)
      (#x0061 . #x007A)
      (#x00C0 . #x00D6)
@@ -233,13 +233,13 @@ USA.
      (#xAC00 . #xD7A3))))
 
 (define alphabet:xml-ideographic
-  (code-points->alphabet
+  (scalar-values->alphabet
    '(#x3007
      (#x3021 . #x3029)
      (#x4E00 . #x9FA5))))
 
 (define alphabet:xml-combining-char
-  (code-points->alphabet
+  (scalar-values->alphabet
    '((#x0300 . #x0345)
      (#x0360 . #x0361)
      (#x0483 . #x0486)
@@ -337,7 +337,7 @@ USA.
      #x309A)))
 
 (define alphabet:xml-digit
-  (code-points->alphabet
+  (scalar-values->alphabet
    '((#x0030 . #x0039)
      (#x0660 . #x0669)
      (#x06F0 . #x06F9)
@@ -355,7 +355,7 @@ USA.
      (#x0F20 . #x0F29))))
 
 (define alphabet:xml-extender
-  (code-points->alphabet
+  (scalar-values->alphabet
    '(#x00B7
      #x02D0
      #x02D1
@@ -369,7 +369,7 @@ USA.
      (#x30FC . #x30FE))))
 
 (define alphabet:xml-char
-  (code-points->alphabet
+  (scalar-values->alphabet
    '(#x0009
      #x000A
      #x000D
diff --git a/src/xml/xml-output.scm b/src/xml/xml-output.scm
index 2d84bdc67..925348216 100644
--- a/src/xml/xml-output.scm
+++ b/src/xml/xml-output.scm
@@ -90,7 +90,7 @@ USA.
 
 (define (emit-string string ctx)
   (let ((port (ctx-port ctx)))
-    (for-each-wide-char string
+    (for-each-unicode-char string
       (lambda (char)
 	(write-char char port)))))
 
@@ -413,7 +413,7 @@ USA.
 
 (define (xml-string-columns string)
   (let ((n 0))
-    (for-each-wide-char string
+    (for-each-unicode-char string
       (lambda (char)
 	(set! n
 	      (fix:+ n
@@ -486,7 +486,7 @@ USA.
       (emit-char #\space ctx)))
 
 (define (write-escaped-string string escapes ctx)
-  (for-each-wide-char string
+  (for-each-unicode-char string
     (lambda (char)
       (cond ((assq char escapes)
 	     => (lambda (e)
@@ -499,7 +499,7 @@ USA.
 	    (else
 	     (emit-char char ctx))))))
 
-(define (for-each-wide-char string procedure)
+(define (for-each-unicode-char string procedure)
   (let ((port (open-utf8-input-string string)))
     (let loop ()
       (let ((char (read-char port)))
diff --git a/src/xml/xml-parser.scm b/src/xml/xml-parser.scm
index e441d72f7..7c22d30c6 100644
--- a/src/xml/xml-parser.scm
+++ b/src/xml/xml-parser.scm
@@ -691,7 +691,7 @@ USA.
   (let ((make-ref
 	 (lambda (s r p)
 	   (let ((n (string->number s r)))
-	     (if (not (unicode-code-point? n))
+	     (if (not (unicode-scalar-value? n))
 		 (perror p "Invalid code point" n))
 	     (let ((char (integer->char n)))
 	       (if (not (char-in-alphabet? char alphabet:xml-char))
diff --git a/src/xml/xml-struct.scm b/src/xml/xml-struct.scm
index 766693f40..e6db02d52 100644
--- a/src/xml/xml-struct.scm
+++ b/src/xml/xml-struct.scm
@@ -152,7 +152,7 @@ USA.
   (value canonicalize canonicalize-char-data))
 
 (define (xml-char-data? object)
-  (or (wide-char? object)
+  (or (unicode-char? object)
       (and (or (wide-string? object)
                (and (string? object)
                     (utf8-string-valid? object)))
@@ -166,7 +166,7 @@ USA.
                             'UTF-8))
 
 (define (canonicalize-char-data object)
-  (cond ((wide-char? object)
+  (cond ((unicode-char? object)
 	 (call-with-utf8-output-string
 	   (lambda (port)
 	     (write-char object port))))