Change "code point" to "scalar value" everywhere. Change "wide char"

author Chris Hanson <org/chris-hanson/cph>

Sun, 30 Aug 2009 07:08:31 +0000 (00:08 -0700)

committer Chris Hanson <org/chris-hanson/cph>

Sun, 30 Aug 2009 07:08:31 +0000 (00:08 -0700)
author Chris Hanson <org/chris-hanson/cph>
Sun, 30 Aug 2009 07:08:31 +0000 (00:08 -0700)
committer Chris Hanson <org/chris-hanson/cph>
Sun, 30 Aug 2009 07:08:31 +0000 (00:08 -0700)
diff --git a/doc/ref-manual/characters.texi b/doc/ref-manual/characters.texi

index a40f2b3639edbc604081a8470254f642975aee59..7938c9ed8832d683c18a4428746d628aad0779df 100644 (file)
--- a/doc/ref-manual/characters.texi
+++ b/doc/ref-manual/characters.texi
@@ -54,9 +54,9 @@ quote them.
  
  @findex #\U+
  In addition to the standard character syntax, MIT Scheme also supports a
-general syntax that denotes any Unicode character by its code point.
-This notation is @code{#\U+@var{code-point}}, where @var{code-point} is
-a sequence of hexadecimal digits for a valid code point.  So the above
+general syntax that denotes any Unicode character by its scalar value.
+This notation is @code{#\U+@var{scalar-value}}, where @var{scalar-value} is
+a sequence of hexadecimal digits for a valid scalar value.  So the above
  examples could also be written like this:
  
  @example
@@ -331,9 +331,9 @@ In fact, the Control bucky bit is completely orthogonal to the
  @code{#\C-SOH}.}
  
  MIT/GNU Scheme uses a 21-bit character code with 4 bucky bits.  The
-character code contains the Unicode code point for the character.  This
+character code contains the Unicode scalar value for the character.  This
  is a change from earlier versions of the system, which used the
-@acronym{ISO-8859-1} code point, but it is upwards compatible with
+@acronym{ISO-8859-1} scalar value, but it is upwards compatible with
  previous usage, since @acronym{ISO-8859-1} is a proper subset of
  Unicode.
  
@@ -395,7 +395,7 @@ example,
  @end example
  
  Note that in MIT/GNU Scheme, the value of @code{char-code} is the
-Unicode code point for @var{char}.
+Unicode scalar value for @var{char}.
  @end deffn
  
  @defvr variable char-code-limit
@@ -463,7 +463,7 @@ procedures is as follows:
  
  This implies that @code{char->integer} and @code{char-code} produce
  identical results for characters that have no bucky bits set, and that
-characters are ordered according to their Unicode code points.
+characters are ordered according to their Unicode scalar values.
  
  Note: If the argument to @code{char->integer} or @code{integer->char} is
  a constant, the compiler will constant-fold the call, replacing it with
@@ -475,7 +475,7 @@ character constants or @acronym{ASCII} codes.
  The range of @code{char->integer} is defined to be the exact
  non-negative integers that are less than the value of this variable
  (exclusive).  Note, however, that there are some holes in this range,
-because the character code must be a valid Unicode code point.
+because the character code must be a valid Unicode scalar value.
  @end defvr
  
  @node ISO-8859-1 Characters, Character Sets, Internal Representation of Characters, Characters
@@ -691,24 +691,24 @@ parser (@pxref{XML Support}) implementation.  @acronym{XML} uses
  Unicode as its base character set, and any @acronym{XML}
  implementation @emph{must} support Unicode.
  
-@cindex Code point, Unicode
-@cindex Wide character
-@cindex Character, wide
-The basic unit in a Unicode implementation is the @dfn{code point}.  The
-character equivalent of a code point is a @dfn{wide character}.
+@cindex Scalar value, Unicode
+@cindex Unicode character
+@cindex Character, Unicode
+The basic unit in a Unicode implementation is the @dfn{scalar value}.  The
+character equivalent of a scalar value is a @dfn{Unicode character}.
  
-@deffn procedure unicode-code-point? object
-Returns @code{#t} if @var{object} is a Unicode code point.  Code points
-are implemented as exact non-negative integers.  They are further
+@deffn procedure unicode-scalar-value? object
+Returns @code{#t} if @var{object} is a Unicode scalar value.  Scalar
+values are implemented as exact non-negative integers.  They are further
  limited, by the Unicode standard, to be strictly less than
  @code{#x110000}, with the values @code{#xD800} through @code{#xDFFF},
  @code{#xFFFE}, and @code{#xFFFF} excluded.
  @end deffn
  
-@deffn procedure wide-char? object
-Returns @code{#t} if @var{object} is a wide character, specifically if
-@var{object} is a character with no bucky bits and whose code satisfies
-@code{unicode-code-point?}.
+@deffn procedure unicode-char? object
+Returns @code{#t} if @var{object} is a Unicode character, specifically
+if @var{object} is a character with no bucky bits and whose code
+satisfies @code{unicode-scalar-value?}.
  @end deffn
  
  The Unicode implementation consists of three parts:
@@ -725,7 +725,7 @@ several external representations, specifically @acronym{UTF-8},
  
  @item
  An @dfn{alphabet} abstraction, which is an efficient implementation of
-sets of Unicode code points (similar to the @code{char-set}
+sets of Unicode scalar values (similar to the @code{char-set}
  abstraction).
  @end itemize
  
@@ -749,13 +749,13 @@ constant-time access to each character in the string.
  Returns @code{#t} if @var{object} is a wide string.
  @end deffn
  
-@deffn procedure make-wide-string k [wide-char]
+@deffn procedure make-wide-string k [unicode-char]
  Returns a newly allocated wide string of length @var{k}.  If @var{char}
  is specified, all elements of the returned string are initialized to
  @var{char}; otherwise the contents of the string are unspecified.
  @end deffn
  
-@deffn procedure wide-string wide-char @dots{}
+@deffn procedure wide-string unicode-char @dots{}
  Returns a newly allocated wide string consisting of the specified
  characters.
  @end deffn
@@ -770,7 +770,7 @@ Returns character @var{k} of @var{wide-string}.  @var{K} must be a valid
  index of @var{string}.
  @end deffn
  
-@deffn procedure wide-string-set! wide-string k wide-char
+@deffn procedure wide-string-set! wide-string k unicode-char
  Stores @var{char} in element @var{k} of @var{wide-string} and returns an
  unspecified value.  @var{K} must be a valid index of @var{wide-string}.
  @end deffn
@@ -804,7 +804,7 @@ be used to specify that the port delivers characters from a substring of
  @end deffn
  
  @deffn procedure open-wide-output-string
-Returns an output port that accepts wide characters and strings and
+Returns an output port that accepts Unicode characters and strings and
  accumulates them in a buffer.  Call @code{get-output-string} on the
  returned port to get a wide string containing the accumulated
  characters.
@@ -915,37 +915,38 @@ and @var{string}'s length, respectively.
  Applications often need to manipulate sets of characters, such as the
  set of alphabetic characters or the set of whitespace characters.  The
  @dfn{alphabet} abstraction provides an efficient implementation of
-sets of Unicode code points.
+sets of Unicode scalar values.
  
  @deffn procedure alphabet? object
  Returns @code{#t} if @var{object} is a Unicode alphabet, otherwise
  returns @code{#f}.
  @end deffn
  
-@deffn procedure alphabet wide-char @dots{}
-Returns a Unicode alphabet containing the wide characters passed as
+@deffn procedure alphabet unicode-char @dots{}
+Returns a Unicode alphabet containing the Unicode characters passed as
  arguments.
  @end deffn
  
-@deffn procedure code-points->alphabet items
-Returns a Unicode alphabet containing the code points described by
+@deffn procedure scalar-values->alphabet items
+Returns a Unicode alphabet containing the scalar values described by
  @var{items}.  @var{Items} must satisfy
-@code{well-formed-code-points-list?}.
+@code{well-formed-scalar-values-list?}.
  @end deffn
  
-@deffn procedure alphabet->code-points alphabet
-Returns a well-formed code-points list that describes the code points
+@deffn procedure alphabet->scalar-values alphabet
+Returns a well-formed scalar-values list that describes the scalar values
  represented by @var{alphabet}.
  @end deffn
  
-@deffn procedure well-formed-code-points-list? object
-Returns @code{#t} if @var{object} is a well-formed code-points list,
-otherwise returns @code{#f}.  A well-formed code-points list is a
-proper list, each element of which is either a code point or a pair of
-code points.  A pair of code points represents a contiguous range of
-code points.  The @sc{car} of the pair is the lower limit, and the
-@sc{cdr} is the upper limit.  Both limits are inclusive, and the lower
-limit must be strictly less than the upper limit.
+@deffn procedure well-formed-scalar-values-list? object
+Returns @code{#t} if @var{object} is a well-formed scalar-values list,
+otherwise returns @code{#f}.  A well-formed scalar-values list is a
+proper list, each element of which is either a unicode scalar value or a
+pair of unicode scalar values.  A pair of scalar values represents a
+contiguous range of scalar values.  The @sc{car} of the pair is the
+lower limit, and the @sc{cdr} is the upper limit.  Both limits are
+inclusive, and the lower limit must be less than or equal to the upper
+limit.
  @end deffn
  
  @deffn procedure char-in-alphabet? char alphabet
@@ -954,23 +955,23 @@ otherwise returns @code{#f}.
  @end deffn
  
  Character sets and alphabets can be converted to one another, provided
-that the alphabet contains only 8-bit code points.  This is true
-because 8-bit code points in Unicode map directly to
+that the alphabet contains only 8-bit scalar values.  This is true
+because 8-bit scalar values in Unicode map directly to
  @acronym{ISO-8859-1} characters, which is what character sets contain.
  
  @deffn procedure char-set->alphabet char-set
-Returns a Unicode alphabet containing the code points that correspond
+Returns a Unicode alphabet containing the scalar values that correspond
  to characters that are members of @var{char-set}.
  @end deffn
  
  @deffn procedure alphabet->char-set alphabet
  Returns a character set containing the characters that correspond to
-8-bit code points that are members of @var{alphabet}.  (Code points
+8-bit scalar values that are members of @var{alphabet}.  (Scalar values
  outside the 8-bit range are ignored.)
  @end deffn
  
  @deffn procedure string->alphabet string
-Returns a Unicode alphabet containing the code points corresponding to
+Returns a Unicode alphabet containing the scalar values corresponding to
  the characters in @var{string}.  Equivalent to
  
  @example
@@ -979,22 +980,22 @@ the characters in @var{string}.  Equivalent to
  @end deffn
  
  @deffn procedure alphabet->string alphabet
-Returns a newly-allocated string containing the characters
-corresponding to the 8-bit code points in @var{alphabet}.  (Code
-points outside the 8-bit range are ignored.)
+Returns a newly-allocated string containing the characters corresponding
+to the 8-bit scalar values in @var{alphabet}.  (Scalar values outside
+the 8-bit range are ignored.)
  @end deffn
  
  @deffn procedure 8-bit-alphabet? alphabet
-Returns @code{#t} if @var{alphabet} contains only 8-bit code points,
+Returns @code{#t} if @var{alphabet} contains only 8-bit scalar values,
  otherwise returns @code{#f}.
  @end deffn
  
  @deffn procedure alphabet+ alphabet @dots{}
-Returns a Unicode alphabet that contains each code point that is a
+Returns a Unicode alphabet that contains each scalar value that is a
  member of any of the @var{alphabet} arguments.
  @end deffn
  
  @deffn procedure alphabet- alphabet1 alphabet2
-Returns a Unicode alphabet that contains each code point that is a
+Returns a Unicode alphabet that contains each scalar value that is a
  member of @var{alphabet1} and is not a member of @var{alphabet2}.
  @end deffn
diff --git a/src/runtime/genio.scm b/src/runtime/genio.scm

index 6a8792efedba273842e1bfd6d5464ec6e353da2b..79c12fbb76c93c76c204f3a05e3afe081db22fef 100644 (file)
--- a/src/runtime/genio.scm
+++ b/src/runtime/genio.scm
@@ -1857,7 +1857,7 @@ USA.
                      (* (get-byte bv bs 1) #x10000)
                      (* (get-byte bv bs 2) #x100)
                      (get-byte bv bs 3))))
-            (if (unicode-code-point? cp)
+            (if (unicode-scalar-value? cp)
                  (begin
                    (set-input-buffer-start! ib (fix:+ bs 4))
                    cp)
@@ -1873,7 +1873,7 @@ USA.
                      (* (get-byte bv bs 2) #x10000)
                      (* (get-byte bv bs 1) #x100)
                      (get-byte bv bs 0))))
-            (if (unicode-code-point? cp)
+            (if (unicode-scalar-value? cp)
                  (begin
                    (set-input-buffer-start! ib (fix:+ bs 4))
                    cp)
diff --git a/src/runtime/runtime.pkg b/src/runtime/runtime.pkg

index 276e0f79ee0d933ed06f72dc4d397a5cbbb73649..95907d0454649e5a9c12bd983560cf5f472e991c 100644 (file)
--- a/src/runtime/runtime.pkg
+++ b/src/runtime/runtime.pkg
@@ -4910,6 +4910,9 @@ USA.
    (files "unicode")
    (parent (runtime))
    (export ()
+         (error:not-wide-char error:not-unicode-char)
+         (guarantee-wide-char guarantee-unicode-char)
+         (wide-char? unicode-char?)
           (wide-string->utf16-be-string string->utf16-be-string)
           (wide-string->utf16-le-string string->utf16-le-string)
           (wide-string->utf16-string string->utf16-string)
@@ -4924,7 +4927,7 @@ USA.
           alphabet+
           alphabet-
           alphabet->char-set
-         alphabet->code-points
+         alphabet->scalar-values
           alphabet->string
           alphabet-predicate
           alphabet?
@@ -4944,11 +4947,11 @@ USA.
           call-with-utf8-output-string
           char-in-alphabet?
           char-set->alphabet
-         code-points->alphabet
           combine-utf16-surrogates
           error:not-8-bit-alphabet
           error:not-alphabet
-         error:not-unicode-code-point
+         error:not-unicode-char
+         error:not-unicode-scalar-value
           error:not-utf16-be-string
           error:not-utf16-high-surrogate
           error:not-utf16-le-string
@@ -4958,14 +4961,14 @@ USA.
           error:not-utf32-le-string
           error:not-utf32-string
           error:not-utf8-string
-         error:not-well-formed-code-point-list
-         error:not-wide-char
+         error:not-well-formed-scalar-value-list
           error:not-wide-string
           for-all-chars-in-string?
           for-any-char-in-string?
           guarantee-8-bit-alphabet
           guarantee-alphabet
-         guarantee-unicode-code-point
+         guarantee-unicode-char
+         guarantee-unicode-scalar-value
           guarantee-utf16-be-string
           guarantee-utf16-high-surrogate
           guarantee-utf16-le-string
@@ -4975,8 +4978,7 @@ USA.
           guarantee-utf32-le-string
           guarantee-utf32-string
           guarantee-utf8-string
-         guarantee-well-formed-code-point-list
-         guarantee-wide-char
+         guarantee-well-formed-scalar-value-list
           guarantee-wide-string
           guarantee-wide-string-index
           guarantee-wide-substring
@@ -4995,6 +4997,7 @@ USA.
           open-utf32-output-string
           open-utf8-input-string
           open-utf8-output-string
+         scalar-values->alphabet
           split-into-utf16-surrogates
           string->alphabet
           string->utf16-be-string
@@ -5006,7 +5009,8 @@ USA.
           string->utf8-string
           string->utf8-string
           string->wide-string
-         unicode-code-point?
+         unicode-char?
+         unicode-scalar-value?
           utf16-be-string->wide-string
           utf16-be-string-length
           utf16-be-string-valid?
@@ -5038,8 +5042,7 @@ USA.
           utf8-string-length
           utf8-string-valid?
           utf8-string?
-         well-formed-code-point-list?
-         wide-char?
+         well-formed-scalar-value-list?
           wide-string
           wide-string->string
           wide-string-index?
diff --git a/src/runtime/string.scm b/src/runtime/string.scm

index ad8a94bf2a4ea24f9732fa847a40af5914dc14de..9114ac238ca612e5ed20b7c93d1276343c329743 100644 (file)
--- a/src/runtime/string.scm
+++ b/src/runtime/string.scm
@@ -203,7 +203,8 @@ USA.
    (cond ((string? object) (string->utf8-string object))
         ((symbol? object) (symbol-name object))
         ((wide-string? object) (wide-string->utf8-string object))
-       ((wide-char? object) (wide-string->utf8-string (wide-string object)))
+       ((unicode-char? object)
+        (wide-string->utf8-string (wide-string object)))
         (else (%->string object 'UTF8-STRING))))
  
  (define (%->string object caller)
diff --git a/src/runtime/unicode.scm b/src/runtime/unicode.scm

index 289d36bc58786847a7bd5d429168dd6aaed0a7d4..6764160a2b167a1033d435be9c0dbde5bdbe4367 100644 (file)
--- a/src/runtime/unicode.scm
+++ b/src/runtime/unicode.scm
@@ -162,21 +162,18 @@ USA.
  \f
  ;;;; Unicode characters
  
-(define (wide-char? object)
+(define (unicode-char? object)
    (and (char? object)
         (legal-code-32? (char->integer object))))
  
-(define-guarantee wide-char "a Unicode character")
+(define-guarantee unicode-char "a Unicode character")
  
-(define (unicode-code-point? object)
-  (and (%unicode-code-point? object)
-       (not (illegal? object))))
-
-(define (%unicode-code-point? object)
+(define (unicode-scalar-value? object)
    (and (index-fixnum? object)
-       (fix:< object char-code-limit)))
+       (fix:< object char-code-limit)
+       (not (illegal? object))))
  
-(define-guarantee unicode-code-point "a Unicode code point")
+(define-guarantee unicode-scalar-value "a Unicode scalar value")
  
  (define-integrable (legal-code-32? pt)
    (and (fix:< pt char-code-limit)
@@ -213,26 +210,26 @@ Not used at the moment.
  (define-integrable (make-alphabet-low)
    (make-string #x100 (integer->char 0)))
  
-(define-integrable (alphabet-low-ref low code-point)
-  (not (fix:= (fix:and (vector-8b-ref low (fix:lsh code-point -3))
-                      (fix:lsh 1 (fix:and code-point 7)))
+(define-integrable (alphabet-low-ref low scalar-value)
+  (not (fix:= (fix:and (vector-8b-ref low (fix:lsh scalar-value -3))
+                      (fix:lsh 1 (fix:and scalar-value 7)))
               0)))
  
-(define-integrable (alphabet-low-set! low code-point)
+(define-integrable (alphabet-low-set! low scalar-value)
    (vector-8b-set! low
-                 (fix:lsh code-point -3)
-                 (fix:or (vector-8b-ref low (fix:lsh code-point -3))
-                         (fix:lsh 1 (fix:and code-point 7)))))
+                 (fix:lsh scalar-value -3)
+                 (fix:or (vector-8b-ref low (fix:lsh scalar-value -3))
+                         (fix:lsh 1 (fix:and scalar-value 7)))))
  
  (define null-alphabet
    (make-alphabet (make-alphabet-low) '#() '#()))
  
  (define (char-in-alphabet? char alphabet)
-  (guarantee-wide-char char 'CHAR-IN-ALPHABET?)
+  (guarantee-unicode-char char 'CHAR-IN-ALPHABET?)
    (guarantee-alphabet alphabet 'CHAR-IN-ALPHABET?)
-  (%code-point-in-alphabet? (char-code char) alphabet))
+  (%scalar-value-in-alphabet? (char-code char) alphabet))
  
-(define (%code-point-in-alphabet? pt alphabet)
+(define (%scalar-value-in-alphabet? pt alphabet)
    (if (fix:< pt #x800)
        (alphabet-low-ref (alphabet-low alphabet) pt)
        (let ((high1 (alphabet-high1 alphabet))
@@ -246,36 +243,31 @@ Not used at the moment.
                         (loop (fix:+ index 1) upper))
                        (else #t))))))))
  
-(define (well-formed-code-point-list? items)
-  (if (pair? items)
-      (and (well-formed-item? (car items))
-          (let loop ((a (car items)) (items (cdr items)))
-            (if (pair? items)
-                (let ((b (car items))
-                      (items (cdr items)))
-                  (and (well-formed-item? b)
-                       (fix:< (if (pair? a) (cdr a) a)
-                              (if (pair? b) (car b) b))
-                       (loop b items)))
-                (null? items))))
-      (null? items)))
+(define (well-formed-scalar-value-list? items)
+  (list-of-type? items well-formed-item?))
  
  (define (well-formed-item? item)
    (if (pair? item)
-      (and (%unicode-code-point? (car item))
-          (%unicode-code-point? (cdr item))
-          (fix:< (car item) (cdr item)))
-      (%unicode-code-point? item)))
+      (and (unicode-scalar-value? (car item))
+          (unicode-scalar-value? (cdr item))
+          (fix:<= (car item) (cdr item)))
+      (unicode-scalar-value? item)))
  
-(define-guarantee well-formed-code-point-list "a Unicode code-point list")
+(define-guarantee well-formed-scalar-value-list "a Unicode scalar-value list")
  \f
-(define (code-points->alphabet items)
-  (guarantee-well-formed-code-point-list items 'CODE-POINTS->ALPHABET)
-  (%code-points->alphabet items))
+(define (scalar-values->alphabet items)
+  (guarantee-well-formed-scalar-value-list items 'SCALAR-VALUES->ALPHABET)
+  (%scalar-values->alphabet items))
+
+(define (alphabet . chars)
+  (for-each (lambda (char)
+             (guarantee-unicode-char char 'ALPHABET))
+           chars)
+  (%scalar-values->alphabet (map char->integer chars)))
  
-(define (%code-points->alphabet items)
+(define (%scalar-values->alphabet items)
    (receive (low-items high-items)
-      (split-list (canonicalize-code-point-list items) #x800)
+      (split-list (canonicalize-scalar-value-list items) #x800)
      (let ((low (make-alphabet-low)))
        (for-each (lambda (item)
                   (if (pair? item)
@@ -299,83 +291,13 @@ Not used at the moment.
                   (vector-set! high2 i (car items)))))
           (make-alphabet low high1 high2))))))
  
-(define (canonicalize-code-point-list items)
-  (if (pair? items)
-      (let ((a (car items)))
-       (let loop
-           ((al (if (pair? a) (car a) a))
-            (ah (if (pair? a) (cdr a) a))
-            (items (cdr items)))
-         (if (pair? items)
-             (let ((b (car items))
-                   (items (cdr items)))
-               (let ((bl (if (pair? b) (car b) b))
-                     (bh (if (pair? b) (cdr b) b)))
-                 (if (fix:= (fix:+ ah 1) bl)
-                     (loop al bh items)
-                     (cons (if (fix:= al ah) al (cons al ah))
-                           (loop bl bh items)))))
-             (list (if (fix:= al ah) al (cons al ah))))))
-      '()))
-
-(define (split-list items limit)
-  (let loop ((items items) (low '()))
-    (if (pair? items)
-       (let ((item (car items)))
-         (cond ((not (pair? item))
-                (if (fix:< item limit)
-                    (loop (cdr items) (cons item low))
-                    (values low items)))
-               ((fix:< (cdr item) limit)
-                (loop (cdr items) (cons item low)))
-               ((fix:<= limit (car item))
-                (values low items))
-               (else
-                (values (cons (cons (car item) (fix:- limit 1)) low)
-                        (cons (cons limit (cdr item)) items)))))
-       (values low '()))))
-\f
-(define (alphabet . chars)
-  (%code-points->alphabet (chars->wfcp-list (remove-duplicate-chars chars))))
-
-(define (remove-duplicate-chars chars)
-  (let ((table (make-eq-hash-table)))
-    (for-each (lambda (char)
-               (guarantee-wide-char char 'REMOVE-DUPLICATE-CHARS)
-               (hash-table/put! table char #t))
-             chars)
-    (hash-table/key-list table)))
-
-(define (chars->wfcp-list chars)
-  (let ((chars (sort chars char<?)))
-    (if (pair? chars)
-       (let loop
-           ((pt (char->integer (car chars)))
-            (chars (cdr chars))
-            (items '()))
-         (if (pair? chars)
-             (let ((pt* (char->integer (car chars)))
-                   (chars (cdr chars)))
-               (if (fix:= pt* (fix:+ pt 1))
-                   (let find-max ((pt* pt*) (chars chars))
-                     (if (pair? chars)
-                         (let ((pt** (char->integer (car chars)))
-                               (chars (cdr chars)))
-                           (if (fix:= pt** (fix:+ pt* 1))
-                               (find-max pt** chars)
-                               (loop pt** chars (cons (cons pt pt*) items))))
-                         (reverse! (cons (cons pt pt*) items))))
-                   (loop pt* chars (cons pt items))))
-             (reverse! (cons pt items))))
-       '())))
-
-(define (alphabet->code-points alphabet)
-  (guarantee-alphabet alphabet 'ALPHABET->CODE-POINTS)
-  (append! (alphabet-low->code-points (alphabet-low alphabet))
-          (alphabet-high->code-points (alphabet-high1 alphabet)
+(define (alphabet->scalar-values alphabet)
+  (guarantee-alphabet alphabet 'ALPHABET->SCALAR-VALUES)
+  (append! (alphabet-low->scalar-values (alphabet-low alphabet))
+          (alphabet-high->scalar-values (alphabet-high1 alphabet)
                                        (alphabet-high2 alphabet))))
  
-(define (alphabet-low->code-points low)
+(define (alphabet-low->scalar-values low)
    (let find-lower ((i 0) (result '()))
      (if (fix:< i #x800)
         (if (alphabet-low-ref low i)
@@ -397,15 +319,127 @@ Not used at the moment.
             (find-lower (fix:+ i 1) result))
         (reverse! result))))
  
-(define (alphabet-high->code-points lower upper)
+(define (alphabet-high->scalar-values lower upper)
    (let ((n (vector-length lower)))
      (let loop ((i 0) (result '()))
        (if (fix:< i n)
           (loop (fix:+ i 1)
-               (cons (cons (vector-ref lower i) (vector-ref upper i))
+               (cons (if (fix:< (vector-ref lower i) (vector-ref upper i))
+                         (cons (vector-ref lower i) (vector-ref upper i))
+                         (vector-ref lower i))
                       result))
           (reverse! result)))))
  \f
+(define (canonicalize-scalar-value-list items)
+  (if (pair? items)
+      (let ((items
+            (sort items
+              (lambda (a b)
+                (let ((al (if (pair? a) (car a) a))
+                      (ah (if (pair? a) (cdr a) a))
+                      (bl (if (pair? b) (car b) b))
+                      (bh (if (pair? b) (cdr b) b)))
+                  (or (fix:< al bl)
+                      (and (fix:= al bl)
+                           (fix:< ah bh)))))))
+           (make-item
+            (lambda (l h)
+              (if (fix:= l h)
+                  l
+                  (cons l h)))))
+       (let loop
+           ((al (if (pair? (car items)) (caar items) (car items)))
+            (ah (if (pair? (car items)) (cdar items) (car items)))
+            (items (cdr items)))
+         (if (pair? items)
+             (let ((bl (if (pair? (car items)) (caar items) (car items)))
+                   (bh (if (pair? (car items)) (cdar items) (car items)))
+                   (items (cdr items)))
+               (if (fix:< (fix:+ ah 1) bl)
+                   (cons (make-item al ah)
+                         (loop bl bh items))
+                   (loop al (fix:max ah bh) items)))
+             (list (make-item al ah)))))
+      items))
+
+(define (split-list items limit)
+  (let loop ((items items) (low '()))
+    (if (pair? items)
+       (let ((item (car items)))
+         (cond ((not (pair? item))
+                (if (fix:< item limit)
+                    (loop (cdr items) (cons item low))
+                    (values low items)))
+               ((fix:< (cdr item) limit)
+                (loop (cdr items) (cons item low)))
+               ((fix:<= limit (car item))
+                (values low items))
+               (else
+                (values (cons (cons (car item) (fix:- limit 1)) low)
+                        (cons (cons limit (cdr item)) items)))))
+       (values low '()))))
+\f
+#|
+
+(define (test-canonicalize-scalar-value-list n-items n-iter)
+  (run-cpl-test n-items n-iter canonicalize-scalar-value-list))
+
+(define (test-alphabet->scalar-values n-items n-iter)
+  (run-cpl-test n-items n-iter
+               (lambda (cpl)
+                 (alphabet->scalar-values (scalar-values->alphabet cpl)))))
+
+(define (run-cpl-test n-items n-iter procedure)
+  (do ((i 0 (+ i 1))
+       (failures '()
+                (let ((cpl (make-test-cpl n-items)))
+                  (guarantee-well-formed-scalar-value-list cpl)
+                  (let ((cpl* (procedure cpl)))
+                    (if (canonical-scalar-value-list? cpl*)
+                        failures
+                        (cons (cons cpl cpl*) failures))))))
+      ((not (< i n-iter))
+       (let ((n-failures (length failures)))
+        (if (> n-failures 0)
+            (begin
+              (write-string "Got ")
+              (write n-failures)
+              (write-string " failure")
+              (if (> n-failures 1)
+                  (write-string "s"))
+              (write-string " out of ")
+              (write n-iter)
+              (newline)
+              (pp failures)))))))
+
+(define (make-test-cpl n-items)
+  (make-initialized-list n-items
+    (lambda (i)
+      (let loop ()
+       (let ((n (random #x10000)))
+         (if (unicode-scalar-value? n)
+             (let ((m (random #x100)))
+               (if (fix:= m 0)
+                   n
+                   (if (unicode-scalar-value? (fix:+ n m))
+                       (fix:+ n m)
+                       (loop))))
+             (loop)))))))
+
+(define (canonical-scalar-value-list? items)
+  (and (well-formed-scalar-value-list? items)
+       (if (pair? items)
+          (let loop ((a (car items)) (items (cdr items)))
+            (if (pair? items)
+                (let ((b (car items))
+                      (items (cdr items)))
+                  (and (fix:< (fix:+ (if (pair? a) (cdr a) a) 1)
+                              (if (pair? b) (car b) b))
+                       (loop b items)))
+                #t))
+          #t)))
+|#
+\f
  (define (8-bit-alphabet? alphabet)
    (and (fix:= (vector-length (alphabet-high1 alphabet)) 0)
         (let ((low (alphabet-low alphabet)))
@@ -443,7 +477,7 @@ Not used at the moment.
    (let loop ((i 0) (chars '()))
      (if (fix:< i #x100)
         (loop (fix:+ i 1)
-             (if (%code-point-in-alphabet? i alphabet)
+             (if (%scalar-value-in-alphabet? i alphabet)
                   (cons (integer->char i) chars)
                   chars))
         (apply string (reverse! chars)))))
@@ -581,12 +615,12 @@ Not used at the moment.
     (make-vector length
                 (if (if (default-object? char) #f char)
                     (begin
-                     (guarantee-wide-char char 'MAKE-WIDE-STRING)
+                     (guarantee-unicode-char char 'MAKE-WIDE-STRING)
                       char)
                     (integer->char 0)))))
  
  (define (wide-string . chars)
-  (for-each (lambda (char) (guarantee-wide-char char 'WIDE-STRING)) chars)
+  (for-each (lambda (char) (guarantee-unicode-char char 'WIDE-STRING)) chars)
    (%make-wide-string (list->vector chars)))
  
  (define (wide-string-length string)
@@ -607,7 +641,7 @@ Not used at the moment.
  (define (wide-string-set! string index char)
    (guarantee-wide-string string 'WIDE-STRING-SET!)
    (guarantee-wide-string-index index string 'WIDE-STRING-SET!)
-  (guarantee-wide-char char 'WIDE-STRING-SET!)
+  (guarantee-unicode-char char 'WIDE-STRING-SET!)
    (%wide-string-set! string index char))
  
  (define-integrable (%wide-string-set! string index char)
@@ -957,7 +991,7 @@ Not used at the moment.
          #x10000))
  
  (define (split-into-utf16-surrogates n)
-  (guarantee-unicode-code-point n 'split-into-utf16-surrogates)
+  (guarantee-unicode-scalar-value n 'split-into-utf16-surrogates)
    (let ((n (fix:- n #x10000)))
      (values (fix:or (fix:and (fix:lsh n -10) #x03FF) #xD800)
             (fix:or (fix:and n #x03FF) #xDC00))))
diff --git a/src/xml/turtle.scm b/src/xml/turtle.scm

index c75a845a70d741f63802fad9e83e0af7a1f34dcd..a6db82779bcea72876f42346cf95eaf5557b95ad 100644 (file)
--- a/src/xml/turtle.scm
+++ b/src/xml/turtle.scm
@@ -278,7 +278,7 @@ USA.
                   char-set:turtle-digit))
  
  (define alphabet:name-start-char
-  (code-points->alphabet
+  (scalar-values->alphabet
     '((#x0041 . #x005A)
       #x005F
       (#x0061 . #x007A)
@@ -297,7 +297,7 @@ USA.
  
  (define alphabet:name-char
    (alphabet+ alphabet:name-start-char
-            (code-points->alphabet
+            (scalar-values->alphabet
               '(#x002D
                 (#x0030 . #x0039)
                 #x00B7
@@ -308,7 +308,7 @@ USA.
    (alphabet- alphabet:name-start-char (alphabet #\_)))
  
  (define alphabet:character
-  (code-points->alphabet '((#x20 . #x5B) (#x5D . #x10FFFF))))
+  (scalar-values->alphabet '((#x20 . #x5B) (#x5D . #x10FFFF))))
  
  (define alphabet:ucharacter
    (alphabet- alphabet:character (alphabet #\>)))
diff --git a/src/xml/xml-chars.scm b/src/xml/xml-chars.scm

index c0b7881be0a776bf1cfc163d337b1c47dd67a61b..0d06ae13c6cbe859e5d1f4f15bd52b65e143d910 100644 (file)
--- a/src/xml/xml-chars.scm
+++ b/src/xml/xml-chars.scm
@@ -28,7 +28,7 @@ USA.
  (declare (usual-integrations))
  
  (define alphabet:xml-base-char
-  (code-points->alphabet
+  (scalar-values->alphabet
     '((#x0041 . #x005A)
       (#x0061 . #x007A)
       (#x00C0 . #x00D6)
@@ -233,13 +233,13 @@ USA.
       (#xAC00 . #xD7A3))))
  
  (define alphabet:xml-ideographic
-  (code-points->alphabet
+  (scalar-values->alphabet
     '(#x3007
       (#x3021 . #x3029)
       (#x4E00 . #x9FA5))))
  
  (define alphabet:xml-combining-char
-  (code-points->alphabet
+  (scalar-values->alphabet
     '((#x0300 . #x0345)
       (#x0360 . #x0361)
       (#x0483 . #x0486)
@@ -337,7 +337,7 @@ USA.
       #x309A)))
  
  (define alphabet:xml-digit
-  (code-points->alphabet
+  (scalar-values->alphabet
     '((#x0030 . #x0039)
       (#x0660 . #x0669)
       (#x06F0 . #x06F9)
@@ -355,7 +355,7 @@ USA.
       (#x0F20 . #x0F29))))
  
  (define alphabet:xml-extender
-  (code-points->alphabet
+  (scalar-values->alphabet
     '(#x00B7
       #x02D0
       #x02D1
@@ -369,7 +369,7 @@ USA.
       (#x30FC . #x30FE))))
  
  (define alphabet:xml-char
-  (code-points->alphabet
+  (scalar-values->alphabet
     '(#x0009
       #x000A
       #x000D
diff --git a/src/xml/xml-output.scm b/src/xml/xml-output.scm

index 2d84bdc671163d780616ec2c2e4fc82fe335a2a3..925348216dc8562d0c0cf5357845d0b5a0fc3a05 100644 (file)
--- a/src/xml/xml-output.scm
+++ b/src/xml/xml-output.scm
@@ -90,7 +90,7 @@ USA.
  
  (define (emit-string string ctx)
    (let ((port (ctx-port ctx)))
-    (for-each-wide-char string
+    (for-each-unicode-char string
        (lambda (char)
         (write-char char port)))))
  
@@ -413,7 +413,7 @@ USA.
  
  (define (xml-string-columns string)
    (let ((n 0))
-    (for-each-wide-char string
+    (for-each-unicode-char string
        (lambda (char)
         (set! n
               (fix:+ n
@@ -486,7 +486,7 @@ USA.
        (emit-char #\space ctx)))
  
  (define (write-escaped-string string escapes ctx)
-  (for-each-wide-char string
+  (for-each-unicode-char string
      (lambda (char)
        (cond ((assq char escapes)
              => (lambda (e)
@@ -499,7 +499,7 @@ USA.
             (else
              (emit-char char ctx))))))
  
-(define (for-each-wide-char string procedure)
+(define (for-each-unicode-char string procedure)
    (let ((port (open-utf8-input-string string)))
      (let loop ()
        (let ((char (read-char port)))
diff --git a/src/xml/xml-parser.scm b/src/xml/xml-parser.scm

index e441d72f7c74daa7f1a407abb3457880211c70e7..7c22d30c6f0d42d2e2cf79be83ac14c1165c8808 100644 (file)
--- a/src/xml/xml-parser.scm
+++ b/src/xml/xml-parser.scm
@@ -691,7 +691,7 @@ USA.
    (let ((make-ref
          (lambda (s r p)
            (let ((n (string->number s r)))
-            (if (not (unicode-code-point? n))
+            (if (not (unicode-scalar-value? n))
                  (perror p "Invalid code point" n))
              (let ((char (integer->char n)))
                (if (not (char-in-alphabet? char alphabet:xml-char))
diff --git a/src/xml/xml-struct.scm b/src/xml/xml-struct.scm

index 766693f40ba4bce5621501029d6f5302c28b7075..e6db02d5217dfaefe8f5aa2eec325fd7dc269ec5 100644 (file)
--- a/src/xml/xml-struct.scm
+++ b/src/xml/xml-struct.scm
@@ -152,7 +152,7 @@ USA.
    (value canonicalize canonicalize-char-data))
  
  (define (xml-char-data? object)
-  (or (wide-char? object)
+  (or (unicode-char? object)
        (and (or (wide-string? object)
                 (and (string? object)
                      (utf8-string-valid? object)))
@@ -166,7 +166,7 @@ USA.
                              'UTF-8))
  
  (define (canonicalize-char-data object)
-  (cond ((wide-char? object)
+  (cond ((unicode-char? object)
          (call-with-utf8-output-string
            (lambda (port)
              (write-char object port))))
author	Chris Hanson <org/chris-hanson/cph>
	Sun, 30 Aug 2009 07:08:31 +0000 (00:08 -0700)
committer	Chris Hanson <org/chris-hanson/cph>
	Sun, 30 Aug 2009 07:08:31 +0000 (00:08 -0700)
doc/ref-manual/characters.texi		patch \| blob \| history
src/runtime/genio.scm		patch \| blob \| history
src/runtime/runtime.pkg		patch \| blob \| history
src/runtime/string.scm		patch \| blob \| history
src/runtime/unicode.scm		patch \| blob \| history
src/xml/turtle.scm		patch \| blob \| history
src/xml/xml-chars.scm		patch \| blob \| history
src/xml/xml-output.scm		patch \| blob \| history
src/xml/xml-parser.scm		patch \| blob \| history
src/xml/xml-struct.scm		patch \| blob \| history