From: Chris Hanson Date: Sun, 30 Aug 2009 07:08:31 +0000 (-0700) Subject: Change "code point" to "scalar value" everywhere. Change "wide char" X-Git-Tag: 20100708-Gtk~355 X-Git-Url: https://birchwood-abbey.net/git?a=commitdiff_plain;h=3641498176e63df5c18f7df6db5b58f73fc9af7a;p=mit-scheme.git Change "code point" to "scalar value" everywhere. Change "wide char" to "unicode char" everywhere. Relax the definition of a well-formed scalar-value list so that it allows overlaps, unsorted elements, and non-canonical pairs. --- diff --git a/doc/ref-manual/characters.texi b/doc/ref-manual/characters.texi index a40f2b363..7938c9ed8 100644 --- a/doc/ref-manual/characters.texi +++ b/doc/ref-manual/characters.texi @@ -54,9 +54,9 @@ quote them. @findex #\U+ In addition to the standard character syntax, MIT Scheme also supports a -general syntax that denotes any Unicode character by its code point. -This notation is @code{#\U+@var{code-point}}, where @var{code-point} is -a sequence of hexadecimal digits for a valid code point. So the above +general syntax that denotes any Unicode character by its scalar value. +This notation is @code{#\U+@var{scalar-value}}, where @var{scalar-value} is +a sequence of hexadecimal digits for a valid scalar value. So the above examples could also be written like this: @example @@ -331,9 +331,9 @@ In fact, the Control bucky bit is completely orthogonal to the @code{#\C-SOH}.} MIT/GNU Scheme uses a 21-bit character code with 4 bucky bits. The -character code contains the Unicode code point for the character. This +character code contains the Unicode scalar value for the character. This is a change from earlier versions of the system, which used the -@acronym{ISO-8859-1} code point, but it is upwards compatible with +@acronym{ISO-8859-1} scalar value, but it is upwards compatible with previous usage, since @acronym{ISO-8859-1} is a proper subset of Unicode. @@ -395,7 +395,7 @@ example, @end example Note that in MIT/GNU Scheme, the value of @code{char-code} is the -Unicode code point for @var{char}. +Unicode scalar value for @var{char}. @end deffn @defvr variable char-code-limit @@ -463,7 +463,7 @@ procedures is as follows: This implies that @code{char->integer} and @code{char-code} produce identical results for characters that have no bucky bits set, and that -characters are ordered according to their Unicode code points. +characters are ordered according to their Unicode scalar values. Note: If the argument to @code{char->integer} or @code{integer->char} is a constant, the compiler will constant-fold the call, replacing it with @@ -475,7 +475,7 @@ character constants or @acronym{ASCII} codes. The range of @code{char->integer} is defined to be the exact non-negative integers that are less than the value of this variable (exclusive). Note, however, that there are some holes in this range, -because the character code must be a valid Unicode code point. +because the character code must be a valid Unicode scalar value. @end defvr @node ISO-8859-1 Characters, Character Sets, Internal Representation of Characters, Characters @@ -691,24 +691,24 @@ parser (@pxref{XML Support}) implementation. @acronym{XML} uses Unicode as its base character set, and any @acronym{XML} implementation @emph{must} support Unicode. -@cindex Code point, Unicode -@cindex Wide character -@cindex Character, wide -The basic unit in a Unicode implementation is the @dfn{code point}. The -character equivalent of a code point is a @dfn{wide character}. +@cindex Scalar value, Unicode +@cindex Unicode character +@cindex Character, Unicode +The basic unit in a Unicode implementation is the @dfn{scalar value}. The +character equivalent of a scalar value is a @dfn{Unicode character}. -@deffn procedure unicode-code-point? object -Returns @code{#t} if @var{object} is a Unicode code point. Code points -are implemented as exact non-negative integers. They are further +@deffn procedure unicode-scalar-value? object +Returns @code{#t} if @var{object} is a Unicode scalar value. Scalar +values are implemented as exact non-negative integers. They are further limited, by the Unicode standard, to be strictly less than @code{#x110000}, with the values @code{#xD800} through @code{#xDFFF}, @code{#xFFFE}, and @code{#xFFFF} excluded. @end deffn -@deffn procedure wide-char? object -Returns @code{#t} if @var{object} is a wide character, specifically if -@var{object} is a character with no bucky bits and whose code satisfies -@code{unicode-code-point?}. +@deffn procedure unicode-char? object +Returns @code{#t} if @var{object} is a Unicode character, specifically +if @var{object} is a character with no bucky bits and whose code +satisfies @code{unicode-scalar-value?}. @end deffn The Unicode implementation consists of three parts: @@ -725,7 +725,7 @@ several external representations, specifically @acronym{UTF-8}, @item An @dfn{alphabet} abstraction, which is an efficient implementation of -sets of Unicode code points (similar to the @code{char-set} +sets of Unicode scalar values (similar to the @code{char-set} abstraction). @end itemize @@ -749,13 +749,13 @@ constant-time access to each character in the string. Returns @code{#t} if @var{object} is a wide string. @end deffn -@deffn procedure make-wide-string k [wide-char] +@deffn procedure make-wide-string k [unicode-char] Returns a newly allocated wide string of length @var{k}. If @var{char} is specified, all elements of the returned string are initialized to @var{char}; otherwise the contents of the string are unspecified. @end deffn -@deffn procedure wide-string wide-char @dots{} +@deffn procedure wide-string unicode-char @dots{} Returns a newly allocated wide string consisting of the specified characters. @end deffn @@ -770,7 +770,7 @@ Returns character @var{k} of @var{wide-string}. @var{K} must be a valid index of @var{string}. @end deffn -@deffn procedure wide-string-set! wide-string k wide-char +@deffn procedure wide-string-set! wide-string k unicode-char Stores @var{char} in element @var{k} of @var{wide-string} and returns an unspecified value. @var{K} must be a valid index of @var{wide-string}. @end deffn @@ -804,7 +804,7 @@ be used to specify that the port delivers characters from a substring of @end deffn @deffn procedure open-wide-output-string -Returns an output port that accepts wide characters and strings and +Returns an output port that accepts Unicode characters and strings and accumulates them in a buffer. Call @code{get-output-string} on the returned port to get a wide string containing the accumulated characters. @@ -915,37 +915,38 @@ and @var{string}'s length, respectively. Applications often need to manipulate sets of characters, such as the set of alphabetic characters or the set of whitespace characters. The @dfn{alphabet} abstraction provides an efficient implementation of -sets of Unicode code points. +sets of Unicode scalar values. @deffn procedure alphabet? object Returns @code{#t} if @var{object} is a Unicode alphabet, otherwise returns @code{#f}. @end deffn -@deffn procedure alphabet wide-char @dots{} -Returns a Unicode alphabet containing the wide characters passed as +@deffn procedure alphabet unicode-char @dots{} +Returns a Unicode alphabet containing the Unicode characters passed as arguments. @end deffn -@deffn procedure code-points->alphabet items -Returns a Unicode alphabet containing the code points described by +@deffn procedure scalar-values->alphabet items +Returns a Unicode alphabet containing the scalar values described by @var{items}. @var{Items} must satisfy -@code{well-formed-code-points-list?}. +@code{well-formed-scalar-values-list?}. @end deffn -@deffn procedure alphabet->code-points alphabet -Returns a well-formed code-points list that describes the code points +@deffn procedure alphabet->scalar-values alphabet +Returns a well-formed scalar-values list that describes the scalar values represented by @var{alphabet}. @end deffn -@deffn procedure well-formed-code-points-list? object -Returns @code{#t} if @var{object} is a well-formed code-points list, -otherwise returns @code{#f}. A well-formed code-points list is a -proper list, each element of which is either a code point or a pair of -code points. A pair of code points represents a contiguous range of -code points. The @sc{car} of the pair is the lower limit, and the -@sc{cdr} is the upper limit. Both limits are inclusive, and the lower -limit must be strictly less than the upper limit. +@deffn procedure well-formed-scalar-values-list? object +Returns @code{#t} if @var{object} is a well-formed scalar-values list, +otherwise returns @code{#f}. A well-formed scalar-values list is a +proper list, each element of which is either a unicode scalar value or a +pair of unicode scalar values. A pair of scalar values represents a +contiguous range of scalar values. The @sc{car} of the pair is the +lower limit, and the @sc{cdr} is the upper limit. Both limits are +inclusive, and the lower limit must be less than or equal to the upper +limit. @end deffn @deffn procedure char-in-alphabet? char alphabet @@ -954,23 +955,23 @@ otherwise returns @code{#f}. @end deffn Character sets and alphabets can be converted to one another, provided -that the alphabet contains only 8-bit code points. This is true -because 8-bit code points in Unicode map directly to +that the alphabet contains only 8-bit scalar values. This is true +because 8-bit scalar values in Unicode map directly to @acronym{ISO-8859-1} characters, which is what character sets contain. @deffn procedure char-set->alphabet char-set -Returns a Unicode alphabet containing the code points that correspond +Returns a Unicode alphabet containing the scalar values that correspond to characters that are members of @var{char-set}. @end deffn @deffn procedure alphabet->char-set alphabet Returns a character set containing the characters that correspond to -8-bit code points that are members of @var{alphabet}. (Code points +8-bit scalar values that are members of @var{alphabet}. (Scalar values outside the 8-bit range are ignored.) @end deffn @deffn procedure string->alphabet string -Returns a Unicode alphabet containing the code points corresponding to +Returns a Unicode alphabet containing the scalar values corresponding to the characters in @var{string}. Equivalent to @example @@ -979,22 +980,22 @@ the characters in @var{string}. Equivalent to @end deffn @deffn procedure alphabet->string alphabet -Returns a newly-allocated string containing the characters -corresponding to the 8-bit code points in @var{alphabet}. (Code -points outside the 8-bit range are ignored.) +Returns a newly-allocated string containing the characters corresponding +to the 8-bit scalar values in @var{alphabet}. (Scalar values outside +the 8-bit range are ignored.) @end deffn @deffn procedure 8-bit-alphabet? alphabet -Returns @code{#t} if @var{alphabet} contains only 8-bit code points, +Returns @code{#t} if @var{alphabet} contains only 8-bit scalar values, otherwise returns @code{#f}. @end deffn @deffn procedure alphabet+ alphabet @dots{} -Returns a Unicode alphabet that contains each code point that is a +Returns a Unicode alphabet that contains each scalar value that is a member of any of the @var{alphabet} arguments. @end deffn @deffn procedure alphabet- alphabet1 alphabet2 -Returns a Unicode alphabet that contains each code point that is a +Returns a Unicode alphabet that contains each scalar value that is a member of @var{alphabet1} and is not a member of @var{alphabet2}. @end deffn diff --git a/src/runtime/genio.scm b/src/runtime/genio.scm index 6a8792efe..79c12fbb7 100644 --- a/src/runtime/genio.scm +++ b/src/runtime/genio.scm @@ -1857,7 +1857,7 @@ USA. (* (get-byte bv bs 1) #x10000) (* (get-byte bv bs 2) #x100) (get-byte bv bs 3)))) - (if (unicode-code-point? cp) + (if (unicode-scalar-value? cp) (begin (set-input-buffer-start! ib (fix:+ bs 4)) cp) @@ -1873,7 +1873,7 @@ USA. (* (get-byte bv bs 2) #x10000) (* (get-byte bv bs 1) #x100) (get-byte bv bs 0)))) - (if (unicode-code-point? cp) + (if (unicode-scalar-value? cp) (begin (set-input-buffer-start! ib (fix:+ bs 4)) cp) diff --git a/src/runtime/runtime.pkg b/src/runtime/runtime.pkg index 276e0f79e..95907d045 100644 --- a/src/runtime/runtime.pkg +++ b/src/runtime/runtime.pkg @@ -4910,6 +4910,9 @@ USA. (files "unicode") (parent (runtime)) (export () + (error:not-wide-char error:not-unicode-char) + (guarantee-wide-char guarantee-unicode-char) + (wide-char? unicode-char?) (wide-string->utf16-be-string string->utf16-be-string) (wide-string->utf16-le-string string->utf16-le-string) (wide-string->utf16-string string->utf16-string) @@ -4924,7 +4927,7 @@ USA. alphabet+ alphabet- alphabet->char-set - alphabet->code-points + alphabet->scalar-values alphabet->string alphabet-predicate alphabet? @@ -4944,11 +4947,11 @@ USA. call-with-utf8-output-string char-in-alphabet? char-set->alphabet - code-points->alphabet combine-utf16-surrogates error:not-8-bit-alphabet error:not-alphabet - error:not-unicode-code-point + error:not-unicode-char + error:not-unicode-scalar-value error:not-utf16-be-string error:not-utf16-high-surrogate error:not-utf16-le-string @@ -4958,14 +4961,14 @@ USA. error:not-utf32-le-string error:not-utf32-string error:not-utf8-string - error:not-well-formed-code-point-list - error:not-wide-char + error:not-well-formed-scalar-value-list error:not-wide-string for-all-chars-in-string? for-any-char-in-string? guarantee-8-bit-alphabet guarantee-alphabet - guarantee-unicode-code-point + guarantee-unicode-char + guarantee-unicode-scalar-value guarantee-utf16-be-string guarantee-utf16-high-surrogate guarantee-utf16-le-string @@ -4975,8 +4978,7 @@ USA. guarantee-utf32-le-string guarantee-utf32-string guarantee-utf8-string - guarantee-well-formed-code-point-list - guarantee-wide-char + guarantee-well-formed-scalar-value-list guarantee-wide-string guarantee-wide-string-index guarantee-wide-substring @@ -4995,6 +4997,7 @@ USA. open-utf32-output-string open-utf8-input-string open-utf8-output-string + scalar-values->alphabet split-into-utf16-surrogates string->alphabet string->utf16-be-string @@ -5006,7 +5009,8 @@ USA. string->utf8-string string->utf8-string string->wide-string - unicode-code-point? + unicode-char? + unicode-scalar-value? utf16-be-string->wide-string utf16-be-string-length utf16-be-string-valid? @@ -5038,8 +5042,7 @@ USA. utf8-string-length utf8-string-valid? utf8-string? - well-formed-code-point-list? - wide-char? + well-formed-scalar-value-list? wide-string wide-string->string wide-string-index? diff --git a/src/runtime/string.scm b/src/runtime/string.scm index ad8a94bf2..9114ac238 100644 --- a/src/runtime/string.scm +++ b/src/runtime/string.scm @@ -203,7 +203,8 @@ USA. (cond ((string? object) (string->utf8-string object)) ((symbol? object) (symbol-name object)) ((wide-string? object) (wide-string->utf8-string object)) - ((wide-char? object) (wide-string->utf8-string (wide-string object))) + ((unicode-char? object) + (wide-string->utf8-string (wide-string object))) (else (%->string object 'UTF8-STRING)))) (define (%->string object caller) diff --git a/src/runtime/unicode.scm b/src/runtime/unicode.scm index 289d36bc5..6764160a2 100644 --- a/src/runtime/unicode.scm +++ b/src/runtime/unicode.scm @@ -162,21 +162,18 @@ USA. ;;;; Unicode characters -(define (wide-char? object) +(define (unicode-char? object) (and (char? object) (legal-code-32? (char->integer object)))) -(define-guarantee wide-char "a Unicode character") +(define-guarantee unicode-char "a Unicode character") -(define (unicode-code-point? object) - (and (%unicode-code-point? object) - (not (illegal? object)))) - -(define (%unicode-code-point? object) +(define (unicode-scalar-value? object) (and (index-fixnum? object) - (fix:< object char-code-limit))) + (fix:< object char-code-limit) + (not (illegal? object)))) -(define-guarantee unicode-code-point "a Unicode code point") +(define-guarantee unicode-scalar-value "a Unicode scalar value") (define-integrable (legal-code-32? pt) (and (fix:< pt char-code-limit) @@ -213,26 +210,26 @@ Not used at the moment. (define-integrable (make-alphabet-low) (make-string #x100 (integer->char 0))) -(define-integrable (alphabet-low-ref low code-point) - (not (fix:= (fix:and (vector-8b-ref low (fix:lsh code-point -3)) - (fix:lsh 1 (fix:and code-point 7))) +(define-integrable (alphabet-low-ref low scalar-value) + (not (fix:= (fix:and (vector-8b-ref low (fix:lsh scalar-value -3)) + (fix:lsh 1 (fix:and scalar-value 7))) 0))) -(define-integrable (alphabet-low-set! low code-point) +(define-integrable (alphabet-low-set! low scalar-value) (vector-8b-set! low - (fix:lsh code-point -3) - (fix:or (vector-8b-ref low (fix:lsh code-point -3)) - (fix:lsh 1 (fix:and code-point 7))))) + (fix:lsh scalar-value -3) + (fix:or (vector-8b-ref low (fix:lsh scalar-value -3)) + (fix:lsh 1 (fix:and scalar-value 7))))) (define null-alphabet (make-alphabet (make-alphabet-low) '#() '#())) (define (char-in-alphabet? char alphabet) - (guarantee-wide-char char 'CHAR-IN-ALPHABET?) + (guarantee-unicode-char char 'CHAR-IN-ALPHABET?) (guarantee-alphabet alphabet 'CHAR-IN-ALPHABET?) - (%code-point-in-alphabet? (char-code char) alphabet)) + (%scalar-value-in-alphabet? (char-code char) alphabet)) -(define (%code-point-in-alphabet? pt alphabet) +(define (%scalar-value-in-alphabet? pt alphabet) (if (fix:< pt #x800) (alphabet-low-ref (alphabet-low alphabet) pt) (let ((high1 (alphabet-high1 alphabet)) @@ -246,36 +243,31 @@ Not used at the moment. (loop (fix:+ index 1) upper)) (else #t)))))))) -(define (well-formed-code-point-list? items) - (if (pair? items) - (and (well-formed-item? (car items)) - (let loop ((a (car items)) (items (cdr items))) - (if (pair? items) - (let ((b (car items)) - (items (cdr items))) - (and (well-formed-item? b) - (fix:< (if (pair? a) (cdr a) a) - (if (pair? b) (car b) b)) - (loop b items))) - (null? items)))) - (null? items))) +(define (well-formed-scalar-value-list? items) + (list-of-type? items well-formed-item?)) (define (well-formed-item? item) (if (pair? item) - (and (%unicode-code-point? (car item)) - (%unicode-code-point? (cdr item)) - (fix:< (car item) (cdr item))) - (%unicode-code-point? item))) + (and (unicode-scalar-value? (car item)) + (unicode-scalar-value? (cdr item)) + (fix:<= (car item) (cdr item))) + (unicode-scalar-value? item))) -(define-guarantee well-formed-code-point-list "a Unicode code-point list") +(define-guarantee well-formed-scalar-value-list "a Unicode scalar-value list") -(define (code-points->alphabet items) - (guarantee-well-formed-code-point-list items 'CODE-POINTS->ALPHABET) - (%code-points->alphabet items)) +(define (scalar-values->alphabet items) + (guarantee-well-formed-scalar-value-list items 'SCALAR-VALUES->ALPHABET) + (%scalar-values->alphabet items)) + +(define (alphabet . chars) + (for-each (lambda (char) + (guarantee-unicode-char char 'ALPHABET)) + chars) + (%scalar-values->alphabet (map char->integer chars))) -(define (%code-points->alphabet items) +(define (%scalar-values->alphabet items) (receive (low-items high-items) - (split-list (canonicalize-code-point-list items) #x800) + (split-list (canonicalize-scalar-value-list items) #x800) (let ((low (make-alphabet-low))) (for-each (lambda (item) (if (pair? item) @@ -299,83 +291,13 @@ Not used at the moment. (vector-set! high2 i (car items))))) (make-alphabet low high1 high2)))))) -(define (canonicalize-code-point-list items) - (if (pair? items) - (let ((a (car items))) - (let loop - ((al (if (pair? a) (car a) a)) - (ah (if (pair? a) (cdr a) a)) - (items (cdr items))) - (if (pair? items) - (let ((b (car items)) - (items (cdr items))) - (let ((bl (if (pair? b) (car b) b)) - (bh (if (pair? b) (cdr b) b))) - (if (fix:= (fix:+ ah 1) bl) - (loop al bh items) - (cons (if (fix:= al ah) al (cons al ah)) - (loop bl bh items))))) - (list (if (fix:= al ah) al (cons al ah)))))) - '())) - -(define (split-list items limit) - (let loop ((items items) (low '())) - (if (pair? items) - (let ((item (car items))) - (cond ((not (pair? item)) - (if (fix:< item limit) - (loop (cdr items) (cons item low)) - (values low items))) - ((fix:< (cdr item) limit) - (loop (cdr items) (cons item low))) - ((fix:<= limit (car item)) - (values low items)) - (else - (values (cons (cons (car item) (fix:- limit 1)) low) - (cons (cons limit (cdr item)) items))))) - (values low '())))) - -(define (alphabet . chars) - (%code-points->alphabet (chars->wfcp-list (remove-duplicate-chars chars)))) - -(define (remove-duplicate-chars chars) - (let ((table (make-eq-hash-table))) - (for-each (lambda (char) - (guarantee-wide-char char 'REMOVE-DUPLICATE-CHARS) - (hash-table/put! table char #t)) - chars) - (hash-table/key-list table))) - -(define (chars->wfcp-list chars) - (let ((chars (sort chars charinteger (car chars))) - (chars (cdr chars)) - (items '())) - (if (pair? chars) - (let ((pt* (char->integer (car chars))) - (chars (cdr chars))) - (if (fix:= pt* (fix:+ pt 1)) - (let find-max ((pt* pt*) (chars chars)) - (if (pair? chars) - (let ((pt** (char->integer (car chars))) - (chars (cdr chars))) - (if (fix:= pt** (fix:+ pt* 1)) - (find-max pt** chars) - (loop pt** chars (cons (cons pt pt*) items)))) - (reverse! (cons (cons pt pt*) items)))) - (loop pt* chars (cons pt items)))) - (reverse! (cons pt items)))) - '()))) - -(define (alphabet->code-points alphabet) - (guarantee-alphabet alphabet 'ALPHABET->CODE-POINTS) - (append! (alphabet-low->code-points (alphabet-low alphabet)) - (alphabet-high->code-points (alphabet-high1 alphabet) +(define (alphabet->scalar-values alphabet) + (guarantee-alphabet alphabet 'ALPHABET->SCALAR-VALUES) + (append! (alphabet-low->scalar-values (alphabet-low alphabet)) + (alphabet-high->scalar-values (alphabet-high1 alphabet) (alphabet-high2 alphabet)))) -(define (alphabet-low->code-points low) +(define (alphabet-low->scalar-values low) (let find-lower ((i 0) (result '())) (if (fix:< i #x800) (if (alphabet-low-ref low i) @@ -397,15 +319,127 @@ Not used at the moment. (find-lower (fix:+ i 1) result)) (reverse! result)))) -(define (alphabet-high->code-points lower upper) +(define (alphabet-high->scalar-values lower upper) (let ((n (vector-length lower))) (let loop ((i 0) (result '())) (if (fix:< i n) (loop (fix:+ i 1) - (cons (cons (vector-ref lower i) (vector-ref upper i)) + (cons (if (fix:< (vector-ref lower i) (vector-ref upper i)) + (cons (vector-ref lower i) (vector-ref upper i)) + (vector-ref lower i)) result)) (reverse! result))))) +(define (canonicalize-scalar-value-list items) + (if (pair? items) + (let ((items + (sort items + (lambda (a b) + (let ((al (if (pair? a) (car a) a)) + (ah (if (pair? a) (cdr a) a)) + (bl (if (pair? b) (car b) b)) + (bh (if (pair? b) (cdr b) b))) + (or (fix:< al bl) + (and (fix:= al bl) + (fix:< ah bh))))))) + (make-item + (lambda (l h) + (if (fix:= l h) + l + (cons l h))))) + (let loop + ((al (if (pair? (car items)) (caar items) (car items))) + (ah (if (pair? (car items)) (cdar items) (car items))) + (items (cdr items))) + (if (pair? items) + (let ((bl (if (pair? (car items)) (caar items) (car items))) + (bh (if (pair? (car items)) (cdar items) (car items))) + (items (cdr items))) + (if (fix:< (fix:+ ah 1) bl) + (cons (make-item al ah) + (loop bl bh items)) + (loop al (fix:max ah bh) items))) + (list (make-item al ah))))) + items)) + +(define (split-list items limit) + (let loop ((items items) (low '())) + (if (pair? items) + (let ((item (car items))) + (cond ((not (pair? item)) + (if (fix:< item limit) + (loop (cdr items) (cons item low)) + (values low items))) + ((fix:< (cdr item) limit) + (loop (cdr items) (cons item low))) + ((fix:<= limit (car item)) + (values low items)) + (else + (values (cons (cons (car item) (fix:- limit 1)) low) + (cons (cons limit (cdr item)) items))))) + (values low '())))) + +#| + +(define (test-canonicalize-scalar-value-list n-items n-iter) + (run-cpl-test n-items n-iter canonicalize-scalar-value-list)) + +(define (test-alphabet->scalar-values n-items n-iter) + (run-cpl-test n-items n-iter + (lambda (cpl) + (alphabet->scalar-values (scalar-values->alphabet cpl))))) + +(define (run-cpl-test n-items n-iter procedure) + (do ((i 0 (+ i 1)) + (failures '() + (let ((cpl (make-test-cpl n-items))) + (guarantee-well-formed-scalar-value-list cpl) + (let ((cpl* (procedure cpl))) + (if (canonical-scalar-value-list? cpl*) + failures + (cons (cons cpl cpl*) failures)))))) + ((not (< i n-iter)) + (let ((n-failures (length failures))) + (if (> n-failures 0) + (begin + (write-string "Got ") + (write n-failures) + (write-string " failure") + (if (> n-failures 1) + (write-string "s")) + (write-string " out of ") + (write n-iter) + (newline) + (pp failures))))))) + +(define (make-test-cpl n-items) + (make-initialized-list n-items + (lambda (i) + (let loop () + (let ((n (random #x10000))) + (if (unicode-scalar-value? n) + (let ((m (random #x100))) + (if (fix:= m 0) + n + (if (unicode-scalar-value? (fix:+ n m)) + (fix:+ n m) + (loop)))) + (loop))))))) + +(define (canonical-scalar-value-list? items) + (and (well-formed-scalar-value-list? items) + (if (pair? items) + (let loop ((a (car items)) (items (cdr items))) + (if (pair? items) + (let ((b (car items)) + (items (cdr items))) + (and (fix:< (fix:+ (if (pair? a) (cdr a) a) 1) + (if (pair? b) (car b) b)) + (loop b items))) + #t)) + #t))) +|# + (define (8-bit-alphabet? alphabet) (and (fix:= (vector-length (alphabet-high1 alphabet)) 0) (let ((low (alphabet-low alphabet))) @@ -443,7 +477,7 @@ Not used at the moment. (let loop ((i 0) (chars '())) (if (fix:< i #x100) (loop (fix:+ i 1) - (if (%code-point-in-alphabet? i alphabet) + (if (%scalar-value-in-alphabet? i alphabet) (cons (integer->char i) chars) chars)) (apply string (reverse! chars))))) @@ -581,12 +615,12 @@ Not used at the moment. (make-vector length (if (if (default-object? char) #f char) (begin - (guarantee-wide-char char 'MAKE-WIDE-STRING) + (guarantee-unicode-char char 'MAKE-WIDE-STRING) char) (integer->char 0))))) (define (wide-string . chars) - (for-each (lambda (char) (guarantee-wide-char char 'WIDE-STRING)) chars) + (for-each (lambda (char) (guarantee-unicode-char char 'WIDE-STRING)) chars) (%make-wide-string (list->vector chars))) (define (wide-string-length string) @@ -607,7 +641,7 @@ Not used at the moment. (define (wide-string-set! string index char) (guarantee-wide-string string 'WIDE-STRING-SET!) (guarantee-wide-string-index index string 'WIDE-STRING-SET!) - (guarantee-wide-char char 'WIDE-STRING-SET!) + (guarantee-unicode-char char 'WIDE-STRING-SET!) (%wide-string-set! string index char)) (define-integrable (%wide-string-set! string index char) @@ -957,7 +991,7 @@ Not used at the moment. #x10000)) (define (split-into-utf16-surrogates n) - (guarantee-unicode-code-point n 'split-into-utf16-surrogates) + (guarantee-unicode-scalar-value n 'split-into-utf16-surrogates) (let ((n (fix:- n #x10000))) (values (fix:or (fix:and (fix:lsh n -10) #x03FF) #xD800) (fix:or (fix:and n #x03FF) #xDC00)))) diff --git a/src/xml/turtle.scm b/src/xml/turtle.scm index c75a845a7..a6db82779 100644 --- a/src/xml/turtle.scm +++ b/src/xml/turtle.scm @@ -278,7 +278,7 @@ USA. char-set:turtle-digit)) (define alphabet:name-start-char - (code-points->alphabet + (scalar-values->alphabet '((#x0041 . #x005A) #x005F (#x0061 . #x007A) @@ -297,7 +297,7 @@ USA. (define alphabet:name-char (alphabet+ alphabet:name-start-char - (code-points->alphabet + (scalar-values->alphabet '(#x002D (#x0030 . #x0039) #x00B7 @@ -308,7 +308,7 @@ USA. (alphabet- alphabet:name-start-char (alphabet #\_))) (define alphabet:character - (code-points->alphabet '((#x20 . #x5B) (#x5D . #x10FFFF)))) + (scalar-values->alphabet '((#x20 . #x5B) (#x5D . #x10FFFF)))) (define alphabet:ucharacter (alphabet- alphabet:character (alphabet #\>))) diff --git a/src/xml/xml-chars.scm b/src/xml/xml-chars.scm index c0b7881be..0d06ae13c 100644 --- a/src/xml/xml-chars.scm +++ b/src/xml/xml-chars.scm @@ -28,7 +28,7 @@ USA. (declare (usual-integrations)) (define alphabet:xml-base-char - (code-points->alphabet + (scalar-values->alphabet '((#x0041 . #x005A) (#x0061 . #x007A) (#x00C0 . #x00D6) @@ -233,13 +233,13 @@ USA. (#xAC00 . #xD7A3)))) (define alphabet:xml-ideographic - (code-points->alphabet + (scalar-values->alphabet '(#x3007 (#x3021 . #x3029) (#x4E00 . #x9FA5)))) (define alphabet:xml-combining-char - (code-points->alphabet + (scalar-values->alphabet '((#x0300 . #x0345) (#x0360 . #x0361) (#x0483 . #x0486) @@ -337,7 +337,7 @@ USA. #x309A))) (define alphabet:xml-digit - (code-points->alphabet + (scalar-values->alphabet '((#x0030 . #x0039) (#x0660 . #x0669) (#x06F0 . #x06F9) @@ -355,7 +355,7 @@ USA. (#x0F20 . #x0F29)))) (define alphabet:xml-extender - (code-points->alphabet + (scalar-values->alphabet '(#x00B7 #x02D0 #x02D1 @@ -369,7 +369,7 @@ USA. (#x30FC . #x30FE)))) (define alphabet:xml-char - (code-points->alphabet + (scalar-values->alphabet '(#x0009 #x000A #x000D diff --git a/src/xml/xml-output.scm b/src/xml/xml-output.scm index 2d84bdc67..925348216 100644 --- a/src/xml/xml-output.scm +++ b/src/xml/xml-output.scm @@ -90,7 +90,7 @@ USA. (define (emit-string string ctx) (let ((port (ctx-port ctx))) - (for-each-wide-char string + (for-each-unicode-char string (lambda (char) (write-char char port))))) @@ -413,7 +413,7 @@ USA. (define (xml-string-columns string) (let ((n 0)) - (for-each-wide-char string + (for-each-unicode-char string (lambda (char) (set! n (fix:+ n @@ -486,7 +486,7 @@ USA. (emit-char #\space ctx))) (define (write-escaped-string string escapes ctx) - (for-each-wide-char string + (for-each-unicode-char string (lambda (char) (cond ((assq char escapes) => (lambda (e) @@ -499,7 +499,7 @@ USA. (else (emit-char char ctx)))))) -(define (for-each-wide-char string procedure) +(define (for-each-unicode-char string procedure) (let ((port (open-utf8-input-string string))) (let loop () (let ((char (read-char port))) diff --git a/src/xml/xml-parser.scm b/src/xml/xml-parser.scm index e441d72f7..7c22d30c6 100644 --- a/src/xml/xml-parser.scm +++ b/src/xml/xml-parser.scm @@ -691,7 +691,7 @@ USA. (let ((make-ref (lambda (s r p) (let ((n (string->number s r))) - (if (not (unicode-code-point? n)) + (if (not (unicode-scalar-value? n)) (perror p "Invalid code point" n)) (let ((char (integer->char n))) (if (not (char-in-alphabet? char alphabet:xml-char)) diff --git a/src/xml/xml-struct.scm b/src/xml/xml-struct.scm index 766693f40..e6db02d52 100644 --- a/src/xml/xml-struct.scm +++ b/src/xml/xml-struct.scm @@ -152,7 +152,7 @@ USA. (value canonicalize canonicalize-char-data)) (define (xml-char-data? object) - (or (wide-char? object) + (or (unicode-char? object) (and (or (wide-string? object) (and (string? object) (utf8-string-valid? object))) @@ -166,7 +166,7 @@ USA. 'UTF-8)) (define (canonicalize-char-data object) - (cond ((wide-char? object) + (cond ((unicode-char? object) (call-with-utf8-output-string (lambda (port) (write-char object port))))