From: Taylor R Campbell <campbell@mumble.net>
Date: Sun, 30 Jun 2019 20:52:56 +0000 (+0000)
Subject: Update documentation for floating-point operations.
X-Git-Tag: mit-scheme-pucked-10.1.12~7^2~30
X-Git-Url: https://birchwood-abbey.net/git?a=commitdiff_plain;h=043b4804d99942b3af1afa7b25d6d69458629dab;p=mit-scheme.git

Update documentation for floating-point operations.
---

diff --git a/doc/ref-manual/numbers.texi b/doc/ref-manual/numbers.texi
index 1b592d274..dca0b573e 100644
--- a/doc/ref-manual/numbers.texi
+++ b/doc/ref-manual/numbers.texi
@@ -1703,6 +1703,97 @@ floating-point number.  In MIT/GNU Scheme, all inexact real numbers are
 flonums.  For this reason, constants such as @code{0.} and @code{2.3}
 are guaranteed to be flonums.
 
+MIT/GNU Scheme follows the @acronym{IEEE 754-2008} floating-point
+standard, using binary64 arithmetic for flonums.
+All floating-point values are classified into:
+
+@table @strong
+@item normal
+@cindex floating-point number, normal
+@cindex normal floating-point number
+Numbers of the form
+@iftex
+@tex
+$$r^e (1 + f/r^p)$$
+@end tex
+@end iftex
+@ifnottex
+
+@example
+r^e (1 + f/r^p)
+@end example
+
+@end ifnottex
+where @math{r}, the radix, is a positive integer, here always @math{2};
+@math{p}, the precision, is a positive integer, here always @math{53};
+@math{e}, the exponent, is an integer within a limited range, here
+always @math{-1022} to @math{1023} (inclusive); and @math{f}, the
+fractional part of the significand, is a @math{(p-1)}-bit unsigned
+integer,
+
+@item subnormal
+@cindex floating-point number, subnormal
+@cindex subnormal floating-point number
+@cindex denormal
+Fixed-point numbers near zero that allow for gradual underflow.
+Every subnormal number is an integer multiple of the smallest
+subnormal number.
+Subnormals were also historically called ``denormal''.
+
+@item zero
+@cindex floating-point number, zero
+@cindex zero
+@cindex signed zero
+There are two distinguished zero values, one with ``negative'' sign
+bit and one with ``positive'' sign bit.
+
+The two zero values are considered numerically equal, but serve to
+distinguish paths converging to zero along different branch cuts and
+so some operations yield different results for differently signed
+zero values.
+
+@item infinity
+@vindex +inf.0
+@vindex -inf.0
+@cindex positive infinity (@code{+inf.0})
+@cindex negative infinity (@code{-inf.0})
+@cindex floating-point number, infinite
+@cindex infinity (@code{+inf.0}, @code{-inf.0})
+@cindex extended real line
+There are two distinguished infinity values, negative infinity or
+@code{-inf.0} and positive infinity or @code{+inf.0}, representing
+overflow on the real line.
+
+@item NaN
+@vindex NaN (not a number)
+@vindex +nan.0
+@vindex -nan.0
+@vindex +snan.1
+@vindex -snan.1
+@cindex floating-point number, not a number
+@cindex not a number (NaN, @code{+nan.0})
+@cindex NaN
+There are @math{4 r^{p-2} - 2} distinguished not-a-number values,
+representing invalid operations or uninitialized data, distinguished
+by their negative/positive sign bit, a quiet/signalling bit, and a
+@math{(p-2)}-digit unsigned integer payload which must not be zero for
+signalling NaNs.
+
+@cindex quiet NaN
+@cindex signalling NaN
+@cindex invalid-operation exception
+Arithmetic on @strong{quiet} NaNs propagates them without raising any
+floating-point exceptions.
+In contrast, arithmetic on @strong{signalling} NaNs raises the
+floating-point invalid-operation exception.
+Quiet NaNs are written @code{+nan.123}, @code{-nan.0}, etc.
+Signalling NaNs are written @code{+snan.123}, @code{-snan.1}, etc.
+The notation @code{+snan.0} and @code{-snan.0} is not allowed: what
+would be the encoding for them actually means @code{+inf.0} and
+@code{-inf.0}.
+
+@end table
+
 @deffn procedure flo:flonum? object
 @cindex type predicate, for flonum
 Returns @code{#t} if @var{object} is a flonum; otherwise returns @code{#f}.
@@ -1710,10 +1801,47 @@ Returns @code{#t} if @var{object} is a flonum; otherwise returns @code{#f}.
 
 @deffn procedure flo:= flonum1 flonum2
 @deffnx procedure flo:< flonum1 flonum2
+@deffnx procedure flo:<= flonum1 flonum2
 @deffnx procedure flo:> flonum1 flonum2
+@deffnx procedure flo:>= flonum1 flonum2
+@deffnx procedure flo:<> flonum1 flonum2
 @cindex equivalence predicate, for flonums
+@cindex ordered comparison
+@cindex floating-point comparison, ordered
+@cindex trichotomy
 These procedures are the standard order and equality predicates on
 flonums.  When compiled, they do not check the types of their arguments.
+These predicates raise floating-point invalid-operation exceptions on
+NaN arguments; in other words, they are ``ordered comparisons''.
+When floating-point exception traps are disabled, they return false
+when any argument is NaN.
+
+Every pair of floating-point numbers --- excluding NaN --- exhibits
+ordered trichotomy: they are related either by @code{flo:=},
+@code{flo:<}, or @code{flo:>}.
+@end deffn
+
+@deffn procedure flo:safe= flonum1 flonum2
+@deffnx procedure flo:safe< flonum1 flonum2
+@deffnx procedure flo:safe<= flonum1 flonum2
+@deffnx procedure flo:safe> flonum1 flonum2
+@deffnx procedure flo:safe>= flonum1 flonum2
+@deffnx procedure flo:safe<> flonum1 flonum2
+@deffnx procedure flo:unordered? flonum1 flonum2
+@cindex equivalence predicate, for flonums
+@cindex unordered comparison
+@cindex floating-point comparison, unordered
+@cindex tetrachotomy
+These procedures are the standard order and equality predicates on
+flonums.  When compiled, they do not check the types of their arguments.
+These predicates do not raise floating-point exceptions, and simply
+return false on NaN arguments, except @code{flo:unordered?} which
+returns true iff at least one argument is NaN; in other words, they
+are ``unordered comparisons''.
+
+Every pair of floating-point numbers --- excluding NaN --- exhibits
+unordered tetrachotomy: they are related either by @code{flo:safe=},
+@code{flo:safe<}, @code{flo:safe>}, or @code{flo:unordered?}.
 @end deffn
 
 @deffn procedure flo:zero? flonum
@@ -1721,6 +1849,83 @@ flonums.  When compiled, they do not check the types of their arguments.
 @deffnx procedure flo:negative? flonum
 Each of these procedures compares its argument to zero.  When compiled,
 they do not check the type of their argument.
+These predicates raise floating-point invalid-operation exceptions on
+NaN arguments; in other words, they are ``ordered comparisons''.
+
+@example
+@group
+(flo:zero? -0.)                @result{} #t
+(flo:negative? -0.)            @result{} #f
+(flo:negative? -1.)            @result{} #t
+
+(flo:zero? 0.)                 @result{} #t
+(flo:positive? 0.)             @result{} #f
+(flo:positive? 1.)             @result{} #f
+
+(flo:zero? +nan.123)           @result{} #f  @r{; (raises invalid-operation)}
+@end group
+@end example
+@end deffn
+
+@deffn procedure flo:normal? flonum
+@deffnx procedure flo:subnormal? flonum
+@deffnx procedure flo:safe-zero? flonum
+@deffnx procedure flo:infinite? flonum
+@deffnx procedure flo:nan? flonum
+Floating-point classification predicates.
+For any flonum, exactly one of these predicates returns true.
+These predicates never raise floating-point exceptions.
+
+@example
+(flo:normal? 1.23)             @result{} #t
+(flo:subnormal? 4e-124)        @result{} #t
+(flo:safe-zero? -0.)           @result{} #t
+(flo:infinite? +inf.0)         @result{} #t
+(flo:nan? -nan.123)            @result{} #t
+@end example
+@end deffn
+
+@deffn procedure flo:finite? flonum
+Equivalent to:
+
+@example
+@group
+(or (flo:safe-zero? @var{flonum})
+    (flo:subnormal? @var{flonum})
+    (flo:normal? @var{flonum}))
+; or
+(and (not (flo:infinite? @var{flonum}))
+     (not (flo:nan? @var{flonum})))
+@end group
+@end example
+
+True for normal, subnormal, and zero floating-point values; false for
+infinity and NaN.
+@end deffn
+
+@deffn procedure flo:classify flonum
+Returns a symbol representing the classification of the flonum, one
+of @code{normal}, @code{subnormal}, @code{zero}, @code{infinity}, or
+@code{nan}.
+@end deffn
+
+@deffn procedure flo:sign-negative? flonum
+Returns true if the sign bit of @var{flonum} is negative, and false
+otherwise.
+Never raises a floating-point exception.
+
+@example
+@group
+(flo:sign-negative? +0.)       @result{} #f
+(flo:sign-negative? -0.)       @result{} #t
+(flo:sign-negative? -1.)       @result{} #t
+(flo:sign-negative? +inf.0)    @result{} #f
+(flo:sign-negative? +nan.123)  @result{} #f
+
+(flo:negative? -0.)            @result{} #f
+(flo:negative? +nan.123)       @result{} #f  @r{; (raises invalid-operation)}
+@end group
+@end example
 @end deffn
 
 @deffn procedure flo:+ flonum1 flonum2
@@ -1731,24 +1936,27 @@ These procedures are the standard arithmetic operations on flonums.
 When compiled, they do not check the types of their arguments.
 @end deffn
 
-@deffn procedure flo:finite? flonum
-@vindex +inf
-@vindex -inf
-@vindex NaN
-@cindex positive infinity (@code{+inf})
-@cindex negative infinity (@code{-inf})
-@cindex not a number (@code{NaN})
-The @acronym{IEEE} floating-point number specification supports three
-special ``numbers'': positive infinity (@code{+inf}), negative infinity
-(@code{-inf}), and not-a-number (@code{NaN}).  This predicate returns
-@code{#f} if @var{flonum} is one of these objects, and @code{#t} if it
-is any other floating-point number.
-@end deffn
-
 @deffn procedure flo:negate flonum
 This procedure returns the negation of its argument.  When compiled, it
-does not check the type of its argument.  Equivalent to @code{(flo:- 0.
-@var{flonum})}.
+does not check the type of its argument.
+
+This is @emph{not} equivalent to @code{(flo:- 0. @var{flonum})}:
+
+@example
+@group
+(flo:negate 1.2)               @result{} -1.2
+(flo:negate -nan.123)          @result{} +nan.123
+(flo:negate +inf.0)            @result{} -inf.0
+(flo:negate 0.)                @result{} -0.
+(flo:negate -0.)               @result{} 0.
+
+(flo:- 0. 1.2)                 @result{} -1.2
+(flo:- 0. -nan.123)            @result{} -nan.123
+(flo:- 0. +inf.0)              @result{} -inf.0
+(flo:- 0. 0.)                  @result{} 0.
+(flo:- 0. -0.)                 @result{} 0.
+@end group
+@end example
 @end deffn
 
 @deffn procedure flo:abs flonum
@@ -1780,6 +1988,205 @@ This is the flonum version of @code{atan} with two arguments.  When
 compiled, it does not check the types of its arguments.
 @end deffn
 
+@deffn procedure flo:min x1 x2
+@deffnx procedure flo:max x1 x2
+Returns the min or max of two floating-point numbers.
+If either argument is NaN, raises the floating-point invalid-operation
+exception and returns the other one if it is not NaN, or the first
+argument if they are both NaN.
+@end deffn
+
+@deffn procedure flo:min-mag x1 x2
+@deffnx procedure flo:max-mag x1 x2
+Returns the argument that has the smallest or largest magnitude, as in
+minNumMag or maxNumMag of @acronym{IEEE 754-2008}.
+If either argument is NaN, raises the floating-point invalid-operation
+exception and returns the other one if it is not NaN, or the first
+argument if they are both NaN.
+@end deffn
+
+@deffn procedure flo:ldexp x1 x2
+@deffnx procedure flo:scalbn x1 x2
+@code{Flo:ldexp} scales by a power of two; @code{flo:scalbn} scales by
+a power of the floating-point radix.
+@iftex
+@tex
+$$\eqalign{
+  \mathop{\rm ldexp} x \, e &:= x \cdot 2^e, \cr
+  \mathop{\rm scalbn} x \, e &:= x \cdot r^e.
+}$$
+@end tex
+@end iftex
+@ifnottex
+
+@example
+ldexp x e := x * 2^e,
+scalbn x e := x * r^e.
+@end example
+
+@end ifnottex
+In MIT/GNU Scheme, these procedures are the same; they are both
+provided to make it clearer which operation is meant.
+@end deffn
+
+@defvr constant flo:radix
+@defvrx constant flo:radix.
+@defvrx constant flo:precision
+Floating-point system parameters.
+@code{Flo:radix} is the floating-point radix as an integer, and
+@code{flo:precision} is the floating-point precision as an integer;
+@code{flo:radix.} is the flotaing-point radix as a flonum.
+@end defvr
+
+@defvr constant flo:error-bound
+@defvrx constant flo:log-error-bound
+@defvrx constant flo:ulp-of-one
+@defvrx constant flo:log-ulp-of-one
+@code{Flo:error-bound}, sometimes called the machine epsilon, is the
+maximum relative error of rounding to nearest:
+@iftex
+@tex
+$$\max_x {|x - \mathop{\rm fl}(x)| \over |x|} = {1 \over 2 r^{p-1}},$$
+@end tex
+@end iftex
+@ifnottex
+
+@example
+max |x - fl(x)|/|x| = 1/(2 r^(p-1)),
+@end example
+
+@end ifnottex
+where @math{r} is the floating-point radix and @math{p} is the
+floating-point precision.
+
+@code{Flo:ulp-of-one} is the distance from @math{1} to the next larger
+floating-point number, and is equal to @math{1/r^{p-1}}.
+
+@code{Flo:error-bound} is half @code{flo:ulp-of-one}.
+
+@code{Flo:log-error-bound} is the logarithm of @code{flo:error-bound},
+and @code{flo:log-ulp-of-one} is the logarithm of
+@code{flo:log-ulp-of-one}.
+@end defvr
+
+@deffn procedure flo:ulp flonum
+Returns the distance from @var{flonum} to the next floating-point
+number larger in magnitude with the same sign.
+For zero, this returns the smallest subnormal.
+For infinities, this returns positive infinity.
+For NaN, this returns the same NaN.
+
+@example
+(flo:ulp 1.)                    @result{} 2.220446049250313e-16
+(= (flo:ulp 1.) flo:ulp-of-one) @result{} #t
+@end example
+@end deffn
+
+@defvr constant flo:normal-exponent-max
+@defvrx constant flo:normal-exponent-min
+@defvrx constant flo:subnormal-exponent-min
+Largest and smallest positive integer exponents of the radix in normal
+and subnormal floating-point numbers.
+
+@itemize @bullet
+@item
+@code{Flo:normal-exponent-max} is the largest positive integer such
+that @code{(expt flo:radix. flo:normal-exponent-max)} does not
+overflow.
+
+@item
+@code{Flo:normal-exponent-min} is the smallest positive integer such
+that @code{(expt flo:radix. flo:normal-exponent-min)} is a normal
+floating-point number.
+
+@item
+@code{Flo:subnormal-exponent-min} is the smallest positive integer such
+that @code{(expt flo:radix. flo:subnormal-exponent-min)} is nonzero;
+this is also the smallest positive floating-point number.
+@end itemize
+@end defvr
+
+@defvr constant flo:largest-positive-normal
+@defvrx constant flo:smallest-positive-normal
+@defvrx constant flo:smallest-positive-subnormal
+Smallest and largest normal and subnormal numbers in magnitude.
+@end defvr
+
+@defvr constant flo:greatest-normal-exponent-base-e
+@defvrx constant flo:greatest-normal-exponent-base-2
+@defvrx constant flo:greatest-normal-exponent-base-10
+@defvrx constant flo:least-normal-exponent-base-e
+@defvrx constant flo:least-normal-exponent-base-2
+@defvrx constant flo:least-normal-exponent-base-10
+@defvrx constant flo:least-subnormal-exponent-base-e
+@defvrx constant flo:least-subnormal-exponent-base-2
+@defvrx constant flo:least-subnormal-exponent-base-10
+Least and greatest exponents of normal and subnormal floating-point
+numbers, as floating-point numbers.
+For example, @code{flo:greatest-normal-exponent-base-2} is the
+greatest floating-point number such that @code{(expt
+2. flo:greatest-normal-exponent-base-2)} does not overflow and is a
+normal floating-point number.
+@end defvr
+
+@deffn procedure flo:total< x1 x2
+@deffnx procedure flo:total-mag< x1 x2
+@deffnx procedure flo:total-order x1 x2
+@deffnx procedure flo:total-order-mag x1 x2
+These procedures implement the @acronym{IEEE 754-2008} total ordering
+on floating-point values and their magnitudes.
+Here the ``magnitude'' of a floating-point value is a floating-point
+value with positive sign bit and everything else the same; e.g.,
+@code{+nan.123} is the ``magnitude'' of @code{-nan.123} and @code{0.}
+is the ``magnitude'' of @code{-0.}.
+
+@itemize @bullet
+@item
+@code{Flo:total<} returns true if @var{x1} precedes @var{x2}.
+
+@item
+@code{Flo:total-mag<} returns true if the magnitude of @var{x1}
+precedes the magnitude of @var{x2}.
+
+@item
+@code{Flo:total-order} returns @math{-1} if @var{x1} precedes
+@var{x2}, @math{0} if they are the same floating-point value
+(including sign of zero, or sign and payload of NaN), and @math{+1} if
+@var{x1} follows @var{x2}.
+
+@item
+@code{Flo:total-order-mag} returns @math{-1} if the magnitude of
+@var{x1} precedes the magnitude of @var{x2}, etc.
+@end itemize
+@end deffn
+
+@deffn procedure flo:make-nan negative? quiet? payload
+@deffnx procedure flo:nan-quiet? nan
+@deffnx procedure flo:nan-payload nan
+@code{Flo:make-nan} creates a NaN given the sign bit, quiet bit, and
+payload.
+@var{Negative?} and @var{quiet?} must be booleans, and @var{payload}
+must be an unsigned @math{(p-2)}-bit integer, where @math{p} is the
+floating-point precision.
+If @var{quiet?} is false, @var{payload} must be nonzero.
+
+@example
+@group
+(flo:sign-negative? (flo:make-nan @var{negative?} @var{quiet?} @var{payload}))
+                               @result{} @var{negative?}
+(flo:nan-quiet? (flo:make-nan @var{negative?} @var{quiet?} @var{payload}))
+                               @result{} @var{quiet?}
+(flo:nan-payload (flo:make-nan @var{negative?} @var{quiet?} @var{payload}))
+                               @result{} @var{payload}
+
+(flo:make-nan #t #f 42)        @result{} -snan.42
+(flo:sign-negative? +nan.123)  @result{} #f
+(flo:quiet? +nan.123)          @result{} #t
+(flo:payload +nan.123)         @result{} 123
+@end group
+@end example
+@end deffn
+
 @node Random Numbers,  , Fixnum and Flonum Operations, Numbers
 @section Random Numbers
 @cindex random number