Flonum Operations (MIT/GNU Scheme Pucked Reference Manual)

4.7.2 Flonum Operations

A flonum is an inexact real number that is implemented as a floating-point number. In MIT/GNU Scheme, all inexact real numbers are flonums. For this reason, constants such as 0. and 2.3 are guaranteed to be flonums.

MIT/GNU Scheme follows the IEEE 754-2008 floating-point standard, using binary64 arithmetic for flonums. All floating-point values are classified into:

normal

Numbers of the form

r^e (1 + f/r^p)

where r, the radix, is a positive integer, here always 2; p, the precision, is a positive integer, here always 53; e, the exponent, is an integer within a limited range, here always -1022 to 1023 (inclusive); and f, the fractional part of the significand, is a (p-1)-bit unsigned integer,

subnormal

Fixed-point numbers near zero that allow for gradual underflow. Every subnormal number is an integer multiple of the smallest subnormal number. Subnormals were also historically called “denormal”.

zero

There are two distinguished zero values, one with “negative” sign bit and one with “positive” sign bit.

The two zero values are considered numerically equal, but serve to distinguish paths converging to zero along different branch cuts and so some operations yield different results for differently signed zero values.

infinity

There are two distinguished infinity values, negative infinity or -inf.0 and positive infinity or +inf.0, representing overflow on the real line.

NaN

There are 4 r^{p-2} - 2 distinguished not-a-number values, representing invalid operations or uninitialized data, distinguished by their negative/positive sign bit, a quiet/signalling bit, and a (p-2)-digit unsigned integer payload which must not be zero for signalling NaNs.

Arithmetic on quiet NaNs propagates them without raising any floating-point exceptions. In contrast, arithmetic on signalling NaNs raises the floating-point invalid-operation exception. Quiet NaNs are written +nan.123, -nan.0, etc. Signalling NaNs are written +snan.123, -snan.1, etc. The notation +snan.0 and -snan.0 is not allowed: what would be the encoding for them actually means +inf.0 and -inf.0.

procedure: flo:flonum? object: Returns #t if object is a flonum; otherwise returns #f.

procedure: flo:= flonum1 flonum2

procedure: flo:< flonum1 flonum2

procedure: flo:<= flonum1 flonum2

procedure: flo:> flonum1 flonum2

procedure: flo:>= flonum1 flonum2

procedure: flo:<> flonum1 flonum2

These procedures are the standard order and equality predicates on flonums. When compiled, they do not check the types of their arguments. These predicates raise floating-point invalid-operation exceptions on NaN arguments; in other words, they are “ordered comparisons”. When floating-point exception traps are disabled, they return false when any argument is NaN.

Every pair of floating-point numbers — excluding NaN — exhibits ordered trichotomy: they are related either by flo:=, flo:<, or flo:>.

procedure: flo:safe= flonum1 flonum2

procedure: flo:safe< flonum1 flonum2

procedure: flo:safe<= flonum1 flonum2

procedure: flo:safe> flonum1 flonum2

procedure: flo:safe>= flonum1 flonum2

procedure: flo:safe<> flonum1 flonum2

procedure: flo:unordered? flonum1 flonum2

These procedures are the standard order and equality predicates on flonums. When compiled, they do not check the types of their arguments. These predicates do not raise floating-point exceptions, and simply return false on NaN arguments, except flo:unordered? which returns true iff at least one argument is NaN; in other words, they are “unordered comparisons”.

Every pair of floating-point values — including NaN — exhibits unordered tetrachotomy: they are related either by flo:safe=, flo:safe<, flo:safe>, or flo:unordered?.

procedure: flo:zero? flonum

procedure: flo:positive? flonum

procedure: flo:negative? flonum

Each of these procedures compares its argument to zero. When compiled, they do not check the type of their argument. These predicates raise floating-point invalid-operation exceptions on NaN arguments; in other words, they are “ordered comparisons”.

(flo:zero? -0.)                ⇒ #t
(flo:negative? -0.)            ⇒ #f
(flo:negative? -1.)            ⇒ #t

(flo:zero? 0.)                 ⇒ #t
(flo:positive? 0.)             ⇒ #f
(flo:positive? 1.)             ⇒ #f

(flo:zero? +nan.123)           ⇒ #f  ; (raises invalid-operation)

procedure: flo:normal? flonum

procedure: flo:subnormal? flonum

procedure: flo:safe-zero? flonum

procedure: flo:infinite? flonum

procedure: flo:nan? flonum

Floating-point classification predicates. For any flonum, exactly one of these predicates returns true. These predicates never raise floating-point exceptions.

(flo:normal? 1.23)             ⇒ #t
(flo:subnormal? 4e-124)        ⇒ #t
(flo:safe-zero? -0.)           ⇒ #t
(flo:infinite? +inf.0)         ⇒ #t
(flo:nan? -nan.123)            ⇒ #t

procedure: flo:finite? flonum

Equivalent to:

(or (flo:safe-zero? flonum)
    (flo:subnormal? flonum)
    (flo:normal? flonum))
; or
(and (not (flo:infinite? flonum))
     (not (flo:nan? flonum)))

True for normal, subnormal, and zero floating-point values; false for infinity and NaN.

procedure: flo:classify flonum: Returns a symbol representing the classification of the flonum, one of normal, subnormal, zero, infinity, or nan.

procedure: flo:sign-negative? flonum

Returns true if the sign bit of flonum is negative, and false otherwise. Never raises a floating-point exception.

(flo:sign-negative? +0.)       ⇒ #f
(flo:sign-negative? -0.)       ⇒ #t
(flo:sign-negative? -1.)       ⇒ #t
(flo:sign-negative? +inf.0)    ⇒ #f
(flo:sign-negative? +nan.123)  ⇒ #f

(flo:negative? -0.)            ⇒ #f
(flo:negative? +nan.123)       ⇒ #f  ; (raises invalid-operation)

procedure: flo:+ flonum1 flonum2
procedure: flo:- flonum1 flonum2
procedure: flo:* flonum1 flonum2
procedure: flo:/ flonum1 flonum2: These procedures are the standard arithmetic operations on flonums. When compiled, they do not check the types of their arguments.

procedure: flo:*+ flonum1 flonum2 flonum3

procedure: flo:fma flonum1 flonum2 flonum3

procedure: flo:fast-fma?

Fused multiply-add: (flo:*+ u v a) computes uv+a correctly rounded, with no intermediate overflow or underflow arising from uv. In contrast, (flo:+ (flo:* u v) a) may have two rounding errors, and can overflow or underflow if uv is too large or too small even if uv + a is normal. Flo:fma is an alias for flo:*+ with the more familiar name used in other languages like C.

Flo:fast-fma? returns true if the implementation of fused multiply-add is supported by fast hardware, and false if it is emulated using Dekker’s double-precision algorithm in software.

(flo:+ (flo:* 1.2e100 2e208) -1.4e308)
                               ⇒ +inf.0  ; (raises overflow)
(flo:*+ 1.2e100 2e208  -1.4e308)
                               ⇒ 1e308

procedure: flo:negate flonum

This procedure returns the negation of its argument. When compiled, it does not check the type of its argument.

This is not equivalent to (flo:- 0. flonum):

(flo:negate 1.2)               ⇒ -1.2
(flo:negate -nan.123)          ⇒ +nan.123
(flo:negate +inf.0)            ⇒ -inf.0
(flo:negate 0.)                ⇒ -0.
(flo:negate -0.)               ⇒ 0.

(flo:- 0. 1.2)                 ⇒ -1.2
(flo:- 0. -nan.123)            ⇒ -nan.123
(flo:- 0. +inf.0)              ⇒ -inf.0
(flo:- 0. 0.)                  ⇒ 0.
(flo:- 0. -0.)                 ⇒ 0.

procedure: flo:abs flonum
procedure: flo:exp flonum
procedure: flo:log flonum
procedure: flo:sin flonum
procedure: flo:cos flonum
procedure: flo:tan flonum
procedure: flo:asin flonum
procedure: flo:acos flonum
procedure: flo:atan flonum
procedure: flo:sinh flonum
procedure: flo:cosh flonum
procedure: flo:tanh flonum
procedure: flo:asinh flonum
procedure: flo:acosh flonum
procedure: flo:atanh flonum
procedure: flo:sqrt flonum
procedure: flo:cbrt flonum
procedure: flo:expt flonum1 flonum2
procedure: flo:erf flonum
procedure: flo:erfc flonum
procedure: flo:hypot flonum1 flonum2
procedure: flo:j0 flonum
procedure: flo:j1 flonum
procedure: flo:jn flonum
procedure: flo:y0 flonum
procedure: flo:y1 flonum
procedure: flo:yn flonum
procedure: flo:gamma flonum
procedure: flo:lgamma flonum
procedure: flo:floor flonum
procedure: flo:ceiling flonum
procedure: flo:truncate flonum
procedure: flo:round flonum
procedure: flo:floor->exact flonum
procedure: flo:ceiling->exact flonum
procedure: flo:truncate->exact flonum
procedure: flo:round->exact flonum: These procedures are flonum versions of the corresponding procedures. When compiled, they do not check the types of their arguments.

procedure: flo:expm1 flonum
procedure: flo:log1p flonum: Flonum versions of expm1 and log1p with restricted domains: flo:expm1 is defined only on inputs bounded below log(2) in magnitude, and flo:log1p is defined only on inputs bounded below 1 - sqrt(1/2) in magnitude. Callers must use (- (flo:exp x) 1) or (flo:log (+ 1 x)) outside these ranges.

procedure: flo:atan2 flonum1 flonum2: This is the flonum version of atan with two arguments. When compiled, it does not check the types of its arguments.

procedure: flo:signed-lgamma x

Returns two values,

m = log(|Gamma(x)|)

and

s = sign(Gamma(x)),

respectively a flonum and an exact integer either -1 or 1, so that

Gamma(x) = s * e^m.

procedure: flo:min x1 x2
procedure: flo:max x1 x2: Returns the min or max of two floating-point numbers. If either argument is NaN, raises the floating-point invalid-operation exception and returns the other one if it is not NaN, or the first argument if they are both NaN.

procedure: flo:min-mag x1 x2
procedure: flo:max-mag x1 x2: Returns the argument that has the smallest or largest magnitude, as in minNumMag or maxNumMag of IEEE 754-2008. If either argument is NaN, raises the floating-point invalid-operation exception and returns the other one if it is not NaN, or the first argument if they are both NaN.

procedure: flo:ldexp x1 x2

procedure: flo:scalbn x1 x2

Flo:ldexp scales by a power of two; flo:scalbn scales by a power of the floating-point radix.

ldexp x e := x * 2^e,
scalbn x e := x * r^e.

In MIT/GNU Scheme, these procedures are the same; they are both provided to make it clearer which operation is meant.

procedure: flo:logb x

For nonzero finite x, returns floor(log(x)/log(r)) as an exact integer, where r is the floating-point radix.

For all other inputs, raises invalid-operation and returns #f.

procedure: flo:nextafter x1 x2

Returns the next floating-point number after x1 in the direction of x2.

(flo:nextafter 0. -1.)         ⇒ -4.9406564584124654e-324

procedure: flo:copysign x1 x2

Returns a floating-point number with the magnitude of x1 and the sign of x2.

(flo:copysign 123. 456.)       ⇒ 123.
(flo:copysign +inf.0 -1)       ⇒ -inf.0
(flo:copysign 0. -1)           ⇒ -0.
(flo:copysign -0. 0.)          ⇒ 0.
(flo:copysign -nan.123 0.)     ⇒ +nan.123

constant: flo:radix
constant: flo:radix.
constant: flo:precision: Floating-point system parameters. Flo:radix is the floating-point radix as an integer, and flo:precision is the floating-point precision as an integer; flo:radix. is the flotaing-point radix as a flonum.

constant: flo:error-bound

constant: flo:log-error-bound

constant: flo:ulp-of-one

constant: flo:log-ulp-of-one

Flo:error-bound, sometimes called the machine epsilon, is the maximum relative error of rounding to nearest:

max |x - fl(x)|/|x| = 1/(2 r^(p-1)),

where r is the floating-point radix and p is the floating-point precision.

Flo:ulp-of-one is the distance from 1 to the next larger floating-point number, and is equal to 1/r^{p-1}.

Flo:error-bound is half flo:ulp-of-one.

Flo:log-error-bound is the logarithm of flo:error-bound, and flo:log-ulp-of-one is the logarithm of flo:log-ulp-of-one.

procedure: flo:ulp flonum

Returns the distance from flonum to the next floating-point number larger in magnitude with the same sign. For zero, this returns the smallest subnormal. For infinities, this returns positive infinity. For NaN, this returns the same NaN.

(flo:ulp 1.)                    ⇒ 2.220446049250313e-16
(= (flo:ulp 1.) flo:ulp-of-one) ⇒ #t

constant: flo:normal-exponent-max

constant: flo:normal-exponent-min

constant: flo:subnormal-exponent-min

Largest and smallest positive integer exponents of the radix in normal and subnormal floating-point numbers.

Flo:normal-exponent-max is the largest positive integer such that (expt flo:radix. flo:normal-exponent-max) does not overflow.
Flo:normal-exponent-min is the smallest positive integer such that (expt flo:radix. flo:normal-exponent-min) is a normal floating-point number.
Flo:subnormal-exponent-min is the smallest positive integer such that (expt flo:radix. flo:subnormal-exponent-min) is nonzero; this is also the smallest positive floating-point number.

constant: flo:largest-positive-normal
constant: flo:smallest-positive-normal
constant: flo:smallest-positive-subnormal: Smallest and largest normal and subnormal numbers in magnitude.

constant: flo:greatest-normal-exponent-base-e
constant: flo:greatest-normal-exponent-base-2
constant: flo:greatest-normal-exponent-base-10
constant: flo:least-normal-exponent-base-e
constant: flo:least-normal-exponent-base-2
constant: flo:least-normal-exponent-base-10
constant: flo:least-subnormal-exponent-base-e
constant: flo:least-subnormal-exponent-base-2
constant: flo:least-subnormal-exponent-base-10: Least and greatest exponents of normal and subnormal floating-point numbers, as floating-point numbers. For example, flo:greatest-normal-exponent-base-2 is the greatest floating-point number such that (expt 2. flo:greatest-normal-exponent-base-2) does not overflow and is a normal floating-point number.

procedure: flo:total< x1 x2

procedure: flo:total-mag< x1 x2

procedure: flo:total-order x1 x2

procedure: flo:total-order-mag x1 x2

These procedures implement the IEEE 754-2008 total ordering on floating-point values and their magnitudes. Here the “magnitude” of a floating-point value is a floating-point value with positive sign bit and everything else the same; e.g., +nan.123 is the “magnitude” of -nan.123 and 0.0 is the “magnitude” of -0.0.

The total ordering has little to no numerical meaning and should be used only when an arbitrary choice of total ordering is required for some non-numerical reason.

Flo:total< returns true if x1 precedes x2.
Flo:total-mag< returns true if the magnitude of x1 precedes the magnitude of x2.
Flo:total-order returns -1 if x1 precedes x2, 0 if they are the same floating-point value (including sign of zero, or sign and payload of NaN), and +1 if x1 follows x2.
Flo:total-order-mag returns -1 if the magnitude of x1 precedes the magnitude of x2, etc.

procedure: flo:make-nan negative? quiet? payload

procedure: flo:nan-quiet? nan

procedure: flo:nan-payload nan

Flo:make-nan creates a NaN given the sign bit, quiet bit, and payload. Negative? and quiet? must be booleans, and payload must be an unsigned (p-2)-bit integer, where p is the floating-point precision. If quiet? is false, payload must be nonzero.

(flo:sign-negative? (flo:make-nan negative? quiet? payload))
                               ⇒ negative?
(flo:nan-quiet? (flo:make-nan negative? quiet? payload))
                               ⇒ quiet?
(flo:nan-payload (flo:make-nan negative? quiet? payload))
                               ⇒ payload

(flo:make-nan #t #f 42)        ⇒ -snan.42
(flo:sign-negative? +nan.123)  ⇒ #f
(flo:quiet? +nan.123)          ⇒ #t
(flo:payload +nan.123)         ⇒ 123