|
|
Mathemetical ConstantsFrequently Asked Questions (FAQs) |
Because of conflicting requirements from users, applications, hardware, compilers and language.
After more than a year of effort involving dozens of suggestions from Boost developers and others, there appear to be disadvantages to all proposed interfaces, however ingenious.
If all you want is double constants, simply #include header file constants.hpp.
Because 40 decimal digits are needed to ensure that internal
floating point values are defined correctly after rounding for
floating-point hardware up to
128
significand bits.
(For interval constants, twice as many as significand digits are needed
to ensure that decimal digit strings are exactly representable C++
values).
If the Standard C float.h is available, the macro
FLT_MANT_DIG may provide the number of digits (usually radix 2) in the significand (mantissa) that limits precision (but not the maximum or
minimum values : these constrained mainly by the maximum exponent).
(Macro digits and numeric_limits<FPType>::digits include any
hidden or implicit significand bits as these contribute to precision,
so it is not sufficient to count the bits in the floating point layout.)
/* The number of base FLT_RADIX digits in the mantissa. */
#define _DBL_RADIX 2 /* exponent radix */
#define DBL_MANT_DIG 53 /* # of bits in mantissa */
#define LDBL_MANT_DIG 64 /* # of bits in mantissa */
The same values are available using Standard C++ <limits>, for example as:
int double_radix = std::numeric_limits< double>::radix; // 2 for IEEE FP
int double_mantissa_digits = std::numeric_limits<double>::digits;
// Usually 53 for IEEE 64-bit double.
int double_digits10 = std::numeric_limits<long double>::digits10;
// 15 decimal digits guaranteed output correctly for 64-bit double.
(But the 128-bit representation used by the AIX OS warns that that this value, defined as 106-bit for AIX long double, may be not tell the whole story – the actual number of bits of precision can vary. There are even worse difficulties in calculating interval values for non-standard floating point formats and values given may not be fully reliable.)
To ensure no loss of accuracy when reading in, all floating point constants should be specified with at least a few decimal digits MORE than the value of ISO Standard 14882:1998 C++ Library limits for std::numeric_limits<type>:: which are only the number guaranteed to be correct on output . For example std::numeric_limits<double>::digits10 (equivalent to Standard C Library float.h DBL_DIG) is 15 for double using IEEE 754/IEC559 standard 64-bit real, 11-bit exponent, 52+1 = 53 bit significand (mantissa) using Motorola 680X0 or Intel 8087 floating point coprocessor as incorporated in all Pentium chips. So double constants must be specified with at least 17 decimal digits. (Even many more decimal digits may be required to ensure that interval values are exactly representable – see below).
William Kahan in his report on the status of IEEE754 in 1996 gives two significant digits values for each format:
The lower value is at least
floor ((n - 1) * log10(2) significant digits,
std::numeric_limits< >::digits10)
and the upper,
ceil(1 + n * log10(2)) significant digits,
where n is number of significand bits (std::numeric_limits<
>::digits).
For float, floor (24 –1) *
0.301 = 6 &
ceil(1 + 24 *
0.301) = 9,
& for double 15 and 17, extended (long double?) 18 & 21, quadruple 33 & 36). For example, for float:
If a decimal string with at most the lower significant digits (6) is converted to float precision and the converted back to the same number of significant digits (6), then the string should match the original 6 digits. This is
int digits10 = int(floor(-1) std::numeric_limits<float>::digits * log10(2.))
that should be the same as
std::numeric_limits<float>::digits10
and macro value FLT_DIG.
If a float value is converted to a decimal string with at least the upper number of significant digits (9 for float), and then converted back to float, then the final number must match the original float number. For example, using Standard C++
#include <iostream>
#include <limits>
double const log10Two = 0.3010299956639811952137388947; // log10(2.)
pi = 3.1415926535897932384626433832795F;
int sig_digits10 =
int( ceil(1 + std::numeric_limits<float>::digits * 0.3010299956639811952137388947));
std::cout.precision(decimal_digits);
std::cout <<"float pi = " << pi << " using " <<
sig_digits10 << " decimal digits" << endl;
(But note that MACRO calculations only use integer arithmetic and cannot use floating-point functions ceil and floor. An approximation using only integer arithmetic is
2 + MANT_DIG * 30103 /100000
that gives the expected result for the significand digits listed below).
Note the asymmetry: the upper
value is the number of digits that are significant on output and are required
for accurate input, the lower value is the number guaranteed
correct on output. The practically useful precision
is given by bits/log10(2.)
| Floating point Type |
Often used for C/C++ type | Total bits |
Significand bits (+ 1 if an implicit bit) |
guaranteed digits10 decimal digits |
significant decimal digits |
| IEEE single | float | 32 | 23 + 1 = 24 | 6 | 9 |
| VAXF | float | 32 | 23 + 1 = 24 | 6 | 9 |
| IBM G/390 short | float | 32 | 24 | 6 | 9 |
| IEEE single extended | ? | >=43 | >=32 | 7 | 11 |
| IEEE double | double | 64 | 52 + 1 = 53 | 15 | 17 |
| VAXG | double | 64 | 52 + 1 = 53 | 15 | 17 |
| IBM G/390 long | double | 64 | 56 | 15 | 17 |
| VAXD | long double | 64 | 56 | 15 | 17 |
| IEEE double extended | long double | 80 | >=64 | 19 | 21 |
| Sparc doubleextended (x86) | long double | 80 | 64 | 18 | 21 |
| AIX quad | long double | 128 | 106 | 31 | 33 |
| NTL quad | ‘quad’ |
128 | 106 | 31 | 33 |
| IBM G/390 extended | long double | 128 | 112 | 33 | 35 |
| VAXH | long double | 128 | 112 + 1 = 113 | 34 | 36 |
| IEEE quadruple | long double | 128 | 112 + 1 = 113 | 34 | 36 |
| Sparc double extended | long double | 128 | 112 + 1 = 113 | 34 | 36 |
| signed fractional | ? | 127 | 128 | 38 | 40 |
| unsigned fractional | ? | 128 | 128 | 38 | 40 |
| unsigned fractional | ? | 128 | 128 + 1 = 129 | 38 | 40 |
Semi-exhaustive tests using MSVC++.NET (compiler version 7.1) with C++ float and double types confirm that these numbers of significant digits can be output and re-input as expected.
For example, the following code should not fail the assert test for any normalised float value:
float a = 1.F;
float aa = 0;
std::stringstream s;
s.precision(9);
s << std::scientific << a;
s >> aa;
assert (a != aa);
All the constants here are believed reliable to 40 decimal digits and so should meet all these floating point formats, including C++ long double extended precision if the compiler uses IEEE 80-bit real, 15-bit exponent, 64-bit significand (or mantissa) format for which constants will need 21 decimal digits, and will even meet the requirements for 128-bit real arithmetic providing significand (mantissa) accuracy of 113-bit used by the Sun Sharc processor, and the AIX OS 106-bit, and floating-point types which can be emulated in software using, for example, C++ NTL quad_float class by Victor Shoup (derived from the doubledouble class of Keith Briggs).
The Intel Itanium IA-64 128-bit quad_precision is planned to be implemented in hardware (with 112-bit significand, 15 exponent, and sign), slightly more precise than 36 decimal digit constants required by other formats above. See www.intel.com , ADAG.pdf for IA-64 Application Developers Architecture Guide.
If all 128 bits are used as a fraction or significand (perhaps without sign and with an implicit most significant bit), 40 decimal digits would still be sufficient.
Floating-point hardware is unlikely to provide higher precision for some time to come. (For applications that require a higher precision, several software solutions are available, for example NTL, MPFUN, GMP, LIDIA - see references below).
Use a extended precision arithmetic or specialised Number Theoretic Package like Victor Shoup's NTL package, used to calculate these constants, or MPFUN, GNU GMP or LIDIA.
To ensure that the value is 'exactly representable' for the floating point type. This is necessary because the C++ Standard does not specify any rounding requirement. The values must not need any rounding to be read into the internal floating point representation.
Sadly, but perhaps inevitably, the ISO C++ standard 14882:1998 [2.13.3.1] states:
" [...] If the scaled value is in the range of representable values for its type, the result is the scaled value if representable else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner. [...] "
The phrase “implementation-defined manner” implies that the compiler is not required to round to the nearest internal representation, although all compilers used so far appear to round as expected.
The effect of this is that if the number is not exactly representable (for example, if radix = 2 then 10. is not, but 2. is) then different implementations may convert the same decimal digits to binary values that differ by one least significant bit (or digit if radix is not two, for example ten or sixteen). The constants interval upper and lower limits are these two values, differing by at most one least significant bit (but the identical if the number, for example 2, is exactly representable). To ensure that they are converted from decimal digits to the internal format, these interval limits must be exactly representable. Many decimal digits may be required to achieve this – nearly the number of bits in the significand.
So with sufficient decimal digits, compilers should convert constants from decimal digit characters to their internal floating-point format within 1 least significant bit (or digit).
Not all compilers achieve this and so some programs, for example the Cephes mathematical functions library, have provided constants (especially polynomial coefficients) in hexadecimal format (also proposed for future version of C and C++). This ensures that the values are stored exactly, but also requires different hexadecimal values for each processor and floating-point format.
The method of providing decimal character digits is more portable but it does rely on compilers making the conversion from decimal character digits to internal floating-point binary format conforming to the C++ Standard. In both cases, the results of storage and computation depend on the floating-point representation and so none are fully portable, as explained below.
(The unusually large number of decimal digits might cause overflow of non-compliant compilers that should accept, but ignore, unlimited numbers of digits. The only workaround for constant values is to round the values by hand or to use hexadecimal constants for the intervals. This package does not attempt to cater for this non-conforming behaviour.)
To ensure that they are exactly representable in the appropriate floating point type. They also differ for different implementations of floating point types, for example long double may be stored in 64, 80 or 128 bit format.
MACROs may pollute the global namespace, and are politically incorrect, but may be most efficient, for example in embedded systems where every byte is precious, and can be used in other languages, like C.
Because they are said to offer most scope for compiler optimisation (though some evidence suggests that the best compilers can produce 'optimal' code from any representation..
No, except for smallest-interval values of constants. (Support for other formats could be added). NTL quad_float (similar to doubledouble as representation in 128 bits using 106 significand bits) interval values is an example of extension to a user-defined type (UDT) for which interval values have been calculated.
Because existing compilers only use native built-in floating point precision, often only 64-bit with a guaranteed accuracy of only 15 decimal digits. Using these constants gives the best chance of portability.
For example, it may be a little better to use third_pi rather than pi /3. This is because the single constants are produced with a higher precision than the floating-point hardware and/or software that the compiler is also certain to use for its computation.
The constants are entirely fixed at compile time, so there is no code to call functions, for example, log, nor the need to include header files which define those functions, for example <cmath>. This may speed compilation and reduce file accesses and dependency, and facilitate optimisation. When debugging code (often with fewer optimisation options) the constant is never calculated at run-time. Behaviour when debugging is independent of compiler type, version and compile options like debug/release and optimisation..
There is no evaluation of the function at program start time, nor wasted evaluation if the constant is never used, so programs may be slightly quicker.
Portability is improved because as there is no compile or run-time evaluation of constants, there can be no difference caused by different versions of functions, for example in different version of C math libraries, nor in different version of hardware floating point evaluation. Although in reality, this is rather unlikely to be a problem, the effort and time wasted proving that it is not a problem is likely to be significant.
The most potentially serious loss of accuracy occurs with transcendental functions: most seriously, the exponentiation function x raised to power y, pow(x, y) for x**y or x^y where y may not be an integer.
Because only a few are available, they are often not the right size, are not available on all processors, and most important are sometimes not accurate enough. Using these constants gives the best chance of portability.
Intel X86 floating-point hardware, for example, includes instructions for providing floating-point constants pi and log base 2 (10). These are accurate to 64 significand bits, including guard and sticky bits, which is higher than for any constant read into a register or memory by the compiler, even from a more precise constant decimal digit string.
Internally, the Intel X86 floating-point hardware holds all numbers in extended precision real format (64 significand bits compared to external double of 53 significand bits). Extra bits are mainly to support rounding mode correctly. Calculations performed using floating point register to register operations will be performed with this 64-bit accuracy. But operations that use external memory will have lower accuracy. The compiler makes the choice between register and memory, so expecting completely portable accuracy is probably too optimistic. Using accurate literal constants probably gives the best chance of portability (but not necessarily the best possible accuracy).
File builtin_constants.h which provides a templated function for X86 hardware pi, and builtin_constants.cpp provides a demo, but these are NOT recommended.
The disadvantage of this is that it is less portable, but may give higher accuracy under certain conditions, but varying with compiler and compiler optimisation and other options.
For the Intel X86/Pentium extended precision significand is 64-bit, double is only 53-bit, and memory values are always 53-bit (plus sign and exponent to total 64 bits). (See www.intel.com 27064004.pdf for 80C187 80-bit Math Coprocessor datasheet).
One method of ensuring that the IEEE extended double precision floating-point (typically the Intel X86/Pentium) is not used is to alter the appropriate bit in the processor “control word”. This can only be done in assembler. See X86ControlWord.cpp for an example. C99 promises to provide a portable way of setting the precision.
The default value of the code word bit is NOT extended for Microsoft Windows, but IS extended for gcc/Linux.
The third way to fix the problem is to 'force' all intermediate floating point results into memory. This is not an 'ideal' fix, since it may not be fully equivalent to 53-bit precision (because of double rounding), but it is reported to work (although a full proof of correctness in this case is not available). This 'force into memory' method is also used in the Boost Interval library.
NTL's quad_float code ensures use of 53-bit significands on all platforms by storing intermediate results in local variables declared to be 'volatile'. This is the solution to the problem that NTL uses if it detects the problem and can't fix it using the GNU 'asm' hack mentioned above. This solution should work on any platform that faithfully implements 'volatile' according to the ANSI C standard. See www.shoup.net for more details.
Note that MSVC++ compilers up to version 8 implement long double exactly as double, IEEE 64-bit floating-point format. (This compiler does NOT use IEEE 80-bit extended precision available within the Intel Pentium or X86 floating point co-processor processor family instruction set).
C/C++ Pre-processor should calculate any long integers using 36 bits (P J Plauger, The Standard C Library, p 78) but none of the constants in constants.h are currently of integer type. (Some are floating point versions of integer values, for example: unity, zero, two, ten…).
No. An important reason is that floating point types may use
different representation and thus accuracy. For example, long
double may be 64, 80 or 128-bit, and even 64-bit representations may
have different splits between significand (controlling accuracy) and
exponent (controlling range).
But using these math constants will probably give the best portability.
Alas complexity of floating point is such that they cannot be absolutely guaranteed to give same result even if both use hardware conforming to IEEE 754/IEC559 floating-point specification. This is likely be a significant problem with iterative calculations at limit accuracy are involved, perhaps leading failureto converge or converging a to different solution. These are most troublesome when trying to prove the equivalence of different systems for computer validation. Because can predict that differences may arise one should not pessimistically assume that different output means that either system is faulty, nor assume that the systems are not equivalent.
Because this is seen as a C-style solution and unsuitable for taking
advantage of C++.
If this convention is desired, then the macros
can be used to assign values, for example:
const long pi_l = BOOST_PI;
To get the advantages of functions, like double pi(), but for clarity and convenience to write :
double area = pi * radius * radius; // pi rather than pi()
But there are several downsides:
Because this would rely on the compiler casting down to double or float which might lose accuracy through rounding, and this might vary with the optimisation. Interval constants are of course very different for different floating-point types.
For most values, Victor Shoup's NTL has been used to compute (and/or cross-check) the values in this file with a significand of 300 bits (about 100 decimal digits) and values are output as 40 decimal digit strings (or more digits if exactly representable interval limit values). There is no typing or editing to introduce errors.
The numeric string values in
the file constants.h are checked either one or both of two ways:
1 Values are consistent with values calculated by the compiler
pre-processor.
2 Comparison with values in references tables: for example in
D. E. Knuth, Appendix A (scanned
in from the printed book and converted to a text file), or
M. Abramowitz & Irene A. Stegun.
Some values are be tested with straightforward checks against compiler pre-processor calculated values, for example comparing:
if (twoPi != 2. * pi) // may have a problem?
or C macro assert (which causes program termination if it fails - a more noticeable conclusion, but one that does not allow one to continue testing other constants).
assert(twoPi == 2. * pi)
but many naively formulated tests appear to fail to the extent that the two values differ by the machine least significant significand bit, so instead it is better to test the absolute relative difference between values thus:
if (fabs(twoPi - 2.L * pi)/ twoPi >= epsilon) // Do have a problem with accuracy!
See D. E. Knuth and Alberto Squassabia for more guidance on comparison of reals.
(Epsilon is the smallest floating-point representation of the least-significant significand bit - see below.)
The equivalent using the C macro assert is:
assert(fabs(sqrtTwo * sqrtTwo - 2.)/2. <= eps);
It is best to choose the constant to divide by to make the test relative to the constant value, rather than recomputing the check value.
Typical epsilon for type double for Intel X86 IEEE754/IEC559 hardware is 2.22e-016.
Minimum values specified by ISO
14882:1998 section 18.2.1 are defined as equivalent to the C Standard
constants:
#define FLT_EPSILON <float rvalue <=
10^(-5)>
#define DBL_EPSILON <double rvalue <= 10^(-9)>
#define LDBL_EPSILON <long double rvalue <= 10^(-9)>
The macros yield the smallest X of
type float, double or long double such that 1.0 + X != 1.0.
C++ versions of these are
provided by the ISO 14882:1998 Standard numeric limits template, for
example:
std::numeric_limits<float>::epsilon();
std::numeric_limits<double>::epsilon()
std::numeric_limits<long double>::epsilon();
The constants smallest or closest interval upper and lower limits
should meet the following, shown as an example for pi:
static const double pi_d = 3.1415926535897932384626433832; // closest.
static const double pi_d_l = 3.141592502593994140625; // lower
&
static const double pi_d_u = 3.1415927410125732421875; // upper
limit.
(pi_d_l == pi_d) || (pi_d_u == pi_d) // must equal upper or lower, or
both
and
float ul_pi = pi_d_l + pi_d_u; // upper + lower
(ul_pi < 2.f * pi_d_u) && (ul_pi > 2.f * pi_d_l) // only one bit different
Physical constants (like Planck's constant h) and their uncertainty change whereas mathematical constants (like pi) do not. Mixing them is therefore unwise. A collection of C++ physical constants values and their uncertainty may be provided in future.
http:/www.hetp.u-net.com/public/faq.html, Revised 15 May 2005
© Copyright Paul A. Bristow 2002-2005. All Rights Reserved.