C++ Boost

Mathemetical Constants

Frequently Asked Questions (FAQs)


Why are there so many different formats?
Why are so many decimal digits provided?
What if I want even more accuracy?
Why are there so very many digits decimal digits for interval constants?
Why are the number of decimal digits different for float, double and long double interval constants.
Why are MACROs not sufficient and/or why are MACROs provided in a C++ library?
Why are functions returning the value of constants provided?
Does the library assume IEEE 754/IEC559 floating point?
Why not let the compiler calculate constants?
Why not use floating-point hardware constants?
Why not use the full accuracy of Intel floating-point hardware?
Can I expect complete portability with exactly the same result on all systems?
Why are there so many constants which can be calculated from others (like 2 pi)?
Why are there just three constants named like pi_f, pi and pi_l?
Why are functions like pi() which return constants provided?
Why are there also what appear to be floating point types like pi but are really functions ?
Why are the constants not defined in hexadecimal?
Why are the constants not defined as long double?
How are values of constants validated?
Why are there no physical constants?

Why are there so may different formats?

Because of conflicting  requirements from users, applications, hardware, compilers and language.

After more than a year of effort involving dozens of suggestions from Boost developers and others, there appear to be disadvantages to all proposed interfaces, however ingenious.

If all you want is double constants, simply #include header file constants.hpp.

Why are so many decimal digits provided?

Because 40 decimal digits are needed to ensure that internal floating point values are defined correctly after rounding for floating-point hardware up to 128 significand bits.
(For interval constants, twice as many as significand digits are needed to ensure that decimal digit strings are exactly representable C++ values).

If the Standard C float.h is available, the macro FLT_MANT_DIG may provide the number of digits (usually radix 2) in the significand (mantissa) that limits precision (but not the maximum or minimum values : these constrained mainly by the maximum exponent).
(Macro digits and numeric_limits<FPType>::digits include any hidden or implicit significand bits as these contribute to precision, so it is not sufficient to count the bits in the floating point layout.)

/* The number of base FLT_RADIX digits in the mantissa. */
#define _DBL_RADIX 2 /* exponent radix */
#define DBL_MANT_DIG 53 /* # of bits in mantissa */
#define LDBL_MANT_DIG 64 /* # of bits in mantissa */

The same values are available using Standard C++ <limits>, for example as:

int double_radix = std::numeric_limits< double>::radix; // 2 for IEEE FP
int double_mantissa_digits = std::numeric_limits<double>::digits;
// Usually 53 for IEEE 64-bit double.
int double_digits10 = std::numeric_limits<long double>::digits10;
// 15 decimal digits guaranteed output correctly for 64-bit double.

(But the 128-bit representation used by the AIX OS warns that that this value, defined as 106-bit for AIX long double, may be not tell the whole story – the actual number of bits of precision can vary.  There are even worse difficulties in calculating interval values for non-standard floating point formats and values given may not be fully reliable.)

To ensure no loss of accuracy when reading in, all floating point constants should be specified with at least a few decimal digits MORE than the value of ISO Standard 14882:1998 C++ Library limits for std::numeric_limits<type>:: which are only the number guaranteed to be correct on output . For example std::numeric_limits<double>::digits10 (equivalent to Standard C Library float.h DBL_DIG) is 15 for double using IEEE 754/IEC559 standard 64-bit real, 11-bit exponent, 52+1 = 53 bit significand  (mantissa) using Motorola 680X0 or Intel 8087 floating point coprocessor as incorporated in all Pentium chips. So double constants must be specified with at least 17 decimal digits. (Even many more decimal digits may be required to ensure that interval values are exactly representable – see below).

William Kahan in his report on the status of IEEE754 in 1996 gives two significant digits values for each format:

The lower value is at least
   floor ((n - 1) * log10(2)  significant digits, std::numeric_limits< >::digits10)
and the upper,
   ceil(1 + n * log10(2))  significant digits,
where n is number of significand bits (std::numeric_limits< >::digits).

For float, floor (24 –1) * 0.301  = 6 &
               ceil(1 + 24 * 0.301)     =   9,

& for double 15 and 17, extended (long double?) 18 & 21, quadruple 33 & 36). For example, for float:

If a decimal string with at most the lower significant digits (6) is converted to float precision and the converted back to the same number of significant digits (6), then the string should match the original 6 digits. This is

  int digits10 = int(floor(-1) std::numeric_limits<float>::digits * log10(2.))

that should be the same as

  std::numeric_limits<float>::digits10

and macro value   FLT_DIG.

If a float value is converted to a decimal string with at least the upper number of significant digits (9 for float), and then converted back to float, then the final number must match the original float number. For example, using Standard C++

#include <iostream>
#include <limits>
double const log10Two = 0.3010299956639811952137388947; // log10(2.)
pi = 3.1415926535897932384626433832795F;
int sig_digits10 =
int( ceil(1 + std::numeric_limits<float>::digits * 0.3010299956639811952137388947));
std::cout.precision(decimal_digits);
std::cout <<"float pi = " << pi << " using " <<
sig_digits10 << " decimal digits" << endl;

(But note that MACRO calculations only use integer arithmetic and cannot use floating-point functions ceil and floor. An approximation using only integer arithmetic is

2 + MANT_DIG * 30103 /100000

that gives the expected result for the significand digits listed below).

Note the asymmetry: the upper value is the number of digits that are significant on output and are required for accurate input, the lower value is the number guaranteed correct on output. The practically useful precision is given by bits/log10(2.)
 

Floating point Type

 
Often used for C/C++ type Total
bits
Significand
bits
(+ 1 if an implicit bit)
guaranteed digits10
decimal
digits
significant
decimal
digits
IEEE single float 32 23 + 1 = 24 6 9
VAXF float 32 23 + 1 = 24 6 9
IBM G/390 short float 32 24 6 9
IEEE single extended ? >=43 >=32 7 11
IEEE double double 64 52 + 1 = 53 15 17
VAXG double 64 52 + 1 = 53 15 17
IBM G/390 long double 64 56 15 17
VAXD long double 64 56 15 17
IEEE double extended long double 80 >=64 19 21
Sparc doubleextended (x86) long double 80 64 18 21
AIX quad long double 128 106 31 33
NTL quad ‘quad’
128 106 31 33
IBM G/390 extended long double 128 112 33 35
VAXH long double 128 112 + 1 = 113 34 36
IEEE quadruple long double 128 112 + 1 = 113 34 36
Sparc double extended long double 128 112 + 1 = 113 34 36
signed fractional ? 127 128 38 40
unsigned fractional ? 128 128 38 40
unsigned fractional ? 128 128 + 1 = 129 38 40

Semi-exhaustive tests using MSVC++.NET (compiler version 7.1) with C++ float and double types confirm that these numbers of significant digits can be output and re-input as expected.

For example, the following code should not fail the assert test for any normalised float value:

float a = 1.F;
float aa = 0;
std::stringstream s;
s.precision(9);
s << std::scientific << a;
s >> aa;
assert (a != aa);

All the constants here are believed reliable to 40 decimal digits and so should meet all these floating point formats, including C++ long double extended precision if the compiler uses IEEE 80-bit real, 15-bit exponent, 64-bit significand (or mantissa) format for which constants will need 21 decimal digits, and will even meet the requirements for 128-bit real arithmetic providing significand (mantissa) accuracy of 113-bit used by the Sun Sharc processor, and the AIX OS 106-bit, and floating-point types which can be emulated in software using, for example, C++ NTL quad_float class by Victor Shoup (derived from the doubledouble class of Keith Briggs).

The Intel Itanium IA-64 128-bit quad_precision is planned to be implemented in hardware (with 112-bit significand, 15 exponent, and sign), slightly more precise than 36 decimal digit constants required by other formats above. See www.intel.com , ADAG.pdf for IA-64 Application Developers Architecture Guide.

If all 128 bits are used as a fraction or significand (perhaps without sign and with an implicit most significant bit), 40 decimal digits would still be sufficient.

Floating-point hardware is unlikely to provide higher precision for some time to come. (For applications that require a higher precision, several software solutions are available, for example NTL, MPFUN, GMP, LIDIA - see references below).

What if I want even more accuracy?

Use a extended precision arithmetic or specialised Number Theoretic Package like Victor Shoup's NTL package, used to calculate these constants, or MPFUN, GNU GMP or LIDIA.

Why are there so very, very many decimal digits for interval constants?

To ensure that the value is 'exactly representable' for the floating point type.  This is necessary because the C++ Standard does not specify any rounding requirement.  The values must not need any rounding to be read into the internal floating point representation.

Sadly, but perhaps inevitably, the ISO C++ standard 14882:1998 [2.13.3.1] states:

 " [...] If the scaled value is in the range of representable values for its type, the result is the scaled value if representable else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner. [...] "

 The phrase “implementation-defined manner” implies that the compiler is not required to round to the nearest internal representation, although all compilers used so far appear to round as expected.

The effect of this is that if the number is not exactly representable (for example, if radix = 2 then 10. is not, but 2. is) then different implementations may convert the same decimal digits to binary values that differ by one least significant bit (or digit if radix is not two, for example ten or sixteen). The constants interval upper and lower limits are these two values, differing by at most one least significant bit (but the identical if the number, for example 2, is exactly representable).  To ensure that they are converted from decimal digits to the internal format, these interval limits must be exactly representable. Many decimal digits may be required to achieve this – nearly the number of bits in the significand.

So with sufficient decimal digits, compilers should convert constants from decimal digit characters to their internal floating-point format within 1 least significant bit (or digit).

Not all compilers achieve this and so some programs, for example the Cephes mathematical functions library, have provided constants (especially polynomial coefficients) in hexadecimal format (also proposed for future version of C and C++).  This ensures that the values are stored exactly, but also requires different hexadecimal values for each processor and floating-point format.

The method of providing decimal character digits is more portable but it does rely on compilers making the conversion from decimal character digits to internal floating-point binary format conforming to the C++ Standard.  In both cases, the results of storage and computation depend on the floating-point representation and so none are fully portable, as explained below.

(The unusually large number of decimal digits might cause overflow of non-compliant compilers that should accept, but ignore, unlimited numbers of digits. The only workaround for constant values is to round the values by hand or to use hexadecimal constants for the intervals.  This package does not attempt to cater for this non-conforming behaviour.)

Why are the number of decimal digits different for float, double and long double interval constants.

To ensure that they are exactly representable in the appropriate floating point type.  They also differ for different implementations of floating point types, for example long double may be stored in 64, 80 or 128 bit format.

Why are MACROs not sufficient and/or why are MACROs provided in a C++ library?

MACROs may pollute the global namespace, and are politically incorrect, but may be most efficient, for example in embedded systems where every byte is precious, and can be used in other languages, like C.

Why are functions returning the value of constants provided?

Because they are said to offer most scope for compiler optimisation (though some evidence suggests that the best compilers can produce 'optimal' code from any representation.. 

Does the library assume IEEE 754/IEC559 floating point?

No, except for smallest-interval values of constants.  (Support for other formats could be added).  NTL quad_float (similar to doubledouble as representation in 128 bits using 106 significand bits) interval values is an example of extension to a user-defined type (UDT) for which interval values have been calculated.

Why not let the compiler calculate constants?

Because existing compilers only use native built-in floating point precision, often only 64-bit with a guaranteed accuracy of only 15 decimal digits. Using these constants gives the best chance of portability.

For example, it may be a little better to use third_pi rather than pi /3. This is because the single constants are produced with a higher precision than the floating-point hardware and/or software that the compiler is also certain to use for its computation.

The constants are entirely fixed at compile time, so there is no code to call functions, for example, log, nor the need to include header files which define those functions, for example <cmath>.  This may speed compilation and reduce file accesses and dependency, and facilitate optimisation. When debugging code (often with fewer optimisation options) the constant is never calculated at run-time. Behaviour when debugging is independent of compiler type, version and compile options like debug/release and optimisation..

There is no evaluation of the function at program start time, nor wasted evaluation if the constant is never used, so programs may be slightly quicker.

Portability is improved because as there is no compile or run-time evaluation of constants, there can be no difference caused by different versions of functions, for example in different version of C math libraries, nor in different version of hardware floating point evaluation. Although in reality, this is rather unlikely to be a problem, the effort and time wasted proving that it is not a problem is likely to be significant.

The most potentially serious loss of accuracy occurs with transcendental functions: most seriously, the exponentiation function x raised to power y, pow(x, y) for x**y or x^y where y may not be an integer.

Why not use floating-point hardware constants?

Because only a few are available, they are often not the right size, are not available on all processors, and most important are sometimes not accurate enough. Using these constants gives the best chance of portability.

Intel X86 floating-point hardware, for example, includes instructions for providing floating-point constants pi and log base 2 (10). These are accurate to 64 significand bits, including guard and sticky bits, which is higher than for any constant read into a register or memory by the compiler, even from a more precise constant decimal digit string.

Internally, the Intel X86 floating-point hardware holds all numbers in extended precision real format (64 significand bits compared to external double of 53 significand bits). Extra bits are mainly to support rounding mode correctly. Calculations performed using floating point register to register operations will be performed with this 64-bit accuracy.  But operations that use external memory will have lower accuracy.  The compiler makes the choice between register and memory, so expecting completely portable accuracy is probably too optimistic. Using accurate literal constants probably gives the best chance of portability (but not necessarily the best possible accuracy).

File builtin_constants.h which provides a templated function for X86 hardware pi, and builtin_constants.cpp provides a demo, but these are NOT recommended.

Why not use the full accuracy of Intel Floating-point hardware?

The disadvantage of this is that it is less portable, but may give higher accuracy under certain conditions, but varying with compiler and compiler optimisation and other options.

For the Intel X86/Pentium extended precision significand is 64-bit, double is only 53-bit, and memory values are always 53-bit (plus sign and exponent to total 64 bits).  (See www.intel.com  27064004.pdf for 80C187 80-bit Math Coprocessor datasheet).

One method of ensuring that the IEEE extended double precision floating-point (typically the Intel X86/Pentium) is not used is to alter the appropriate bit in the processor “control word”.  This can only be done in assembler. See X86ControlWord.cpp for an example. C99 promises to provide a portable way of setting the precision.

The default value of the code word bit is NOT extended for Microsoft Windows, but IS extended for gcc/Linux.

The third way to fix the problem is to 'force' all intermediate floating point results into memory.  This is not an 'ideal' fix, since it may not be fully equivalent to 53-bit precision (because of double rounding), but it is reported to work (although a full proof of correctness in this case is not available). This 'force into memory' method is also used in the Boost Interval library.

NTL's quad_float code ensures use of 53-bit significands on all platforms by storing intermediate results in local variables declared to be 'volatile'. This is the solution to the problem that NTL uses if it detects the problem and can't fix it using the GNU 'asm' hack mentioned above. This solution should work on any platform that faithfully implements 'volatile' according to the ANSI C standard.  See www.shoup.net for more details.

Note that MSVC++ compilers up to version 8 implement long double exactly as double, IEEE 64-bit floating-point format. (This compiler does NOT use IEEE 80-bit extended precision available within the Intel Pentium or X86 floating point co-processor processor family instruction set).

C/C++ Pre-processor should calculate any long integers using 36 bits (P J Plauger, The Standard C Library, p 78) but none of the constants in constants.h are currently of integer type. (Some are floating point versions of integer values, for example: unity, zero, two, ten…).

Can I expect complete portability with exactly the same result on all systems?

No.  An important reason is that floating point types may use different representation and thus accuracy.  For example, long double may be 64, 80 or 128-bit, and even 64-bit representations may have different splits between significand (controlling accuracy) and exponent (controlling range).
But using these math constants will probably give the best portability.

Alas complexity of floating point is such that they cannot be absolutely guaranteed to give same result even if both use hardware conforming to IEEE 754/IEC559 floating-point specification. This is likely be a significant problem with iterative calculations at limit accuracy are involved, perhaps leading failureto converge or converging a to different solution. These are most troublesome when trying to prove the equivalence of different systems for computer validation. Because can predict that differences may arise one should not pessimistically assume that different output means that either system is faulty, nor assume that the systems are not equivalent.

Why are there so many constants which can be calculated from others (like 2pi)?

  1. Because interval values are also provided.
  2. Because it is convenient to have frequently used constants like two_pi defined.
  3. To avoid loss of accuracy in some constants, for example 4/3 * pi (though IEEE754 floating-point should compute 2 * pi without loss).
  4. Because non-radix-2 systems may lose accuracy in calculation (radix 10 systems exist - possible future systems might use f radix 12, 13, 42? ;-)

Why are there not just three constants named like pi_f, pi and pi_l?

Because this is seen as a C-style solution and unsuitable for taking advantage of C++.
 If this convention is desired, then the macros can be used to assign values, for example:

const long pi_l = BOOST_PI;

Why are functions like double pi() which return constants provided?

  1. Because some compilers may be able to produce smaller and/or faster code.
     (For example, note that MSVC 7 Standard edition only inlines  "__inline", so this will produce slower and longer code).
  2. It provides a framework whereby users can plug in special implementation and hardware-specific versions.

Why are there also what appear to be floating point types like pi but are really functions ?

To get the advantages of functions, like double pi(), but for clarity and convenience to write :

double area = pi * radius * radius; // pi rather than pi()

But there are several downsides:

Why are the constants not defined in hexadecimal?

  1. Because the C++ ISO14882:1998 standard does not permit this (though future revisions might, like the forthcoming C Standard revision).
  2. Because it is not portable between different floating point formats.
  3. Because 'exactly representable' decimal digits strings achieve the same effect (for standard conformant C++ compilers) in a more C++ friendly way.

Why are the constants not simply defined as long double?

Because this would rely on the compiler casting down to double or float which might lose accuracy through rounding, and this might vary with the optimisation. Interval constants are of course very different for different floating-point types.

How are values of constants validated?

For most values, Victor Shoup's NTL has been used to compute (and/or cross-check) the values in this file with a significand of 300 bits (about 100 decimal digits) and values are output as 40 decimal digit strings (or more digits if exactly representable interval limit values).  There is no typing or editing to introduce errors.

The numeric string values in the file constants.h are checked either one or both of two ways:
1 Values are consistent with values calculated by the compiler pre-processor.
2 Comparison with values in references tables: for example in D. E. Knuth, Appendix A (scanned in from the printed book and converted to a text file), or M. Abramowitz & Irene A. Stegun.

Some values are be tested with straightforward checks against compiler pre-processor calculated values, for example comparing:

  if (twoPi != 2. * pi) // may have a problem?

or C macro assert (which causes program termination if it fails - a more noticeable conclusion, but one that does not allow one to continue testing other constants).

  assert(twoPi == 2. * pi)

but many naively formulated tests appear to fail to the extent that the two values differ by the machine least significant significand bit, so instead it is better to test the absolute relative difference between values thus:

  if (fabs(twoPi - 2.L * pi)/ twoPi >= epsilon)  // Do have a problem with accuracy!

See D. E. Knuth and Alberto Squassabia for more guidance on comparison of reals.

(Epsilon is the smallest floating-point representation of the least-significant significand bit - see below.)

The equivalent using the C macro assert is:

  assert(fabs(sqrtTwo * sqrtTwo - 2.)/2. <= eps);

It is best to choose the constant to divide by to make the test relative to the constant value, rather than recomputing the check value.

Typical epsilon for type double for Intel X86 IEEE754/IEC559 hardware is 2.22e-016.

Minimum values specified by ISO 14882:1998 section 18.2.1 are defined as equivalent to the C Standard constants:

  #define FLT_EPSILON <float rvalue <= 10^(-5)>
  #define DBL_EPSILON <double rvalue <= 10^(-9)>
  #define LDBL_EPSILON <long double rvalue <= 10^(-9)>

The macros yield the smallest X of type float, double or long double such that 1.0 + X != 1.0.

C++ versions of these are provided by the ISO 14882:1998 Standard numeric limits template, for example:

  std::numeric_limits<float>::epsilon();
  std::numeric_limits<double>::epsilon()
  std::numeric_limits<long double>::epsilon();

The constants smallest or closest interval upper and lower limits should meet the following, shown as an example for pi:

  static const double pi_d = 3.1415926535897932384626433832; // closest.
  static const double pi_d_l = 3.141592502593994140625; // lower &
  static const double pi_d_u = 3.1415927410125732421875; // upper limit.

  (pi_d_l == pi_d) || (pi_d_u == pi_d) // must equal upper or lower, or both

and

float ul_pi = pi_d_l + pi_d_u; // upper + lower
 (ul_pi < 2.f * pi_d_u) && (ul_pi > 2.f * pi_d_l) // only one bit different

Why are there no physical constants?

Physical constants (like Planck's constant h) and their uncertainty change whereas mathematical constants (like pi) do not. Mixing them is therefore unwise.  A collection of C++ physical constants values and their uncertainty may be provided in future.


http:/www.hetp.u-net.com/public/faq.html, Revised 15 May 2005

© Copyright Paul A. Bristow 2002-2005. All Rights Reserved.

Valid HTML 4.01!