2. Basic Numerical Methods
Citations Over Time
Abstract
2.1. Basics of Floating Point Computation 2.1.1. Rounding error analysis. A floating point number system consists of numbers x which can be represented as x=m⋅βe,m=±d0.d1d2…dt,2.1.1 where m is the mantissa, e the exponent, and t the number of digits carried in the mantissa, and . The integer β is the base (usually ). The mantissa is usually normalized so that , and the exponent satisfies . Hence, the floating point number system is characterized by the set (β,t,L,U). The result of arithmetic operations on floating point numbers cannot generally be represented exactly as floating point numbers. Rounding errors will arise because the computer can only represent a subset F of the real numbers. The elements of this subset are referred to as floating point numbers. Error estimates can be expressed in terms of the unit roundoff u, which for the floating point system (2.1.1) may be defined as u=12β1−t and equals the maximum relative error in storing a number. (Sometimes u is defined as the smallest floating point number e such that . However, with this definition the precise value of u varies even among different implementations of the standard IEEE arithmetic.)
Related Papers
- → Automatically improving accuracy for floating point expressions(2015)160 cited
- → The Essentials of verified numerical computations, rounding error analyses, interval arithmetic, and error-free transformations(2020)1 cited
- → Rounding error in the computation of opposite sign floating point number parametric addition: a case study(1994)
- Rounding error study of calculator floating-point numbers arithmetic operation(2005)
- → Deca-Rounding Methods for Floating Point Numbers: Error Analysis & Hardware Implementation(2023)