Sierpinski (<1K) THE ART OF FUNCTIONS


Sierpinski (<1K) Functions of a Single Variable
A  function f is a single-valued mapping from a set X (the  domain  of f) to a set Y (the  range of f). Functions of interest to programmers include sin,cos,tan,Ö, and log.

    There are four basic ways to computaionally evaluate functions:
(i) Using a Look Up Table, possibly with some form of iterpolation (see Look Up Tables).
(ii) Using a "bit-shifting" algorithm similar to those above for multiplication and division that "builds" the result.
(iii) Iterative Methods using a sequence that approaches the desired value (see Reccurance Relations).
(iv) Evaluating an Approximation Function (typically a polynomial) which is close to the desired function over the required range.

Look Up Tables
 Lookup tables ( LUTs) provide fast but often imprecise evaluation of functions. They can take up a lot of memory but may vastly improove performance. Designing look up tables and the code to access them is an exercise in balancing storage and speed against accuracy. To discuss look up tables in general we will take as a simple example the problem of evaluating sin(x) where x is measured in 256ths of a rotation for any x Î Z255.
    We note first of all that if we know sin(x) over the range [0,64] then we can readily calculate sin(x) for any x Î Z255. If we define a function s11:Z255 ® Z64 by
s11(x)=x if 0 < x < 64
=128-x if  64 £ x < 128
=x-128 if 128 £ x < 192
=256-x if 192 £ x < 256

then |sin(x)|=|sin(s11(x))| " x Î Z255.
    This means that we need only tabulate sin(x) for x Î Z64.
    Now suppose we only have memory available for 33 entries in our sine LUT. Since we cannot tabulate sin(0),sin(1),sin(2),...sin(64) we tabulate sin(0),sin(2),sin(4),...sin(64) and we combine s11 with s12(x) = [x/2| to give a contraction function s1(x)=[s11(x)/2] which converts our functional argument into a LUT index. If we want to know sin(x) for some x Î Z255 we use the s1(x)th element of our lookup table.
    If each of our LUT entries is just 8 bits wide then each entry can only contain an integer in the range 0 to 255. Since sin(x) is a postive fraction for x Î Z64 we store "scaled up" sines and tabulate S[i]  = Min{[28sin(2i)+(½)|,255} over i Î Z32. All that remains is to calculate the sign of our result. If we define
s22(y,x)=y if 0 £ x < 128
=-y if 128 £ x < 256

and set s2(x,y)=2-8s22(x,y) then our estimation of sin(x) for x Î Z255 is simply s2(x,S[s1(x)]).
We refer to s2 as an  expansion function, it converts our tabulated integers into final values.

    Just how accurate is this method of approximating sin(x)?
    Suppose we want to calculate sin(5). Since we have only tabulated sin for even arguments we will actually return sin(4) giving an absolute error of -0.02. But this assumes that sin(4) is evaluated exactly and this will be subject to an absolute error of 2-9 due to the limited width of the LUT entry.
    This latter error is a  thinness or  quantisation error due to the limited accuracy of the LUT entries. These are minimised by tabulating a function with as small a variation as possible, scaled and translated to fully use the given LUT entry width.
    The former is a  spareseness error and is due to the limited number of LUT entries. These can be reduced by tabulating as small a segment of the desired function as possible and by some form of interpolation on the tabulated values.
    Note however that the linear function s(x) = x/64 lies below and within 0.21052 of sin(x) over [0,64] and if we exploit this by tabulating S[i] = [29(sin(2i)-i/32)| for i Î Z32 we can reduce the absolute error in sin(2i) from 2-9 to 2-11.

    In general, we approximate a desired function F: X ® Y via an Nentry LUT by F(x) =f2(F[f1(x)],F[h(f1(x))],x)
where f1 is a contraction function mapping X ® ZN;
    f2 is an expansion function mapping ZM ZM X ® Y;
    F[i] denotes the ith element of the table (assumed in ZM).
    (This is not truly general since it is possible to use more than two LUT entries for a given evaluation, eg. fitting a quadratic to three tabled points.)

    In choosing the contraction function f1 we first look for a representative portion of X with respect to F. This is a subset X* of X such that if we know F over X* we can readily extend to X (in the sin example Z64 was a representative portion of Z255 with respect to sin).  f1 is then the composition of two maps:
f11:X ® X* defined so F(x) is easily derived from  F(f11(x)) " x Î X.
f12:X* ® ZN.

    The expansion function is also the composition of two maps:
    f21, maps ZM ZM X to Y and attempts to regain the accuracy lost due to f12;
    f22, maps Y X ® Y and attempts to compensate for f11.

    The art of LUT definition lies in choosing expansion and contraction functions that minimise the total error (arising from thinness and sparseness errors) without requiring excessive calculation or memory.

ThinnessErrors
     To minimise the thinness errors we tabulate the deviation of F from a (usually fairly crude) approximation function f:X ® Y for F over X* rather than F itslf, preferably scaled up to use the full width of the table entries. The simplest choice of this  base function f is a constant, obvious choices being either the minimum or the median value taken by F over X*.
    We then tabulate F[i] = [K(F(f1-1(i))-f(f1-1(i))|    " i Î ZN
where K may be chosen so that F[i]max is as large as possible within ZM.
    Our sin example used base functions of s(x)=0 (giving an absolute error in sin(2i) of 2-9), and s(x) = x/64 giving an absolute error in sin(2i) of 2-11. (occuring at (switching to radians)  x=cos1(2/p) = 0.28 giving a difference of approximately 0.21051).
    Any base function used must be catered for in the first part of the expansion function. When s(x)=0 we can get away with s22(y1,y2,x) = 2-8y1 but we must compensate for s(x)=x/64 with s22(y1,y2,x) = 2-10y1+2-6s1(x).
    A suitable base function over X*=[a,b] is often given by g(x)=f(a)+(x- a)(f(b)-f(a))/(b-a). The maximal deviation of g from f then occurs at x=f'1((f(b)-f(a))/(b-a)).



Sparseness Errors

    These are usually a greater problem than thinness errors. To halve thinness areas it suffices to add a single bit to each LUT entry. To halve sparseness errors one must double the number of entries.
    If X* = [a,a+b] and we map it to ZN by f12(x) = [N(x-a)/b] then the sparseness errors are less than f(x+b/N)-f(x) where x is that point in [a,a+b] where f(x) has maximal absolute gradient - approximately bf'(x)/N for small b/N.
    In the sin example our sparseness errors are high. b/N is 2 and x=0 so our sparseness errors are potentially S(2)-S(0). Since we will only actually want S(n) for integral n our worst sparseness error will actually be when S(1) evaluates to S(0)=0 instead of &0.0324.
    We can reduce sparseness errors  by incorporating linear interpolation of the LUT values into f22. To do this we need an approximator G(x) for the gradient f'(x) at out tabulated point. The interpolated value from a base evaluation f(a) is then given by g(a+x) = f(a) + xG(a). G(a) can be obtained from a sperate LUT for the derivative function, or obtained from the LUT of the function itself. One approximator for f'(a) is given by G(a)=(f(a+b)- f(a))/(ba) where b is the smallest value such that f(a+b) may be evaluated from the table distinct from f(a). This particular strategum will be refered to here as  naive linear interpolation.
    Naive linear interpolation is very easy for s12(x)=[x/2], s(x)=0. We simply set
s22(y1,y2,x)=2-8y1 if s11(x) is even
=2-9(y1+y2) if s11(x) is odd

With this choice of G then for x<b and provided b is small so that we can apply Taylor series:
f(a+x)-g(a+x)»f(a)+xf '(a)+x2f "(a)/2+o(x3)-f(a)-bf '(a)-b2f "(a)/2o(b3)
=f'(a)x(x-b)/2 + o(x3,b3).

and the maximal error occurs near x=b/2 and is approximately f "(a)b2/8. This provides a quicker error estimation than calculating f(x)g(x) where x=f '-1((f(b)-f(a))/(b-a)) as described under Thinness Errors.

    In our example the error involved in calculating sin(2i+1) from sin(2i) and sin(2i+2) is -(2p/256)2sin(2i)/2 < 2-15p2 < &.001A.
    In summary, if  our sin LUT is set up as S[i]=[29(sin(2i)-i/32)| for iÎZ32 and used by
S(x)=h(x)(2-10S[s1(x)]+2-6s1(x)) if x is even
=h(x)(2-11(S[s1(x)]+S[s1(x)+1])+2-6s1(x)+2-7) if x is odd

where h(x) =  1 if 0<x<128 , -1 if 128<x<255;
s1(x)=[x]  if 0 £ x<64
=[128-x] if 64 £  x <128
=[x-128] if 128 £ x <192
=[256-x] if 192 £ x <256

then S(x) is accurate to approximately +&.002A.

    Suppose now we can have 65 bytes in the LUT. Should we double the width of each entry or the number of entries?
    The former eliminates all the thinness errors because the width of the LUT entry equals the required result width. Because of this, another advantage is that s(x)=0 is just as good as s(x)=x/64.
    The latter eliminates all the sparseness errors because the length of the table then equals the length of X*.
    But if we are using linear interpolation the sparseness errors are small compared to the thinness ones so it is better to "widen" the LUT than "lengthen" it in this as in many other cases.
    Note that this is not the optimal way to code sine LUTs. We can do better than the G described above by exploiting the fact that the derivative of sine is cosine. This is described in Sine and Cosine Functions below.

    Look up table methods are considered in depth later on. We will here consider how the other three methods may be used to evaluate some of the more common functions needed by games programmers.



Bit-Shifting Methods
"And let your speach be Yea, yea; Nay, nay; and whatsoever is more is of the evil one" --- Matthew 5:37
    We will use the notation x{i} to denote the value (0 or 1) of the bit of the operand x corresponding to 2i. Thus x = åi=-¥+¥ x{i}2i.
There are two basic sorts of bit shifting algorithm to evaluate y=f(x).

(a) Operand-Based

We build up y working from the bits of x. This approach requires an identity of the form f(a+b)=g(a,b,f(a),f(b)). A typical algorithm would be:
(i) Set y0=f(0); i=<lowest bit position for x>. s=x{i}2i
(ii) If x{i}=1 then yi+1 = g(s,2i,yi,f(2i)),s=s+2i else yi+i = yi.
(iii) i=i+1. Repeat (ii) unless x{j}=0 " j ³ i.

    If g is of the form g(f(a),f(b)) then we do not need to maintain s. This approach usually requires a small lookup table for f(2i).

(b) Result-Based We build up the bits of y exploiting the monotinicity of f-1.
    If f-1 increases we would use:
(i) Set y0=0; N=<maximal posible bit position for f(x))
(ii) If f-1(yn+2N-n) ³ x then yn+1= yn+2N-n else yn+1= yn
(iii) i=i-1. Repeat (ii) unless i<<lowest desired bit posion for y>
 This method is equivalent to a binary search for the value of y solving ¦-1(y)=x.

Iterative Methods
We iterate some sequence or sequences a number of times, where one sequence tends to the desired result. See the chapter on Recurrence Relations.

Approximation Functions
We evaluate a different function, typically a polynomial, that approximates the desired function over the desired domain. We may use different approximation functions over different parts of the domain.

For any function p(x) which is meant to approximate f(x) over a given range [a,b] we can construct an error function e(x)=p(x)-f(x) which we would like to be as small as possible over [a,b].

If p is constrained to be a particular type of function (say, a polynomial of degree n) then  we want the function of that type that "minimises" |e(x)| over [a,b] (rather than naively taking the first n terms of the Taylor series, for example). Unfortunately it is not immediately clear what it means to "minimise" a function over a range. The two most common criteria for error minimilisation are minimising E¥ = MaxxÎ[a,b]|e(x)| (the  minimax approximator) and minimising E2 =  òabe(x)2 dx (the  least squares approximator).  The former gives the smallest maximal deviation of p from f, while the latter minimises the "overall" deviation.

    We will show below how to find the minimax and least square polynomials of degree n appproximating a function f over [a,b].

Least Squares Polynomials
If we write ci for the coefficient of xi in our polynomial p(x) then for p to be a least squares approximator for f we require
dE2 / dci =0 i Î Zn+1
Ûd/dci[ òab(p(x)-f(x))2  dx]=0 i Î Zn+1
Û òabd/dci[(p(x)-f(x))2] dx=0 i Î Zn+1
Û òabd/dci[(Sj=0.ncjxj - f(x))2] dx=0 i Î Zn+1
Û òab2(Sj=0.ncjxj - f(x))d/d(Sj=0.ncjxj) dx=0 i Î Zn+1
Û òab2(Sj=0.ncjxj - f(x))xi dx=0 i Î Zn+1
Û òabSj=0.ncjxi+j dx= òabxif(x) dx i Î Zn+1
ÛSj=0.n òabcjxi+j dx= òabxif(x) dx i Î Zn+1
ÛSj=0.n(bi+j+1 - ai+j+1)/(i+j+1)= òabxif(x) dx i Î Zn+1
ÛAc=y i Î Zn+1

where Aij = (bi+j+1 - ai+j+1)/(i+j+1) i,j Î Zn+1. ; yi =  òabxif(x) dx   i Î Zn+1.
y can be obtained either by hand or computationally using, for example, Simpsons Method for integration of a given range. The (n+1)(n+1) linear equations Ac = y can then be solved using Gaussian Elimination. This, and Simpson's Method, are both extensively covered in conventional Computer Algorithm books and accordingly are not described here.

Minimax Polynomials
Finding minimax polynomials is less straightforward. An algorithm does exist, however, the  Exchange Algorithm, which we will describe below. Any reader wishing to see a rigorous methematical justification for this algorithm should consult (for example) Approximation Theory and Methods (M.J.D.Powell, Cambridge University Press 1981).

    The basic principle is that the error function of a minimax polynomial p of degree n approximating a function f over [a,b] will achieve its greatest magnitude h at exactly n+2 different values of x in the range [a,b] and, further, that if these values are {h0,h1,...,hn+1} where a£h0<h1<.....<hn+1£b then
e(hi) =  (-1)ig i=0,1,..n+1  where |g|=h. (The errors alternate in sign).
    Suppose we wanted to fit a polynomial of degree n to the n+2 points (xi,f(xi)) i=0,1,..n+1.
    Any polynomial that minimises Maxi=0,n+1|p(xi)-f(xi)| satisfies  p(xi)-f(xi) = (-1)ie    i=0,1,..n+1 for some e, and, conversely, any polynomial of degree n that satisfies this for some e does minimise Maxi=0,n+1|p(xi)-f(xi)|. We must also have that |e|£h since if |e|>h then the minimax polynomial over [a,b] would give a smaller Maxi=0,n+1|p(xi)-f(xi)|.

    If we happened to know what the values h0,h1,...,hn+1 were then we could solve the n+2 linear equations
    p(hi)-f(hi) =  (-1)ig i=0,1,..n+1
to find g and, more importantly, the n+1 coefficients of the minimax polynomial p.
    Since we do not know these values we choose any set S0 of n+2 values from [a,b]. S0={xi : i=0,1,..n+1} where a£x0<x1<...<xn+1£b.
    xi=a+i(b-a)/(n+1) would do but, for reasons that need not concern us here, the  Chebyshev Points
    xi = (a+b)/2 +(ba)cos((n+i+½)p/(n+1))/2
are a more appropriate choice.
    At stage M we find the minimax polynomial pM for f over SM by solving the n+2 linear equations
     pM(xi)-f(xi) = (-1)ieM     i=0,1,..n+1.
by Gausssian elimination to find the n+1 coefficinets of pM and the unknown deviation eM.
    We then find one or more values {xj: j=0,1,..k} (where k ³ 0) that maximise |pM(x)-f(x)| over [a,b]. If these points equal some of the points in Sm then we have |eM|=h and pM is minimax over [a,b]. Otherwise we replace one or more of the values in SM with these new values such that the new set SM+1 still satisfies a£x0<x1<...<xn+1£b and that xi is only replaced by xj if pM(xi)-f(xi) and pM(xj)-f(xj) have the same sign.
    Having obtained SM+1 we then find pM+1, use that to obtain SM+2, and so on.
    Each stage will bring our set of values closer to the "worst possible" set of values {h0,h1,...,hn+1} and at every stage we will have |eM+1|³|eM|. The algorithm terminates at stage N when |eN+1| - |eN|<d for some small d since we will then have Maxi=0,n+1|pN(xi)-f(xi)| < h+d.

Minimax Rational Functions
    If we allow a division in our approximation function then we have the full range of rational functions available to us. A rational function over [a,b] is one of the form r(x) = p(x)/q(x) where p(x) is a polynomial of degree m and q(x) is a polynomial of degree n satisfying q(x)>0 for x Î [a,b].

The exchange algorithm described above may be used to find minimax rational approximation functions using sets of m+n+2 points although it is no longer the case that at stage M we can find the minimax polynomial rM for f over SM by solving a set of linear equations; instead we are faced with:

pM(xi) = [f(xi)-(-1)ieM]qM(xi)     i=0,1,.,m+n+1.

Though it is possible to solve these directly using eigenvalues (see Approximation Theory and Methods) we instead adobt the tools of linear programming such as the  Primal Simplex Algorithm.

    We choose an estimation h for the deviation eM and look for a rational function rM satisfying |f(xi)-rM(xi)|£h for i=0,1,.,m+n+1. That is, we look for polynomials p and q satisfying
    p(xi) - f(xi)q(xi£ hq(xi)      i=0,1,..m+n+1
    f(xi)q(xi) - p(xi£  hq(xi)      i=0,1,..m+n+1
    q(xi) > 0     i=0,1,..m+n+1.
    If we can find such p and q then we know that h³ eM. If such p and q can be shown not to exist then we know that h<eM. This enables us to find eM to any required accuracy (using a binary search technique) and thus pM and qM.

Finding coefficients of p and q to satisfy these equalities is a linear programming problem. However we can achieve better performance in terms of h's approach to eM by using the differential correction algorithm. We start stage M with h=eM-1 and look for polynomials p and q and scalar d³0 satisfying the m+n+3 equations:
     p(xi) - f(xi)q(xi£hq(xi) - dqM-1(xi)      i=0,1,..m+n+1
    f(xi)q(xi) - p(xi£hq(xi) - dqM-1(xi)      i=0,1,..m+n+1
    q(xk) = 1     for a fixed kÎ{0,1,..,n+m+1}.

    If such can be found with d=0 then p=pM, q=qM, h=eM. Otherwise d>0 and we replace h with Maxi=0,..m+n+1|f(xi)-p(xi)/q(xi)| and repeat. h will tend to eM from above.
    For stage 1 we define e0 to be anything known to be larger than the optimal minimax rational approximation error over [a,b], and  q0(x) to be identically equal to 1.

It only remains to reformulate the above inequalities into the standard form usual for linear programming techniques. An appropriate reformulation is obtained by writing the coefficient pr of xr in p as pr0-pr1 (r=0,1,..,m) and the coefficient qs of xs in q as qs0-qs1 (s=0,1,..,n) and also introducing slack variables yi and zi for i=0,1,..,m+n+1. We then have the linear programming problem:
    Minimise åj=0m+n+1 (yj+zj) subject to the 2(m+n)+3 constraints
    år=0m (pr0-pr1)xir - (f(xi)+h)[ås=0n (qs0-qs1)xis] + yi =   - dqM-1(xi)      i=0,1,..m+n+1
    -år=0m (pr0-pr1)xir + (f(xi)-h)[ås=0=n(qs0-qs1)xis ]  + zi=   - dqM-1(xi)     i=0,1,..m+n+1
    åj=0n (qj0-qj1)xkj   =  1
    over pr0,pr1,qs0,qs1,yi,zi, d ³ 0.

Basic Math Functions


Square Root function
Square roots are often required, occasionally in the form Ö(1+x). In the presence of log and antilog functions one can exploit the trivial result Öx = antilog( log(x)/2).

Bit-shifting Methods

     It is reasonably straightforward to evaluate square roots explicitly in time proportional to the accuracy required.

    In the following discussion we will assume that the problem is to evaluate, to N bit accuracy, the square root of a fractional operand x. We assume x to be a positive value strictly less than one. This means that the result y=Öx will be of the same order no matter how wide x is. If x is four bytes wide we will regard it as being in (232)ths, if it is two bytes wide we will regard it as being in (2-16)ths. In both cases, if N=8 the result will be in (2-8)ths.

    This assumption is not limiting. If we wish x to represent, say, a 32-bit integer then y=Ö(x/232) need only be multiplied by 216 to give the correct result. Either algorithm with N=16 will return y in (2-16)ths which is effectively the same thing as 216y.

    Since Ö(a+b) is not easily expressible in terms of a and b we do not use an operand-based bit-shifting algorithm. However since f-1(y)=y2 is monotonically increasing we can use a result-based one.
    Suppose at the nth stage the top n bits of yn match those of y=Öx but the rest of  yn is zero  (ie.  yn =åi=-n-1 y{i}2i) so that En=x-yn2 is a non-negative error value for yn2 as an underapproximator for x.
     y{n+1} = 1     Û (yn+2-(n+1))2 £ x     Û yn2 + 2-nyn + 2-2(n+1) £ x     Û 2-nyn + 2-2(n+1) £ En , with the comparison one of positive (unsigned) integers.

    An algorithm exploiting this is

  1. Set n=0 ; yn=0 ; En=x
  2. IF En ³ 2-nyn + 2-2(n+1) set Pn+1 =Pn+2-(n+1) ; En+1 = En-2-nyn-2-2(n+1)
      ELSE set yn+1 = yn  ; En+1 = En
  3. If n<N set n=n+1 ;  goto (ii)
  4.  yN=Öx (if regarded as an N-bit fraction)
This algorithm does not shift its variables although it does require 2-2n and multiplication by 2-n each cycle. Pn is N bits wide, En as wide as Q.
    By maintaining Rn = 2N-nyn instead of yn, and Sn=2-2(n+1) we can improove the algorithm thus:
  1.  Set n=0 ; Rn=0 ; En=x ; Sn=1/4
  2.  If   En ³ 2NRn + Sn set Rn+1 = Rn/2+2N+1Sn ; En+1 = En - 2-NRn - Sn
      Else set Rn+1 = Rn/2      ; En+1 = En
  3. Set Sn+1 = Sn/4
  4. If n<N set n=n+1 ;  goto (ii)
  5. RN=yN=Öx (if regarded as an N-bit fraction)
With Rn being 3N-2 bits wide, Sn 2N wide, and En as wide as Q.

An alternative approach is to operate on the bits of x in pairs by shifting them into a comparison field. Each of these double shifts gives one bit of result data.
    At each stage we have yn containing an integral underestimation of the square root of 4nx. We maintain an error value En=4nx-yn2.

    How is En+1 related to En?
    If yn+1=2yn   then En+1 = 4n+1x-4yn2 = 4En
    If yn+1=2yn +1 then En+1 = 4n+1x-4yn2-4yn-1 = 4En - 4yn -1.
    So we set yn+1=2yn+1+1 Û 4En ³ 4yn + 1 ie. Û En ³ yn + 1/4.
    The algorithm is therefore:

  1. Set n=0; yn=0 ; En=x (fraction)
  2. If  En ³ yn+1/4 set En+1=4(En-yn-1/4) ; yn+1=2yn+1
      Else          set En+1=4En         ; yn+1=2yn
  3. If n<N set n=n+1 ;  goto (ii)
  4.  yN=Öx (if regarded as an N-bit fraction)
Setting Fn = 4En to loose the 1/4 gives:
  1. Set n=0; yn=0 ; Fn=4x (fraction)
  2. If  Fn ³ 4yn+1 set Fn+1=4*(Fn-4yn-1) ; yn+1=2yn+1
      Else          set Fn+1=4Fn         ; yn+1=2yn
  3. If n<N set n=n+1 ;  goto (ii)
  4.  yN=Öx (if regarded as an N-bit fraction)
Here the shifts must have width N+M where M is the width of Q.

Iterative Methods

    It is possible to obtain square roots using an iterative method involving divisions. Consider the iteration xn+1 = (xn+y/xn)/2.
    If en = xn-Öy then we have en+1 = en2/2xn.
    So if xn>Öy we must have 0< en+1 < en2/2Öy; writing gn=en/Öy for the fractional error in xn, we have gn+1 £ gn2/2 Þ gn £ g02n/22n-1.
    Thus provided x0 differs from Öy by less than 141% (g0<Ö2) then xn®Öy (from above) extremely rapidly. If, for example, x0 is within 6% of Öy so that the most signigicant four bits of x0 are accurate, then x1 will be accurate to nine bits, x2 to nineteen bits, x3 to thirty nine bits, x4 to seventy nine bits, and so on (though not beyond the accuracy to which the divisions are evaluated).

    Since this iteration converges to the square root so fast, we rarely need to do more than three iterations and two will often suffice if we start from a resonable estimate for Öy. If Öy is close to one (eg. when normalising an approximately normal vector), then we set x0=1 and get our first iteration x1=(1+y)/2 almost for free. To find the exact magnitude of a vector whose magnitude is not necessarily close to unity we can sum the squares of the components and then use a distance measure (see Distance Measures) to start the iteration.
    Similar iterations exist to find cube and higher roots. See Inverse Functions Via Iteration.

Approximation Functions

    We can reduce the domain from [0,1] to [0,½] by dividing any x>½ by four and the doubling its square root. Indeed, we can reduce to [0,2-m] by dividing by 2m and then multiplying the squareroot by 22m although we loose the LeastSig m input bits by doing so.
[Include some rational poly fits here ]

Power Function
    Square roots are the b=½ special case of the general power function xb . In the presence of log and antilog functions one can exploit the trivial result xb= antilog(b log(x)) .
    Of particular interest is xb for x Î [0,1] and a "large" positive b (around twelve) eg. the ( cosq)b arising in Phong Shading. The result is then also in [0,1].
    For known integer b>0,  xb can be evaluated in less than b multiplications in a "hardwired" manner. x18 , for example, requires 4-Mult as (((x2)2)2)(x2) [   (x2) being evaluated once and used twice ].

Differentiating power functions

    If ' denotes /x then we have (ax)' =  ln(a)ax and (xa)' = axa-1 for a¹0.

    If ¦(x)=g(x)_h(x) then
    d-1(¦(x+d)-¦(x))   =   d-1(g(x+d)_h(x+d)-g(x+d)_h(x) + g(x+d)_h(x)-g(x)_h(x))
    » ln(g(x+d))g(x+d)_h(x)_h'(x) + _h(x)g(x)_h(x)-1g'(x) » ln(g(x))g(x)_h(x))h'(x) + _h(x)g(x)_h(x)-1g'(x) = (  ln(g(x))g(x)_h'(x) + _h(x)g'(x) )g(x)_h(x)-1

Approximation Functions

    If the result is required only to within a fairly course error tolerance e [  Typically e=2-8 for an 8-bit shade value ] then a rough-and-ready piecewise approximator is given by
    ¦(x) º 0      for 0£ x <d ;      (Ax+B)m      for d£ x £1
where db £ e (typically d = e(b-1) to give equality) ;     A=(1-d)-1  ;      B = 1-A = -d(1__delta)-1 .
    This satisfies ¦(d)=0 ; ¦(1)=1 for any m and it remains only to chose an integer m smaller than b (ideally a power of two) giving the required accuracy.
    For a given m, the optimal (minimax) values for aand Bare given by the minimax linear approximator to x(b/m) over x Î [d,1] though this gives a discontinuity at x=d and may deviate slightly from [0,1].

    In general, a "power bulging" function curve ¦(x) over a given range   can be approximated by computing the mth power of an approximator to the "flatter" curve ¦(x)(m-1) for a prechosen "friendly" m.

Sine and Cosine functions

Look Up Tables

    We will assume we need a q-bit sine (that is a result expressed in (2q)ths) of a p-bit angle (that is an angle expressed in (2-p)ths of a complete revolution (2p radians or 360o)),ie. we need S(i) = 2qsin(2(p1)p) where iÎZ2p-1; and that we have room for a lookup table containing N=2n m-bit entries.
    We take X* to be the first quarter of X as before. Suitable base functions are s(i)=0 and s(i)=2q+2-pi
    X=Z2p-1  ; X*=Z2p-2-1
    S(i) = 2qsin(2(p1)p)  ; s(i) = 0 , 2q+2-pi
     Y=Z2q-1.
s11(i)=i if 0<i<2p-2
=2p-1-i if 2p-2<i<2p-1
=i-2p-1 if 2p-1<i<3*2p-1
=2p-i if 3*2p-2<i<2p
s12(i)=2n-p+2i.

If we use s(i)=0 then we have S[i] = [2msin(2-n-1p)| and our expansion function is
s2(y,i)=2q-my if 0 £ i < 2P-1
=-2q-my if 2p-1 £ i < 2P.

If we use s(i)=2q+2-pi then we have S[i] = [2m+2(sin(2-n-1pi)2-n-1p)| and our expansion function is
s2(y,i)=2q-m-2y + 2q-p+2s11(i) if 0 £ i < 2P-1
=-2q-m-2y - 2q-p+2s11(i) if 2p-1 £ i < 2P.

    Our thinness errors come from an absolute error of 2-1 in yÎZ2m1 and so translate to an absolute error in our final result of 2q-m for s(i)=0 and 2q-m-3 for s(i)=2q+2-pi. If we are using an 8-bit table to give a 16-bit result (m=8, q=16) for example, then our absolute thinness error is 25 for the linear base function, only the top eleven bits of our result are accurate. To eliminate thinness errors we thus need m ³ q-3.

    Our sparseness errors derive from any difference between p2 and n. If we wish to eliminate sparseness errors we must either have n³p2 or use some form of interpolation. Rather than using naive linear interpolation we note that
    S'(x) » 2qp+1pcos(2(p1)px) = 2p1pS(2p-2-x) so that
    S(j+k) » S(j) + (21-ppk) S(2p-2-i).

    Using the contraction function s12(i) = 2n-p+2i results in the loss of p-n-2 bits so the k we are interested in will be in Z2p-n-2-1. We therefore construct a corrective look up table C to approximate the function  c(k)=21-ppk. We will assume this table to have 2p-n-2 entries (giving a contraction function c1(k)=k) entries of width 2r (giving a table C[k]=[2r+n-ppk|, and expansion function c2(y,k)=21-r-ny if we use a zero base function). For any x Î X we can therefore define our approximation S of S to be S(i) = s2(S[j],i)+ 21-r-nC[k]s2(S[2n-j],2p-i)
where j=i DIV 2p-n-2 , k=i MOD 2p-n-2.

Example
    If we have n=r=8 and p=q=m=16 then s12(i) = 2-6i.
    If we use a zero base function then:
    S[] contains 28 16-bit enties:  S[i] = [216sin(29pi)|.

C[] contains 26 8-bit entries:  C[k] = kp.

S(i) = S[j] + 2-15C[k]*S[28-j]

 where j=i DIV 26,k=i MOD 26 for iÎðZ214-1.

If we use a linear base function then:

  S[] contains 28 16-bit enties:  S[i] = [218(sin(29pi)29pi)|.

C[] contains 26 8-bit entries:  C[k] = kp.

S(i) = 2-2S[j] + 28j + 2-15C[k]*(2-2S[28-j]+216-28j)

 where j=i DIV 26, k=i MOD 26 for iÎðZ214-1.

The multiplication is of the form 2-15(8-bit)*(16-bit) and we only need the top nine bits of the 24-bit result.

For a zero base function this system gives values for S(i) within ±2 of the desired 16-bit scaled sine. If we only evaluate the multiplication to eight bits then the errors become ±3. Raising r to 16 reduces the errors to ±1 except for just 14 (out of 214) +2s. Taking n=9, r=16 and a 9-bit multiply gives errors of ±1 with only 75 +1s. Increasing n to 11 (a 4 Kbyte lookup table) ensures that the value is only ever out by -1.

Bit-Shifting Methods

    The identity sin(a+b) = sinasinb + cosacosb allows an operand-based method (if we tabulate sin(2i) and cos(2i)) but this would be extremely extremely costly requiring two multiplications per non-zero operand bit.

The monotonicity of sin-1(y) allows a result-based method but only if sin-1(y) is available.

Approximation Functions

    The following tables give minimax polynomial approximations to sin(px/2), that is, we assume x=1 corresponds to a right angle.
    e1 is the maximal negative error in the approximation, e2 the maximal gives the maximal positve error. e3 and e4 give the maximal negative and positive errors if all multiplications are only performed to four heximal places (sixteen bit fractions).

Approximations to sin(px/2) over [0,1].
x0x1x2x3x4x5x6e1(-ve)e2e3e4(+ve)
-.038CA7+1.D87297-.D1594A                                -.038CA7+.038CA7-.038C28+.038E13
-.005998+1.9C47AC-.2CA6C2-.6FA0EA                        -.005998+.005998-.00595D+.005AF5
+.00070F+1.90AF50+.0C0D72-.C915CE+.2C4AEE                -.00070F+.00070F-.00070F+.0009E9
+.000077+1.91FFC8+.0162BE-.AAE54A+.0980D0+.0E01F4        -.000077+.000077-.000077+.0003FA
-.000007+1.922253-.002AE6-.A45511-.030FD3+.191CAC-.03AF27-.000007+.000007-.000001+.000459

Approximations to sin(px/2) over [0,½].
x0x1x2x3x4x5x6e1e2e3e4
+.03D13F+1.6A09E6                                        -.03D13F+.03D13F-.03D13F+.03D2BE
-.009760+1.A7DEC3-.76EEB9                                -.009760+.009760-.009730+.0098CA
-.00031C+1.92D222-.060D8B-.9705DA                        -.00031C+.00031C-.00030C+.0004B1
+.00004B+1.92022C+.01DD56-.AFF0E3+.18A0E6                -.00004B+.00004B-.00004B+.000209
+.000001+1.921F29+.000C12-.A5BD88+.0145CD+.12B5CB        -.000001+.000001-.000001+.0001B8
-.000000+1.921FC2-.0001B0-.A5495E-.007687+.15C668-.020835-.000000+.0000000+.0001DD

Approximations to sin(px/2) over [½,1].
x0x1x2x3x4x5x6e1e2e3e4
+.730C08+.95F61A                                        -.090222+.090222-.090215+.0902C9
-.2646B8+2.45AA3C-1.1F23F8                               -.003F8C+.003F8C-.003F48+.004042
-.0E8F7A+1.E064E7-.934EDE-.3E8E00                        -.000772+.000772-.0006E6+.00097A
+.023D41+1.816DEE+.31ECE6-.F10DB9+.3B7586                -.00001F+.00001F0+.000327
+.00AB6F+1.8C8EBF+.12D800-.C64B6C+.1E795B+.07BFE6        -.000002+.0000020+.0003D9
-.0010BA+1.92CAD4-.02ECBD-.9E3EEA-.0A7B74+.1DD4D2-.04E7D0-.000000+.0000000+.0003DC



Arctangent function
    This function is most commonly used converting cartesian coordinates into polar coordinates when we need q=tan-1(y/x). Now y/x can take any value in the range [-¥,+¥] but by exploiting the identities:
    tan-1(1/x) = p/2 -tan-1x)
    tan-1(-x) = -tan-1(x)
we can reduce our requirements to finding tan-1(x) over xÎ[0,1].
[0,1] is a representative portion of R with respect to tan-1(x).

Bit-Shifting Methods

    The monotonicity of tan(y) makes result-based algorithms viable in the presense of a routine to evaluate tan.

Iterative Methods

    The iteration  xn+1 = ((2-sin2xn)xn + (1+cos2xn)y)/2 will approach tan-1(y) extremely rapidly. See Inverse Functions Via Iteration.

Approximation Functions

    It is possible to find polynomials that approximate  tan-1(x) fairly well over xÎ[0,1], or indeed over xÎ[-1,1]. Unfortunately we need a polynomial of the form C1x+C3x3+C5x5+C7x7 (requiring five multiplications) merely for an accuracy of ±&0.0006 (achieved by C1=.9992150,C3=-.3211819,C5=.1462766,C7=-.0389929).

The simplest and most efficient approach is to have two or more different polynomials for different portions of the required domain but alternatively we can reduce the required domain to [0,Ö2-1]»[0,&0.6A09] using the identity
     tan-1(x) = 2 tan-1((Ö(1+x2-1)/x)
which of course we need only use when x>Ö2-1.

    Alternatively we use identities of the form  
     tan-1(x) = a +  tan-1((x-tana)/(1+xtana))
for various a Î (0,1).
    Taking a= tan-1(½) » &0.76B2 and using
     tan-1(x) =  tan-1(½) +  tan-1((2x-1)/(2+x))
on any x>½ will reduce the domain to [0,½].
    If we then take a =  tan-1(1/4) » &0.3EB7 and use
     tan-1(x) =  tan-1(1/4) +  tan-1((4x-1)/(4+x))
on any x>½ we will further reduce the domain to [0,1/4].
    While it is possible to continue this process to obtain an arbitarily small domain over which to use a simple approximation function (a linear polynomial for example) or a short look up table, it is probably more computatonally efficient to use a higher degree polynomial over a larger range since whereas each additional degree of a polynomial adds one multiplication, each domain reduction may require one division.

    The tables below gives good approximation polynomials for tan-1(x) over the ranges [-1,1], [0,1], [0,Ö2-1], [0,½],[½,1],[0,1/4],[1/4,½],[½,3/4], and [3/4,1]. Coefficients and errors are all given in heximal.

    Approximations to tan-1(x) over [-1,1].
x0x1x2x3x4x5x6e1e2e3e4
0+10-1/3                  -.1E6530+.1E6530-.1E65DB+.1E65DB
0+10-1/30+1/5              -.14CE03+.14CE03-.14CD25+.14CD25
0+.F8EED50-.312382         -.014488+.014488-.0144DB+.0146DB
0+.FECFC70-.49E7990+.144F8F -.0027E2+.0027E2-.0028CA+.002A69

Approximations to tan-1(x) over [0,1].
x0x1x2x3x4x5x6e1e2e3e4
+.091A49+.C90FDB                                       -.091A49+.091A49-.091A49+.091B60
-.00A0B5+1.10F8B3-.46A76D                               -.00A0B5+.00A0B5-.00A067+.00A1F4
-.004868+1.096170-.2F9CC5-.10B4D0                       -.004868+.004868-.004857+.00498E
-.0006F6+1.00FB4D-.04AE78-.568A15+.235B05               -.0006F6+.0006F6-.0006D0+.000971
+.00015F+0.FF8D86+.060E9D-.737E60+.439E18-.0CAC00        -.00015F+.00015F-.00015F+.0004DA
+.00006B+0.FFD785+.0279D9-.637579+.237551+.10D9B4-.0A1BE6-.00006B+.00006B-.00006B+.000493

Approximations to tan-1(x) over [0,½].
x0x1x2x3x4x5x6e1e2e3e4
+.01B5FA+.ED6338                                       -.01B5FA+.01B5FA-.01B5FA+.01B6D3
-.003925+1.08E16D-.35333F                               -.003925+.003925-.0038FA+.003A47
-.000580+1.0147E9-.0B97C8-.386334                       -.000580+.000580-.000572+.0006B5
+.000041+.FFE25C+.022CD9-.641216+.2B6F86                -.000041+.000041-.000041+.000180
+.000010+.FFF6E6+.00D1F3-.5C29DE+.18C5B2+.0F519B        -.000010+.000010-.000010+.000192
+.000000+.FFFFF5-.0001C0-.5518BD-.02860C+.3FE832-.206E10-.000000+.000000-.000000+.000198

Approximations to tan-1(x) over [½,1].
x0x1x2x3x4x5x6e1e2e3e4
-.0243E7+1.18C9E2-.4D8472                               -.000E51+.000E51-.000DFD+.000FC2
-.07BC4A+1.2FCF56-.6CD820+.0DD30D                       -.0001E1+.0001E1-.000192+.0003AF
-.03A604+1.188DAC-.3C36BC-.1E7BC7+.0EDAF8               -.000042+.0000420+.0003C2
-.004AC5+1.00C75F+.0620D5-.79AA51+.4C9B9E-.107EE6       -.000003+.0000030+.000378
+.00B718+.F83E19+.23E7B1-.B065E7+.848836-.2E9F4F+.06AFFA-.000000+.0000000+.00051A

Approximations to tan-1(x) over [0,Ö2-1].
x0x1x2x3x4x5x6e1e2e3e4
+.0106BD+.F2B406                                        -.0106BD+.0106BD-.0106BD+.0107AB
-.0024B2+1.06ABCD-.2E8926                                -.0024B2+.0024B2-.00249D+.0025D7
-.000268+1.00AAE7-.0731BF-.40058E                        -.000268+.000268-.000257+.0003AF
+.000026+.FFEC79+.019FEC-.61B5CC+.285BA4                 -.000026+.000026-.000026+.00015B
+.000005+.FFFC70+.0061C5-.591C11+.1031BB+.17AC22        -.000005+.000005-.000005+.000161
+.000000+1.00001C-.000519-.54FC8A-.02F392+.40AD59-.20F2B8-.000000+.0000000+.000166

Approximations to tan-1(x) over [0,1/4].
x0x1x2x3x4x5x6e1e2e3e4
+.003E90+.FADBB0                                        -.003E90+.003E90-.003E90+.003F94
-.00099D+1.02C5A5-.1E7437                                -.00099D+.00099D-.000995+.000AA3
-.00003A+1.001A58-.01CD2D-.4CB5D4                        -.00003A+.00003A-.000030+.000148
+.000005+.FFFC08+.00830D-.5B4B77+.1CDD39                 -.000005+.000005-.000005+.00012C
+.000000+.FFFFC6+.000A22-.55F79F+.0460BD+.27587F         -.000000+.000000-.000000+.000129
+.000000+1.000004-.000121-.553915-.0153B5+.3B86E1-.1ACC2C-.000000+.0000000+.000129

Approximations to tan-1(x) over [1/4,½].
x0x1x2x3x4x5x6e1e2e3e4
+.074E6A+.DFEAC1                                               -.00922F+.00922F-.0091FF+.009355
-.026B6D+1.16AEA2-.48D849                               -.000435+.000435-.00041C+.000581
-.00D40D+1.091C8A-.236D49-.216D4F                       -.000060+.000060-.00003C+.0001C2
+.00045C+.FF8B21+.045CFE-.698805+.30171E                -.000001+.0000010+.000164
+.00113A+.FED65A+.084545-.742E60+.3E6582-.079517        -.000000+.0000000+.000173
+.000315+.FFC6C9+.01B343-.5BF600+.0CC46B+.2E07EF-.17DD12-.000000+.0000000+.000171

Approximations to tan-1(x) over [½,3/4].
x0x1x2x3x4x5x6e1e2e3e4
+.1B4083+.B82B84                                        -.00A4A9+.00A4A9-.00A467+.00A59A
-.04377A+1.1EF1EB-.523CBC                               -.0000B1+.0000B1-.00007C+.00020B
-.056DBA+1.24D1E0-.5BA27A+.04F684                      -.000034+.0000340+.0001F1
-.019D91+1.0BE19D-.1EE41C-.3C58F8+.1A2825               -.000003+.0000030+.0001FC
+.004576+.FC78F8+.132F0E-.8D4BD5+.5B40F6-.14D5C5        -.000000+.0000000+.000244
+.00719F+.FACA68+.19FD08-.9BEB52+.6CDBE0-.201827+.02FD16-.000000+.0000000+.000242

Approximations to tan-1(x) over [3/4,1].
x0x1x2x3x4x5x6e1e2e3e4
+.3851F9+.914D76                                        -.008F94+.008F94-.008F3C+.009082
+.020481+1.0EC7EC-.47BF10                                -.00027D+.00027D-.000197+.0004A6
-.0B1716+1.3C2B04-.7BDC68+.13D848                        -.00000B+.00000B0+.000273
-.07E796+1.2D6C09-.6259FE+.004C9C+.0598CC                -.000001+.0000010+.0003AA
-.0257DE+1.0D4B72-.184E59-.54CF00+.3661EE-.0B284E        -.000000+.0000000+.000410
+.0106A3+0.F5E41C+.2B4CCC-.BCC31E+.901EC3-.3464A4+.07E150-.000000+.0000000+.000450



Logarithm function

Look Up Tables

    The dynamic range of loga(x) for x=1,2,3,...N is usually small for reasonable N. In order to use logs efficiently it is often necessary to scale them up.
    For example, if a=2 and N=255, loga(N) is less than 8 so storing [log2(x)| only gives three bits of data and a corresponding fractional error of about 12%. If L(x) approximates 212log2(x) so that L[255] » 212*7.994=&7FE7, 15 bits are used profitably.
    We define L(x) = Kloga(x) for x Î X and some convenient K.
    A convenient representative portion of Â+ wrt L is an interval of the form [1,at) for some postive (perhaps integral) t.
If x Î Â+ is expressed in the form x = am(aq+r) where m is an integer (the  exponent) and r (the  mantissa) Î [0,at+q-aq) (Standard binary floating point format, for example, has a=2,t=1,q=-1) then,  writing s for ra-q Î [0,at-1),
    L(x) = L(am(aq+r)) = L(am+q(1+s)) = K(m+q) + L(1+s).
    A contraction function can thus be based on l11(am(aq+r)) = 1+raq and an expansion function on l22(y,am(aq+r)) = y + K(m+q).
    We thus have: X = R+  ; X*= [1,at) ; L(x) = Kloga(x) ; l(x)= 0 , Kt(x-1)/(at-1) ,or Ktxat
    l11(am(aq+r)) = 1+raq
   l22(y,am(aq+r)) = y + K(m+q).
    Assuming that x=am(aq+r), combining l11(x) = 1+raq with l12(1+s)=[NLs/(at-1)] gives a contraction function l1(x)=NLraq/(at-1).
    If we choose l(x)=0 and store L[i] = [Kloga(1+i(at1)/NL)| iÎZN
(where K is chosen so that Kt is as close to (but not greater than) ML) then, setting _2B l2(y1,y2,x) =  y1 + K(m+q),
we obtain
l2(L[l1(x)].y.x)=L[l1(x)] + K(m+q)
=L[[Nra-q(at-1))|] + K(m+q)
=Kloga(1+(NLra-q/(at-1)))(at-1)/NL) + K(m+q)
=Kloga(1+ra-q) +  K(m+q)
=Kloga((1+ra-q)am+q)
=Kloga(am(aq+r))
=Kloga(x)

as required.
    Although using l12(1+s)=[NLsat] and L(i)=[Kloga(1+iat/NL)| uses the LUT space less efficiently, it is often easier to code.
    Over [1,at) the dynamic range of Kloga(x) is [0,Kt), however most of the dynamic range occurs over the first (1/a-(1/a)t)th of the interval. With a=2, t=8, for example, we have the last half of our table varying by less than K ((1/8)th of the full range).
    This can be avoided by either: taking a=2 is t=1 (so that X*=[1,2], and L[i] = [Klog2(1+i/NL)| i Î ZNL ranges from L[1]=0 to L[NL]=K with L[NL/2]=.6K; a reasonable spread of values); or using a nonzero base function.
    The maximum deviation of L(x)=Kloga(x) from its linear under-approximation l(x)=Kt(x-1)/(at-1) occurs at x0=L'-1((L(at)-L(1))/(at-1).
    Now L'(x)=K/(ln(a)x) so L-1(y) = K/(ln(a)y). Thus x0=K/(ln(a)(Kt/(at1)) = (at-1)/(ln(a)t) and our maximal deviation is L(x0)-l(x0) =  K(loga(x0)- t(x0-1)/(at-1)).
    Another possible base function is l(x)=Ktxat which gives a possibly greater deviation of tat at x=1 and is sometimes greater than L(x). (The gradient of Kloga(x) over x>c is greatest at x=c).
    Using l12(x)=[(x-1)/b| will generate maximal sparseness errors over x>c of Kloga(c+b)-Kloga(c) = Kloga(1+b/c). If we have c=1, b=(at-1)/NL then this is Kloga(1+(at-1)/NL) which for a=2, t=1 and large NL is approximately K/(ln(2)NL) ® 0.31K/NL. We can use linear interpolation to reduce this.
    The naive linear interpolator  g(x) = f(c)+(xc)(f(c+b)f(c))/b approximating f=L over [c,c+b] has maximum error approximately equal to f '(c)b2/8 for small b. Since L'(x) has its greatest absolute value over [1,at] at x=1 and is there -K/ln(a) naive linear interpolation will reduce our sparseness errors from Kloga(1+(at-1)/NL) to K((at-1)/NL)2/(8ln(a)). For a=2, t=1 this is K/(8ln(2)NL2) = 0.18K/NL2.
    The gradient of L is actually given by L'(x) = K/(ln(a)x) and we would need a corrective LUT approximating C(d)=Kd/ln(a) in conjunction with a reciprocal table to exploit this.

Antilog or Exponential Function
If we define our log function as L(x)=Kloga(x) then we must define our antilog function as A(x) = ax/K.

Look Up Tables

    Any interval in  is a representative portion of  wrt A. We will consider an interval [d,d+w) and suppose that a given xΠis readily expressed in the form mS+r where mÎZ, rÎ[d,d+w) and S is some constant (typically d=0,w=S, so that m=x DIV S, r=x MOD S and we have
    A(x) = A(mS+r) = amS/KA(r)).
    Thus a contraction function can be based on a11(mS+r)=r and an expansion function on a22(y,mS+r)=amS/Ky, this latter only being really feasible if mS/K is an integer.
    Antilog generates the same non-uniform variation problem as log but with the low spread of values occurring at the beginning of the range. This is again only a problem if a constant base function is used and can be avoided even then by using d=0,w=K giving a dynamic range of [1,a).
    The obvious constant base function for A over [d,d+w] is a(x)=ad/K.
    Typically we would take d=0,w=S=1 giving:
    X  = R+  ; X* = [0,1)
    A(x) = ax/K  ; a(x) = 1 , 1+x(a1/K-1) , xa1/K
    a11(z+x) = x   (zÎZ,xÎ[0,1))
     a22(y,z+x) = az/Ky.

Thinness Errors

    The maximum deviation of A(x)=ax/K from its linear approximation a(x)=ad/K(1+x(aw/K-1)/w) which overestimates A(x) over (d,d+w) occurs at x0=A'-1((A(d+w)-A(d))/w).
    Now A'(x)=ln(a)ax/K/K so A-1(y) = Kloga(Ky/ln(a)).
Thus x0=Kloga(Kad/K(aw/K-1)/w/ln(a))
=d+Kloga(K(aw/K-1)/w/ln(a))
=d+Kloga(m)

where  m = K(aw/K-1)/w/ln(a) so the maximal deviation is
A(x0)-g(x0)=ad/K(m-1-Kloga(m)/w(aw/k-1))
=ad/K(m-1-mln(a)loga(m))
=ad/K(m1-mln(m)).

Sparseness Errors

    Setting b=w/NA and a12(r)=[(r-d)/b] will generate fractional sparseness errors of approximately e=ln(a)b/K. For a=2 this is approximately 0.69b/K.
    One way to reduce sparseness errors is to note that A(x+d) = A(d)A(x) and use a corrective LUT of the form C(y)=A(y/b).
    Alternatively we can use naive linear interpolation. Since A'(x) = ln(a)2ax/k/K2 has its greatest absolute value over [a,b] at x=b and is there (ln(a)/K)2A(b), an estimation of the theoretical fractional sparseness error after using naive linear interpolation is e'=(bln(a)/K)2/8=e2/8.
    For a=2 this is approximately 0.06(b/K)2.
    Linear interpolation, however, involves multiplication so we can use L and A again. Suppose c=d+yb for some yÎZ and xÎ(c,c+b). Let z=(xc)/b which we know lies inside (0,1). (if d=0 we have y=[x/b], z=(x/b)-y, c=yb).
g(x)=A(c) + z(A(c+b)-A(c))
=A(c) + zA(c)(A(b)-1)
=A(c) + alogz+c/K+ log(A(b)-1)
=A(c) + a(Klogz+c+K log(A(b)-1))/K
=A(c) + A(Kloga(z)+c+Kloga(A(b)-1))

 Now z may be approximated as Rzap where p>0 and   RZÎZap and then provided RZ>0 we have Kloga(z) = Kloga(Rz)  Kp so that:
    g(x) = A(c) + A(c + Kloga(Rz) + G)
where G is the constant Kloga(A(b)-1) - Kp = K(loga(2b/K-1)-p).
    But Kloga(Rz)»l21(L[l12(Rz)]) and so we have:
g(x)=A(c) + A(c +l21(L[l12(Rz)])+G)
»a2(A[y],c) + a2(A[[y+l21(L[l12(Rz)])/b+G/b|],c+l21(L[l12(Rz)])+G))

    An example should make this clearer. With a=2,K=212,d=0,w=S=214  we have a11(x)=x MOD 214, a22(y,x)=24(x DIV 2^10)y. If we have NA=28 we need to complete our contraction function with a12(x)=2-6 and are then interested in interpolating for values of the form 26i+j where jÎZ261. If we use a constant base function a(x)=20=1 then replacing c with 26i, b with 26, and z with 2-6j (Rz=j) we have
g(26i+j)=A(26i) + A(26i + 212log2(j) + G)
=A(26i) + A(26i + L(j) + G)

where G is the constant 212(log2(A(26)-1) - 6).

    Let us consider how approximating g in this manner effects our theoretical sparseness eror of e'=e2/8.
    A(c) will not be prone to sparseness errors since c divides b. However both it and the corrective term A(Kloga(z)+c+Kloga(A(b)-1)) are subject to thinness errors in A so we have effectively doubled the thinness errors. The corrective term is bounded in magnitude by A(c+b)-A(c). But this is just the original absolute sparseness error bound for A over [c,c+b) and equals eA(c). If the errors in L are small compared to the sparseness errors in A we can take all the errors in the corrective term as deriving from A and we have a maximal sparseness error of e(Max corrective term)=e(eA(c))=e2A(c).
    Our fractional sparseness error is then e2A(c)/A(x) < e2 since A(c)<A(x). Adding this to the theorectical error e2/8 derived above gives a final sparseness error in A of e'=9e2/8. We have also doubled our thinness errors.
(The gradient of A is greatest at the end of the range so our crude absolute error bound is A(d+w)-A(d+w-b)=a(d+w)/K-a(d+w-b)/K » a(d+w)/K(ln(a)b/K). We can similarly derive a maximal fractional error of ln(a)b/K.).

Reciprocal function
The function f(x)=1/x is often tabulated to provide a way of doing divisions using multiplications; xy being evaluated as x*(1/y).

Bit-Shifting Methods

    Any of the bit-shifting division algorithms for P/Q can be used with P=1.



Gamma Function
    The  Gamma function is a generalisation of the factorial integer function to real x defined so that G(x+1)=xG(x) . This enables us to express any G(x) for x³0 as a multiple of G(x') for x'Î[½,1½] . G(1)=1 ensures G(k) = (k-1)! for positive integer k.
    We also have G(½) = p½ providing G(k+½) = 1×3×5×...(2k-1) 2-k p½ .
    G(x) can be extended over negative x via G(-x) = -p(x sin(px)G(x))-1 having sign (-1)k+1 between -k and -(k+1) .

    G(x) =  ò01 dt ( ln(t-1))x-1 =  ò0¥ dt tx-1e-t provide alternate integration-based definitions of G(x).over x>0.
    Gauss's defintion is G(x) º  Limn ® ¥ n!nx-1(x(x+1)..(x+n-1))-1 for noninteger real x.

    G(-(k+d))   =   (-1)k+1p((k+d) sin(pd)G(k+d))-1 is, for k>3 , small for most dÎ(0,1) ramping rapidly to ±¥ as d®0 and 1.

Dropoff Functions
    A  dropoff function of order N is a function g:[-1,1]®[-1,1] satisfying
g(0)=1 . g(1)=0 , g(i)(0)=g(i)(1)=0 " 1 £ i £ N
where (i) denotes the ith derivative.
    Though defined over [-1,1], we typicaly only require evaluation over [0,1].

    Dropoff functions have associated  blendings,
    If we have defined ¦(x)=g(x) for x £ a, h(x) for x ³ b where a < b then the blended function
¦(x)= g((x-a)/(b-a))g(x) + g((b-x)/(b-a))h(x) for (a £ x £ b)
ensures N-order continuity of ¦ over [a,b].

    If g satisfies g(1-t) = 1-g(t) over t Î [0,1] we say the blending is ½-symmetric and we have g(½)=½.
    If g satisfies g(-t) = g(t) over t Î [-1,1] we say the blending is 0-symmetric. 0-symmetric dropoff functions are also known as  splines.

    If g is ½-symmetric then g'(1-t) = g'(t). If g is both ½- and 0-symmetric we have g(t-1) = 1-g(t) ; g'(t-1) = -g'(t) ; g"(t-1) = -g"(t).

½-symmetrising

    We can create a ½-symmetric droppof function from any g(t) defined over [0,½] with g(0)=1 ;  g(½)=½ ; g'(0)=0 ; g"(½)=0 by defining g(t)=g(|t|) for t<0 ; g(t) = 1-g(1-t) for ½<t£1 ; and g(t)=0 for t>1.   We will refer to this as a  ½-symmetrised dropoff function. The g"(½)=0 condition ensures continuity in g" at ½ and we have g'(t) = g'(1-t) and g"(t) = -g"(1-t) for t>½ .
    There will be a kth order discontinuity at t=½ for even k³4 if g(k)(½)¹0

Linear
    g(t)= 1-|t|
    Linear blending is ½-symmetric of order 0. It gives first order discontinuities. g'(t)= -1 " t Î [0,1].

Cubic
    g(t)= 1-t2(3-2|t|)
    Cubic blending is not ½-symmetric (although g(½)=½) and is only 0-symmetric if defined as here with a |t|. It has order 1 and cubic blends are consequently often discontinuous in the second derivative as a result.
    g'(t) ³ g'(½) = -3/2 " t Î [0,1].

    By ½-extending 1-3t2+2t3 we obtain a ½-symmetric order 1 blend.

Quartic
    g(t) = 1-8|t|3(1-|t|) over [-½,½] ½-symmetrised over [-1,1] provides an order 2 ½-symmetric drop off which can be used to constuct blendings with third order continuity (ie. continuous second derivatives). There is a discontinuity in g(4)(t) at tau=½.
    g'(t) ³ g'(½) = -2 " t Î [0,1] so quartic dropoff is only a third steeper than cubic dropoff at t=½, though of course it requires one further multiply to evaluate. |g'(t)|£2 and |g"(t)|£4.

    More generally, for integer k³3 we can ½-symmetrise k-degree polynomial g(t)= 1 - 2k-3tk-1(k - 2t(k-2)) to obtain an order k-2 droppoff function with g'(½) = -½k .

Squared cosine
    g(t)=½(1+ cos(pt)) =  cos2(pt/2)
      Squared cosine blending is 0- and ½-symmetric and has order 1, but with g(i)(0)=g(i)(1)=0 for all odd i.
    g'(t) = -½p sin(pt) so g'(t) ³ g'(½) = -p/2 " t Î [0,1].

Bias
    ga(t) = Biasa(1-t) where Biasa(t)   º t-log2a.
    a=½ gives linear blending.
    ga(½) = a.
    Bias blending has order 0 since g'(1) = -log2a.

Gain
    gb(t) = Gainb(1-t) where
     Gainb(t) º Bias1-b(2t)/2      if t £ ½ ;     1-Bias1-b(2-2t)/2      else.
    Gain blending is ½-symmetric. It has order Int((2b +1)/2).

Slope1
    Define Slope1d(t) where d > 0 as equal to
    (d+1)/(1+t/d) - d = d(1-t)/(t+d) for t £ 1, 0 else.
    Slope blending has order 0 since   Slope1d'(0) = -(d+1)/d, Slope1d'(1) = -d/(d+1).
    Of particular interest is Slope12a-1(t) =  (2a-1)(1-t)/(t+2a-1). which has gradients -2a/(2a-1) and -(2a-1)/2a at t=0 and 1 respectively.
    For large a these gradients are just below and just above -1. As a®0 they tend to -¥ and 0.
    For a=1 we have Slope11(t) = 2/(1+t) - 1 with gradient -2 at t=0 and -½ at t=1.
    [ Insert graph image here ]

Sinc Function
    g(t) = (kpt)-1  sin(kpt) for integer k is a 0th order dropoff function because g'(t) =   t-1 ( cos(kpt) -  g(t)) has g'(_0)=0 but g'(1)=(-1)k. It has k zeroes in [0,1].

Dropoff derivatives   
g(t)g'(t)g"(t)g"'(t)g'(½)
Linear1-|t|Sign(-t)00-1
Cubic1-3t2+2t36t(t-1)12t-6123/2
Quartic1-8t3+8t4 for t£½
8t(1-t)3 for t³½
-8t2(4(1-t)-1) for t£½
-8(1-t)2(4t-1) for t³½
-48t(1-2t)  for t£½
-48(1-t)(1-2t)  for t³½
-48(1-4t)    for t£½
-48(4t-3) for t³½
-2
Polynomial 1 - 2k-3xk-1(k - 2(k-2)x) for x£½
2k-3(1-x)k-1(1-k + 2(k-2)x) for x³½
-2k-3kxk-2(k-1 - 2(k-2)x) k
Squared Cosine½(1+ cos(pt))-(p/2) sin(pt) -(p2/2) cos(pt) = p2(½-g(t))(p3/2) sin(pt) p
Bias (1-t)-log2a (log2a)(1-t)-(1+log2a) (log2a(1-log2a) (1-t)-(2+log2a) (log2a) (1-log2a) (2-log2a) (1-t)-(3+log2a) 2alog2a
Sinc (kpt)-1  sin(kpt) t-1( cos(kpt)-Sinc(t)) t-1(-kp sin(kpt)-t-1( cos(kpt)-Sinc(t)) -t-2( cos(kpt)-Sinc(t)) = ???


Random Variables
    A  random function is a function ¦: Â ® X  mapping a "time" t into a (possibly multidimensional) value x with  probabilty density function ¦P(x,t). We say the density function is  normalised if sums or integrate to 1 over X so that the value is the probability.
    We speak of ¦ "returning" x at "time" t with probabilty ¦P(x,t). A random function is  static if ¦P(y,t)= ¦P(y,0)      " y,t .

Uniform
    A random function is  uniform if ¦P(y,t)= ¦P(y0,t)      " y,t     In the case of a finite number of possible values in X, all values are equally probable to occur.
    Most standard programming libraries provide an attempt at a real-valued uniform random function Ua.b(t) which provides a real x Î [a,b] for any t such that the probability the x at any given t lies in subrange [ _a0,_a0+d] is d(b-a)-1 .

General
    Set F(x) =  ò-¥x¦P(s) ds . Observe that F(x) strictly increases with x.
    If F is invertable then F-1( U0.1 ) is a random function with normalised probabilty density function ¦P .
[ Proof :  Probabilty(F-1(U0.1)) £ y} = Probabilty(U0.1 £ F(y)) = F(y)  .]
    We can thus easily emulate many random distributions given a decent U0.1(t) implementation .

Random Directions
    We can specify a random 3D direction with two "spherical polar" angles q,f but if we take q uniformly from [0,p] then we will get as many "at" the poles as are "around" the equator. We favour q Î [0,p] with nonnormalised probability density ½ sinq in proportion to the longitude circumferance length,
    F(q) = ½ ò0q sins ds = ½(1- cosq)     Þ F-1(y)=  cos-1(1-2y)     so we set
     q =  cos-1(1-2U0.1) =  cos-1(U-1.1) and f=U0.2p for a uniform spherical distribution.

Next : Functions Of Several Variables


Glossary   Contents   Author
Copyright (c) Ian C G Bell 1998
Web Source: www.iancgbell.clara.net/maths or www.bigfoot.com/~iancgbell/maths
18 Nov 2006.