Iteration

BASIC ARITHMETIC

Numbers
    From the programmers perspective, there are four types of number from which more complicated numeric types such as complex numbers or multivectors can be constructed: integer, rational, fixed-point, and floating-point. Integeres are whole numbers, typically signed and in the range [-2^_M-1,2^_M-1) with _M typically 8, 16, 32 or 64. Twos compliment form is usually used for negative values, with -x represented by x ^ ~0 where ~0 has 1 in all _M bits.
    For rational numbers, we represesent x with two integers p and q¹0 with the understanding that x=pq^-1.
    For fixed-point we represent x with a signed intgere z and the understanding that x=2^-_Lz where L is a positive integer less than _M. With _M=16 and _L=14, for example, we can represent values in (-2,2) to within 2^-14. One is represented by 0x4000, ½ by 0x2000 and so on. When multiplying two such numbers together, we must shift the 32-bit result down by 14 places. If _M=32 we acheive dynamic range (2¹⁷,2¹⁷).
    For floating-point we represent x by a fixed-point mantissa y and an integer exponent z with the understanding that x=2^zy. Typically we would have a seperate sign-bit for the sign of x, forcing y positive, and choose z so that yÎ[1,2), allowing us to store |y|-1 rather than y; enabling _L=_M for y.
    However, if we wish to store groups of similar magnitude numbers (vector coordinates (_x1,_x2,_x3) say) it is often sensible to have just one common exponent for all of them. We then have x = 2^z(_y1,_y2,_y3) where _y1,_y2, and _y3 are signed values in (-1,1). We would typically choose z so that at least one of the y_i has |y_i|³½.
    Fixed-point numbers are arguably the most fundamental type. They are less forgiving in use, requring greater care with range and accuracy, than floating-point but are much faster to add and subtract. That said, a lot of silicon has been cut to provide fast floating point units, and a lot of standard library routines assume floating point.

Heximal Numbers
    Serious programmers should be familiar with heximal numeric notation and recognise key values.Value

Heximal	Decimal
¼	.4000000	0.25000000	1/3	.5555555	0.33333333	½	.8000000	0.50000000
4^-1p	.C90FDAA	0.78539816	3^-1p	1.0C15238	1.04719755	2^-1p	1.921FB54	1.57079633
p	3.243F6A8	3.14159265	2p	6.487ED51	6.28318531
2^-½	.B504F33	0.70710678	2^½	1.6A09E66	1.41421356	3^-½	.93CD3A2	0.57735027	3^½	1.BB67AE8	1.73205081
5^-½	.727C971	0.44721360	5^½	2.3C6EF37	2.23606798	7^-½	.60C2479	0.37796447	7^½	2.A54FF53	2.64575131

Multiplication and Division
     The fastest way to multiply or divide two numbers is often logarithmically as described in the next section. Failing this, If the processor does not have multiply and divide instructions, they must be explicitly programmed.

Multiplication
   We will assume we have an N-bit value P to be multiplied by an M-bit value Q to give a K-bit value R. A fully accurate result requires K ³ N+M.
    We will further assume that P is non-negative.
    If we write p_i for the i^th bit of P (i=0 giving the LSB) then we have: QP = Qå_i=0^N-1 2ⁱp_i
    Let R_n=å_i=0ⁿ Q2ⁱp_N-n+i. R_n is thus Q times the top n bits of P.
    Further:

R_n+1	=	å_i=0ⁿ⁺¹ Q2ⁱp_N-n-1+i
	=	å_i=0ⁿ⁺¹ Q2ipN-n-1+i + Q2⁰p_N-n-1
	=	å_i=0ⁿ Q2ⁱ⁺¹p_N-n+i + Qp_N-n-1
	=	2å_i=0ⁿ Q2ⁱp_N-n+1 + Qp_N-n-1
	=	2R_n + Qp_N-n-1

So our algorithm is

Set n = 0 ; R_n = 0
If p_N-n-1=1 set R_n+1 = 2R_n + Q
Else set R_n+1 = 2R_n
Set n = n+1 ; If n < K_Mult goto (ii)
R_{K_Mult} = MostSig K bits of PQ. If K=N+M this is exact.

which may be implemented as

Set n = 0 ; R = 0
If p_N-n-1=1 set R = 2R + Q else set R = 2R
Set n = n+1; If n < K-M goto (ii)
R = MostSig K bits of PQ.

This algorithm works for any value of Q and any non-negative value of P. However, it is often worthwhile checking explicitly for zero valued P or Q before using the algorithm and exiting with R=0. This will give a marked performance improvement if the routine is frequently called with one or other argument zero (eg. multiplication of sparse matrices).
If we assume that Q is also non-negative then the addition of Q to R in step (ii) may be more efficiently codable. There is another advantage to having Q of low-width. If Q is no wider than P it is possible improove on a naive implimentation of the algorithm that shifts P leftwards to extract its bits in the required order for the test in (ii). Since R is also being shifted leftwards (with the occassional addition of Q) we can effectively do both shifts at once by bringing R into P from below. For this to work R must have width ³ N+M. We procede as follows:

(i) Set R = 2MP
REPEAT K-M TIMES
(ii) Shift R left by one bit setting carry = resulting N+Mth bit
(iii) If carry is set set R = R + Q
ENDREPEAT
(end) LeastSig K bits of R = MostSig K bits of PQ.

    Note that if we do not perform N cycles of the loop there will still be some low bits of P at the top of our "result" R.
    In step (iii) the addition is performed only if the carry is set. This is annoying on processors such as the 6502 without an "Add without Carry" instruction. Rather than clearing the carry everytime it is possible to set R = 2M times the compliment of P. Alternatively one can decrement Q before entering the loop and restore it later if necessary.
    A general multiplication routine is typically of the form:

Check for easy values of P and/or Q like 0 or +1.
Set S = |P| ; T =|Q|
Set R = ST (unsigned multiplication)
If P and Q have different sign negate R
R=result

A "narrow" multiplication operartion can be used for wider arguments by repeated application in accordance with the results

(2^ma+b)(2ⁿc+d)	=	2^n+mac + 2^mad + 2ⁿbc + bd
	=	2²ⁿac + 2ⁿ(ad+bc) + bd	if n=m
	=	(2²ⁿ+2ⁿ)ac + 2ⁿ(a-b)(d-c) +(2ⁿ+1)bd.

the latter rearrangement giving three rather than four "sub" multiplications.

Division
It is possible to perform division by successive multiplication. We can set P₀, Q₀ equal to P and Q scaled so that Q is in the range [1/2,1). Writing d_n for 1-Q_n we then have
P_n/Q_n = P_n/(1-d_n) = P_n(1+d_n)/(1-d_n²)
so if we set P_n+1 = P_n(1+d_n) ; Q_n+1 = 1-d_n² we have that
( i) P_n/Q_n = P/Q for all n>0.
(ii) Q_n ® 1 as n ® ¥.
Whence P_n ® P/Q as n ® ¥.
Division by specific constants can be done "by hand" for example

x/3	=	x(1+1/2)^-1/2
	=	x(1 - 1/2 + 1/4 - 1/8 + ...)/2
	=	x(1/2 - 1/4 +1/8 - 1/16 + ...).

General division routines are usually based on binary long division. P is shifted left into a comparison field F and if F is then greater than Q, Q is subtracted from F and a bit is added to the result R.
We will assume for the moment that Q>P>0. The result P/Q is therefore fractional and we will calculate 2^K(P/Q).

Set n = 0; R_n = 0 ; F_n = P
Set F_n = 2F_n
If F_n>Q Set F_n = F_n - Q ; Set R_n = 2R_n + 1
Else Set R_n = 2R_n
Set n = n+1 ; If n < K goto (ii)
R_K = 2^K(P/Q) ie. (2^KP) DIV Q ; F_K = 2^KP - QR_K ie. (2^KP) MOD Q

As for multiplication, we can combine the two shifts and rotate R "into" F.

Set R = 2P
REPEAT
If R>2^NQ Set R =2(R - 2^NQ) + 1
Else Set R =2R
K TIMES
LeastSig K bits of R = (2^KP) DIV Q
Remainder of R = (2^KP) MOD Q

For example, if P=&38, Q=&7E, K=8 since 38/7E = 0.71C71C we would obtain R₈ = 71 ; F₈ = 2⁸(38-7E*.71) = 62.

The above will typically form the basis of a general division routine of the following form:

Check for special cases like Q=0, P=0, or Q=+1.
Set P₁=|P|/2^L ; Q₁=|Q| where L is chosen so that P₁ will be < Q₁.
Set R₁=(2^KP₁) DIV Q₁ as above; R₂=(2^KP₁) MOD Q₁ if required.
If P and Q are of different sign negate R₁.
If P is negative negate R₂ if required.
R₁=(2^K-LP) DIV Q ; R₂=(2^K-LP) MOD Q.

Glossary Contents Author
Copyright (c) Ian C G Bell 1998
Web Source: www.iancgbell.clara.net/maths or www.bigfoot.com/~iancgbell/maths
18 Nov 2006.