Fast DCT and Integer Arithmetic in CUDA

Discrete Cosine Transform (DCT)

Number of calucations ~ N2

Fast DCT

   Break down the summation into stages.
   Number of calucations ~ N log N rather than N2

  
    Since θ = π / 16, or 16θ = π, also cos(π - α) = -cos(α) cos(2π - α) = cos(α) Using these formulas, we can reduce cos(nθ) to a form of ±cos(kθ) with k ≤ 7. e.g. cos(9θ) = cos(16θ - 7θ) = cos(π - 7θ) = -cos(7θ) cos(35θ) = cos(32θ + 3θ) = cos(2π + 3θ) = cos(3θ)
Let cs = cosine
Some symmetries. Simpify.
      y0 = (x0 + x7) + (x1 + x6) + (x2 + x5) + (x3 + x4)
      y2 = (x0 + x7)c2 + (x1 + x6)c6 - (x2 + x5)c6 - (x3 + x4)c2

      where ck = cos(kθ);
Continue this process recursively. The cosine functions can be expressed in terms of each other. In the calculations we often have this form c = a + b d = a - b Can be represented by a flow graph: Flow graph of Butterfly Computation If we include the constants ak in our calculations of yk and let Ck = ckak = ck/2 = cos(kθ)/2 for k ≥ 1 (θ = π/16) C0 = 1/√2 Note that C4 = cos(π/4)/2 = 1/(2√2) We arrive at the following flow-graph:

//1st stage transform 
for (j = 0; j < 4; j++){ 
  j1 = 7 - j;
  x1[j] = x[j] + x[j1];
  x1[j1] = x[j] - x[j1];
}
//second stage transform
x[0] = x1[0] + x1[3]; 
x[1] = x1[1] + x1[2];
x[2] = x1[1] - x1[2];
x[3] = x1[0] - x1[3];
x[4] = x1[4];
x[5] = (x1[6] - x1[5]) * C0;
x[6] = ( x1[6] + x1[5]) * C0
x[7] = x1[7];
//upper-half of 3rd (final) stage
y[0] = ( x[0] + x[1] )*C4; 
y[4] = ( x[0] - x[1] )*C4; 
y[2] = x[2] * C6 + x[3] * C2;
y[6] = x[3] * C6 - x[2] * C2;
//lower-half of third stage
x1[4] = x[4] + x[5]; 
x1[5] = x[4] - x[5];
x1[6] = x[7] - x[6];
x1[7] = x[7] + x[6];
//lower-half of 4th (final) stage
y[1] = x1[4] * C7 + x1[7] * C1;
y[7] = x1[7] * C7 - x1[4] * C1;
y[5] = x1[5] * C3 + x1[6] * C5;
y[3] = x1[6] * C3 - x1[5] * C5;

Integer Arithmetic
  
  Without loss of generality, let us consider 16-bit integers
	e.g. x =  2.75 (= 21 + 2-1 + 2-2)

  Imagine there's a binary point at the right side of bit 0 of a 16-bit integer.
   bit position: 15           9 8 7 6 5 4 3 2 1 0
        x:        0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 . 1 1 

  To retain the information of the fractional part, we can left-shift 
  the whole number by 8 bits:
    bit position: 15           9 8 7 6 5 4 3 2 1 0
         x':       0 0 0 0 0 0 1 0.1 1 0 0 0 0 0 0 

  value of x' is 29 + 27 + 26 = 704 
  (x' = 2.75 * 256)
  If we right-shift x' by 8 bits, we get 2 -- effectively truncating the number.
  In most applications, rounding is preferred, which can be
  achieved by adding 0.5 to the original value:
  i.e. adding 0.5x256 = 128 to x' :
    bit position: 15           9 8 7 6 5 4 3 2 1 0
         x':       0 0 0 0 0 0 1 0.1 1 0 0 0 0 0 0 
        0.5:       0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
        x'':       0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0

   when we right-shift x'' by 8 bits, we obtain our desired value 3.

   In general, we can transform real positive numbers to integers by multiplying 
   the real numbers by 2n; rounding effect is achieved by adding the 
   value 2n−1 to the results before shifting them right n bits.

Integer-arithmetic Implementation of Fast DCT