[IPOL discuss] Fast DCT using FFTW3

Tue Feb 28 16:02:15 CET 2012

Hi Nicolas,
> Thanks for sharing the code, maybe you can put it somewhere in the
> wiki?
Sure, I'll add it to the wiki.

> * The float variant is faster, for the simple reason that the data
>   takes less place in the CPU cache, hence less memory<->CPU
>   transfers. If 6 decimals are enough for you, which is probable if
>   your input data comes from a 8bit image, single precision is worth
>   trying.

I agree. And it's not only a speed issue, but floats need the half amount
of memory than doubles.

> * Instead of copying the data from u to FFT_in, I think you could
>   directly process a block of u with the advanced FFTW3 interface
>   ->
> http://fftw.org/fftw3_doc/Advanced-Real_002dto_002dreal-Transforms.html
>   This would save a possibly expensive copy.
>
> * If you want to copy from u to FFT_in, you could use a memcpy instead
>   of the inner for() loop; this could be faster, nested loops are
>   often a big performance penalty. You could do something like
>
>       for (int j = 0; j < w; j++)
>           memcpy(FFT_in + j*w, u + (y+j)*Nx+x, w * sizeof(double));
>
> * Still about loops, you could use a single loop instead of the two
>   external ones, something like
>
>       /* z is the upper left block corner indice */
>       for (int z = 0; z < Nx * (Ny - w), z++) {
>           if (z % Nx > Nx - w)
>               continue; /* skip borders */
>           for (int i = 0; i < w; i++)
>               memcpy(FFT_in + i*w, u + z+Nx*i, w * sizeof(double));
>           fftw_execute(fft_plan);
>           ...
>       }
>
>   But without more comments, this code is harder to understand.
>
> * Multiplications are faster than divisions, so you could do
>
>     double inv_2w = 1. / ( 2 * w);
>
>     for (...)
>         FFT_out[...] *= inv_2w;
>
>   Same thing for sqrt, you only need it once:
>
>     double inv_sqrt2 = 1. / sqrt(2);
>
>     for (...)
>        FFT_out[...] *= inv_sqrt2;
>
>   Some compiler may recognize these optimizations automatically, some
>   will not. This formula may also have a small impact on the rounding
>   error.

Thank you for all your suggestions.
You're right, the code can be modified to improve it's speed. I just
wanted to show a very clear piece of code that shows how to compute the
orthonormal 2D DCT-II with FFTW.

However, it's true that using the advanced interface may be faster that
executing several times the same plan and therefore it should be used.

About the multiplications speed and pre-storing constants to use them
later, my experience with GCC is that it's asked to optimize the code
(-O3, I think), it computes in compile-time the constants.

It even analyzes the numerical expressions and tries to compute it the
shorter and faster way.
For example, if y = f(y) + 2*f(y) + 3*f(y), it doesn't call f(y) three times.

It's true that other compilers perhaps doesn't have such a set of high
level  optimizations, but GCC is almost omnipresent.

I'll add a version of the code using the advanced interface in the wiki,
however.

Best,
Miguel