Many years later, I revisited the code and enhanced it to use the GNU autotools; it therefore became portable to all Intel platforms under the sun (Linux/x86, Mac OS/X, Windows (via MinGW gcc), OpenBSD/amd64 and FreeBSD/amd64). Benchmarks showed - as expected - that OSes had very little to do with such "pure" calculations as this; speed was pretty much the same under all of them (I used libSDL for portable HW-assisted "blitting" of the pixel data).
Recently (2009), I learned how to use OpenMP and Intel TBB to make use of additional CPUs/cores - so I added OpenMP #pragmas to both the pure-C and SSE modes... and smiled as I saw my little Atom330 get a boost from 58 to around 150 frames per second. The manually written SSE code was also using GCC's AT/T syntax, so it could be automatically inlined by GCC even at -O3 optimizations.
I thought that this was the limit - what else could I do now but to sit back and wait for CPU cores to multiply?... Perhaps buy an i7?
Nope. Wrong.
Last weekend, I got to play with an NVIDIA GT240 (around 100$). Having read a lot of blogs about GPU programming, I downloaded the CUDA SDK and started reading some samples.
In less than one hour, I went from my rather complex SSE inline assembly, to a simple, clear Mandelbrot implementation... that run... 15 times faster!
Let me say this again: 1500% faster. Jaw dropping. Or put a different way: I went from 147fps at 320x240... to 210fps... at 1024x768!
I only have one comment for my fellow developers:
The algorithm in question was perfect for a CUDA implementation. You won't always get this kind of speedups (while at the same time doing it with clearer and significantly less code).
But regardless of that, you must start looking into GPGPU coding: CUDA (and soon, OpenCL) can become huge game changers.
If you don't, you are in danger of missing... significant opportunities...
Update, one week later: It turned out that my GT240 is so ...fresh, that the Linux drivers aren't yet up to speed - they are driving my card at Performance Level 1 instead of 2 (i.e. at 324MHz memory clock instead of 1700MHz!). Under Windows, the code runs at an astonishing 420 frames per second, so the speedup is not 15x, it is in fact more like 30x...
Update, March 2010: The newest NVIDIA drivers (195.36.15) have corrected the bug - the Linux and Windows versions now run at identical speeds: Blazing ones :‑)
./configure && make && ./src/mandelSSE
You can zoom-in/zoom-out with the left and right mouse buttons. Hit ESC and it will report the frames per second achieved.
If you are a Windows user, here is a Win32 package (cross-compiled via MinGW). To use it, simply unzip the package and run either one of the two .bat files. The 'autopilot' one zooms in a specific location, while the other one allows you to zoom interactively using your mouse (left-click zooms in, right-click zooms out).
If you are working under Windows, a precompiled Windows binary is here (compressed with 7-zip). Make sure you disable Vertical Sync in order to get the maximum frame rate.
__global__ void CoreLoop( int *p, float xld, float yld, /* Left-Down coordinates */ float xru, float yru, /* Right-Up coordinates */ int MAXX, int MAXY) /* Window size */ { float re,im,rez,imz; float t1, t2, o1, o2; int k; unsigned result = 0; unsigned idx = blockIdx.x*blockDim.x + threadIdx.x; int y = idx / MAXX; int x = idx % MAXX; re = (float) xld + (xru-xld)*x/MAXX; im = (float) yld + (yru-yld)*y/MAXY; rez = 0.0f; imz = 0.0f; k = 0; while (k < ITERA) { o1 = rez * rez; o2 = imz * imz; t2 = 2 * rez * imz; t1 = o1 - o2; rez = t1 + re; imz = t2 + im; if (o1 + o2 > 4) { result = k; break; } k++; } p[y*MAXX + x] = lookup[result]; // Palettized lookup }
; x' = x^2 - y^2 + a ; y' = 2xy + b ; mov ecx, 0 movaps xmm5, [fours] ; 4. 4. 4. 4. ; xmm5 movaps xmm6, [re] ; a0 a1 a2 a3 ; xmm6 movaps xmm7, [im] ; b0 b1 b2 b3 ; xmm7 xorps xmm0, xmm0 ; 0. 0. 0. 0. xorps xmm1, xmm1 ; 0. 0. 0. 0. xorps xmm3, xmm3 ; 0. 0. 0. 0. ; xmm3 loop1: movaps xmm2, xmm0 ; x0 x1 x2 x3 ; xmm2 mulps xmm2, xmm1 ; x0*y0 x1*y1 x2*y2 x3*y3 ; xmm2 mulps xmm0, xmm0 ; x0^2 x1^2 x2^2 x3^2 ; xmm0 mulps xmm1, xmm1 ; y0^2 y1^2 y2^2 y3^2 ; xmm1 movaps xmm4, xmm0 addps xmm4, xmm1 ; x0^2+y0^2 x1... ; xmm4 subps xmm0, xmm1 ; x0^2-y0^2 x1... ; xmm0 addps xmm0, xmm6 ; x0' x1' x2' x3' ; xmm0 movaps xmm1, xmm2 ; x0*y0 x1*y1 x2*y2 x3*y3 ; xmm1 addps xmm1, xmm1 ; 2x0*y0 2x1*y1 2x2*y2 2x3*y3 ; xmm1 addps xmm1, xmm7 ; y0' y1' y2' y3' ; xmm1 cmpltps xmm4, xmm5 ; <4 <4 <4 <4 ? ; xmm2 movaps xmm2, xmm4 ; at this point, xmm2 has all 1s in the non-overflowed pixels movmskps eax, xmm4 ; (lower 4 bits reflect comparisons) andps xmm4, [ones] ; so, prepare to increase the non-over addps xmm3, xmm4 ; by updating the 4 bailout counters or eax, eax ; have all 4 pixels overflowed ? jz short nomore ; yes, we're done inc ecx cmp ecx, 119 jnz short loop1
Not only was this SSE code a pain to write, and impossible to maintain... It is now an order of magnitude slower than what you can get with cheap GPUs!
Times have changed (yet again)...
Check the Git repository for details.
Executive summary: 100s of frames per second with huge windows and interactive zooms...
Index | CV | Updated: Sat Dec 3 19:52:56 2022 |