Not logged in, Join Here! or Log In Below:  
News Articles Search    

 Home / 3D Theory & Graphics / Question about SSE ray/box code Account Manager
Archive Notice: This thread is old and no longer active. It is here for reference purposes. This thread was created on an older version of the flipcode forums, before the site closed in 2005. Please keep that in mind as you view this thread, as many of the topics and opinions may be outdated.
Pierre Terdiman

May 26, 2005, 06:07 AM

The original thread ( is closed so let's ask here.

Out of curiosity I tried the code (on a P4) and I'm getting weird results. Grab the test here:

Now, can you tell me what results you get ? And what's wrong with the profiling ? Where are the "33 cycles" mentioned in the original article ? From what I get the SSE code is slower than the non-SSE version, and the intersection point is apaprently not always correct.

Ideas ?

- Pierre


May 26, 2005, 11:59 AM

Um, your program just spits out a boatload of result on the console. Not seeing any timings. I'd guess it's printf limited ;P
(i ran it on my athlon1800+ for curiosity)


May 26, 2005, 12:06 PM

yep, same here. just a bunch of coordinates.


Pierre Terdiman

May 26, 2005, 12:42 PM

Errr, you have to run it inside VC7 for the window not to immediately vanish. It's obvious from the source code what the numbers are :) The 3 first numbers are the relevant ones:

First = number of cycles for an FPU, SAT-based ray-box test

Second = number of cycles for the SSE version from the mentioned COTD

Third = number of cycles for an FPU ray-box test (Woo algo)

The other numbers are the impact points found by the Second & Third methods. The First method is only a boolean test that doesn't compute the impact point (which explains why it's usually the fastest).

I'm wondering why the SSE code takes a crazy amount of time on my machine (~7000 cycles) instead of "33", as quoted in the COTD.

The alternative question could be: did anybody get significant *speedups* using that COTD ?

- Pierre

Jacco Bikker

May 26, 2005, 01:39 PM

COuld it be that you compiled without sse support? I.e., sse might be emulated with the compiler settings you used.


May 26, 2005, 03:37 PM


I had the same results as you, the SSE version was definately the slowest. Looking at the asm though, I expected to see some _mm_blah_blah style instructions but there were none... I don't know much about assembly other than the odd bit here and there though. Certainly no SSE stuff.


Fabian 'ryg' Giesen

May 26, 2005, 07:05 PM

Okay, after a few minutes of intense staring I discovered the reason the SSE version was so damn slow:

You didn't initialize the .w components of min/max/orig/idir. The FPU code couldn't care less, but "whatever happens to be on the stack" apparently is inf/nan/denormal most of the time.

That levels out cycle counts a wee bit. Currently looking into why it's still far off from the 33 cycles, and why the result is sometimes incorrect, because I think the algorithm is basically correct.

In any case, most of the cycles spent in all three methods currently are due to CPUID serializing, but I guess NB wasn't set to 1 in the original test suite ;)

Fabian 'ryg' Giesen

May 26, 2005, 08:01 PM

Ok, some testing later. I'm now using NB=1000 which gives me somewhat realistic timings.

With proper input values (i.e. no NaNs), the SSE ray/AABB intersection test levels out at about 89 cycles. When I comment out the "filtering" part, I arrive at 31 cycles using VC++. With Intel C++ 8.0, I get 75 cycles; so much for what was implied in the "if you have a decent compiler (wink wink, nudge nudge)" part.

Anyway, seeing as removing the "filtering" kills about 4 SSE instructions, I can only assume that the original author never really timed his code, atleast not on a P4 (though I don't see how a P3 would be much faster). The 58 cycles difference for the filtering are certainly in line with the experiences I've made with using floating point specials in general - infinites and denormals are slow in calculations. I definitely do not see any way this is going to run on an average of 33 cycles on my P4 with the way it's implemented, atleast not after looking over it some more. I've still not found out why it returns the wrong result sometimes, though frankly I haven't looked much.

On a side note, the paper the author refers to describes an algorithm that checks a 4-ray packet stored as 3 vectors (x0,x1,x2,x3), (y0,y1,y2,y3), (z0,z1,z2,z3) against a box. I see no value of 17 cycles mentioned anywhere in the paper; and using the pseudocode given as a reference, I very much doubt they arrive at 17 cycles for the whole test - 68 (or more) cycles for the 4 tests seems a lot more likely. This is all fine if it's ray packets you're working with, but not much help in a general scenario - though I guess a symmetrical version which tests one ray against 4 boxes could be divised, which would probably be more interesting outside of a raytracer setting (e.g. for collision detection).

I'd try the ray-against-4-boxes variant for SSE optimization, again with a "vertical" layout (i.e. (x,x,x,x),(y,y,y,y),(z,z,z,z) vectors). This tends to lead to much nicer code (basically just as you'd write a normal FPU variant, only you're dealing with 4 values at once), and be a lot faster aswell because you get tons less data dependencies, swizzling instructions, and perhaps most importantly, don't have to do annoying "merge steps" where you combine the results from several components of a vector into one field.

Fabian 'ryg' Giesen

May 26, 2005, 09:21 PM

And again somewhat later...

Found the correctness problem. It was one of those "duh" things - when there is no hit, the two functions report different "intersection points". But since there is no intersection anyway, those values are pure junk. Setting the respective coordinates to 0 when there is no hit (in the outer loop ofcourse) gives the results one would expect.

I've also test-implemented the ray-against-4-boxes code. This is actually absolutely straightforward: (using the macros in the original ray-against-box SSE code)

  2. static int Ray4AABB(const Point* min,const Point* max,const Point& orig,const Point& idir,float* near,float* far)
  3. {
  4.   // Fetch position+inverse direction
  5.   __m128 pos = loadps(&orig);
  6.   __m128 invDir = loadps(&idir);
  8.   // X coordinate
  9.   __m128 posX = _mm_shuffle_ps(pos, pos, 0x00);
  10.   __m128 invDirX = _mm_shuffle_ps(invDir, invDir, 0x00);
  11.   __m128 lambda1 = mulps(subps(loadps(&min[0]), posX), invDirX);
  12.   __m128 lambda2 = mulps(subps(loadps(&max[0]), posX), invDirX);
  13.   __m128 lmin = minps(lambda1, lambda2);
  14.   __m128 lmax = maxps(lambda1, lambda2);
  16.   // Y coordinate
  17.   __m128 posY = _mm_shuffle_ps(pos, pos, 0x55);
  18.   __m128 invDirY = _mm_shuffle_ps(invDir, pos, 0x55);
  19.   lambda1 = mulps(subps(loadps(&min[1]), posY), invDirY);
  20.   lambda2 = mulps(subps(loadps(&max[1]), posY), invDirY);
  21.   lmin = maxps(minps(lambda1, lambda2), lmin);
  22.   lmax = minps(maxps(lambda1, lambda2), lmax);
  24.   // Z coordinate
  25.   __m128 posZ = _mm_shuffle_ps(pos, pos, 0xaa);
  26.   __m128 invDirZ = _mm_shuffle_ps(invDir, invDir, 0xaa);
  27.   lambda1 = mulps(subps(loadps(&min[2]), posZ), invDirZ);
  28.   lambda2 = mulps(subps(loadps(&max[2]), posZ), invDirZ);
  29.   lmin = maxps(minps(lambda1, lambda2), lmin);
  30.   lmax = minps(maxps(lambda1, lambda2), lmax);
  32.   // Write back
  33.   _mm_store_ps(near,lmin);
  34.   _mm_store_ps(far,lmax);
  36.   __m128 hit = _mm_and_ps(_mm_cmplt_ps(lmin,lmax),_mm_cmpgt_ps(lmin,_mm_setzero_ps()));
  37.   return _mm_movemask_ps(hit);
  38. }

min/max should point to three vectors each (one x0-x3,one y0-y3,one z0-z3), and near/far should point to 4 floats that receive the respective tMin/tMax values.

This code again has the same deficiency as the algorithm in the paper: If one of the min/max coordinates exactly matches the ray origin and one of the ray's directional components is zero, you get NaNs somewhere inbetween which messes up the whole calculation. I think some messing around with unordered compares plus some bit-twiddling is a better way to work around this problem than the "filtering" in the original article; but then, this is testing a ray against multiple boxes, not multiple rays against one box, so it might be more practical to work around the problem "at the source" by changing the input ray direction very slightly if it exhibits the problem. Though one would have to check how that affects accuracy. If all else fails, you could a "mask" vector together with the inverse direction, to substitute the right +inf/-inf after the mulps/subps calculation, therefore avoiding 0*inf altogether. Something like

  2.   lambda1 = addps(mulps(subps(loadps(&min[1]), posX), invDirX), fixDirX);
  3.   lambda2 = addps(mulps(subps(loadps(&max[1]), posX), invDirX), fixDirX);

Where fixDir is a vector containg +inf in components where dir = 0, and invDir having set those components to zero. In any case testing one ray against lots of boxes turns everything around a bit, and you can move the trickier distinctions out of the inner loops, which seems like a good thing.

The code as presented above without the fixDir stuff levels out at 70/cycles execution in your test framework, so it is actually faster than the 1-ray-against-1-box SSE version with fixes, and about factor 2 slower than the 1-ray-against-1-box SSE version without fixes. Also, 70/4 = 17.5, so this is pretty close to those mythical 17 cycles/intersection; when I don't store the far intersection values I level out at 67 cycles, which means 67/4 = 16.75 cycles/intersections on average.

All assuming you always have 4 boxes to test again and everything is nice and in the cache, which is pretty darn optimistic. Still a nice result.

Pierre Terdiman

May 27, 2005, 02:22 AM

What can I say ? As usual you rule, Fabian. Thanks !

- Pierre

Pierre Terdiman

May 27, 2005, 02:33 AM

What can I say ? As usual you rule, Fabian. Thanks !

- Pierre


May 27, 2005, 03:01 AM

Thanks guys, I was about to check/use this code too, and you guys saved me probably a lot of troubles!

This thread contains 12 messages.
Hosting by Solid Eight Studios, maker of PhotoTangler Collage Maker.