
Nick wrote:
Ok, unless I'm terribly mistaken, your 'SSE Ray/Box Intersection Test' works with just one ray and one box. The SSE is just used to store xyz vectors. Of course that's faster that using the FPU, but still, you're wasting 1/4 of SSE's processing power to start with. It is potentially much faster to test intersections for four rays and four corresponding boxes in parallel. So use one SSE register to store all x coordinates, one whole register for the y coordinate, one for z. Do that for every component. Now, work with each of these registers as if they're just one scalar. So something like a dot product becomes three mulps and two addps, nothing more. This eliminates shuffle instructions, scalar instructions, and you're using all four components all the time. In fact, for nearly the same number of instructions you now get four dot products instead of one!
You're damn right. Now if you bothered reading the paper i've picked that idea from, you'll find that they've made it up only for the ray packet vs a box case :) In fact i've implemented a coherent raytracer exactly as described in that paper (using SSE to shoot 4 rays at once against a BVH). Been there, done that.
The way i've fixed the NaN corner case is a bit of pain for ray packets (and i didn't used the "correct" version for performance reasons), but while cooking it i thought it would still be way better than the usual branchy way to intersect 1 ray vs 1 box. Hence the article.
You're right when you say i'm wasting 1/4 of the computation, but my line of thought was that it wasn't that bad (a scalar min/max takes 2 cycles, a vector min/max 3)... i was more worried by cycles taken to store/unstore things in the vector (barillet style) and dependencies it would create. Still i thought it would be more efficient than a long sequence of scalar only ops (and funnier).
edit: ah crap, missed that you were talking about 4 rays vs *4* boxes; hmm thinking about it, there's not much to be gained from checking 4 boxes at once instead of just one (and that would further restrain applicability); with 4 rays vs 1 box you can already make full use of vectors.
