
And again somewhat later...
Found the correctness problem. It was one of those "duh" things  when there is no hit, the two functions report different "intersection points". But since there is no intersection anyway, those values are pure junk. Setting the respective coordinates to 0 when there is no hit (in the outer loop ofcourse) gives the results one would expect.
I've also testimplemented the rayagainst4boxes code. This is actually absolutely straightforward: (using the macros in the original rayagainstbox SSE code)
static int Ray4AABB (const Point * min,const Point * max,const Point & orig ,const Point & idir ,float * near ,float * far )
{
// Fetch position+inverse direction
__m128 pos = loadps (&orig );
__m128 invDir = loadps(&idir);
// X coordinate
__m128 posX = _mm_shuffle_ps (pos, pos, 0x00);
__m128 invDirX = _mm_shuffle_ps(invDir, invDir, 0x00);
__m128 lambda1 = mulps (subps (loadps (&min[0]), posX ), invDirX );
__m128 lambda2 = mulps (subps (loadps (&max[0]), posX ), invDirX );
__m128 lmin = minps(lambda1, lambda2);
__m128 lmax = maxps(lambda1, lambda2);
// Y coordinate
__m128 posY = _mm_shuffle_ps (pos, pos, 0x55);
__m128 invDirY = _mm_shuffle_ps (invDir , pos, 0x55);
lambda1 = mulps (subps (loadps (&min[1]), posY ), invDirY );
lambda2 = mulps (subps (loadps (&max[1]), posY ), invDirY );
lmin = maxps(minps(lambda1, lambda2), lmin);
lmax = minps(maxps(lambda1, lambda2), lmax);
// Z coordinate
__m128 posZ = _mm_shuffle_ps (pos, pos, 0xaa);
__m128 invDirZ = _mm_shuffle_ps(invDir, invDir, 0xaa);
lambda1 = mulps (subps (loadps (&min[2]), posZ ), invDirZ );
lambda2 = mulps (subps (loadps (&max[2]), posZ ), invDirZ );
lmin = maxps(minps(lambda1, lambda2), lmin);
lmax = minps(maxps(lambda1, lambda2), lmax);
// Write back
_mm_store_ps(near,lmin);
_mm_store_ps(far,lmax);
__m128 hit = _mm_and_ps(_mm_cmplt_ps(lmin,lmax),_mm_cmpgt_ps(lmin,_mm_setzero_ps()));
return _mm_movemask_ps(hit);
}

min/max should point to three vectors each (one x0x3,one y0y3,one z0z3), and near/far should point to 4 floats that receive the respective tMin/tMax values.
This code again has the same deficiency as the algorithm in the paper: If one of the min/max coordinates exactly matches the ray origin and one of the ray's directional components is zero, you get NaNs somewhere inbetween which messes up the whole calculation. I think some messing around with unordered compares plus some bittwiddling is a better way to work around this problem than the "filtering" in the original article; but then, this is testing a ray against multiple boxes, not multiple rays against one box, so it might be more practical to work around the problem "at the source" by changing the input ray direction very slightly if it exhibits the problem. Though one would have to check how that affects accuracy. If all else fails, you could a "mask" vector together with the inverse direction, to substitute the right +inf/inf after the mulps/subps calculation, therefore avoiding 0*inf altogether. Something like
lambda1 = addps (mulps (subps (loadps (&min[1]), posX ), invDirX ), fixDirX );
lambda2 = addps (mulps (subps (loadps (&max[1]), posX ), invDirX ), fixDirX );

Where fixDir is a vector containg +inf in components where dir = 0, and invDir having set those components to zero. In any case testing one ray against lots of boxes turns everything around a bit, and you can move the trickier distinctions out of the inner loops, which seems like a good thing.
The code as presented above without the fixDir stuff levels out at 70/cycles execution in your test framework, so it is actually faster than the 1rayagainst1box SSE version with fixes, and about factor 2 slower than the 1rayagainst1box SSE version without fixes. Also, 70/4 = 17.5, so this is pretty close to those mythical 17 cycles/intersection; when I don't store the far intersection values I level out at 67 cycles, which means 67/4 = 16.75 cycles/intersections on average.
All assuming you always have 4 boxes to test again and everything is nice and in the cache, which is pretty darn optimistic. Still a nice result.
