Not logged in, Join Here! or Log In Below:  
 
News Articles Search    
 


Submitted by Jacco Bikker, posted on January 05, 2005




Image Description, by Jacco Bikker



These are some images from my latest (yet unnamed) ray tracer. Everyone has seen the bunny a thousand times now so I'm throwing in a real time demo as well. :-) As you can see, so far I have been focussing on raw performance. The ray tracer now fully implements Ingo Wald's packet tracing: Whenever possible, four rays are cast simultaneously and traversed through a kd-tree. By using packets, the ray tracer becomes very suitable for vectorization using sse/sse2 instructions. Despite some overhead this approach doubles the speed of a regular raytracer.

About the images: The top left image shows the famous Stanford Bunny model, 69k triangles. There are two (dynamic) light sources in this scene, and the animation is rendered at 512x384 pixels. On my 1.7Ghz Pentium-M laptop this runs at about 5 frames per second. To the right of this image a dump of the matching kd-tree is shown.

The two lower images show the Happy Buddha model and the matching kd-tree, which consists of no less than 1089k triangles. This model renders slightly slower; on my system it runs at about 4 frames per second.

This brings up a very interesting characteristic of ray tracing: The technology scales very well. Switching from 69k to 1089k triangles only means that some extra levels are added to the kd-tree; the speed decrease is therefore not linear at all. Besides that, ray tracing performance scales virtually linearly with processing power and the number of available processors. This means that, given a sufficiently complex scene, it's possible to outperform high-end graphics cards using commodity hardware.

Also interesting is the fact that the ray tracing algorithm is very simple. 750 lines of code get me 'z-buffering' with virtually unlimited accuracy, visibility determination, self-shadowing, recursive reflections and refractions, hard and / or soft shadows, per-pixel lighting, HDR effects and so on.

I'm currently working with Thierry Berger-Perrin to produce a more interesting demo, perhaps similar to the famous RealStorm benchmark. In the meantime, you can download the bunny demo (3.8MB).

Greets
Jacco Bikker


[prev]
Image of the Day Gallery
www.flipcode.com

[next]

 
Message Center / Reader Comments: ( To Participate in the Discussion, Join the Community )
 
Archive Notice: This thread is old and no longer active. It is here for reference purposes. This thread was created on an older version of the flipcode forums, before the site closed in 2005. Please keep that in mind as you view this thread, as many of the topics and opinions may be outdated.
 
Nick

January 11, 2005, 06:15 PM

El Pinto Grande wrote: That's one of the last operation, has a 4 cycles latency and that's what you call "opportunities"?

The scalar operations are potentially optimizable as well. See below...
It seems you think that you and i are the only one to know that. The fact is the compiler also does; both shufps are issued first followed by a couple of faster movhlps and only then comes some interwinded scalar min/max + comparison.

It's not a question of scheduling or instruction selection. See below...
A much better argument would be to avoid accumulating values in vectors and to go for an all scalar route (removing the need for the final horizontal crap but potentially wasting decoding bandwith etc).

What I'm suggesting comes down to nearly the same thing, except wasting the decoding bandwidth...

Ok, unless I'm terribly mistaken, your 'SSE Ray/Box Intersection Test' works with just one ray and one box. The SSE is just used to store xyz vectors. Of course that's faster that using the FPU, but still, you're wasting 1/4 of SSE's processing power to start with. It is potentially much faster to test intersections for four rays and four corresponding boxes in parallel. So use one SSE register to store all x coordinates, one whole register for the y coordinate, one for z. Do that for every component. Now, work with each of these registers as if they're just one scalar. So something like a dot product becomes three mulps and two addps, nothing more. This eliminates shuffle instructions, scalar instructions, and you're using all four components all the time. In fact, for nearly the same number of instructions you now get four dot products instead of one!

It requires data structures like vector {float x[4]; float y[4]; float z[4];} but that doesn't seem unworkable for a raytracer. This approach also increases register pressure, but that's nothing your compiler can't handle. ;-) Three registers can hold four vectors (instead of one for one), and most arguments can remain in memory.

 
El Pinto Grande

January 11, 2005, 08:11 PM

Nick wrote: Ok, unless I'm terribly mistaken, your 'SSE Ray/Box Intersection Test' works with just one ray and one box. The SSE is just used to store xyz vectors. Of course that's faster that using the FPU, but still, you're wasting 1/4 of SSE's processing power to start with. It is potentially much faster to test intersections for four rays and four corresponding boxes in parallel. So use one SSE register to store all x coordinates, one whole register for the y coordinate, one for z. Do that for every component. Now, work with each of these registers as if they're just one scalar. So something like a dot product becomes three mulps and two addps, nothing more. This eliminates shuffle instructions, scalar instructions, and you're using all four components all the time. In fact, for nearly the same number of instructions you now get four dot products instead of one!


You're damn right.
Now if you bothered reading the paper i've picked that idea from, you'll find that they've made it up only for the ray packet vs a box case :)
In fact i've implemented a coherent raytracer exactly as described in that paper (using SSE to shoot 4 rays at once against a BVH).
Been there, done that.

The way i've fixed the NaN corner case is a bit of pain for ray packets (and i didn't used the "correct" version for performance reasons), but while cooking it i thought it would still be way better than the usual branchy way to intersect 1 ray vs 1 box. Hence the article.

You're right when you say i'm wasting 1/4 of the computation, but my line of thought was that it wasn't that bad (a scalar min/max takes 2 cycles, a vector min/max 3)... i was more worried by cycles taken to store/unstore things in the vector (barillet style) and dependencies it would create. Still i thought it would be more efficient than a long sequence of scalar only ops (and funnier).

edit: ah crap, missed that you were talking about 4 rays vs *4* boxes; hmm thinking about it, there's not much to be gained from checking 4 boxes at once instead of just one (and that would further restrain applicability); with 4 rays vs 1 box you can already make full use of vectors.

 
Nick

January 12, 2005, 09:22 AM

El Pinto Grande wrote: You're damn right. Now if you bothered reading the paper i've picked that idea from, you'll find that they've made it up only for the ray packet vs a box case :) In fact i've implemented a coherent raytracer exactly as described in that paper (using SSE to shoot 4 rays at once against a BVH). Been there, done that.

I partially read that article but didn't see any SSE code for four rays. Must have missed it, sorry.
The way i've fixed the NaN corner case is a bit of pain for ray packets (and i didn't used the "correct" version for performance reasons), but while cooking it i thought it would still be way better than the usual branchy way to intersect 1 ray vs 1 box. Hence the article.

Thanks for the explanation!
You're right when you say i'm wasting 1/4 of the computation, but my line of thought was that it wasn't that bad (a scalar min/max takes 2 cycles, a vector min/max 3)... i was more worried by cycles taken to store/unstore things in the vector (barillet style) and dependencies it would create. Still i thought it would be more efficient than a long sequence of scalar only ops (and funnier).

Ok, I see your point. All I wanted to point out is that 33 cycles per intersection can definitely be improved for code like that. Packing and unpacking data can often be done in a non-critical stage (e.g. a vertex buffer only requires unpacking at creation, your bounding box hierarchy is static, etc). Dependencies can be eliminted with software pipelining and similar techniques. By the way, extra copy operations (even to/from memory) are not that bad, because they're done by a different execution unit that can work in parallel.

When people first explained to me to do my software shaders on four vertices in parallel I first claimed it had too many disadvantages. I was terribly wrong... Now I'm even planning to do four polygon setups in parallel so I don't have to pack the transformed vertices back together. :-)

 
El Pinto Grande

January 12, 2005, 10:07 AM

Nick wrote: Ok, I see your point. All I wanted to point out is that 33 cycles per intersection can definitely be improved for code like that.


Definitely, but that requires more assumptions about the data layout (or some prior data juggling) and that was against the purpose of the article.

With proper data layout intersecting a ray packet is just as fast as intersecting one, or in throughoutput speakage it's 4 times faster (if you put aside coherency issues ;).

It took quite some iterations to come up with the conclusion that we agreed from the start :p

 
This thread contains 64 messages.
First Previous ( To view more messages, select a page: 0 1 2 ... out of 2) Next Last
 
 
Hosting by Solid Eight Studios, maker of PhotoTangler Collage Maker.