Not logged in, Join Here! or Log In Below:  
 
News Articles Search    
 


Submitted by Jacco Bikker, posted on January 05, 2005




Image Description, by Jacco Bikker



These are some images from my latest (yet unnamed) ray tracer. Everyone has seen the bunny a thousand times now so I'm throwing in a real time demo as well. :-) As you can see, so far I have been focussing on raw performance. The ray tracer now fully implements Ingo Wald's packet tracing: Whenever possible, four rays are cast simultaneously and traversed through a kd-tree. By using packets, the ray tracer becomes very suitable for vectorization using sse/sse2 instructions. Despite some overhead this approach doubles the speed of a regular raytracer.

About the images: The top left image shows the famous Stanford Bunny model, 69k triangles. There are two (dynamic) light sources in this scene, and the animation is rendered at 512x384 pixels. On my 1.7Ghz Pentium-M laptop this runs at about 5 frames per second. To the right of this image a dump of the matching kd-tree is shown.

The two lower images show the Happy Buddha model and the matching kd-tree, which consists of no less than 1089k triangles. This model renders slightly slower; on my system it runs at about 4 frames per second.

This brings up a very interesting characteristic of ray tracing: The technology scales very well. Switching from 69k to 1089k triangles only means that some extra levels are added to the kd-tree; the speed decrease is therefore not linear at all. Besides that, ray tracing performance scales virtually linearly with processing power and the number of available processors. This means that, given a sufficiently complex scene, it's possible to outperform high-end graphics cards using commodity hardware.

Also interesting is the fact that the ray tracing algorithm is very simple. 750 lines of code get me 'z-buffering' with virtually unlimited accuracy, visibility determination, self-shadowing, recursive reflections and refractions, hard and / or soft shadows, per-pixel lighting, HDR effects and so on.

I'm currently working with Thierry Berger-Perrin to produce a more interesting demo, perhaps similar to the famous RealStorm benchmark. In the meantime, you can download the bunny demo (3.8MB).

Greets
Jacco Bikker


[prev]
Image of the Day Gallery
www.flipcode.com

[next]

 
Message Center / Reader Comments: ( To Participate in the Discussion, Join the Community )
 
Archive Notice: This thread is old and no longer active. It is here for reference purposes. This thread was created on an older version of the flipcode forums, before the site closed in 2005. Please keep that in mind as you view this thread, as many of the topics and opinions may be outdated.
 
Sebastian Wagner

January 06, 2005, 02:13 AM

May I propose a starting angle:

why not go for a practical, short tutorial that shows by example how to implement some math operation in SSE/SSE2? You start off with showing how to embed the asm code in regular C++ and then you go into the details.

You could do a few really short and sweet tutorials this way, starting with something really simple and adding more complicated examples as you go. Expect people to understand basic asm-programming (for example we did a lot of MIPS asm programming here at my university).

 
Jacco Bikker

January 06, 2005, 02:39 AM

The idea behind packet rendering is that it limits the amount of memory accessed. The chip that is being developed at SAAR doesn't render 4-ray packets, but more complex packets, and one of the main reasons for that is reducing memory usage.

But indeed, memory access is more or less random, which also makes ray tracing (or rather, kd-tree traversal) very hard to implement on today's graphics hardware.

About 'cool shots': In a ray tracer, a bumpy surface could reflect a soft shadow as seen through a refracted sphere. Apart from the soft shadow, this does not require trees of rays, just a chain (and even soft shadows can be faked efficiently, if you wish, in that case they are as expensive as a single shadow ray). But what I really like is that it 'just works'. Add a new feature, and it will work with all the other features.

3dsmax does not use a ray tracer, it's a scaline thingy, but I get your point. I still do not agree however: If you take the PovRay source code, and you take out kd-tree traversal, intersections and basic shading, you're left with only a couple of lines of code. On the other hand, if you take the Quake 1 source code, you take out the polygon rasterizers and the transformation pipeline, you still have the BSP/PVS. For more recent engines much more functionality is added on top of what the 3D hardware provides. Ray tracing is simpler and does not require LOD's or PVS's. If you move the core algorithm of ray tracing to hardware, what's left does not require a Tim Sweeney, IMHO.

 
Dennis de Bruijn

January 06, 2005, 02:45 AM

If you ask me, the reason why there are so little FPU or SSE tutorials, is because they defeat the purpose.
I suppose you have to have a certain level of understanding of programming and asm in general, to understand it. And when you have reached that level anyway, then you can easily figure it out yourself.
As said, the instructions are really simple, the trick is how to use them.
And I think that comes down to experience, and a certain 'knack' for assembly programming, I don't think it's something you can teach in a tutorial.
I suppose all half-decent assembly programmers are mostly selftaught.

 
Scali

January 06, 2005, 03:05 AM

3dsmax is a hybrid renderer, like most photorealistic renderers.
Also, shouldn't the analogy be that taking away the kd-tree from POVRAY is the same as taking away the BSP/PVS from Quake?

Also I don't really agree with the comment on hardware-accelerated raytracing being simple.
We all said the same when we moved from software rendering to OpenGL/D3D.
Sure, at first it was really easy, gouraud-shaded textured polygons with little effort. Then we got T&L hardware, and the per-poly approach no longer worked, you had to start batching your triangles, and rethink things like BSP/PVS systems.
Then we got programmable shaders, and suddenly things were about as intensive as software-rendering again... or actually more intensive because you have all this extra processing power, and per-pixel phong lighting is just the beginning. And then there were all the new possibilities with render-to-texture etc... so I think hardware acceleration only helped to reach a high level of complexity sooner. The actual things that are accelerated are trivial, basically only the actual rasterization and clipping are taken away from the programmer, everything else must still be coded manually.

I think it will be the same for raytracing. The actual ray/triangle intersections will be taken away, but you still have to program the shading, which also includes things like reflection and refraction ofcourse (don't want things like that hardwired in the end)... And you will also have to solve the problems with animation, as mentioned before. When you skin a mesh, a kd-tree will not work. Also, you can only build kd-trees for static meshes themselves. You can't build one kd-tree for an entire scene, when that entire scene is not static (kd-trees can't be made dynamic very efficiently). So you still have to do some kind of BSP/PVS (eg build a dynamic grid or octree) to eliminate all objects that cannot possibly be seen in the current frame, or else you'll have to end up traversing the kd-tree of all static meshes in the scene.

So no, if you ask me, raytracing won't be much easier... It may even be harder.
The problems will just be a bit different from a rasterizer.

 
Jacco Bikker

January 06, 2005, 03:05 AM

GCC performance: You'll have to wait till Pierre looks here again (I'll ask him to), as he introduced me to the wonders of gcc only days ago. But yes, the gcc compiler (4.0.x) actually can produce code superior to what the Intel compiler does. However, note that Pierre actually dives into the generated code to tweak the original C so it produces better code; I don't know if that's something for you. :) Perhaps someone familiar with the ins and outs of the Intel compiler would do the same and beat the gcc compiler in turn. ;) Point is they are both very good compilers.

 
Rhinoid

January 06, 2005, 03:07 AM

What makes this demo even more impressive is the fact that Jacco does most of his programming on this new project of his on the half hour train ride from and to work (given that he can find a space to sit). Combined with the fact that he's started with raytracing just a few months ago it is really amazing.

Respect!

 
kudi

January 06, 2005, 03:23 AM

wow, looks really good. i also wanted to try this after my exams. i think a lot of work to do. how long did it take?

i first wanted to work through the paper
http://www.cgg.cvut.cz/~havran/phdthesis.html
to know how the kd - trees function.

one problem i think is the movement of the objects. doesnt this change your kd - tree? did you already implement this?

 
Jacco Bikker

January 06, 2005, 03:39 AM

Dyanimc kd-trees: This is indeed a problem. Check out Ingo Wald's thesis on realtime ray tracing for details on this. It usually involves using lesser quality trees and update those (partially) on-the-fly.

BTW this took me 4 months, starting from scratch (code & knowledge wise). You could make a jumpstart by reading my articles. :) After that it takes lots of tweaking and a little help from my friends.

 
Andrew Brown

January 06, 2005, 03:47 AM

Very nice

3G P4, 1G RAM, 256M ATI 9800 Pro, 10 FPS

 
Roel

January 06, 2005, 04:11 AM

very neat! the kd-tree view looks like some alien technology stuff :)

 
Nick

January 06, 2005, 04:32 AM

Sebastian Wagner wrote: May I propose a starting angle: why not go for a practical, short tutorial that shows by example how to implement some math operation in SSE/SSE2? You start off with showing how to embed the asm code in regular C++ and then you go into the details. You could do a few really short and sweet tutorials this way, starting with something really simple and adding more complicated examples as you go. Expect people to understand basic asm-programming (for example we did a lot of MIPS asm programming here at my university).

That's a great idea! This way a wide public can be addressed and we could start interesting discussions about the ideas presented in each tutorial. By really splitting it into short tutorials the beginners won't be scared away by the advanced aspects and the experts can go right to the last tutorials without getting bored with 'trivial' matter.

 
Geeko

January 06, 2005, 04:33 AM

Where can i find SSE tutorials ?? Thx

Btw very nice work

 
Nick

January 06, 2005, 04:39 AM

Dennis de Bruijn wrote: If you ask me, the reason why there are so little FPU or SSE tutorials, is because they defeat the purpose. I suppose you have to have a certain level of understanding of programming and asm in general, to understand it. And when you have reached that level anyway, then you can easily figure it out yourself.

Exactly my point!
As said, the instructions are really simple, the trick is how to use them. And I think that comes down to experience, and a certain 'knack' for assembly programming, I don't think it's something you can teach in a tutorial. I suppose all half-decent assembly programmers are mostly selftaught.

It would be nice to document those 'tricks' though, without making it too hard for the beginner and without making it boring for the advanced assembly programmer. A classical tutorial would miss that goal I believe, but Sebastian's idea of making a series of small tutorials might actually be very effective.

 
Scali

January 06, 2005, 06:17 AM

Well, in the odd few tutorials I've written, I've tried to focus on the concepts rather than the code itself.
I suppose that is the best way... You should try to teach a way of thinking, rather than handing code. Code is fine to illustrate a point here and there, but the focus should be on the ideas behind the code, not on the code itself.
Coding should always be about concepts anyway. Concepts are universal, they work on all similar archictectures and in all similar programming languages. The code itself doesn't.

 
El Pinto Grande

January 06, 2005, 06:40 AM

We're using a recent, ftp://gcc.gnu.org/pub/gcc/snapshots/4.0-20050102/, gcc snapshot with a bunch of patches compiled for cygwin; it's what will become gcc4.0.0 in a couple of months. The main new feature is the use of a SSA tree, see http://gcc.gnu.org/gcc-4.0/changes.html for a temp. list of changes.

It's still a bit raw, and i have a couple of SSE related bug reports in flight atm.
But it's really promising (read it's already on par with Intel's in FP/vector heavy code).

The doctoring done on Jacco source was mostly done to kludge around some known bugs and to avoid some pessimization; ie Const Correctness Brigade, aliasing etc... some compilers are a bit more lenient than gcc about that (but it never hurts).

The timings in that http://www.flipcode.com/cgi-bin/fcarticles.cgi?show=4&id=65014 article also have been done with that gcc, msvc2k3 is typically more than 25% slower (don't have Intel's here so i can't extrapolate).

Nick:
You're quite right, but i still think SSE beginners could use such a toolbox; a blinking 2x2m disclaimer would be in order :)

And you cannot decently compare your otherwise neat Softwire optimizer to what gives a full blown compiler.

PS: it's Thierry not Pierre you silly :p

 
El Pinto Grande

January 06, 2005, 07:13 AM

Nick wrote:That's not really SSE specific.

The proper way to handle those is extremely SSE specific.

But you're right about the low lvl nature of all that crap and later propositions for multiples snippets/tutorials really make sense.

 
Jacco Bikker

January 06, 2005, 10:13 AM

Thierry? Aw. Incredible. We've exchanged only 300 e-mails so far so I guess things will improve in the future. ;)

 
Nick

January 06, 2005, 02:39 PM

The timings in that http://www.flipcode.com/cgi-bin/fcarticles.cgi?show=4&id=65014 article also have been done with that gcc, msvc2k3 is typically more than 25% slower (don't have Intel's here so i can't extrapolate).

Why isn't that written in inline assembly? If there's 25% performance difference when using intrinsics then this small code definitely seems worth hand-optimizing! It's useless looking for the best compiler options and such, if it costs more time than actually writing the assembly manually. To me, a compiler is just a tool to generate assembly.
And you cannot decently compare your otherwise neat Softwire optimizer to what gives a full blown compiler.

SoftWire's strength is not its optimization capabilities (which is in its infancy), but its flexibility. Like Jacco noted, it becomes messy to manage similar code when using inline assembly or intrinsics. Things were you'd normally use function calls and conditional statements now have to be 'unrolled', 'flattented', 'inlined'. SoftWire allows to bring structure in all that. Conditional statements become conditional compilation. Function calls become seamless inlining (in fact, the more function calls the better, because it determines lifetimes of variables). It actually minimizes the amount of assembly code that has to be written. For example I have a function that converts data in an SSE register to fixed-point format in an MMX register. Instead of copy-pasting that assembly code, making things messy and harder to manage, I just call that function of run-time intrinsics anywhere I need this conversion. No need to look at that assembly code ever again.

Anyway, maybe you're right and I shouldn't be pimping SoftWire here. ;-) I wish you all the luck with your projects!

 
El Pinto Grande

January 06, 2005, 05:24 PM

Nick wrote: Why isn't that written in inline assembly? If there's 25% performance difference when using intrinsics then this small code definitely seems worth hand-optimizing! It's useless looking for the best compiler options and such, if it costs more time than actually writing the assembly manually. To me, a compiler is just a tool to generate assembly.


Just for the fun, i've tried to beat gcc with ad-hoc asm for that function; the best i could get was a draw (and a headache): guessing how things get scheduled on a modern cpu really isn't obvious for non trivial functions.

In that function there's: 2 vmul, 2 vsub, 6 vmin/vmax, 4 smin/max, 2 shuffle, 2 movehl and 2*32bits store + 6*128bits load.
Ignoring loads and stores on my k8 that's: 2*5 + 2*5 + 6*3 + 4*2, 2*4, 2*2 = 58 cycles.
33 cycles sounds ok as a Real World performance to me.
If you don't beleive me, just try :p

Now suppose i write that as inline asm. With msvc i'd be stuck with a naked/non inlinable function, with no way for it to shortcut either the prologue or the epilogue (due to asinine inlined asm convention).
It would be better with gcc but i'd still force it to pessimize a lot.

Written with intrinsics, the compiler can inline it (if that's a win) and use higher lvl knowledge to schedule around even better; that's not science fiction, it happens you know :)

Last but not least, as the scheduling is mostly left to the compiler i can tailor the code gen to different processor if need be: ie an opteron an p4 really don't behave the same way. That wouldn't be an option with inlined asm (that argument is a tad dubious as is, but imagine maintaining that piece of code a couple of years later on the architecture of the day... ouch).

That 25% performance drop is just for msvc2k3; it just suck with intrinsics.
There's no such issue with gcc (or ICC i guess).

Anyway, maybe you're right and I shouldn't be pimping SoftWire here. ;-)

No that's fine, just keep it in perspective ;)

 
Nick

January 07, 2005, 05:19 AM

El Pinto Grande wrote: Just for the fun, i've tried to beat gcc with ad-hoc asm for that function; the best i could get was a draw (and a headache): guessing how things get scheduled on a modern cpu really isn't obvious for non trivial functions.

Yeah, scheduling if fun. :-) In my experience it becomes pointless at a certain point though. Out-of-order execution is really powerful, so as long as you break long dependency chains and don't use memory loads right after a store, the hardware is able to schedule things perfectly.
In that function there's: 2 vmul, 2 vsub, 6 vmin/vmax, 4 smin/max, 2 shuffle, 2 movehl and 2*32bits store + 6*128bits load. Ignoring loads and stores on my k8 that's: 2*5 + 2*5 + 6*3 + 4*2, 2*4, 2*2 = 58 cycles. 33 cycles sounds ok as a Real World performance to me. If you don't beleive me, just try :p

33 cycles for 18 instructions? That's terrible really. ;-) With software pipelining and more parallelization (getting rid of the scalar instructions and shuffles) I'm positive that could be reduced significally. If I had more time, I'd definitely give it a try, but unfortunately I have exams in two weeks.
Now suppose i write that as inline asm. With msvc i'd be stuck with a naked/non inlinable function, with no way for it to shortcut either the prologue or the epilogue (due to asinine inlined asm convention). It would be better with gcc but i'd still force it to pessimize a lot.

Inline it completely yourself. Just write the surrounding loop or whatever else is in the bottleneck in assembly too.
Last but not least, as the scheduling is mostly left to the compiler i can tailor the code gen to different processor if need be: ie an opteron an p4 really don't behave the same way. That wouldn't be an option with inlined asm (that argument is a tad dubious as is, but imagine maintaining that piece of code a couple of years later on the architecture of the day... ouch).

That's no different when using intrinsics. If the architecture changes significally you have to rewrite things.

 
El Pinto Grande

January 07, 2005, 09:41 PM

Nick wrote:33 cycles for 18 instructions? That's terrible really. ;-) With software pipelining and more parallelization (getting rid of the scalar instructions and shuffles) I'm positive that could be reduced significally. If I had more time, I'd definitely give it a try, but unfortunately I have exams in two weeks.

Did you even look at what it's doing and how? Parallelizing what?
There's 2 dependency chains (for the l1/l2 values) until they are NaN filtered, fused, and then horizontally min/maxed again.
It requires a bit more than just some hand waving or software pipelining to do that in 33 cycles.
The 18 instructions figure is for a fantasy world where the register set is infinite and data magically moves around.
And don't be fooled by the way it's presented, the gcc generated code is way more tortured.

If you can come up with something faster i'd be delighted, but until you show me some real code...

Inline it completely yourself. Just write the surrounding loop or whatever else is in the bottleneck in assembly too.

So i have to hand write one callable version and another for one inlining position.
Say that at some point i don't need the intersection point and just need to know if an intersection happened. That's another version to write and schedule by hand (because you know the compiler will remove the unneeded loads/stores/computations).
Given how inlined asm induces quite some pessimization i'd better write the whole thing in asm from the start.

I've had to write things only in asm in the past and i have no nostalgia from that era, thanks.
Perhaps you should give look at the kind of code a good compiler produce nowadays, the keyword being good.

That's no different when using intrinsics. If the architecture changes significally you have to rewrite things.

We're talking about scheduling and i wouldn't have to re-schedule everything; again that's the compiler job.

 
TaskForce_Reivax

January 08, 2005, 08:59 AM

The demo was amazing! Never seen anything like it. Hardware vendors should really start considering specialized harware for ray-tracing.

On my P4 2.8 HT, 256MB RAM, I get about ~8 FPS which is very cool.

 
Nick

January 09, 2005, 08:20 AM

Here's a very interesting tutorial which could be of use: http://optimizations.org/optimizations/Siggraph2001-Klimovitski1.ppt

 
Jacco Bikker

January 11, 2005, 10:25 AM

Update, for those interested:
The demo is now 10% faster and includes a reflective floor plane (so the tracer itself is about 15% faster, as the floor plane now creates extra rays for the reflection). Not sure where the ceiling is...

 
davepermen

January 11, 2005, 10:39 AM

i removed caches, downloaded new, but its still the same..

 
Lotuspec

January 11, 2005, 11:13 AM

I think he only updated his own raytracer code, not the uploaded file (yet? :)) so we'll have to wait I guess.

 
Jacco Bikker

January 11, 2005, 12:03 PM

Ow yeah sorry it was just a progress report. I can post a demo, but I think 15% isn't enough to justify that. I'm working on something better to show.

 
davepermen

January 11, 2005, 01:01 PM

grmbl.. i've deleted all my cache just for you! now i want to at least see why! SEND ME A DEMO!!!

well, easy, if something even bether follows, i can wait :D great to hear you have progress.. i'm just curious what that means to the articles.. hehe.. *waiting eagerly*

 
Nick

January 11, 2005, 01:36 PM

El Pinto Grande wrote: Did you even look at what it's doing and how? Parallelizing what?

I don't even have to look further than the shufps instructions to see opportunities to parallelize things.

Anyway, I'd truely love to proof it can be done faster, but honestly I've got exams and I uninstalled my compiler.

Besides, I wouldn't want to spoil your Aha-Erlebnis. ;-)

 
El Pinto Grande

January 11, 2005, 02:15 PM

Nick wrote: I don't even have to look further than the shufps instructions to see opportunities to parallelize things.


That's one of the last operation, has a 4 cycles latency and that's what you call "opportunities"?
It seems you think that you and i are the only one to know that.
The fact is the compiler also does; both shufps are issued first followed by a couple of faster movhlps and only then comes some interwinded scalar min/max + comparison.

A much better argument would be to avoid accumulating values in vectors and to go for an all scalar route (removing the need for the final horizontal crap but potentially wasting decoding bandwith etc).

 
This thread contains 64 messages.
First Previous ( To view more messages, select a page: 0 1 2 ... out of 2) Next Last
 
 
Hosting by Solid Eight Studios, maker of PhotoTangler Collage Maker.