Not logged in, Join Here! or Log In Below:  
 
News Articles Search    
 


Submitted by Jacco Bikker, posted on April 29, 2005




Image Description, by Jacco Bikker



A while back I sent in an ray tracing IOTD showing the Stanford bunny, rendered at high speed. I've been busy since I made that demo, and these shots show the current state of the art. The top shot represents the maximum image quality: There are textures, adaptive super sampling (for edge anti aliasing), a bloom filter causing a subtle glow and of course the reflections. Sadly all this eye candy comes at a cost. The lower shot shows a very good performing model: The number of rays per second is no less than 3 million on a 1.7 Pentium-M - on a P4 @ 3.2Ghz this would be about 6 million rays per second, which is better than the SaarCOR FPGA ray tracing chip.

Over the past months, many things have improved: The overall speed of the ray tracer has been improved considerably due to some stiff competition from tbp (the odd french dude), there's a complete tool chain now to get from downloaded content to ray traced images (via the .obj file format), and the functionality has been extended considerably (textures, reflections, HDRI, networked rendering etc.).

There will be more good stuff, I'll keep you all informed. Greets - Jacco.


[prev]
Image of the Day Gallery
www.flipcode.com

[next]

 
Message Center / Reader Comments: ( To Participate in the Discussion, Join the Community )
 
Archive Notice: This thread is old and no longer active. It is here for reference purposes. This thread was created on an older version of the flipcode forums, before the site closed in 2005. Please keep that in mind as you view this thread, as many of the topics and opinions may be outdated.
 
lycium

April 30, 2005, 11:17 AM

the new renderer is still very much in the design phase, though i'm expecting it to be pretty fast when doing global illumination.

reasons for double precision:

1. i use a lot of accumulation (filtered aa) and numerical methods in my code
2. memory bandwidth is usually the problem, not alu time. the use of double precision is very localised, probably in registers/l1 most of the time.

i'm also not using double precision for colours, only at the aa stage. however, all the geometric computations use doubles. basically, this is targetted more at scientific visualisation (full evaluation of the serret-frennet and weingarten equations with vector display and labeling etc) than realtime use.

the renderer is supposed to be:

1. completely cpu-bound
2. very efficient
3. use little memory
4. scale very well with additional cores/cpus

most people won't be interested in it because it doesn't use triangles, however...

 
Nick

April 30, 2005, 01:52 PM

Jacco Bikker wrote: Nick: I know you can't really compare Mhz, but it's actually an educated guess: I share experiences (and demos) with tbp, who has a 2.2Ghz P4. On his machine, my demos run 60% faster than on my machine. Also, I tried earlier versions on a collegue's P4/3.2, and got roughly doubled performance. I don't know why you don't see the same results; perhaps my code is less dependent on the cache?

My code is very MMX intensive, which a Pentium 4 isn't very good at. But even if your code doesn't rely on MMX that much, I'm quite surprised there's such a big performance difference. Does your colleague's processor have a Northwood or Prescott core?

One of the Pentium 4's biggest strengths is the trace cache. So if you use many complex instructions the Pentium M will have trouble decoding. If this is the bottleneck you should consider reordering instrucitions and/or replacing them with ones that are faster to decode. With a bit of luck you could get closer to that Pentium 4's performance!

 
Nick

April 30, 2005, 01:54 PM

dummy wrote: [account locked, using a decoy] I don't own a P4 (holy jumping Jesus, that would be so embarassing)..

Any specific reason for that? Recent AMD processors are great but I don't see any need to be embarassed about owning a Pentium 4...

 
Scali

April 30, 2005, 02:11 PM

dummy wrote: Better filtering asks for too much memory bandwith, a scarce ressource for modern CPU. It's not really practical to go beyond bilerp.


True... but bilinear texture filtering is not exactly up to today's standards in terms of image quality... so I was hoping that we were at a point where some memory bandwidth could be sacrificed for at least trilinear filtering or such.

And animation isn't an issue as long as you can bear lowering your space partitionning efficiency (that's where the tradeoff stands).


Yes, but basically it means that using something like skinning can mean the difference between realtime raytracing and non-realtime raytracing.
It would be nice if someone would find a solution to that, since we've grown used to seeing skining in realtime by now, and non-skinned characters are just not acceptable anymore.

 
Dan Royer

April 30, 2005, 03:24 PM

My point exactly.

 
dummy

April 30, 2005, 10:13 PM

Nick wrote: Any specific reason for that?

Because they suck, specifically.

 
dummy

April 30, 2005, 10:36 PM

Scali wrote: True... but bilinear texture filtering is not exactly up to today's standards in terms of image quality... so I was hoping that we were at a point where some memory bandwidth could be sacrificed for at least trilinear filtering or such.


GPU can spend that much on filtering because they have that much surface dedicated to that particular bandwidth problem.
General purpose CPU optimize general purpose memory access.

So you have to balance things out and leave some for AA & friends.

 
hplus1104

May 01, 2005, 12:18 AM

A 2.4 GHz Pentium 4 probably has a 400 or 533 MHz FSB, often paired with 266 MHz memory. A 3.2 GHz Pentium 4 typically has an 800 MHz FSB, paired with dual-DDR-400 memory. A Pentium-M only has a 400 Mhz FSB, and only can use a single channel of DDR-400. Thus, it wouldn't be uncommon for memory-bound algorithms to be almost twice as fast on dual-DDR-400 as on single-DDR-400.

 
Jacco Bikker

May 01, 2005, 04:34 AM

I'll post the legocar demo on Monday, so people can check out performance on various machines.

And I promise to fix the blooming effect (filter out values < 1.0). As I mentioned, the current implementation was a 'quick hack' and 'first attempt'.

BTW does anyone have tips to implement the filter in an efficient manner? The current code is painfully slow, and I wonder how this could be fixed especially when I add a conditional to it...

 
Betelgeuse

May 01, 2005, 08:58 AM

Dumb question, but as I don't know the details of your implementation... can you offload the frame buffer to hardware and use shaders to implement your bloom?

And... if the demo doesn't run on my Athlon XP I will cry crocodile tears! :)

 
Lotuspec

May 01, 2005, 10:06 AM

I dont think Athlon XP supports SSE2 (at least my 2600+ doesn't). But maybe some more recent revisions do have support.

 
Axel

May 01, 2005, 12:27 PM

Only Athlon 64 supports SSE2, the newest revision adds SSE3 also (E3 stepping).

 
lycium

May 01, 2005, 03:21 PM

use seperable filters, this allows you to multiply an x-pass and a y-pass to get (correct) blurring using arbitrary size kernels in o(n) time. also very amenable to sse(2) i'm guessing.

i seriously need to learn sse sometime, and might write a "for idiots" article when i'm done; the kind i'd really like to have now...

 
dummy

May 01, 2005, 04:14 PM

lycium wrote: use seperable filters, this allows you to multiply an x-pass and a y-pass to get (correct) blurring using arbitrary size kernels in o(n) time. also very amenable to sse(2) i'm guessing.

Yep, and the high pass + tone mapping is just as easy (my 10mn hack trying to show that on gamedev was...hmm... dismissed).
Jacco, i don't see the need for a conditionnal in there at all.

i seriously need to learn sse sometime, and might write a "for idiots" article when i'm done; the kind i'd really like to have now...


I guess anyone about to start tinkering with SSE is bound to come to that same conclusion.

 
Reedbeta

May 01, 2005, 04:17 PM

I don't think a conditional is needed if you can do arithmetic (saturated) subtraction...

 
sylvan

May 01, 2005, 04:39 PM

If you can't do a zero-clamped subtraction, do something like

color = color - 1.0
color = color * (color > 0);

For each pixel in the buffer before you blur and add it back...

 
dummy

May 01, 2005, 11:10 PM

Yes, you can work your way via min/max (still better than those 3 ops conditionnal moves).

 
Jacco Bikker

May 02, 2005, 03:24 AM

Yes I figured that out shortly after writing my post (the saturation idea). :) Thanks. I'm not too sure about this O(n) filtering; I'm using different weights for each element in the filter. I'm not yet sure how this can be done efficiently.

 
Jacco Bikker

May 02, 2005, 03:43 AM

Lego car performance test is online now. Please report your timings.
http://www.bik5.com/legodemo.zip

 
Jacco Bikker

May 02, 2005, 04:20 AM

By the way, my own machine peaks at 3417k rays / second for this scene. That's a 1.7Ghz Pentium-M.

 
Yonaz

May 02, 2005, 04:21 AM

Lego-Car-Demo
min: 3022K rays/s
max: 5402K rays/s (can't reproduce this value, though.. normal is 4911K)

Bunny-Demo:
min: 1927K rays/s
max: 3096K rays/s

machine:
A64 3000+
512MB DDR400-Ram (Single Channel)

Good work! If the speed-improvement goes on like this, we might see some animated stuff in some months ;)

 
ghavami

May 02, 2005, 04:28 AM

When viewing from behind the car: ~4.9-5.2 fps
When viewing from front of the car: ~6.1-6.7 fps

max rays: ~4882k
min rays: ~2621k

My specs:
Pentium4 2.4Ghz with 533Mhz system bus and 512kb cache
768mb RAM, 333Mhz I think, can't check right now :P

 
Jacco Bikker

May 02, 2005, 04:40 AM

Some notes:
- You can't really compare the bunny and the car. The car is 10k triangles, the bunny 68k.
- You NEED sse2 for this demo. If you downloaded it, and noticed that it doesn't work on the AthlonXP, well, too bad. :)
- The rays/sec figure goes up if you increase the resolution. At 800x600, the rate is about 10% higher as it is at 512x384. The OpenRT figures usually are based on 1024x768 renderings.

Sorry about the frequent SaarCOR / OpenRT comparisons, point is that I have tried to match Wald's excellent work for several months, and he was ahead by a factor 10 or more for most of the time. So it was quite an experience when I realized I matched his performance. I think he did a marvelous job, and his extreme performance figures where a big motivation.

 
Betelgeuse

May 02, 2005, 04:41 AM

I will get back to you this time next year when I've upgraded to an Athlon 64... *wails*

 
dummy

May 02, 2005, 04:59 AM


Then you'll have the priviledge to waste cycles on that in full 64bit glory with my own version ;)

 
xunil

May 02, 2005, 06:35 AM

Curious:

max rays: ~6286k
min rays: ~2621k

Most of the time it's around 4000k I guess.

Framerate basically ranges from about 6fps to 8fps depending on what part of it's rotation the car is on.

This is on a Pentium 4 3.2Ghz with 1Gb dual channel ddr2 ram.

 
playmesumch00ns

May 02, 2005, 06:39 AM

What happens when you want to add hemispherical sampling? How does your performance stack up when there's absolutely no coherence between rays?

That's what's always bugged me about Wald's work: it's like, OK you can render that bunny really quick for a raytracer, but we can already do that with scanline...

 
[MENTAL]

May 02, 2005, 06:49 AM

P4 2.8ghz w/ hyperthreading, 512mb ram (don't ask what type!)

5-6fps at all times.

 
Scali

May 02, 2005, 06:56 AM

dummy wrote: GPU can spend that much on filtering because they have that much surface dedicated to that particular bandwidth problem. General purpose CPU optimize general purpose memory access. So you have to balance things out and leave some for AA & friends.


Well, I think the option should at least be there... so you can get the better image quality if you want... speed hit or not.
I also think that with a good texture filter, you can get the same quality with less AA, so effectively you'd be faster.

 
Marmakoide

May 02, 2005, 07:08 AM

Hi

Awesome job, real time raytracing is not a joke...
A suggestion to speed-up all that :
- You use kd-trees, the geometry must be static, but the lights can move.
In the case of static lights, instead of using the kd-tree to compute a ray from the light to an object, a 'visibility buffer' could be a good thing. A 'visibility buffer' is made of 2d boxes on a sphere, one box bounds the projection of one object on the sphere, a trick like the 'dirty rectangles' trick. More precise here :
http://www.cs.ualberta.ca/~dneilson/raytracing.html

 
This thread contains 160 messages.
First Previous ( To view more messages, select a page: 0 1 2 3 4 5 ... out of 5) Next Last
 
 
Hosting by Solid Eight Studios, maker of PhotoTangler Collage Maker.