Not logged in, Join Here! or Log In Below:  
 
News Articles Search    
 


Submitted by Jacco Bikker, posted on April 29, 2005




Image Description, by Jacco Bikker



A while back I sent in an ray tracing IOTD showing the Stanford bunny, rendered at high speed. I've been busy since I made that demo, and these shots show the current state of the art. The top shot represents the maximum image quality: There are textures, adaptive super sampling (for edge anti aliasing), a bloom filter causing a subtle glow and of course the reflections. Sadly all this eye candy comes at a cost. The lower shot shows a very good performing model: The number of rays per second is no less than 3 million on a 1.7 Pentium-M - on a P4 @ 3.2Ghz this would be about 6 million rays per second, which is better than the SaarCOR FPGA ray tracing chip.

Over the past months, many things have improved: The overall speed of the ray tracer has been improved considerably due to some stiff competition from tbp (the odd french dude), there's a complete tool chain now to get from downloaded content to ray traced images (via the .obj file format), and the functionality has been extended considerably (textures, reflections, HDRI, networked rendering etc.).

There will be more good stuff, I'll keep you all informed. Greets - Jacco.


[prev]
Image of the Day Gallery
www.flipcode.com

[next]

 
Message Center / Reader Comments: ( To Participate in the Discussion, Join the Community )
 
Archive Notice: This thread is old and no longer active. It is here for reference purposes. This thread was created on an older version of the flipcode forums, before the site closed in 2005. Please keep that in mind as you view this thread, as many of the topics and opinions may be outdated.
 
Scali

May 02, 2005, 07:08 AM

If you can write the columns of your kernel as weighted rows of your kernel, then you can 'separate' it in horizontal and vertical passes.
A gaussian kernel is such a kernel.

 
Paul Garton

May 02, 2005, 07:26 AM

I think he knows all that, and if you read his posts you can see he is using a kd-tree

 
Crog

May 02, 2005, 07:37 AM

max rays: ~4911k
min rays: ~1912.49k

My specs:
Pentium4 2.8Ghz Prescot with 800Mhz system bus and 1Gb 333Mhz (not running dual chanel)

One thing i'd like to note is that i had opened task manager & the processor usage NEVER went above 50%! Don't kno that much bout this stuff but surely this there cud be space for performance increase :P

 
Jacco Bikker

May 02, 2005, 07:51 AM

Crog:
The processor usage is that low because there is a deliberate small sleep to prevent the ray tracer from taking over the system. Appart from that, I guarantee that it eats virtually every cycle. :) The reason for the sleep is the fact that this ray tracer can also run in networked mode; in that case one machine runs the 'master' and a single 'slave'. Other machines can connect with a 'slave' renderer running. To prevent that the 'slave' app on the main machine halts the 'master', I took some measures to prevent 100% CPU usage.

This is also the reason that max and min rays/sec differ so much: Something else apparently was more important and so the tracer more or less waited for the system to become idle again. That's why that figure is not very important, it's really the 'max' ray count that matters. For testing purposes, I don't animate the object, so I get a more stable figure. But even under those circumstances, min and max vary 10% or more. On faster machines, this is bound to be more than 10% (as you've noticed).

 
Nicolas Lelong

May 02, 2005, 08:06 AM

Crog wrote:One thing i'd like to note is that i had opened task manager & the processor usage NEVER went above 50%! Don't kno that much bout this stuff but surely this there cud be space for performance increase :P[/i]


On your P4 this happens whenever a process only use one of the two virtual CPUs... as you may see, the demo only has one thread running, so does not benefit from hyperthreading !

 
Nicolas Lelong

May 02, 2005, 08:07 AM

Just out of curiosity, do you, SSEx guys, use intrinsics to code your optimized functions, or raw assembly ? I suppose from what I saw that everyone's using intrinsics to let the compiler schedule the instructions, do I get it ?

Has anyone explored defining the SSE intrinsic functions to target the traditional FPU so that a same source could generate code that works everywhere ?

 
Lewil

May 02, 2005, 08:27 AM

this is Lewil (ex pocketmatrix :x ) d'ya remember Evil Circles ? :)

good to see you're still producing such codes :) nice !! master

 
Vast

May 02, 2005, 09:37 AM

Max: 6300
Min: 4000

Some not-so-strong machine with an intel video card at my school... err.

All i know is its a relatively new dell, but since its school, its not a monster machine. Thats all i can say.

- Tim

BTW: Very impressive work!

 
Jacco Bikker

May 02, 2005, 09:44 AM

It's probably a P4 @ 3.2 (check the other scores).
BTW my original statement that a P4 @ 3.2 would render twice as fast as my 1.7 Pentium-M laptop appears to be correct after all.

 
Scali

May 02, 2005, 10:16 AM

Jacco Bikker wrote: BTW my original statement that a P4 @ 3.2 would render twice as fast as my 1.7 Pentium-M laptop appears to be correct after all.


Makes sense too, since you apparently make heavy use of SSE/SSE2, which is one of the weaker points of the Pentium-M architecture.
The same used to go for Athlons, but it seems that the Athlon64 isn't far from the P4 that corresponds to its rating. I wonder what happens at the high end... an FX55 vs a P4EE.

 
ZEN

May 02, 2005, 10:35 AM

yes true, i can see a small market for raytracing cards but the software for them is so limited they will never make it, a chicken and egg scenario. as for physics, an addon card will only takeoff if it provides 10x-100x the performance of a P4 4GHZ, and I just don't see that happening, since all users will have a 2nd proc for 'free' once dual cores go mainstream, there will be a very huge barrier there to upgrading to a physics card.

 
ZEN

May 02, 2005, 10:40 AM

6300, 3600

P4 3.0 HT 800FSB, 1GB MEM

 
Axel

May 02, 2005, 10:40 AM

A64 2Ghz (3000+ Single Channel) 2900 Min 4800 Max

 
Scali

May 02, 2005, 11:01 AM

ZEN wrote: as for physics, an addon card will only takeoff if it provides 10x-100x the performance of a P4 4GHZ, and I just don't see that happening, since all users will have a 2nd proc for 'free' once dual cores go mainstream, there will be a very huge barrier there to upgrading to a physics card.


But PhysX does exactly that...
It can handle about 30000 rigidbodies in realtime, where a high-end P4 or Athlon will struggle with a mere 100... So even in the best case scenario, the dualcore CPUs would still only get about 200 rigidbodies in realtime. Making PhysX a huge leap forward in terms of performance.

 
LastInquisitor

May 02, 2005, 11:18 AM

A64 4000+ 2.45GHz 1GB Dual Channel DDR400
3666 Min
6308 Max

 
Betelgeuse

May 02, 2005, 11:49 AM

...where do you guys get these machines? ;p

 
Marmakoide

May 02, 2005, 12:03 PM

I know that kd-trees are used, but I just suggest something TO ADD WITH kd-trees. The trick I'am speaking is an other preprocessing step, independant of the kd-tree processing. I'am curious to know if it used or not... If you keep en eye to the link I've posted, you can see that a big improvment.

 
Jacco Bikker

May 02, 2005, 12:05 PM

The correct question is: Where are the P4EE's? ;)
It looks like I should really do a HT capable version.

 
Nick

May 02, 2005, 12:47 PM

Jacco Bikker wrote: BTW my original statement that a P4 @ 3.2 would render twice as fast as my 1.7 Pentium-M laptop appears to be correct after all.

Yes, I was wrong, impressive results for a Pentium 4! Created some new respect for these stoves. ;-)

 
Roel

May 02, 2005, 12:52 PM

Betelgeuse: People with less results don't dare to reply ;)

 
Scali

May 02, 2005, 01:04 PM

Oh yea?!

Well I have a 1.6 GHz Celeron in my laptop... and I get about 3100 max, 1900 min... So there!

 
ZEN

May 02, 2005, 01:10 PM

I have a 3.0 HT, just ran it again to see what proc usage was

when running 1 instance
cpu1 20%
cpu2 80% 6300max

when running 2 instances
cpu1 100% 4900max
cpu2 100% 3600max

both seem to average around 3000-3500 though

to get the 2 instances running without disturbing the max value i started 4 instances, then closed the first 2, this prevented the fist one starting up with the full cpu available which would then give a false reading for the max value.

 
[MENTAL]

May 02, 2005, 01:52 PM

Divide the screen into equal portians depending on the amount of processors detected (obviously count hyperthreading as 2) and render each part in a different thread - raytracing speeds are almost linear with CPUs, so doubling the CPUs can almost double the framerate (providing you use threads).

I'm sure you already know this though :).

 
Billy Zelsnack

May 02, 2005, 03:10 PM

P-M 1.6, 1GB RAM
3783 max
2425 min

Xeon 2.8, 1GB RAM
5500 max
3090 min

 
dummy

May 02, 2005, 07:23 PM

Scali wrote: Makes sense too, since you apparently make heavy use of SSE/SSE2, which is one of the weaker points of the Pentium-M architecture. The same used to go for Athlons, but it seems that the Athlon64 isn't far from the P4 that corresponds to its rating. I wonder what happens at the high end... an FX55 vs a P4EE.


It's not really fair to compare k7 to current p4.
The demo has peaked at 5.4Mray/s on my k8 (opteron 146, 2Ghz), and it's at 4.7Mray/s (more or less) for same view as in the IOTD shot.

Now 2 remarks:
. that demo is compiled with ICC and even if, so far and in my own experience, it produces the best code for IA32 it truely has a P4 slant with shifts replaced by chained adds etc (a kludge a k8 doesn't ask for).
. on x86-64 the story is different as gcc rulez and you can use more neutral code.

And that's how i get >6Mray/s (IOTD view, not peak) with my own code on my box.

 
dummy

May 02, 2005, 07:27 PM

Scali wrote: Well, I think the option should at least be there... so you can get the better image quality if you want... speed hit or not. I also think that with a good texture filter, you can get the same quality with less AA, so effectively you'd be faster.


There's some interaction between AA & texture filtering indeed.
But that kind of raytracer really shine when you throw insane geometry amounts at it (log(n) vs linear) and not just a handful of triangles as in this demo.
And then your main problem becomes edge anti-aliasing, not texture filtering.
So, i'm not saying you're wrong, but now you can see why AA is priviledged.

 
hoho

May 03, 2005, 02:55 AM

I really doubt that it's useful to use two parallel threads on a HT processor. HT is only good when CPU can't decode instructions fast enough or there are lot of memory stalls. This raytracers bottleneck on the other hand is probably not instruction decoding or memory stalls but memory bandwidth. Having two threads filling the cache with different data won't do any good either. On the whole I think having two threads running in parallel on a HT processor will decrease the performace.

 
zed zeek

May 03, 2005, 03:31 AM

it would be nice to control the camera/lights
athlon64 2.0ghz 512mb@200mhz
5402 max
3204 min

"The overall speed of the ray tracer has been improved considerably due to some stiff competition from tbp"
nothing like someone up your ass to get u motivated, hmmm wait that doesnt sound right

 
Scali

May 03, 2005, 03:54 AM

dummy wrote: that demo is compiled with ICC and even if, so far and in my own experience, it produces the best code for IA32 it truely has a P4 slant with shifts replaced by chained adds etc (a kludge a k8 doesn't ask for).


I really don't think that the bottleneck in a raytracer is the shift operation. So I doubt it affects performance much.

on x86-64 the story is different as gcc rulez and you can use more neutral code.


I would not be surprised if icc generated better code than gcc in 64 bits... It never had much trouble doing it in 32 bits. And else there's the new MS compiler, which is probably even better than icc at most code.
And are you talking about neutral code, or more AMD-slanted?

And that's how i get >6Mray/s (IOTD view, not peak) with my own code on my box.[/i]


Yes, but can you compare your code to this code... and if so... what kind of performance do you get with your code on a P4?

 
Scali

May 03, 2005, 03:59 AM

I'm still not convinced... You're just doing some handwaving here..
I don't agree with any of it, by the way... the polycount has nothing to do with the amount of texturefiltering required...

But if you're doing only edge-AA, then ofcourse the textures won't get filtered at all... So you'll have to do fullscreen AA to get improved texture quality... And even then, you need to push the AA quite far to get certain oversampled textures at decent quality.
Compare that to today's 3d accelerators, which get very good quality with only 4x rotated grid multisampling and 16x adaptive AF.
I think such a balance of AA and AF would also improve quality and/or performance in a raytracer.

 
This thread contains 160 messages.
First Previous ( To view more messages, select a page: 0 1 2 3 4 5 ... out of 5) Next Last
 
 
Hosting by Solid Eight Studios, maker of PhotoTangler Collage Maker.