Not logged in, Join Here! or Log In Below:  
News Articles Search    

Submitted by Austin Appleby, posted on August 09, 2001

Image Description, by Austin Appleby

8,000 Butterflies at 100 frames per second - no assembly, no hardcoding, no hacks.

These are three shots from a test of the particle system in my experimental engine (currently named Pandora). The butterflies are completely dynamic and are independently animated - no precalculation. When the butterflies flap their wings they climb higher, and when the don't flap they glide down. The butterflies are all different sizes, and smaller butterflies are faster than larger ones. There's also a simple physics model (a force field) that corrals the butterflies into a donut-shape.

Each butterfly is a separate C++ object, and each uses virtual functions to implement its behavior (they derive from a CParticle base class). Before people complain about the performance penalties of virtual functions, I've extensively benchmarked the virtual function overhead here and it's practically negligible. The particles are moderately memory-inefficient, but that can be cleaned up quite a bit with a custom allocator. There are exactly 0 lines of assembly code used in the particle system code - the particles themselves are straight C++, and the vector math routines underneath them are hybrid C/C++.

The geometry for each butterfly is essentially a square folded down the diagonal, and is built with 8 vertices and 4 triangles (4 verts and 2 triangles per side). The current version renders 8,000 butterflies at 95-100 frames a second on my development machine (a P4-1.7ghz + GeForce 3), or approximately 800,000 particles per second (3.2 million triangles per second). It doesn't use any GeForce 3 or Pentium 4 specific features (no vertex shaders, SSE, etc.) though it does use some NVidia-specific OpenGL extensions (mainly NV_vertex_array_range).

I wrote this demo just to prove that you can get excellent performance out of a C++/OpenGL engine without any sort of hacking or assembly optimization - as long as you keep your code efficient and benchmark every one of your changes, you can usually avoid any performance bottlenecks. I probably won't be releasing the full source code to the demo (as that would require releasing huge chunks of my still-in-development engine) but I can write up a quick overview of the techniques I used if enough people are interested.

-Austin Appleby

Image of the Day Gallery


Message Center / Reader Comments: ( To Participate in the Discussion, Join the Community )
Archive Notice: This thread is old and no longer active. It is here for reference purposes. This thread was created on an older version of the flipcode forums, before the site closed in 2005. Please keep that in mind as you view this thread, as many of the topics and opinions may be outdated.

August 10, 2001, 08:35 AM

Nice work.

For what it's worth, I'm also quite obsessive about extracting maximal performance from my kit (k6-2, gf2). First off, establish what the upper throughput limit of your system actually is. I used nvidia's benmark5 and ms' optimized mesh. This is a very important step in your optimization process, because it gives you your target. Notice the *MASSIVE* differences in this upper limit caused by OS and driver versions. On my system, my upper limit is ~10MT/s for strips, ~3MT/s for lists according to benmark5. optimizedmesh gets a little better, ~12MT/s.

With your system, you should have a theoreatical max. of ~20MT/s. Aim for that!


August 10, 2001, 08:48 AM

Yeah, he's smooth alright!!


August 10, 2001, 08:54 AM

Just to let you know, you are not alone, Nick ;): the only reason I did not participate in the above conversation was that all valid points I could think of have been already stated. Assembly is still my langauge of choice (however I did not have an opportunity to use it lately :( ), since CPUs still "think" in machine code and not in OOP ;). However, mainstream assembly is a dying art... too bad, but that's life, and life is change, so we have to learn to live with it.
I have a horror vision before my eyes: in 10-20 years programming will be like using a GUI and telling the compiler what to do, without writing any code by hand... And only few people (hackers and people responsible for maintainig the machines etc.) will know how things work behind the curtains... but hey, this is called PROGRESS... :-/



August 10, 2001, 08:54 AM

Oh yeah, that triangle limit thing was another thing I was going to comment. It's definately not 15MTris/s, but maybe twice of that. I'm currently getting ~18MTri/sec in my engine (nice OO one too :) in GeForce3 + 1.5GHz and 267ktri static untextured model in 1280x1024x32bpp resolution (note that clearing the z- and screen-buffer every frame also limit the performance) with one directional light. The model is build from triangle lists and isn't optimized at all. GeForce 256 actually has 15MTri/s theoretical limit.

Cheers, Altair


August 10, 2001, 09:02 AM

Ok, it's very offtopic, but who cares... ;)
For most it seems easier to believe in God and destination than just to live on their own. It is such a freightening idea that an individual's life on earth has no higher sense... 'cause it's true.

> The answer is 42, but do YOU know the question?
Unfortunately the only guy being able to give us an answer is dead. :-/


August 10, 2001, 09:05 AM

Oh my god !!!!!!!! Hard to accept that world of nightmare :oO


August 10, 2001, 09:17 AM

Tain ya pleins de francais ds la salle mortel :o)


August 10, 2001, 09:33 AM

Hi Ogotay!

"since CPUs still "think" in machine code and not in OOP ;)"

Very nice argument, thanks!

I like assembly, and I like C++ (object-oriented programming). The hard thing is just to find a balance between the two. I'm constantly trying new ways to make my assembly flexible and reusable, so it fits into my object-oriented framework. Maybe I need to write my own assembler ;)

In a few years, the people that still understand assembly are the ones that will write those freaky "Intentional Programming" compilers. With a little luck one of the things I'll learn at university in the last years is compiler techniques, and you can't write a fast compiler without assembly knowledge. So the assembly I know now will be very handy.

I think that even when we have 'intelligent' compilers that are very hard to beat, it's still important that people know how things work internally. It's like math, you first have to learn to do divisions on paper. Then we get a calculator, and after that we can use computers for more complicated calculations. Then suddenly you're in a situation where we need to go back to the basics :P We have to go with the progress, but not forget the basics.

It's the same thing with my software engine. I now have more knowledge about the inner workings of a 3D engine than the newbies that start with OpenGL or Direct3D and have no clue why or how a projection matrix works. My horror vision is that in 10-20 years we will have programmers that can only link libraries together instead of writing their own algorithms and trying to improve things...




August 10, 2001, 09:40 AM

"The reason: a compiler _knows_ (or might knows) the machine much better than people..."

I cant agree with that! Compiler dont knows the machine, the compiler know what humans said to it ! And maybe the compiler's author is the best assembler programmer in the world ? Then, maybe there's some ways faster to optimise some code ! (unknown by the compiler)

Jukka Liimatta

August 10, 2001, 09:41 AM

>Small clue: That's not fast.

Another Clue: each object (8000!) has unique transformation associated with itself.

It's easier to draw more triangles when they are:
1. smaller
2. in larger sets

To get 8000 individual objects, you have two choises:
1. change MODELVIEW per object
2. transform with host

IMHO, 8000 *objects* is quite a large number, don't you think so too? 32,000 triangles is a small figure, I agree, but you're looking at the wrong figure. Go figure. ;-)


August 10, 2001, 10:12 AM

Nice screenshot, i also would like to see a demo.
People are always talking here about speed, i like the idea to render
so many butterflies.
If assembler, c++ or any other languages, there are just tools and should be only used if they fit the purposes to solve a "problem".


August 10, 2001, 10:13 AM

my argument wasn't really about the existence of god, but whether people actually believe in god or not, and by winning the argument i mean that I sent the guy a link to this board with a ps saying "so there", and then ignored his couter-messages. Is there a better way to end an argument?

Shawn Kirst

August 10, 2001, 10:23 AM

Austin, you mentioned that you think you might have a vertex pipeline bottleneck. You also mentioned you are using NV_vertex_array_range. Are you sequentially updating your VAR vertex data after you manually transform the vertex data to world space (you mentioned as well that you do this manually)? Are you reading from the VAR data? If so you should keep a copy of the data in system memory because VAR reads are very slow. And non sequencial AGP writes are even slower.


August 10, 2001, 10:30 AM

LOL, remind me to never argue with you!



August 10, 2001, 10:33 AM

My point was that they seem more like individual rigid bodies because they have orientation, although now that I read the response below, there is a lack of any of the distributed mass stuff (i.e. intertia tensors, etc.) and probably collision volumes too. I guess the particles in my engine have orientation, but none of the other stuff as well.

I thought there was a little more to this demo, that's all.



August 10, 2001, 10:55 AM

sounds like a challenge to me...


August 10, 2001, 11:01 AM

Yes it's only a matter of time, and on some supercomputers Neural Nets would outrun our brain already.


August 10, 2001, 12:03 PM

Ya même des Québécois :P


August 10, 2001, 12:21 PM

whew.. i just skip on some major _unreadable text above..

all i want to say is: is it for the bug contest?? i mean it fit the description completely.

Austin Appleby

August 10, 2001, 12:40 PM

Just thought I'd do a bit more of my homework and see what sort of performance I should be expecting vert-per-second-wise - I think I'm actually not that far off from where I should be.

I grabbed SphereMark off of NVidia's developer site and ran it with a few different settings. I'm running at 1600x1200x32, and SphereMark is running in a small window. Here's the results -

No texture, no light - 25.2 mtri/sec
Texture, no light - 20.9 mtri/sec
Texture, 1 directional light - 17.3 mtri/sec
Texture, 1 spot/point* light - 6.6 mtri/sec (ouch!)

* - I can't tell if SphereMark is using spot or point lights; under OpenGL the only difference between them is the light cutoff angle. There might be a difference between SphereMark and my engine here, I'm not sure.

I also grabbed the source to the demo to see how the spheres are being rendered. Looks like they're almost all triangle strips except for the end caps, which are individual triangles. The spheres are rendered using glDrawElements() and the vertex data is stored in AGP memory, so that should be about the same as my engine. Since the end caps are only a small portion of the whole sphere, I'd venture to guess that SphereMark sends a little more than 1 vertex per triangle on average. say about 1.1 verts/triangle.

So, with 1 point light and 1 texture SphereMark has a vertex transform rate of somewhere around 7 million verts per second, which is roughly equal to what I measured last night. I just tried the butterfly demo without lighting and with the butterfly logic paused, and it ran at ~280 fps, or about 18 mverts/sec, which again seems about right if a tad slower than SphereMark.

Good rule to remember for the future - Default OpenGL point lights are _much_ slower than directional lights. They're probably both much slower than what I could get if I used a vertex program, so I guess that will be my next avenue of optimization.

Regarding collisions/physics on the particles - there ain't any. No collision volumes, no spring-mass systems, no inertia tensors, nuttin'. Just some simple particles in a force field.



August 10, 2001, 12:54 PM

That IOTD looks cool. I'd love to see it in action. I remember an early XBox demo featuring butterflies. That was nice stuff.

Anyway, I'm not gonna rant about your performance. It's true that you don't get the best performance out of your system (and a hell of a system it is you've got there mate). Just consider all those comments a motivation to optimise your code.

Austin Appleby

August 10, 2001, 12:59 PM

The demo is now available in EXE form for your downloading pleasure. Go to, you can't miss it. Like I said before, GeForce 3 required until I have a chance to fix some stuff. Good luck running it, I haven't tested it on any other machines yet.

By the way, I have no idea what kind of bandwidth caps my Roadrunner page has, so PLEASE don't submit a link to it to any news sites without warning me first.


Austin Appleby

August 10, 2001, 01:02 PM

The demo is now available in EXE form for your downloading pleasure. Go to, you can't miss it. Like I said before, GeForce 3 required until I have a chance to fix some stuff. Good luck running it, I haven't tested it on any other machines yet.

By the way, I have no idea what kind of bandwidth caps my Roadrunner page has, so PLEASE don't submit a link to it to any news sites without warning me first.



August 10, 2001, 01:17 PM

I've tried it ! Nice memory error !
congrats !


August 10, 2001, 01:36 PM

ok, cool I just ran it and I have a GeForce2.

1ghz PIII with 256MB ram

It looks really damn cool - I was getting 30fps - I'm sure with a GeForce3 it would have been a much higher fps.

but the butterflies floating around look very nice

if only i had the BFG...


August 10, 2001, 01:36 PM

hmm??? only GeForce3 you said?? I don't think so, its running on my GF2 too...

1Ghz Thunderbird, 512 MB RAM, GF2 GTS Pro

frame time ~30
update time ~26.5
render time ~2.5
swap time ~0.15
text time 0.00
proc time ~32.6
fps ~34
rt ~90



August 10, 2001, 01:43 PM

11fps on PII/GeForce3.

I spent some time at nVidia recently learning about vertex programs. They are really great for particle systems-- I think you could move about half the processing to the graphics card using vertex programs. With a pixel program (actually, or in the vertex program), you could do 2 sided lighting yourself and cut the geometry in half.

I understand the reluctance to use features for one card only. As a game developer, I'm interested in the GeForce3 programmable pipeline because I expect over half the high-end gamer PC's to have them within the next year, and because the X-Box basically uses a GeForce3. This means a pretty big market, and it's worth putting conditional code in if it will make the program run 2x faster on half of my users machines.

Also, programmable pipelines aren't going away-- I think by the time we hit DirectX 10, we'll see programmable pipeline standards across vendors.


Austin Appleby

August 10, 2001, 01:45 PM

What megahertz P2 are you running? Could you send me the values of the different timers that are in the profiling window?



August 10, 2001, 01:46 PM

there is a slight difference between rendering at 0.5 vertices/triangle (most high poly closed meshes are like that) and 2 vertices/triangle. his troughput (7MVerts/sek) is not that bad. also not really good tho^^. (just wanted to note that with this type of geometry you definitally will not push 30MTris/sek...)


August 10, 2001, 01:47 PM

100 fps is a pretty high number though... and so is 8,000 butterflies with such complicated physics attached to each of them individually. On a slower machine it would still perform very well considering the operations that are going on.

This thread contains 152 messages.
First Previous ( To view more messages, select a page: 0 1 2 3 4 5 ... out of 5) Next Last
Hosting by Solid Eight Studios, maker of PhotoTangler Collage Maker.