Not logged in, Join Here! or Log In Below:  
News Articles Search    

Submitted by Austin Appleby, posted on August 09, 2001

Image Description, by Austin Appleby

8,000 Butterflies at 100 frames per second - no assembly, no hardcoding, no hacks.

These are three shots from a test of the particle system in my experimental engine (currently named Pandora). The butterflies are completely dynamic and are independently animated - no precalculation. When the butterflies flap their wings they climb higher, and when the don't flap they glide down. The butterflies are all different sizes, and smaller butterflies are faster than larger ones. There's also a simple physics model (a force field) that corrals the butterflies into a donut-shape.

Each butterfly is a separate C++ object, and each uses virtual functions to implement its behavior (they derive from a CParticle base class). Before people complain about the performance penalties of virtual functions, I've extensively benchmarked the virtual function overhead here and it's practically negligible. The particles are moderately memory-inefficient, but that can be cleaned up quite a bit with a custom allocator. There are exactly 0 lines of assembly code used in the particle system code - the particles themselves are straight C++, and the vector math routines underneath them are hybrid C/C++.

The geometry for each butterfly is essentially a square folded down the diagonal, and is built with 8 vertices and 4 triangles (4 verts and 2 triangles per side). The current version renders 8,000 butterflies at 95-100 frames a second on my development machine (a P4-1.7ghz + GeForce 3), or approximately 800,000 particles per second (3.2 million triangles per second). It doesn't use any GeForce 3 or Pentium 4 specific features (no vertex shaders, SSE, etc.) though it does use some NVidia-specific OpenGL extensions (mainly NV_vertex_array_range).

I wrote this demo just to prove that you can get excellent performance out of a C++/OpenGL engine without any sort of hacking or assembly optimization - as long as you keep your code efficient and benchmark every one of your changes, you can usually avoid any performance bottlenecks. I probably won't be releasing the full source code to the demo (as that would require releasing huge chunks of my still-in-development engine) but I can write up a quick overview of the techniques I used if enough people are interested.

-Austin Appleby

Image of the Day Gallery


Message Center / Reader Comments: ( To Participate in the Discussion, Join the Community )
Archive Notice: This thread is old and no longer active. It is here for reference purposes. This thread was created on an older version of the flipcode forums, before the site closed in 2005. Please keep that in mind as you view this thread, as many of the topics and opinions may be outdated.

August 10, 2001, 03:53 AM

About assembler nowadays:

I have found out that it's easy to optimize with it.
Not to start another round of "asm vs algorithmic"..
it's supposed to use both.. first find a best algo
for the job, then code it in assembler.

Only things I find that is needed to do with assembler
are (x86 "platform independant" rules):

1. Get instruction accessing memory to as few as possible
2. Get number of instruction to as few as possible
3. Greate about 3 independant paths, which could be
running parallel.

There is it, cpus native out-of-order execution handles the
rest. There's not much use to "pair" the instructions for
p2/k6-2 above, even less to p3/athlon.

Allso avoiding instructions in code, which performs poorly
on intel- or amd-platform, is a good thing.
Generaly i have found that athlons run allmost any code to
near max without doing much work. P3 is doing allmost as good.

Vector stuff (simd for babies) is another thing, but code using
sse or 3dnow is usually very reusable, mostly math-routines.
So spending time optimizing those will pay off.

... oh.. and nice picture.. i just tough you should get them to
run alot faster (without any assembler) than that. ;)
If you opt the memory usage, it should FLY on p4, compared to
current situation.


August 10, 2001, 04:05 AM

mmmmh ...

don't want to seem to be an MC BAXTON clone, but ...

what the fuck is the impressive thing in this IOTD ???

i mean, there is NO performance at ALL !

... 100 fps on a P4 1.7GHz + GeForce3 ... and you call that efficient code ????

moreover, you want to give us a lesson about assembly coding and optimization ???
the ONLY nice things on this IOTD are the butterflies and the way they have been implemented ... don't try to impress anyone with your "performance" please ...


August 10, 2001, 04:28 AM

Hey flure, tu te laches :o)

Mike Taylor

August 10, 2001, 04:31 AM

Can you do this?

-Mike Taylor

Austin Appleby

August 10, 2001, 04:39 AM

Here's some responses to the questions/comments/flames - more later (I can only fit so much in one post...)

1. "Your system's absurdly powerful, no wonder you get such good performance. / Have you tried it on another system?" -
Yes, it is. I used to work as a games & graphics programmer, and I try and keep my system up to date. I haven't tested it on any other systems, though I hope to get some feedback once the demo is up on my (still under construction) website.

2. "Do the butterflies move because they're flapping their wings, or do they flap their wings because they're moving?" -
I tried it the first way and it worked OK, but the second way was faster. The butterflies' wing flapping speed is proportional to their vertical velocity, and their velocity is controlled by a semi-random "wander" force plus a friction (air resistance) force.

3. "Can we get the demo EXE somewhere?" -
I'm in the process of setting up my Roadrunner personal home page, I should have a copy of the executable up there in a few days.

4. "Does each butterfly turn in the direction the particle is heading?" -
Yes, they turn to face the direction they're heading. Each particle is made up of (at least) a position, a velocity, an acceleration, and a 3x3 orientation/scale matrix.

5. "A particle system that renders below 1 million particles per second is pathetic / 32,000 triangles at 100 fps (3.2m triangles/sec) is pathetic" -
Point sprites are good at rendering simple particle systems (i.e. clouds of transparent sprites) and a GeForce 3 can render many many millions of them per second. However, point sprites are strictly 2D and can't be used for particle systems where each particle is a 3d object. A GeForce 3 can render 15+ million triangles per second, but only when the geometry is static (or dynamic with certain restrictions, see NVidia's vertex array range demo) and the geometry is heavily stripified. This project was a test to see how fast I could draw small animated 3d particles, so I couldn't use point sprites. I could probably draw many more triangles per second, but the demo is CPU limited (Yes, I know, aren't I pathetic! I should optimize more!) so the graphics card isn't the bottleneck.

6. "Why are you so anti-assembly-language?" -
Like one of the other posters said, I'm not really anti-assembly-language, I'm pro-algorithmic-optimization and pro-application-profiling (and pro-code-clarity and pro-abstraction, etcetera). I've seen a lot of engine code mucked up by assembly optimization that didn't improve performance enough to justify the man-months that went into optimizing it, so I decided not to use it and thought that it would be good to show that you don't always need it.

7. "Maxing out a P4 + GeForce3 is nothing to brag about." -
Er, if I wasn't maxing out my machine then I wouldn't be demonstrating the maximum performance I could achive, right? Would it be more impressive to say "I'm drawing 40,000 particles per second and only using 5% of my CPU"?

8. "You're not doing any texture swapping." -
No, I'm not. I once had 4 different colors of butterflies in there (4 different textures) and rendered the particles in 4 batches with a texture swap in between. It rendered at nearly the same speed, but the effect was really confusing to look at - it was more like rainbow noise than a nice cohesive cloud of butterflies.

9. "I wonder if it would be better to use 2 triangles instead of 4." -
I tried it, and without proper two-sided lighting support it looked really funky (I turned backface culling off and set the normals at the corners pointing outwards). Right now there's four verts for the top side with normals pointing up, and four for the bottom with normals pointing down. I could probably get a _huge_ optimization by doing some proper two-sided lighting using NVidia's vertex programs, but I didn't want to open that can of worms quite yet.

10. "When you blow them up, do butterfly guts fly everywhere? Can I have your computer?" -
Bwahahaa... damn, that would be really really funny... I should do that next - exploding butterflies/pigeons/pegasuses (pegasi?) And no, you can't have my computer - I'd suck at Tribes 2 even more without it. ;-)

More comments and some background info in the next post.



August 10, 2001, 04:46 AM

Really nice.

But i have a poor 800Mhz P3 and a GEFORCE 2. I can
display about 100 frogs at 50 FPS ;-)


August 10, 2001, 04:51 AM

Its nice to see an IOTD which is considerably different to the rest. Give us all a shout when its downloadable please.



August 10, 2001, 05:02 AM

Yes, best would be to post a link to the document ;), if that is not possible, send it over to me, too.

Frans Bouma

August 10, 2001, 05:14 AM

You say your butterflies are made of 8 vertices. But isn't a triangle fan with just 5 vertices a lot faster? Or are the nVidia drivers not faster using tri-fans (or tri-strips for that matter using 5 vertices per butterfly) than when you just use tri's with all their vertices?

Just wondering :) It looks real nice. Dunno when you get the 100fps, when the camera is close/in the cloud of butterflies or a little away from the butterflies. (imho it makes a lot of difference)

Austin Appleby

August 10, 2001, 05:17 AM

Ok, those responses are out of the way. Now for some more info -

First off, thanks for the new-asshole-ripping. I tend to rant without enough justification beforehand, and it's good to have people point out where I'm screwing things up communication-wise.

Hmm. Background. Well, for starters, I decided to write this demo as both a test of my engine and as a little stab at the guys who did the XBox demo with the pond and butterflies - everyone raved over the "huge" numbers of butterflies (which looked to be a few hundred max), so I decided to see how many I could do on my own machine. However, I decided that I oughta stick with the programming conventions that I use on real game code, which meant that there would be a few extra rules -

1. No assembly. It's not worth the time and effort, especially on funky-20-stage-pipelined and doublespeed-ALU processors like the P4. Too many new rules to follow - I'd rather let the compiler handle that.

2. Minimal compromises to good OO design principles - make it fast, but try and keep the external interfaces clean. This means that any new particles have to derive from CParticle and implement their special behaviors by overriding CParticle's virtual functions - no hardcoded butterfly logic inside the particle engine.

3. Make it fast - Benchmark everything before and after each change that might affect performance. Use the most efficient way of doing things whenever you can do so without breaking rule #2.

Blah blah blah - this shouldn't be anything new to the other ex-industry guys out there.

OK, now for the meaty bits -

To make a long story short, there are a few main optimizations that I found work quite well -

1. Transform your particles to world space on the host CPU and let the GPU do world-to-screen transforms. Trying to change the OpenGL matrix stack before every particle is horribly, horribly slow.

2. Store your particle geometry in a vertex array allocated in AGP memory, and dump the whole thing to the GPU using just a few big glDrawElements calls.

3. Break the particle system logic up into pieces and have each piece run at a different rate. Instead of having one "virtual void CParticle::Update()" function, I broke that into a couple separate functions, and each function isn't necessarily called each frame :

Move - Move each particle along a parabolic path determined by its current velocity and acceleration. This is done for each particle every frame. I also stuck the wing-flapping code here so that it'd run every frame too.

Turn - Reorient the particle. The butterflies implement this method to make them face the direction they're moving. I can also make them do odd things like face the camera and spin like pinwheels. This runs 60 times per second, though it's tolerable to turn them only 30 times per second.

Phys - Apply our physics model to the particle. In the butterfly's case this is just a simple force field that combines a pull-to-center force, a go-around-in-circles force and a friction force. Doh! I just noticed that I'm re-phys-ing all my particles every frame... I should probably only be doing this 20-30 times a second...

Think - Run whatever you want here. The butterflies just pick a new wander force 10 times per second.

CalcSortValue - Each particle has a CalcSortValue() method that is used by the particle engine to keep the particles in back-to-front (or front-to-back) order. This doesn't actually need to be done very often for most particle effects - 10-15 times per second seems OK. I'm not using this for the butterflies anymore, as they're now drawn as alpha-tested cutouts instead of alpha-blended textures.

Each of those particle functions can be called either X times per second or every X frames. Each particle doesn't keep track of its next think/turn/sort time, that's done for the particle system as a whole - all the butterflies think at the same time, they all turn at the same time, etcetera. It would probably look better if the turn/phys times were staggered more, but that would add a bunch of conditional logic and per-particle counters that I didn't want to worry about.

Those are the major important bits that I can think of right now, as it's 4 in the morning and I'm rather sleepy. Any other questions?


Anton Knyazev

August 10, 2001, 05:20 AM

Can I have it too, please?

Austin Appleby

August 10, 2001, 05:20 AM

A fan wouldn't work because you wouldn't be able to render the whole thing with a single glDrawElements call - you'd need one call per fan. Also, since the normals for the top and bottom vertices are different, you'd need degenerate triangles in the fan to get the lighting to work properly. Strips aren't much better either - you could render the whole thing with one drawElements call, but you're still stuck with lots of degenerate triangles. I'm just using plain ol' GL_TRIANGLES, which is reasonably fast on a GeForce with indexed vertex arrays due to it's post-transform vertex cache.

-Austin Appleby


August 10, 2001, 05:21 AM

A famous comedian hit the sweet spot when he said:

"God is mankind's greatest invention. Once invented, you could never disprove his existance."

I think that is so true...

Austin Appleby

August 10, 2001, 05:26 AM

To the guy who replied to a comment with the message "Can you do this?" - Thanks for reminding me that I shouldn't take the flames quite so seriously. I'm hoping someone will make their own demo that's a few times faster than mine, as in all honesty I'm not particularly good at optimization (not to mention the fact that I've just noticed a few glaring slowdowns as I was reading through my code to post my responses...)

To all you out there who think that 800,000 butterflies per second is too damn slow - Make your own demo, and make it faster than mine. I'll have mine available for download sometime tomorrow at, though it will still require a GeForce 3 (crappy planning on my part - I left out some conditional stuff in my GL extension initialization code. I'll swap out the GF3 with my GF2 and try to get it working soon - promise.)


Austin Appleby

August 10, 2001, 05:37 AM

I did see the particle-system-in-vertex-programs demo, and though it was interesting I wouldn't be able to use the same kind of techniques that he did. His particles were relatively simple and had only static state - if you know the positions, forces, and velocities at time 0 it's very easy to calculate them at time X in a vertex shader. Since I'm dealing with particles that have to keep track of a randomly changing wander direction and a variable-speed wing-flap-speed variable that's calculated from the vertical velocity (amongst other things), I'd never be able to implement them in vertex programs.

One optimization that I really ought to take a look at is replacing the 4 triangles (2 top, 2 bottom) with just 2 and doing proper two-sided lighting in the vertex shader. Half the geometry, half the transforms, more speed = oh so tasty and good.



August 10, 2001, 05:37 AM

In a straight fight, Id back the frogs against the butterflys...


August 10, 2001, 05:39 AM

Well, i'm totally agree with Austin Appleby, asm can be useful sometimes, but i think that compilers can optimize much better than a person. The reason: a compiler _knows_ (or might knows) the machine much better than people, it know the delay before a branch, it knows the way to order the instructions... and all that stuff.
I think the best way to attack the problems is to find a good algortihm to solve it, and if it really hits the performance, and you know enough the machine try to optimize it... but most of cases you won't need it, or you won't be able to do it.
Ending, Austin did a really good job, original and, surely, clean and human readable.
That's my two cents

Mark Friedenbach

August 10, 2001, 05:41 AM

I just have to complemnt you on how well you seemingly took all that criticism.


August 10, 2001, 05:48 AM


I think you should have hacked or something to get decent performance :)

You seem to be very proud of the fact that you can get 3.2 Million tri's per sec. on a GEFORCE 3.
(And I'm not even mentioning the 1.7Ghz...)

So what's the point saying - "look guys, I'm proving that you can do all this without hacking and asm" while the performance is simply BUTT ? This could easily be done without asm, hacking and even without bragging at least 10 times as fast on your hardware.

As someone here shouted - wow, amazing, I could do that on a Voodoo2...



August 10, 2001, 05:50 AM

give me such a machine, i'll do it ...

anyway i never said this IOTD wasn't nice ... just that it wasn't impressive in terms of performance ...

by the way ... nice OO design, etc ...

Austin Appleby

August 10, 2001, 06:07 AM

Just added a "pause" key that stops the particle system from updating so I could check the GPU load - with none of the butterfly logic running, the engine update loop takes 0.02 msec, the render loop 0.6 msec, and the stall during frame swap takes 8.3 msec. So, it's running at 110 fps (or about 3.5 million tris/sec) with the butterfly logic (which used to be the bottleneck) turned off. Since I'm drawing the butterflies using GL_TRIANGLES and indexed vertex arrays, I'm sending (and transforming) 2 vertices per triangle, or about 7 million vertices per second. Not great, not bad. My vertices are rather fat (40 bytes : position is 3 floats, normal is 3 floats, texcoord is 4 floats) but I don't think that would be too much of a problem. Hmm. Looks like I have a vertex pipeline bottleneck somewhere else in my engine that I should take a look at...

Gawd, I really oughta sleep now - the sun will be up in another hour or so...



August 10, 2001, 06:13 AM

Austin, about the assembly stuff, I never said anything bad about your program. So sorry if you feel offended. I just wanted to react to isobel who said that assembly should be avoided nowadays...

I think your program is impressive, even with the P4 and GF3. 800,000 'particles' per second means you've got only a little more than 2,000 clock cycles per particle. Without SSE or hand-optimised assembly, that's a big limit!

Transforming a vertex with my software engine's optimised matrix code (no SSE yet) takes about 50 clock cycles, and the C++ version is just a little slower. To transform your eight vertices per butterfly to world space that will most probably take already 500 clock cycles for just the vertex position. Or do you use OpenGL's optimised matrix stuff that does the transformation with SSE in just a few clock cycles?

I'm a little curious, what's are the real bottlenecks in your program?

Kind regards,



August 10, 2001, 06:16 AM

ca defoule :)


August 10, 2001, 06:38 AM

Having read through the posts one thing that hasn't been mentioned
but which I think is important is that with modern processors, IMHO
the best method will be to learn what the processor and compiler are
trying to achieve and to help them out whenever possible.

Simply being aware of what the compiler will convert your c/c++
statements into should allow you to write fairly fast code most of
the time. I'm not saying you'll never want to write something in
assembly just that knowing what goes on 'under the hood' is likely
to make you a better driver ;)


August 10, 2001, 07:19 AM

"The reason: a compiler _knows_ (or might knows) the machine much better than people, it know the delay before a branch, it knows the way to order the instructions... and all that stuff."

Thats true, but it's really easy to say that's the task of the compiler. We got programs like VTune to assist us with that while optimising assembly code. It also has a "code coach" that can do automatic optimisations, but it tells me very often "No optimisations found" while I can optimise it a lot further.

I think a lot of people are 'scared' of assembly because their first attempt to optimise some code failed. So they quickly think compilers are always better. You need a lot of practice, just like you need a lot of practice to write good C++ programs or to learn Win32 for example. Assembly is really not that hard if you spend 50% programming assembly and 50% programming C++ like I do for my software engine.

It's not easy, but certainly not impossible...


August 10, 2001, 07:21 AM

Um, no.

A [chess computer | compiler] searches a discrete space of [board states | instruction combinations] for the optimal configuration ([computer wins | fastest/smallest/etc].

You're also confusing 'now' and 'never'. Humans will currently almost always beat a computer at assembly coding. That was also true of humans beating chess computers twenty years ago.


August 10, 2001, 07:24 AM

yes of course, no man ever walked on the moon, all the videos were filmed in some Walt Disney Studios, just to get some advance agaisnt russia during the cold war ...

but i thought everybody knew that ...

Bemmu Sepponen

August 10, 2001, 07:33 AM

Humans write chess programs, so chess programs will never play better than humans.

Frans Bouma

August 10, 2001, 08:19 AM

Ok agreed, haven't thought of that glDrawElements limit (I never used fans, so forgive me ;)). However I don't see the issue with the normals. Ok, the vertices that are shared between both wings are not showing the correct lighting when you just pass the normal of 1 face but you can average that normal and pass that. I don't think that will give you any artifacts. I don't know what you ment with "degenerated triangles"..


August 10, 2001, 08:20 AM

Austin: You could have individually colored butterflies without state changes by using vertex coloring.

About asm/c++ ranting: It's true that you can always have atleast as fast asm code as c++ code (just because of the fact that you can use asm output of the compiler and optimize it further), but that's not the point. The point is, which way you should invest your time to reach best performance in given time. Nowdays compilers do very good job in optimization and there is no much point to waste time on optimizing those things by hand. Of course there is exceptions like very tight pixel loops and such, but in many cases a lot better performance as a whole can be achieved by concentrating on algorithms and memory usage (as has already been mentioned) rather than local 'hacker' optimizations by using assembler.

Also if you head to optimize things in asm too early and your product will be released let say in a year, it's very much likely that during that year compilers are also optimized to produce more optimal code and new processors will be introduced, which makes your asm code even more obsolete. Optimizing by hand should be like the last trick in the sleeve and exploited in the very final step in the production if it's really necessary.

I know (from personal experience) that it's fun to optimize things in asm, but it doesn't have much more meaning than the temporary pleasure which you get from the results. More meaningful is to design your code so that it can be easily optimized later if needed.

Cheers, Altair

This thread contains 152 messages.
First Previous ( To view more messages, select a page: 0 1 2 3 4 5 ... out of 5) Next Last
Hosting by Solid Eight Studios, maker of PhotoTangler Collage Maker.