Not logged in, Join Here! or Log In Below:  
News Articles Search    

Submitted by Austin Appleby, posted on August 09, 2001

Image Description, by Austin Appleby

8,000 Butterflies at 100 frames per second - no assembly, no hardcoding, no hacks.

These are three shots from a test of the particle system in my experimental engine (currently named Pandora). The butterflies are completely dynamic and are independently animated - no precalculation. When the butterflies flap their wings they climb higher, and when the don't flap they glide down. The butterflies are all different sizes, and smaller butterflies are faster than larger ones. There's also a simple physics model (a force field) that corrals the butterflies into a donut-shape.

Each butterfly is a separate C++ object, and each uses virtual functions to implement its behavior (they derive from a CParticle base class). Before people complain about the performance penalties of virtual functions, I've extensively benchmarked the virtual function overhead here and it's practically negligible. The particles are moderately memory-inefficient, but that can be cleaned up quite a bit with a custom allocator. There are exactly 0 lines of assembly code used in the particle system code - the particles themselves are straight C++, and the vector math routines underneath them are hybrid C/C++.

The geometry for each butterfly is essentially a square folded down the diagonal, and is built with 8 vertices and 4 triangles (4 verts and 2 triangles per side). The current version renders 8,000 butterflies at 95-100 frames a second on my development machine (a P4-1.7ghz + GeForce 3), or approximately 800,000 particles per second (3.2 million triangles per second). It doesn't use any GeForce 3 or Pentium 4 specific features (no vertex shaders, SSE, etc.) though it does use some NVidia-specific OpenGL extensions (mainly NV_vertex_array_range).

I wrote this demo just to prove that you can get excellent performance out of a C++/OpenGL engine without any sort of hacking or assembly optimization - as long as you keep your code efficient and benchmark every one of your changes, you can usually avoid any performance bottlenecks. I probably won't be releasing the full source code to the demo (as that would require releasing huge chunks of my still-in-development engine) but I can write up a quick overview of the techniques I used if enough people are interested.

-Austin Appleby

Image of the Day Gallery


Message Center / Reader Comments: ( To Participate in the Discussion, Join the Community )
Archive Notice: This thread is old and no longer active. It is here for reference purposes. This thread was created on an older version of the flipcode forums, before the site closed in 2005. Please keep that in mind as you view this thread, as many of the topics and opinions may be outdated.

August 10, 2001, 01:53 PM

Here are the results in 1024x768x32 (windowed) - noAA

AMD Athlon 900Mhz/A7V/256MB + GeForce3 (ELSA Gladiac 920) 64MB/DDR

frame time: 37.04
update time: 33.89
render time: 2.60
swap time: 0.20
text time: 0.00
proc time: 37.41
fps: 27.26
rt: 68.78

My AGP is set to x2 - otherwise - i'm getting hangs ;(

very nice demo - how it's called - flocking - I mean - if one or more butterflies move around - will the other ones follow them? Why don't you sent the demo to nVidia + sources - they keep lots of demos up-there.


August 10, 2001, 02:07 PM

Doesn't work on an Athlon 750 + Geforce 1 (32Mb DDR)


August 10, 2001, 02:07 PM

Until are you moron flamers show me something better than this, why dont you keep your frieking mouths shut!

Hiro Protagonist

August 10, 2001, 02:11 PM

Yes I should have quoted your smiley face emoticon to show that the last part of the sentence was in jest. It is ridiculous to say that a human given enough time and patience would ALWAYS beat a computer at a game of chess.

To say that a human would always beat a computer is contradicting what you said earlier. If a human designed the computer to play chess, then wouldnt the computer have a chance of winning based on its designer's skills in chess and in programming? I think it would.

We are a long way off from the day when we can say a computer is better at some skill than a human. Computers are not sentient. If a computer is not sentient, than it can not be more skilled than a human. A computer is simply an interface for a human to practice his skill.

Some people play chess by sitting in front of the board and analyzing their next move, some people play chess by writing a program to do that for them. In both cases however, it is a human playing chess.


August 10, 2001, 02:14 PM

I wonder who's the flamer now...


August 10, 2001, 02:24 PM



August 10, 2001, 02:26 PM

Yes, you're right, it's more like the programmer against the chess player...

So if the programmer is smarter than the chess player and he can put his intelligence into code, then there's a good chance the computer will win from an average chess player. What I was actually trying to say is that the computer will never be smarter than the programmer. Not in this computer era.

So a compiler will never generate better code than somebody who understands the algorithms that the compiler uses, if you gave him enough time...

Amen ;)

Austin Appleby

August 10, 2001, 02:26 PM

There's a new version up for grabs on I changed some of the particle update rates so that I'm not running my physics every frame. The butterflies now are re-phys-ed 12 times per second, and reoriented 45 times per second. Should be considerably faster on most systems - I'm now GPU limited instead of CPU limited. With lighting and textures off, it runs at 140 fps on my machine.


Austin Appleby

August 10, 2001, 02:27 PM

There's a new version up for grabs on I changed some of the particle update rates so that I'm not running my physics every frame. The butterflies now are re-phys-ed 12 times per second, and reoriented 45 times per second. Should be considerably faster on most systems - I'm now GPU limited instead of CPU limited. With lighting and textures off, it runs at 140 fps on my machine.



August 10, 2001, 02:32 PM

Nice demo!

PIII 700 + GF3

frame time 30.76
update time 28.22
render time 3.37
swap time 0.08
text time 0.01
proc time 31.19
fps 32.43
rt 146.55

The Digital Bean

August 10, 2001, 02:39 PM

Nice work Austin. I'm curious as to what compiler you used ? In my experience I've found Microsoft's compiler is that it does a terrible job compiling code for the P4. Intel's new P4 optimizing compiler gave me a 30% speed boost in my ray tracer. Of course I was CPU bound, you probably aren't, but still different compilers are worth exploring.

As to the whole ASM issue. Sure ASM does have its place, ie write solid code with efficient algorithms, profile your code find out where you spend a lot of time, ASM optimize that. For example ASM optimizing double to int conversions is extremely worthwhile (especially if that cast is done often).

I also have to agree that with you that 'C++isms' like polymorphism and virtual functions have negligible impact on performance. Naturally having a virtual 'putpixel' function deep in a hierarchy is going to hurt (since you'd call it an obscene amount of times / frame). However the design and organization benefits offered by C++ far outweigh the negligible performance drop.


- DB


August 10, 2001, 02:45 PM

Nice demo..

P!!! 933Mhz
192MB SDRAM (pc133)
GeForce2 MX (normal version)
Detonator 12.41
Win2K Pro.

Frame time: ~30
Update time: ~26
Render time: ~3
Swap time: ~0.20
Text time: 0
Proc time: ~29
FPS: ~36
Rt: ~25

Blah, I want a P4 + GF3 :P


August 10, 2001, 02:54 PM

Austin, two quick things.

First, don't reload the page with the message confirmation on it. It's causing you to post each of your messages twice.

Second, I don't see why anyone would post a link to your demo on a news site. It's neat-looking, but it's not like it's Slashdot-worthy or anything. Unless the 3D hardware scene is still in the infantile "Ooo look what the GeForce3 can do!" stage.

Rectilinear Cat

August 10, 2001, 02:58 PM

Richard: Dood go get the 12.90 drivers and then post your FPS. You are almost guaranteed to get a speed boost.

Austin Appleby

August 10, 2001, 03:03 PM

Yeah, I hit the back arrow too many times and landed on the posting page, which put another copy up. Bleh.

I'm not really worried about the news sites, I just want to make sure that I don't lose my Roadrunner home page due to bandwidth abuse until I have a chance to get a real web host.


Jukka Liimatta

August 10, 2001, 03:17 PM

The demo runs ~52fps for me ( P4@1.7, GF3, Windows2000, Det14.40, 1280x1024@32bit desktop ).

Only half the framerate from what I expected (110 ;-)

Still, there is obscene amount of individually transformed triangles. I can't say I'm impressed, but still I say good job. Even if you get a lot of criticism from some guys, I must say I never seen THIS amount of stuff on screen before.

The nVidia tree t&l demo, and the grove demo had a lot of triangles, but those were most likely in some fixed coordinate system, ie. static model vs. dynamic like you have.

I don't really have a point.. except, for those who criticize too lightly: don't stare at the number of triangles or vertices on screen only, think how you would do the same. The approach that was described sounds pretty reasonable .. have vertex buffer, where you transform with host, then dispatch it to the rendering API with so few calls.

Try doing 8000 matrix state changes per frame yourself, see how "fast" it is even if you had TB@4Ghz -- then notice that bottleneck if the OpenGL, not your code, then work around it like done here. Good luck! ;-)

Have to go now, going to see "Kiss of the Dragon" ( Jet Li's movie for those who don't know what that is all about ).


Austin Appleby

August 10, 2001, 03:26 PM

We're running mostly identical systems (processor, video, driver) except for the OS - I'm under Windows 98, you're under 2000.

I had a hunch that it might be failing to allocate AGP memory under Win2K, so I tried the demo on my machine with AGP allocation disabled in the engine - it ran at 52 fps.

The engine is allocating a very big chunk of AGP memory to store the particles in, I guess the drivers under Win2K don't like the amount I'm trying to allocate and are falling back on normal system RAM.


Wim Libaers

August 10, 2001, 04:49 PM

Well, about compiler writers being the best assembly language programmers...
I suggest you go to
There, go for chapters 17 and 18 of Abrash' book (suggested reading section). 17 for the introduction, 18 to see what someone from Borland pulled off: a C program that is a compiler specifically optimised for creating the Game of Life program, for a contest. Some really interesting stuff in there!

Mark Friedenbach

August 10, 2001, 05:04 PM

VERY, VERY off topic:

Unfortunately the only guy being able to give us an answer is dead. :-/

Wasn't the question "What do you get when you multiply 6 by 9?" (yes, *9*, that was not a typo).

I could have sworn that that they figured that out when Arthur and Ford were stuck on the earth 2,000,000 years in the past, by having Arthur pull pieces out of scrabble bag or something...


August 10, 2001, 05:15 PM

300 MHz dual PII

ft 76
ut 66
rt 7
sw .39
tt 0
pt 75
fps 13
rt 77


Isaack Rasmussen

August 10, 2001, 06:01 PM

Austin, I'm 99% certain that your program is allocating AGP memory under Win2K. The problem is, that for most motherboards AGP FastWrite are disabled.
It's supposedly enabled in XP.

I have Duron750, GF3 and get 30fps.

But it looks cool, though it's hard to see the butterflies flapping. I can only see a few of them doing it.
There are SO many of them, but I think it looks more like leaves, because it's hard to see them flapping their wings.

Hehe, I can't imagine them with different colors, it's got to look like a big mess of rainbow colors :)

Thomas S

August 10, 2001, 07:30 PM

Didn't make any diffrence on my system (P3-733Mhz, GeForce3/Det 12.41, Win2K), got the same FPS with or without AGP enabled.
But I don't know if it's because there's a problem allocating AGP memory in Win2k. But the AGP demo in 'DirectX Diagnostic Tool' works,
and there's no problems with 3DMark 2001 (Used to have some problems with that, but fixed it with driver updates and SP2)... Maybe I should install Windows 98...

Thomas S

August 10, 2001, 07:37 PM

Forgot to mention that I loved the demo :>

Pierre Terdiman

August 10, 2001, 07:41 PM

The demo just crashes on my machine.

Celeron 500 Mhz, Win98, GeForce 2 MX.

Jukka Liimatta

August 10, 2001, 07:43 PM

That sounds plausible explanation, if you need someone with similiar setup on w2k to try if there is way around this I'll do the testrun. ;-)

You can reach me from, I'll know what time it is when you mail about update (harder to come back here now that this is already in IOD history folder ).

Tell me about quirks, just recently found out (actually mr. Mikko Nurmi did), that certain DX8 VB locking flags which work with retail runtimes, fail on debug runtimes. Some people reported that certain "production" did fail for them lately, and couldn't find anything wrong with it. Then it dawned that the debug runtimes hosed everything. ;-)

That's PC coding..

But! Even if consoles have some "sexy" appeal for certain sort of programmer, for example the sort who doesn't like to get reamed for stuff he never fucked up in the first place. PC has some stuff going for itself too.. PC's are generally much better networked than game consoles, and there is a lot more power - although harder to tap into, since have to keep compatibility to other PC boxes with different drivers, hardware, and so forth.

I'm getting a bit off-topic (how uncommon for me), it's just interesting to see that switch of OS can affect framerate this dramatically w/o any changes in the code, or the way API is used.

Like said above, that's PC coding.. ;----------)

Doc Kennedy

August 10, 2001, 09:38 PM

Personally, I find it's more efficient to work on a somewhat slow machine. Like my old PPro 200 that I'm on now, it forces you to optimize as much as you can. Sure it may not be the standard of the day that everyone's using but it makes your programs naturally fast. A system as nice as yours is an awful thing to work on. Having all the power in the world tends to make one lazy.

Doc Kennedy

August 10, 2001, 10:17 PM

Perhaps it's not as important for people who develope just for Intel machines (awful lot of processors they are) but, as there are more and more architectures out there, there's more and more a need for portability. This is especially important for those who develope on and for unix machines as their programs are able to run on anything from a m68k (not to knock the m68k) to a complex distributed supercomputer. Unless your programs is very specialized (and windows programs _are_) you generally try to avoid assembly and make use of standardized libraries (such as OpenGL). Although I don't like it, that's how it is. The second your particular architecture goes out of style your program dies. Have in Intel one day and get an Alpha the next you may just need to do a lot of work to port everything on your system to it.


August 11, 2001, 03:19 AM

Rectilinear Cat,

Thanx for the hint but I don't think it'll get any faster with this shitty GeForce2 MX =)

Here are the results with the 12.90 Detonators:

frame time: 28.33
update time: 24.38
render time: 3.08
swap time: 0.26
text time: 0.00
proc time: 28.31
fps: 35.22
rt: 85.70

Richard Szalay

August 11, 2001, 04:30 AM

PII 550 - GeForce 2 MX - 128MB - Win2k Pro

frame time 68.59
update time 62.87
render time 4.84
swap time 0.31
text time 0.00
proc tiem 68.60
fps 14.40
ft 209.85


August 12, 2001, 02:02 AM


good thing those aren't bees!
well done!

This thread contains 152 messages.
First Previous ( To view more messages, select a page: 0 1 2 3 4 5 ... out of 5) Next Last
Hosting by Solid Eight Studios, maker of PhotoTangler Collage Maker.