Not logged in, Join Here! or Log In Below:  
News Articles Search    

Submitted by Austin Appleby, posted on August 09, 2001

Image Description, by Austin Appleby

8,000 Butterflies at 100 frames per second - no assembly, no hardcoding, no hacks.

These are three shots from a test of the particle system in my experimental engine (currently named Pandora). The butterflies are completely dynamic and are independently animated - no precalculation. When the butterflies flap their wings they climb higher, and when the don't flap they glide down. The butterflies are all different sizes, and smaller butterflies are faster than larger ones. There's also a simple physics model (a force field) that corrals the butterflies into a donut-shape.

Each butterfly is a separate C++ object, and each uses virtual functions to implement its behavior (they derive from a CParticle base class). Before people complain about the performance penalties of virtual functions, I've extensively benchmarked the virtual function overhead here and it's practically negligible. The particles are moderately memory-inefficient, but that can be cleaned up quite a bit with a custom allocator. There are exactly 0 lines of assembly code used in the particle system code - the particles themselves are straight C++, and the vector math routines underneath them are hybrid C/C++.

The geometry for each butterfly is essentially a square folded down the diagonal, and is built with 8 vertices and 4 triangles (4 verts and 2 triangles per side). The current version renders 8,000 butterflies at 95-100 frames a second on my development machine (a P4-1.7ghz + GeForce 3), or approximately 800,000 particles per second (3.2 million triangles per second). It doesn't use any GeForce 3 or Pentium 4 specific features (no vertex shaders, SSE, etc.) though it does use some NVidia-specific OpenGL extensions (mainly NV_vertex_array_range).

I wrote this demo just to prove that you can get excellent performance out of a C++/OpenGL engine without any sort of hacking or assembly optimization - as long as you keep your code efficient and benchmark every one of your changes, you can usually avoid any performance bottlenecks. I probably won't be releasing the full source code to the demo (as that would require releasing huge chunks of my still-in-development engine) but I can write up a quick overview of the techniques I used if enough people are interested.

-Austin Appleby

Image of the Day Gallery


Message Center / Reader Comments: ( To Participate in the Discussion, Join the Community )
Archive Notice: This thread is old and no longer active. It is here for reference purposes. This thread was created on an older version of the flipcode forums, before the site closed in 2005. Please keep that in mind as you view this thread, as many of the topics and opinions may be outdated.

August 09, 2001, 02:57 PM

you must be really good to get such frame rates on a crap system like that... ;)


August 09, 2001, 03:01 PM

That's a hell of a lot of butterflies. Do they rise and fall because of the flapping, or is it just a "if (flapping) butterfly.y++" type of thing?
It looks damn cool, and I'd like to see it in action...
You could also use this as a base to implement some kind of flocking behaviour algorithms.

I need more free time.


August 09, 2001, 03:03 PM

Nice butterflies !
The P4 is probably a good reason for the 100 fps ;-)
But this is a good example to show that assembly should be avoided today.

Matt A

August 09, 2001, 03:04 PM

Even if you're not releasing the source, can we get the demo exe somewhere? I'd like to see this in action.


August 09, 2001, 03:06 PM

Wow, this sounds really nice, and extremely impressive... However I'd like to run a demo of it if that's possible, just to see how it checks out on my machine (1.2ghz Athlon, Geforce3)



August 09, 2001, 03:34 PM

"But this is a good example to show that assembly should be avoided today."

Humans write compilers, so compilers will never produce better assembly than humans...


August 09, 2001, 03:44 PM

Have you tried it on an other system?

I mean, yeah 100fps, but that's still on a GeForce3 + 1.7ghz CPU, that's not a "common system"...


August 09, 2001, 03:46 PM

The number of people that can create assembly code that is more efficient than current professional compilers on very modern processors (P3+) is much smaller than the total number of people who still use assembly just because they think they are getting some benefit that either doesn't exist at all or is so very small that its not even close to worth the effort they put in to writing the code.

In any case, I wouldn't go so far as to say compilers will never produce better assembly than humans. In the future I'm sure we'll see compilers that will make use of ridiculously fast processors (for the compile stage) and some basic form of genetic algorithms to do genetic-style optimizations on sets of standard algorithms that wind up with a solution that is faster than any human could come up with in a reasonable amount of time.


August 09, 2001, 03:46 PM

Not only would I love to see a demo, but I for one would love to read a rundown of how you move so many objects so quickly.
My daughter would kill to have something like this as a screen saver! Shes the kind that has to 'save' every caterpillar by bringing them into her butterfly house and taking care of them till they become butterflies/moths. 8000 butterflies, yep thats about a normal summer for her ; )

Isobel - why do you think that assembly should be avoided? Although it makes maintance and future development a little, shall we say funner, it does provide so really nice speed advantages.


August 09, 2001, 03:47 PM

I think the best way for you to benchmark your program would be a public demo. We would definetely let you know if its too slow on our machines.

Rectilinear Cat

August 09, 2001, 03:51 PM

< rant >
Yea? Well let's see it pull off even 10 fps on my k6-2 500/geforce 2 mx400? And a crappy VIA board at that. And crappy FP calc speed because its a k6-2. Where's your 100fps now? I suppose you could argue "get with the times, man. all the cool kids have 999ghz and geforce 10 systems" but then there's the other ~80% of the consumer market that will be pissed because they're getting crappy framerates. In other words, this is more a monument to the increasing strength of brute force methods. I mean, if it was really a whole bunch of individually modeled, skinned, and boned butterflies with their own flocking beahavoir, maybe we could accept the fact that it needs a fast system to run. But 8000 groups of textured triangles propelled by standard C++ code? That just screams "optimize me". Don't be so quick to shrug off assembly as a type of "hack", as if it's somehow an unclean way to code.
< /rant >

Anyways, nice pictures ;)


August 09, 2001, 04:11 PM

basically, two heads are better then one


August 09, 2001, 04:14 PM

to bad rant isn't an html tag
so many possiblilities...


August 09, 2001, 04:16 PM

the pics are nice but you should've used something less cheerful then butterflies. locusts devouring everthing in sight would be a welcome modification ;)


August 09, 2001, 04:19 PM

not only will they be pissed that they're getting crappy framerate, but that they also dished out 60 bucks to see flying butterflies


August 09, 2001, 04:21 PM

"Humans write compilers, so compilers will never produce better assembly than humans..."

Humans write chess programs, so chess programs will never beat humans...


August 09, 2001, 04:22 PM

Lots of libraries (like OpenGL) are full of hand-optimised assembly. This is because it's still easier to write assembly than to write an optimising compiler that can do the same thing. Every time you play an mp3 or a DVD on your computer, it wouldn't be possible with that quality and speed without SIMD instructions.

BTW, it's not hard at all to write better assembly than for instance VTune. I once had written a piece of unoptimised MMX code for lightmapping in software, first it took more than 20 clock cycles per pixel, but it was already faster than the C++ code compiled by MSVC. VTune optimised it to 18 clock cycles, but I pushed it to 11 clock cycles. The code is 100% paired now. Of course that little snippet of code (22 instructions now) took two days to fully optimise...

I'm not saying VTune sucks, it helps me a lot to find the bottlenecks in my code to see where assembly would mean a significant speedup. The problem is that VTune doens't know what I'm doing. It doesn't know that it can trade some fixed-point precision for speed, it doens't know that the variable x at memory location y is a floating-point number and that I can multiply it by two without sending it to the FPU, it doesn't know that the input can only have a certain range, etc.

If you wanted to make a compiler that does know these things, you would have to completely 'explain' what the algorithm should do, what assumptions it can make and how much precision you need for certain results. You see, it would be far much easier to just directly start programming that in assembly instead of trying to make clear to the compiler how fast you want it to be...

Wim Libaers

August 09, 2001, 04:23 PM

Well, humans will always make code that is at least as good as that of a compiler. Why? The human can use his brain, and the compiler. The compiler can't tap the human brain. That's why humans will always win, we can cheat ;-)
Anyway, this subject has been beaten to death (a few times already, but keeps rising from the grave)


August 09, 2001, 04:30 PM

Do your butterflies twist and turn in the direction they are heading (i.e. each one has "orientation" and "rotation")? It seems a little weird to me that this would be called a particle. I would probably call them bodies instead. I guess "particle system" is a loose term these days...

Can you give us some info about how you calculate their orientation so we can get a better idea about the performance. It seems to me that this calculation is potentially the most complex one that you do and might partially be what is making your demo more "processor bound." You are definitely not taking full advantage of that spanky graphics card.

I'm thinking there is a lot more to this than meets the eye and I'm just being really really picky. I love your pics.



August 09, 2001, 04:44 PM

Not necesarily that it *should* (there are always places where a little hand optimization goes a long way) but certainly that it *can* be and still recieve good results.


August 09, 2001, 04:46 PM

That's a very stupid change of words...

A chess program can beat a human because it can execute it's program faster than a human can think (about the game). Nobody can multiply two floating-point numbers in less than a nanosecond, but computers can. The computer doesn't 'know' how he does it, only humans can develop new ways to make it faster. If you gave the chess player enough time (and pacience) he would always be able to win, except when the program has won before the match begins because it has stored all solutions in memory :P

All you need to write better assembly than a compiler is time. MSVC can compile megabytes of code in just a few minutes, but if you give me a few hours or days I can optimise the bottlenecks in that program so it's noticeably faster.

A computer is a tool for doing things we can do too, only much slower...

Pierre Mengal

August 09, 2001, 04:52 PM

Love it


August 09, 2001, 04:52 PM

Really nice shot and not common !
Would be great to see a demo !!

How are they moving exactly ?

funny !

David Olsson

August 09, 2001, 05:00 PM

I'm not that impressed with the performance part, you should be able to push more than that with that pc. On the other hand, I don't know how complex the butterfly calculations are.

However the screens look really cool and nice.

I'm just about to start on my own particle system and I'm aiming at atleast 2-3 million alpha blended particles per second. Below 1 million would be a big disapointment. (athlon 700 + geforce2)


August 09, 2001, 05:02 PM

Nice image. I, too, would enjoy a running demo -- looks like it'd be really cool to watch. :)

A note to the "assembly-is-not-a-hack" posters:

I don't think he was trying to say assembly is a hack. I think his (very valid) point is that the best optimizations are algorithmic.

- MidNight


August 09, 2001, 05:14 PM

The "only" 100 fps, is probably due to vsync locking. Like the demo, more original than most things posted here.

Derek Simkowiak

August 09, 2001, 05:14 PM

"Humans write compilers, so compilers will never produce better assembly than humans..."

Humans write chess programs, so chess programs will never beat humans...
All you need to write better assembly than a compiler is time. [...] but if you give me a few hours or days [...]

It is generally accepted that the programmer time and maintenance problems associated with assembly is much more expensive than the marginal benefits that can be had by hand-optized assembly in your codebase. Take a few days to assembly-optimize and gain 2 FPS, or add some really cool new gameplay feature?

Around the early- to mid-eighties, chip manufacturers stopped designing instruction sets for people and started designing instruction sets for compilers. Compilers can do a pretty good job these days (at least, gcc does for me; I've never used the MS stuff).

My own opinion is that assembly is a just a tool, like any other; if you profile your code and find a huge bottleneck where there shouldn't be one, consider assembly. I've never had to do this (but I've only been programming games for about a year.) If I were writing software rasterizers, device drivers, or some other oft-used and low-level code, I would definately consider assembly.



August 09, 2001, 05:58 PM

While yes, it does run pretty damn fast, you are doing absolutely no texture swapping. This would be more impressive if you ended up using 20 megs of butterfly textures, as most games require at least 20 megs of world textures. The average Quake3 level uses about 40-50 megs IIRC, and thats a relatively OLD game!

I guess I should also mention that using vendor specific instructions (NV_vertex_array_range) is next to cheating. Thats all well and good you can get great framerates on 1 brand but just like Rectilinear Cat said, you just threw performance out the window for all the other cards, even if you DO write extra support for them.

Lets try to be a little more robost in our bitching. Write better programs to show people its possible.


August 09, 2001, 06:01 PM

"Take a few days to assembly-optimize and gain 2 FPS, or add some really cool new gameplay feature?"

I agree, if you're using hardware acceleration and an optimised library like OpenGL, optimsing your own code in assembly won't make a big speed difference. The algorithms you use will be more important like Paul said.

I am writing a software engine, and I try to push the performance to the maximum. There is no way I can get good results without using assembly. Compilers simply do not understand where and how things should be optimised. I process millions of pixels, texels, lumels and other tiny little things per second that shouldn't take more than a few clock cycles. It's not about one single FPS more, it's about doubling the FPS!

It's a shame there are no real software rendering libraries out there. I've never even seen one that uses SSE while this allows to use SIMD integer and floating-point instructions at the same time! This might enable huge speedups for software rendering but nobody has ever tried it (and I don't have a PIII, yet). It would be pointless to optimise anything else than my rasterizers, math routines and pipelines in assembly, just like it's pointless to optimise an OpenGL program in assembly since all these bottlenecks are already optimised.

A few months ago the goal of my software engine was to experiment with VSD and HSR algorithms (B-Buffer!), but now I have a second goal: showing that software engines are not dead and can produce amazing graphics on modern CPU's.

Of course, not only a software engine benifits from assembly. A lot of sound and video libraries were written with important assembly optimisations. Here it's not so important we get a high 'FPS', but it's important that the application doesn't take much processor power so we can run a lot of programs simultaneously...

As a last argument to learn at least a bit of assembly: you'll have a lot better knowledge of how your compiler and your computer in general work. A few days ago we had a question in the forum about how to speed up negating a floating-point number. If you know a little about assembly then you know this is a stupid question (sorry) and the bottleneck should be solved differently.

Just my humble opinion,



August 09, 2001, 06:32 PM

Definately: learning assembly has given me a great understanding of the PC and as a consequence, has actually helped me write better HLL code.

This thread contains 152 messages.
First Previous ( To view more messages, select a page: 0 1 2 3 4 ... out of 5) Next Last
Hosting by Solid Eight Studios, maker of PhotoTangler Collage Maker.