One thing a lot of people don't do is optimize for specific hardware. Although, normally including by options in your engine, you can predefine the fastest available set of options, not just for a chipset, but given the amount of memory, the texture download bandwidth (some cards are still PCI, eg Voodoo2 and some are both PCI and AGP such as the TnT) and a certain resolution.
The main thing to understand, when doing this, is knowing very well the specs of your 3d hardware. Understanding that all chipsets are different, and have different requirements to reach peak performance. Too illustrate this, I will use the example of a widely mis-understood, but widely distributed chipsets, the 3dlabs Permedia 2. I own a P2 and understand the specs.
I also know the specs of other cards. The TnT for example has a good fillrate, especially if you know how to get both chips to bilinear filter a DIFFERENT polygon on a single pass, it can't handle Triangles as well as some other cards, but it does fairly well. It has good blend mode support, reasonably stable OpenGl drivers, but the driver coders aren't afraid to cheat to boost performance (eg the debacle of not drawing polygons smaller than a pixel when anti-aliasing was turned on).
The Permedia 2 is a good card to demonstrate with though, as it has many little tricks that can be pulled. It has good texture upload speeds (faster than most AGP 2x cards!), a (normally) poor fillrate, a lack of some blend modes, very good triangle setup and excellent rendering quality (if used properly). It is actually much better as a professional workstation card, and is very good at tasks such as 3d studio max rendering through its dedicated heidi driver.
Some of this is common knowledge, but from digging a little deeper we can find out... This card is the only mainstream card that sports 4-bit texture support, Turning off the z-buffer can double the fillrate, this card supports full 32bit rendering (one of the first mainstream cards to do so) at an excellent speed, it handles some alpha tests, has full OpenGL support for hardware stenciling and a very Robust hardy well tuned OpenGL ICD. It also can do one of the features that the Banshee was marketed on, which is environment mapping from the frame buffer.
And contrary to popular belief, it can actually do colour lighting in Quake 2 (16 bit mode isn't right, but 32 bit mode is great).
So, if you have a very good visibility set, (Claustrophobic Irony has an excellent vis set;-) and have less than 2x overdraw, the Permedia 2 would benefit from having the z-buffer off. This is important to note, for things like a portal engine, which has very little overdraw.
In fact, normally the Permedia 2 (un-overclocked, this card is extremely overclockable, I've got mine running at 98Mhz, for a 80mhz card.) does 40 million 8, 15 or 12 bit bilinear filtered single texture z-buffered vertex shaded (vertex colouring makes no dif) per pixel. At point sampled this is the same, except for the fact that the bandwidth between memory and the chip is less stressed, getting slightly better speeds, but as John Carmack (God himself) pointed out to me, the card can do 80mpix for a 4bit point sampled texture. What he didn't tell me though, is that without any Z-buffer use, I could crank this rate up too 80mpix for nearly all of the bilinear filtered. On my overclocked card, this means I can get 100 mpix, which isn't too shoddy. This is just what I have noted, though... It may be wrong.
Other cards can get benefits out of not using the z-buffer. On cards such as the Voodoo2 you can go to 1024*768 without the z-buffer.
Another point is, on most cards, using the stencil buffer can be expensive. So ONLY use it when needed... You can in fact avoid using the stencil buffer in most circumstances (eg reflections or shadows) by using plane based clipping, which the next gen Nvidia card will have in hardware, and the Glint Gamma has now.
Many older cards don't support tri-linear filtering either, and those that do, often run at 1/3 the speed. But, a Tri-linear filter is only a 2 pass Bilinear operation. If you use a flat or vertex interpolated alpha blend of the two mips. On the Permedia 2, using no z-buffer and this trilinear filtering method, I can get 40mpix, which I would normally get using Bilinear with the z-buffer enabled.
This may seem a little biased towards the P2, but what it actually is, is an example of what research and knowledge can get you.
Another cool thing that many people don't know is that you can often cheat in making advanced effects.
Bumpmapping can be done at high quality (16 bits y and 16 x) in four passes. This is done by having four maps 1 for positive x, 1 for negative x the same for y, and then using vertex shading to balance the normals to the correct value. If you want to increase brightness you can use a paletted 50-50 blend instead of the normal alpha blend.
Bumpmapping can also be done, on cards like the Voodoo and P2 which have high texture download rates, using paletted 8-bit relief maps. This works by having the top 4 bits for y and bottom 4 for x, and then interpolating the actual normals into the palette. Then applying light. This can effectively make your lighting extra pass if you simply use vertex lighting. So you have a texture pass and a lighting pass. The one problem is, I can't recall any way to change palettes on the fly for hardware textures (does anyone know of one?). But this method, as developed by The Phantom for software rendering works great. A clever optimisation is relizing that the same palette can be used for all polygons that share a normal. So, in optimizing this, you can do for whole planes (say your floor polygons, of which many will share planes) the same bumpmap and palette, saving some of the huge download costs. Also, you can cheat, and use the same bumpmap and palette for polygons that have normals that are close together.