The texture cache, as used by Quake, is a very interesting invention. Mainly because it allows multiple passes to be rendered, fully perspective correct, with only one rendering iteration per texel. Also, reallistically, because this gets perspective corrected at rendering time the x, y, and z can be treated linearly, and the environment is a rectangle.
This allows for perspective correct fog without using 2 divides per pixel. It allows for linear interpolation of mipmapping (the type that is used in Tri-linear mapping). It also allows bumpmapping, environment mapping and things to be perspective corrected with the texture, and only be calculated once per needed use (Static lighting assumed).
Also, you can use it to render multiple co-planar triangles, such as a shadow split surface. I am thinking about doing some kind of advanced lighting using this method. It could even allow for radiosity, if you only have to calculate the shadow set once, before runtime, and then add in shadows after a dynamic light.
The thing is, this system can avoid many perspective corrections and keep blends from happening every frame. I think this is a perfectly good reason why software rendering is not dead, and with faster processors, and simplified 3d fpu instructions like 3dNow! It may be reasonable to actually get many hardware quality rendering on a PC.
This is because the key to writing a good software renderer, which many have forgotten, is that you can use tricks such as this, which are not possible (in this case without strangling the bandwidth too your 3d card) in hardware.
Hardware, although fast, and being made feature rich, in its current state not do anywhere near the things possible in a software renderer (a programmable chip, with a special 3d based instruction set may be able to). So instead of writing your software renderer to the exact pattern of your hardware one, abliet nice filtering and blending modes, consider how you can cut many cycles from the process by being smarter than the average 3d processor. A TnT can not keep blend information over many cycles. It can only render two passes per iteration. It relies on its z-buffer. You do not have to feel restrained to implement a similar system, if that is your target hardware platform.