Nick April 19, 2002, 03:02 PM 

Thanks for the old version! :)
>> Besides, my fixed point math is more precise.
A float has 'only' 24 bit mantissa precision, but the eight bits for the exponent are not lost you know! Your 32bit fixedpoint numbers can have a lot of unused bits at the MSB side, while a float 'always' uses the total 24bit mantissa. Also when you divide or multiply a float the same precision remains, while a fixedpoint number looses precision. If you scale your world by a factor of 1000, floats won't give any problem, but fixedpoint numbers might overflow/underflow. So use floats, they're totally not slower and your code looks a lot better ;)
>> But I like Phong!
Ok, in that case, keep the LUT :) But try making it a lot smaller. I even think that a table of 32x32 will look the same, but it won't cause a cache miss every time like your 1 MB LUT. Look, you are storing 8bit values in a gigantic table, most neighboring values would probably be the same!
>> Maybe it's a little slower, but I wanted to create an eyecandy, not just another superfastgouraud thing.
When you have enough triangles (like this case), gouraud looks just the same as phong if you use the phong formula at the vertices.
>> I'm sorry, but 3 mul's  I don't think it possible.
Then you obviously don't know where the formula for bilinear filtering came from. I'll show you two ways to get a formula with only three multiplications. First analytical. We start from the formulas you copied form Intel:
c = c0*(1du)*(1dv)+c1*du*(1dv)+c2*(1du)*dv+c3*du*dv
Expand it and you get:
c = c0c0*dvc0*du+c0*du*dv+c1*duc1*du*dv+c2*dvc2*dv*du+c3*du*dv
Collect du and dv:
c = ((c0c1c2+c3)*duc0+c2)*dv+(c1c0)*du+c0
Hmm, yucky, but it's already three multiplications isn't it? When we use some temporary varialbles we can simplify it a bit:
c20 = (c2c0)*dv+c0
c31 = (c3c1)*dv+c1
c = (c31c20)*du+c20
That's only three multiplications and six additions. I don't think it can be any shorter. Now let me explain the meaning. Draw a square with the colors c0...c3 first. If we interpolate the color on the left side with dv, we get c20 in the formula above, if we do the same for the right side we get c31. Now interpolate accross c20 to c31 using dv and we get c! Ascii art:
c0 c1
+++
 dv 
c20+c +c31
du 
++
c2 c3
Also notice that we don't have to calculate (1du) and (1dv) and more, and we don't need any multiplications in the weight factors. So if you count every color component (including alpha) separately, we've reduced from 20 to 12 multiplications, or from five to three MMX multiplications. Not to mention the additions we've saved. Du and dv can be calculated in just a few clock cyles if you store u and v in a separate register.
>> I always thought those guys were good in optimization. :)
Their implementation is as optimised as can be, but they didn't think long enough about the algorithms and formulas...
