
Somes notes on that trick (I'm using it for blurring in all my texture generators for quite some time now ;):
1. I find the variant Per mentioned generally nicer in practice, because it needs less memory (O(N) vs. O(N^2)), and works with less bits in your accumulators. The latter may seem irrelevant at first, but it's not; I use 16 bit per color channel for texture generation (a good idea if you plan to generate normal and displacement maps), so with a 32 bit accumulator, you can get overflow after more than 65536 pixels, or a 256x256 bitmap  oops. Plus the smaller memory requirements make it feasible to implement things in a way that basically everything happens in the cache, which is a nice speed boost.
2. The horizontal blur step is usually very fast. The vertical one tends to be *much* slower, because you're reading one pixel from line 0, then one pixel from line 1, etc., which is a cache nightmare. One solution is to do the vertical pass in "stripes" of more that one pixel width  how much more depends on the number of bytes in a pixel and in a cache line. This improves the speed notably.
3. Fast blurs are nice, but being limited to a box filter isn't. Fortunately, you aren't: You can blur multiple times in succession. When doing first, you're essentially doing a convolution with a box filter kernel. Now, convolution is associative, so "((((signal * box1) * box2) * ...) * boxN)" is the same as "signal * (box1 * box2 * ... * boxN)" with * indicating convolution. As some experiment will convince you (I'll spare you proofs :), a box filter kernel applied onto a box filter kernel yields a triangle kernel, and another application gives a quadratic bspline which looks pretty close to a gaussian (further applications will result in bspline basis functions of higher degree, which are still closer to a gaussian; again, I'll skip the details here). It has the advantage that it actually drops off to zero after a while, unlike a true gaussian which has infinite support, so this is in practice nicer to use than the "real thing", because you don't need to do any windowing.
4. The last point leads to a relatively easy implementation of a very fast "gaussian" blur (some experiments I've conducted lead me to the conclusion that Photoshop also uses the 3repeatedboxfilters approximation, so this is the same "gaussian" as in Photoshops "gaussian blur"), if you take care of one detail: don't just do a "full" blur 3 times in succession, do the 3 passes directly for every row/column you process, while it's still in cache!
An interesting phenomenon I've encountered is that replacing the vertical pass altogether by transposing the image, then doing a horizontal pass, then transposing again, was faster than the seperate vertical pass in "stripes". Obviously, transposing is quite memory bandwidthheavy, so you'll only want to do it twice, and not for all 3 passes! Still, I find this quite interesting  but it's nice news, since it allows me to only have code for horizontal blurs together with a very short loop that transposes the image  less work than having two seperate steps with different code.

For hardware, you cannot implement this technique directly. However, for box filters of width N, you can still do better than summing N weighted copies of the image: Assume for a moment than N is a power of 2.
In the first pass, compute the average of each pixel and the pixel just to the right.
Then, do a second pass, with the results from the first pass as input; compute the average of each pixel and the pixel *2* steps to the right. The first pixel contains the value (intensity[x+0]+intensity[x+1])/2, the second contains (intensity[x+2]+intensity[x+3])/2, so you get (intensity[x+0]+intensity[x+1]+intensity[x+2]+intensity[x+3])/4.
Repeat as needed. This takes exactly log2(N) steps for a width N horizontal blur, if N is a power of two; extending that method to one that computes the exact result of blurring with a radius N box kernel for nonpowerof2 N is quite hard to do with hardware rendering, but in practice I've found you don't need to; just offset the texture coordinates smoothly and make use of bilinear filtering. You won't get any visible popping and it will still look blurred  who cares that it's not quite right. Also, the method I've used for explanation shifts the image to the right by N/2 pixels; "centering" is again just a matter of adjusting the texture coordinates appropriately, which again will not be quite correct if they become fractional in the process, and which I'll again ignore :)
To do a real 2D kernel requires another log2(N) passes in vertical direction. This can sum up to quite a lot passes. Another notquiterightbutOK approximation is to sum 4 copies of the image in each step  one offset to the topleft, one to the topright, one bottomleft, and one bottomright. This doesn't really result in a proper 2D box filter at all, but it still looks blurred, so we'll happily take the reduced amount of passes and decide to completely forget about proper filtering :).
In the first pass, you can normally get some more "bang for the buck" by actively using bilinear filtering to weight more pixels in. For bloom, doing one or two such lowradius passes afterwards does a nice job of removing the very linear gradients that box filters tend to result in. Obviously, this all just pays off if you're doing a box filter with relatively high width in the first place, which is probably not a good idea anyway; but still, this makes for a cool demo effect :)
