Not logged in, Join Here! or Log In Below:  
 
News Articles Search    
 

 Home / 3D Theory & Graphics / Compute Shader: Crazy 'continue' performance impact Account Manager
 
CodingCat

September 16, 2012, 06:43 AM

So let's pull yet another forum out of its deep slumber and see how many graphics programmers stop by. Yersterday, I came across a rather odd performance issue in one of my ray marching shaders. In that compute shader, I am basically looping over a number of rays, for each ray marching through a grid by a fixed number of steps. Some rays may no longer require marching, therefore I inserted a continue into the outer loop to simply skip these.

I am aware that this continue will generally not gain me anything performance-wise (as long as there is at least one active ray in the group of threads running concurrently), this was solely for the sake of simplicity. However, the performance impact was enormous in the opposite way: The more rays got skipped, the longer it took to complete the marching. This went as far as reliably going from 10 milliseconds (no rays skipped) up to 4 seconds (most rays skipped).

I implemented an alternative way of skipping in-active rays simply by setting the number of marching steps to zero. In this case, the marching time constantly remained at 10 milliseconds, as I would have expected it to do in both cases.

Sketch of the variant taking up to 4 seconds:

  1. while (...)
  2. {
  3.    rayIdx = nextRayIdx();
  4.  
  5.    if (!rayActive(rayIdx))
  6.       continue;
  7.    ...
  8.    for (rayLength ...)
  9. }


Sketch of the variant constantly taking 10 milliseconds as expected:
  1. while (...)
  2. {
  3.    rayIdx = nextRayIdx();
  4.  
  5.    if (!rayActive(rayIdx))
  6.       rayLength = 0;
  7.    ...
  8.    for (rayLength ...)
  9. }


Up to this point, I had basically settled with the idea that branching on modern GPUs was only a matter of some threads being masked out and idling during the execution of branches not taken. However, this issue suggests otherwise, and I am not entirely sure what to make of it.

Assembly code diffs only yield the following change, as expected:
  1. if_nz r0.w
  2.    continue

 
Nathan Reed

September 16, 2012, 12:51 PM

I wouldn't be surprised if the shader compiler is screwing you over. Complicated control flow is not a strong point of the current DX11 shader compilers I've tried, and a 'continue' statement may be getting it confused enough to generate really bad backend code. :( The situation will improve with time, I suppose, just like when high-level shading languages were new.

You might also try wrapping the rest of the loop in an if-statement, just to see what happens, like:

  1. while (...)
  2. {
  3.    rayIdx = nextRayIdx();
  4.  
  5.    if (rayActive(rayIdx))
  6.    {
  7.       ...
  8.       for (rayLength ...)
  9.    }
  10. }


It would be logically equivalent to the 'continue', but might coax the compiler into generating better code.

 
CodingCat

September 18, 2012, 05:08 AM

Thank you for your reply. I have just come to try out the variant you suggested and found something interesting. In-between the active ray condition and the inner marching for loop there is another early-out condition:

  1. if (rayIdx >= rayCount)
  2.    return;

If I move this inside the active ray branch (which corresponds to the original continue variant), I get the same devastating drop in performance. However, if I move the early-out condition out of the branch right before the active ray codition, I get the same performance as I get for the rayLength = 0 variant.

 
Sirithang

September 27, 2012, 10:00 AM

Well if I understood well how shader process work (but I can totally have missed the point), but in this video http://www.youtube.com/watch?v=2MzSmdC49Ns , he clearly say that how condition work is :

- if a condition fail, the shader are marked "pending" and execute again after the current batch finish with the "next" condition (in the case of a trivial if/else, if the "if" fail, then each failed shaders is executed in another batch with the "else" active this time.)

So could it be that, as your loop is dynamic, each "continue" basicly create a new batch (since it create a unique condition), so instead of having a batch running slowly(no test) you end up with lots of little processing batch, in the fact faster, but once combined (and with the cost of reseting things and relaunching the shader with the new "branch") a lot slower ?

Well again that just how I would see it, I'm only a student myself so not sure about how everything work.

 
This thread contains 4 messages.
 

If you'd like to post a comment on this discussion thread, please log in or create an account.

 
 
Hosting by Solid Eight Studios, maker of PhotoTangler Collage Maker.