Not logged in, Join Here! or Log In Below:  
 
News Articles Search    
 

 Home / General Programming / Sorting with min/max Account Manager
 
Archive Notice: This thread is old and no longer active. It is here for reference purposes. This thread was created on an older version of the flipcode forums, before the site closed in 2005. Please keep that in mind as you view this thread, as many of the topics and opinions may be outdated.
 
Nick

May 08, 2005, 09:37 AM

Hi all,

I'm wondering if this piece of code can be futher optimized:

  1.  
  2. b = min(max(a1, c), b);
  3. b = max(min(a0, c), b);
  4.  

Or, more mathematically:
  1.  
  2. b' = max(min(a0, c), min(max(a1, c), b));
  3.  

b is supposed to be always between a0/1 and c. In other words, if it's the biggest value, replace it with the second biggest, and if it's the smallest replace it with the second smallest. As an extra given, a1 is always bigger than a0 (both are constants, while c and b are recomputed every iteration).

It's probably optimal as it is, so I'm just checking...

 
ReaperUnreal

May 08, 2005, 02:43 PM

Nesting complex statements like that isn't great for the CPU, but a statement that small will make next to no effect on the runtime of whatever you're working on. That is of course unless that's in an inner loop, in that case I'd suggest separating it into smaller statements so that the previous statement has nothing to do with the previous one.

 
Julio Gorgé

May 08, 2005, 04:06 PM

Why don't you look at the compiler output? Make sure you build the executable in release/deployment mode, so that the compiler effectively applies any optimizations to the code.

 
Nick

May 08, 2005, 05:08 PM

The code is used in an inner loop but is not a major bottleneck. Either way I'd like to further optimize it if possible. I'm already using SSE so the compiler has no influence in it.

Don't take the problem too serious. :-) The code works and I was only wondering if furhter optimizations are possible...

 
goltrpoat

May 08, 2005, 09:52 PM

This should be faster, I think (didn't count the worst case number of branches in the unrolled min/max expression, but it looked like quite a few).

  1.  
  2. if (b >= a1)      // b > a1 -> b > a0 due to a1 > a0
  3. {
  4.    // b > c means that b (and possibly a1) are the greatest elements.
  5.    // c is either second largest or second smallest by pigeonhole principle.
  6.    if (b > c)    
  7.    {
  8.       b = c;
  9.    }
  10.    // handle the special case where both b and c are the greatest elements.
  11.    // a1 is then both second smallest and second largest.
  12.    else if (b == c)
  13.    {
  14.       b = a1;
  15.    }
  16. }
  17. else if (b <= a0)  // b < a0 -> b < a1 due to a0 < a1
  18. {
  19.    // b < c means that b (and possibly a0) are the smallest elements.
  20.    // c is either second smallest or second largest.
  21.    if (b < c)
  22.    {
  23.       b = c;
  24.    }
  25.    // handle the special case where both b and c are the smallest elements.
  26.    // a0 is then both second smallest and second largest.
  27.    else if (b == c)
  28.    {
  29.       b = a0;
  30.    }
  31. }
  32.  
  33. // keep b otherwise
  34.  
  35.  


You can get rid of two else/if's in there if you allow situations like c = b < a0 < a1 (i.e. if you can relax the definition from "b is not the smallest element and not the greatest element" to "b is not the unique smallest element and not the unique greatest element"). In that case, the whole thing executes only two branches. Oh, and (b >= a1) becomes (b > a1) and (b

 
dummy

May 09, 2005, 12:59 AM

Branches is what's going to kill you in there but any decent compiler will turn that into a branchless thingy, provided you allow for P6 or better codegen (otherwise and explicit expansion like goltrpoat did is the way to go).

feeding...

  1.  
  2. namespace woot {
  3.         template<class T> static T max(const T a, const T b) { return a<b ? b : a; }
  4.         template<class T> static T min(const T a, const T b) { return a<b ? a : b; }
  5.  
  6.         template<class T> struct foo_t {
  7.                 T a0, a1;
  8.                 T bar(const T b, const T c) {
  9.                         return max(min(a0, c), min(max(a1, c), b));
  10.                 }
  11.         };
  12.         template struct foo_t<int>;
  13.         template struct foo_t<float>;
  14. }
  15.  


i get
  1.  
  2. 0000000000433310 <woot::foo_t<int>::bar(int, int)>:
  3.   433310:       mov    0x4(%rdi),%eax
  4.   433313:       mov    (%rdi),%ecx
  5.   433315:       cmp    %edx,%eax
  6.   433317:       cmovl  %edx,%eax
  7.   43331a:       cmp    %eax,%esi
  8.   43331c:       cmovle %esi,%eax
  9.   43331f:       cmp    %ecx,%edx
  10.   433321:       cmovg  %ecx,%edx
  11.   433324:       cmp    %eax,%edx
  12.   433326:       cmovge %edx,%eax
  13.   433329:       retq
  14.  


and... (with SSE, but fcmov* would be used for fpu)
  1.  
  2. 0000000000433330 <woot::foo_t<float>::bar(float, float)>:
  3.   433330:       movss  0x4(%rdi),%xmm2
  4.   433335:       maxss  %xmm1,%xmm2
  5.   433339:       minss  (%rdi),%xmm1
  6.   43333d:       minss  %xmm0,%xmm2
  7.   433341:       movaps %xmm1,%xmm0
  8.   433344:       maxss  %xmm2,%xmm0
  9.   433348:       retq
  10.  

 
Nick

May 09, 2005, 02:59 AM

Thanks goltrpoat! Unfortunately I do need the closest one. I'm using this code for an improved linear approximation, and sorting these values this way is the key to it.

If I write it down with if-else statements like you suggested and then 'convert' that to using min/max I get the code I'm already using. So I guess that proves its optimality...

 
Nick

May 09, 2005, 03:03 AM

This is GCC's output? Impressive! Only the movaps could have been replaced with a movss.

Anyway, clearly the compiler couldn't optimize it further eiher...

 
dummy

May 09, 2005, 04:01 AM

Nick wrote: This is GCC's output? Impressive! Only the movaps could have been replaced with a movss. Anyway, clearly the compiler couldn't optimize it further eiher...


Yes that's with gcc and obviously for x86-64 as posted. But you get the same kind of code for x86 (modulo ABI issues, namely passing stuff via fpu stack).

It takes a bit of work to convince icc to spit something equivalent while msvc2k3 is completly hermetic.

 
This thread contains 9 messages.
 
 
Hosting by Solid Eight Studios, maker of PhotoTangler Collage Maker.