Not logged in, Join Here! or Log In Below:  
 
News Articles Search    
 

 Home / General Programming / Accessing structures is faster? Account Manager
 
Archive Notice: This thread is old and no longer active. It is here for reference purposes. This thread was created on an older version of the flipcode forums, before the site closed in 2005. Please keep that in mind as you view this thread, as many of the topics and opinions may be outdated.
 
Peter Bone

April 03, 2005, 06:01 AM

While trying to speed up my Gouraud shader I replaced structured RGB values with individual variables in the hope that it would be slightly faster. I was surprised to find that it was a lot slower. Here's the original line scanning loop with the RGB values as a structure (Delphi code):

// scan the line
for x := ScanStart to ScanEnd do begin
Scan[x].rgbtBlue := RGB.B shr 8;
Scan[x].rgbtGreen := RGB.G shr 8;
Scan[x].rgbtRed := RGB.R shr 8;
Inc(RGB.B, RGBdx.B);
Inc(RGB.G, RGBdx.G);
Inc(RGB.R, RGBdx.R);
end;

RGB and RGBdx are of the following type:

TRGBFixed = record
R : integer;
G : integer;
B : integer;
end;

and here's the same code with seperate variables:

// scan the line
for x := ScanStart to ScanEnd do begin
Scan[x].rgbtBlue := B shr 8;
Scan[x].rgbtGreen := G shr 8;
Scan[x].rgbtRed := R shr 8;
Inc(B, RGBdxB);
Inc(G, RGBdxG);
Inc(R, RGBdxR);
end;

The 6 RGB and RGBdx values are now all integers.

With seperate variables it was quite a lot slower. Why? I would have thought that accessing structures would have been slower.

Thanks

Peter Bone

 
Chris

April 03, 2005, 06:26 AM

If the compiler is good, it should boil down to almost the same code.

If you want to know the answer, step to the code in the debugger and select View|Debug Windows|CPU in the Delphi IDE to have a look at the generated assembly code.

Also be sure to enable Optimization and remove I/O, Range and Overflow checking under Project|Options|Compiler.

(I don't know if I'm telling you things here you've already known for years,
so forgive me if this doesn't help you).

I copied your code to Delphi and had a quick look at my assembly code, and the answer explains the behaviour you see:

Using structs:
- 15 instructions doing 12 memory accesses per loop.

Using separate variables:
- 12 instructions doing 6 memory accesses per loop.

So the lesson is: Delphi is a nice language, excellent RAD tool, has a bleedingly fast compiler, but has abysmal code optimization. It's not suited for such things as software renderers.

Using variables, the compiler offloads them to CPU registers. Using structs, it must keep them in memory and also recalculates the struct's base address in every loop. It could do much better, but sadly it doesn't.

The answer to all these questions (also to your 16 versus 32 bit question) is hidden in the generated assembly code. It's definitively worth learning to read and even code assembly. You can gain much speed by doing such loops in assembly yourself.

 
Patrick Grawehr

April 03, 2005, 03:04 PM

Chris: Maybe you just overlooked it, but Peter said that the SECOND code was slower, which, after your analysis and after what he'd exspected, shan't be.

 
timm

April 03, 2005, 03:16 PM

Doesn't this have less to do with code and more to do with data,
specifically locality. Since all of the struct members are stored contiguously you will have Red and Green in cache once you read Blue.

 
Peter Bone

April 04, 2005, 04:43 AM

Yes, I think Chris overlooked what one I was saying was faster. The structure was faster. But he still gave good advice on reading the generated assembly. I've never learnt assembly but I've always wanted to.
I think timm may be right. Because the structure is stored in the same place it can access green and blue faster after accessing red.
As for Delphi not creating optimized code I have no idea, but I know that the speed of my software renderer is now limited to the speed the image can be blitted to the screen. The time taken to draw the polygons is negligible compared to this.
If anyone knows a faster way to blit bitmaps to the screen in Delphi, other than BitBlt, then I'd love to know. I'd be willing to use hardware just for the blitting.

Thanks

Peter

 
Chris

April 04, 2005, 01:50 PM

I overlooked it, but I also didn't notice it because I cannot reproduce it. With my machine and Delphi, full opimizations, no checks, the version w/o structures is faster, which the assembley analysis above clearly explains.

Also cache locality is not an explanation, because variables on the local function stack don't have wildly different addresses but are aligned to each other as well. And cache lines are generally 64 bytes wide, so that should not matter.

Things would be different if he declared his Scan[x] array as three different arrays, e.g. comparing an SOA (structure-of-arrays) approach to an AOS (array-of-structures) approach. Then AOS approach would clearly benefit from cache locality. But that he didn't do.

 
Peter Bone

April 05, 2005, 05:17 AM

Ok, here's my test rig:

http://atlas.walagata.com/w/peterbone/GouraudSpeedTest.zip

When I click the 'old' button (structures) I get 10 seconds.
When I click the 'new' button (variables) I get 15 seconds.

The only thing I can think of is that I have to assign the initial RGB for the line variable at a time instead of assigning one structure to another - but surely this cannot account for this kind of speed reduction.

Thanks for your help.

Peter

 
z80

April 05, 2005, 11:34 AM

Peter Bone wrote: If anyone knows a faster way to blit bitmaps to the screen in Delphi, other than BitBlt, then I'd love to know. I'd be willing to use hardware just for the blitting.


SetDIBitsToDevice (see http://msdn.microsoft.com/library/default.asp?url=/library/en-us/gdi/bitmaps_8e5h.asp) is probably a better way to get pixels on the screen (if you have that function in Delphi -- dunno much about Delphi).

If you wanna use H/W writing to a dynamic texture/surface and rendering that on the screen is probably the best way. You could use DirectDraw 7 (again if this is available in Dephi) if you want h/w acceleration without the requirement for a 3D card.

 
Alex Herz

April 05, 2005, 12:26 PM

maybe u can post the importand bits of the disassembly for both versions, so ppl who cannot be bothered to compile the stuff using delphi can also have a look :)

Alex

 
Chris

April 05, 2005, 03:38 PM

The results on my machine are even more dramatic.
11.2 seconds for structure and 29.8 (!) seconds for variables.

Here's the disassembly of the innermost loop (e.g. horizontal scanline filling), and it's entirely clear that the code for separate variables is abysmally slow.

(BTW the Delphi IDE is so uncomfortable. I cannot copy/paste from the disassembly view, believe it or not)

Structure:
lea esi, [eax+eax*2] // red
mov ecx, [ebp-74h]
shr ecx, 8
mov [ebx+esi], cl
mov ecx, [ebp-78h] // green
shr ecx, 8
mov [ebx+esi+1], cl
mov ecx, [ebp-7ch] // blue
shr ecx, 8
mov [ebx+esi+2], cl

Note that this is crap, I could hand-code it much faster. I count 7 memory accesses for what should have been 2.

The worst thing is accumulation of the bytes in memory instead of a register, this creates 3 unaligned masked accessed to a dword, about the worst-case memory acccess scenario for pipelined CPUs.

But now it get's even worse:

Variables:

lea ecx,[eax+eax*2] //red
lea ecx,[ebx+ecx]
push ecx
mov ecx,[ebp-30h]
shr ecx, 8
pop edi
mov [edi], cl
lea ecx,[eax+eax*2] // green
lea ecx,[ebx+ecx+1]
push ecx
mov ecx,[ebp-2ch]
shr ecx, 8
pop edi
mov [edi], cl
lea ecx,[eax+eax*2] // blue
lea ecx,[ebx+ecx+2]
push ecx
mov ecx,esi
shr ecx, 8
pop edi
mov [edi], cl

No further comment needed here.
I count 17 (!) memory accesses for what should have been 2. Note the pushs and pops.

Summary: Do it in assembly yourself, or do it using a really optimizing compiler like gcc, Microsoft's C, or Intel's C.

Also don't draw the conclusion that accessing structures is faster somehow. If done properly by the compiler, it should be the exact same thing.

 
Nick

April 05, 2005, 04:27 PM

Slightly off-topic, but I want to share an 'optimization horror-story' I encountered lately:

I was working on a (provable) algorithmic improvement of a major bottleneck in my project. I isolated the bottleneck in a test application and first tested the optimization there. Everything went excellent, and performance almost doubled for a common scenario. Then I tried to integrate the optimization into my project, and the bottleneck suddenly took two times longer!

Looking for explanations, I tried profiling the code, and the performance degradation was gone! Debug builds were significally faster than Release builds. But all the code was static hand-written assembly code...

I'll save you from further details, but it turned out that I was heavily thrashing the L1 data cache of my Pentium M processor. The only difference between the Debug and Release build was the starting address of the stack (esp register). Adding a certain offset to the esp register solved the problem. Unfortunately, this isn't a permanent solution because it would be different for nearly every processor. Luckily it doesn't have a huge impact on my project's total performance, but it did make it very hard to practically prove that my optimization was working.

My only conclusion is: Don't waste time on optimizations that depend on too many external conditions. For Peter's code, the compiler controls everything. So never make your high-level code less readable just because it would create faster code on one compiler. Assembly code is a lot more predictable but it can still give unexpected results...

 
Peter Bone

April 06, 2005, 05:25 AM

ok, I finally admit that the Delphi compiler is rubbish. Why did I ever choose Delphi over c. Maybe because it was the first programming language I learnt and it was easier to learn.
Now I just need to learn assembly and then compress those 17 memory accesses into 2. Thanks for your help.

Peter

 
glWizardry

April 06, 2005, 08:14 AM


Peter Bone wrote: ok, I finally admit that the Delphi compiler is rubbish. Why did I ever choose Delphi over c. Maybe because it was the first programming language I learnt and it was easier to learn. Now I just need to learn assembly and then compress those 17 memory accesses into 2. Thanks for your help. Peter


Why don't you just go learn good ol c/c++? assembler is wonderful in the hands of an expert, but in the hands of an amateur it is worse. Trust me. Get a good C compiler and translate your code to C. It is not that hard really. It is good practice to learn C/C++. Trust me, it is very hard to beat out most of these modern compilers yourself unless you have years of experience in assembler.

 
pauljan

April 06, 2005, 08:53 AM

Peter, Chris, what Delphi version are you using? Borland has been optimizing their compilers (even though compiled code optimization is still far from their number 1 prio), and I'd be interesting to see if the differences still hold in the latest 2005 compiler. Also, I'd like to see how freepascal performs here.

Personally, I am not too surprised by the results you got. Using a "With Scan[x] do" will result in code that is a bit better. And yes, that fact alone implies the optimizer sucks.

 
Chris

April 06, 2005, 09:37 AM

The disassembly is from a Delphi 7 version. Since I don't have any of the newer versions, I cannot answer this. I've heard of, but never used freepascal.

Borland always gave priority to compilation speed, and truly their compilers' speed, whether C++ or Pascal/Delphi, is and has always been unbeaten. So I'm not surprised either, and in my very first post I already blamed the compiler's bad optimization (w/o having confirmed it, granted).

They also put strong emphasis on safety, with the default project options nearly every array access and arithmetic operations gets range and overflow-checked, at a great penalty to execution speed.
But I guess in their most important market segment, large database and web applications, hardcore optimization does not matter much, and it'd be a waste of manhours for them to try to compete with MS/Intel here.

Surely Microsoft's compiler spits out strange code sometimes, and in the past even generated plain wrong code due to over-optimizing it (which Delphi never did, not even in the otherwise very broken version 4), but it's generally much more efficient and does a much better job with mapping variables to registers.

It's a matter of properly choosing tools. Delphi for beautiful UI, extremely short development times. MS/Intel for rapid execution speeds.

However this is not a question of language. I'm sure one could optimize Pascalish source code the same way.

 
pauljan

April 07, 2005, 04:43 AM

Note that I agree completely with what you just said. I've been a happy Delphi user since version 2, and while for inner loops the assembler generated could've been a lot better, for business appplications the code optimizer does a decent job. And, frankly, I have never seen a business application run too slow on a target machine because of poor low-level optimization.

 
Peter Bone

April 07, 2005, 05:17 AM

I use Delphi 7 too. I don't have Delphi 2005 so I can't try it.
I tried the 'with Scan[y] do' but it made it slower. Actually it was very strange - there are 2 parts of the Gouraud code that have the inner loop (the 2 sections of the triangle), if I put the with on just 1 of them (either one) it made it faster, but putting it on both made it slower! How could that possibly happen?
I think I will learn C++. Does the Borland C++Builder have a good compiler? Is it a RAD environment like Delphi? I want to be able to create an interface with drag and drop components like Delphi.

 
pauljan

April 07, 2005, 07:33 AM

Why don't you write your interface in Delphi then, and use DLL's built in any decent C++ flavour (I'd go for a free compiler to start with) for anything that is time-critical.

 
Chris

April 07, 2005, 11:20 AM

Yes, pauljan, this is exactly what I also do.

The IDE and environment is exactly like Delphi, it uses the VCL too. Just the language is different.

However I'd recommend against C++Builder. It's C++ support has been rather incomplete - though I didn't have a look at the newest versions - when it comes to templates and especially partial template specification.

To be evil, I'd say take Linux, KDevelop and gcc. But that may bee too big a change to make it at once. For the time being, why don't you try the free Dev-Cpp IDE which uses the free MinGW compiler, which essentially is gcc for Windows.

gcc's optimization rivals Intel's - which is known to be more or less the best available on the market. To be honest, I cannot comment on C++Builder's optimization qualities.

http://www.bloodshed.net/devcpp.html

Admittedly I was a bit annoyed by the Dev-Cpp IDE when I tried it.

 
Steven Hansen

April 07, 2005, 11:33 AM

Quick addition to this fun discussion.

Our professional presentation product uses Delphi for UI and engine control, and C++ for the engine written inside COM dlls. It works great most of the time. Delphi is *so* much easier for RAD developement. I personally hate MFC. (Note - I'm not the Delphi guy here).

Our product consists of quite a few high-end graphics features, with much of it needing hand-coded SSE/MMX optimizations as well. The moral of the story is that you can use the right tool for the job, and still mix and match with quite a bit of success.

In our case, the original product started out entirely in Delphi. With the need to add optimized graphics routines, a C++/DirectX developer (me) was brought on staff. Things have evolved since then. Fun!

Good luck.

 
This thread contains 20 messages.
 
 
Hosting by Solid Eight Studios, maker of PhotoTangler Collage Maker.