Not logged in, Join Here! or Log In Below:  
 
News Articles Search    
 

 Home / 3D Theory & Graphics / glDrawRangeElements() bottleneck Account Manager
 
Archive Notice: This thread is old and no longer active. It is here for reference purposes. This thread was created on an older version of the flipcode forums, before the site closed in 2005. Please keep that in mind as you view this thread, as many of the topics and opinions may be outdated.
 
telak

April 08, 2005, 09:27 PM

Hi folks! I'm having some frame rate troubles and wonder if you could help me out.

I'm using OpenGL on a 2.8GHz P4 with a 32MB GeForce Go 5200. I'm drawing a mesh with a constant number of quads, stored in an indexed, sorted-by-texture vertex array (which is uploaded once offline using the VBO extension). Because the array is sorted by texture, my understanding is that I have to call glBindTexture() and glDrawRangeElements() exactly once for each texture.

With a 320x320 GL_QUADS mesh (i.e. 204800 triangles) of only one texture, I get an acceptable frame rate of 30Hz. However, increasing this to 8 textures results in a drop to 20fps. Surely 8 textures isn't that much?

As I said, the only functions that are called more often when the number of textures is increased are glBindTexture() and glDrawRangeElements(). If the mesh has 1 texture, they're called once, if it has 8 textures, they're called 8 times. I commented out the texture binds to see if they were the bottleneck, but this resulted in no change to the frame rate. So it must be the multiple glDrawRangeElements() calls that slow things down. Which means I'm CPU-limited, right?

I then tried converting the index array to a VBO, and uploaded the indices once at the start of every display cycle using glBufferDataARB(). I hoped this would prevent the indices having to be sent across the bus every time glDrawRangeElements() is called. But the frame rate still stays exactly the same. No increase, no decrease!

Anyway, I just wondered if it's normal for glDrawRangeElements() calls to be this slow. Or am I missing something? Is there an alternative way to render an indexed sorted-by-texture VBO that doesn't require a call to glDrawRangeElements() for each texture? As the project develops, I'll need far more than 8 textures, so I can't continue with the current system.

 
Chad Austin

April 08, 2005, 10:00 PM

Why do you call glDrawRangeElements multiple times with multiple textures? Shouldn't you be using multitexturing?

If you're rendering your geometry eight times, the drop would be normal...

 
telak

April 08, 2005, 10:12 PM

No, there's no multitexturing: each quad only has one texture. The geometry is only rendered once. Perhaps I didn't explain too well.

In the first test, where I got 30fps, the mesh uses only one texture: that is, every quad uses that texture, so the whole mesh can be drawn with a glBindTexture() call and a glDrawRangeElements() call. In the 20fps test, 8 textures are being drawn: so if I have 102400 quads, texture 0 is drawn on roughly 12800 of them (an eighth), texture 1 on another 12800, and so on. This requires 8 glBindTexture() and glDrawRangeElements() calls - or at least I can't see any other way of doing it. If there is an alternative, I'd be pleased to hear about it. :)

 
criznach

April 09, 2005, 12:47 AM

Have you tried this on any other cards? What resolution and color depth are you using? What's the size of your textures? 32mb is not much by today's standards and I wonder if you're thrashing your vram with 8 textures.

 
Reedbeta

April 09, 2005, 01:19 AM

You could try stacking all 8 textures into a single texture, and using the texture coordinates to select the "subtextures".

However, rendering 200K triangles with simple texturing on your card should be FAR faster than 30fps, unless you're using a very expensive shader of some kind. I'd look for the bottleneck elsewhere.

 
telak

April 09, 2005, 09:08 AM

Have you tried this on any other cards? What resolution and color depth are you using? What's the size of your textures? 32mb is not much by today's standards and I wonder if you're thrashing your vram with 8 textures.


I'm using 800x600 at 16-bit. The textures are 64x64. I don't think vram can be the problem, as I store all 8 textures in memory whether they're used in the mesh or not.

You could try stacking all 8 textures into a single texture, and using the texture coordinates to select the "subtextures".


That's a possibility, I suppose. Thing is, because the mesh is a regular quad grid, I'm having OGL generate the texture coords automatically. This is mainly to save memory: it means that each shared vertex only needs to be stored in the array once (thereby reducing the size of the array to 25%). It also means I don't have to spend any memory on storing texture coords. Stacking the textures onto one big texture would allow me to reduce the number of function calls, but it would also make my vertex array grow by 300%.

Besides all this, it would only be a temporary fix: in future, I plan to use more textures than could be stacked onto a single one, so multiple function calls will be necessary anyway. As you say, with only 8 texture changes this should be running faster than it is, so I'd like to eliminate the problem at this early stage.

However, rendering 200K triangles with simple texturing on your card should be FAR faster than 30fps, unless you're using a very expensive shader of some kind. I'd look for the bottleneck elsewhere.


But what else can it be? Performance only falls when I use more textures in the mesh, and the only two functions that are called more when the mesh has more textures are glBindTexture() and glDrawRangeElements() - everything else stays the same.

Here's the rendering code. I hope that makes things clearer. The vertex array is already uploaded to the card at this point (I do that once at startup). Sorry this is a bit ugly: is there a way to do indents on this board? :P


//Bind the vertex and index arrays, upload the index array and set the vertex pointer
if(usevbo){
glBindBufferARB(GL_ARRAY_BUFFER_ARB, vbovertices);
glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, vboindices);
glBufferDataARB(GL_ELEMENT_ARRAY_BUFFER_ARB, sizeof(unsigned int)*numberofindices, indices, GL_STATIC_DRAW_ARB);
glVertexPointer(3, GL_FLOAT, 0, 0);
}
else{
glVertexPointer(3, GL_FLOAT, 0, vertices);
}

unsigned int start=0;

//Render the quads by texture
for(int t=0; t

 
Erik Faye-Lund

April 09, 2005, 01:40 PM

okay, first of all... as far as i know, nearly no current hardware actually support quads. the driver will have to triangulage the data itself. in cpu. so, this might cause SOME of the way too low fps you're having to begin with... i also suspect that your hardware don't support 32bit indices. try reducing them to 16bit. also... is the glBufferDataARB()-call done on a pr frame base? if so, you should use the GL_DYNAMIC_DRAW-flag or something instead... i know that none of my points here are directly related to your question, but hey ;)

 
Reedbeta

April 09, 2005, 01:57 PM

I can't think of very many things that could be wrong here...

Is it necessary to upload the index data each frame? Is the index data changing? If not, upload it just once, with the vertex data; if it's changing, maybe you shouldn't use GL_STATIC_DRAW_ARB (although I don't yet know of a driver that actually pays attention to the usage parameter).

Also, if your terrain is less than 65536 vertices, you might try using GL_UNSIGNED_SHORT indices and see if this makes a difference.

 
Reedbeta

April 09, 2005, 02:00 PM

Eric brings up a good point - you could also reduce number of indices you need to send by ~50% using triangle strips instead of quads. (Quads require 2 indices per triangle on average, while triangle strips need only 1.)

 
Chad Austin

April 09, 2005, 02:15 PM

For what it's worth, the VBO test in GLScry (http://aegisknight.org/glscry) with a GF4MX on Linux shows that buffers designated with the READ flags are about four times slower for drawing than the others. But I've never seen a difference between STATIC and DYNAMIC...

 
telak

April 09, 2005, 03:01 PM

Thanks for the replies, guys. I'll get to the bottom of this one day!

okay, first of all... as far as i know, nearly no current hardware actually support quads. the driver will have to triangulage the data itself. in cpu. so, this might cause SOME of the way too low fps you're having to begin with... i also suspect that your hardware don't support 32bit indices. try reducing them to 16bit. also... is the glBufferDataARB()-call done on a pr frame base? if so, you should use the GL_DYNAMIC_DRAW-flag or something instead... i know that none of my points here are directly related to your question, but hey ;)


All right, I've now changed it to use GL_DYNAMIC_DRAW instead. It hasn't made any difference, but I'll leave it in just in case it does in future. ;)

I'm not sure about using triangle strips instead of quads. Wouldn't I have to join together non-adjacent mesh tiles with degenerate tris? Since I'm sorting the mesh by texture, a lot of non-adjacent tiles have to be drawn: in fact, I'd probably need a degenerate triangle for almost every tile. Would this still be quicker than using quads?

As for converting the index array to shorts: my 320x320 mesh has 409600 indices, so it won't fit. Even if I use tristrips, it'll be over 204800. I could split it up into, say, 8 arrays of 51200 indices each: but then the glBindTexture()-glDrawRangeElements() cycle would have to be called 800% as many times as it's being called now! Again, the question is, on balance, would any speed-up from using shorts instead of longs be enough to offset this disadvantage?

Is it necessary to upload the index data each frame? Is the index data changing? If not, upload it just once, with the vertex data


At present, the indices don't change. But I was hoping to add octree-based frustum culling later, which would require the index array to be built every frame (unless there's a better way to do it?). So any saving I achieved now from uploading the indices at startup would be erased later on. Nevertheless, I just tried it out (uploading the indices at startup): and the fps stays the same!

I also tried turning off the texcoord generation and replacing it with just one call to glTexCoord2i(0,0) before rendering begins (so the textures just look like blocks of colour): this results in a pitiful boost of around 5fps on the 8 texture mesh, but I'm guessing this is just because no texcoords are being specified for each vertex. Not really a viable way to speed things up. I'm still stuck.

Out of curiosity, is it standard practice to get the GL to generate texcoords automatically on regularly-spaced meshes? Does anyone know if it's slower, faster or no different from explictly specifying texcoords?

 
Reedbeta

April 09, 2005, 03:24 PM

I used an indexing scheme for my quadtree-based frustum culler that required no change to the indices. Basically, the index buffer was subdivided into 4 contiguous intervals and each was assigned to one of the 4 children of the root of the quadtree; this is applied recursively until you've reached leaves, at which point the subdivided intervals are actually filled with indices. This results in the indices for any quadtree node, from the root to the leaves, being contiguous in the buffer. (I used triangles, by the way, not triangle strips.)

The same idea would work for octrees, as long as you didn't have any geometry in multiple octree nodes. However, my terrain only used one texture; this scheme might prove impractical when one needs to do texture sorts, and it certainly doesn't lend itself to doing geomipmapping or any other kind of LOD.

I haven't heard anyone complain about texture coordinate generation being slow, though I've only used explicit texture coordinates in my terrain renderer.

As for triangle strips, yes you'd have to join non-adjacent terrain patches with degenerates. This would only cost two indices per join (you just have to repeat the last index before the join and the first index after it). Whether it's worth the effort of converting your code to tri-strips is an open question - it would definetly reduce the number of indices, but that may not even be the problem.

You might want to give your code a quick profile, just to make sure some innocuous function isn't running slowly and CPU limiting the whole thing. If you spend most of your execution time in glFinish or SwapBuffers, it's GPU limited.

 
GuybrushThreepwood

April 09, 2005, 04:13 PM

Don't know for sure whether this might help, but it hasn't been mentioned so it might be worth looking at.

glDrawRangeElements does have two defines you can feed to the glGet* commands..

GL_MAX_ELEMENTS_VERTICES_EXT
GL_MAX_ELEMENTS_INDICES_EXT

these values, it says in the spec are used by vendors to give you information on what should be considered the max amounts for vertex and index data...this could be your problem, you seem to have a lot of indices, which might be causing problems...here's more from the spec....

glDrawRangeElementsEXT may also be further constrained to only operate
at maximum performance for limited amounts of data. Implementations may
advertise recommended maximum amounts of vertex and index data using the
GL_MAX_ELEMENTS_VERTICES_EXT and GL_MAX_ELEMENTS_INDICES_EXT enumerants.
If a particular call to glDrawRangeElementsEXT has (end-start+1) greater
than GL_MAX_ELEMENTS_VERTICES_EXT or if count is greater than
GL_MAX_ELEMENTS_INDICES_EXT then the implementation may be forced to
process the data less efficiently than it could have with less data.

 
telak

April 09, 2005, 04:55 PM

Reedbeta: I stuck a glFinish() just before swapping the buffers, and it didn't make any difference to the speed. I assume this means I'm not GPU-limited?

Guybrush: you may be onto something. GL_MAX_ELEMENTS_INDICES_EXT returns only 4096. When I render the 8-texture mesh, I draw on average 51200 indices with each call to glDrawRangeElements(): that's 12.5 times the recommended limit. And when I have a mesh with one texture, I'm drawing 409600 indices with one call: 100 times the limit!

So calling glDrawRangeElements() more often, with fewer indices each time, might speed things up. On the other hand, that doesn't explain why the 1-texture mesh (409600 indices at once) renders significantly faster than the 8-texture mesh (51200 indices at once).

A further wrinkle: I've now tried rendering a mesh containing 100 textures. Since the 8-texture mesh rendered quite a bit slower than the 1-texture mesh, I was expecting 100 textures to be really slow. But it's exactly the same as the 8-texture mesh - 20fps. This seems to support Guybrush's idea that it's not the number of glDrawRangeElements() calls that's relevant, but the number of indices in each call.

I guess the first step to getting down the number of indices is to convert to triangle strips. Even in the worst case with the maximum number of degenerate tris, it'll take fewer indices than using quads, right? So I'll give that a try. Shouldn't be too hard... :P

 
Reedbeta

April 09, 2005, 05:00 PM

telak wrote: Reedbeta: I stuck a glFinish() just before swapping the buffers, and it didn't make any difference to the speed. I assume this means I'm not GPU-limited?


No, no, that won't do anything. I meant you should profile your code. (A profiler is a program that measures how much time your program spends executing certain function calls). If the program is GPU limited, the profiler will reveal that most of the time is spent in glFinish or SwapBuffers; if the program is CPU limited, most of the time will be spent elsewhere.

 
This thread contains 15 messages.
 
 
Hosting by Solid Eight Studios, maker of PhotoTangler Collage Maker.