Flaws with Renderer, GLProgram and gl state management

I worked quite some time with cocos2dx now. I looked alot into cocos2d::Renderer and noticed some flaws and possible performance issues here and there.

1.) The Renderer’s VBO gets updated multiple times

If you submit some render commands in an order like this:

renderer->addCommand(TrianglesCommand());
renderer->addCommand(CustomCommand());
renderer->addCommand(TrianglesCommand());

the Renderer under the hood will do something like this:

...
glBindBuffer(GL_ARRAY_BUFFER, vertexVBO);
glBufferData(GL_ARRAY_BUFFER, ...);
glDrawElements(...);
...
glBindBuffer(GL_ARRAY_BUFFER, vertexVBO);
glBufferData(GL_ARRAY_BUFFER, ...);
glDrawElements(...);
...

The problem here is that you call glBufferUpdate on the same vbo multiple times. Generally there’s no problem with that, but in this case, you’re updating a vbo whose data is needed by a previous draw call. In this case - according to this - the current thread halts and waits until all draw commands that could be affected by this update are finished. This could lead to some serious performance issues.

As an example, on my Samsung Galaxy S (I know it’s a pretty old device) this leads to an extra frame time of 14ms which is unacceptable.

There are multiple solutions to this. One would be to simply use a new vbo for each glBufferData. Another one would be to just use client-side vertex arrays.

Both of them are okay I guess, but I can think of a better one.

The third approach would be to gather all vertices and indices from all TrianglesCommands and QuadCommands before executing any RenderCommands, and then update the vertex and index vbo with the data one time before any rendering is done. This would also get rid of the overhead of having to call glBufferData multiple times. If there needs to be any flushes between the TrianglesCommands, add a base index to the glVertexAttribPointer calls. Like this:

glVertexAttribPointer(..., offset_in_vbo_in_bytes + offset_of_vertex_attrib);

2.) Every vertex submitted via TrianglesCommand gets transformed on the cpu

This one is generally not a bad thing, as it allows you to render lots of simple sprites with different transforms in one batched draw call.

But in a case where you submit a TrianglesCommand with lots of vertices, say 3000, it would be better to let the gpu handle all the vertex transformation, even if this would make batching unavailable for this TrianglesCommand.

A solution would be to add a flag to TrianglesCommand indicating wether the vertices should be transformed on the cpu or not.

3.) The Renderer can only handle vertices of type V3F_C4B_T2F

If you want to renderer anything using a different vertex layout, you have to wrap it in a CustomCommand and must implement batching for these yourself. Especially for opengl newbies this could be problematic.

It would be cool if there would be a render command like this:

ArbitaryVertexCommand cmd;
cmd.init(glProgramState, texturesUsed, vertexAttribFormat, vertices_as_byte_ptr, size_of_vertices, indices_as_short_ptr, size_of_indices, gl_draw_type)
// gl_draw_type is one of GL_POINTS, GL_TRIANGLES, etc.

It would be also nice if the renderer could automatically batch these if possible.

4.) GLProgram shader can only use the default glsl version

This one is pretty self-explanatory.

Normally, if you want to use something else than the default glsl version in a shader you type something like this:

#version 130

on top of your shader file. But using the GLProgram class, cocos2dx puts some pre-defined uniforms before your actual shader source. This is problematic as the shader compiler expects the #version line as the first line before anything else.

It would be nice if there would be a way to set the glsl version of the shader in GLProgram before it gets compiled.

5.) GLProgram puts a lot of pre-defined uniforms that you won’t need in most cases on top of the shader

Again, pretty self-explanatory. This feature could be helpfull in some cases, but again in most cases you won’t need it, so it would be good if there was an option to turn off specific ones.

6.) GLStateCache doesn’t handle all gl states

I recently played around with Qualcomms Adreno gpu profiler and noticed that there were a lot of redundant (gl calls that don’t change any state) and query (glIs, glGet) calls. This is due to the fact that the GLStateCache doesn’t keep track of things like depth test, blending, framebuffer binding, etc.

This - especially the amount of query calls - could lead to some performance issues one mobile devices.

Okay that’s all I noticed till now.

I’m going to make a version of the Renderer class that adresses the first three issues I mentioned, so you can see what exactly I mean and how they could be implemented/fixed, and upload it here.

11 Likes

That sounds really cool! I’m looking forward to your version of the renderer!

Thanks man. Already working on it.

It might be a good idea to open a Github issue and paste this same information to the Cocos2D-x issue tracker. You will get a direct response from the devs then.

I just finished the overworked Renderer. Uploaded it to GitHub, check it out here.

It works fine and I was not able to find any bugs or errors. If you find one, please let me know :smile:
In it’s current form, it addresses the first three flaws/issues. The most important feature is of course the ArbitraryVertexCommand, as it enables you to render all kinds of meshes, etc. without even touching OpenGL. For more info check the GitHub page.

The rendering process is as following:

  1. sort RenderQueues
  2. make a single list out of all RenderQueues while:
  3. converting TrianglesCommands and QuadCommands to ArbitraryVertexCommand
  4. creating batching data
  5. map the index and vertex buffer to the gpu
  6. process all RenderCommands

There are however some issues/flaws with this implementation too.

Too make sure that you can use this Renderer without changing any code in your game, every TrianglesCommand and QuadCommand needs to be converted to an ArbitraryVertexCommand. This creates quite some overhead if you are using a lot of them. Another one is that if you call the Renderer::render function more than one time per frame (e.g. if using multiple Cameras per scene) issue #1 happens again. A solution to this would be multi-buffering.

I also did some comparsions (performance wise) between the cocos2dx renderer and my renderer. Here are the results:

###Tests

MR: My Renderer
CR: Cocos Renderer
android1: Tested on Samsung Galaxy S (very old device)
android2: Tested on Motorola Moto G

EDIT: The given time is the time that the Renderer::render function took to finish.

Test case 1: 10000 Sprites are visible on screen sharing all the same material.
Results:

Win32 MR release =  7.803578ms
Win32 MR debug   = 30.862434ms
android1 MR      = 59.805569ms
android2 MR      = 38.757420ms
Win32 CR release =  2.439611ms
Win32 CR debug   =  9.020019ms
android1 CR      = 28.575230ms
android2 CR      = 20.793112ms

Test case 2: 10000 Sprites are visible on screen. They are rendered in 1000 batches
Results:

Win32 MR release =  8.053041ms
Win32 MR debug   = 32.239613ms
android1 MR      = 63.866676ms
android2 MR      = 42.696968ms
Win32 CR release =  2.613098ms
Win32 CR debug   =  9.656133ms
android1 CR      = 31.219564ms
android2 CR      = 21.113848ms

Test case 3: 10000 Sprites are visible on screen. Every 100th Sprite, there is a CustomCommand (meaning pipeline flush)
Results:

Win32 MR release =   8.103778ms
Win32 MR debug   =  32.057499ms
android1 MR      =  61.389282ms
android2 MR      =  40.610973ms
Win32 CR release =   9.170638ms
Win32 CR debug   =  26.020094ms
android1 CR      = 201.462753ms
android2 CR      =  17.157057ms

Test case 4: 10000 Sprites using ArbitraryVertexCommand directly (meaning no conversion necessary) are visible on screen. They share the same material.
Results:

Win32 MR release =  3.819415ms
Win32 MR debug   = 18.110653ms
android1 MR      = 40.330093ms
android2 MR      = 24.701368ms
Win32 CR release =       N / A
Win32 CR debug   =       N / A
android1 CR      =       N / A
android2 CR      =       N / A

###Conclusion

Note: The gigantic gaps between the debug build and the release build are mainly due to the lack of function inlining in the debug build.

As you can see from the results, the conversion of the TrianglesCommands and QuadCommands produces quite some overhead. Issue #1 is nicely represented by test case 3, as the release CR version is slower than the release MR version. However on the Moto G, test case 3 performs even faster than the other cases which is pretty odd. You can also see that if every TrianglesCommand would be replaced by an ArbitraryVertexCommand, my renderer would be nearly as fast as the original.

It would be cool if anyone could use my renderer on an ‘real-life’ scenario and tell me how it performs there ;).

The Renderer wasn’t really optimized by myself yet, so that could be another reason why my Renderer is slower than the original.

Cheers,
Darinex

What do you guys @slackmoehrle @ricardo think about this? Would be nice to hear some feedback from the cocos2dx devs too ;).

Very good job!

Yes, we discussed problem 1, 2, 3 a lot during the design of the render, because we’re not sure how developers are going to use custom commands, so we focus on optimize for the basic commands instead. I think we should visit it again and see what we can do for the custom command.

I really like the idea of arbitraryVertexCommand, you can take advantage of batching but allow for more flexibility for the developers to design their own meshes or VBO’s

Hey everyone, I just updated the cocos2dx-AdvancedRenderer. Check it out [here][1].

The core renderer mechanics were optimized a lot. The renderer is up to 3(!) times faster now. I also fixed the non-cpu transform batching logic (didn’t work correctly), and fixed an issue with accidentally enabled back face culling which led to some triangles not being drawn.

However, these optimizations came at a price: the batching code is a lot more complex and harder to understand now.

I also ran some tests again, here are the results:

###Tests

MR: My Renderer
CR: Cocos Renderer
android1: Tested on Samsung Galaxy S (very old device) in release mode
android2: Tested on Motorola Moto G in release mode

The given time is the time that the Renderer::render function took to finish.

Test case 1: 10000 Sprites are visible on screen sharing all the same material.

Test case 2: 10000 Sprites are visible on screen. They are rendered in 1000 batches

Test case 3: 10000 Sprites are visible on screen. Every 100th Sprite, there is a CustomCommand (meaning pipeline flush)

Test case 4: 10000 Sprites using ArbitraryVertexCommand directly (meaning no conversion necessary) are visible on screen. They share the same material.

As you can see from the results, my renderer is pretty much on par with the cocos2dx renderer performance wise now.

Cheers,
Darinex
[1]: https://github.com/darinex/cocos2dx-AdvancedRenderer

4 Likes

@ricardo you should take a look.

@Darinex man… great job! Thanks. I’ll take a look these days.

I added this ticket as a reminder to me: https://github.com/cocos2d/cocos2d-x/issues/14913

2 Likes

Thanks man, nice to hear. Also I just commited some bug fixes. Make sure you use the newest version :smile:.

1 Like

@ricardo Sorry for maybe interrupting you again :smile:, but I just commited another patch that fixes several bugs/issues.

Edit: Has anybody used my renderer in their projects yet? If yes, I would appreciate it if you could give me some feedback on how it performs, if there are any issues etc.

Cheers,
Darinex

Thanks @darinex. I’ll take a look at it after finishing what I’m doing (some rich text support)… so in one or two days I’ll play with it.

Hey again.

I integrated my renderer in one of my existent projects, and tested the ArbitraryVertexCommand. I have a scene with lots of vertex data (200kilobytes), and noticed that on my Motorola Moto G android smartphone performance significantly decreased. I was able to track down the source of the problem. It was the vertex buffer updating part in my renderer class.

So I did some research and testing, and it seems like the glBufferUpdate function (at least on android) doesn’t like data that is too large.

Note 1: At first I thought that issue #1 that I mentioned on top of this topic could only happen if glBufferData is called on a buffer twice or more times in one frame. But I just read that the gpu sometimes doesn’t flushes its internal render queue during multiple frames. I don’t know if this is the case on android, as I thought eglSwapBuffers flushes the render queue and halts until every task on the gpu is finished just like glFinish, but I assume for now that this doesn’t happen.

Note 2: The following seems to only happen on android (and perhaps all other mobile devices?)

Here is what I did:

// init logic
byte data[200000];
GLuint vbo;
glGenBuffers(1, &vbo);
...
// draw logic
beginTimeTracking();
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferData(GL_ARRAY_BUFFER, 200000, &data[0], GL_DYNAMIC_DRAW);
endTimeTrackingAndLog();

glDrawArrays(...);

The endTimeTrackingAndLog function logs a value between 5ms and 15ms, with 13ms being the average.
My first thought was that this was caused by what I mentioned in Note 1, so I commented the glDrawArrays(...) line out, which didn’t change anything. My second thought was that this could be be due to the speed limitation of the cpu-to-gpu transfer so I did another test:

// init logic
byte data[200000];
GLuint vbos[100];
for (int i = 0; i < 100; i++) { 
     glGenBuffers(1, &vbo);
}
...
beginTimeTracking();
for (int i = 0; i < 100; i++) {
     glBindBuffer(GL_ARRAY_BUFFER, vbos[i]);
     glBufferData(GL_ARRAY_BUFFER, 2000, &data[i * 2000], GL_DYNAMIC_DRAW);
}
endTimeTrackingAndLog();
...
for (int i = 0; i < 100; i++) {
     glDrawArrays(...);
}

The endTimeTrackingAndLog function logs a value less than 1ms. So in this case I submit 2000 byte 100 times. Which is equal to 200000 bytes per frame, so the same as the first test. I tried this with 4 buffers with size 50000, but the log value is still below 1ms. With 2 buffer with size 100000, the log values are also below 1ms. I continued my test and noticed that if the data size submitted via glBufferData gets greater than ~130000 bytes (value may be different on a different device), performance significantly decreases. This happens even if the data ptr submitted via glBufferData is a nullptr. My guess is that allocating a very big (140kb) chunk of memory on the gpu is pretty heavy.

So what are possible solutions?

###1.) Use GL_OES_map_buffer extension
This one would be to call glBufferData once at the start up of the game. And then update it via glMapBuffer/glUnmapBuffer. If this solution doesn’t have the same problem as glBufferData (which I don’t know), this would be a good solution. However, not all devices support the GL_OES_map_buffer extension, which makes it a non-universal solution.

###2.) Split the data at N bytes into several VBOs
There are two approaches on how to achieve this. They differ in code complexity and maintainability. They both use glBufferData for updating.

Approach 1:

Do the splitting while generating the batching data. This would increase the code complexity of the batching algorithm and make the code harder to maintain. A pro would be that you have more control over when you start using a new VBO.

Approach 2:

Do the splitting after generating the batching data. This would be based on the VertexBatches. You then would attach an IBO and VBO value to them, indicating which VBO/IBO they should use when drawing. A negative thing about this approach would be that if there were 1000 ArbitraryVertexCommands - using a total of N bytes or more of vertex or index data - batched together into one, there couldn’t be any splits made because the algorithm would only split after or before an ArbitraryVertexCommand, and the above described problem would happen again. This could be avoided by either automatically or manually inserting some kind of DummyCommand that would break batching after portions of big ArbitraryVertexCommands so splitting can be done properly.

Both approaches would have problems with the case where you submit a single ArbitraryVertexCommand which uses more than n bytes of index or vertex data. A solution would be to just not accept any ArbitraryVertexCommands which use more than N byte of vertex or index data.

Again this whole problem only occurs if you have more than ~130kb of vertex or index data and - as said above - only seems to occur on mobile devices.

Hence cocos2d-x is mainly made for mobile games, it would be good if I could fix that issue. I’m currently prefering solution 1.) with approach 2.

Edit: I meant solution 2.) with approach 2.

What do you - as a cocos2d-x dev and code maintainer - say @ricardo?

1 Like

I finished what I was doing with Rich Text label… I’ll start playing with your PR this Monday… where I’ll be able to read in detail your PR and understand better your two approaches. Thanks!

1 Like

Oh, sorry for bugging you then :stuck_out_tongue:.

PS: I’m probably going to use a hybrid model of approach 1 and 2.

I fixed the android performance problem. As I said, I used kind of an hybrid model between the two approaches I mentioned above. My scene that took about 14ms, now takes around 7ms, which is pretty good. Again, the new version is up on GitHub.

However, I noticed that my Renderer implementation has a main issue: the code is pretty complex, especially the batching/conversion code and the buffer splitting one. This also means that it is hard to maintain and extend the Renderer. Additionally implementing other graphics backends (gles 3.0, metal, vulcan, directX, etc.) would be difficult, given the current structure of the Renderer. I already have ideas on how to solve these issues and make the workflow of the Renderer clearer and better. A problem is that I don’t have that much time on my hands right now, so it might take some time till this kind of ‘revamp’ is done :smile: .

PS: @ricardo If you don’t understand some parts of my Renderer, feel free to ask me ;).

Cheers,
Darinex

yes, ideally we should not have custom commands. And users should not need them…
but the truth is that users need them because our API is not that rich, and we haven’t converted all our nodes to TrianglesCommand.

good point. We should add a do_not_batch flag… and use the MV matrix transform, and not the global one.

IIRC, the MeshCommand supports more formats… any kind of format.

Yes, our shader code is somewhat limited… I think we improved this a bit with the Material system.
It supports DEFINES, so a single shader can be customized (compiled) for different uses.

In any case, I would like to solve this bug also with a “defines” API or something like that.

yep.

Yes, the GLStateCache is somewhat deprecated… instead we should use the new material system which has a better state handling… not super complete, but much better than GLStateCache. Take a look at RenderState:

When I added Material System (including RenderState), I only added for 3d… I also added it in Node, so that 2d sprites could use the new system, but it as adding extra complexity and some backwards incompatibilities, so in the end I decided to remove it from Node… but I would re-add it if there is a need for it.

For sure, it will make our API more powerful.

… I’m taking a look at your code ATM…

That’s why I implemented a vertex/index data gathering stage. Users can use CustomCommands and flush the pipe without a need to rewrite the buffer because it has been written to before the actual command proccessing.

Thats true. The only problem is that they can’t be batched together in one draw call. That’s one of the main ideas of ArbitraryVertexCommand since they can be batched together in one draw call if they share the same material

You’re right, it looks a lot better. Didn’t see it before, my bad :smile:.

Sounds like a v4.0 thing too me.

Currently I’m working on an improved cocos2dx-AdvancedRenderer that supports static vertex/index data, a more cleaner batching logic and some commands for drawing non-indexed vertex data.

Cheers,
Darinex