Android Performance Issues - Vertex Array Object (VAO)

ricardo · November 23, 2016, 5:17pm

yes, absolutely, at least in theory.
The thing is that trying to workaround performance issues on android is expensive from a testing point of view. so before starting the expensive task of testing this in all possible android devices, i was asking if it makes sense from the logic of the source code.

zhangxm · November 24, 2016, 6:11am

A feedback from ARM: mali GPU should not have the issue that

Configuration::supportsShareableVAO() returns true, but the device indeed doesn't support VAO.

And they ask me the report the issue to them if i met the issue on Mali GPU. They will fix it.

@ricardo the codes uses "vertex_array_object" to check VAO supporting, why not use "GL_OES_vertex_array_object"? I am not sure if it is the reason cause the issue.

UPDATE:
can refer to this url for GL_OES_vertex_array_object.

ricardo · November 24, 2016, 6:36am

yes!
probably on GL ES we should check that extension and on GL we should check the current one

zhangxm · November 24, 2016, 6:45am

vertex_array_object
- true on Mac iPhone6 and Nexus 5x
GL_OES_vertex_array_object
- true on Nexus5x and iPhone6
- false on Mac

Yep, i think so. And should check other configurations too.

zhangxm · November 24, 2016, 7:40am

I sent a PR for this issue.

framusrock · November 24, 2016, 11:33am

@zhangxm

Cool find, maybe with the correct extension check that problem could be solved, since on some devices the other check doesn’t work properly. But we should be really sure, to not accidentally cause non-functioning games in hundreds of devices

What you say makes a lot of sense and should be handled like this to device between devices that support MapBuffer and/or VAO. Though it doesn’t improve the performance if the device supports MapBuffer, but not VAO, correct?

My Nexus 7 is exactly in this category: MapBuffer: Yes; VAO: No

This call is the one that takes 5-10 ms, even with only around 100 Vertices and around 20 sprites:

    glBufferData(GL_ARRAY_BUFFER, sizeof(_verts[0]) * _filledVertex , _verts, GL_DYNAMIC_DRAW);

So for this case the proposed fix wouldn’t change anything, right? Please correct me, if I’m wrong. I’m still eager to profile any ideas.

About the parameters, instead of GL_DYNAMIC_DRAW, I also tried GL_STATIC_DRAW as I found out that it’s maybe faster on some devices. But it’s exactly the same result for me - very slow.

ricardo · November 29, 2016, 1:27am

well… it depends on the driver implementation. it might improve the performance in some devices.
although, technically, “orphaning” (that technique) requires certain conditions to actually perform faster… IIRC the buffers must be of the same size… but not 100% sure

framusrock · November 29, 2016, 1:44am

@ricardo What I actually meant is that you suggest to split the one check into two. But on my specific device that wouldn’t change anything, since the code flow would stay the same. At least that’s how I understood your change.
My device supports MapBuffers, but not VAO - in this case your change would not make it faster, right?
I’m really hoping that there’s a thing to improve the performance on my device, but I kind of doubt it

Though on other devices, your change makes sense and should probably be implemented like this.

Also what’s the plan with the VAO settings now? Did anybody have the chance to test / further explore the findings of @zhangxm yet?

Darinex · November 29, 2016, 5:26pm

Hey, maybe a little late to the party but let’s try to clear some things up.

Using client side arrays means that your mesh/vertex memory is located in the default RAM, meaning that the GPU must copy the data to the GPU RAM first, which is typically slower. I’m not quite sure how this behaves on mobile devies though, as the GPU and CPU share the same RAM here.

A VBO (Vertex Buffer Object) is essentially a piece of memory located on the GPU RAM. You can upload data to it via glBufferData/glBufferSubData and also give the driver some hints on how frequently the memory is accessed.

A VAO (Vertex Array Object) is an object that holds information about the vertex layout and vbo/ibo bindings. Without using a VAO, a draw call will typically start with some glBindBuffer and glVertexAttribPointer calls to define the vertex layout. Everytime the vertex layout is changed via those two functions (or the following glDrawXXX call respecitvely), there have to be some computations done which can be quite costly when done often. When using VAOs, these computations are done on VAO creation/changing, which can save up lots of computation time if you reuse the VAO alot.

But back to the topic ;).

I think that the reason that the VAO-path is so much faster is that it is using buffer orphaning.

If you want to update a vbo that is currently in use by a previous draw call (the GPU works asynchronus to the CPU, so previously emitted draw calls are not guranteed to be done yet), the CPU needs to stall and wait for the usage to be over. Only then it can write the new content to the VBO. The wait time can become quite long, especially when the draw call/calls in question was/were quite heavy.

To work around that you can use buffer orphaning which is, to put it simple, some driver magic that lets you use a “new” chunk of memory for the new contents. Buffer orphaning is done via glBufferData(GL_XXX_BUFFER, size, nullptr, usage) where nullptr is the important part. For buffer orphaning to be perfect the given size and usage must match the previous values (as noted by @ricardo). Altough the driver - depending on its implementation - may can do some magic even if these conditions are not fulfilled.

Some solutions to the problem would be:

1.) Use Client Side Arrays. They may be a little bit slower in general, but work around the VBO already-in-use stall problem (Which in my opinion is the culprit here) pretty nicely.
2.) Use buffer orphaning. The code would look something like this:

...
glBufferData(GL_ARRAY_BUFFER, size, nullptr, GL_DYNAMIC_DRAW);
glBufferData(GL_ARRAY_BUFFER, size, ptr_to_actual_data, GL_DYNAMIC_DRAW);
...

Note that the driver decides if orphaning is done or not, so if the implementation sucks, you are pretty much back where you started.

3.) Use round-robin buffering. This one may be the most complex (to code), but gets the job done equally good (and maybe even better) as the others, at the tradeoff of more memory usage.
The idea is to just create multiple VBOs and circle through them. Something like the following:

...
glBindBuffer(GL_ARRAY_BUFFER, vbos[vboIndex]);
glBufferData(GL_ARRAY_BUFFER, size, ptr_to_data, GL_STREAM_DRAW);

vboIndex = (vboIndex + 1) % vboCount;
...

Depending on the number of VBOs you should be able to work around the VBO already-in-use problem.

4.) Use round-robin buffering with orphaning. Pretty self-explanatory

framusrock · November 30, 2016, 12:01pm

@Darinex Thanks a lot for this explanation. It makes a lot of sense and I wasn’t aware of the nuances that you perfectly lay out here.

I just tried 2 - Buffer Orphaning on my Nexus 7 like this, but it didn’t help - probably to a bad driver, as you explained:

    glBufferData(GL_ARRAY_BUFFER, sizeof(_verts[0]) * _filledVertex , nullptr, GL_DYNAMIC_DRAW);
    glBufferData(GL_ARRAY_BUFFER, sizeof(_verts[0]) * _filledVertex , _verts, GL_DYNAMIC_DRAW);

instead of just:

    glBufferData(GL_ARRAY_BUFFER, sizeof(_verts[0]) * _filledVertex , _verts, GL_DYNAMIC_DRAW);

The buffer orphaning method was not slower though, so at least it didn’t do any harm.

Would it make sense to artificially increase the VBO size with zero-vertices to have the same count of vertices all the time and be able to use buffer orphaning as it’s supposed to be used?

Maybe 3 - Round-robin buffering could fix this issue. As far as I understood this might also improve the performance of the VAO - Rendering path as it is supposed to be faster in general. The additional memory shouldn’t really be a concern, since Vertex Count is generally low in 2D. Unfortunately I don’t really know how to implement it perfectly.

This topic is getting more and more interesting as it seems to turn out that it’s most likely not the VAO, but the missing buffer orphaning that causes these issues. Though in both cases it’s the GPU driver of old devices being unable to do certain things.

In any way, I strongly believe that the core rendering code needs a serious overhaul as there’s many not optimal things / wrong comments / bits of chaos in there. It feels like this is a rather low hanging fruit and will improve the life of any Cocos2d-x user.

framusrock · December 6, 2016, 1:16am

@ricardo @zhangxm I’d love to revive this thread and ask again if you guys have an idea how to ideally fix this issues in upcoming Cocos2d-x version or if you have any other ideas that I could try.

What do you guys think about @Darinex solutions?

shivmsit · December 6, 2016, 6:34pm

Can someone please provide sample code where I can see performance degradation? I want to look into issue and I am fairly new to cocos2dx but experienced in opengl, I have seen chromium project have lot of GPU specific optimization and implements lots of workaround for driver bugs. We can learn something from there and apply that learning to improve cocos2dx rendering performance.

zhangxm · December 7, 2016, 2:04am

@shivmsit the codes is here. It uses VAO but should update VBO as its content is changed every frame.

@Darinex good to know buffer orphaning optimization. But the codes already uses buffer orphaning

glBufferData(GL_ARRAY_BUFFER, sizeof(_verts[0]) * _filledVertex, nullptr, GL_STATIC_DRAW);
void *buf = glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY);
memcpy(buf, _verts, sizeof(_verts[0]) * _filledVertex);
glUnmapBuffer(GL_ARRAY_BUFFER);

@framusrock you can just create some VBO in advance and choose one at a time as @Darinex suggests. I want to have a try of this method, but first i need to have an example to reproduce the issue. I don’t have Nexus 7, is there any other device has the issue? And how did you measure the issue? Just compute the time of the function call?

framusrock · December 7, 2016, 2:18pm

@zhangxm Okay, thanks for trying to solve it. At first I tried to find some kind of Android CPU profiling tools (like the ones in Xcode), but since I wasn’t able to find anything useful I just used the CCProfiler that comes with cocos2d-x and slowly narrowed down where all the time is lost. After countless iterations I found out that it’s the single call that I highlighted.

This is a short sample from my profiling code:

    CC_PROFILER_START_CATEGORY("Tri2_2", "Tri2_2");

    // Client Side Arrays - This is NOT using client side arrays - it's VBO
#define kQuadSize sizeof(_verts[0])
    glBindBuffer(GL_ARRAY_BUFFER, _buffersVBO[0]);

    CC_PROFILER_START_CATEGORY("Tri222", "Tri222");
    glBufferData(GL_ARRAY_BUFFER, sizeof(_verts[0]) * _filledVertex , _verts, GL_DYNAMIC_DRAW);
    CC_PROFILER_STOP_CATEGORY("Tri222", "Tri222");

    GL::enableVertexAttribs(GL::VERTEX_ATTRIB_FLAG_POS_COLOR_TEX);

    CC_PROFILER_START_CATEGORY("Tri223", "Tri223");
    // vertices
    glVertexAttribPointer( GLProgram::VERTEX_ATTRIB_POSITION, 3, GL_FLOAT, GL_FALSE, kQuadSize, (GLvoid*) offsetof(V3F_C4B_T2F, vertices));

    // colors
    glVertexAttribPointer(GLProgram::VERTEX_ATTRIB_COLOR, 4, GL_UNSIGNED_BYTE, GL_TRUE, kQuadSize, (GLvoid*) offsetof(V3F_C4B_T2F, colors));

    // tex coords
    glVertexAttribPointer(GLProgram::VERTEX_ATTRIB_TEX_COORD, 2, GL_FLOAT, GL_FALSE, kQuadSize, (GLvoid*) offsetof(V3F_C4B_T2F, texCoords));
    CC_PROFILER_STOP_CATEGORY("Tri223", "Tri223");

    glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, _buffersVBO[1]);
   
    CC_PROFILER_START_CATEGORY("Tri224", "Tri224");
    glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(_indices[0]) * _filledIndex, _indices, GL_STATIC_DRAW);
    CC_PROFILER_STOP_CATEGORY("Tri224", "Tri224");

    CC_PROFILER_STOP_CATEGORY("Tri2_2", "Tri2_2");

And here the “Tri222” Profiler-Mark takes like 5-10 ms per frame.

devnoob · December 7, 2016, 2:51pm

Qualcomm has some profile tools. Profiler

framusrock · December 7, 2016, 3:09pm

I have also tried the nVidia Tegra Profiling Tool that is specifically for my Nexus 7. But it didn’t work, it needs a specific Android ROM / and very specific settings that are nowhere documented.

framusrock · December 20, 2016, 7:38pm

Hey guys,

I just wanted to ask again if anyone has a solution for this yet / if there’s any more fixes that I could try?

framusrock · January 29, 2017, 12:01am

@ricardo @zhangxm Hey guys, I hope it’s okay that I tag the two of you here again.

I just wanted to ask what the conclusion of this thread is now? Has there been any conformation where the issue comes from? Any ideas how to improve performance on older devices?

ricardo · January 30, 2017, 10:13pm

yes, sure. It is ok to tag.

The patch that @zhangxm send, didn’t it work?

framusrock · January 30, 2017, 11:00pm

@ricardo Thanks Just to be sure, which patch do you mean exactly? I see that there’s some code in his last reply, but I don’t think this is actually a patch, looks more like clarification that it’s already using buffer orphaning. Or did I miss something and there’s a patch somewhere that I can try?