Android Performance Issues - Vertex Array Object (VAO)

framusrock · November 16, 2016, 5:45pm

I did a lot of profiling in the last days, since I had vastly worse performance on Android than on iOS.

I ended up in the drawBatchedTriangles function of the CCRenderer which took a lot of time on my Nexus 7 (2012).

There’s an if-condition that checks whether a device supports VAO or not. I found out that the first branch of the condition is very fast and uses VAO, while the second one is very slow and does use Client Side Arrays instead of a VAO.
The performance of the second non-VAO-branch is 10-20x times slower according to my tests.

This call would regularly take 5-10 ms alone with a filledVertex count of 100-150 and around 25 Sprites on the screen:

    glBufferData(GL_ARRAY_BUFFER, sizeof(_verts[0]) * _filledVertex , _verts, GL_DYNAMIC_DRAW);

My Nexus 7 does not support VAO according to the GLExtensions test, so apparently users of this device can’t do much to change this.

But what really confused me is that, VAOs are disabled by a define (CC_TEXTURE_ATLAS_USE_VAO) for ALL android devices.

I’ve got two questions:

Why are VAOs disabled by default for all Android devices? Are there any reported problems with it?
Is there anything one can do to improve the performance on older devices that do not support VAOs?

nite · November 16, 2016, 6:52pm

Probably worth a look at @dabingnn @ricardo

ricardo · November 17, 2016, 1:55pm

mmm… @zhangxm any idea ?

Our documentation says:

Some android devices cannot support VAO very well, so we disable it by default for android platform.

But perhaps we can enable it by default on Android and disable it on certain phones.
Or do the opposite, disable by default and but enable it on certain Android phones that we know it works Ok. For example, we should try it on the most popular ones: Samsungs, Xiaomi, Nexus.

zhangxm · November 18, 2016, 1:37am

@ricardo we can do it like this, but as you know there are so many Android brands, and even one brand(such as sansung has many models), so it is hard for use to do like this.

I am curious if there is method the check VAO supporting in runtime.

About question 2, i have not idea currently. But i will do more research.

zhangxm · November 18, 2016, 1:47am

Yep, we already check VAO supporting at runtime. We add the macro is jut because some devices don’t obey the spec very well, such as the run time checking says it supports VAO, but indeed it doesn’t. I don’t remember which devices have the problem. May be we can enable it by default and add exceptions from developers’ reporting?

zhangxm · November 18, 2016, 1:53am

@framusrock what’s your engine version? I hope this issue can help you.

framusrock · November 18, 2016, 2:02am

@zhangxm If any devices claim they support VAO, but in fact they don’t - are there any other things that we can test to make sure they support it? Or can we try some specific call and catch the exception instead of a failure?
VAO brings so much performance (apparently).
It’d be really cool to have bulletproof-solution that can fallback to non-VAO in devices that have no VAO and to use it, if it’s available.

I’m using engine version 3.13.1. I’ve tried/profiled many things in this issue but nothing really helped so far…
Can you point me to something specific that should help? I’m happy to profile anything that might work.
As far as I’ve understood it, the things in this issue are primarily for older Cocos2d-x versions?

Why is this VAO actually so slow. 5-10 ms with barely anything to move to the GPU is very high. Is it due to a sync that the GPU and CPU have to perform, hence one waiting for the other?

zhangxm · November 18, 2016, 5:49am

@framusrock AFAIK there is not such a way like that.

v3.13.1 already includes the patch, so it is not the issue.

What did you mean? Did you mean VAO is slow?

framusrock · November 18, 2016, 12:09pm

@zhangxm That’s really sad to hear. Do you/we know any specific phone that had this issue? I think we should test this phone again with the current version of Cocos2d-x and find out if this issue persists.[quote=“zhangxm, post:8, topic:33412”]
What did you mean? Did you mean VAO is slow?
[/quote]

Sorry, no I mean why it’s so slow without VAO. I was a little confused yesterday.
So again to clarify: Why is it so slow without VAO (hence with Client-Side Vertex Arrays)? The difference I measured is 5-10 ms per frame.

ricardo · November 19, 2016, 5:59am

I guess it depends from driver implementation to driver implementation. But client side arrays are slow because they need to copy all the memory from user-memory to gpu-memory.
An alternative to client-side arrays are VBOs.

framusrock · November 20, 2016, 12:37am

@ricardo Just to be clear about what we speak, let me paste the piece of code this is about here. It’s from CCRenderer.cpp, drawBatchedTriangles:

    // Client Side Arrays
    #define kQuadSize sizeof(_verts[0])
    glBindBuffer(GL_ARRAY_BUFFER, _buffersVBO[0]);

    // this call is very slow on my Nexus 7
    glBufferData(GL_ARRAY_BUFFER, sizeof(_verts[0]) * _filledVertex , _verts, GL_DYNAMIC_DRAW);

    GL::enableVertexAttribs(GL::VERTEX_ATTRIB_FLAG_POS_COLOR_TEX);

    // vertices
    glVertexAttribPointer( GLProgram::VERTEX_ATTRIB_POSITION, 3, GL_FLOAT, GL_FALSE, kQuadSize, (GLvoid*) offsetof(V3F_C4B_T2F, vertices));

    // colors
    glVertexAttribPointer(GLProgram::VERTEX_ATTRIB_COLOR, 4, GL_UNSIGNED_BYTE, GL_TRUE, kQuadSize, (GLvoid*) offsetof(V3F_C4B_T2F, colors));

    // tex coords
    glVertexAttribPointer(GLProgram::VERTEX_ATTRIB_TEX_COORD, 2, GL_FLOAT, GL_FALSE, kQuadSize, (GLvoid*) offsetof(V3F_C4B_T2F, texCoords));

    glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, _buffersVBO[1]);
   
    glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(_indices[0]) * _filledIndex, _indices, GL_STATIC_DRAW);

Unfortunately, I’m not quite the OpenGL genius, so I might not fully understand what is happening here / how to improve it.
The comment says that Client-Side arrays are being used here, though afterwards a VBO is being bound. Which of the two techniques is actually being used here?
How could one change it to improve it?

zhangxm · November 21, 2016, 1:25am

Sorry, i don’t remember. The issue is nothing about engine, so it will still exist if its hardware or OpenGL ES driver doesn’t support VAO.

framusrock · November 21, 2016, 10:54am

@zhangxm Yes, I understand. I’d just love to do my own profiling on one of these devices and do some testing. Do you remember what happens on these devices that claim that they have VAO, but in fact don’t have it? Is the result no rendering at all?

If we can find a solution together, to enable VAO for the working devices or to disable it for the non-working ones as @ricardo suggested, Cocos2d-x might get a performance boost on android. At the moment Android is getting artificially slowed on many devices.

zhangxm · November 22, 2016, 1:25am

Yep. So i think it is ok to enable it by default and add exceptions for devices have the issue.

ricardo · November 22, 2016, 7:10am

yep. it’s using VBOs (not client-side arrays).

framusrock · November 22, 2016, 11:45am

Okay, so the comment in the code confused me into thinking it is client side arrays
Anyway, since it’s already VBO, client side arrays would be even slower, I guess.

Okay, awesome. How do we find out which devices have this issue?
What happens if we forget one device? Will the app not render at all then?

zhangxm · November 23, 2016, 1:38am

I don’t remember the result. May be wrong effect or crash. So we should find the result after reproducing the issue. There is not way to find this, just find it when developers report it because there are so many devices.

I think i will ask some company to check if this issue is common. May be ask ARM guys, sansung and huawei guys. We have some contract with them.

framusrock · November 23, 2016, 1:53am

@zhangxm

Okay, thank you. ARM, Samsung and Huawei should definitely be able to give some hints on this issue. Awesome that you have a contact there! I’m looking forward to the results.

ricardo · November 23, 2016, 7:10am

So, right now I don’t know where is the bottleneck. VBOs shouldn’t be that slow.
Perhaps it is the glBufferData() parameters that we are using.

I think it makes sense to split the switch like this:

Proposed fix:

if (conf->supportsShareableVAO())
{
     bindVAO();
     if (conf->supportsMapBuffer()) {
       useMapBuffers();
     } else {
       useBufferData();
     }
}
else /* VBO */
{
    bindVBO();
     if (conf->supportsMapBuffer()) {
       useMapBuffers();
     } else {
       useBufferData();
     }
}

Current code:

if (conf->supportsShareableVAO() && conf->supportsMapBuffer())
{
     bindVAO();
     useMapBuffers();
}
else /* VBO */
{
    bindVBO();
    useBufferData();
}

thoughts?

Alanmars · November 23, 2016, 7:38am

Use the proposed fix and profile it, then we’ll know does it work or not.