Flaws with Renderer, GLProgram and gl state management

ricardo · February 1, 2016, 5:23pm

Not sure when. It is @ShunLin 's call. He owns the roadmap. But he commissioned me to try to improve the renderer for v3.x in a 100% backward compatible way. So I’ll try to add the changes that we discussed last week for v3.x…

ricardo · February 2, 2016, 1:25am

So, the first thing that I’m doing is to implement the “no cpu transformation” feature.
Currently I have the following dilema… how to implement it correctly, in a good way?

The idea is that this feature should be invisible for the user. The renderer should do the CPU transformation if:

the command can be batched
and it makes sense to batch it (triangles < a certain threshold).

But tricky thing is the shader. If it is going to be batched, it should use shader without “Model View”. If not, it should use a different shader.
But currently Node (and Sprite… and everything except Sprite3d) only have one shader.

So, Node should provide two shaders: one that multiplies Projection* Model View * position.
Another that multiplies Projection * position… Both shaders are exactly the same with just that change in the vertex shader.

option a)

Add Material into Node, and material should have two techniques:

one technique with the MV shader (no batch)
another technique without it (batch)

and the renderer will use the different techniques according to the situation.

Pros: Material could be a good feature for sprites as well.
Cons: kind of hack-ish… although Material Techniques are supposed to do that… this is kind of abusing it.

option b)

No need to use two shaders. Just one. Kind of unify the multiplication… instead of these two shaders:

Batchable vertex shader:

gl_Position = CC_PMatrix * a_position;

Non-Batchable vertex shader:

gl_Position = CC_MVPMatrix * a_position;

have one:

gl_Position = CC_Matrix * a_position;

And the attribute CC_Matrix is defined from the renderer… or by GLProgramState

Pros: cleaner approach. simplifies shader creation. Just one.
Cons: not sure yet…

option c)
any other option?

Darinex · February 2, 2016, 5:43pm

I like option b) more than option a). My main concern about option a) is that using the Material class on Sprites would make them alot more complex and also could slow down rendering when you have lots of sprites. Also I like the idea of having it as a invisible feature that works automatically. Two questions: Would it be possible that two commands who share the same mv matrix are batched together in your implementation? And would it be possible that you can explicitly give hints to the renderer on when to disable transforming on cpu, something like renderer->disableTransformOnCpu()?

The feature that I mentioned in the second question could be usefull in some cases. As an example: If you have a complex Spine skeleton that has 15 skinned mesh attachements with an average of 200 vertices per mesh attachement, you would create a TrianglesCommand for every attachement. These TrianglesCommand are then transformed on the cpu, which could be somewhat expensive especially on mobile devices. In case of a skeleton in spine, all attachements of a skeleton always share the same model view matrix. So if they all share the same blend func and the same texture, they technically be drawn in a single render call too without being transformed on the cpu, as they share the same mv matrix.

Currently the spine renderer does that by using CustomCommands, but it would be better if it would issue its drawing stuff through TrianglesCommands or PrimitivesCommands in the future. So in that case the abillity to hint the renderer on when to disable “transform on cpu” would be usefull .

ricardo · February 2, 2016, 8:59pm

thanks for the feedback.

regarding option a), adding material won’t make the renderer slower by default… it will make it slower if users add multiple “passes” to the material. but it will make the code more complex.

regarding batching nodes that share the MV (or parent), it is feasible… but it doesn’t seem to be easy to implement… I think the correct approach is to put them in a “batch group command” (or add more functionality to the group command"), and use the MV associated to the “batch group command” and use it as the parent MV. I’ll think more about this.

IIRC, the PrimitivesCommand was added in order to support spine. But the Spine author told us that before applying our patch, we must first release and stable version that includes the PrimitivesCommand… and then we forgot (or we were busy on something else) and we didn’t send the patch for Spine.

ricardo · February 3, 2016, 1:25am

My changes are here… not finished yet… this is WIP… I took approach b):
https://github.com/ricardoquesada/cocos2d-x/tree/uniform_matrix_auto

Some random notes:

SpritePolygon: doesn’t support islands… nasty bug… at least it should tell us… there is an island, I cannot generate the polygon correctly.
SpritePolygon: as it is right now, it is only useful for… I don’t know… it is not really useful since it doesn’t support any kind of animation. SpritePolygon has a lot of potential, but we should support animation.

UPDATE:

I can’t find a test that proves that disabling batching (no cpu point transformation) is better… I only tried with up to sprites with 2000 vertices. @Darinex, do you have any good number for this?

Darinex · February 3, 2016, 11:22am

Your implementation looks good from what I saw.

I did a test yesterday. The setup were 15 sprites with 1000 vertices each. I don’t remember the concrete numbers but I remember that on win32 batching was a little bit faster than non-batching something around 0.5ms faster. However on my Motorola Moto G android device, non-batching was around 2ms faster than the batched approach.

I think this is due to the more of cpu power on the desktop device.

Thats a really good idea, I think I’m going to use it in the cocos2dx-AdvancedRenderer. The only problem with this approach is that currently the RenderQueue is a little bit heavy for this case, as it has mutliple sub queues which aren’t needed. It would be better to have a special RenderQueue which just acts like one sub-list for render commands for the “batch group command”.

piotrros · February 3, 2016, 12:07pm

So. Will it be integrated in 3.11?
Besides, question from similar category. Do you (cocos team) work on improving DrawNode? Optimize it, add anti-aliasing etc.

ricardo · February 3, 2016, 5:22pm

I think I’m going to leave the threshold number high… like in 5.000 (or bigger)… but users can optimize the code by just doing:

class MySprite : public Sprite
{
    if (verteCount > A_CERTAIN_NUMBER)
         triangleCommand->setSkipBatching(true);
}

If you do that in TriangleCommand or a QuandCommand, then it won’t batch it and it won’t transform the points to world space.

Yes, this will be part of v3.11.
I’ll wrap up this feature today, and then I’ll start playing with the PrimitivesCommand, and try to add it in the spine runtime.

we are not working on improving it right now… but we should. IIRC it already supports anti-aliasing, and that is why it is so slow.

ricardo · February 4, 2016, 11:45pm

I kept working on this feature… I fixed some old rendered bugs, and I think I make it slightly faster… specially on old devices.

The 3.10 renderer works like this:

it converted to world coordinates all sprites… regardless whether or not the sprites could be grouped in a batch > 1.

The 3.11 renderer works like this:

it only converts to world coordinates the sprites that can be batched in groups > 1.
otherwise the sprites will keep using the MV coordinates.

Example, let’s say that we have 3 different textures different sprites: S1, S2, S3.

If we want to draw the following sequence: S1 S1, S2, S3 S3, S1, S2 S2 S2, then it will create 5 batches:

S1 S1: will be transformed to world coordinates
S2: won’t be transformed to world coordinates, since batch size == 1
S3 S3: world coordinates
S1: MV coordinates (batch size == 1)
S2 S2 S2: world coordinates

so, this little improvement won’t be perceived in high-end devices, but it will be good for old and/or low-end devices.

v3.11 renderer has other small improvements: it will call glBufferData less times, which means faster.

Darinex · February 5, 2016, 12:44pm

Could you explain how he does that?

Edit: I just looked at your source code and saw how the renderer does that. The Renderer doesn’t flush the batched triangles if a TrianglesCommand has skip batching enabled.

However issue 1 still exists. You could fix that be using a Round-Robin approach. The Round-Robin technique works as following:

You have an array with a certain size, say 20, of vbos handles. When you update a vbo you increment the vbo look up index by one and use that vbo for the next vertex/index update. If the index gets higher than 20 - 1 it is set back to 0. This way, you don’t update the same vbo/ibo multiple times and improve performance (espacially whit lots of glBufferData calls). A drawback of this approach is that you have to guess how much buffer updates you have per frames. Or you would create a define that defines how big the Round-Robin chain is, so the user can set the size himself if he needs more or less than the default chain size

ricardo · February 5, 2016, 4:06pm

We are actually using “orphaning” in our code ( https://www.opengl.org/wiki/Buffer_Object_Streaming#Buffer_re-specification ), so, in a way we are addressing that issue.
So, I don’t know if using using multiple buffers ( https://www.opengl.org/wiki/Buffer_Object_Streaming#Explicit_multiple_buffering ) will bring any benefit.

Although I noticed that we are not using orphaning correctly. We are missing this part: "and the exact same size and usage hints it had before."

I’ll give it a try.
BTW, I’m kind of frustrated… yesterday I run the “Sprites tests performance tests” and it had the same speed as v3.10 except in one test were it was a bit slower (rendering multiple sprites with no batching at all)… I know this new approach should be faster for SpritePolygon, and we don’t have official SpritePolygon tests, so I’ll create a test with it.

Darinex · February 5, 2016, 9:39pm

As I see it, you’re currently only using orphaning if a shareable vao is supported. But you’re right using multiple buffers might be unnecessary complex. However the text under Buffer Re-specification also says that, quote: “One issue with this method is that it is implementation dependent. Just because an implementation has the freedom to do something does not mean that it will.”

I think that this is nearly impossible to achieve with opengl 2.1 (2.0 es). You could use glBufferSubData. However glBufferSubData seems to be pretty slow on android (mobile in general?) devices. I think you should do something that is mentioned here:

glBindBuffer(GL_ARRAY_BUFFER, 1);
glBufferData(GL_ARRAY_BUFFER, size, nullptr, GL_STREAM_DRAW);
glBufferData(GL_ARRAY_BUFFER, size, data_ptr, GL_STREAM_DRAW);

If you want to make sure that you call it with the same size as its predecessor, you have to make sure that the data_ptr is always as big as the size that was initially used for the vbo and never change the size value. This however would be bad in the following situation: The vbo and data_ptr have a size of 100kb and you want to draw 10 triangles, this would mean that the vertex data that you actually want to submit has a size of 720 ((sizeof(V3F_C4B_T2F) = 24) * 3 * 10) bytes(!). But since you cant change the size of the vbo you would have to submit 100kb even though you only need the first 720 bytes.

I hope you understand what I mean, because I described it a little bit messy

ricardo · February 6, 2016, 12:46am

yes, thanks. I think I got it.

So, I kept doing tests, and the more vertices the sprite has, the more efficient is the new code (at least on older devices like the iPhone 4S)… Still trying to find out how many vertices a sprite should have in order to use the new code…

UPDATE: I think I know why it is making is slower… I’m changing the “auto matrix” uniform each draw call… mmm… why? I mean, there are situations where I should change it… but when all the nodes belong to the same parent, if all the nodes are non-batched, then they share the same MV matrix… strange… I’ll debug it.

UPDATE 2: Yep, the issue was updating the uniform. So it is more expensive to update the MV Matrix uniform for each sprite than transforming its MV to world coordinates and keep using the previous uniform.
I guess this varies from driver to driver, but I prefer to not innovate here unless I have solid data that it is worth adding this change. So, I will leave this feature (the AutoMatrix thing) disabled for the moment.

UPDATE 3: I’ll merge QuadCommand and TrianglesCommand internally. Whether you use QuadCommand or TrianglesCommand, the renderer will treat them as TriangleCommand. Benefits:

reduces the flushes (faster)
more chances to create batches (faster)
less code to maintain (yeah!)

The benefit of using QuadCommand (over TriangleCommands) is that you don’t need to create the indices, but I think the benefits of not using QuadCommand are greater than that.

UPDATE 4: This the squashed new branch: https://github.com/cocos2d/cocos2d-x/pull/15022

Darinex · February 8, 2016, 1:09pm

As you mentioned in your pull request, QuadCommands subclasses TrianglesCommand and creates the indices during its initialization, which means that the indices must be allocated or reallocated on every QuadCommand creation. However, you could do something that I did to avoid this problem: The QuadCommand class has a static member named s_quadIndices which is a short pointer. The Renderer creates a list of all possible quad indices during its initialization (such as he did before) and then passes the indices pointer to the QuadComman’s static member s_quadIndices. Whenever you create a QuadCommand, you just use the s_quadIndices pointer for the indices. This way you don’t have to malloc/realloc/free the for every QuadCommand and save memory and (possibly) time.

ricardo · February 8, 2016, 1:47pm

Thanks! Nice tip!

update: fixed https://github.com/cocos2d/cocos2d-x/commit/cce893a9114da5dda91a895e4e661894f365e2c7

update 2: new PR is here: https://github.com/cocos2d/cocos2d-x/pull/15027/files

Darinex · February 8, 2016, 11:48pm

You’re welcome ;).

I took a really close look at BGFX this time and noticed some issues and or flaws:

The OpenGL renderer doesn’t support dynamic gl versions at runtime, because the target GL version is defined by a #define. That means that the renderer can’t use features from higher gl version if a lower target version is defined, even if the device theoretically supports the higher gl version. E.g if you define that you want to use opengles 20 as the target gl version, but the devices context is a gles31 one, you can’t use compute shaders even if they are supported (I hope you understand what I just said because I didn’t really know how to write it down )
BGFX is handling all context/device/window creation and managing stuff. This could be a problem with if its used with cocos2d-x as cocos is currently handling all that stuff by itself, so it would probably require a big and deep refactor.
(I’m not sure about the following, because I haven’t used metal in practice yet, but only read some stuff about it) The BGFX metal (and vulkan later) renderer is (currently) not making any use of one of metal coolest features (compared to OpenGL): Its mulit-threading capabilities. This feature would give a huge performance bonus in some situations.

As someone noted above you could fix those issue using a fork of BGFX for cocos2d-x and fix them there. However, issue 1 and 2 are deeply integrated into the whole BGFX structure, so fixing them would require you too:

Fully understand the BGFX code (time consuming)
Find and eliminate all parts that are referencing the problematic code, which would also requires lots of time

So I think that it would be better (and less time consuming) to recreate BGFX and reuse parts of it. But again, that is just my opinion.

Edit: I know its not your call, but I’m just stating some issues with BGFX here

Edit 2: Apparently flaw 2 is not true. See @bkaradzic’s comment.

ricardo · February 9, 2016, 2:04am

Thanks for sharing your research!

bkaradzic · February 10, 2016, 7:02am

Hi guys, I’m author of bgfx (noticed this discussion in bgfx’s project analytics)

I took a really close look at BGFX this time and noticed some issues and or flaws:

You specify minimum version, higher version features become available through extensions in case of OpenGL. Generally I support capabilities flags and it would be quite trivial to add ability to pick GLES profile in runtime, but since most GLES devices are mobile, switching and checking caps in runtime comes with cost (anyhow it’s doable).
False, Window creation is not handled by bgfx at all. Context/device is handled by bgfx only if you don’t pass context/device pointers. More info: bkaradzic.github.io/bgfx/bgfx.html#_CPPv2N4bgfx12PlatformDataE
bgfx is multithreaded. Two threads bgfx API, and rendering submission thread split. Rendering thread doesn’t do threaded submission yet. Most of applications are bounded by their internal logic rather than bgfx renderer submission. Here are some performance numbers: bkaradzic.github.io/bgfx/examples.html#drawstress

Anyhow, bgfx is active project and issues will be addressed over time. Also if you have feature requests or even general questions about bgfx feel free to open GitHub issue.

Darinex · February 10, 2016, 11:39am

Thanks for clearing this up @bkaradzic.

All the things I said were based on my understanding of the source code. Guess I interpreted some of them wrong.

I know how bgfx is making use of multithreading. I was just stating that the metal renderer currently isn’t making use of metals capability to encode render commands on more than one thread to their own command buffers, which could improve performance with applications with lots of render/dispatch/blit commands.

stevetranby · February 11, 2016, 3:38am

Metal et al have better support for shared memory access and other lower-level capabilities and features that could be utilized by those who need it with less work than writing from scratch, if the API is open and designed well enough. Essentially it should improve performance without any changes to your game code just by being a more optimized framework. It would likely require “duplicate” shaders written however even though its fairly similar to GLSL, but then I’d love to see Unity-level abstraction over most shaders (with ability to write them by hand for those pesky custom needs). That said for me I see GL ES 3.x+ and extensions support as being the main driver toward BGFX or another similar thing possibly written custom for the next cocos2d engine.

http://blogs.unity3d.com/2014/07/03/metal-a-new-graphics-api-for-ios-8/?

2c