How does Cocos Creator optimize the use of Gaussian blur? Explore the best performance options!

Gaussian blur is a technical feature that we are very familiar with. The developer, Quan XiaoMo, has deeply optimized Gaussian blur by bilinear sampling. The sample project has been upgraded to Cocos Creator 3.6.2. See the end of the article for the download link.

First, let’s see the final result.


9x9 Gaussian kernel four iterations effect

Gaussian blur is a blurring technique that is often used in game post-processing. How is it implemented? What are the ways to optimize its performance? At this time, we will look at the implementation of Gaussian blur and explore the methods of deep optimization to get better blurring results with less computation.

This paper focuses on the following points.

  • Basic implementation of Gaussian blur
  • Linear decomposition of Gaussian fuzzy
  • A bilinear sampling of Gaussian fuzzy
  • Cocos post-processing applications

Preliminary Preparation

First, I referred to the forum post about Gaussian blur.

Gaussian Blur Shader, by Chen Pi Pi:

Blur effect of the scheme

The effect is really good, and it can do the depth blur effect, but the performance is not very good. The resolution of my interface is 960x640. If I set the blur radius to 20, the calculation of each frame is 960x640x41x41=1.03 billion times. PS: Why 41? Because the radius is 20, the origin is 1, which is 20+1+20, and the radius is R, which is the Gaussian kernel of (2R+1)x(2R+1).

In short, this method screwed up my computer anyway. If the radius is adjusted to 4, the calculation is 49.76 million times. The effect is shown in the figure above. So is there any room for optimization?

Note: Since it is difficult to see the change in data each time we multiply 960 and 640, the subsequent calculations are divided by these two values.

Core Ideas

Gaussian blur is commonly used in image processing to reduce image noise, reduce the level of detail, and blur the image with the visual effect of looking at the image through a translucent screen.

From a digital signal processing perspective, the essence of image blurring is a process of filtering high-frequency signals and retaining low-frequency signals. A common alternative to filtering high-frequency signals is convolutional filtering. From this perspective, the process of Gaussian blurring of an image means that the image is convolved with a normal distribution. Since the normal distribution is also called “Gaussian distribution,” this technique is called Gaussian blurring. And since the Fourier transform of a Gaussian function is another Gaussian function, Gaussian blurring is a low-pass filter for the image.

Speaking of Gaussian blur, it is necessary to talk about the Gaussian kernel, which has a basic model as follows:


Three-dimensional schematic of Gaussian function

Each pixel point of the input is calculated by taking the pixel points in a circle around that pixel (blur radius) and calculating them through a Gaussian kernel-based weight, then adding them up as the output value.

Gaussian blur can also be computed separately for two separate one-dimensional spaces on a two-dimensional image, i.e., satisfying linearly separable. This means that the effect obtained using a 2D matrix transformation can also be obtained by performing a 1D Gaussian matrix transformation in the horizontal direction plus a 1D Gaussian matrix transformation in the vertical direction. This is a useful feature from a computational point of view since it requires only MNm+MNn, compared to the original computational complexity of MNn*m, where M and N are the dimensions of the image to be filtered (pixels), m and n are the dimensions of the filter (blur radius).

The following is the linear decomposition process of a Gaussian Kernel:


The above figure is filtered five times, and the Yang Hui triangle shows the binomial coefficients, which can be used to calculate the convolution kernel weights (each element is the sum of two adjacent elements of the previous row).


We take the bottom row as the data sample. The number sum of the bottom row is 4096. Because the values of 1/4096 and 12/4096 are relatively small, we can remove the parameters of 1 and 12 to keep a more nice effect, that the number sum becomes 4070, the weight of each is [66,220,495,792,924]/4070.

Implementation process

This time we use the camera to render the scene into renderTexture, then load it with the Sprite, and finally use the Canvas camera to render it to the screen. You can refer to my previous post for the specific process.

The 3D camera captures to renderTexture and then renders to the 2D camera by Quan XiaoMo.

Let’s start with a 9x9 Gaussian kernel with a total weight of 4070.

N x M → N + M

 _BlurOffsetX: {value: 0,editor: { slide: true, range: [0, 1.0], step: 0.0001 }}
 _BlurOffsetY: {value: 0,editor: { slide: true, range: [0, 1.0], step: 0.0001 }}


First, define two uniforms to record the uv offset of a single pixel and make two progress bars to dynamically adjust the uv offset in the horizontal and vertical directions, where size is the screen size because the progress bar is from 0 to 1. After dividing by size, it becomes the uv offset of a single pixel (PS: I used 2 for scaling here, and adjusting it to the maximum is equivalent to 2 pixels each time to test the effect).

The vertex shader has not been modified, and the fragment shader code is as follows because it is 9x9, so:

vec4 GaussianBlur() {
     // 原点
      vec4 color = 0.2270270270 * CCSampleWithAlphaSeparated(cc_spriteTexture, uv0);
      // 右边/上方的采样点
      color += 0.1945945946 * CCSampleWithAlphaSeparated(cc_spriteTexture, uv0 + vec2(1.0 * _BlurOffsetX  , 1.0 * _BlurOffsetY ));
      color += 0.1216216216 * CCSampleWithAlphaSeparated(cc_spriteTexture, uv0 + vec2(2.0 * _BlurOffsetX  , 2.0 * _BlurOffsetY ));
      color += 0.0540540541 * CCSampleWithAlphaSeparated(cc_spriteTexture, uv0 + vec2(3.0 * _BlurOffsetX  , 3.0 * _BlurOffsetY ));
      color += 0.0162162162 * CCSampleWithAlphaSeparated(cc_spriteTexture, uv0 + vec2(4.0 * _BlurOffsetX  , 4.0 * _BlurOffsetY ));
      // 左边/下方的采样点
      color += 0.1945945946 * CCSampleWithAlphaSeparated(cc_spriteTexture, uv0 + vec2(-1.0 * _BlurOffsetX  , -1.0 * _BlurOffsetY ));
      color += 0.1216216216 * CCSampleWithAlphaSeparated(cc_spriteTexture, uv0 + vec2(-2.0 * _BlurOffsetX  , -2.0 * _BlurOffsetY ));
      color += 0.0540540541 * CCSampleWithAlphaSeparated(cc_spriteTexture, uv0 + vec2(-3.0 * _BlurOffsetX  , -3.0 * _BlurOffsetY ));
      color += 0.0162162162 * CCSampleWithAlphaSeparated(cc_spriteTexture, uv0 + vec2(-4.0 * _BlurOffsetX  , -4.0 * _BlurOffsetY ));
      return color;

Pulling the progress bar to the maximum, we get a blurring effect with the following result.
One blur

This effect is basically the same as the 9x9 blur effect, but the computation is reduced from 9x9=81 to 9+9=18. However, there are problems with this effect, which we will discuss later.

Multiple filtering

To achieve multiple filtering, you must pass the result of this output image as a parameter to the second filtering after one filtering. This time we use the scheme: layered + multi-camera to achieve.


I added eight steps in the project setup and placed eight sprites in the Canvas. Each sprite corresponds to a step and created eight cameras to shoot these eight sprites. The cameras’ rendering priority is low to high, so the eight cameras can be filtered eight times in total.

Here I refer to the forum’s screen post-processing effects example technical solution. This Demo is really powerful. There are a total of 14 kinds of screen post-processing effects for me to get what is needed. Here is the encapsulated code. I will not go into detail. You can directly download the Demo to understand.

Screen post-processing effects examples:


I followed the code above and processed the blurred result four times to get the following result.


I noticed that some details were distorted after blurring, which must be where something went wrong. Let’s go back to the formula we saw before. NxN is converted into Nx1 and 1xN, which requires Nx1 to calculate 1xN after the calculation. Still, my current algorithm is to calculate both together. The final result is slightly inconsistent, so the blurring result is distorted.

The corrected multiple filtering

So we have to correct the code, and here is the updated code.


We first use one pass to process the horizontal direction and then one pass to process the vertical direction. The horizontal and vertical crosses appear. I interpret this as a ping-pong cross. Although it is processed by four passes, it really only counts two iterations. The effect is as follows.

Two iterations using ping-pong crossover

Comparing the previous image, you can see that the effect is now silky smooth like never before. The blurring effect is a bit stronger than the initial one, which can be seen to increase with the number of iterations. So let’s try four iterations and get the following results.

Four iterations using ping-pong crossover

The blur is much stronger than the initial result, with 9x4x2=72 calculations, which is less than the original 81. We get a better blur with less computation, but that’s not the end of it.

Linear sampling

Up to this point, we have assumed that we have to do one mapping read to get information about a pixel, meaning that nine pixels require nine mapping reads. While this holds true for the implementation on the CPU, it is not always the case on the GPU. This is because bilinear sampling can be used freely on the GPU with little additional burden. This means that if the mapping is not read at the center of the stripe, it is possible to get information about multiple pixels. Since the separability of the Gaussian function has been exploited, actually working in 1D, the bilinear interpolation will provide information for 2 pixels. The amount of each stripe contribution to the color is then determined by the coordinates used.

By correctly adjusting the coordinate offset of the mapping reads, it is possible to get accurate information for both pixels or vectors with only one mapping read. This means that only five mapping reads are needed to implement a 9x1 or 1x9 Gaussian filter. In total, [N/2] mapping reads are required to implement an Nx1 or 1xN filter.


How to understand this sentence?

According to the above formula, the calculated coordinates of point A are 1.3846153846, and the coordinates of point B are 3.2307692308. Bring it into the previous fragment shader. The code is as follows:


At this point, you will find that the result is almost the same as before, but the amount of computation is 5x4x2=20 times.

Suppose you remove the four iterations of the effect. In that case, you only need the original 9x9 effect, and the computation only needs 5x2=10 times, which is a considerable performance improvement!

Resources Download

Complete project:

Forum post

To summarize, if you need a deep blur effect, you may want to increase the number of iterations or increase the sampling radius using bilinear sampling. Even for a 13x13 Gaussian kernel, one iteration only requires 7x2=14 calculations.

1 Like