Bao Ye: Tencent Senior Engineer
Bao Ye, a senior engineer of Tencent Games, is a Cocos Star Writer. He published Cocos2d-x tutorials in the early years and published many Cocos Creator-related technical tutorials on the forum.
At the Cocos Developer Salon Shenzhen event, he gave a speech on the theme of Real-time synchronization in games. He shared the frame synchronization and state synchronization in the real-time technology of the game and how to fight delays, frame design issues, and disconnection. In many aspects, such as connection and consistency issues, the technical implementation details of frame synchronization games are introduced.
Hello everyone, I’m Bao Ye known from Cocos Forum, and from Photon Studio inside Tencent. I want to share today the real-time synchronization in the game, which mainly includes frame synchronization and state synchronization. These are all my previous experiences accumulated in-game projects.
Let’s take a look at the frame synchronization demo first. This is a PVP game I developed with Cocos2d-x a few years ago. First of all, there is a match. After the match loads, there is an opening animation. Next, I will demonstrate the effect of chasing frames.
Now the other player is stuck and will not move. When we recover, the player who just got stuck will speed up the time to come back. This is an issue of a small reconnection. When we disconnect from the network during the battle, we can see that the other end continues to operate. When we restore the network, it can also perform correctly.
The small reconnection that happened just now is because we have made some shortcuts in the game, which can quickly realize the disconnection and reconnection. I can test this by pressing a shortcut key to turn it off and pressing a shortcut key to turn it on. This is a beneficial thing.
The last demonstration was a large reconnection. We quit the game in battle and then reconnected. After entering the game, it can also restore the entire scene completely. It was just slower, mainly because it was loading resources. It is not like the traditional frame synchronization, and this is a restoration process that runs from the beginning to the end.
Frame synchronization ensures that everyone gets the same input and executes the same logic on every frame by synchronizing the players’ actions and finally gets consistent performance and results. There are two synchronizations, one is time synchronization, and the other is instruction synchronization.
We want all players to start the game at the same time. Before starting the game, you need to load various map model resources, which is very time-consuming, but some players may have older devices and slow loading, and some players have better equipment and load quickly. What should I do?
We generally need a loading interface to synchronize the loading progress of all players and start the game after all players have loaded everything. But after all your resources have been loaded, starting the game may not stay synchronized. Because after entering the game scene, you have to perform a lot of initialization logic. In this process, if some people’s phones execute very slowly, the game process behind them will be significantly slower than other players.
However, there is an opening animation that can eliminate the difference between us. For example, if player A comes in, it takes one second to initialize, and then player B comes in, taking two seconds to initialize. If we have a three-second opening animation, player A may come in and play it for two seconds, and player B may come in and play it in one second. This way, everyone will start the game at the same point in time—after the opening animation has played.
Finally, we also need to synchronize the time of both devices, and the server time shall prevail here. When we enter the match, we will request the server’s time and calculate the ping value of this request simultaneously, which is when the packet arrives at the server and then comes back. Then divide this ping value by 2 and add the server’s return time to get an accurate, current server time. Then, we can modify the time according to a smaller ping value in the subsequent synchronization during the game. Command synchronization is relatively simple. The server collects all player operations every frame and then broadcasts them to all players.
The recognition of the core logic of frame synchronization mainly consists of two parts: the design of the command queue and the other is the recognition of the game’s main loop. These two are the most important.
Most of the command queue designs have all operations not directly affecting the role itself but enter a queue and then execute its logic from the queue.
We will create different listeners for stand-alone and network modes. If it is a stand-alone mode, the effect of the listener we created is to stuff the operation into the queue when the player’s operation is monitored. If it is network monitoring, it will send the operation to the server after monitoring the player’s operation. At the same time, it will also monitor the server and insert the operation returned by the server into the queue.
Then in this mode design, the stand-alone mode and the network mode are basically the same code. The only difference is that you create different listeners, and this queue design easily implements functions such as playback and watching battles.
The main loop of the general game understands the logic of our game and basically writes in the update function driven by the engine. However, frame synchronization requires us to have strict control over the execution order of the entire game, so generally, you cannot directly use the engine’s update. We need to control everything in our own hands. The first thing we need to control is the frame rate.
Frame rate control needs to do two things. One is that we have to run the game at a specific frequency, such as a logical frame of 10 frames per second to run the game, and the other is that we need to control the frame tracking. If it is the progress of the game. If he is behind, then he should run more logical frames. With the original 10 frames per second, in the case of chasing frames, he may need to run 60 frames per second to fast forward to the current progress.
This code demonstrates this function. For example, if we have been stuck in the game for a long time, the Delta will be massive when the update comes in. Then after adding this Delta, it will be much larger than the number of frames you initially need to execute. We will speed up the execution, execute to a certain number of exits, and then speed up the execution again until we get to the current progress.
This order of the logic is strictly controlled. Before executing the logic, we will sort all the objects once and then execute the update method of each object in a specific order to execute the logic.
Network delay may cause problems such as game freezes and inconsistent client performance. It is objective and cannot be avoided. Then how should we deal with it?
First of all, we can increase the frame buffer and forward roll animation to cover up the delay. This is a very common practice.Secondly, use UDP to replace TCP at the bottom layer, such as KCP. But why replace TCP?
There are several reasons for this. The first is because of TCP’s Nagle algorithm. By default, it will collect as many small packets as possible and then send them simultaneously, reducing bandwidth, but the real-time performance is relatively poor. We can use the TCP_NODELAY option to turn off this feature. The second is its timeout retransmission mechanism. When we have not received a packet of one frame, the game’s logic cannot be executed normally until the packet is retransmitted.
So when will it be retransmitted? Either wait until it times out, TCP will retransmit at this time; or it is TCP’s fast retransmission mechanism, that is, when it receives three repeated ACKs, it will trigger its fast retransmission at this time. But this time is still too long compared to UDP. Finally, when you lose packets, its congestion control mechanism will also restrict your packet sending.
Although we have done some optimizations, these optimizations can not avoid the delay, so what should I do if the network is delayed? To ensure consistent results, the general approach is to get stuck (lock frame synchronization). If the data of this frame does not come down, I am stuck waiting for the data of this frame to come down, and then I will speed up to track back to the current progress.
When participating in that project, I used another synchronization scheme without frame lock, which is a scheme for predicting rollback. If the client does not operate, the server will not send an empty packet and only broadcast when there is an operation. The client does not receive the package and will not get stuck but will continue to execute. But what if the package we receive is delayed? At this time, the client’s status will be wrong, and then we need to correct the error.
In the initial version, our server will run logic. At this time, we must request a copy of the latest state from the server and deserialize it on the client. In addition, the client itself can also implement a mechanism such as “rollback and retry,” which is to roll back to the latest correct state and then recover.
The reason for this design is based on the characteristics of the game. The game we made at the time was a strategy game. Its characteristic was that it had fewer operations. If you use the previous scheme (locked frame synchronization), its empty frame rate is very high, and there are many frames. If the player does not operate, these empty frames will be stuck if there is a network delay.
If we use the latter scheme (without locking the frame synchronization), if we do not operate, then no matter how many cards the network has, it will not affect my game or cause my game to freeze. And the traffic will be more economical.
As for how to de-serialize and how to roll back just mentioned, this part will be discussed in detail later.
Rolling back is actually a highly complex task, but if your framework is well designed, the complexity of this operation will drop a lot. The core of the entire combat framework design is to use component design to separate the display and logic layers, which is somewhat similar to ECS but not as thorough as ECS.
ECS is very thorough. All its components are pure data and have no logic, so it will be better to do that kind of serialization and deserialization.
After the logical layer with the display layer was separated, they may run at different frequencies. The frequency for a logical layer to perform more reasonably may make rendering even more smooth. The display layer may update its performance according to the state of the logic level change, and then the component does only one thing. After planning in this way, our serialization and deserialization become straightforward.
And with the decoupled logic layer efficiently running on the server, this will bring a lot of convenience! The first is high security. Generally, the client is trusted for frame synchronization. This method can verify the battle result on the server in real-time or offline and has an excellent preventive effect on external links. It is also convenient for us to run battles in batches, which is very helpful for testing and balance adjustments.
With disconnection and reconnection, because we have the design of serialization and deserialization, there is no need to run from start to finish to get the current state like a typical frame-synchronized game. If it is normal frame synchronization and the game has been running for a long time, getting a long reconnection instance is very painful. How do we deal with it?
In the original version, the server runs logic. When the connection is disconnected and reconnected, it will directly send the latest state to the client for recovery. However, frequent disconnection and reconnection will be triggered, and the server performance will be more problematic.
So we made two optimizations. One is that if you are a small reconnect and just lose a few frames of data, and then we will replenish you with the data of these few frames. If you are a large reconnect, the data we serialized will be cached for 5 seconds at this time. If the connection is repeatedly disconnected and reconnected during this period of time, we will reuse this cached data.
In addition, we also made a keyframe and timing frame optimization. This optimization is to liberate the server because we found that the serialization of the entire battle scene is very fast. When we receive the correct packet, we will do a serialization. This is the keyframe. If every 10 seconds, If the clock is not serialized, we can do another serialization, which is a timing frame. We keep up to 3 timing frames and a keyframe. When we need to roll back, we find the latest available serialized data from the current time forward, then deserialize it, and use this data to restore, then accelerate again.
Finally, we will save this serialization to disk. When you reconnect, even if there is no state in the memory, you can load data from the disk for recovery. Basically, your server does not need to run at this step, and it still has those securities.
But there is also an extremely abnormal situation, that is, if you change your phone to a new phone and perform a big reconnection operation, what should you do if there is no archive in the new phone?
The server can run a copy of his current status and then send it to him. However, this matter is a very abnormal situation, which is very uncommon. After doing this, the server does not need to run logic at all, and the overall server cost can also be very drawn down.
It is not difficult to achieve serialization and deserialization. It is nothing more than writing all attributes into the buffer or reading and restoring from the buffer. Among them, there are a few places that need to be paid attention to:
- To serialize the directory first. The directory is equivalent to a table of all our objects. We need to make sure that all the objects have been instantiated, and we need to use it later when restoring object properties. For example, my object has an attack target’s Target attribute that points to another character. When recovering, you need to hold the character’s ID, find out the character corresponding to the ID from the table, and set it to this variable.
If the dead object is referenced elsewhere, it must also be serialized and deserialized to ensure the correct logic.
Then, to pay attention to one of its side effects. For example, when we create an object and add it to the scene, it may execute some methods, which will bring some side effects.
For example, during deserialization recovery, when I add an object, it will execute a skill, which will modify some attributes of other objects so that you will pollute some variables of other people. In other words, if you call a random method, your random result will be different from others at this time. After reconnecting, we still need to consider how to proceed in various situations so I won’t go into details here.
Let me give you an example. What happens when the battle is over after we disconnect and reconnect? This was not done initially. It was processed later when his battle was over. The battle result will be cached for a period of time. When he is disconnected and reconnected,we will send the final result to him when he comes in. You can directly see the result of the battle.
The last one is the issue of consistency. It is very common for the same code to run out of different battle results, which is the most painful point when developing frame synchronization.
The typical consistency problems are roughly as follows:
For example, using floating-point calculations, and then not using randomness appropriately, or pointers participate in the calculation. In other words, some static variables used in the last round are not reset, and then they will affect the next round.
Or the order of execution is different. For example, we use some unstable sorting. It is possible to line up in two clients, for example, the highest amount of health, or the closest object, then they have two distances and health precisely the same, line up to check out may return A on this side over there to return B.
For example, some subjective logic is used in the global system, or the logic layer depends on the display layer and the side effects of deserialization mentioned earlier.
What is subjective logic? For example, if some logic executes for me first and then executes others, his execution order will change according to my changes. Therefore, when doing frame synchronization games, we should strictly follow the order of player No. 1 and player No. 2 objectively. It has nothing to do with me. There are no restrictions on the display layer.
In developing frame-synchronized games, locating the consistency problem is very important, so how do we locate the inconsistency problem?
First of all, the most primitive method is to make a lot of logs. I believe many people also use this method. But this method is inefficient in this way, and it is difficult to reproduce the scene. The inconsistencies that ran out from others may not be reproduced here.
So a more scientific method is to write a memory log and use tools to compare the differences quickly. This method is described in detail in the book “The Essence of Tencent Game Development.” It has high performance and low consumption, so it can also be used in online environments.
Then we can do something better based on that method. For example, after each battle, we hash the memory logs of these two players and report them to the server. Then make a comparison. If the comparison is inconsistent, the two clients compress this detailed memory log and report it to the background. In this way, we can capture the inconsistency of many online players and then analyze the reasons. All inconsistencies can be eliminated soon because the online environment can often expose more problems!
State synchronization is very different from frame synchronization. It puts most of the state and logic calculations on the server-side, and then the server sends the results to the client, and the client only plays the animation according to the results sent by the server.
Its advantage is that it can support more players, more time, and longer running time, and it is safer, disconnected, and reconnected faster, and real-time will be better. But its disadvantage is that the implementation complexity is relatively high, and the playback is not easy to implement.
I am now designing an open-source state synchronization framework for Cocos Creator 3.x in my spare time. I have absorbed some of the excellent concepts of the UE network synchronization framework, and there is no historical burden. I will use relatively simple and clear code to implement a framework that is currently being developed slowly. I hope that a version will be released before the Lunar New Year.
The development route is roughly the same. First, a copy of a single machine and a single object, and then a copy of a single machine and multiple objects. What is a single machine multi-object? For example, when I create two layers on the left and right of a scene, I can synchronize the changes of the nodes on this side to the other side in real-time. Then slowly increase the network and finally realize the synchronization of an independent server.
There are many concepts involved. For example, it is mainly to generate a JS server code for it. The attribute replication involved is very complicated, especially on the JS side. It is also very troublesome to track the historical changes of an array. What’s more, it needs to support various containers, such as Array, Map, Set, and then support their incremental update, which is really complicated. If we do better, we also need a non-rendered Cocos engine so that some animations can run on the server-side.
That’s it for what I shared, thank you.
The following is the guest Q&A session:
Q: Hello, my question just now involves part of the logical frame chasing. When the logical layer chasing frames, how should we handle the display layer?
Because in the display layer, there are many obligations in Cocos, such as some easing timers, some animations are on the display level. Still, while chasing frames in the logical layer, we should not be able to say that all the forms The details are released. For example, the logic layer may have been chasing for three seconds, but a three-second big move was released three seconds ago. It should not be possible to broadcast the animation in a short period of time during this process. What kind of good practice is there for this kind of problem? Experience?
A: This depends on how your display layer is implemented. My display layer is implemented in this way. I have a variety of display components, and each display component does only one thing: to stare at the state of the logic layer that I care about and then detect it—the change. For example, my logic layer ran for 10 frames at once, but the display layer only checked once after it ran and checked whether its current state had changed.
In fact, during the whole recovery process, I didn’t deal with the display layer. It’s because I hope the display layer has a certain ability. As long as it gets the correct state, it can be displayed correctly, and every component will do such a thing, right?
Q: Hello, I want to know about the inconsistency of floating-point numbers. Are there any situations that may cause inconsistencies? When we tried it before, it seemed that inconsistent results did not appear in all cases.
A: First of all, floating-point numbers follow the IEEE 754 standard. If all hardware is implemented strictly according to this standard, there should be no inconsistencies in floating-point calculations.
But the reality is that much hardware is not implemented strictly according to this standard, and then there may be some subtle differences in this way. In fact, I have discussed this issue with many people in-depth, but because this issue is too low-level, the specific difference is related to the implementation of the underlying hardware.