Dev blog 21st April 2017

A step by step walkthrough of finding and fixing an elusive bug

A bug we had been plagued with for a long time, was an occasional crash of the game client just after a practice match restarted, or when the map changed. We thought we had fixed it about three or four times, but because it was an intermittent crash, it was very easy to believe the problem was fixed. We would investigate, find a genuine error, fix it, and after a few test runs, think it was gone, only to discover it reappear after release.

The most frustrating aspect was that because of our use of UDK, any actual crashes happen in the engine, to which we don't have source code or debugging symbols. It's basically a black box. To put it another way, we had to find the butterfly that flapped its wings without knowing where in the world the resulting hurricane happened.

So how did we find and squash this bug?

Establishing the facts: We knew that the problem happened in our release build, but much more rarely in our development debug builds. So this meant timing or perhaps memory layout was likely to matter. We knew that it only happened when a client was connected to a server, never offline, so this meant it was probably linked to the order of network messages, or data replication. We also knew that it did not crash if you were in spectator mode, meaning it must be something that the playing client did that the spectator didn't (or vice-versa).

The first step of fixing a problem is finding a way to reproduce it reliably. Through trial and error, I eventually determined that locally running a 64 bit server in release mode, using non-cooked resources, and connecting to it using a 32 bit client, also in release with non-cooked resources, was able to trigger the problem about 9 times out of 10. With a one minute match time, I could trigger the bug in a couple of minutes after making code changes. This was important because now I could prove that any changes I made actually fixed the problem. As an aside, I came up with a new game to play in that one minute practice match while waiting for the crash: Try to shoot the ball so it hits the crossbar of both goals without touching the ground in between. Give it a try! I managed two double-dings in about hundred attempts.

Next, we investigate. This is a bit of a dark art, and is about 75% system knowledge, 20% gut instinct and 5% luck. I ran through my test sequence in normal and in spectator mode, then compared the log files to see what was different. What stood out was more data replication happening in normal mode than as a spectator. This makes sense as the normal player has things like a pawn (their character), and a gun, which the spectator does not. I saw that the crash happened a couple of seconds after the first replications of the pawn and gun so I took a guess and started putting some logging statements in the tick functions of the pawn and the gun. This showed me that the pawn and the gun were completing one tick, then doing nothing before the crash a couple of seconds later. I started going through other things that ticked, and eventually came to the HUD.

The HUD has caused problems before, most notably when trying to draw before the player's team is known. The team number comes from the server, and there is a period where the client doesn't know what team it is on. I took a guess and thought that maybe we were looking at a similar problem, so I started logging all calls to GetTeamNum(), the function which says whether the player is on team 0 (red), team 1 (blue), team 254 (spectator), or 255 (unknown). Sure enough, the difference between the 9/10 runs of failure and the 1/10 success was that the team number was still set to 255. However, this was just the start...

I added a check to the HUD render call to return immediately if the team number was 255. This made the crash go away. Promising, but we hadn't yet found the root cause. I started by manually reviewing the HUD rendering code to check for array length problems that might have been missed before (for the uninitiated, imagine what might happen when a computer has a list of 2 things, and you ask it for the 255th thing in that list). I couldn't see anything obvious, so I had to take the hard approach and just start adding log entries every few lines of HUD code, see how far we got before we crashed, find the last function that was called and repeat the process further down the call stack. 

Eventually I made it to a call into the engine: SkeletalMeshComponent::GetPosition(). In this case we were in the radar drawing code, trying to find out where every player is. As I mentioned above, at this point we are stuck at the black box of the unreal engine with no way of digging further. I checked the SkeletalMeshComponent object and it existed and was not None at the time of the call, so I thought I might be stuck. At this point I probably would have added a check for the team number with an apologetic code comment explaining how I got to this point. This would have stopped the crash, but we would never have known why. However, googling "udk getposition crash" led me to this page where someone describes a very familiar sounding problem. I don't know who you are, Farseer on the Unreal forums, but I love you very much. He noticed that calling GetPosition before the SkeletalMesh property was set in the SkeletalMeshComponent would result in a engine crash with no warning.

I added a check for the SkeletalMesh. Sure enough, it was set to None. Modifying the code to return early if the SkeletalMesh was None stopped the crash. Problem found, but the question of "why?" was still unanswered. 

As it turns out, the team number still being 255 was not a red herring. The code to create the mesh returns early if the player's team is not yet set (because it doesn't know what colour material to use), retrying a fraction of a second later. If the player's team number was received after the pawn replication, there was a very small window where the mesh was yet to be created, but the pawn was already trying to draw the radar.

All in all, that's probably 20 hours of investigation (ignoring all the previous attempts) to add one line of code. I am reminded of the joke about a mechanic charging $500 to fix a car by hitting it with a hammer. It's not the hit you are paying for, it's knowing where to hit it that matters.

tl;dr: In-depth investigation showed that the timing of certain events could trigger a bug in the unreal engine.