We’ve seen how using the CPU’s cache can lead to a 13x speedup, but that’s only utilizing one of the CPU’s cache types. Today’s article shows you how to go further by utilizing a whole other type of CPU caching!

Consumer CPUs since the 1980s have had two kinds of caches on the chip itself to speed up access to slow main RAM. The first is the data cache I wrote about two weeks ago. In a nutshell, whenever you access memory the CPU first checks its cache to see if what you need is there and, if it is, avoids accessing RAM. A cache “miss” means the CPU “stalls” (i.e. does nothing) while it waits a really long time for a whole “cache line” (e.g. 32 or 64 bytes) to be fetched from RAM. This is called the “data cache” and it caches the data your program stores in RAM.

The other cache is the “instruction cache”. Just like the data cache, it stores the instructions that make up your program. As the CPU executes your instructions it pulls them from RAM. To speed this up, it first checks the instruction cache. This all behaves very similarly to the data cache.

Normally your instructions are just a linear list of commands with “jumps” (e.g. a C# `if` or `while`) that only go a little way. You might skip forward or backward a few dozen instructions, but you probably won’t skip a million. Even if you do, modern CPUs will read ahead of the current instruction to see that you’re about to jump to a far-flung function and try to load that code ahead of time so it’s already in the instruction cache by the time you need it. That means code is normally very well set up for localized execution that repeatedly reads from the same lines of the instruction cache.

So what can go wrong? The problems start to happen when we write code where the execution order is determined at runtime. The code the CPU ends up executing says “read this memory location and jump to the address you find there”. That means the instruction to jump to might change, so CPUs have a hard time pre-loading the right memory into the instruction cache.

In C# there are two main ways to trigger this. The first is the most literal version of this: delegates. Delegates (e.g. Action) are variables that can be changed at runtime to “point” to different functions. They can even point to multiple functions. So when the CPU gets to a delegate it might not be able to predict which function it should load into the instruction cache and might need to stall while it waits for the right function to load from RAM.

The second way to trigger this is less obvious: virtual functions. These include functions marked virtual, functions marked override to override a virtual function, and any function implementing a function defined in an interface. Consider the classic polymorphism example where IAnimal has a void Speak() and Cat and Dog classes implement IAnimal by defining their own Speak functions that print “meow” and “woof”. Then some other code has an IAnimal variable and calls animal.Speak(). The way this works behind the scenes is that both Dog and Cat instances contain a function pointer (essentially a low-level delegate) that points to their own version of Speak. This too trips up the CPU, causing it to frequently wait on main memory just to know what to execute next.

Unfortunately, it’s rather hard to create a small test script for the purposes of demonstrating instruction cache misses. To do so requires something more like a real game that has a lot more complexity than I can put here in this article. For example, say you had an IEntity with 50 classes implementing it: Player, Boss, Obstacle, etc. If you loop over an IPlayer[] calling a virtual Update then you’ll constantly be telling the CPU to jump to different areas of RAM to find your instructions on how to update each of those types. If, however, you only had a few classes implementing IEntity then all of those Update functions might fit in the instruction cache. Hence the need for more code than is practical for this article.

By the way, a more instruction cache-friendly way to call Update on all the IEntity objects is to sort them by type and call non-virtual functions on each of them. This presumes that their update order doesn’t need to be very interleaved by entity type, but that’s often the case.

So keep the instruction cache in mind as you focus on performance. If you can arrange your code to avoid virtual functions and delegates then you’ll have a nice performance win on your hands.