Last week’s article showed how to effectively use the CPU’s caches to boost performance by an order of magnitude. Today’s article goes even further to show you how to use even more of the CPU’s capabilities!

Writing fast code requires you to utilize all of the computer’s processors to their limit. In order to do that, you have to understand what those processors have to offer. If you’re writing code for some abstract computer then it’s really easy to miss out on massive optimizations. It turns out that our code runs on actual computers that are built to provide huge performance if you just give them code that they’re designed to run fast.

The CPU’s caches of RAM are just one feature like this. Sure, you could ignore the cache and your program will still run correctly. But if you want your program to run fast then you need to be aware of the cache when you’re laying out your data structures and algorithms.

The very same goes for the CPU’s cores. These days there are almost no computers with just a single core. The iPhone has had at least two cores since the 4S in 2011 and now has four cores in the iPhone 7. Android devices have been multi-core for at least as long. Desktop computers and consoles have been multi-core even longer.

So while you can write working code that ignores the CPU cache or ignores all but one of the CPU’s cores, your code will be slow. If you want it to run faster, you need to take advantage of the performance potential that’s sitting there waiting for you to use it. In the case of multi-core CPUs, your app needs to run multiple threads. I imagine most programmers have at least heard of these if not used them in their own apps, but in a nutshell they allow you to run your code in parallel on each of the CPU’s cores while sharing the same memory.

For example, consider the little program from last week’s article. It wanted to update the position of a projectile according to its velocity and the elapsed time. It simply looped from the start of an array of projectiles to the end updating them one at a time. That program was single-threaded because only one copy of the code was running at a time. The rest of the CPU’s cores were just sitting there idle, waiting for a program to give them something to do.

In today’s article we’ll wake up those cores and put them to work! We’ll do this by splitting up the loop so that we don’t have just one big loop that goes from the beginning to the end of the array. Instead, we’ll have one loop per CPU core. In the case of the 6-core Android device I’m running on, each loop will update a 16 chunk of the projectiles array.

using System;
using System.Threading;
 
using UnityEngine;
 
class TestScript : MonoBehaviour
{
	struct ProjectileStruct
	{
		public Vector3 Position;
		public Vector3 Velocity;
	}
 
	class ProjectileClass
	{
		public Vector3 Position;
		public Vector3 Velocity;
	}
 
	class ThreadStartParamStruct
	{
		public ProjectileStruct[] ProjectileStructs;
		public int StartIndex;
		public int Count;
		public float Time;
	}
 
	class ThreadStartParamClass
	{
		public ProjectileClass[] ProjectileClasses;
		public int StartIndex;
		public int Count;
		public float Time;
	}
 
	void Start()
	{
		// Setup
 
		const int count = 10000000;
		ProjectileStruct[] projectileStructs = new ProjectileStruct[count];
		ProjectileClass[] projectileClasses = new ProjectileClass[count];
		for (int i = 0; i < count; ++i)
		{
			projectileClasses[i] = new ProjectileClass();
		}
		Shuffle(projectileStructs);
		Shuffle(projectileClasses);
		int numThreads = Environment.ProcessorCount;
		int numPerThread = count / numThreads;
		Thread[] structThreads = new Thread[numThreads];
		ParameterizedThreadStart threadStartStruct = UpdateProjectilesStruct;
		ThreadStartParamStruct threadStartParamStruct = new ThreadStartParamStruct();
		Thread[] classThreads = new Thread[numThreads];
		ParameterizedThreadStart threadStartClass = UpdateProjectilesClass;
		ThreadStartParamClass threadStartParamClass = new ThreadStartParamClass();
		for (int i = 0; i < numThreads; ++i)
		{
			structThreads[i] = new Thread(threadStartStruct);
			classThreads[i] = new Thread(threadStartClass);
		}
 
		// Struct
 
		System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew();
		for (int i = 0; i < count; ++i)
		{
			UpdateProjectile(ref projectileStructs[i], 0.5f);
		}
		long structTime = sw.ElapsedMilliseconds;
 
		// Class
 
		sw.Reset();
		sw.Start();
		for (int i = 0; i < count; ++i)
		{
			UpdateProjectile(projectileClasses[i], 0.5f);
		}
		long classTime = sw.ElapsedMilliseconds;
 
		// Threaded Struct
 
		sw.Reset();
		sw.Start();
		threadStartParamStruct.ProjectileStructs = projectileStructs;
		threadStartParamStruct.StartIndex = 0;
		threadStartParamStruct.Count = numPerThread;
		threadStartParamStruct.Time = 0.5f;
		for (int i = 0; i < numThreads; ++i)
		{
			structThreads[i].Start(threadStartParamStruct);
			threadStartParamStruct.StartIndex += numPerThread;
		}
		for (int i = 0; i < numThreads; ++i)
		{
			structThreads[i].Join();
		}
		long threadedStructTime = sw.ElapsedMilliseconds;
 
		// Threaded Class
 
		sw.Reset();
		sw.Start();
		threadStartParamClass.ProjectileClasses = projectileClasses;
		threadStartParamClass.StartIndex = 0;
		threadStartParamClass.Count = numPerThread;
		threadStartParamClass.Time = 0.5f;
		for (int i = 0; i < numThreads; ++i)
		{
			classThreads[i].Start(threadStartParamClass);
			threadStartParamClass.StartIndex += numPerThread;
		}
		for (int i = 0; i < numThreads; ++i)
		{
			classThreads[i].Join();
		}
		long threadedClassTime = sw.ElapsedMilliseconds;
 
		string report = string.Format(
			"Type,Time,Threaded Time\n" +
			"Struct,{0},{2}\n" +
			"Class,{1},{3}",
			structTime,
			classTime,
			threadedStructTime,
			threadedClassTime
		);
		Debug.Log(report);
	}
 
	static void UpdateProjectilesStruct(object startParam)
	{
		ThreadStartParamStruct typedStartParam = (ThreadStartParamStruct)startParam;
		ProjectileStruct[] projectileStructs = typedStartParam.ProjectileStructs;
		int endIndex = typedStartParam.StartIndex + typedStartParam.Count;
		float time = typedStartParam.Time;
		for (int i = typedStartParam.StartIndex; i < endIndex; ++i)
		{
			UpdateProjectile(ref projectileStructs[i], time);
		}
	}
 
	static void UpdateProjectilesClass(object startParam)
	{
		ThreadStartParamClass typedStartParam = (ThreadStartParamClass)startParam;
		ProjectileClass[] projectileClasses = typedStartParam.ProjectileClasses;
		int endIndex = typedStartParam.StartIndex + typedStartParam.Count;
		float time = typedStartParam.Time;
		for (int i = typedStartParam.StartIndex; i < endIndex; ++i)
		{
			UpdateProjectile(projectileClasses[i], time);
		}
	}
 
	static void UpdateProjectile(ref ProjectileStruct projectile, float time)
	{
		projectile.Position += projectile.Velocity * time;
	}
 
	static void UpdateProjectile(ProjectileClass projectile, float time)
	{
		projectile.Position += projectile.Velocity * time;
	}
 
	public static void Shuffle<T>(T[] list)  
	{
		System.Random random = new System.Random();
		for (int n = list.Length; n > 1; )
		{
			n--;
			int k = random.Next(n + 1);
			T value = list[k];
			list[k] = list[n];
			list[n] = value;
		}
	}
}

If you want to try out the test yourself, simply paste the above code into a TestScript.cs file in your Unity project’s Assets directory and attach it to the main camera game object in a new, empty project. Go to Player Settings and change Scripting Backend to IL2CPP. I ran it that way on this machine:

  • LG Nexus 5X
  • Android 7.1.2
  • Unity 5.6.0f3

And here are the results I got:

Type Time Threaded Time
Struct 296 123
Class 3478 583

How To Use The Whole CPU

Using all six cores on the testing device has yielded a massive speedup for both structs and classes! Structs are about 2.5x faster and classes are about 6x faster! Structs still hold the performance advantage over classes due to their increased use of the CPU cache. Not waiting on slow RAM access and making good use of the CPU’s caches results in about a 6x speedup for the structs version.

Overall, using both CPU caching (via structs) and the CPU’s cores (via threads) has led us to a 28x performance boost!

At this point I should mention that both structs and threads are just tools in your toolbox. You should understand how they work so you can put them to use for appropriate tasks. If you do, you could write way faster code!