How to Use the Whole CPU
Last week’s article showed how to effectively use the CPU’s caches to boost performance by an order of magnitude. Today’s article goes even further to show you how to use even more of the CPU’s capabilities!
Writing fast code requires you to utilize all of the computer’s processors to their limit. In order to do that, you have to understand what those processors have to offer. If you’re writing code for some abstract computer then it’s really easy to miss out on massive optimizations. It turns out that our code runs on actual computers that are built to provide huge performance if you just give them code that they’re designed to run fast.
The CPU’s caches of RAM are just one feature like this. Sure, you could ignore the cache and your program will still run correctly. But if you want your program to run fast then you need to be aware of the cache when you’re laying out your data structures and algorithms.
The very same goes for the CPU’s cores. These days there are almost no computers with just a single core. The iPhone has had at least two cores since the 4S in 2011 and now has four cores in the iPhone 7. Android devices have been multi-core for at least as long. Desktop computers and consoles have been multi-core even longer.
So while you can write working code that ignores the CPU cache or ignores all but one of the CPU’s cores, your code will be slow. If you want it to run faster, you need to take advantage of the performance potential that’s sitting there waiting for you to use it. In the case of multi-core CPUs, your app needs to run multiple threads. I imagine most programmers have at least heard of these if not used them in their own apps, but in a nutshell they allow you to run your code in parallel on each of the CPU’s cores while sharing the same memory.
For example, consider the little program from last week’s article. It wanted to update the position of a projectile according to its velocity and the elapsed time. It simply looped from the start of an array of projectiles to the end updating them one at a time. That program was single-threaded because only one copy of the code was running at a time. The rest of the CPU’s cores were just sitting there idle, waiting for a program to give them something to do.
In today’s article we’ll wake up those cores and put them to work! We’ll do this by splitting up the loop so that we don’t have just one big loop that goes from the beginning to the end of the array. Instead, we’ll have one loop per CPU core. In the case of the 6-core Android device I’m running on, each loop will update a 1⁄6 chunk of the projectiles array.
using System; using System.Threading; using UnityEngine; class TestScript : MonoBehaviour { struct ProjectileStruct { public Vector3 Position; public Vector3 Velocity; } class ProjectileClass { public Vector3 Position; public Vector3 Velocity; } class ThreadStartParamStruct { public ProjectileStruct[] ProjectileStructs; public int StartIndex; public int Count; public float Time; } class ThreadStartParamClass { public ProjectileClass[] ProjectileClasses; public int StartIndex; public int Count; public float Time; } void Start() { // Setup const int count = 10000000; ProjectileStruct[] projectileStructs = new ProjectileStruct[count]; ProjectileClass[] projectileClasses = new ProjectileClass[count]; for (int i = 0; i < count; ++i) { projectileClasses[i] = new ProjectileClass(); } Shuffle(projectileStructs); Shuffle(projectileClasses); int numThreads = Environment.ProcessorCount; int numPerThread = count / numThreads; Thread[] structThreads = new Thread[numThreads]; ParameterizedThreadStart threadStartStruct = UpdateProjectilesStruct; ThreadStartParamStruct threadStartParamStruct = new ThreadStartParamStruct(); Thread[] classThreads = new Thread[numThreads]; ParameterizedThreadStart threadStartClass = UpdateProjectilesClass; ThreadStartParamClass threadStartParamClass = new ThreadStartParamClass(); for (int i = 0; i < numThreads; ++i) { structThreads[i] = new Thread(threadStartStruct); classThreads[i] = new Thread(threadStartClass); } // Struct System.Diagnostics.Stopwatch sw = System.Diagnostics.Stopwatch.StartNew(); for (int i = 0; i < count; ++i) { UpdateProjectile(ref projectileStructs[i], 0.5f); } long structTime = sw.ElapsedMilliseconds; // Class sw.Reset(); sw.Start(); for (int i = 0; i < count; ++i) { UpdateProjectile(projectileClasses[i], 0.5f); } long classTime = sw.ElapsedMilliseconds; // Threaded Struct sw.Reset(); sw.Start(); threadStartParamStruct.ProjectileStructs = projectileStructs; threadStartParamStruct.StartIndex = 0; threadStartParamStruct.Count = numPerThread; threadStartParamStruct.Time = 0.5f; for (int i = 0; i < numThreads; ++i) { structThreads[i].Start(threadStartParamStruct); threadStartParamStruct.StartIndex += numPerThread; } for (int i = 0; i < numThreads; ++i) { structThreads[i].Join(); } long threadedStructTime = sw.ElapsedMilliseconds; // Threaded Class sw.Reset(); sw.Start(); threadStartParamClass.ProjectileClasses = projectileClasses; threadStartParamClass.StartIndex = 0; threadStartParamClass.Count = numPerThread; threadStartParamClass.Time = 0.5f; for (int i = 0; i < numThreads; ++i) { classThreads[i].Start(threadStartParamClass); threadStartParamClass.StartIndex += numPerThread; } for (int i = 0; i < numThreads; ++i) { classThreads[i].Join(); } long threadedClassTime = sw.ElapsedMilliseconds; string report = string.Format( "Type,Time,Threaded Time\n" + "Struct,{0},{2}\n" + "Class,{1},{3}", structTime, classTime, threadedStructTime, threadedClassTime ); Debug.Log(report); } static void UpdateProjectilesStruct(object startParam) { ThreadStartParamStruct typedStartParam = (ThreadStartParamStruct)startParam; ProjectileStruct[] projectileStructs = typedStartParam.ProjectileStructs; int endIndex = typedStartParam.StartIndex + typedStartParam.Count; float time = typedStartParam.Time; for (int i = typedStartParam.StartIndex; i < endIndex; ++i) { UpdateProjectile(ref projectileStructs[i], time); } } static void UpdateProjectilesClass(object startParam) { ThreadStartParamClass typedStartParam = (ThreadStartParamClass)startParam; ProjectileClass[] projectileClasses = typedStartParam.ProjectileClasses; int endIndex = typedStartParam.StartIndex + typedStartParam.Count; float time = typedStartParam.Time; for (int i = typedStartParam.StartIndex; i < endIndex; ++i) { UpdateProjectile(projectileClasses[i], time); } } static void UpdateProjectile(ref ProjectileStruct projectile, float time) { projectile.Position += projectile.Velocity * time; } static void UpdateProjectile(ProjectileClass projectile, float time) { projectile.Position += projectile.Velocity * time; } public static void Shuffle<T>(T[] list) { System.Random random = new System.Random(); for (int n = list.Length; n > 1; ) { n--; int k = random.Next(n + 1); T value = list[k]; list[k] = list[n]; list[n] = value; } } }
If you want to try out the test yourself, simply paste the above code into a TestScript.cs
file in your Unity project’s Assets
directory and attach it to the main camera game object in a new, empty project. Go to Player Settings
and change Scripting Backend
to IL2CPP
. I ran it that way on this machine:
- LG Nexus 5X
- Android 7.1.2
- Unity 5.6.0f3
And here are the results I got:
Type | Time | Threaded Time |
---|---|---|
Struct | 296 | 123 |
Class | 3478 | 583 |
Using all six cores on the testing device has yielded a massive speedup for both structs and classes! Structs are about 2.5x faster and classes are about 6x faster! Structs still hold the performance advantage over classes due to their increased use of the CPU cache. Not waiting on slow RAM access and making good use of the CPU’s caches results in about a 6x speedup for the structs version.
Overall, using both CPU caching (via structs) and the CPU’s cores (via threads) has led us to a 28x performance boost!
At this point I should mention that both structs and threads are just tools in your toolbox. You should understand how they work so you can put them to use for appropriate tasks. If you do, you could write way faster code!
#1 by Liam on May 15th, 2017 ·
Interesting. Do you know if Unity performs any of these kinds of optimisations itself – like the Flex SDK’s compile time optimisations on AS3?
#2 by jackson on May 15th, 2017 ·
Unity itself will use multiple threads for its own work in certain circumstances, but it won’t automatically run your code using multiple threads.
#3 by Kaiyum on May 16th, 2017 ·
Great article. But we can not use any unity API on threaded code, can we? Because Unity API is not thread safe.
#4 by jackson on May 16th, 2017 ·
That’s not always the case. In the article I used
Vector3
from other threads without issue. There are areas of the Unity API that are thread-safe and there are areas that aren’t thread-safe.#5 by Kaiyum on May 16th, 2017 ·
A list of the thread-safe things would be awesome. And about the vector3 part, did u try modifying the x,y,z components individually from threaded code? As far as I know, mathf library is pretty ok with threaded code though I do not consider it unity API really, looks more like standard mathf library found in any framework.
#6 by jackson on May 17th, 2017 ·
No, I didn’t try modifying the components individually. Let me know if you ever find a list of thread-safe Unity APIs. That could sure come in handy!
#7 by Frenzy on May 17th, 2017 ·
I like your articles a lot, i always learn something new.
But this post conclusion is pretty misleading ..
This code runs on 10 000 000 items, when you come down to some reasonable and more common number like 5000 – 10000 iterations, the results are start to shift back in favor of non threaded code cause you didn’t mention cost for thread creation and join
#8 by jackson on May 17th, 2017 ·
I didn’t test at small numbers, only the 10 million used in the previous article. You have an excellent point about overhead though. If you expect to start and terminate threads rapidly you’re in for a surprise. That certainly isn’t the way to handle small tasks as you’ll spend way more time spinning up and down threads than doing the actual work. Instead, what you should do is create a “thread pool” of sorts and schedule “tasks” for those threads to execute. That’s a common approach used to avoid the overhead of thread start and join costs while still taking advantage of the CPU’s cores.
#9 by Jeremiah on January 19th, 2018 ·
Hey, thanks. I implemented this in my terrain generator and it is now way faster. As for the above, there certainly is overhead for smaller operations that run frequently, and as for your response, the threadpool is the way to solve it. I did some googling on how to join threadpools, and it was pretty easy. So if you wanted to update your code to use a threadpool, it’d be trivial and helpful for other people!
The biggest change from your code is that threadpools don’t have a join, so you use a CountdownEvent object instead. You can initialize the countdown event object’s count to be the number of threads you have, then at the end of every thread function you can do “countdownObject.Signal()” which decrements the count in the countdown. Then, instead of writing a for loop to join all the threads, you just use countdownObject.Wait() which blocks the main thread until the countdownObject’s internal counter is 0. After that, you can reset the countdownObject’s count back to it’s initial count with countdownObject.Reset(), just preparing it for the next run.
Implementing this made multithreading cases faster for me in all cases (when my terrain was lower quality and had less verts, singlethreading used to be faster because thread creation overhead, but now multithreading is plainly always faster). The very first time you run the threads might be slow because threads have to be created the first time, but every other time after that will be quick!
#10 by jackson on January 23rd, 2018 ·
I’m glad to hear your terrain generator is running faster now that you’re putting the CPU’s cores to work with threads! I’ll quite likely be writing an article on thread pools since, as you found, they’re quite helpful. If you used one from an existing library, can you link me to it?
#11 by ZJP on May 24th, 2017 ·
“…the results are start to shift back in favor of non threaded code cause you didn’t mention cost for thread creation and join…”
Have a look here : https://forum.unity3d.com/threads/c-optimized-and-advanced-classic-threads-for-unity-option-with-setthreadaffinitymask.463307/
#12 by ThoMue on July 3rd, 2017 ·
Great Blog, great work. Love to see guys that talk about the real shit ;)
One mention on your code Sample:
shouldn’t
class ThreadStartParamStruct { ... }
bestruct ThreadStartParamStruct { ... }
?Didn’t test your sample on my machine, but if you ran the code with this minor typo your results may be falsified.
Love your work!