The last time we looked at performance was way back in part four of the series. Ever since then we’ve been relentlessly adding more and more features to the C++ scripting system. So today we’ll take a break from feature additions to improve the system’s performance in a couple of key areas.

Table of Contents

Back in part four we validated the system’s performance with some benchmarking to prove that all these calls between C# and C++ wasn’t adding significant overhead. This overhead will offset whatever performance gains we get by scripting in C++, such as avoiding the garbage collector and IL2CPP overhead altogether, so it’s best to keep it as small as possible.

Many features were added to the system in the 15 articles since that performance test was done. While we’ve paid attention to performance all along, some areas could inevitably use some work. The first such area relates to how we pass parameters in C++. Specifically, when we pass a type like String that derives from Object, we pass by value. Here’s a simple example:

struct Debug
{
	// Function that takes an Object
	static void Log(Object message);
};
 
// Call the function
String message("hello");
Debug::Log(message);

Passing by value introduces some overhead that would be best avoided. Copying the Object type doesn’t result in a large amount of data being copied since they’re typically only a few bytes in size. Instead, the issue occurs when the class’ copy constructor is implicitly called when the function is called. Recall that the copy constructor looks like this:

String::String(const String& other)
	: Object(Plugin::InternalUse::Only, other.Handle)
{
	if (Handle)
	{
		Plugin::ReferenceManagedClass(Handle);
	}
}

There’s one if here that can trip up the CPU’s instruction cache, but the bigger issue lies within ReferenceManagedClass:

void ReferenceManagedClass(int32_t handle)
{
	if (handle != 0)
	{
		RefCountsClass[handle]++;
	}
}

There’s one more if, but the big issue is that this function reads from the global RefCountsClass array which may not be in the CPU’s data cache. That would happen if the reference count hadn’t been accessed in a while, such as for an object that was created a while ago. If that’s the case, then this “cache miss” will cause the CPU to go idle for about 100 nanoseconds. That doesn’t sound like a long time until you remember that a 1 GHz processor is running at 1 cycle per nanosecond. That means this cache miss takes just as long as executing dozens or even hundreds of instructions!

So we want to avoid accessing the reference counts array if possible. After all, there isn’t any point to incrementing the reference count when calling the function only to decrement it when the function returns. Working around this is relatively straightforward. If we switch to passing Object types by reference then no copy is made and there’s no need for the reference counting.

Here’s how the new version of Debug::Log looks:

struct Debug
{
	// Function that takes a reference to an Object
	static void Log(Object& message);
};
 
// Call the function
String message("hello");
Debug::Log(message);

The only difference is that the function takes an Object& instead of an Object. There’s no change for the caller since C++ will automatically make a reference out of a variable. The one drawback here is that references require an “lvalue”, which is something that has a name. We can’t pass an “rvalue” like this:

Debug::Log(String("hello"));

The reason is that the String doesn’t have a name. C++ has some ways around this, but they’ll lead to greatly more complicated code that’s best avoided for now. So we’ll just give our Object-typed variables a name such as by saving them to a local variable and that’ll allow us to skip the potential cache misses.

Keep in mind that this is only necessary for Object types because they’re the only ones that are reference-counted. We still pass primitives (e.g. int) and enums (e.g. KeyCode) by value and “full structs” (e.g. Vector3) by reference as we did before.

The next optimization relates to managed arrays. These are the arrays that C# deals with, not the arrays that C++ has. We implemented these back in part 14 and upgraded them in part 18, but they could use a performance upgrade. Consider looping over a managed array in C++:

// Make the array
Array1<float> floats(1000);
 
// Fill in the array with 0.0, 1.0, 2.0, etc.
for (int32_t i = 0; i < floats.GetLength(); ++i)
{
	floats[i] = (float)i;
}

This results in 1000 calls to Array1&lt;float&gt;.GetLength(), which causes 1000 calls from C++ into C#. That’s a lot of expensive work to get a value that cannot change by definition. So far it’s been easy to work around by caching the length in a local variable:

for (int32_t i = 0, len = floats.GetLength(); i < len; ++i)
{
	floats[i] = (float)i;
}

This works great for one loop, but each loop has to call into C# to get the length. Instead, we can cache that length in the C++ Array class itself so it only needs to be fetched once. Then we can check the local cache in GetLength and avoid the call into C# if we already have the length cached. Here’s how that looks:

template<> struct Array1<float> : Array
{
	// The cached length
	int32_t InternalLength;
 
	// Create a new array
	Array1(int32_t length)
	{
		// Cache the length
		InternalLength = length;
 
		// Call into C# to create the array
		Handle = ArrayFloat1Constructor(length);
	}
 
	// Create a null array
	Array1(nullptr_t n)
	{
		// A zero length means we don't have the length cached
		InternalLength = 0;
 
		// A zero handle means this is null
		Handle = 0;
	}
 
	int32_t GetLength()
	{
		// We don't have the length cached
		if (InternalLength == 0)
		{
			// Call into C# to get the length
			// Cache the result
			InternalLength = ArrayFloat1GetLength(Handle);
		}
 
		return InternalLength;
	}
};

The API is exactly the same as it was before, so both loops will still work just fine. In the first version where we didn’t cache the length as a local variable, we’d end up doing 1000 if checks to make sure the length was cached but no calls into C# to get the length. That’s a huge speedup! To go even further, the version that does cache the length as a local variable only does one if check and then just uses the local variable. That’s the faster version, but only a minor speedup compared to not calling into C# over and over.

Now we can take this caching strategy and apply it a couple more times with multidimensional arrays. Consider a 2D array of float. It has both a GetLength() to get the overall length and a GetLength(int32_t) to get the length of one of the dimensions. We just need to add an array of cached lengths for the dimension lengths since they’re all constant, too.

template<> struct Array2<float> : Array
{
	// The cached overall length
	int32_t InternalLength;
 
	// The cached lengths for each dimension
	int32_t InternalLengths[2];
 
	// Create a new array
	Array2(int32_t length0, int32_t length1)
	{
		// Cache the overall length
		InternalLength = length0 * length1;
 
		// Cache the lengths of each dimension
		InternalLengths[0] = length0;
		InternalLengths[1] = length1;
 
		// Call into C# to create the array
		Handle = ArrayFloat2Constructor(length);
	}
 
	// Create a null array
	Array1(nullptr_t n)
	{
		// A zero length means we don't have the length cached
		InternalLength = 0;
		InternalLengths[0] = 0;
		InternalLengths[1] = 0;
 
		// A zero handle means this is null
		Handle = 0;
	}
 
	int32_t GetLength(int32_t dimension)
	{
		// We don't have the length cached
		int32_t length = InternalLengths[dimension];
		if (length == 0)
		{
			// Call into C# to get the length
			// Cache the result
			length = ArrayFloat2GetLength(Handle, dimension);
			InternalLengths[dimension] = length;
		}
 
		return length;
	}
};

Finally, there’s one more constant in arrays that we can optimize: GetRank. This returns the number of dimensions in the array, which is known at compile time. There’s simply no need to ever call into C# to get this, so we can modify these functions to just return a constant:

template<> struct Array2<float> : Array
{
	int32_t GetRank()
	{
		return 2;
	}
};

Compilers will almost certainly inline this function to just a constant, so it’s essentially free to call it now.

Those are the optimizations we’ll cover today. There’s always more that can be done, but the steps we’ve taken today should solve some of the bigger performance issues that could arise and lead to even lower overhead when using C++ scripting. As usual, this is all available now on the GitHub project if you’d like to try it out or see all the details about how it was implemented.