How SharedStatic Works
Burst 1.2 comes with a new feature: SharedStatic<T>
. This allows us to write to static variables from within Burst-compiled code like jobs and function pointers. Today we’ll look at how this is implemented by Burst and IL2CPP. We’ll also put them to a performance test to see how fast they are.
Today we’ll be using the simple SharedStatic<int>
example from the Burst docs.
public abstract class MutableStaticTest { public static readonly SharedStatic<int> IntField = SharedStatic<int>.GetOrCreate<MutableStaticTest, IntFieldKey>(); // Define a Key type to identify IntField private class IntFieldKey {} }
To read from and write to this SharedStatic<int>
, we’ll create a Burst-compiled job:
[BurstCompile(CompileSynchronously = true)] struct SharedStaticJob : IJob { public void Execute() { MutableStaticTest.IntField.Data = 5; } }
Let’s use Burst 1.2.3’s Burst Inspector to look at the assembly for SharedStaticJob
in Unity 2019.3.9f1:
; Execute lea rax, [rip + .L0$pb] movabs rcx, offset _GLOBAL_OFFSET_TABLE_-.L0$pb add rcx, rax movabs rax, offset ".LUnity.Burst.SharedStatic`1<System.Int32> MutableStaticTest::IntField"@GOTOFF mov rax, qword ptr [rcx + rax] mov dword ptr [rax], 5 ret ; Unity.Jobs.IJobExtensions.JobStruct`1<SharedStaticJob>.Execute push r15 push r14 push rbx sub rsp, 16 mov r14, rdi .L1$pb: lea rax, [rip + .L1$pb] movabs rbx, offset _GLOBAL_OFFSET_TABLE_-.L1$pb add rbx, rax movabs rdi, offset .Lburst_abort.function.string@GOTOFF add rdi, rbx call r14 movabs r15, offset .Lburst_abort_Ptr@GOTOFF mov qword ptr [rbx + r15], rax movabs rdi, offset ".LUnity.Burst.LowLevel.BurstCompilerService::GetOrCreateSharedMemory.function.string"@GOTOFF add rdi, rbx call r14 movabs rcx, offset .LCPI1_0@GOTOFF movaps xmm0, xmmword ptr [rbx + rcx] movaps xmmword ptr [rsp], xmm0 mov rdi, rsp mov esi, 4 mov edx, 4 call rax mov r14, rax test rax, rax jne .LBB1_2 # %bb.1: # %BL.0035.i.i.i movabs rdi, offset .Lburst_abort.error.id.1@GOTOFF add rdi, rbx movabs rsi, offset .Lburst_abort.error.message.2@GOTOFF add rsi, rbx call qword ptr [rbx + r15] .LBB1_2: # %"MutableStaticTest..cctor()_2E87130171D178A4.exit" movabs rax, offset ".LUnity.Burst.SharedStatic`1<System.Int32> MutableStaticTest::IntField"@GOTOFF mov qword ptr [rbx + rax], r14 add rsp, 16 pop rbx pop r14 pop r15 ret
The actual writing of the SharedStatic<int>
value takes six instructions, mostly to look up the address to write to in a table named _GLOBAL_OFFSET_TABLE_
.
However, there is also the IJobExtensions.JobStruct.Execute
that implicitly wraps the SharedStaticJob.Execute
call. That involves a lot more code. It also deals with the _GLOBAL_OFFSET_TABLE_
table and includes some error-handling.
Next, let’s write a regular static class that isn’t compiled by Burst:
static class TestSharedStatic { public static void Write() { MutableStaticTest.IntField.Data = 1; } }
To see how IL2CPP writes to the SharedStatic<int>
, let’s make a macOS build and look at the C++ that IL2CPP outputs. We’ll start in Assembly-CSharp.cpp
with TestSharedStatic.Write
: (mild formatting for line length by Jackson)
IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR void TestSharedStatic_Write_m1A0666B6B553B746BA9FE2C60BB0BA3A3BBDBC09 ( const RuntimeMethod* method) { static bool s_Il2CppMethodInitialized; if (!s_Il2CppMethodInitialized) { il2cpp_codegen_initialize_method (TestSharedStatic_Write_m1A0666B6B553B746BA9FE2C60BB0BA3A3BBDBC09_MetadataUsageId); s_Il2CppMethodInitialized = true; } { // MutableStaticTest.IntField.Data = 1; IL2CPP_RUNTIME_CLASS_INIT(MutableStaticTest_tC2B1BC83B9404FFF95FD83208F03C7599D0A899F_il2cpp_TypeInfo_var); int32_t* L_0 = SharedStatic_1_get_Data_m14AE2D6CC6A8E0C94485512D2C7D22C433B7EA0C( (SharedStatic_1_t4816A3740ED422B03300036A1C23055FDA1FC77C *)( ((MutableStaticTest_tC2B1BC83B9404FFF95FD83208F03C7599D0A899F_StaticFields*) il2cpp_codegen_static_fields_for( MutableStaticTest_tC2B1BC83B9404FFF95FD83208F03C7599D0A899F_il2cpp_TypeInfo_var)) ->get_address_of_IntField_0()), /*hidden argument*/SharedStatic_1_get_Data_m14AE2D6CC6A8E0C94485512D2C7D22C433B7EA0C_RuntimeMethod_var); *((int32_t*)L_0) = (int32_t)1; // } return; } }
Here we see the usual method initialization overhead followed by a call to the SharedStatic.data
property to get a pointer to the int
. That pointer is then dereferenced and set to 1
.
Let’s dive into the SharedStatic.data
property in Generics17.cpp
to see what it does:
inline int32_t* SharedStatic_1_get_Data_m14AE2D6CC6A8E0C94485512D2C7D22C433B7EA0C ( SharedStatic_1_t4816A3740ED422B03300036A1C23055FDA1FC77C * __this, const RuntimeMethod* method) { return (( int32_t* (*) ( SharedStatic_1_t4816A3740ED422B03300036A1C23055FDA1FC77C *, const RuntimeMethod*)) SharedStatic_1_get_Data_m14AE2D6CC6A8E0C94485512D2C7D22C433B7EA0C_gshared )(__this, method); }
Apparently, it’s just a wrapper around another function call. Let’s jump to that function, also in Generics17.cpp
:
IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR int32_t* SharedStatic_1_get_Data_m14AE2D6CC6A8E0C94485512D2C7D22C433B7EA0C_gshared ( SharedStatic_1_t4816A3740ED422B03300036A1C23055FDA1FC77C * __this, const RuntimeMethod* method) { { // return ref Unsafe.AsRef<T>(_buffer); void* L_0 = (void*)__this->get__buffer_0(); int32_t* L_1 = (( int32_t* (*) (void*, const RuntimeMethod*)) IL2CPP_RGCTX_METHOD_INFO( InitializedTypeInfo(method->klass)->rgctx_data, 0)->methodPointer) ((void*)( void*)L_0, /*hidden argument*/IL2CPP_RGCTX_METHOD_INFO( InitializedTypeInfo(method->klass)->rgctx_data, 0)); return (int32_t*)(L_1); } }
First, let’s check the IL2CPP_RGCTX_METHOD_INFO
macro which is also in il2cpp-codegen-il2cpp.h
:
#define IL2CPP_RGCTX_METHOD_INFO(rgctxVar, index) (rgctxVar[index].method)
Now we can jump down to InitializedTypeInfo
in the same file:
inline RuntimeClass* InitializedTypeInfo(RuntimeClass* klass) { il2cpp::vm::ClassInlines::InitFromCodegen(klass); return klass; }
This calls into some hand-written, not code-generated, code in ClassInlines.h
:
// This function is critical for performance, before optimization it // caused up to 20% of all CPU usage in code generated by il2cpp static IL2CPP_FORCE_INLINE bool InitFromCodegen(Il2CppClass *klass) { if (klass->initialized_and_no_error) return true; return InitFromCodegenSlow(klass); }
The initialized_and_no_error
flag will usually be set, avoiding the need to call the slow version of the function. Just to see what that looks like, let’s peek at ClassInlines.cpp
:
bool ClassInlines::InitFromCodegenSlow(Il2CppClass *klass) { bool result = Class::Init(klass); if (klass->has_initialization_error) il2cpp::vm::Exception::Raise((Il2CppException*)gc::GCHandle::GetTarget(klass->initializationExceptionGCHandle)); return result; }
Finally, let’s jump back to TestSharedStatic.Write
and see what il2cpp_codegen_static_fields_for
does. It’s in il2cpp-codegen-il2cpp.h
:
inline void* il2cpp_codegen_static_fields_for(RuntimeClass* klass) { return klass->static_fields; }
Taken all together, there’s nothing that suprising here. Most writes should check a couple of flags and avoid the expensive work of initializing the class and the method and get on to the quick work of setting the shared value by dereferencing a pointer.
Now let’s compare the SharedStatic<int>
job above with a job that writes to a NativeArray<int>
:
[BurstCompile(CompileSynchronously = true)] struct NativeArrayJob : IJob { public NativeArray<int> Array; public void Execute() { Array[0] = 5; } }
Here’s what Burst Inspector shows:
; Execute mov rax, qword ptr [rdi] mov dword ptr [rax], 5 ret ; Unity.Jobs.IJobExtensions.JobStruct`1<NativeArrayJob>.Execute ret
The Execute
portion is down from six instructions to just two because there’s no need to look up the pointer in a table. The IJobExtensions.JobStruct.Execute
wrapper is completely cleared out as no table needs to be set up and no errors need to be handled.
To further compare, let’s write a job that uses a raw int*
pointer:
[BurstCompile(CompileSynchronously = true)] unsafe struct PointerJob : IJob { [NativeDisableUnsafePtrRestriction] public int* Pointer; public void Execute() { *Pointer = 5; } }
Here’s what Burst Inspector shows:
; Execute mov rax, qword ptr [rdi] mov dword ptr [rax], 5 ret ; Unity.Jobs.IJobExtensions.JobStruct`1<PointerJob>.Execute ret
This is the exact same output as with NativeArray<int>
.
Finally for today, let’s write a quick performance comparison between these three job types. We’ll execute a bunch of each job and check how long each took to run:
class TestScript : MonoBehaviour { unsafe void Start() { const int reps = 100000; TestSharedStatic.Write(); NativeArray<int> array = new NativeArray<int>(1, Allocator.TempJob); SharedStaticJob sharedStaticJob = new SharedStaticJob(); NativeArrayJob nativeArrayJob = new NativeArrayJob { Array = array }; PointerJob pointerJob = new PointerJob { Pointer = (int*)array.GetUnsafePtr() }; // Warmup sharedStaticJob.Run(); nativeArrayJob.Run(); pointerJob.Run(); Stopwatch sw = Stopwatch.StartNew(); for (int i = 0; i < reps; ++i) { sharedStaticJob.Run(); } long sharedStaticTicks = sw.ElapsedTicks; sw.Restart(); for (int i = 0; i < reps; ++i) { nativeArrayJob.Run(); } long nativeArrayTicks = sw.ElapsedTicks; sw.Restart(); for (int i = 0; i < reps; ++i) { pointerJob.Run(); } long pointerTicks = sw.ElapsedTicks; array.Dispose(); print( "Job,Ticks\n" + "SharedStatic," + sharedStaticTicks + "\n" + "NativeArray," + nativeArrayTicks + "\n" + "Pointer," + pointerTicks); } }
I ran the test in this environment:
- 2.7 Ghz Intel Core i7-6820HQ
- macOS 10.15.3
- Unity 2019.3.9f1
- Burst package 1.2.3
- macOS Standalone
- .NET 4.x scripting runtime version and API compatibility level
- IL2CPP
- Non-development
- 640×480, Fastest, Windowed
And here are the results I got:
Job | Ticks |
---|---|
SharedStatic | 267909 |
NativeArray | 243741 |
Pointer | 243402 |
SharedStatic<T>
takes about 10% longer than both NativeArray<T>
and raw pointers in this test, which seems about right given how little work each job is doing and how great a share of the time is therefore spent in job system overhead.
In conclusion, SharedStatic<T>
adds a little bit of overhead compared to raw pointers or NativeArray<T>
when compiled with either IL2CPP or Burst but it’s pretty minimal overhead and quite unlikely to cause any serious performance problems.
#1 by B Porter on April 13th, 2020 ·
Currently digesting this article, it’s awesome. Just set up the environment.
I noticed a few things. I couldn’t figure out the combination of Jobs/Collections that worked with Burst 1.2.3, so I used Jobs (Preview .3 0.2.8), Collections (Preview.3 0.7.1), and Burst (Preview.9 1.3.0 ).
Also, in the text above, it appears the blog is removing the slashes from your escaped “ns” in the print() function.
My numbers (Similar to yours but SharedStatic is a lot closer to the others):
SharedStatic: 183448
NativeArray: 176779
Pointers: 174767
(Used 18 core Intel-i9, Windows 10)
A noobish question: I output my results using Text mesh pro. How did you get the print() output in your IL2CPP build – without outputing it to the screen?
#2 by jackson on April 14th, 2020 ·
Thanks for pointing out the unescaped newlines. I updated the article with a fix.
As for
print
, it just callsDebug.Log
so the output is located in your Player log file.#3 by Arman on April 14th, 2020 ·
Would be interesting to see how the asynchronous job execution influences the compiled instructions as well as the performance, even if just to confirm that it doesn’t change much. I am curious to see if it adds overhead for shared static.
#4 by jackson on April 14th, 2020 ·
The job is compiled to the same instructions regardless of whether it’s executed synchronously or asynchronously. Performance would definitely be better with asynchronous execution, but it’s not really controllable enough for a performance test like in this article.