Burst 1.2 comes with a new feature: SharedStatic<T>. This allows us to write to static variables from within Burst-compiled code like jobs and function pointers. Today we’ll look at how this is implemented by Burst and IL2CPP. We’ll also put them to a performance test to see how fast they are.

Today we’ll be using the simple SharedStatic<int> example from the Burst docs.

public abstract class MutableStaticTest
{
    public static readonly SharedStatic<int> IntField =
        SharedStatic<int>.GetOrCreate<MutableStaticTest, IntFieldKey>();
 
    // Define a Key type to identify IntField
    private class IntFieldKey {}
}

To read from and write to this SharedStatic<int>, we’ll create a Burst-compiled job:

[BurstCompile(CompileSynchronously = true)]
struct SharedStaticJob : IJob
{
    public void Execute()
    {
        MutableStaticTest.IntField.Data = 5;
    }
}

Let’s use Burst 1.2.3’s Burst Inspector to look at the assembly for SharedStaticJob in Unity 2019.3.9f1:

; Execute
        lea        rax, [rip + .L0$pb]
        movabs        rcx, offset _GLOBAL_OFFSET_TABLE_-.L0$pb
        add        rcx, rax
        movabs        rax, offset ".LUnity.Burst.SharedStatic`1<System.Int32> MutableStaticTest::IntField"@GOTOFF
        mov        rax, qword ptr [rcx + rax]
        mov        dword ptr [rax], 5
        ret
 
; Unity.Jobs.IJobExtensions.JobStruct`1<SharedStaticJob>.Execute
        push        r15
        push        r14
        push        rbx
        sub        rsp, 16
        mov        r14, rdi
.L1$pb:
        lea        rax, [rip + .L1$pb]
        movabs        rbx, offset _GLOBAL_OFFSET_TABLE_-.L1$pb
        add        rbx, rax
        movabs        rdi, offset .Lburst_abort.function.string@GOTOFF
        add        rdi, rbx
        call        r14
        movabs        r15, offset .Lburst_abort_Ptr@GOTOFF
        mov        qword ptr [rbx + r15], rax
        movabs        rdi, offset ".LUnity.Burst.LowLevel.BurstCompilerService::GetOrCreateSharedMemory.function.string"@GOTOFF
        add        rdi, rbx
        call        r14
        movabs        rcx, offset .LCPI1_0@GOTOFF
        movaps        xmm0, xmmword ptr [rbx + rcx]
        movaps        xmmword ptr [rsp], xmm0
        mov        rdi, rsp
        mov        esi, 4
        mov        edx, 4
        call        rax
        mov        r14, rax
        test        rax, rax
        jne        .LBB1_2
# %bb.1:                                # %BL.0035.i.i.i
        movabs        rdi, offset .Lburst_abort.error.id.1@GOTOFF
        add        rdi, rbx
        movabs        rsi, offset .Lburst_abort.error.message.2@GOTOFF
        add        rsi, rbx
        call        qword ptr [rbx + r15]
.LBB1_2:                                # %"MutableStaticTest..cctor()_2E87130171D178A4.exit"
        movabs        rax, offset ".LUnity.Burst.SharedStatic`1<System.Int32> MutableStaticTest::IntField"@GOTOFF
        mov        qword ptr [rbx + rax], r14
        add        rsp, 16
        pop        rbx
        pop        r14
        pop        r15
        ret

The actual writing of the SharedStatic<int> value takes six instructions, mostly to look up the address to write to in a table named _GLOBAL_OFFSET_TABLE_.

However, there is also the IJobExtensions.JobStruct.Execute that implicitly wraps the SharedStaticJob.Execute call. That involves a lot more code. It also deals with the _GLOBAL_OFFSET_TABLE_ table and includes some error-handling.

Next, let’s write a regular static class that isn’t compiled by Burst:

static class TestSharedStatic
{
    public static void Write()
    {
        MutableStaticTest.IntField.Data = 1;
    }
}

To see how IL2CPP writes to the SharedStatic<int>, let’s make a macOS build and look at the C++ that IL2CPP outputs. We’ll start in Assembly-CSharp.cpp with TestSharedStatic.Write: (mild formatting for line length by Jackson)

IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR void
TestSharedStatic_Write_m1A0666B6B553B746BA9FE2C60BB0BA3A3BBDBC09 (
    const RuntimeMethod* method)
{
    static bool s_Il2CppMethodInitialized;
    if (!s_Il2CppMethodInitialized)
    {
        il2cpp_codegen_initialize_method (TestSharedStatic_Write_m1A0666B6B553B746BA9FE2C60BB0BA3A3BBDBC09_MetadataUsageId);
        s_Il2CppMethodInitialized = true;
    }
    {
        // MutableStaticTest.IntField.Data = 1;
        IL2CPP_RUNTIME_CLASS_INIT(MutableStaticTest_tC2B1BC83B9404FFF95FD83208F03C7599D0A899F_il2cpp_TypeInfo_var);
        int32_t* L_0 = SharedStatic_1_get_Data_m14AE2D6CC6A8E0C94485512D2C7D22C433B7EA0C(
            (SharedStatic_1_t4816A3740ED422B03300036A1C23055FDA1FC77C *)(
                ((MutableStaticTest_tC2B1BC83B9404FFF95FD83208F03C7599D0A899F_StaticFields*)
                    il2cpp_codegen_static_fields_for(
                        MutableStaticTest_tC2B1BC83B9404FFF95FD83208F03C7599D0A899F_il2cpp_TypeInfo_var))
                        ->get_address_of_IntField_0()),
                    /*hidden argument*/SharedStatic_1_get_Data_m14AE2D6CC6A8E0C94485512D2C7D22C433B7EA0C_RuntimeMethod_var);
        *((int32_t*)L_0) = (int32_t)1;
        // }
        return;
    }
}

Here we see the usual method initialization overhead followed by a call to the SharedStatic.data property to get a pointer to the int. That pointer is then dereferenced and set to 1.

Let’s dive into the SharedStatic.data property in Generics17.cpp to see what it does:

inline int32_t*
SharedStatic_1_get_Data_m14AE2D6CC6A8E0C94485512D2C7D22C433B7EA0C (
    SharedStatic_1_t4816A3740ED422B03300036A1C23055FDA1FC77C * __this,
    const RuntimeMethod* method)
{
    return ((  int32_t* (*) (
        SharedStatic_1_t4816A3740ED422B03300036A1C23055FDA1FC77C *,
        const RuntimeMethod*))
        SharedStatic_1_get_Data_m14AE2D6CC6A8E0C94485512D2C7D22C433B7EA0C_gshared
        )(__this, method);
}

Apparently, it’s just a wrapper around another function call. Let’s jump to that function, also in Generics17.cpp:

IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR int32_t*
SharedStatic_1_get_Data_m14AE2D6CC6A8E0C94485512D2C7D22C433B7EA0C_gshared (
    SharedStatic_1_t4816A3740ED422B03300036A1C23055FDA1FC77C * __this,
    const RuntimeMethod* method)
{
    {
        // return ref Unsafe.AsRef<T>(_buffer);
        void* L_0 = (void*)__this->get__buffer_0();
        int32_t* L_1 = ((  int32_t* (*) (void*, const RuntimeMethod*))
            IL2CPP_RGCTX_METHOD_INFO(
                InitializedTypeInfo(method->klass)->rgctx_data, 0)->methodPointer)
                ((void*)(
                    void*)L_0,
                    /*hidden argument*/IL2CPP_RGCTX_METHOD_INFO(
                        InitializedTypeInfo(method->klass)->rgctx_data,
                        0));
        return (int32_t*)(L_1);
    }
}

First, let’s check the IL2CPP_RGCTX_METHOD_INFO macro which is also in il2cpp-codegen-il2cpp.h:

#define IL2CPP_RGCTX_METHOD_INFO(rgctxVar, index) (rgctxVar[index].method)

Now we can jump down to InitializedTypeInfo in the same file:

inline RuntimeClass* InitializedTypeInfo(RuntimeClass* klass)
{
    il2cpp::vm::ClassInlines::InitFromCodegen(klass);
    return klass;
}

This calls into some hand-written, not code-generated, code in ClassInlines.h:

// This function is critical for performance, before optimization it
// caused up to 20% of all CPU usage in code generated by il2cpp
static IL2CPP_FORCE_INLINE bool InitFromCodegen(Il2CppClass *klass)
{
    if (klass->initialized_and_no_error)
        return true;
    return InitFromCodegenSlow(klass);
}

The initialized_and_no_error flag will usually be set, avoiding the need to call the slow version of the function. Just to see what that looks like, let’s peek at ClassInlines.cpp:

bool ClassInlines::InitFromCodegenSlow(Il2CppClass *klass)
{
    bool result = Class::Init(klass);
 
    if (klass->has_initialization_error)
        il2cpp::vm::Exception::Raise((Il2CppException*)gc::GCHandle::GetTarget(klass->initializationExceptionGCHandle));
 
    return result;
}

Finally, let’s jump back to TestSharedStatic.Write and see what il2cpp_codegen_static_fields_for does. It’s in il2cpp-codegen-il2cpp.h:

inline void* il2cpp_codegen_static_fields_for(RuntimeClass* klass)
{
    return klass->static_fields;
}

Taken all together, there’s nothing that suprising here. Most writes should check a couple of flags and avoid the expensive work of initializing the class and the method and get on to the quick work of setting the shared value by dereferencing a pointer.

Now let’s compare the SharedStatic<int> job above with a job that writes to a NativeArray<int>:

[BurstCompile(CompileSynchronously = true)]
struct NativeArrayJob : IJob
{
    public NativeArray<int> Array;
 
    public void Execute()
    {
        Array[0] = 5;
    }
}

Here’s what Burst Inspector shows:

; Execute
        mov        rax, qword ptr [rdi]
        mov        dword ptr [rax], 5
        ret
 
; Unity.Jobs.IJobExtensions.JobStruct`1<NativeArrayJob>.Execute
        ret

The Execute portion is down from six instructions to just two because there’s no need to look up the pointer in a table. The IJobExtensions.JobStruct.Execute wrapper is completely cleared out as no table needs to be set up and no errors need to be handled.

To further compare, let’s write a job that uses a raw int* pointer:

[BurstCompile(CompileSynchronously = true)]
unsafe struct PointerJob : IJob
{
    [NativeDisableUnsafePtrRestriction]
    public int* Pointer;
 
    public void Execute()
    {
        *Pointer = 5;
    }
}

Here’s what Burst Inspector shows:

; Execute
        mov        rax, qword ptr [rdi]
        mov        dword ptr [rax], 5
        ret
 
; Unity.Jobs.IJobExtensions.JobStruct`1<PointerJob>.Execute
        ret

This is the exact same output as with NativeArray<int>.

Finally for today, let’s write a quick performance comparison between these three job types. We’ll execute a bunch of each job and check how long each took to run:

class TestScript : MonoBehaviour
{
    unsafe void Start()
    {
        const int reps = 100000;
        TestSharedStatic.Write();
 
        NativeArray<int> array = new NativeArray<int>(1, Allocator.TempJob);
 
        SharedStaticJob sharedStaticJob = new SharedStaticJob();
        NativeArrayJob nativeArrayJob = new NativeArrayJob
        {
            Array = array
        };
        PointerJob pointerJob = new PointerJob
        {
            Pointer = (int*)array.GetUnsafePtr()
        };
 
        // Warmup
        sharedStaticJob.Run();
        nativeArrayJob.Run();
        pointerJob.Run();
 
        Stopwatch sw = Stopwatch.StartNew();
        for (int i = 0; i < reps; ++i)
        {
            sharedStaticJob.Run();
        }
        long sharedStaticTicks = sw.ElapsedTicks;
 
        sw.Restart();
        for (int i = 0; i < reps; ++i)
        {
            nativeArrayJob.Run();
        }
        long nativeArrayTicks = sw.ElapsedTicks;
 
        sw.Restart();
        for (int i = 0; i < reps; ++i)
        {
            pointerJob.Run();
        }
        long pointerTicks = sw.ElapsedTicks;
 
        array.Dispose();
 
        print(
            "Job,Ticks\n" +
            "SharedStatic," + sharedStaticTicks + "\n" +
            "NativeArray," + nativeArrayTicks + "\n" +
            "Pointer," + pointerTicks);
    }
}

I ran the test in this environment:

  • 2.7 Ghz Intel Core i7-6820HQ
  • macOS 10.15.3
  • Unity 2019.3.9f1
  • Burst package 1.2.3
  • macOS Standalone
  • .NET 4.x scripting runtime version and API compatibility level
  • IL2CPP
  • Non-development
  • 640×480, Fastest, Windowed

And here are the results I got:

Job Ticks
SharedStatic 267909
NativeArray 243741
Pointer 243402

SharedStatic<T> Performance Graph

SharedStatic<T> takes about 10% longer than both NativeArray<T> and raw pointers in this test, which seems about right given how little work each job is doing and how great a share of the time is therefore spent in job system overhead.

In conclusion, SharedStatic<T> adds a little bit of overhead compared to raw pointers or NativeArray<T> when compiled with either IL2CPP or Burst but it’s pretty minimal overhead and quite unlikely to cause any serious performance problems.