IL2CPP Output for Unsafe Code
C# has some powerful features like fixed
-size buffers, pointers, and unmanaged local variable arrays courtesy of stackalloc
. These are deemed “unsafe” since they all deal with unmanaged memory. We should know what we’re ultimately instructing the CPU to execute when we use these features, so today we’ll take a look at the C++ output from IL2CPP and the assembly output from the C++ compiler to find out just that.
Defreferencing Pointers
Let’s start simple by looking at what happens when we dereference a pointer in C#:
static unsafe class TestClass { static int DereferencePointer(int* x) { return *x; } }
IL2CPP in Unity 2017.3 turns this C# into the following C++:
extern "C" int32_t TestClass_DereferencePointer_m3073840890 (RuntimeObject * __this /* static, unused */, int32_t* ___x0, const RuntimeMethod* method) { { int32_t* L_0 = ___x0; return (*((int32_t*)L_0)); } }
This is a pretty literal translation that isn’t adding much. There’s an unnecessary code block ({}
), an unnecessary copy from the ___x0
parameter to the L_0
local variable and an unnecessary cast from int32_t*
to int32_t*
, but otherwise this function is just dereferencing the pointer as we did in C#.
Let’s see what Xcode 9.2’s C++ compiler turns this into when it generates ARM machine code:
ldr r0, [r1] bx lr
All of that syntax just boils down to reading the memory at the pointer’s address (a.k.a. dereferencing it) then returning the value that was read. This is the minimal assembly and neither IL2CPP nor the C++ compiler have added any overhead at all. Great!
Indexing Unmanaged Arrays
Pointers can also be thought of as the address of the first element of an array. Let’s try using a pointer like an array using the p[x]
syntax:
static unsafe class TestClass { static int IndexPointer(int* x) { return x[3]; } }
Here’s the IL2CPP output:
extern "C" int32_t TestClass_IndexPointer_m1288887649 (RuntimeObject * __this /* static, unused */, int32_t* ___x0, const RuntimeMethod* method) { { int32_t* L_0 = ___x0; return (*((int32_t*)((int32_t*)il2cpp_codegen_add((intptr_t)L_0, (int32_t)((int32_t)12))))); } }
Here we see that IL2CPP has multiplied the index (3
) by the size of the int
elements of the array (4
) to get an offset in bytes: 12
. Aside from a bunch of unnecessary casts and code blocks, we see a call to il2cpp_codegen_add
which looks like this:
template<typename T, typename U> inline typename pick_bigger<T, U>::type il2cpp_codegen_add(T left, U right) { return left + right; }
So this just adds the two parameters together and returns a pick_bigger<T, U>::type
. That type is defined by the combination of three template structs using a C++ technique called metaprogramming:
template<class T, class U> struct pick_first<true, T, U> { typedef T type; }; template<class T, class U> struct pick_first<false, T, U> { typedef U type; }; template<class T, class U> struct pick_bigger { typedef typename pick_first<(sizeof(T) >= sizeof(U)), T, U>::type type; };
This boils down to pick_bigger<T, U>::type
being the larger of the T
and U
types. So if il2cpp_codegen_add
were to add an int32_t
and an int64_t
, it would return an int64_t
since it’s bigger.
Now let’s see how all this metaprogramming and extra syntax boils down to ARM assembly code:
ldr r0, [r1, #12] bx lr
All of that got stripped out and we’re left with just one read from 12 bytes after the pointer and then returning the value that was read. Again, this is the minimal work so the output is great!
Casting Pointers
Next up, let’s cast a pointer to one type to a pointer to another type. We’ve seen before that casting objects can be quite expensive, but does this hold true for pointers? Let’s try:
static unsafe class TestClass { static float CastPointer(int* x) { return *(float*)x; } }
Here’s what IL2CPP outputs:
extern "C" float TestClass_CastPointer_m222125499 (RuntimeObject * __this /* static, unused */, int32_t* ___x0, const RuntimeMethod* method) { { int32_t* L_0 = ___x0; return (*((float*)L_0)); } }
Aside from the pointless local variable and code block, this is a literal translation of the C#. There’s no dynamic type checking going on here, unlike with object casting. Let’s see how this gets compiled into ARM assembly:
ldr r0, [r1] bx lr
This is exactly the same assembly as with just dereferencing the pointer without a cast, as it should be. We simply read from memory at the pointer’s address then return the value. Great!
Fixing a Pointer to a Managed Array
The memory manager in C# is allowed to move objects around as it pleases. This is possible since C# references aren’t a literal memory address, so no pointers will be invalidated by the move. This also implies that if we want a pointer to a managed object that we need to use the fixed
block to temporarily prevent the object from being moved while we use its current location in memory. Let’s try that by getting a pointer to the first element of a managed array:
static unsafe class TestClass { static int* FixedBlock(int[] x) { fixed (int* p = x) { return p; } } }
The following is the C++ that IL2CPP outputs. It’s a lot longer, so I’ve annotated it with comments to explain what’s going on.
extern "C" int32_t* TestClass_FixedBlock_m2981938384 (RuntimeObject * __this /* static, unused */, Int32U5BU5D_t385246372* ___x0, const RuntimeMethod* method) { // A temporary that's used later int32_t* V_0 = NULL; // This will be the return value uintptr_t G_B4_0 = 0; { // If the array is null, skip to assigning the return value to null Int32U5BU5D_t385246372* L_0 = ___x0; if (!L_0) { goto IL_000e; } } { // The array is not null because we didn't execute the above 'goto' // Null check the array (redundantly) Int32U5BU5D_t385246372* L_1 = ___x0; NullCheck(L_1); // If the array is not empty, skip to getting the first element's address if ((((int32_t)((int32_t)(((RuntimeArray *)L_1)->max_length))))) { goto IL_0015; } } IL_000e: { // Set null and then go return it G_B4_0 = (((uintptr_t)0)); goto IL_001c; } IL_0015: { // The array is not null because two null checks have passed and we also dereferenced it // Null check the array (for the third time) Int32U5BU5D_t385246372* L_2 = ___x0; NullCheck(L_2); // Set the return value to the address of the first element G_B4_0 = ((uintptr_t)(((L_2)->GetAddressAt(static_cast<il2cpp_array_size_t>(0))))); } IL_001c: { // Cast to int* and return V_0 = (int32_t*)G_B4_0; int32_t* L_3 = V_0; return (int32_t*)(L_3); } }
GetAddressAt
and its helper macro look like this:
inline int32_t* GetAddressAt(il2cpp_array_size_t index) { IL2CPP_ARRAY_BOUNDS_CHECK(index, (uint32_t)(this)->max_length); return m_Items + index; } // Performance optimization as detailed here: http://blogs.msdn.com/b/clrcodegeneration/archive/2009/08/13/array-bounds-check-elimination-in-the-clr.aspx // Since array size is a signed int32_t, a single unsigned check can be performed to determine if index is less than array size. // Negative indices will map to a unsigned number greater than or equal to 2^31 which is larger than allowed for a valid array. #define IL2CPP_ARRAY_BOUNDS_CHECK(index, length) \ do { \ if (((uint32_t)(index)) >= ((uint32_t)length)) il2cpp::vm::Exception::Raise (il2cpp::vm::Exception::GetIndexOutOfRangeException()); \ } while (0)
So there are some redundant checks. We checked for a null
array three times and a valid index twice. There are also a lot of extra code blocks and pointless local variable copies, but that hasn’t tripped up the C++ compiler so far. Let’s see what ARM assembly it generates in this case. I’ve annotated it to explain what’s going on.
cbz r1, LBB3_2 // if array is null, go to LBB3_2 ldr r0, [r1, #12] // read array's max_length cmp r0, #0 // compare max_length with 0 it ne // if max_length isn't zero addne.w r0, r1, #16 // set the first element's address as return value if max_length isn't zero bx lr // return LBB3_2: movs r0, #0 // set null as return value bx lr // return
The compiler has removed the second two null checks and the redundant bounds check leaving just one array check and one empty array check. These are sensible checks for general-purpose code, but what if we know the array isn’t null and isn’t empty? Can we remove these checks with [Il2CppSetOption]
attributes? Let’s try:
static unsafe class TestClass { [Il2CppSetOption(Option.NullChecks, false)] [Il2CppSetOption(Option.ArrayBoundsChecks, false)] static int* FixedBlockNoNullChecksNoRangeChecks(int[] x) { fixed (int* p = x) { return p; } } }
Here’s the C++ we get out of IL2CPP:
extern "C" int32_t* TestClass_FixedBlockNoNullChecksNoRangeChecks_m1091582187 (RuntimeObject * __this /* static, unused */, Int32U5BU5D_t385246372* ___x0, const RuntimeMethod* method) { int32_t* V_0 = NULL; uintptr_t G_B4_0 = 0; { Int32U5BU5D_t385246372* L_0 = ___x0; if (!L_0) { goto IL_000e; } } { Int32U5BU5D_t385246372* L_1 = ___x0; if ((((int32_t)((int32_t)(((RuntimeArray *)L_1)->max_length))))) { goto IL_0015; } } IL_000e: { G_B4_0 = (((uintptr_t)0)); goto IL_001c; } IL_0015: { Int32U5BU5D_t385246372* L_2 = ___x0; G_B4_0 = ((uintptr_t)(((L_2)->GetAddressAtUnchecked(static_cast<il2cpp_array_size_t>(0))))); } IL_001c: { V_0 = (int32_t*)G_B4_0; int32_t* L_3 = V_0; return (int32_t*)(L_3); } }
Both NullCheck
calls have been removed and the call to GetAddressAt
has been converted to GetAddressAtUnchecked
, as we’d expect from [Il2CppSetOption]
. However, we still have a manual null check (if (!L_0)
) and a manual length check (if ((((int32_t)((int32_t)(((RuntimeArray *)L_1)->max_length)))))
) that didn’t get removed. How will this affect the ARM assembly? Let’s see:
cbz r1, LBB6_2 ldr r0, [r1, #12] cmp r0, #0 it ne addne.w r0, r1, #16 bx lr LBB6_2: movs r0, #0 bx lr
This assembly code is exactly the same as without the [Il2CppSetOption]
attributes. The C++ compiler was already stripping out the redundant null and bounds checks, so adding attributes didn’t change the code the CPU will execute. It’s unfortunate that we can’t get rid of the two remaining checks, but at least we don’t need to remember to add attributes to get the best machine code output.
Fixing a Pointer to a Fixed Buffer
There’s another meaning of the fixed
keyword. It can be used to create a fixed-length buffer of primitives as a direct field of a struct. So fixed float pos[3]
is just like typing float x, y, z
: the values are added directly to the struct’s contents. Unlike individual values, fixed-length buffers can be indexed into like a managed array. To do so, we once again need to use the other fixed
keyword to fix a pointer since the managed object containing the struct might be moved by the memory manager. Let’s try that now:
unsafe struct TestStruct { public fixed int FixedBuffer[10]; public int UseFixedBuffer() { fixed (int* f = FixedBuffer) { return f[3]; } } }
Here’s the C++ that IL2CPP outputs:
extern "C" int32_t TestStruct_UseFixedBuffer_m1456987545 (TestStruct_t512363622 * __this, const RuntimeMethod* method) { int32_t* V_0 = NULL; { U3CFixedBufferU3E__FixedBuffer0_t1481979028 * L_0 = __this->get_address_of_FixedBuffer_0(); int32_t* L_1 = L_0->get_address_of_FixedElementField_0(); V_0 = (int32_t*)L_1; int32_t* L_2 = V_0; return (*((int32_t*)((int32_t*)il2cpp_codegen_add((intptr_t)L_2, (int32_t)((int32_t)12))))); } } struct U3CFixedBufferU3E__FixedBuffer0_t1481979028 { public: union { struct { // System.Int32 TestStruct/<FixedBuffer>__FixedBuffer0::FixedElementField int32_t ___FixedElementField_0; }; uint8_t U3CFixedBufferU3E__FixedBuffer0_t1481979028__padding[40]; }; public: inline static int32_t get_offset_of_FixedElementField_0() { return static_cast<int32_t>(offsetof(U3CFixedBufferU3E__FixedBuffer0_t1481979028, ___FixedElementField_0)); } inline int32_t get_FixedElementField_0() const { return ___FixedElementField_0; } inline int32_t* get_address_of_FixedElementField_0() { return &___FixedElementField_0; } inline void set_FixedElementField_0(int32_t value) { ___FixedElementField_0 = value; } };
We can see that the generated struct has 40 bytes (uint8_t
) of data, which is the capacity for 10 int
values in the fixed-length buffer. Various accessor functions were generated and a couple of them used in UseFixedBuffer
. il2cpp_codegen_add
returns here to offset the pointer to the first element by 12, which is 3 elements of 4 bytes each.
Notably missing are the null and length checks. Because fixed-length buffers cannot be null and cannot be empty, there’s simply no reason to check for these cases. Let’s see how this translates to assembly code via the C++ compiler:
ldr r0, [r0, #12] bx lr
This is identical to the assembly code that was generated when we indexed into a pointer. There are no null checks, no range checks, and no length checks as we saw above with managed arrays. Great!
Now let’s change up how we use the fixed buffer. Instead of using it from within the struct, let’s use it from another class.
static unsafe class TestClass { static int FixedBufferFromPointer(TestStruct* x) { fixed (int* p = x->FixedBuffer) { return p[3]; } } }
This is the same code, except it uses a pointer parameter instead of the this
pointer to access the fixed-length buffer. Let’s see the C++:
extern "C" int32_t TestClass_FixedBufferFromPointer_m2376647269 (RuntimeObject * __this /* static, unused */, TestStruct_t512363622 * ___x0, const RuntimeMethod* method) { int32_t* V_0 = NULL; { TestStruct_t512363622 * L_0 = ___x0; NullCheck(L_0); U3CFixedBufferU3E__FixedBuffer0_t1481979028 * L_1 = L_0->get_address_of_FixedBuffer_0(); int32_t* L_2 = L_1->get_address_of_FixedElementField_0(); V_0 = (int32_t*)L_2; int32_t* L_3 = V_0; return (*((int32_t*)((int32_t*)il2cpp_codegen_add((intptr_t)L_3, (int32_t)((int32_t)12))))); } }
We get a null check now, but otherwise the code is the same as when the function was within the struct. Let’s see the assembly code:
push {r4, r7, lr} add r7, sp, #4 mov r4, r1 cbnz r4, LBB9_2 bl __ZN6il2cpp2vm9Exception27RaiseNullReferenceExceptionEv LBB9_2: ldr r0, [r4, #12] pop {r4, r7, pc}
Most of this is for the null check, so let’s remove it to see if we can get back to the minimal assembly that was generated for the function inside the struct:
static unsafe class TestClass { [Il2CppSetOption(Option.NullChecks, false)] static int FixedBufferFromPointerNoNullChecks(TestStruct* x) { fixed (int* p = x->FixedBuffer) { return p[3]; } } }
Here’s the C++ that IL2CPP outputs:
extern "C" int32_t TestClass_FixedBufferFromPointerNoNullChecks_m1201836921 (RuntimeObject * __this /* static, unused */, TestStruct_t512363622 * ___x0, const RuntimeMethod* method) { int32_t* V_0 = NULL; { TestStruct_t512363622 * L_0 = ___x0; U3CFixedBufferU3E__FixedBuffer0_t1481979028 * L_1 = L_0->get_address_of_FixedBuffer_0(); int32_t* L_2 = L_1->get_address_of_FixedElementField_0(); V_0 = (int32_t*)L_2; int32_t* L_3 = V_0; return (*((int32_t*)((int32_t*)il2cpp_codegen_add((intptr_t)L_3, (int32_t)((int32_t)12))))); } }
This is the same, except the NullCheck
call has been removed. Now let’s see what this compiles to:
ldr r0, [r1, #12] bx lr
We’re back to the original, minimal assembly as with the function inside the struct. Great!
Stack Allocating Arrays
Finally for today, let’s look at the stackalloc
keyword. This can be used like fixed-length arrays to create an unmanaged array of local variables inside a function.
static unsafe class TestClass { static int StackallocVar(int len) { int* x = stackalloc int[len]; return x[3]; } }
IL2CPP outputs this C++, which I’ve annotated with comments:
extern "C" int32_t TestClass_StackallocVar_m2549670191 (RuntimeObject * __this /* static, unused */, int32_t ___len0, const RuntimeMethod* method) { int32_t* V_0 = NULL; { int32_t L_0 = ___len0; // If the stackalloc length is too long, throw an exception if ((uint64_t)(uint32_t)L_0 * (uint64_t)(uint32_t)4 > (uint64_t)(uint32_t)kIl2CppUInt32Max) IL2CPP_RAISE_MANAGED_EXCEPTION(il2cpp_codegen_get_overflow_exception()); // stackalloc the array and clear it to zeroes int8_t* L_1 = (int8_t*) alloca(((int32_t)il2cpp_codegen_multiply((int32_t)L_0, (int32_t)4))); memset(L_1,0,((int32_t)il2cpp_codegen_multiply((int32_t)L_0, (int32_t)4))); // Index into the fourth element and return it V_0 = (int32_t*)(L_1); int32_t* L_2 = V_0; return (*((int32_t*)((int32_t*)il2cpp_codegen_add((intptr_t)L_2, (int32_t)((int32_t)12))))); } }
The function begins with a check to make sure we’re not going to overflow the stack with an excessively large stackalloc
array. Then it allocates the array, clears it to zeroes, and returns the fourth element. Let’s see how this compiles into ARM assembly:
push {r4, r5, r7, lr} add r7, sp, #8 sub sp, #4 movw r0, :lower16:(L___stack_chk_guard$non_lazy_ptr-(LPC12_0+4)) cmp.w r1, #1073741824 movt r0, :upper16:(L___stack_chk_guard$non_lazy_ptr-(LPC12_0+4)) LPC12_0: add r0, pc ldr r0, [r0] ldr r0, [r0] str r0, [r7, #-12] bhs LBB12_2 sub.w r4, sp, r1, lsl #2 mov sp, r4 lsls r2, r1, #2 mov r0, r4 movs r1, #0 bl _memset ldr r0, [r4, #12] ldr r1, [r7, #-12] movw r2, :lower16:(L___stack_chk_guard$non_lazy_ptr-(LPC12_1+4)) movt r2, :upper16:(L___stack_chk_guard$non_lazy_ptr-(LPC12_1+4)) LPC12_1: add r2, pc ldr r2, [r2] ldr r2, [r2] subs r1, r2, r1 ittt eq subeq.w r4, r7, #8 moveq sp, r4 popeq {r4, r5, r7, pc} bl ___stack_chk_fail LBB12_2: bl __Z37il2cpp_codegen_get_overflow_exceptionv bl __ZN6il2cpp2vm9Exception5RaiseEP15Il2CppException
Most of this is the code for the call to alloca
which is making room on the stack for the array, the length check and ensuing exception, and the memset
to clear the array to zeroes. There’s definitely overhead to be paid here, but it’s a lot less than the overhead of allocating a managed array and, ultimately, feeding it to the GC.
If the length of the array is constant, there’s a workaround to eliminate the alloca
and length checks. Just create a bunch of local variables and use the address of the first one as a pointer to the first element of the “array”:
static unsafe class TestClass { static int StackallocManual(int i) { int x0, x1, x2, x3, x4, x5, x6, x7, x8, x9; int* x = &x0; return x[i]; } }
Here’s the C++ that IL2CPP generates:
extern "C" int32_t TestClass_StackallocManual_m2866866183 (RuntimeObject * __this /* static, unused */, int32_t ___i0, const RuntimeMethod* method) { int32_t V_0 = 0; int32_t V_1 = 0; int32_t V_2 = 0; int32_t V_3 = 0; int32_t V_4 = 0; int32_t V_5 = 0; int32_t V_6 = 0; int32_t V_7 = 0; int32_t V_8 = 0; int32_t V_9 = 0; int32_t* V_10 = NULL; { V_10 = (int32_t*)(&V_0); int32_t* L_0 = V_10; int32_t L_1 = ___i0; return (*((int32_t*)((int32_t*)il2cpp_codegen_add((intptr_t)L_0, (intptr_t)((intptr_t)il2cpp_codegen_multiply((intptr_t)(((intptr_t)L_1)), (int32_t)4)))))); } }
There’s no more length check or alloca
call here. Let’s look at the assembly to see what ultimately runs on the CPU:
sub sp, #4 movs r0, #0 str r0, [sp] mov r0, sp ldr.w r0, [r0, r1, lsl #2] add sp, #4 bx lr
This assembly is way shorter and no longer has any branch instructions such as for length checks.
Conclusion
All of these “unsafe” language features are implemented reasonably well in both IL2CPP and the C++ compiler. In many cases, the resulting ARM assembly code is completely optimal. In particular, dereferencing pointers, indexing into an unmanaged array, and casting pointers always produces the minimal assembly code.
Indexing into fixed-length buffer fields of a struct usually produces optimal machine code, but null checks will be added if accessing the buffer via a pointer. Thankfully, they’re easily removed by adding [Il2CppSetOption(Option.NullChecks, false)]
to the function using the buffer.
Unlike fixed-length buffers, indexing into a managed array via a pointer requires two branches to check for null and for empty arrays. These can’t be removed with an attribute.
The last form of array is created by stackalloc
and it comes with the most overhead. There’s a length check, a call to memset
to clear to zeroes, and alloca
to dynamically allocate inside the stack. Working around this requires a fixed array length and some extra typing, but it’s an option when performance is really crucial.
All of these are as fast or faster than their managed equivalents and typically don’t involve as many cache misses due to pointer indirection, as much need to disable null and bounds checks, and zero interaction with the GC and memory manager. For performance-critical C#, there’s a lot of upside to using “unsafe” code.
#1 by greenland on April 26th, 2019 ·
This is really great. Thanks for doing all this research.
Does IL2CPP still do length checks on stackalloc when the length is a fixed literal value? Does it also ignore [Il2CppSetOption(Option.ArrayBoundsChecks, false)]?
I would not have thought to define each “array” element individually; seems a bit crazy, but makes perfect sense.
I’ve a min/max search implemented as a C++ plugin, but I’m wondering if it would be worth it to rewrite it as C# using stackalloc, just for the benefit of Mono running it in-editor instead of requiring an IL2CPP build.
#2 by jackson on April 26th, 2019 ·
You’re welcome!
There are no bounds checks in the generated C++, so adding that attribute won’t help. The exception that’s thrown is for when the
stackalloc
array is too big.As for
stackalloc
with a constant length, that’ll produce the samealloca
call but the exception check won’t be present. So it’s faster than the version in the article that uses a variable-sized array, but not as flexible.As for your C++ plugin, I’d recommend profiling both approaches since it sounds like a minimal amount of code to duplicate. If you discover a significant difference, it might be worthwhile to dive into the IL2CPP-generated C++ and see why.
#3 by Dennin on September 27th, 2019 ·
I’d like to know what would be your thoughts about the Unity NativeArrays vs C# stackallocs.
#4 by jackson on September 28th, 2019 ·
There are a lot of differences between them. The biggest difference is that
stackalloc
allocates on the stack andNativeArray
allocates on the heap. So when your function returns, thestackalloc
array is gone.NativeArray
, on the other hand, can live on indefinitely.This also implies that you need to call
Dispose
at some point withNativeArray
but there is no need to manually dispose of astackalloc
array. Since stack space is limited,stackalloc
is also only suitable for short arrays whileNativeArray
is only limited by RAM size.Other differences include the requirement to use
unsafe
withstackalloc
and the requirement to use Unity withNativeArray
.#5 by Menyus on January 3rd, 2020 ·
AFAIK you can use stackalloc without unsafe by using Span, just throwing this here. Very nice and informtive article it was!
#6 by Amn on June 16th, 2022 ·
Hi @jackson!
Shouldn’t StackallocManual(int i) be returning return x[-i]; (Negative) because of how the stack grows?
I tried it in normal C code and it wasn’t working, so I negated i and it worked.