JacksonDunstan.com

There are many permutations of loops we can write, but what do they compile to? We should know the consequences of using an array versus a List<T>, for versus foreach, caching Length, and other factors. So today’s article dives into the C++ code that IL2CPP outputs when we write these various types of loops to examine the differences. We’ll even go further and look at the ARM assembly that the C++ compiles to and really find out how much overhead our choices are costing us.

Array `for` loop, checks disabled, `Length` cached

Today we’ll progress from the fastest possible loop to the slowest. Along the way we’ll look at the C# source code we write, the C++ source code that IL2CPP in Unity 2017.3 generates for our C#, and the ARM assembly code that Xcode 9.2 compiles to for an release build on iOS. Let’s start with the C# for a for loop on an array when we’ve told IL2CPP to disable null- and bounds-checks and we’re caching the array’s Length field as a local variable. Everything’s exactly the same for while and do-while loops, so only for loops will be covered here.

static class TestClass
{
	[Il2CppSetOption(Option.NullChecks, false)]
	[Il2CppSetOption(Option.ArrayBoundsChecks, false)]
	static int ForArrayChecksDisabled(int[] array)
	{
		int sum = 0;
		int len = array.Length;
		for (int i = 0; i < len; ++i)
		{
			sum += array[i];
		}
		return sum;
	}
}

Now the C++ that IL2CPP outputs:

extern "C"  int32_t TestClass_ForArrayChecksDisabled_m816776997 (RuntimeObject * __this /* static, unused */, Int32U5BU5D_t385246372* ___array0, const RuntimeMethod* method)
{
	int32_t V_0 = 0;
	int32_t V_1 = 0;
	int32_t V_2 = 0;
	{
		V_0 = 0;
		Int32U5BU5D_t385246372* L_0 = ___array0;
		V_1 = (((int32_t)((int32_t)(((RuntimeArray *)L_0)->max_length))));
		V_2 = 0;
		goto IL_0017;
	}
 
IL_000d:
	{
		int32_t L_1 = V_0;
		Int32U5BU5D_t385246372* L_2 = ___array0;
		int32_t L_3 = V_2;
		int32_t L_4 = L_3;
		int32_t L_5 = (L_2)->GetAtUnchecked(static_cast<il2cpp_array_size_t>(L_4));
		V_0 = ((int32_t)il2cpp_codegen_add((int32_t)L_1, (int32_t)L_5));
		int32_t L_6 = V_2;
		V_2 = ((int32_t)il2cpp_codegen_add((int32_t)L_6, (int32_t)1));
	}
 
IL_0017:
	{
		int32_t L_7 = V_2;
		int32_t L_8 = V_1;
		if ((((int32_t)L_7) < ((int32_t)L_8)))
		{
			goto IL_000d;
		}
	}
	{
		int32_t L_9 = V_0;
		return L_9;
	}
}

There are a lot of pointless local variable aliases and unnecessary casts, but other than that it’s pretty clear what’s going on here. We don’t see any NullCheck calls and we see GetAtUnchecked, which means we’ve effectively disabled null- and bounds-checks. Other than that, the only weird parts are the calls to il2cpp_codegen_add. This is an inline templated function that returns the larger of the two argument sizes, so int extends to long and so forth. In this case, it’s equivalent to just using the + operator.

Now let’s see how effective the C++ compiler is at removing all these local variable aliases, casts, and function calls by looking at the assembly it generates:

	ldr	r2, [r1, #12]
	cmp	r2, #1
	itt	lt
	movlt	r0, #0
	bxlt	lr
	adds	r1, #16
	movs	r0, #0
LBB0_1:
	ldr	r3, [r1], #4
	subs	r2, #1
	add	r0, r3
	bne	LBB0_1
	bx	lr

It’s OK to not understand all the details of this assembly. There’s not much going on here though, just some loop setup, the contents of the loop, incrementing the iterator variable, and jumping back to the LBB0_1 label at the start of the loop to keep it going. This is quite minimal and shows that the C# to C++ to Assembly pipeline can be really efficient under the right circumstances.

Array `for` loop, checks enabled

Hardly anybody ever disables the null- and bounds-checks everywhere though, so let’s see what a for loop on an array looks like normally when we don’t disable them. Here’s the C#:

static class TestClass
{
	static int ForArray(int[] array)
	{
		int sum = 0;
		int len = array.Length;
		for (int i = 0; i < len; ++i)
		{
			sum += array[i];
		}
		return sum;
	}
}

And the C++:

extern "C"  int32_t TestClass_ForArray_m2463453876 (RuntimeObject * __this /* static, unused */, Int32U5BU5D_t385246372* ___array0, const RuntimeMethod* method)
{
	int32_t V_0 = 0;
	int32_t V_1 = 0;
	int32_t V_2 = 0;
	{
		V_0 = 0;
		Int32U5BU5D_t385246372* L_0 = ___array0;
		NullCheck(L_0);
		V_1 = (((int32_t)((int32_t)(((RuntimeArray *)L_0)->max_length))));
		V_2 = 0;
		goto IL_0017;
	}
 
IL_000d:
	{
		int32_t L_1 = V_0;
		Int32U5BU5D_t385246372* L_2 = ___array0;
		int32_t L_3 = V_2;
		NullCheck(L_2);
		int32_t L_4 = L_3;
		int32_t L_5 = (L_2)->GetAt(static_cast<il2cpp_array_size_t>(L_4));
		V_0 = ((int32_t)il2cpp_codegen_add((int32_t)L_1, (int32_t)L_5));
		int32_t L_6 = V_2;
		V_2 = ((int32_t)il2cpp_codegen_add((int32_t)L_6, (int32_t)1));
	}
 
IL_0017:
	{
		int32_t L_7 = V_2;
		int32_t L_8 = V_1;
		if ((((int32_t)L_7) < ((int32_t)L_8)))
		{
			goto IL_000d;
		}
	}
	{
		int32_t L_9 = V_0;
		return L_9;
	}
}

As expected, the NullCheck calls are back and GetAtUnchecked has been replaced with GetAt. Let’s see the impact on the ARM assembly:

	push	{r4, r5, r6, r7, lr}
	add	r7, sp, #12
	push.w	{r8, r10}
	mov	r8, r1
	cmp.w	r8, #0
	it	eq
	bleq	__ZN6il2cpp2vm9Exception27RaiseNullReferenceExceptionEv
	ldr.w	r0, [r8, #12]
	cmp	r0, #1
	blt	LBB0_6
	sub.w	r10, r0, #1
	add.w	r4, r8, #16
	movs	r5, #0
	movs	r6, #0
	b	LBB0_3
LBB0_2:
	adds	r6, #1
	ldr.w	r0, [r8, #12]
LBB0_3:
	cmp	r0, r6
	bhi	LBB0_5
	bl	__ZN6il2cpp2vm9Exception27GetIndexOutOfRangeExceptionEv
	bl	__ZN6il2cpp2vm9Exception5RaiseEP15Il2CppException
LBB0_5:
	ldr.w	r0, [r4, r6, lsl #2]
	cmp	r10, r6
	add	r5, r0
	bne	LBB0_2
	b	LBB0_7
LBB0_6:
	movs	r5, #0
LBB0_7:
	mov	r0, r5
	pop.w	{r8, r10}
	pop	{r4, r5, r6, r7, pc}

Well that got a lot longer! Again, there’s no need to understand all the details. The gist here is that a lot more assembly code got generated. There are more conditionals (e.g. bl) now to do all the checks. These can be expensive so they’re best avoided in performance-critical code, but it’s not too bad.

Array `for` loop, checks enabled, `Length` cache

Next, let’s look at another for loop over an array, but this time without caching the Length of the array as a local variable. Many C# programmers don’t do this, so let’s see what difference it makes.

static class TestClass
{
	static int ForArrayNoCache(int[] array)
	{
		int sum = 0;
		for (int i = 0; i < array.Length; ++i)
		{
			sum += array[i];
		}
		return sum;
	}
}

Here’s the C++:

extern "C"  int32_t TestClass_ForArrayNoCache_m2316932607 (RuntimeObject * __this /* static, unused */, Int32U5BU5D_t385246372* ___array0, const RuntimeMethod* method)
{
	int32_t V_0 = 0;
	int32_t V_1 = 0;
	{
		V_0 = 0;
		V_1 = 0;
		goto IL_0013;
	}
 
IL_0009:
	{
		int32_t L_0 = V_0;
		Int32U5BU5D_t385246372* L_1 = ___array0;
		int32_t L_2 = V_1;
		NullCheck(L_1);
		int32_t L_3 = L_2;
		int32_t L_4 = (L_1)->GetAt(static_cast<il2cpp_array_size_t>(L_3));
		V_0 = ((int32_t)il2cpp_codegen_add((int32_t)L_0, (int32_t)L_4));
		int32_t L_5 = V_1;
		V_1 = ((int32_t)il2cpp_codegen_add((int32_t)L_5, (int32_t)1));
	}
 
IL_0013:
	{
		int32_t L_6 = V_1;
		Int32U5BU5D_t385246372* L_7 = ___array0;
		NullCheck(L_7);
		if ((((int32_t)L_6) < ((int32_t)(((int32_t)((int32_t)(((RuntimeArray *)L_7)->max_length)))))))
		{
			goto IL_0009;
		}
	}
	{
		int32_t L_8 = V_0;
		return L_8;
	}
}

Notice that the if line is no longer just comparing two local variables (if ((((int32_t)L_7) < ((int32_t)L_8)))) but now queries the Length field of the array via max_length. How much of an impact does this have on the assembly? Let’s find out:

	push	{r4, r5, r6, r7, lr}
	add	r7, sp, #12
	str	r8, [sp, #-4]!
	mov	r4, r1
	add.w	r8, r4, #16
	movs	r6, #0
	movs	r5, #0
	b	LBB1_4
LBB1_1:
	cmp	r0, r6
	bhi	LBB1_3
	bl	__ZN6il2cpp2vm9Exception27GetIndexOutOfRangeExceptionEv
	bl	__ZN6il2cpp2vm9Exception5RaiseEP15Il2CppException
LBB1_3:
	ldr.w	r0, [r8, r6, lsl #2]
	adds	r6, #1
	add	r5, r0
LBB1_4:
	cbnz	r4, LBB1_6
	bl	__ZN6il2cpp2vm9Exception27RaiseNullReferenceExceptionEv
LBB1_6:
	ldr	r0, [r4, #12]
	cmp	r6, r0
	blt	LBB1_1
	mov	r0, r5
	ldr	r8, [sp], #4
	pop	{r4, r5, r6, r7, pc}

This is shorter than before, but structured differently. The array is now checked for null (with cbnz) on every loop iteration. So there is a reason to cache the array length as a local variable. It will reduce the amount of branching required in the loop when null checks are left enabled.

Array `foreach` loop

Now that we’ve established a baseline of what normal for loops look like, let’s try out our first foreach loop using an array:

static class TestClass
{
	static int ForeachArray(int[] array)
	{
		int sum = 0;
		foreach (int cur in array)
		{
			sum += cur;
		}
		return sum;
	}
}

Here’s the C++:

extern "C"  int32_t TestClass_ForeachArray_m760813861 (RuntimeObject * __this /* static, unused */, Int32U5BU5D_t385246372* ___array0, const RuntimeMethod* method)
{
	int32_t V_0 = 0;
	int32_t V_1 = 0;
	Int32U5BU5D_t385246372* V_2 = NULL;
	int32_t V_3 = 0;
	{
		V_0 = 0;
		Int32U5BU5D_t385246372* L_0 = ___array0;
		V_2 = L_0;
		V_3 = 0;
		goto IL_0017;
	}
 
IL_000b:
	{
		Int32U5BU5D_t385246372* L_1 = V_2;
		int32_t L_2 = V_3;
		NullCheck(L_1);
		int32_t L_3 = L_2;
		int32_t L_4 = (L_1)->GetAt(static_cast<il2cpp_array_size_t>(L_3));
		V_1 = L_4;
		int32_t L_5 = V_0;
		int32_t L_6 = V_1;
		V_0 = ((int32_t)il2cpp_codegen_add((int32_t)L_5, (int32_t)L_6));
		int32_t L_7 = V_3;
		V_3 = ((int32_t)il2cpp_codegen_add((int32_t)L_7, (int32_t)1));
	}
 
IL_0017:
	{
		int32_t L_8 = V_3;
		Int32U5BU5D_t385246372* L_9 = V_2;
		NullCheck(L_9);
		if ((((int32_t)L_8) < ((int32_t)(((int32_t)((int32_t)(((RuntimeArray *)L_9)->max_length)))))))
		{
			goto IL_000b;
		}
	}
	{
		int32_t L_10 = V_0;
		return L_10;
	}
}

This looks very similar but not identical to the previous test where we used a for loop without caching Length. Let’s look at the assembly to strip away the C++ syntax and see what really gets run by the CPU:

	push	{r4, r5, r6, r7, lr}
	add	r7, sp, #12
	str	r8, [sp, #-4]!
	mov	r4, r1
	add.w	r8, r4, #16
	movs	r6, #0
	movs	r5, #0
	b	LBB1_4
LBB1_1:
	cmp	r0, r6
	bhi	LBB1_3
	bl	__ZN6il2cpp2vm9Exception27GetIndexOutOfRangeExceptionEv
	bl	__ZN6il2cpp2vm9Exception5RaiseEP15Il2CppException
LBB1_3:
	ldr.w	r0, [r8, r6, lsl #2]
	adds	r6, #1
	add	r5, r0
LBB1_4:
	cbnz	r4, LBB1_6
	bl	__ZN6il2cpp2vm9Exception27RaiseNullReferenceExceptionEv
LBB1_6:
	ldr	r0, [r4, #12]
	cmp	r6, r0
	blt	LBB1_1
	mov	r0, r5
	ldr	r8, [sp], #4
	pop	{r4, r5, r6, r7, pc}

This is exactly the same assembly code! Every single character of every single line is the same, even down to the registers that local variables occupy. This means we can very easily state what a foreach loop looks like on an array: it’s the same as a for loop that doesn’t cache the Length field and doesn’t disable null- or bounds-checks. That makes it equivalent to the slowest for loop we can write. No more. No less.

`List<T>` `for` loop

Now let’s venture into List<T> territory and explore what it looks like to iterate over one with a for loop:

static class TestClass
{
	static int ForList(List<int> list)
	{
		int sum = 0;
		int len = list.Count;
		for (int i = 0; i < len; ++i)
		{
			sum += list[i];
		}
		return sum;
	}
}

Here’s the C++ that IL2CPP outputs:

extern "C"  int32_t TestClass_ForList_m2093568030 (RuntimeObject * __this /* static, unused */, List_1_t128053199 * ___list0, const RuntimeMethod* method)
{
	static bool s_Il2CppMethodInitialized;
	if (!s_Il2CppMethodInitialized)
	{
		il2cpp_codegen_initialize_method (TestClass_ForList_m2093568030_MetadataUsageId);
		s_Il2CppMethodInitialized = true;
	}
	int32_t V_0 = 0;
	int32_t V_1 = 0;
	int32_t V_2 = 0;
	{
		V_0 = 0;
		List_1_t128053199 * L_0 = ___list0;
		NullCheck(L_0);
		int32_t L_1 = List_1_get_Count_m186164705(L_0, /*hidden argument*/List_1_get_Count_m186164705_RuntimeMethod_var);
		V_1 = L_1;
		V_2 = 0;
		goto IL_001e;
	}
 
IL_0010:
	{
		int32_t L_2 = V_0;
		List_1_t128053199 * L_3 = ___list0;
		int32_t L_4 = V_2;
		NullCheck(L_3);
		int32_t L_5 = List_1_get_Item_m888956288(L_3, L_4, /*hidden argument*/List_1_get_Item_m888956288_RuntimeMethod_var);
		V_0 = ((int32_t)il2cpp_codegen_add((int32_t)L_2, (int32_t)L_5));
		int32_t L_6 = V_2;
		V_2 = ((int32_t)il2cpp_codegen_add((int32_t)L_6, (int32_t)1));
	}
 
IL_001e:
	{
		int32_t L_7 = V_2;
		int32_t L_8 = V_1;
		if ((((int32_t)L_7) < ((int32_t)L_8)))
		{
			goto IL_0010;
		}
	}
	{
		int32_t L_9 = V_0;
		return L_9;
	}
}

As we know, using methods of a generic type means IL2CPP will generate method initialization overhead for us. The first part of this function is just that and the rest is the actual loop. It’s a pretty literal translation of our C# code with calls to List_1_get_Count_m186164705 to get the Count property and List_1_get_Item_m888956288 to use the indexer. Overall, it looks very similar to the array-based for loop with length caching, null-, and bounds-checks except for those function calls. Let’s look at the functions to find out what they do before diving into the assembly.

#define List_1_get_Count_m186164705(__this, method) ((  int32_t (*) (List_1_t128053199 *, const RuntimeMethod*))List_1_get_Count_m186164705_gshared)(__this, method)
 
extern "C"  int32_t List_1_get_Count_m186164705_gshared (List_1_t128053199 * __this, const RuntimeMethod* method)
{
	{
		int32_t L_0 = (int32_t)__this->get__size_1();
		return L_0;
	}
}
 
#define List_1_get_Item_m888956288(__this, p0, method) ((  int32_t (*) (List_1_t128053199 *, int32_t, const RuntimeMethod*))List_1_get_Item_m888956288_gshared)(__this, p0, method)
 
extern "C"  int32_t List_1_get_Item_m888956288_gshared (List_1_t128053199 * __this, int32_t ___index0, const RuntimeMethod* method)
{
	static bool s_Il2CppMethodInitialized;
	if (!s_Il2CppMethodInitialized)
	{
		il2cpp_codegen_initialize_method (List_1_get_Item_m888956288_MetadataUsageId);
		s_Il2CppMethodInitialized = true;
	}
	{
		int32_t L_0 = ___index0;
		int32_t L_1 = (int32_t)__this->get__size_1();
		if ((!(((uint32_t)L_0) >= ((uint32_t)L_1))))
		{
			goto IL_0017;
		}
	}
	{
		ArgumentOutOfRangeException_t777629997 * L_2 = (ArgumentOutOfRangeException_t777629997 *)il2cpp_codegen_object_new(ArgumentOutOfRangeException_t777629997_il2cpp_TypeInfo_var);
		ArgumentOutOfRangeException__ctor_m3628145864(L_2, (String_t*)_stringLiteral797640427, /*hidden argument*/NULL);
		IL2CPP_RAISE_MANAGED_EXCEPTION(L_2);
	}
 
IL_0017:
	{
		Int32U5BU5D_t385246372* L_3 = (Int32U5BU5D_t385246372*)__this->get__items_0();
		int32_t L_4 = ___index0;
		NullCheck(L_3);
		int32_t L_5 = L_4;
		int32_t L_6 = (L_3)->GetAt(static_cast<il2cpp_array_size_t>(L_5));
		return L_6;
	}
}

It turns out that each of these functions is really a macro to call an actual function. In the case of the Count getter, it just returns a field of the List. The indexer, however, is quite long. It has method initializaton overhead of its own since it can throw an exception. Then it performs its own bounds check to throw an ArgumentOutOfRangeException. By calling GetAt, it performs the exact same bounds check to see if it should throw an IndexOutOfRangeException, which will never happen. It might as well disable bounds-checks since it performs its own bounds check, but it doesn’t.

With that in mind, let’s see how these function calls affect the assembly for the function:

	push	{r4, r5, r6, r7, lr}
	add	r7, sp, #12
	push.w	{r8, r10}
	movw	r5, :lower16:(__ZZ29TestClass_ForList_m2093568030E25s_Il2CppMethodInitialized-(LPC2_0+4))
	mov	r4, r1
	movt	r5, :upper16:(__ZZ29TestClass_ForList_m2093568030E25s_Il2CppMethodInitialized-(LPC2_0+4))
LPC2_0:
	add	r5, pc
	ldrb	r0, [r5]
	cbnz	r0, LBB2_2
	movw	r0, :lower16:(L_TestClass_ForList_m2093568030_MetadataUsageId$non_lazy_ptr-(LPC2_1+4))
	movt	r0, :upper16:(L_TestClass_ForList_m2093568030_MetadataUsageId$non_lazy_ptr-(LPC2_1+4))
LPC2_1:
	add	r0, pc
	ldr	r0, [r0]
	ldr	r0, [r0]
	bl	__ZN6il2cpp2vm13MetadataCache24InitializeMethodMetadataEj
	movs	r0, #1
	strb	r0, [r5]
LBB2_2:
	cbnz	r4, LBB2_4
	bl	__ZN6il2cpp2vm9Exception27RaiseNullReferenceExceptionEv
LBB2_4:
	movw	r0, :lower16:(L_List_1_get_Count_m186164705_RuntimeMethod_var$non_lazy_ptr-(LPC2_2+4))
	movt	r0, :upper16:(L_List_1_get_Count_m186164705_RuntimeMethod_var$non_lazy_ptr-(LPC2_2+4))
LPC2_2:
	add	r0, pc
	ldr	r0, [r0]
	ldr	r1, [r0]
	mov	r0, r4
	bl	_List_1_get_Count_m186164705_gshared
	mov	r8, r0
	movs	r5, #0
	cmp.w	r8, #1
	blt	LBB2_9
	movw	r0, :lower16:(L_List_1_get_Item_m888956288_RuntimeMethod_var$non_lazy_ptr-(LPC2_3+4))
	movs	r6, #0
	movt	r0, :upper16:(L_List_1_get_Item_m888956288_RuntimeMethod_var$non_lazy_ptr-(LPC2_3+4))
LPC2_3:
	add	r0, pc
	ldr.w	r10, [r0]
LBB2_6:
	cbnz	r4, LBB2_8
	bl	__ZN6il2cpp2vm9Exception27RaiseNullReferenceExceptionEv
LBB2_8:
	ldr.w	r2, [r10]
	mov	r0, r4
	mov	r1, r6
	bl	_List_1_get_Item_m888956288_gshared
	adds	r6, #1
	add	r5, r0
	cmp	r8, r6
	bne	LBB2_6
LBB2_9:
	mov	r0, r5
	pop.w	{r8, r10}
	pop	{r4, r5, r6, r7, pc}

There’s a lot of code at the start for the method initialization, but after that it starts to look pretty normal. The major difference remains that there are now function calls for _List_1_get_Count_m186164705_gshared and _List_1_get_Item_m888956288_gshared that we looked at earlier. It’s worth noting that these were not inlined, even though _List_1_get_Count_m186164705_gshared is just a single line of code. So we’ll have to eat the function call overhead once for the cached Count getter call and every loop iteration to index into the contents of the List.

`List<T>` `foreach` loop

Moving on, let’s see a foreach loop on a List<T> now that we know what a normal for loop looks like:

static class TestClass
{
	static int ForeachList(List<int> list)
	{
		int sum = 0;
		foreach (int cur in list)
		{
			sum += cur;
		}
		return sum;
	}
}

Here’s the C++:

extern "C"  int32_t TestClass_ForeachList_m2132133839 (RuntimeObject * __this /* static, unused */, List_1_t128053199 * ___list0, const RuntimeMethod* method)
{
	static bool s_Il2CppMethodInitialized;
	if (!s_Il2CppMethodInitialized)
	{
		il2cpp_codegen_initialize_method (TestClass_ForeachList_m2132133839_MetadataUsageId);
		s_Il2CppMethodInitialized = true;
	}
	int32_t V_0 = 0;
	int32_t V_1 = 0;
	Enumerator_t2017297076  V_2;
	memset(&V_2, 0, sizeof(V_2));
	Exception_t * __last_unhandled_exception = 0;
	NO_UNUSED_WARNING (__last_unhandled_exception);
	Exception_t * __exception_local = 0;
	NO_UNUSED_WARNING (__exception_local);
	int32_t __leave_target = 0;
	NO_UNUSED_WARNING (__leave_target);
	{
		V_0 = 0;
		List_1_t128053199 * L_0 = ___list0;
		NullCheck(L_0);
		Enumerator_t2017297076  L_1 = List_1_GetEnumerator_m593114157(L_0, /*hidden argument*/List_1_GetEnumerator_m593114157_RuntimeMethod_var);
		V_2 = L_1;
	}
 
IL_0009:
	try
	{ // begin try (depth: 1)
		{
			goto IL_001a;
		}
 
IL_000e:
		{
			int32_t L_2 = Enumerator_get_Current_m207670954((&V_2), /*hidden argument*/Enumerator_get_Current_m207670954_RuntimeMethod_var);
			V_1 = L_2;
			int32_t L_3 = V_0;
			int32_t L_4 = V_1;
			V_0 = ((int32_t)il2cpp_codegen_add((int32_t)L_3, (int32_t)L_4));
		}
 
IL_001a:
		{
			bool L_5 = Enumerator_MoveNext_m3181700225((&V_2), /*hidden argument*/Enumerator_MoveNext_m3181700225_RuntimeMethod_var);
			if (L_5)
			{
				goto IL_000e;
			}
		}
 
IL_0026:
		{
			IL2CPP_LEAVE(0x39, FINALLY_002b);
		}
	} // end try (depth: 1)
	catch(Il2CppExceptionWrapper& e)
	{
		__last_unhandled_exception = (Exception_t *)e.ex;
		goto FINALLY_002b;
	}
 
FINALLY_002b:
	{ // begin finally (depth: 1)
		Enumerator_Dispose_m222348240((&V_2), /*hidden argument*/Enumerator_Dispose_m222348240_RuntimeMethod_var);
		IL2CPP_END_FINALLY(43)
	} // end finally (depth: 1)
	IL2CPP_CLEANUP(43)
	{
		IL2CPP_JUMP_TBL(0x39, IL_0039)
		IL2CPP_RETHROW_IF_UNHANDLED(Exception_t *)
	}
 
IL_0039:
	{
		int32_t L_6 = V_0;
		return L_6;
	}
}

That’s a lot more C++, so let’s take it one step at a time. First, we have method initialization because we’re using methods of a generic class. Then we have what a foreach loop breaks down into. It’s roughly equivalent to this C#:

var enumerator = list.GetEnumerator();
try
{
	while (enumerator.MoveNext())
	{
		var index = enumerator.Current;
		sum += list[index];
	}
}
finally
{
	enumerator.Dispose();
}

We can see all the parts of it clearly in the C++ code. List_1_GetEnumerator_m593114157 gets the enumerator, Enumerator_MoveNext_m3181700225 advances it, and Enumerator_get_Current_m207670954 gets the current value. Again, we need to look at these to find out what they do:

#define List_1_GetEnumerator_m593114157(__this, method) ((  Enumerator_t2017297076  (*) (List_1_t128053199 *, const RuntimeMethod*))List_1_GetEnumerator_m593114157_gshared)(__this, method)
 
extern "C"  Enumerator_t2017297076  List_1_GetEnumerator_m593114157_gshared (List_1_t128053199 * __this, const RuntimeMethod* method)
{
	{
		Enumerator_t2017297076  L_0;
		memset(&L_0, 0, sizeof(L_0));
		Enumerator__ctor_m247851533((&L_0), (List_1_t128053199 *)__this, /*hidden argument*/IL2CPP_RGCTX_METHOD_INFO(method->declaring_type->rgctx_data, 23));
		return L_0;
	}
}
 
#define Enumerator_MoveNext_m3181700225(__this, method) ((  bool (*) (Enumerator_t2017297076 *, const RuntimeMethod*))Enumerator_MoveNext_m3181700225_gshared)(__this, method)
 
extern "C"  bool Enumerator_MoveNext_m3181700225_gshared (Enumerator_t2017297076 * __this, const RuntimeMethod* method)
{
	int32_t V_0 = 0;
	{
		Enumerator_VerifyState_m1898450050((Enumerator_t2017297076 *)__this, /*hidden argument*/IL2CPP_RGCTX_METHOD_INFO(InitializedTypeInfo(method->declaring_type)->rgctx_data, 0));
		int32_t L_0 = (int32_t)__this->get_next_1();
		if ((((int32_t)L_0) >= ((int32_t)0)))
		{
			goto IL_0014;
		}
	}
	{
		return (bool)0;
	}
 
IL_0014:
	{
		int32_t L_1 = (int32_t)__this->get_next_1();
		List_1_t128053199 * L_2 = (List_1_t128053199 *)__this->get_l_0();
		NullCheck(L_2);
		int32_t L_3 = (int32_t)L_2->get__size_1();
		if ((((int32_t)L_1) >= ((int32_t)L_3)))
		{
			goto IL_0053;
		}
	}
	{
		List_1_t128053199 * L_4 = (List_1_t128053199 *)__this->get_l_0();
		NullCheck(L_4);
		Int32U5BU5D_t385246372* L_5 = (Int32U5BU5D_t385246372*)L_4->get__items_0();
		int32_t L_6 = (int32_t)__this->get_next_1();
		int32_t L_7 = (int32_t)L_6;
		V_0 = (int32_t)L_7;
		__this->set_next_1(((int32_t)il2cpp_codegen_add((int32_t)L_7, (int32_t)1)));
		int32_t L_8 = V_0;
		NullCheck(L_5);
		int32_t L_9 = L_8;
		int32_t L_10 = (L_5)->GetAt(static_cast<il2cpp_array_size_t>(L_9));
		__this->set_current_3(L_10);
		return (bool)1;
	}
 
IL_0053:
	{
		__this->set_next_1((-1));
		return (bool)0;
	}
}
 
#define Enumerator_get_Current_m207670954(__this, method) ((  int32_t (*) (Enumerator_t2017297076 *, const RuntimeMethod*))Enumerator_get_Current_m207670954_gshared)(__this, method)
 
extern "C"  int32_t Enumerator_get_Current_m207670954_gshared (Enumerator_t2017297076 * __this, const RuntimeMethod* method)
{
	{
		int32_t L_0 = (int32_t)__this->get_current_3();
		return L_0;
	}
}

Once again, these are macros that call real functions. List_1_GetEnumerator_m593114157_gshared just returns a struct and Enumerator_get_Current_m207670954_gshared just returns a field of that struct, but Enumerator_MoveNext_m3181700225_gshared has a lot more going on. There’s nothing that surprising as it’s mostly just checking if the index has hit the end and caching the current value. There is a call to Enumerator_VerifyState_m1898450050 though, so let’s check that out:

#define Enumerator_VerifyState_m1898450050(__this, method) ((  void (*) (Enumerator_t2017297076 *, const RuntimeMethod*))Enumerator_VerifyState_m1898450050_gshared)(__this, method)
 
extern "C"  void Enumerator_VerifyState_m1898450050_gshared (Enumerator_t2017297076 * __this, const RuntimeMethod* method)
{
	static bool s_Il2CppMethodInitialized;
	if (!s_Il2CppMethodInitialized)
	{
		il2cpp_codegen_initialize_method (Enumerator_VerifyState_m1898450050_MetadataUsageId);
		s_Il2CppMethodInitialized = true;
	}
	{
		List_1_t128053199 * L_0 = (List_1_t128053199 *)__this->get_l_0();
		if (L_0)
		{
			goto IL_0026;
		}
	}
	{
		Enumerator_t2017297076  L_1 = (*(Enumerator_t2017297076 *)__this);
		RuntimeObject * L_2 = Box(IL2CPP_RGCTX_DATA(InitializedTypeInfo(method->declaring_type)->rgctx_data, 2), &L_1);
		NullCheck((RuntimeObject *)L_2);
		Type_t * L_3 = Object_GetType_m88164663((RuntimeObject *)L_2, /*hidden argument*/NULL);
		NullCheck((Type_t *)L_3);
		String_t* L_4 = VirtFuncInvoker0< String_t* >::Invoke(18 /* System.String System.Type::get_FullName() */, (Type_t *)L_3);
		ObjectDisposedException_t21392786 * L_5 = (ObjectDisposedException_t21392786 *)il2cpp_codegen_object_new(ObjectDisposedException_t21392786_il2cpp_TypeInfo_var);
		ObjectDisposedException__ctor_m3603759869(L_5, (String_t*)L_4, /*hidden argument*/NULL);
		IL2CPP_RAISE_MANAGED_EXCEPTION(L_5);
	}
 
IL_0026:
	{
		int32_t L_6 = (int32_t)__this->get_ver_2();
		List_1_t128053199 * L_7 = (List_1_t128053199 *)__this->get_l_0();
		NullCheck(L_7);
		int32_t L_8 = (int32_t)L_7->get__version_2();
		if ((((int32_t)L_6) == ((int32_t)L_8)))
		{
			goto IL_0047;
		}
	}
	{
		InvalidOperationException_t56020091 * L_9 = (InvalidOperationException_t56020091 *)il2cpp_codegen_object_new(InvalidOperationException_t56020091_il2cpp_TypeInfo_var);
		InvalidOperationException__ctor_m237278729(L_9, (String_t*)_stringLiteral1621028992, /*hidden argument*/NULL);
		IL2CPP_RAISE_MANAGED_EXCEPTION(L_9);
	}
 
IL_0047:
	{
		return;
	}
}

Once again, there’s method initialization overhead here since this can throw an exception when the enumerator is in an invalid state, such as enumerating beyond the end of the List or when modifying the List during the foreach loop. In addition to the method initialization check, both of these cases are checked for at each iteration of the loop.

With all of this expensive work present in the C++ code, let’s look at the assembly it compiles to. This is going to be long and there’s no need to understand it all or even read it all. Just taking in the broad strokes of the amount of code generated is probably enough to give some idea of its performance characteristics.

	push	{r4, r5, r6, r7, lr}
	add	r7, sp, #12
	push.w	{r8, r10, r11}
	sub.w	r4, sp, #64
	bfc	r4, #0, #4
	mov	sp, r4
	vst1.64	{d8, d9, d10, d11}, [r4:128]!
	vst1.64	{d12, d13, d14, d15}, [r4:128]
	sub	sp, #112
	movw	r5, :lower16:(__ZZ33TestClass_ForeachList_m2132133839E25s_Il2CppMethodInitialized-(LPC3_2+4))
	mov	r4, r1
	movt	r5, :upper16:(__ZZ33TestClass_ForeachList_m2132133839E25s_Il2CppMethodInitialized-(LPC3_2+4))
	movw	r0, :lower16:(L___gxx_personality_sj0$non_lazy_ptr-(LPC3_3+4))
	movt	r0, :upper16:(L___gxx_personality_sj0$non_lazy_ptr-(LPC3_3+4))
LPC3_2:
	add	r5, pc
LPC3_3:
	add	r0, pc
	ldr	r1, LCPI3_0
	ldrb	r6, [r5]
	ldr	r0, [r0]
LPC3_0:
	add	r1, pc
	str	r0, [sp, #84]
	ldr	r0, LCPI3_1
	str	r1, [sp, #88]
	orr	r0, r0, #1
	str	r7, [sp, #92]
LPC3_1:
	add	r0, pc
	str.w	sp, [sp, #100]
	str	r0, [sp, #96]
	add	r0, sp, #60
	bl	__Unwind_SjLj_Register
	cbnz	r6, LBB3_2
	movw	r0, :lower16:(L_TestClass_ForeachList_m2132133839_MetadataUsageId$non_lazy_ptr-(LPC3_4+4))
	mov.w	r1, #-1
	movt	r0, :upper16:(L_TestClass_ForeachList_m2132133839_MetadataUsageId$non_lazy_ptr-(LPC3_4+4))
	str	r1, [sp, #64]
LPC3_4:
	add	r0, pc
	ldr	r0, [r0]
	ldr	r0, [r0]
	bl	__ZN6il2cpp2vm13MetadataCache24InitializeMethodMetadataEj
	movs	r0, #1
	strb	r0, [r5]
LBB3_2:
	vmov.i32	q8, #0x0
	add	r6, sp, #24
	cmp	r4, #0
	vst1.64	{d16, d17}, [r6]
	bne	LBB3_4
	mov.w	r0, #-1
	str	r0, [sp, #64]
	bl	__ZN6il2cpp2vm9Exception27RaiseNullReferenceExceptionEv
LBB3_4:
	movw	r0, :lower16:(L_List_1_GetEnumerator_m593114157_RuntimeMethod_var$non_lazy_ptr-(LPC3_5+4))
	add	r5, sp, #8
	movt	r0, :upper16:(L_List_1_GetEnumerator_m593114157_RuntimeMethod_var$non_lazy_ptr-(LPC3_5+4))
	mov	r1, r4
LPC3_5:
	add	r0, pc
	ldr	r0, [r0]
	ldr	r2, [r0]
	mov.w	r0, #-1
	str	r0, [sp, #64]
	mov	r0, r5
	bl	_List_1_GetEnumerator_m593114157_gshared
	vld1.64	{d16, d17}, [r5]
	movs	r0, #0
	vst1.64	{d16, d17}, [r6]
	movw	r1, :lower16:(L_Enumerator_MoveNext_m3181700225_RuntimeMethod_var$non_lazy_ptr-(LPC3_6+4))
	movt	r1, :upper16:(L_Enumerator_MoveNext_m3181700225_RuntimeMethod_var$non_lazy_ptr-(LPC3_6+4))
LPC3_6:
	add	r1, pc
	ldr	r1, [r1]
	str	r1, [sp, #4]
	movw	r1, :lower16:(L_Enumerator_get_Current_m207670954_RuntimeMethod_var$non_lazy_ptr-(LPC3_7+4))
	movt	r1, :upper16:(L_Enumerator_get_Current_m207670954_RuntimeMethod_var$non_lazy_ptr-(LPC3_7+4))
LPC3_7:
	add	r1, pc
	ldr	r1, [r1]
	str	r1, [sp]
	b	LBB3_7
LBB3_5:
	ldr	r1, [sp]
	movs	r2, #1
	ldr	r1, [r1]
	str	r2, [sp, #64]
	bl	_Enumerator_get_Current_m207670954_gshared
	ldr	r1, [sp, #44]
	add	r6, sp, #24
	add	r0, r1
LBB3_7:
	str	r0, [sp, #44]
	ldr	r0, [sp, #4]
	ldr	r1, [r0]
	movs	r0, #2
	str	r0, [sp, #64]
	mov	r0, r6
	bl	_Enumerator_MoveNext_m3181700225_gshared
	cmp	r0, #0
	add	r0, sp, #24
	bne	LBB3_5
	movs	r5, #0
	movs	r4, #1
LBB3_10:
	movw	r0, :lower16:(L_Enumerator_Dispose_m222348240_RuntimeMethod_var$non_lazy_ptr-(LPC3_8+4))
	movt	r0, :upper16:(L_Enumerator_Dispose_m222348240_RuntimeMethod_var$non_lazy_ptr-(LPC3_8+4))
	str	r5, [sp, #56]
LPC3_8:
	add	r0, pc
	ldr	r0, [r0]
	ldr	r1, [r0]
	mov.w	r0, #-1
	str	r0, [sp, #64]
	add	r0, sp, #24
	bl	_Enumerator_Dispose_m222348240_gshared
	ldr	r0, [sp, #56]
	cbnz	r4, LBB3_13
	cbz	r0, LBB3_13
	ldr	r0, [sp, #56]
	mov.w	r1, #-1
	str	r1, [sp, #64]
	bl	__ZN6il2cpp2vm9Exception5RaiseEP15Il2CppException
	add	r0, sp, #60
	ldr	r4, [sp, #44]
	bl	__Unwind_SjLj_Unregister
	mov	r0, r4
	add	r4, sp, #112
	vld1.64	{d8, d9, d10, d11}, [r4:128]!
	vld1.64	{d12, d13, d14, d15}, [r4:128]
	sub.w	r4, r7, #24
	mov	sp, r4
	pop.w	{r8, r10, r11}
	pop	{r4, r5, r6, r7, pc}
LBB3_14:
	ldr	r0, [sp, #64]
	cmp	r0, #2
	bls	LBB3_16
	trap
LBB3_16:
LCPI3_2:
	tbb	[pc, r0]
LJTI3_0:
LBB3_18:
	b	LBB3_20
LBB3_19:
LBB3_20:
	ldr	r0, [sp, #68]
	ldr	r1, [sp, #72]
	strd	r1, r0, [sp, #48]
	ldr	r0, [sp, #48]
	cmp	r0, #1
	bne	LBB3_22
	ldr	r0, [sp, #52]
	bl	___cxa_begin_catch
	ldr	r5, [r0]
	mov.w	r0, #-1
	str	r0, [sp, #64]
	bl	___cxa_end_catch
	movs	r4, #0
	b	LBB3_10
LBB3_22:
	ldr	r0, [sp, #52]
	mov.w	r1, #-1
	str	r1, [sp, #64]
	bl	__Unwind_SjLj_Resume

Notice that this isn’t the full assembly source for the loop. There are still calls to plenty of other functions like _List_1_GetEnumerator_m593114157_gshared, _Enumerator_get_Current_m207670954_gshared, _Enumerator_MoveNext_m3181700225_gshared, and even _Enumerator_Dispose_m222348240_gshared. The Dispose function is just a single line, but we still get function call overhead for it.

`IEnumerable<T>` `foreach` loop

Finally for today, let’s look at a foreach loop over an IEnumerable<T>:

static class TestClass
{
	static int ForeachEnumerable(IEnumerable<int> enumerable)
	{
		int sum = 0;
		foreach (int cur in enumerable)
		{
			sum += cur;
		}
		return sum;
	}
}

And the C++:

extern "C"  int32_t TestClass_ForeachEnumerable_m2471184119 (RuntimeObject * __this /* static, unused */, RuntimeObject* ___enumerable0, const RuntimeMethod* method)
{
	static bool s_Il2CppMethodInitialized;
	if (!s_Il2CppMethodInitialized)
	{
		il2cpp_codegen_initialize_method (TestClass_ForeachEnumerable_m2471184119_MetadataUsageId);
		s_Il2CppMethodInitialized = true;
	}
	int32_t V_0 = 0;
	int32_t V_1 = 0;
	RuntimeObject* V_2 = NULL;
	Exception_t * __last_unhandled_exception = 0;
	NO_UNUSED_WARNING (__last_unhandled_exception);
	Exception_t * __exception_local = 0;
	NO_UNUSED_WARNING (__exception_local);
	int32_t __leave_target = 0;
	NO_UNUSED_WARNING (__leave_target);
	{
		V_0 = 0;
		RuntimeObject* L_0 = ___enumerable0;
		NullCheck(L_0);
		RuntimeObject* L_1 = InterfaceFuncInvoker0< RuntimeObject* >::Invoke(0 /* System.Collections.Generic.IEnumerator`1<!0> System.Collections.Generic.IEnumerable`1<System.Int32>::GetEnumerator() */, IEnumerable_1_t1930798642_il2cpp_TypeInfo_var, L_0);
		V_2 = L_1;
	}
 
IL_0009:
	try
	{ // begin try (depth: 1)
		{
			goto IL_0019;
		}
 
IL_000e:
		{
			RuntimeObject* L_2 = V_2;
			NullCheck(L_2);
			int32_t L_3 = InterfaceFuncInvoker0< int32_t >::Invoke(0 /* !0 System.Collections.Generic.IEnumerator`1<System.Int32>::get_Current() */, IEnumerator_1_t3383516221_il2cpp_TypeInfo_var, L_2);
			V_1 = L_3;
			int32_t L_4 = V_0;
			int32_t L_5 = V_1;
			V_0 = ((int32_t)il2cpp_codegen_add((int32_t)L_4, (int32_t)L_5));
		}
 
IL_0019:
		{
			RuntimeObject* L_6 = V_2;
			NullCheck(L_6);
			bool L_7 = InterfaceFuncInvoker0< bool >::Invoke(1 /* System.Boolean System.Collections.IEnumerator::MoveNext() */, IEnumerator_t1853284238_il2cpp_TypeInfo_var, L_6);
			if (L_7)
			{
				goto IL_000e;
			}
		}
 
IL_0024:
		{
			IL2CPP_LEAVE(0x36, FINALLY_0029);
		}
	} // end try (depth: 1)
	catch(Il2CppExceptionWrapper& e)
	{
		__last_unhandled_exception = (Exception_t *)e.ex;
		goto FINALLY_0029;
	}
 
FINALLY_0029:
	{ // begin finally (depth: 1)
		{
			RuntimeObject* L_8 = V_2;
			if (!L_8)
			{
				goto IL_0035;
			}
		}
 
IL_002f:
		{
			RuntimeObject* L_9 = V_2;
			NullCheck(L_9);
			InterfaceActionInvoker0::Invoke(0 /* System.Void System.IDisposable::Dispose() */, IDisposable_t3640265483_il2cpp_TypeInfo_var, L_9);
		}
 
IL_0035:
		{
			IL2CPP_END_FINALLY(41)
		}
	} // end finally (depth: 1)
	IL2CPP_CLEANUP(41)
	{
		IL2CPP_JUMP_TBL(0x36, IL_0036)
		IL2CPP_RETHROW_IF_UNHANDLED(Exception_t *)
	}
 
IL_0036:
	{
		int32_t L_10 = V_0;
		return L_10;
	}
}

This is pretty similiar to the foreach loop on a List<T>. It has method initialization and then the breakdown into GetEnumerator, Current getter, MoveNext, and Dispose with exception handling via a finally block. The major difference is that InterfaceFuncInvoker0::Invoke and InterfaceActionInvoker0::Invoke are used to call all these functions, so they’ll all be slower calls than the non-virtual functions that were called when using a foreach loop on a List<T>.

Let’s look at the assembly for this. Just like with the foreach loop on a List<T>, it’s long and can just be skimmed to give a gist of its size:

	push	{r4, r5, r6, r7, lr}
	add	r7, sp, #12
	push.w	{r8, r10, r11}
	sub.w	r4, sp, #64
	bfc	r4, #0, #4
	mov	sp, r4
	vst1.64	{d8, d9, d10, d11}, [r4:128]!
	vst1.64	{d12, d13, d14, d15}, [r4:128]
	sub	sp, #96
	movw	r5, :lower16:(__ZZ39TestClass_ForeachEnumerable_m2471184119E25s_Il2CppMethodInitialized-(LPC4_2+4))
	mov	r4, r1
	movt	r5, :upper16:(__ZZ39TestClass_ForeachEnumerable_m2471184119E25s_Il2CppMethodInitialized-(LPC4_2+4))
	movw	r0, :lower16:(L___gxx_personality_sj0$non_lazy_ptr-(LPC4_3+4))
	movt	r0, :upper16:(L___gxx_personality_sj0$non_lazy_ptr-(LPC4_3+4))
LPC4_2:
	add	r5, pc
LPC4_3:
	add	r0, pc
	ldr	r1, LCPI4_0
	ldrb	r6, [r5]
	ldr	r0, [r0]
LPC4_0:
	add	r1, pc
	str	r0, [sp, #68]
	ldr	r0, LCPI4_1
	str	r1, [sp, #72]
	orr	r0, r0, #1
	str	r7, [sp, #76]
LPC4_1:
	add	r0, pc
	str.w	sp, [sp, #84]
	str	r0, [sp, #80]
	add	r0, sp, #44
	bl	__Unwind_SjLj_Register
	cbnz	r6, LBB4_2
	movw	r0, :lower16:(L_TestClass_ForeachEnumerable_m2471184119_MetadataUsageId$non_lazy_ptr-(LPC4_4+4))
	mov.w	r1, #-1
	movt	r0, :upper16:(L_TestClass_ForeachEnumerable_m2471184119_MetadataUsageId$non_lazy_ptr-(LPC4_4+4))
	str	r1, [sp, #48]
LPC4_4:
	add	r0, pc
	ldr	r0, [r0]
	ldr	r0, [r0]
	bl	__ZN6il2cpp2vm13MetadataCache24InitializeMethodMetadataEj
	movs	r0, #1
	strb	r0, [r5]
LBB4_2:
	cbnz	r4, LBB4_4
	mov.w	r0, #-1
	str	r0, [sp, #48]
	bl	__ZN6il2cpp2vm9Exception27RaiseNullReferenceExceptionEv
LBB4_4:
	movw	r0, :lower16:(L_IEnumerable_1_t1930798642_il2cpp_TypeInfo_var$non_lazy_ptr-(LPC4_5+4))
	mov	r2, r4
	movt	r0, :upper16:(L_IEnumerable_1_t1930798642_il2cpp_TypeInfo_var$non_lazy_ptr-(LPC4_5+4))
	movs	r5, #0
LPC4_5:
	add	r0, pc
	ldr	r0, [r0]
	ldr	r1, [r0]
	mov.w	r0, #-1
	str	r0, [sp, #48]
	movs	r0, #0
	bl	__ZN21InterfaceFuncInvoker0IP12Il2CppObjectE6InvokeEjP11Il2CppClassS1_
	str	r0, [sp, #20]
	movw	r0, :lower16:(L_IEnumerator_t1853284238_il2cpp_TypeInfo_var$non_lazy_ptr-(LPC4_6+4))
	movt	r0, :upper16:(L_IEnumerator_t1853284238_il2cpp_TypeInfo_var$non_lazy_ptr-(LPC4_6+4))
LPC4_6:
	add	r0, pc
	ldr	r0, [r0]
	str	r0, [sp, #16]
	movw	r0, :lower16:(L_IEnumerator_1_t3383516221_il2cpp_TypeInfo_var$non_lazy_ptr-(LPC4_7+4))
	movt	r0, :upper16:(L_IEnumerator_1_t3383516221_il2cpp_TypeInfo_var$non_lazy_ptr-(LPC4_7+4))
LPC4_7:
	add	r0, pc
	ldr	r0, [r0]
	str	r0, [sp, #12]
	b	LBB4_9
LBB4_5:
	ldr	r0, [sp, #20]
	cbnz	r0, LBB4_7
	movs	r0, #1
	str	r0, [sp, #48]
	bl	__ZN6il2cpp2vm9Exception27RaiseNullReferenceExceptionEv
LBB4_7:
	ldr	r0, [sp, #12]
	ldr	r1, [r0]
	movs	r0, #2
	ldr	r2, [sp, #20]
	str	r0, [sp, #48]
	movs	r0, #0
	bl	__ZN21InterfaceFuncInvoker0IiE6InvokeEjP11Il2CppClassP12Il2CppObject
	ldr	r1, [sp, #24]
	adds	r5, r0, r1
	str	r5, [sp, #24]
	ldr	r0, [sp, #20]
	cbnz	r0, LBB4_11
	movs	r0, #3
	str	r0, [sp, #48]
	bl	__ZN6il2cpp2vm9Exception27RaiseNullReferenceExceptionEv
LBB4_11:
	ldr	r0, [sp, #16]
	ldr	r1, [r0]
	movs	r0, #4
	ldr	r2, [sp, #20]
	str	r0, [sp, #48]
	movs	r0, #1
	bl	__ZN21InterfaceFuncInvoker0IbE6InvokeEjP11Il2CppClassP12Il2CppObject
	cmp	r0, #0
	bne	LBB4_5
	movs	r4, #0
	movs	r0, #54
LBB4_14:
	strd	r0, r4, [sp, #36]
	ldr	r0, [sp, #20]
	cbz	r0, LBB4_16
	movw	r0, :lower16:(L_IDisposable_t3640265483_il2cpp_TypeInfo_var$non_lazy_ptr-(LPC4_8+4))
	movt	r0, :upper16:(L_IDisposable_t3640265483_il2cpp_TypeInfo_var$non_lazy_ptr-(LPC4_8+4))
LPC4_8:
	add	r0, pc
	ldr	r0, [r0]
	ldr	r1, [r0]
	mov.w	r0, #-1
	ldr	r2, [sp, #20]
	str	r0, [sp, #48]
	movs	r0, #0
	bl	__ZN23InterfaceActionInvoker06InvokeEjP11Il2CppClassP12Il2CppObject
LBB4_16:
	ldr	r1, [sp, #36]
	ldr	r0, [sp, #40]
	cmp	r1, #54
	it	ne
	cmpne	r0, #0
	beq	LBB4_18
	ldr	r0, [sp, #40]
	mov.w	r1, #-1
	str	r1, [sp, #48]
	bl	__ZN6il2cpp2vm9Exception5RaiseEP15Il2CppException
LBB4_18:
	add	r0, sp, #44
	ldr	r4, [sp, #24]
	bl	__Unwind_SjLj_Unregister
	mov	r0, r4
	add	r4, sp, #96
	vld1.64	{d8, d9, d10, d11}, [r4:128]!
	vld1.64	{d12, d13, d14, d15}, [r4:128]
	sub.w	r4, r7, #24
	mov	sp, r4
	pop.w	{r8, r10, r11}
	pop	{r4, r5, r6, r7, pc}
LBB4_19:
	ldr	r0, [sp, #48]
	cmp	r0, #4
	bls	LBB4_21
	trap
LBB4_21:
LCPI4_2:
	tbb	[pc, r0]
LJTI4_0:
LBB4_23:
	b	LBB4_27
LBB4_24:
	b	LBB4_27
LBB4_25:
	b	LBB4_27
LBB4_26:
LBB4_27:
	ldr	r0, [sp, #52]
	ldr	r1, [sp, #56]
	strd	r1, r0, [sp, #28]
	ldr	r0, [sp, #28]
	cmp	r0, #1
	bne	LBB4_29
	ldr	r0, [sp, #32]
	bl	___cxa_begin_catch
	ldr	r4, [r0]
	mov.w	r0, #-1
	str	r0, [sp, #48]
	bl	___cxa_end_catch
	movs	r0, #0
	b	LBB4_14
LBB4_29:
	ldr	r0, [sp, #32]
	mov.w	r1, #-1
	str	r1, [sp, #48]
	bl	__Unwind_SjLj_Resume

This looks very much like the assembly that was generated for foreach on a List<T> and lines up with the C++ quite well. We can see all the calls like __ZN23InterfaceActionInvoker06InvokeEjP11Il2CppClassP12Il2CppObject for InterfaceActionInvoker0::Invoke that have their own assembly elsewhere. There’s also all the calls to exception-related code like ___cxa_begin_catch and __Unwind_SjLj_Resume.

Conclusion

There is a clear winner here: for loops on an array with the Length field cached and the null- and bounds-checks disabled. It produces tiny, minimal assembly code for the CPU to run and should be preferred whenever we care about performance. Not disabling the null- and bounds-checks triples the number of if checks in every iteration of the loop! Unfortunately, foreach loops on an array use this very form. Even with the checks disabled, fetching the Length field every iteration cannot be avoided by caching it as a local variable (i.e. into a CPU register). They should be avoided when performance matters.

Aside from plain arrays, performance takes a serious dive with List<T>. No matter what kind of loop is used, calling methods of a generic type means type initialization overhead is generated for the function with the loop. Every iteration of the loop is slowed down with range checks, function calls to get the current item, and exception-induced overhead. Compared to the few instructions of assembly code that could have been generated, now we’re looking at dozens or even hundreds with expensive branching, cache misses, and function calls. The foreach loop version is even worse than the for version, but neither is good. Stick to plain arrays to avoid all this overhead.

Finally, there’s IEnumerable<T>. It shouldn’t be used from the outset due to the GC allocation required to get its enumerator type for foreach, the only loop option. Aside from that, its loop is the worst because every iteration requires two interface function calls to get the current value and advance the enumerator. It’s the worst kind of loop and should be used very sparingly.

#1 by VVEthan on March 5th, 2018 · Reply

First, thank you for these awesome in-depth investigations!

Secondly, I feel it might be worth mentioning the cases where optimal looping is critical (Update() methods, or anything else that runs each frame) and where it’s not (e.g. doing a foreach on a collection of objects once when a user taps a button.)

I only bring this up because I feel like one of the hardest, ongoing challenges of software is understanding how to take focused lessons like this article and use them productively in the context of writing a full blown application/game.

(I recently had some confusing discussions on Unity forums about how Coroutines are bad and never had a good reason to be used… except for the slew of scenarios in which they are useful and have no meaningful impact on performance.)

#2 by jackson on March 5th, 2018 · Reply

This is an excellent point. If you’re going to loop over five things once every ten seconds then it really doesn’t matter what kind of loop you use. Even a foreach loop over an IEnumerable<T> might be acceptable and it creates garbage!

This investigation is to arm you with the knowledge you need to make this kind of decision. It’s highly game-specific and even within the game, situation-dependent. Once you know the costs of each loop you’re able to judge the tradeoffs in terms of performance, readability, and other factors and decide on the appropriate type of collection, loop, IL2CPP settings, etc.

So far I haven’t discussed this much in the IL2CPP articles other than adding a “when performance matters” disclaimer to my conclusions, but perhaps I should call this out explicitly.

#3 by rick on November 30th, 2021 · Reply

Very interesting article! I would love to see this revisited in 2021 (or 2022?) because both C# and IL2CPP have evolved a lot since the time of writing.

Loops in IL2CPP

Array for loop, checks disabled, Length cached

Array for loop, checks enabled

Array for loop, checks enabled, Length cache

Array foreach loop

List<T> for loop

List<T> foreach loop

IEnumerable<T> foreach loop

Conclusion

Comments

Array `for` loop, checks disabled, `Length` cached

Array `for` loop, checks enabled

Array `for` loop, checks enabled, `Length` cache

Array `foreach` loop

`List<T>` `for` loop

`List<T>` `foreach` loop

`IEnumerable<T>` `foreach` loop