Variable Ordering
I recently perused the AVM2 bytecode overview that Adobe provides and found something strange: they have special instructions for the first few local variables. Getting the first local variables is done by getlocal0, getlocal1, getlocal2, getlocal3, but then a generalized getlocal is used for local variables thereafter. This brought me to think about the performance implications of these specialized instructions. After all, why have them when you already have a generalized one? Read on to see the performance differences and if you can get any speed boost from rearranging your local variables.
Deriving a test to judge the difference between the specific getlocalN and the general getlocal (as well as setlocalN and setlocal) could be difficult, but thankfully (in this case) MXMLC in Flex 4 will order local variables exactly as you declare them. It’ll even keep (and initialize!) any local variables you don’t use, which seems to be a glaring optimization opportunity. So it’s easy to lay out local variables such that we can pick which will be in the first four corresponding to getlocal0, getlocal1, getlocal2, and getlocal3. After this is done, it’s a simple matter of using that local variable a lot to emphasize the local variable reads and writes. A very long loop is perfectly suited to this task. So, here is the test app featuring the two test functions:
package { import flash.text.*; import flash.utils.*; import flash.display.*; public class VariableOrdering extends Sprite { private var __logger:TextField = new TextField(); public function VariableOrdering() { __logger.autoSize = TextFieldAutoSize.LEFT; addChild(__logger); testIFirst(); testILast(); } private function log(msg:*): void { __logger.appendText(msg + "\n"); } private function testIFirst(): void { const REPS:uint = 4000000000; var i:uint; var beforeTime:int; var afterTime:int; var val0:int; var val1:int; var val2:int; var val3:int; var val4:int; var val5:int; var val6:int; var val7:int; var val8:int; var val9:int; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { } afterTime = getTimer(); log("i first: " + (afterTime-beforeTime)); } private function testILast(): void { var val1:int; var val2:int; var val3:int; var val4:int; var val5:int; var val6:int; var val7:int; var val8:int; var val9:int; const REPS:uint = 4000000000; var i:uint; var beforeTime:int; var afterTime:int; var val0:int; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { } afterTime = getTimer(); log("i last: " + (afterTime-beforeTime)); } } }
As you can see, the loop variables i and REPS are placed at the beginning of the function in the case of testIFirst and at the end in the case of testILast. By moving both of them to the end we can double our local variable access per-loop. The rest of the variables&emdash; valN&emdash;are simply there to provide the padding necessary to ensure i and REPS are past the first four local variables and well into the generalized area covered by getlocal and setlocal. So how does these test functions perform? Let’s take a look.
Environment | testIFirst | testILast |
---|---|---|
3.0 Ghz Intel Core 2 Duo, Windows XP | 6876 | 6939 |
2.0 Ghz Intel Core 2 Duo, Mac OS X | 9433 | 9639 |
Putting i and REPS first produces a whopping 0.92% performance increase on Windows XP and 0.06% performance increase on Mac OS X. Both differences are very small, but if you run the test over and over (as I’ve done), you’ll see that they do occur at each test and with roughly the same performance advantage in favor of getlocalN/setlocalN. This is, still, a tiny number, but what do you have to give up in order to cash in on an extra ~1%? All you need to do is rearrange the order that your local variables are declared to get the extra speedup. If you have any areas of code that are iterating over a lot of elements, you may find this handy for eeking out just a little bit more performance. For most AS3 programmers though, this will simply come as a relief that you needn’t worry about the getlocalN/setlocalN opcodes providing a performance boost that you’re missing out on.
#1 by Chris H on June 8th, 2010 ·
“This brought me to think about the performance implications of these specialized instructions. After all, why have them when you already have a generalized one?”
To save space in your SWF. getlocal/setlocal take 2 or 3 bytes each (bytecode plus variable-length encoded index). getlocalN/setlocalN only take 1 byte each. Although, it’s weird that Adobe bothered with this when the compiler misses so many other optimization gimmes.
#2 by jackson on June 8th, 2010 ·
Very good points. One obvious optimization gimme is getting rid of all the unused local variables, like in the example code above.
#3 by Chris H on June 8th, 2010 ·
Another one is if-statement short-circuiting (discovered this playing around with nemo440). Compare these two functions and generated bytecode:
Logically equivalent, right?
#4 by jackson on June 8th, 2010 ·
Ouch! The first version is just a painfully bad compilation of the AS3. The second version is perfectly sane, so I wonder why MXMLC does such a bad job with the first version.
For those not wanting to delve in to the bytecode, the first version is doing a lot of extraneous conditional jumps and stack pushing and popping while the second version is clean, tidy, and efficient.
I’m definitely going to have to do an article on this and other strange compilations. Thanks very much for bringing this to my attention!
#5 by Chris H on June 9th, 2010 ·
You’re welcome, I’m looking forward to reading it.
Some of the bytecode I’ve noticed the compiler producing is so bizarre that it makes me wonder whether the compiler is trying to work around bugs in earlier versions of the VM, or output bytecodes that are more amenable to JIT compilation to native code, or something else non-obvious like that. However if it is possible to make the compiler produce the code that it probably should be producing in certain situations by changing the syntax to something else logically equivalent maybe that’s not the case.
switch-case statements are another quite interesting case. Given that the AVM2 doc mentions a special lookupswitch bytecode that jumps based on an index into a jump table I would have assumed that switch-case statements would be a lot faster than gigantic if-else chains. However looking at the output bytecode, in order to compute the index into the jump table from the variable that you switch on, the compiler outputs… a gigantic if-else chain comparing against each of possible case values and setting the jump index for each on. Then it uses the jump index to jump to the right code. It does this even if all the case values are the same as their jump table indices. Bizarrely, even though the lookupswitch bytecode supports a default jump offset for when you index outside the table, the jump table contains an extra entry for “default” and the fall-through condition of the if-else chain sets the jump index to this value, meaning the default-case jump offset appears in the table twice.
So I’m not sure switch-case statements in general would be any faster than just using an if-else chain directly (aside from the convenience of fall-throughs etc). It also means that case values appearing earlier in the switch will almost certainly execute faster, since they appear earlier in the if-else chain (I couldn’t say for sure without profiling it though).