Why Static Is Slow
Using static variables and functions is slow. That was the conclusion of the previous article on statics, but the subject is actually more nuanced than that. Today we’ll explore static more in-depth and find out just why it is so slow.
Based on some keen comments (particularly by skyboy), I’ve put together the following test:
// BaseClass.as package { import flash.display.Sprite; public class BaseClass extends Sprite { protected var superVal:Number = 44; } } // StaticTest2.as package { import flash.display.*; import flash.utils.*; import flash.text.*; public class StaticTest2 extends BaseClass { private var __logger:TextField = new TextField(); private function row(...cols): void { __logger.appendText(cols.join(",")+"\n"); } protected var val:Number = 33; protected static var staticVal:Number = 33; public function StaticTest2() { __logger.autoSize = TextFieldAutoSize.LEFT; addChild(__logger); var beforeTime:int; var afterTime:int; var REPS:int = 1000000; var i:int; var c:Class; var num:Number; row("Test", "Time"); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { c = Math; } afterTime = getTimer(); row("Get class", (afterTime-beforeTime)); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { num = Math.PI; } afterTime = getTimer(); row("Dot access static var", (afterTime-beforeTime)); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { num = Math["PI"]; } afterTime = getTimer(); row("Index access static var", (afterTime-beforeTime)); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { num = val; } afterTime = getTimer(); row("num = val", (afterTime-beforeTime)); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { num = this.val; } afterTime = getTimer(); row("num = this.val", (afterTime-beforeTime)); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { num = superVal; } afterTime = getTimer(); row("num = superVal", (afterTime-beforeTime)); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { num = this.superVal; } afterTime = getTimer(); row("num = this.superVal", (afterTime-beforeTime)); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { num = super.superVal; } afterTime = getTimer(); row("num = super.superVal", (afterTime-beforeTime)); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { num = staticVal; } afterTime = getTimer(); row("num = staticVal", (afterTime-beforeTime)); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { num = StaticTest2.staticVal; } afterTime = getTimer(); row("num = StaticTest2.staticVal", (afterTime-beforeTime)); } } }
Let’s look at the bytecode generated for each of these in order. First, just getting a class: (annotated by me)
getlex :Math // get the Math class coerce :Class // cast it to a Class setlocal 5 // assign it to the c variable
Next is getting a static variable (PI
) of a class (Math
), which was the focus of the last article:
getlex :Math // get the Math class getproperty :PI // get its PI property convert_d // convert PI to a Number setlocal 6 // assign PI to the num variable
Next we follow skyboy’s suggestion and access the PI
property of Math
by indexing like so: Math["PI"]
getlex :Math // get the Math class pushstring "PI" // push the "PI" string on the stack getproperty private,StaticTest2,http://adobe.com/AS3/2006/builtin,,flash.text,flash.utils,private,,flash.display,StaticTest2,BaseClass,flash.display:Sprite,flash.display:DisplayObjectContainer,flash.display:InteractiveObject,flash.display:DisplayObject,flash.events:EventDispatcher:null // get the PI property convert_d // convert PI to a Number setlocal 6 // assign PI to the num variable
Wow, that’s a lot of arguments to getproperty
! Later on in the article we’ll see what kind of performance impact that has. For now, let’s continue with some comparisons of non-static fields starting with num = val
:
getlocal0 // get the "this" object getproperty StaticTest2:val // get val, which is defined in StaticTest2 convert_d // convert val to a Number setlocal 6 // assign val to the num variable
That was pretty straightforward. If we type a little more (as is common) and use num = this.val
, does the compiler generate different bytecode? Let’s see:
getlocal0 // get the "this" object getproperty StaticTest2:val // get val, which is defined in StaticTest2 convert_d // convert val to a Number setlocal 6 // assign val to the num variable
The answer is clear: no, the generated code is identical whether or not you use the this
keyword. Now what about variables defined in our parent class? Let’s start with num = superVal
:
getlex BaseClass:superVal // get superVal convert_d // convert superVal to a Number setlocal 6 // assign superVal to the num variable
This bytecode uses a getlex
just like the static accesses did, but it avoids the getproperty
instruction. We’ll find out in a little bit what kind of performance impact this has. In the meantime, let’s see the bytecode generated when we use the this
keyword to write num = this.superVal
:
getlocal0 // get the "this" object getproperty BaseClass:superVal // get superVal convert_d // convert superVal to a Number setlocal 6 // assign superVal to num
Using the this
keyword totally transforms the bytecode! Rather than using getlex
, the bytecode is now identical to the num = this.val
bytecode except for referencing the parent class instead of the base class in getproperty
. Does the compiler do this when we use the super
keyword? Let’s look:
getlocal0 // get the "this" object getsuper BaseClass:superVal // get superVal from the base class convert_d // convert superVal to a Number setlocal 6 // assign superVal to num
Here we have a third set of bytecode generated for functionally equivalent code. This version gets the this
object like the version using the this
keyword, but it uses a new instruction—getsuper
—to fetch the superVal
variable. Now that we have a good set of non-statics to compare against, let’s turn our attention back to statics and check out num = staticVal
:
getlex StaticTest2:staticVal // get staticVal convert_d // convert staticVal to a Number setlocal 6 // assign staticVal to num
This access has generated the fewest number (3) of instructions so far, but are they faster given that they have a getlex
? We’ll see in just a bit. Let’s look at the final test with num = StaticTest2.staticVal
:
getglobalscope // get the global scope getslot 1 // get the first slot of the global scope getproperty StaticTest2:staticVal // get the slot's first property: staticVal convert_d // convert staticVal to a Number setlocal 6 // assign staticVal to num
This has to be the most roundabout way of getting to staticVal
given that it results in the same functionality as the previous version that simply had some irrelevant specification left off.
Now that we’ve inspected the bytecode, let’s look at the results. I ran these tests in the following environment:
- Flex SDK (MXMLC) 4.5.1.21328, compiling in release mode (no debugging or verbose stack traces)
- Release version of Flash Player 11.1.102.55
- 2.4 Ghz Intel Core i5
- Mac OS X 10.7.2
And got these results:
Test | Time |
---|---|
Get class | 3 |
Dot access static var | 8 |
Index access static var | 1360 |
num = val | 2 |
num = this.val | 2 |
num = superVal | 2 |
num = this.superVal | 2 |
num = super.superVal | 2 |
num = staticVal | 2 |
num = StaticTest2.staticVal | 2 |
Given that the firs three tests forced the number of iterations down so far that there was apparently no difference between the rest of them, I decided to make another test class that doesn’t include the first three tests and crank up the iterations to examine the performance differences between the remaining tests:
package { import flash.display.*; import flash.utils.*; import flash.text.*; public class StaticTest3 extends BaseClass { private var __logger:TextField = new TextField(); private function row(...cols): void { __logger.appendText(cols.join(",")+"\n"); } protected var val:Number = 33; protected static var staticVal:Number = 33; public function StaticTest3() { __logger.autoSize = TextFieldAutoSize.LEFT; addChild(__logger); var beforeTime:int; var afterTime:int; var REPS:int = 100000000; var i:int; var c:Class; var num:Number; row("Test", "Time"); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { num = val; } afterTime = getTimer(); row("num = val", (afterTime-beforeTime)); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { num = this.val; } afterTime = getTimer(); row("num = this.val", (afterTime-beforeTime)); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { num = superVal; } afterTime = getTimer(); row("num = superVal", (afterTime-beforeTime)); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { num = this.superVal; } afterTime = getTimer(); row("num = this.superVal", (afterTime-beforeTime)); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { num = super.superVal; } afterTime = getTimer(); row("num = super.superVal", (afterTime-beforeTime)); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { num = staticVal; } afterTime = getTimer(); row("num = staticVal", (afterTime-beforeTime)); beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { num = StaticTest3.staticVal; } afterTime = getTimer(); row("num = StaticTest3.staticVal", (afterTime-beforeTime)); } } }
Here are the results for StaticTest3
in the same environment:
Test | Time |
---|---|
num = val | 215 |
num = this.val | 223 |
num = superVal | 219 |
num = this.superVal | 220 |
num = super.superVal | 220 |
num = staticVal | 222 |
num = StaticTest3.staticVal | 293 |
Now that we have the results data, let’s draw some conclusions:
- Indexing a static variable (
Math["PI"]
) is far and away the slowest way you can access a field. Avoid it at all costs in performance-critical code. - Just getting a class (
c = Math
) is 50% slower than accessing a field (static or not) in the same class or any superclass, let alone the time needed to access its fields. - Accessing a static field of another class is about 4x slower than accessing any field in the same class or superclass.
- Specifying even the same class (
StaticTest3.staticVar
) results in bytecode that is slower than any other way of accessing a field of the same class or superclass, static or not. - Despite wildly different bytecode, all other ways of accessing fields of the same class or superclass are roughly the same speed. The JIT is probably producing identical machine code for all of these different sets of bytecode instructions.
The main takeaway here is to remember that static is slow when you’re accessing through a class name. Minimize that (perhaps by caching) and you should be fine.
Spot a bug? Have a suggestion? Post a comment!
#1 by NemoStein on January 30th, 2012 ·
“The main takeaway here is to remember that static is slow when you’re accessing through a class name. Minimize that (perhaps by caching) and you should be fine.”
In other words:
#2 by jackson on January 30th, 2012 ·
Exactly, but you could also make
someStuff
static and then reference it like this:Not like this:
#3 by ben w on January 30th, 2012 ·
out of curiosity could you run the following test
just to see what impact creating the sting object a zillion times had on the test
cheers, ben
#4 by jackson on January 30th, 2012 ·
I actually did that test too, but dropped it when I found that it was quite a bit slower than the string literal version. Perhaps I’ll go more in depth on why in another article.
#5 by Aleksandr Makov on January 30th, 2012 ·
Hey Jackson, thanks for the research. What did you use to inspect the bytecode?
#6 by jackson on January 30th, 2012 ·
You’re welcome. :)
I used
swfdump
that comes with the Flex SDK. Adobe has a nice article on how to use it. Basically you just do this:Then check out bytecode.txt in a text editor. Scroll down a little and you should see the function names with their bytecode.
#7 by skyboy on January 30th, 2012 ·
I’d like to comment on this bit of code:
getproperty private,StaticTest2,http://adobe.com/AS3/2006/builtin,,flash.text,flash.utils,private,,flash.display,StaticTest2,BaseClass,flash.display:Sprite,flash.display:DisplayObjectContainer,flash.display:InteractiveObject,flash.display:DisplayObject,flash.events:EventDispatcher:null // get the PI property
Those aren’t a list of arguments; the arguments (Object, followed by the String) are on the stack. That’s actually a multiname and all of the names it contains; perhaps a NamespaceSet. All of the opcodes that have an argument on the same line are reading an index into a pool somewhere; such as with callmethod and callstatic – two opcodes for calling functions you will never find in SWFs, but they’re completely implemented. All that’s needed is a compiler or post-compiler that makes use of them: they skip most of the checks, and call directly into the SWF function pool (callstatic; this has limitations on application) or into the method pool of a class (callmethod; same limitations as callstatic).
The opcodes that do this are often faster than their stack-based brothers; but not always, due to optimizations made by the JIT in flash; for instance:
(x * (x * (x * x)))
is more likely to be optimized into faster code thanx * x * x * x
because of what the compiler generates;getlocal, getlocal, getlocal, getlocal, multiply, multiply, multiply
vs.getlocal, getlocal, multiply, getlocal, multiply, getlocal, multiply
. the JIT can optimize the former into a single superword (superwords are 16 bits where words are 8 bits; in the context of the interpreter) that loads 4 local variables in one go much more easily than the latter.As useful as the JIT optimizations are, in most cases they don’t seem to be applied at all because the compiler generates dumb assembler, not designed to take advantage of the JIT optimizations. In some cases the compiler generates outright stupid code (like converting uints to float, decrementing then back to uint only when the decrement takes place inside a condition or brackets), but that’s another complaint.
This instruction is using a runtime multiname as well, which invokes a massive overhead (it’s this exact multiname that’s delayed me making my own post-compile optimizer; I’m still struggling with implementing a fully type-strict stack and interpreter just to address these properly).
This specific opcode could get an entire article series devoted to it; I’ll continue in another comment because I’m not sure what the length limit is.
#8 by jackson on January 30th, 2012 ·
I’m not sure what the length limit is, either. :)
Very interesting food for thought. Perhaps I’ll write an article on workarounds for the compiler like your
(x * (x * (x * x)))
oruint
->Number
->uint
for decrement. Any more to share for inclusion?#9 by skyboy on January 30th, 2012 ·
I can’t say every optimization the JIT makes and how to take advantage of them (I don’t know terribly many to start with; I haven’t dug deep into the JIT) but here is a list+implementations of the superwords: http://pastebin.com/g9tqypW3
sp is the current working stack; pc is the current byte in the SWF’s ABC code (post-optimization) that’s being executed; I’m not entirely certain what framep is.
The full code of the interpreter can be found here.
#10 by Kyle Murray / Krilnon on January 31st, 2012 ·
framep is the frame pointer, which points to an address in the current stack frame. It’s useful because the stack pointer register often changes around during the execution of code in that frame.
#11 by skyboy on January 30th, 2012 ·
This is actual live code from the Flash Player AS3 interpreter source:
Yes, that getproperty_fast instruction is in the code, and it is commented out with precompiler tags. I don’t know the specifics on why implementing that superword was ditched, but I believe it has to do with the Dictionary class.
The getlex instruction does the same thing getproperty does, but it’s so much faster because it doesn’t have to deal with Array/Vector/Dictionary or runtime multinames. If that opcode was generated for getting the properties of static classes, we wouldn’t have performance pitfalls for accessing them. I still believe the access is dynamic, just not runtime dynamic – the classes are of the type Class, which has no properties or methods defined. The difference between accessing a static property and instance property appears to be similar to the difference seen with the dot operator vs index operator on another class (perhaps Dictionary?) in a previous article. So while the name is known, it’s still a dynamic look up when compared to the look up for an instance; it’s just not a dynamic runtime look up.
Of course, I can’t say for certain without digging deeper into the multiname pool of an individual SWF along with the associated code in the AS3 interpreter.
Apparat has a class or two in it that appear to exploit that by having non-dynamic calls to the function of a class, potentially by treating the class as an instance. I haven’t looked into what exactly goes on in that class to say, but using Apparat you could devise a test of keeping the Math class in a local variable then calling its methods/retrieving properties in a loop; Following up with a test that uses a single getlex instruction to grab the static properties.
More on callstatic/callmethod: The specific limitation I mentioned for both is that you can only call into the method pool of the running SWF. So these opcodes may very well result in huge performance gains in calling functions, they can have zero impact for calling functions native to AS3; such as the Math functions. You can see more of how they’re faster in this code:
I can’t say why the compiler doesn’t make use of these two extra fully implemented instructions, but my post-compile optimizer will most definitely have the option to test it. If I can manage to get the stack/interpreting completed.
#12 by jackson on January 31st, 2012 ·
Very interesting insights. Thanks for sharing. When your post-compile optimizer is in usable shape, I’d love to test it out for you.
#13 by Rackdoll on January 31st, 2012 ·
First of all…. –> Nice article.
Second.
I was wondering if you tried the private static var / const in your tests ?
IF so can you post those results….. ?
thnx :)
#14 by Rackdoll on January 31st, 2012 ·
or are the results same as ” protected static” ?
Is there even a difference in performance..-> protected , private or public constants ?
#15 by skyboy on January 31st, 2012 ·
public
is marginally faster (less than 1%) at thousands of iterations; the rest of them have less meaningful differences and const/var perform identically.When it comes to these matters (public/private/protected/internal/custom | const/var), write your code for what you need it to do; performance gains/losses is entirely meaningless due to incredibly small gains, that result in nothing of value when the user has other applications running that compete for resources (so, everyone).
#16 by Rackdoll on January 31st, 2012 ·
ok. seems clear enough. Thnx for the feedback!
keep on the greatness!
#17 by Kyle Murray / Krilnon on January 31st, 2012 ·
Using your tests, I don’t get such ridiculously slow results for Math[‘PI’]:
Flash Player 11.2.202.197 (beta 5, 64-bit, ‘release’, ActiveX)
SDK 4.5.1
Core i7-2600K 3.4 GHz
Of course there are a number of differences in our setups, but it seemed like a notable performance difference.
#18 by NemoStein on February 1st, 2012 ·
“Flash Player 11.2.202.197 (beta 5, 64-bit, ‘release’, ActiveX)”
Maybe FP 11.2 has some improvements that can lead to this result.
Beta 5 should be the last beta before release, so, in two weeks we can see the “real” results of this.
#19 by Amit Patel on February 14th, 2012 ·
Since you’re testing variable access inside a loop, have you looked at the bytecode for the loop operations as well, to see how much of your running time is from the loop (increment, test, jump) vs the variable access?
#20 by jackson on February 14th, 2012 ·
This is a good idea for a test, so I tried it out on the same test machine as in the article, I’m getting about 209 ms for an empty loop. So all the accesses except the static access through a class name are a little more differentiated than in the above graph, but the basic idea still stands: static access through a class name is relatively way slower than all other kinds of access.
#21 by Amit Patel on February 15th, 2012 ·
Thanks! If I understand right, that means accessing a local
var
(215-209 = 6) is over twice as fast as accessing an instancethis.var
(223 – 209 = 14) in the loop test, but they both test the same in the first test. Interesting. (I also don’t have a good sense of the std.dev of these measurements so that might account for the difference)#22 by jackson on February 15th, 2012 ·
Yes, local variables are a good deal quicker than fields, especially when writing to them. For more on this, see my article Local Variable Caching.
#23 by Alex on December 11th, 2012 ·
Hi. Got an off-topic question here. How do you see the bytecode? And if you look at whole .swf bytecode, how do you identify the method which a portion of bytecode relates to? Thanks.
#24 by jackson on December 11th, 2012 ·
Adobe’s Flex SDK comes with a little command line tool to show you the bytecode. Just run this command:
#25 by Alex on December 14th, 2012 ·
Thank you very much!