Double the Performance of Stage3D Apps
To draw with Flash Player 11’s Stage3D
API, you must set up the state of various GPU resources before finally calling drawTriangles
. Inevitably, you’ll end up calling drawTriangles
multiple times during a single frame to draw your characters, terrain, sky, and so forth. In between these calls you will change the GPU’s state by calling Context3D
‘s set*
functions. This article will show you which of these functions can literally cut your app’s performance in half.
There are many set*
functions in Context3D
, but here I’m testing some of the most common ones:
- Shader programs:
setProgram
- Vertex buffers:
setVertexBufferAt
- Textures:
setTextureAt
The following test app started life as the Simple Stage3D Camera test app. I then augmented it in the following ways:
- Added more cubes until my performance dropped below the 60 FPS cap
- Removed all camera controls and pointed it so that all cubes were visible
- Made duplicate GPU resources for the texture, shader program, and vertex buffers (positions and texture coordinates)
- Added check boxes to allow for changing the state of these GPU resources between every draw
- Added dummy code (an
&
and a function call) when state changing is disabled to avoid an unfair advantage - UPDATE: Added buttons to change the number of cubes. The below results are for 15x15x15.
When the GPU state (i.e. texture, shader program, or vertex buffer) is changed between every draw, there should be no difference in the visual output because the alternate GPU resources being switched to are identical.
Here’s how the test app ended up:
package { import com.adobe.utils.*; import flash.display.*; import flash.display3D.*; import flash.display3D.textures.*; import flash.events.*; import flash.filters.*; import flash.geom.*; import flash.text.*; import flash.utils.*; public class Stage3DStateChanging extends Sprite { private static const FLOAT_2:String = Context3DVertexBufferFormat.FLOAT_2; private static const FLOAT_3:String = Context3DVertexBufferFormat.FLOAT_3; /** Positions of all cubes' vertices */ private static const POSITIONS:Vector.<Number> = new <Number>[ // back face - bottom tri -0.5, -0.5, -0.5, -0.5, 0.5, -0.5, 0.5, -0.5, -0.5, // back face - top tri -0.5, 0.5, -0.5, 0.5, 0.5, -0.5, 0.5, -0.5, -0.5, // front face - bottom tri -0.5, -0.5, 0.5, -0.5, 0.5, 0.5, 0.5, -0.5, 0.5, // front face - top tri -0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, -0.5, 0.5, // left face - bottom tri -0.5, -0.5, -0.5, -0.5, 0.5, -0.5, -0.5, -0.5, 0.5, // left face - top tri -0.5, 0.5, -0.5, -0.5, 0.5, 0.5, -0.5, -0.5, 0.5, // right face - bottom tri 0.5, -0.5, -0.5, 0.5, 0.5, -0.5, 0.5, -0.5, 0.5, // right face - top tri 0.5, 0.5, -0.5, 0.5, 0.5, 0.5, 0.5, -0.5, 0.5, // bottom face - bottom tri -0.5, -0.5, 0.5, -0.5, -0.5, -0.5, 0.5, -0.5, 0.5, // bottom face - top tri -0.5, -0.5, -0.5, 0.5, -0.5, -0.5, 0.5, -0.5, 0.5, // top face - bottom tri -0.5, 0.5, 0.5, -0.5, 0.5, -0.5, 0.5, 0.5, 0.5, // top face - top tri -0.5, 0.5, -0.5, 0.5, 0.5, -0.5, 0.5, 0.5, 0.5 ]; /** Texture coordinates of all cubes' vertices */ private static const TEX_COORDS:Vector.<Number> = new <Number>[ // back face - bottom tri 1, 1, 1, 0, 0, 1, // back face - top tri 1, 0, 0, 0, 0, 1, // front face - bottom tri 0, 1, 0, 0, 1, 1, // front face - top tri 0, 0, 1, 0, 1, 1, // left face - bottom tri 0, 1, 0, 0, 1, 1, // left face - top tri 0, 0, 1, 0, 1, 1, // right face - bottom tri 1, 1, 1, 0, 0, 1, // right face - top tri 1, 0, 0, 0, 0, 1, // bottom face - bottom tri 0, 0, 0, 1, 1, 0, // bottom face - top tri 0, 1, 1, 1, 1, 0, // top face - bottom tri 0, 1, 0, 0, 1, 1, // top face - top tri 0, 0, 1, 0, 1, 1 ]; /** Triangles of all cubes */ private static const TRIS:Vector.<uint> = new <uint>[ 2, 1, 0, // back face - bottom tri 5, 4, 3, // back face - top tri 6, 7, 8, // front face - bottom tri 9, 10, 11, // front face - top tri 12, 13, 14, // left face - bottom tri 15, 16, 17, // left face - top tri 20, 19, 18, // right face - bottom tri 23, 22, 21, // right face - top tri 26, 25, 24, // bottom face - bottom tri 29, 28, 27, // bottom face - top tri 30, 31, 32, // top face - bottom tri 33, 34, 35 // top face - bottom tri ]; [Embed(source="flash_logo.png")] private static const TEXTURE:Class; private static const TEMP_DRAW_MATRIX:Matrix3D = new Matrix3D(); private var context3D:Context3D; private var positionsBuffer:VertexBuffer3D; private var positionsBuffer2:VertexBuffer3D; private var texCoordsBuffer:VertexBuffer3D; private var texCoordsBuffer2:VertexBuffer3D; private var indexBuffer:IndexBuffer3D; private var program:Program3D; private var program2:Program3D; private var texture:Texture; private var texture2:Texture; private var camera:Camera3D; private var cubes:Vector.<Cube> = new Vector.<Cube>(); private var changeProgram:Boolean; private var changePositionBuffer:Boolean; private var changeTexCoordBuffer:Boolean; private var changeTexture:Boolean; private var numCubes:uint = 15; private var fps:TextField = new TextField(); private var lastFPSUpdateTime:uint; private var lastFrameTime:uint; private var frameCount:uint; private var driver:TextField = new TextField(); public function Stage3DStateChanging() { stage.align = StageAlign.TOP_LEFT; stage.scaleMode = StageScaleMode.NO_SCALE; stage.frameRate = 60; var stage3D:Stage3D = stage.stage3Ds[0]; stage3D.addEventListener(Event.CONTEXT3D_CREATE, onContextCreated); stage3D.requestContext3D(Context3DRenderMode.AUTO); } protected function onContextCreated(ev:Event): void { // Setup context var stage3D:Stage3D = stage.stage3Ds[0]; stage3D.removeEventListener(Event.CONTEXT3D_CREATE, onContextCreated); context3D = stage3D.context3D; context3D.configureBackBuffer( stage.stageWidth, stage.stageHeight, 0, true ); context3D.enableErrorChecking = true; // Setup camera camera = new Camera3D( 0.1, // near 100, // far stage.stageWidth / stage.stageHeight, // aspect ratio 40*(Math.PI/180), // vFOV -6, -6, 6, // position 0, 0, 0, // target 0, 1, 0 // up dir ); // Setup cubes makeCubes(); // Setup UI fps.background = true; fps.backgroundColor = 0xffffffff; fps.autoSize = TextFieldAutoSize.LEFT; fps.text = "Getting FPS..."; addChild(fps); driver.background = true; driver.backgroundColor = 0xffffffff; driver.text = "Driver: " + context3D.driverInfo; driver.autoSize = TextFieldAutoSize.LEFT; driver.y = fps.height; addChild(driver); // Make checkboxes var checkBoxes:Sprite = new Sprite(); var cb:Sprite; var thiz:Stage3DStateChanging = this; function makeCallback(field:String): Function { return function(checked:Boolean): void { thiz[field] = checked; }; } for each (var option:Object in [ {label:"Shader Program", field:"changeProgram"}, {label:"Position Buffer", field:"changePositionBuffer"}, {label:"Tex Coord Buffer", field:"changeTexCoordBuffer"}, {label:"Texture", field:"changeTexture"} ]) { cb = makeCheckBox(option.label + ": ", false, makeCallback(option.field)); cb.y = checkBoxes.height; checkBoxes.addChild(cb); } checkBoxes.y = stage.stageHeight - checkBoxes.height; addChild(checkBoxes); makeButtons("15x15x15 Cubes", "25x25x25 Cubes", "32x32x32 Cubes"); var assembler:AGALMiniAssembler = new AGALMiniAssembler(); // Vertex shader var vertSource:String = "m44 op, va0, vc0\nmov v0, va1\n" assembler.assemble(Context3DProgramType.VERTEX, vertSource); var vertexShaderAGAL:ByteArray = assembler.agalcode; // Fragment shader var fragSource:String = "tex oc, v0, fs0 <2d,linear,mipnone>"; assembler.assemble(Context3DProgramType.FRAGMENT, fragSource); var fragmentShaderAGAL:ByteArray = assembler.agalcode; // Shader program program = context3D.createProgram(); program.upload(vertexShaderAGAL, fragmentShaderAGAL); program2 = context3D.createProgram(); program2.upload(vertexShaderAGAL, fragmentShaderAGAL); // Setup buffers positionsBuffer = context3D.createVertexBuffer(36, 3); positionsBuffer.uploadFromVector(POSITIONS, 0, 36); positionsBuffer2 = context3D.createVertexBuffer(36, 3); positionsBuffer2.uploadFromVector(POSITIONS, 0, 36); texCoordsBuffer = context3D.createVertexBuffer(36, 2); texCoordsBuffer.uploadFromVector(TEX_COORDS, 0, 36); texCoordsBuffer2 = context3D.createVertexBuffer(36, 2); texCoordsBuffer2.uploadFromVector(TEX_COORDS, 0, 36); indexBuffer = context3D.createIndexBuffer(36); indexBuffer.uploadFromVector(TRIS, 0, 36); // Setup texture var bmd:BitmapData = (new TEXTURE() as Bitmap).bitmapData; texture = context3D.createTexture( bmd.width, bmd.height, Context3DTextureFormat.BGRA, true ); texture.uploadFromBitmapData(bmd); texture2 = context3D.createTexture( bmd.width, bmd.height, Context3DTextureFormat.BGRA, true ); texture2.uploadFromBitmapData(bmd); // Start the simulation addEventListener(Event.ENTER_FRAME, onEnterFrame); } private function makeCubes(): void { cubes = new Vector.<Cube>(); for (var i:int; i < numCubes; ++i) { for (var j:int = 0; j < numCubes; ++j) { for (var k:int = 0; k < numCubes; ++k) { cubes.push(new Cube(i*2, j*2, -k*2)); } } } } private function makeButtons(...labels): void { const PAD:Number = 5; var curX:Number = stage.stageWidth; var curY:Number = stage.stageHeight; for each (var label:String in labels) { var tf:TextField = new TextField(); tf.mouseEnabled = false; tf.selectable = false; tf.defaultTextFormat = new TextFormat("_sans", 16, 0x0071BB); tf.autoSize = TextFieldAutoSize.LEFT; tf.text = label; tf.name = "lbl"; tf.background = true; tf.backgroundColor = 0xffffff; var button:Sprite = new Sprite(); button.buttonMode = true; button.graphics.beginFill(0xF5F5F5); button.graphics.drawRect(0, 0, tf.width+PAD, tf.height+PAD); button.graphics.endFill(); button.graphics.lineStyle(1); button.graphics.drawRect(0, 0, tf.width+PAD, tf.height+PAD); button.addChild(tf); button.addEventListener(MouseEvent.CLICK, onButton); tf.x = PAD/2; tf.y = PAD/2; button.x = curX - button.width; button.y = curY - button.height; addChild(button); curY -= button.height; } } public static function makeCheckBox( label:String, checked:Boolean, callback:Function, labelFormat:TextFormat=null): Sprite { var sprite:Sprite = new Sprite(); var tf:TextField = new TextField(); tf.autoSize = TextFieldAutoSize.LEFT; tf.text = label; tf.background = true; tf.backgroundColor = 0xffffff; tf.selectable = false; tf.mouseEnabled = false; tf.setTextFormat(labelFormat || new TextFormat("_sans")); sprite.addChild(tf); var size:Number = tf.height; var background:Shape = new Shape(); background.graphics.beginFill(0xffffff); background.graphics.drawRect(0, 0, size, size); background.x = tf.width; sprite.addChild(background); var border:Shape = new Shape(); border.graphics.lineStyle(1, 0x000000); border.graphics.drawRect(0, 0, size, size); border.x = background.x; sprite.addChild(border); var check:Shape = new Shape(); check.graphics.lineStyle(1, 0x000000); check.graphics.moveTo(0, 0); check.graphics.lineTo(size, size); check.graphics.moveTo(size, 0); check.graphics.lineTo(0, size); check.x = background.x; check.visible = checked; sprite.addChild(check); sprite.addEventListener( MouseEvent.CLICK, function(ev:MouseEvent): void { checked = !checked; check.visible = checked; callback(checked); } ); return sprite; } private function onButton(ev:MouseEvent): void { var tf:TextField = ev.target.getChildByName("lbl"); var lbl:String = tf.text; switch (lbl) { case "15x15x15 Cubes": numCubes = 15; makeCubes(); break; case "25x25x25 Cubes": numCubes = 25; makeCubes(); break; case "32x32x32 Cubes": numCubes = 32; makeCubes(); break; } } private function onEnterFrame(ev:Event): void { // Render scene context3D.setProgram(program); context3D.setVertexBufferAt(0, positionsBuffer, 0, FLOAT_3); context3D.setVertexBufferAt(1, texCoordsBuffer, 0, FLOAT_2); context3D.setTextureAt(0, texture); context3D.clear(0.5, 0.5, 0.5); // Draw all cubes var worldToClip:Matrix3D = camera.worldToClipMatrix; var drawMatrix:Matrix3D = TEMP_DRAW_MATRIX; var temp:int; var cubes:Vector.<Cube> = this.cubes; var numCubes:uint = cubes.length; for (var i:int; i < numCubes; ++i) { var cube:Cube = cubes[i]; if (changeProgram) { context3D.setProgram(i & 1 ? program : program2); } else { temp = i & 1 ? 1 : 0; cube.dummyFunction(); } if (changePositionBuffer) { context3D.setVertexBufferAt(0, i & 1 ? positionsBuffer : positionsBuffer2, 0, FLOAT_3); } else { temp = i & 1 ? 1 : 0; cube.dummyFunction(); } if (changeTexCoordBuffer) { context3D.setVertexBufferAt(1, i & 1 ? texCoordsBuffer : texCoordsBuffer2, 0, FLOAT_2); } else { temp = i & 1 ? 1 : 0; cube.dummyFunction(); } if (changeTexture) { context3D.setTextureAt(0, i & 1 ? texture : texture2); } else { temp = i & 1 ? 1 : 0; cube.dummyFunction(); } cube.mat.copyToMatrix3D(drawMatrix); drawMatrix.prepend(worldToClip); context3D.setProgramConstantsFromMatrix( Context3DProgramType.VERTEX, 0, drawMatrix, false ); context3D.drawTriangles(indexBuffer, 0, 12); } context3D.present(); // Update frame rate display frameCount++; var now:int = getTimer(); var dTime:int = now - lastFrameTime; var elapsed:int = now - lastFPSUpdateTime; if (elapsed > 1000) { var framerateValue:Number = 1000 / (elapsed / frameCount); fps.text = "FPS: " + framerateValue.toFixed(1); lastFPSUpdateTime = now; frameCount = 0; } lastFrameTime = now; } } } import flash.display.Shape; import flash.display.Sprite; import flash.events.MouseEvent; import flash.geom.*; import flash.text.TextField; import flash.text.TextFieldAutoSize; import flash.text.TextFormat; class Cube { public var mat:Matrix3D; public function Cube(x:Number, y:Number, z:Number) { mat = new Matrix3D( new <Number>[ 1, 0, 0, x, 0, 1, 0, y, 0, 0, 1, z, 0, 0, 0, 1 ] ); } public function dummyFunction(): void { } }
I ran this test app in the following environment:
- Flex SDK (MXMLC) 4.5.1.21328, compiling in release mode (no debugging or verbose stack traces)
- Release version of Flash Player 11.2.202.229
- 2.4 Ghz Intel Core i5
- NVIDIA GeForce GT 330M 256 MB
- Mac OS X 10.7.3
It is very important to remember that this is just one possible testing environment. Other environments will very a lot compared to this one. For example, consider Windows machines running DirectX 9, iOS and Android devices running OpenGL ES 2.0, and desktops of all sorts running software rendering and you’ll have some idea of just how vast the performance landscape is. That said, the following results were gathered with a real world machine (a Mid-2010 MacBook Pro), so they will very likely apply to your users if you are targeting the desktop.
That said, here are the results I got:
Shader Program | x | x | x | x | x | x | x | x | ||||||||
Position Buffer | x | x | x | x | x | x | x | x | ||||||||
Tex Coord Buffer | x | x | x | x | x | x | x | x | ||||||||
Texture | x | x | x | x | x | x | x | x | ||||||||
FPS | 58.2 | 26.6 | 49.3 | 23.6 | 49.3 | 23.6 | 46.7 | 23.3 | 44.1 | 25.3 | 30.5 | 22.3 | 37.6 | 22.4 | 36.1 | 22.1 |
Here we see a clear order of performance impact:
- Shader Program (54% performance loss)
- Texture (24% performance loss)
- Vertex buffer (15% performance loss)
Further, we see that state changes are always cumulative. No state change includes another, so saving any state change is always good for performance. Just because you’re already changing one part of the state does not mean that you should try to avoid changing another part.
In conclusion, state change plays a major role in the performance of your Stage3D
-based app. If necessary, it is probably worth spending a sizable amount of CPU time to avoid changing the GPU state when not necessary.
Spot a bug? Have a suggestion or a question? Post a comment!
#1 by Alex on April 9th, 2012 ·
I don’t know what happens when you run the test, but I can select all options without any loss of frames per second.
Specs:
Flash Player 11.2.202.228
AMD Phenom II X4 840 @3.2 GHz (x64)
8192 MB RAM
NVIDIA GeForce GTX 550 Ti (1024 MB)
Windows 7 Ultimate x64
#2 by NemoStein on April 9th, 2012 ·
“NVIDIA GeForce GTX 550 Ti (1024 MB)” vs “NVIDIA GeForce GT 330M 256 MB”
Your GPU is a little (put many littles one behind another for this) than his one.
#3 by NemoStein on April 9th, 2012 ·
Oh, and your overall configuration is MUCH better…
So much that there isn’t enough cubes on screen to make any difference…
#4 by jackson on April 9th, 2012 ·
I just updated the article with some buttons to increase the number of cubes from 15x15x15 to 25x25x25 or 32x32x32 (the maximum number of draws). Could you try it again?
#5 by Alex on April 16th, 2012 ·
I wish you hadn’t done that. Now I have to watch the frame rate crumble under the workload load imposed by 32^3 cubes.
10 frames per second is rather low, don’t you think?
#6 by jackson on April 16th, 2012 ·
That’s also a lot of cubes. :)
15^3 = 3375
25^3 = 15625
32^3 = 32768
#7 by Chris on April 10th, 2012 ·
Great!
Now you just need to do a tutorial on how to tell if an object is in the cameras view, and don’t draw it if it isn’t.
That would help a lot.
#8 by jackson on April 10th, 2012 ·
I may write something like that, but there are already a lot of articles on “view frustum culling” out there. Still, the cost of draws outside the camera’s view in Flash is something good to measure…
#9 by Chris on April 12th, 2012 ·
There is also loads of optimization articles like this, but it didn’t stop you from writing it anyway. :P
I just figured it would be a good read for someone looking for optimization options.
Anyway…
I got view culling to work for me.
It wasn’t hard to figure out after I posted that comment and looked it up.
Though I was wondering, do you know how any other ways to speed up stage3D besides these two?
#10 by ben w on April 12th, 2012 ·
if you are interested I wrote 4 implementations of frustum culling, the 4th being hands down fastest
1. standard frustum culling with a vector of 6 planes to loop through and test against
2. same as 1. but unrolled the loop
3. same as 1. but added plane caching
4. same as 3, but unrolled the loop
1. == slowest
4. == fastest
I started an article on it but never finished it off, might get round to it one of these days.
As for another technique to speed things up – I wrote a software rasterizer to draw a crude z-buffer (with id encoding) for use with occlusion culling and that had positive results… but not quite fast enough for my liking.
Enter alchemy 2, hopefully that will be able to handle that kinda thing without to much difficulty..
#11 by jackson on April 12th, 2012 ·
One good way to improve performance is to reduce your number of draw calls, which implicitly reduces your number of state changes. Simplifying shaders, reducing the amount of geometry, and disabling effects that rely on fancier features (e.g. scissor rects, blending, alpha) also helps a lot. And, obvious as it may be, you should also make sure to run a release build in a release version of Flash Player with
Context3D.enableErrorChecking
set tofalse
. Those are all general tips, but there will be lots that is specific to your app. I’d recommend using a profiler to help with the CPU side and perhaps a third-party tool (e.g. PIX or GPA) to help with the GPU side. Also, as Ben points out, you can implement better/faster culling (e.g. BSP Trees, portals), or even switch to using “domain memory” via Alchemy, Alchemy 2, HaXe’sMemory
class, or Apparat.Hope that helps.
#12 by Kevin Newman on April 13th, 2012 ·
Does updating the program constants affect state the same was as updating texture, shader program, or vertex buffer?
Separately, if you have multiple distinct batches to run (in a 2D engine that might be a batch of transparent quads, and one of opaque quads, with separate vertex, texture and programs – or different material shaders in 3D) – if you were to update all the vertex/tex/index then just switch the programs between drawTriangles, would that yield any benefit (if that’s even possible, I’m still learning)?
In other words, is it the state change between draw calls the problem, or just the raw overhead of changing the state?
#13 by jackson on April 13th, 2012 ·
There is a little overhead to the calls to change the state, but the major factor is what it does to the GPU: a big interruption in chugging through lots of geometry. Yes, changing vertex constants will affect that too, but for the test I needed to change something or I’d just end up with the same cube over and over. You’re right that in other situations—like that 2D engine—contants may not change and you’ll instead by dealing with different performance from changing the state of vertex constants. Again, the specific performance will be very dependent on GPU, OS, driver, and API (i.e. OpenGL 2.0, OpenGL ES 2.0, DirectX 9).
#14 by ben w on April 15th, 2012 ·
just tried FP11.3 in ie and chrome… No idea what they did but it is almost twice as fast!?!?
can render 3000 objects at 60fps in chrome and 5000 at 60fps in ie
That’s a big jump up! was limited to 2000 ish in 11.2, wonder why it is so different. Have you ran any of your plethora of speed tests through 11.3 yet? Be interesting to see if they have sped it up or if its just improved speed on stage3D related calls.
#15 by jackson on April 15th, 2012 ·
On the Mac from the article I now get these numbers with 15x15x15 cubes:
Since 60 is the cap, I bumped up the number of cubes to 25x25x25:
So there may be some optimizations, but they seems small on my Mac. Once again, platform differences are huge here, so they could easily have written optimizations for a particular OS, GPU, driver, etc. What system are you testing on?