Stage3D Upload Speed Tester
Since Flash Player 11’s new Stage3D
allows us to utilize hardware-acceleration for 3D graphics, that entails a whole new set of performance we need to consider. Today’s article discusses the performance of uploading data from system memory (RAM) to video memory (VRAM), such as when you upload textures, vertex buffers, and index buffers. Is it faster to upload to one type rather than another? Is it faster to upload from a Vector
, a ByteArray
, or a BitmapData
? Is there a significant speedup when using software rendering so that VRAM is the same as RAM? Find out the answers to all of these questions below.
The below performance test checks the upload speeds in both hardware and software mode of all of these types:
Texture
from…- BitmapData
- Vector
- ByteArray
VertexBuffer3D
from…- Vector
- ByteArray
IndexBuffer3D
from…- Vector
- ByteArray
Check it out:
package { import flash.display3D.*; import flash.display3D.textures.*; import flash.external.*; import flash.display.*; import flash.sampler.*; import flash.system.*; import flash.events.*; import flash.utils.*; import flash.text.*; import flash.geom.*; import com.adobe.utils.*; public class Stage3DUploadTester extends Sprite { private var __stage3D:Stage3D; private var __logger:TextField = new TextField(); private var __context:Context3D; private var __driverInfo:String; private var __texture:Texture; private var __bmdNoAlpha:BitmapData; private var __bmdAlpha:BitmapData; private var __texBytes:ByteArray; private var __vertexBuffer:VertexBuffer3D; private var __vbVector:Vector.<Number>; private var __vbBytes:ByteArray; private var __indexBuffer:IndexBuffer3D; private var __ibVector:Vector.<uint>; private var __ibBytes:ByteArray; public function Stage3DUploadTester() { __stage3D = stage.stage3Ds[0]; __logger.autoSize = TextFieldAutoSize.LEFT; addChild(__logger); // Allocate texture data __bmdNoAlpha = new BitmapData(2048, 2048, false, 0xffffffff); __bmdAlpha = new BitmapData(2048, 2048, true, 0xffffffff); __texBytes = new ByteArray(); var size:int = __texBytes.length = 2048*2048*4; for (var i:int; i < size; ++i) { __texBytes[i] = 0xffffffff; } // Allocate vertex buffer data size = 65535*64; __vbVector = new Vector.<Number>(size); for (i = 0; i < size; ++i) { __vbVector[i] = 1.0; } __vbBytes = new ByteArray(); __vbBytes.length = size*4; for (i = 0; i < size; ++i) { __vbBytes.writeFloat(1.0); } __vbBytes.position = 0; // Allocate index buffer data size = 524287; __ibVector = new Vector.<uint>(size); for (i = 0; i < size; ++i) { __ibVector[i] = 1.0; } __ibBytes = new ByteArray(); __ibBytes.length = size*4; for (i = 0; i < size; ++i) { __ibBytes.writeFloat(1.0); } __ibBytes.position = 0; setupContext(Context3DRenderMode.AUTO); } private function setupContext(renderMode:String): void { __stage3D.addEventListener(Event.CONTEXT3D_CREATE, onContextCreated); __stage3D.requestContext3D(renderMode); } private function onContextCreated(ev:Event): void { __stage3D.removeEventListener(Event.CONTEXT3D_CREATE, onContextCreated); var first:Boolean = __logger.text.length == 0; if (first) { __logger.appendText("Driver,Test,Time,Bytes/Sec\n"); } const width:int = stage.stageWidth; const height:int = stage.stageHeight; __context = __stage3D.context3D; __context.configureBackBuffer(width, height, 0, true); __driverInfo = __context.driverInfo; __texture = __context.createTexture( 2048, 2048, Context3DTextureFormat.BGRA, false ); __vertexBuffer = __context.createVertexBuffer(65535, 64); __indexBuffer = __context.createIndexBuffer(524287); runTests(); if (first) { __context.dispose(); setupContext(Context3DRenderMode.SOFTWARE); } } private function runTests(): void { var beforeTime:int; var afterTime:int; var time:int; beforeTime = getTimer(); __texture.uploadFromBitmapData(__bmdNoAlpha); afterTime = getTimer(); time = afterTime - beforeTime; row("Texture from BitmapData w/o alpha", time, 2048*2048*4); beforeTime = getTimer(); __texture.uploadFromBitmapData(__bmdAlpha); afterTime = getTimer(); time = afterTime - beforeTime; row("Texture from BitmapData w/ alpha", time, 2048*2048*4); beforeTime = getTimer(); __texture.uploadFromByteArray(__texBytes, 0); afterTime = getTimer(); time = afterTime - beforeTime; row("Texture from ByteArray", time, 2048*2048*4); beforeTime = getTimer(); __vertexBuffer.uploadFromVector(__vbVector, 0, 65535); afterTime = getTimer(); time = afterTime - beforeTime; row("VertexBuffer from Vector", time, 65535*64*4); beforeTime = getTimer(); __vertexBuffer.uploadFromByteArray(__vbBytes, 0, 0, 65535); afterTime = getTimer(); time = afterTime - beforeTime; row("VertexBuffer from ByteArray", time, 65535*64*4); beforeTime = getTimer(); __indexBuffer.uploadFromVector(__ibVector, 0, 524287); afterTime = getTimer(); time = afterTime - beforeTime; row("IndexBuffer from Vector", time, 524287*4); beforeTime = getTimer(); __indexBuffer.uploadFromByteArray(__ibBytes, 0, 0, 524287); afterTime = getTimer(); time = afterTime - beforeTime; row("IndexBuffer from ByteArray", time, 524287*4); } private function row(name:String, time:int, bytes:int): void { __logger.appendText( __driverInfo + "," + name + "," + time + "," + (bytes/time).toFixed(2) + "\n" ); } } }
I ran this performance test with the following environment:
- Flex SDK (MXMLC) 4.5.1.21328, compiling in release mode (no debugging or verbose stack traces)
- Release version of Flash Player 11.0.1.152
- 2.4 Ghz Intel Core i5
- Mac OS X 10.7.2
And got these results:
Driver | Test | Time | Bytes/Sec |
---|---|---|---|
OpenGL (Direct blitting) | Texture from BitmapData w/o alpha | 22 | 762600.73 |
OpenGL (Direct blitting) | Texture from BitmapData w/ alpha | 18 | 932067.56 |
OpenGL (Direct blitting) | Texture from ByteArray | 18 | 932067.56 |
OpenGL (Direct blitting) | VertexBuffer from Vector | 42 | 399451.43 |
OpenGL (Direct blitting) | VertexBuffer from ByteArray | 5 | 3355392.00 |
OpenGL (Direct blitting) | IndexBuffer from Vector | 3 | 699049.33 |
OpenGL (Direct blitting) | IndexBuffer from ByteArray | 1 | 2097148.00 |
Software (Direct blitting) | Texture from BitmapData w/o alpha | 12 | 1398101.33 |
Software (Direct blitting) | Texture from BitmapData w/ alpha | 5 | 3355443.20 |
Software (Direct blitting) | Texture from ByteArray | 5 | 3355443.20 |
Software (Direct blitting) | VertexBuffer from Vector | 15 | 1118464.00 |
Software (Direct blitting) | VertexBuffer from ByteArray | 5 | 3355392.00 |
Software (Direct blitting) | IndexBuffer from Vector | 3 | 699049.33 |
Software (Direct blitting) | IndexBuffer from ByteArray | 2 | 1048574.00 |
There is a clear order of speed in all tests, regardless of hardware or software or type of GPU resource being uploaded to:
ByteArray
(fastest)Vector
BitmapData
(slowest)
Only the magnitude of the advantage changes with this. In particular, if you can manage to upload a vertex or index buffer from a ByteArray
, you’re assured a huge performance win.
Uploading texture data seems much faster in software compared to hardware: a 3x improvement. As for vertex and index buffers, it’s more of a mixed bag. Software is faster when uploading vertex buffers from a Vector
, hardware is faster when uploading index buffers from a ByteArray
, and the rest are a tie. Vertex buffers are curiously quicker to upload than index buffers. The difference is more dramatic with software rendering (3x faster) than hardware rendering (50% faster).
More so than ever before in my performance articles is it important to keep in mind that the performance results posted above are valid only for the test environment that produced them. These numbers may change on Windows, which uses DirectX instead of OpenGL, or any of a number of mobile handsets using OpenGL ES.
Spot a bug? Have a suggestion? Different results on your environment? Post a comment!
#1 by Smily on October 31st, 2011 ·
My results for the following config:
Flash Player 11.0.1.152
Firefox 7.0.1
Windows 7
Intel Core 2 Quad Q6600
GeForce 560ti
Driver,Test,Time,Bytes/Sec
DirectX9 (Direct blitting),Texture from BitmapData w/o alpha,22,762600.73
DirectX9 (Direct blitting),Texture from BitmapData w/ alpha,11,1525201.45
DirectX9 (Direct blitting),Texture from ByteArray,11,1525201.45
DirectX9 (Direct blitting),VertexBuffer from Vector,22,762589.09
DirectX9 (Direct blitting),VertexBuffer from ByteArray,11,1525178.18
DirectX9 (Direct blitting),IndexBuffer from Vector,2,1048574.00
DirectX9 (Direct blitting),IndexBuffer from ByteArray,2,1048574.00
Software (Direct blitting),Texture from BitmapData w/o alpha,21,798915.05
Software (Direct blitting),Texture from BitmapData w/ alpha,12,1398101.33
Software (Direct blitting),Texture from ByteArray,11,1525201.45
Software (Direct blitting),VertexBuffer from Vector,22,762589.09
Software (Direct blitting),VertexBuffer from ByteArray,11,1525178.18
Software (Direct blitting),IndexBuffer from Vector,2,1048574.00
Software (Direct blitting),IndexBuffer from ByteArray,2,1048574.00
Note that the numbers vary quite a bit from test to test, with “DirectX9 (Direct blitting),Texture from BitmapData w/o alpha” running as fast as 15ms and as slow as 28ms.
#2 by Tronster on October 31st, 2011 ·
One other curiosity to note is that uploading texture data seems much faster in hardware compared to software. I may be reading this wrong, but doesn’t your graph show the opposite of this? Looks like all three types of software uploading can push more bytes/sec than hardware.
#3 by jackson on October 31st, 2011 ·
The graphs are showing bytes-per-second, so you want a higher bar as opposed to most of my articles that are showing time, so you want a lower bar. Sorry for the confusion.
#4 by skyboy on October 31st, 2011 ·
On your system, software is faster; vs. where you said hardware is faster in the article.
#5 by jackson on October 31st, 2011 ·
Ah, I see now. I’ve updated the article to have more accurate conclusions on the rendering mode: harware vs. software.
#6 by James on November 2nd, 2011 ·
My result somehow all the same? Infinity mean what?
The Laptop is Lenovo Y460
Driver,Test,Time,Bytes/Sec
DirectX9 (Direct blitting),Texture from BitmapData w/o alpha,10,1677721.60
DirectX9 (Direct blitting),Texture from BitmapData w/ alpha,10,1677721.60
DirectX9 (Direct blitting),Texture from ByteArray,0,Infinity
DirectX9 (Direct blitting),VertexBuffer from Vector,10,1677696.00
DirectX9 (Direct blitting),VertexBuffer from ByteArray,10,1677696.00
DirectX9 (Direct blitting),IndexBuffer from Vector,0,Infinity
DirectX9 (Direct blitting),IndexBuffer from ByteArray,0,Infinity
Software (Direct blitting),Texture from BitmapData w/o alpha,10,1677721.60
Software (Direct blitting),Texture from BitmapData w/ alpha,0,Infinity
Software (Direct blitting),Texture from ByteArray,10,1677721.60
Software (Direct blitting),VertexBuffer from Vector,10,1677696.00
Software (Direct blitting),VertexBuffer from ByteArray,10,1677696.00
Software (Direct blitting),IndexBuffer from Vector,0,Infinity
Software (Direct blitting),IndexBuffer from ByteArray,0,Infinity
#7 by skyboy on November 8th, 2011 ·
Heh. Infinity is due to X/0 in AS3 resulting in Infinity instead of NaN; though 0/0 is still NaN.
#8 by Jacob on November 19th, 2011 ·
Here are results from my machine:
Driver,Test,Time,Bytes/Sec
OpenGL Vendor=ATI Technologies Inc. Version=2.1 ATI-1.6.38 Renderer=ATI Radeon HD 6490M OpenGL Engine GLSL=1.20 (Direct blitting),Texture from BitmapData w/o alpha,16,1048576.00
OpenGL …Texture from BitmapData w/ alpha,16,1048576.00
OpenGL …Texture from ByteArray,16,1048576.00
OpenGL …VertexBuffer from Vector,36,466026.67
OpenGL …VertexBuffer from ByteArray,2,8388480.00
OpenGL …IndexBuffer from Vector,3,699049.33
OpenGL …IndexBuffer from ByteArray,1,2097148.00
Software (Direct blitting),Texture from BitmapData w/o alpha,9,1864135.11
Software (Direct blitting),Texture from BitmapData w/ alpha,3,5592405.33
Software (Direct blitting),Texture from ByteArray,3,5592405.33
Software (Direct blitting),VertexBuffer from Vector,13,1290535.38
Software (Direct blitting),VertexBuffer from ByteArray,3,5592320.00
Software (Direct blitting),IndexBuffer from Vector,2,1048574.00
Software (Direct blitting),IndexBuffer from ByteArray,1,2097148.00
#9 by David on November 21st, 2011 ·
nvidia gtx560
i7 2600k
Driver,Test,Time,Bytes/Sec
DirectX9 (Direct blitting),Texture from BitmapData w/o alpha,4,4194304.00
DirectX9 (Direct blitting),Texture from BitmapData w/ alpha,2,8388608.00
DirectX9 (Direct blitting),Texture from ByteArray,3,5592405.33
DirectX9 (Direct blitting),VertexBuffer from Vector,6,2796160.00
DirectX9 (Direct blitting),VertexBuffer from ByteArray,3,5592320.00
DirectX9 (Direct blitting),IndexBuffer from Vector,1,2097148.00
DirectX9 (Direct blitting),IndexBuffer from ByteArray,1,2097148.00
Software (Direct blitting),Texture from BitmapData w/o alpha,4,4194304.00
Software (Direct blitting),Texture from BitmapData w/ alpha,3,5592405.33
Software (Direct blitting),Texture from ByteArray,2,8388608.00
Software (Direct blitting),VertexBuffer from Vector,5,3355392.00
Software (Direct blitting),VertexBuffer from ByteArray,3,5592320.00
Software (Direct blitting),IndexBuffer from Vector,1,2097148.00
Software (Direct blitting),IndexBuffer from ByteArray,1,2097148.00
i really don’t understand, why the bitmapdata would upload slower without the alpha-channel. does the gpu need the alpha-channel and therefor all rgb-pixels get converted to rgba or where is this coming from?
btw, jackson. it would be nice, if you could create some sort of best-practise article for the new 3dapi, based on your performance-research.
all the years of performance-optimizing in flash, and now this new technology, and im constantly struggling with fps-drops in my experiments, and i just dont know where they come from, and the tutorials on the web are quite thin and i dont really trust the adobe-tutorials. they posted so many bad/slow examples in the past, i dont think it’s so much better this time ;)
thanks for your great efforts.
#10 by jackson on November 21st, 2011 ·
Your guess is as good as mine about the alpha channel upload. That does seem plausible though.
Thanks for the idea about a “best practices” article for
Stage3D
. I’ve been writing a lot on the subject recently so that may very well happen. :)#11 by Sam on February 5th, 2012 ·
Hi Jackson, the indexbuffer code needs a correction: You should use writeShort() instead of writeFload() and the size should be size*2 instead of size*4 because each index is only 16bits wide.
#12 by Matt Lockyer on March 21st, 2012 ·
Hey Jackson,
Wondering if you could revisit this.
I’ve tested some vectors and byte arrays, and I’m wondering what the cost of updating the bytearray would be if that is incorporated into the test?
Roughly I’m showing average framerates of 32-33 for vectors compared to 29-30 for bytearrays… I think this might be due to the fact that you must update the bytearray and this access is slower?
Is a ba.writeFloat(…) slower than data[x] = y ???
#13 by jackson on March 21st, 2012 ·
Hey Matt,
If I’m understanding correctly, you’d like a test of
Vector
access as opposed to multiple forms ofByteArray
access. If so, this sounds like a great idea for an article and one I can’t believe I haven’t written yet! I’ll definitely add it to my list of articles to write.Thanks for the tip,
-Jackson
#14 by Matt Lockyer on March 21st, 2012 ·
Yes that’s correct!
Thank you for taking the time to follow up.
The basic premise behind why this would be an interesting article to write rests in the need for us (as3 developers) to constantly update many values per object per frame into a single vector or bytearray for uploading to the vertex buffer.
Thanks again!
#15 by Glidias on January 24th, 2013 ·
I have a bit of a delimma. I know ByteArray uploads to Stage3D are faster. Unfortunately, in order to write a series of floats to ByteArray at randomly accessed positions, I have to constantly set the ByteArray.position pointer, and than call writeFloat(). I’m not sure if indexed access is possible here since i’m not writing bytes, but a full float. So, I fear that doing this in the CPU pre-processing step would take up more resources compared to simply updating values in a dense int fixed-size vector, which should be much faster and easier to read code-wise.
I guess I can get around this ByteArray problem by using some Domain memory API to handle this stuff, since my application could allocate a fixed ByteArray that I can use specifically for things like vertex data uploads, on-the-fly geometry collision detection, calculations, etc. The stupid thing about Flash is that Domain Memory isn’t available upfront unless I use something like Haxe or Apparat, and I’d also need to register for a license.
Anyway, I’m doing such bytearray uploads per-frame anyway, but occasionally for various pooled terrain lod chunks in a quad-tree which are no longer cached (ie. their vertex data was invalidated by other used chunks). So, it doesn’t happen too often across many frames. Thus, no need to pre-maturely optimize things.
Maybe you can provide a benchmark for this. (randomly accessed positions and writing floats to ByteArray, then upload, compare to doing it for Vectors). Maybe even do an indexed access for byteArray (ie. no using writeFloat() method), but you’d need to write into 4 byte indices manually to compose into a float, and I’m not too sure how to do that…
#16 by jackson on January 24th, 2013 ·
I haven’t done such a test, but it’s intriguing. This is actually a common problem when using
Stage3D
with AS3, particularly when you’re trying to avoid the conversion fromNumber
to 32-bit floating point values in uploading vertex buffers and constants. You can avoid it by writing out 32-bit floats to aByteArray
and then uploading that, which really helps CPU usage.In your situation it seems like the
ByteArray
version may be quicker, especially if you’re willing to go the licensed route with domain memory. Of course there’s only one way to find out. I’ll see about putting together a head-to-head performance test.#17 by Glidias on February 7th, 2013 ·
Well, domain memory is no longer categorised as Premium Features. So, i think domain memory should be the best approach to avoid costly Bytearray manipulation prior to uploading via Bytearray. You only need to manage your offsets/ranges though, since most domain memory implementations use 1 bytearray, so this should be easy to handle with various HAxe libs or Apparat memory managers.
#18 by Xavi Colomer on April 23rd, 2013 ·
Hi There,
I’m having some problems uploading 196000 vertex. It takes up to 15-20 seconds to upload it to the context. Is that normal?
vector is a Vector.
buffer.elements = Engine.context.createVertexBuffer(size, buffer.itemSize);
buffer.elements.uploadFromVector(vector, 0, size );
I did also have added this questions in SO and Adobe:
http://forums.adobe.com/people/XColomer?view=discussions
http://stackoverflow.com/questions/16085832/vertexbuffer3d-is-too-slow-how-can-i-optimize
Thanks!
#19 by jackson on April 23rd, 2013 ·
No, that’s not normal. In the test from this article I upload 64k vertices in under 50 milliseconds. It’s not that I’m doing anything tricky, that’s just how long it took in the test environment using the normal uploading strategy. Are you using a super slow computer? Are you inadvertently including way more in your time measurement? What are your results using the “Try” link from the article on the same computer that takes 15-20 seconds?
#20 by Xavi Colomer on April 24th, 2013 ·
Hi Jackson,
Thank you for your reply.
It’s difficult to say how much time it takes because the flash player appears to be stopped during 15-20 during the upload.
I am porting a customised threejs engine to flash as a fallback for old browsers. Little scenes works great but complex ones increments their rendering time exponentially. We just found a couple of bugs on the JS side, but I definitely have to improve the flash side.
Ill tell you something once we fix the JS bug, which is increasing the data unnecessarily.
Thanks!
#21 by Xavi Colomer on April 24th, 2013 ·
Hi jackson,
We found what the problem is.
Apparently the problem is ExternalInterface. Has problems sending 200000 vertices.
#22 by Zwick on May 2nd, 2013 ·
My results on 5 years old XPS M1530 with M8600GT.
Driver,Test,Time,Bytes/Sec
OpenGL,Texture from BitmapData w/o alpha,51,328965.02
OpenGL,Texture from BitmapData w/ alpha,39,430185.03
OpenGL,Texture from ByteArray,39,430185.03
OpenGL,VertexBuffer from Vector,32,524280.00
OpenGL,VertexBuffer from ByteArray,24,699040.00
OpenGL,IndexBuffer from Vector,14,149796.29
OpenGL,IndexBuffer from ByteArray,2,1048574.00
Software Hw_disabled=explicit,Texture from BitmapData w/o alpha,16,1048576.00
Software Hw_disabled=explicit,Texture from BitmapData w/ alpha,11,1525201.45
Software Hw_disabled=explicit,Texture from ByteArray,16,1048576.00
Software Hw_disabled=explicit,VertexBuffer from Vector,43,390161.86
Software Hw_disabled=explicit,VertexBuffer from ByteArray,16,1048560.00
Software Hw_disabled=explicit,IndexBuffer from Vector,3,699049.33
Software Hw_disabled=explicit,IndexBuffer from ByteArray,2,1048574.00
#23 by dayz free download on June 16th, 2014 ·
Via notre site web, vous avez la possibilité de télécharger Minecraft 1.7.9 gratuitement.
C’est la version du jour que l’on vous propose.
Vous pourrez télécharger à haute vitesse via notre lien.
#24 by NickyD on June 26th, 2014 ·
I know this article is old, but I have been working with batched geometry and ran into a little problem; hoping you can shed some light on the matter. Adobe Scout reports that uploadFromVector() seems to take longer and longer to finish executing over the course of time when the number of objects to be batched is lower than the maximum number of batched objects allowed in the space from the VBO.
When numVerts != BATCH_VERTEX_MAX is when I notice the strange inconsistencies in performance.
CODE
mBatchedVertexBuffer.uploadFromVector(mVertexBufferData, 0, numVerts);
CODE