Your computer has RAM, but it also has much faster forms of memory. If you want your app to run quickly, you need to be aware of these other memories. Today’s article discusses them and shows two AS3 examples that, even with such a high level language, you can still take advantage of them.

It’s true that RAM is fast, but only in comparison to hard drives, SSDs, optical media, the internet, and so forth. RAM is not fast compared to the cache that every CPU has built directly into it. You may have heard of them before, described by their levels: L1, L2, and L3.

These CPU caches are small blocks of memory that are used to store RAM. When RAM is requested, the much-faster cache can be used instead… but only if the requested RAM is in the cache. That’s called a cache “hit”. If it’s not, you’ll have a cache “miss” and the CPU will need to fetch from the much-slower RAM.

CPU strategies for caching vary, but contiguous chunks of RAM are usually stored in the cache. This means that if your app works on a block of contiguous memory (e.g. a Vector>) at a time and doesn’t go outside that limited chunk of memory, you’ll be hitting the cache rather than RAM and you’ll reap a big performance win.

To demonstrate this with a somewhat-contrived example, consider accessing the elements of a big Vector that simulates RAM. In one case we’ll access the elements in order and in the other case we’ll access them at random. To eliminate the unfairness of a Math.random call, both Vector instances will be indexed via a lookup into a Vector. In the case of the random access, the Vector of indices will be randomly shuffled so elements will be read randomly. Here’s an example of what this might look like on a Vector with 10 elements:

// In order
0 1 2 3 4 5 6 7 8 9
 
// Random
2 3 9 1 4 0 6 8 5 7

I ran this test in the following environment:

  • Release version of Flash Player 12.0.0.41
  • 2.3 Ghz Intel Core i7-3615QM (256 KB L2 per core, 6 MB L3 cache)
  • Mac OS X 10.9.1
  • Google Chrome 32.0.1700.77
  • ASC 2.0.0 build 354071 (-debug=false -verbose-stacktraces=false -inline -optimize=true)

And here are the results I got:

Vector Access Time
Sequential 27
Random 169

Vector Access Graph

Here we see the sequential test that makes use of CPU cache beating the random test that doesn’t by over 6x. The order you access memory in really matters!

The above test is, as admitted, rather contrived. The following test handles a real world problem though: looping over the pixels of a BitmapData. You’ll need a nested loop where one goes over the rows (Y) and the other goes over the columns (X). Which should be the outer loop and which should be the inner?

Behind the scenes, BitmapData is laid out “row-wise”. Here’s how a 3×2 BitmapData looks:

(0,0) (1,0) (2,0) (0,1) (1,1) (1,2)

If your outer loop is over the columns (X), you’ll access these memory addresses:

0 4 8 1 5 9 2 6 10 3 7 11

But if your outer loop is over the rows (Y), you’ll go in order:

0 1 2 3 4 5 6 7 8 9 10 11

As we learned above, going in order really helps with CPU caching. It may seem like we’re only skipping around a little bit, but the problem really starts to present itself when we have larger BitmapData instances. If the BitmapData was 2048×2, the first four memory accesses would be 0, 2047, 1, 2048.

The following example compares these two looping strategies with a single 2048×2048 BitmapData in the same test environment as above:

BitmapData Outer Loop Time
Row 89
Col 348

Update: the original version of the test app had a bug leading this article to claim that the column version was only 27% faster and the results were swapped.

BitmapData Outer Loops Graph

The advantage here isn’t as strong as with the Vector, but it’s still a huge 4x advantage when looping over the same pixels. If the BitmapData were larger, the difference would only grow. This is especially true on CPUs with less cache than in my test environment.

Here’s the source code for the app that runs both tests:

package
{
	import flash.display.BitmapData;
	import flash.display.Sprite;
	import flash.display.StageAlign;
	import flash.display.StageScaleMode;
	import flash.text.TextField;
	import flash.text.TextFieldAutoSize;
	import flash.utils.getTimer;
 
	public class CacheTest extends Sprite
	{
		private var logger:TextField = new TextField();
		private function row(...cols): void
		{
			logger.appendText(cols.join(",")+"\n");
		}
 
		public function CacheTest()
		{
			stage.scaleMode = StageScaleMode.NO_SCALE;
			stage.align = StageAlign.TOP_LEFT;
 
			logger.autoSize = TextFieldAutoSize.LEFT;
			addChild(logger);
 
			testVector();
			row();
			testBitmapData();
		}
 
		private function testVector(): void
		{
			const SIZE:int = 10000000;
			var beforeTime:int;
			var afterTime:int;
			var i:int;
			var sum:int;
 
			// Build vectors of indices
			var inOrder:Vector.<int> = new Vector.<int>(SIZE);
			for (i = 0; i < SIZE; ++i)
			{
				inOrder[i] = i;
			}
			var randomOrder:Vector.<int> = inOrder.slice();
			shuffle(randomOrder);
 
			row("Vector Access", "Time");
 
			sum = 0;
			beforeTime = getTimer();
			for (i = 0; i < SIZE; ++i)
			{
				sum += inOrder[inOrder[i]];
			}
			afterTime = getTimer();
			row("Sequential", (afterTime-beforeTime));
 
			sum = 0;
			beforeTime = getTimer();
			for (i = 0; i < SIZE; ++i)
			{
				sum += randomOrder[randomOrder[i]];
			}
			afterTime = getTimer();
			row("Random", (afterTime-beforeTime));
		}
 
		private function testBitmapData(): void
		{
			var SIZE:int = 2048;
			var bmd:BitmapData = new BitmapData(SIZE, SIZE, true, 0xff123456);
			var curRow:int;
			var curCol:int;
			var sum:int;
			var beforeTime:int;
			var afterTime:int;
 
			row("BitmapData Outer Loop", "Time");
 
			sum = 0;
			beforeTime = getTimer();
			for (curRow = 0; curRow < SIZE; ++curRow)
			{
				for (curCol = 0; curCol < SIZE; ++curCol)
				{
					sum += bmd.getPixel32(curCol, curRow);
				}
			}
			afterTime = getTimer();
			row("Row", (afterTime-beforeTime));
 
			sum = 0;
			beforeTime = getTimer();
			for (curCol = 0; curCol < SIZE; ++curCol)
			{
				for (curRow = 0; curRow < SIZE; ++curRow)
				{
					sum += bmd.getPixel32(curCol, curRow);
				}
			}
			afterTime = getTimer();
			row("Col", (afterTime-beforeTime));
		}
 
		private function shuffle(vec:Vector.<int>): void
		{
			var len:uint = vec.length;
			while (len)
			{
				var index:uint = Math.floor(Math.random() * len);
				len--;
 
				var temp:int = vec[len];
				vec[len] = vec[index];
				vec[index] = temp;
			}
		}
	}
}

Run both tests

Spot a bug? Have a question or suggestion? Post a comment!