Optimizing with Bit Twiddling Hacks
Stanford’s Sean Eron Anderson has a great page titled Bit Twiddling Hacks. This is a great resource for C programmers looking to employ bitwise operators (& | ^ ~ << <<< >> >>>
etc.) to speed up their code. But what about AS3 programmers? We have bitwise operators too, but will the same tricks work for us? Today’s article ports some of these bit twiddling hacks to AS3 and tests to see if any ground is gained.
A few months ago I tested just one of these bit twiddling hacks and found that Math.abs
could be sped up by 20% for integers. Today I’ll test several more of the hacks:
- Sign of integer
- Detect different signs
- Minimum integer (like
Math.min
) - Maximum integer (like
Math.max
) - Conditionally set or clear bits
- Conditionally negate a value without branching
- Count bits set
Here’s the test app that does the testing:
package { import flash.utils.getTimer; import flash.text.TextFieldAutoSize; import flash.text.TextField; import flash.display.Sprite; public class BitTwiddlingHacks extends Sprite { private static const BITS_SET_TABLE_256:Vector.<int> = new <int>[0, 0 +1, 0 +1, 0 +2, 0 +1, 0 +1 +1, 0 +1 +1, 0 +1 +2, 0 +1, 0 +1 +1, 0 +1 +1, 0 +1 +2, 0 +2, 0 +2 +1, 0 +2 +1, 0 +2 +2, 0 +1, 0 +1 +1, 0 +1 +1, 0 +1 +2, 0 +1 +1, 0 +1 +1 +1, 0 +1 +1 +1, 0 +1 +1 +2, 0 +1 +1, 0 +1 +1 +1, 0 +1 +1 +1, 0 +1 +1 +2, 0 +1 +2, 0 +1 +2 +1, 0 +1 +2 +1, 0 +1 +2 +2, 0 +1, 0 +1 +1, 0 +1 +1, 0 +1 +2, 0 +1 +1, 0 +1 +1 +1, 0 +1 +1 +1, 0 +1 +1 +2, 0 +1 +1, 0 +1 +1 +1, 0 +1 +1 +1, 0 +1 +1 +2, 0 +1 +2, 0 +1 +2 +1, 0 +1 +2 +1, 0 +1 +2 +2, 0 +2, 0 +2 +1, 0 +2 +1, 0 +2 +2, 0 +2 +1, 0 +2 +1 +1, 0 +2 +1 +1, 0 +2 +1 +2, 0 +2 +1, 0 +2 +1 +1, 0 +2 +1 +1, 0 +2 +1 +2, 0 +2 +2, 0 +2 +2 +1, 0 +2 +2 +1, 0 +2 +2 +2, 1, 1 +1, 1 +1, 1 +2, 1 +1, 1 +1 +1, 1 +1 +1, 1 +1 +2, 1 +1, 1 +1 +1, 1 +1 +1, 1 +1 +2, 1 +2, 1 +2 +1, 1 +2 +1, 1 +2 +2, 1 +1, 1 +1 +1, 1 +1 +1, 1 +1 +2, 1 +1 +1, 1 +1 +1 +1, 1 +1 +1 +1, 1 +1 +1 +2, 1 +1 +1, 1 +1 +1 +1, 1 +1 +1 +1, 1 +1 +1 +2, 1 +1 +2, 1 +1 +2 +1, 1 +1 +2 +1, 1 +1 +2 +2, 1 +1, 1 +1 +1, 1 +1 +1, 1 +1 +2, 1 +1 +1, 1 +1 +1 +1, 1 +1 +1 +1, 1 +1 +1 +2, 1 +1 +1, 1 +1 +1 +1, 1 +1 +1 +1, 1 +1 +1 +2, 1 +1 +2, 1 +1 +2 +1, 1 +1 +2 +1, 1 +1 +2 +2, 1 +2, 1 +2 +1, 1 +2 +1, 1 +2 +2, 1 +2 +1, 1 +2 +1 +1, 1 +2 +1 +1, 1 +2 +1 +2, 1 +2 +1, 1 +2 +1 +1, 1 +2 +1 +1, 1 +2 +1 +2, 1 +2 +2, 1 +2 +2 +1, 1 +2 +2 +1, 1 +2 +2 +2, 1, 1 +1, 1 +1, 1 +2, 1 +1, 1 +1 +1, 1 +1 +1, 1 +1 +2, 1 +1, 1 +1 +1, 1 +1 +1, 1 +1 +2, 1 +2, 1 +2 +1, 1 +2 +1, 1 +2 +2, 1 +1, 1 +1 +1, 1 +1 +1, 1 +1 +2, 1 +1 +1, 1 +1 +1 +1, 1 +1 +1 +1, 1 +1 +1 +2, 1 +1 +1, 1 +1 +1 +1, 1 +1 +1 +1, 1 +1 +1 +2, 1 +1 +2, 1 +1 +2 +1, 1 +1 +2 +1, 1 +1 +2 +2, 1 +1, 1 +1 +1, 1 +1 +1, 1 +1 +2, 1 +1 +1, 1 +1 +1 +1, 1 +1 +1 +1, 1 +1 +1 +2, 1 +1 +1, 1 +1 +1 +1, 1 +1 +1 +1, 1 +1 +1 +2, 1 +1 +2, 1 +1 +2 +1, 1 +1 +2 +1, 1 +1 +2 +2, 1 +2, 1 +2 +1, 1 +2 +1, 1 +2 +2, 1 +2 +1, 1 +2 +1 +1, 1 +2 +1 +1, 1 +2 +1 +2, 1 +2 +1, 1 +2 +1 +1, 1 +2 +1 +1, 1 +2 +1 +2, 1 +2 +2, 1 +2 +2 +1, 1 +2 +2 +1, 1 +2 +2 +2, 2, 2 +1, 2 +1, 2 +2, 2 +1, 2 +1 +1, 2 +1 +1, 2 +1 +2, 2 +1, 2 +1 +1, 2 +1 +1, 2 +1 +2, 2 +2, 2 +2 +1, 2 +2 +1, 2 +2 +2, 2 +1, 2 +1 +1, 2 +1 +1, 2 +1 +2, 2 +1 +1, 2 +1 +1 +1, 2 +1 +1 +1, 2 +1 +1 +2, 2 +1 +1, 2 +1 +1 +1, 2 +1 +1 +1, 2 +1 +1 +2, 2 +1 +2, 2 +1 +2 +1, 2 +1 +2 +1, 2 +1 +2 +2, 2 +1, 2 +1 +1, 2 +1 +1, 2 +1 +2, 2 +1 +1, 2 +1 +1 +1, 2 +1 +1 +1, 2 +1 +1 +2, 2 +1 +1, 2 +1 +1 +1, 2 +1 +1 +1, 2 +1 +1 +2, 2 +1 +2, 2 +1 +2 +1, 2 +1 +2 +1, 2 +1 +2 +2, 2 +2, 2 +2 +1, 2 +2 +1, 2 +2 +2, 2 +2 +1, 2 +2 +1 +1, 2 +2 +1 +1, 2 +2 +1 +2, 2 +2 +1, 2 +2 +1 +1, 2 +2 +1 +1, 2 +2 +1 +2, 2 +2 +2, 2 +2 +2 +1, 2 +2 +2 +1, 2 +2 +2 +2]; private var logger:TextField = new TextField(); private function row(...cols): void { logger.appendText(cols.join(",")+"\n"); } public function BitTwiddlingHacks() { logger.autoSize = TextFieldAutoSize.LEFT; addChild(logger); init(); } private function init(): void { var beforeTime:int; var afterTime:int; var i:int; var REPS:int = 100000000; var normalTime:int; var twiddlingTime:int; var input:int; var input2:int; var inputBool:Boolean; var result:int; var resultBool:Boolean; var BitsSetTable256:Vector.<int> = BITS_SET_TABLE_256; row("Function,Normal,Twiddling"); ////////////////////////////////// // Sign of integer (when positive) ////////////////////////////////// input = 10; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { result = input < 0 ? -1 : 0; } afterTime = getTimer(); normalTime = afterTime - beforeTime; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { result = -(input>>31); } afterTime = getTimer(); twiddlingTime = afterTime - beforeTime; row("Sign of integer (when positive)", normalTime, twiddlingTime); ////////////////////////////////// // Sign of integer (when negative) ////////////////////////////////// input = -10; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { result = input < 0 ? -1 : 0; } afterTime = getTimer(); normalTime = afterTime - beforeTime; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { result = -(input>>31); } afterTime = getTimer(); twiddlingTime = afterTime - beforeTime; row("Sign of integer (when negative)", normalTime, twiddlingTime); ///////////////////////// // Detect different signs ///////////////////////// input = 10; input2 = 30; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { resultBool = (input < 0 && input2 < 0) || (input >= 0 && input2 >= 0); } afterTime = getTimer(); normalTime = afterTime - beforeTime; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { resultBool = ((input ^ input2) < 0); } afterTime = getTimer(); twiddlingTime = afterTime - beforeTime; row("Detect different signs", normalTime, twiddlingTime); ////////////////// // Minimum integer ////////////////// input = 10; input2 = 30; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { result = input < input2 ? input : input2; } afterTime = getTimer(); normalTime = afterTime - beforeTime; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { result = input2 ^ ((input ^ input2) & -(input < input2 ? 1 : 0)); } afterTime = getTimer(); twiddlingTime = afterTime - beforeTime; row("Minimum integer", normalTime, twiddlingTime); ////////////////// // Maximum integer ////////////////// input = 10; input2 = 30; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { result = input > input2 ? input : input2; } afterTime = getTimer(); normalTime = afterTime - beforeTime; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { result = input ^ ((input ^ input2) & -(input < input2 ? 1 : 0)); } afterTime = getTimer(); twiddlingTime = afterTime - beforeTime; row("Maximum integer", normalTime, twiddlingTime); ///////////////////////////////////////// // Conditionally set or clear bits (true) ///////////////////////////////////////// input = 0; // to modify input2 = 6; // mask inputBool = true; // set bits beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { result = inputBool ? (input | input2) : (input & ~input2); } afterTime = getTimer(); normalTime = afterTime - beforeTime; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { result = (input & ~input2) | (-(inputBool ? 1 : 0) & input2); } afterTime = getTimer(); twiddlingTime = afterTime - beforeTime; row("Conditionally set or clear bits (true)", normalTime, twiddlingTime); ////////////////////////////////////////// // Conditionally set or clear bits (false) ////////////////////////////////////////// input = 0; // to modify input2 = 6; // mask inputBool = false; // clear bits beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { result = inputBool ? (input | input2) : (input & ~input2); } afterTime = getTimer(); normalTime = afterTime - beforeTime; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { result = (input & ~input2) | (-(inputBool ? 1 : 0) & input2); } afterTime = getTimer(); twiddlingTime = afterTime - beforeTime; row("Conditionally set or clear bits (false)", normalTime, twiddlingTime); //////////////////////////////////////////////////////// // Conditionally negate a value without branching (true) //////////////////////////////////////////////////////// input = 30; inputBool = false; // do negate (this flag is to NOT negate) beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { result = inputBool ? -input : input; } afterTime = getTimer(); normalTime = afterTime - beforeTime; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { input2 = inputBool ? 1 : 0; // temp result = (input2 ^ (input2 - 1)) * input; } afterTime = getTimer(); twiddlingTime = afterTime - beforeTime; row("Conditionally negate a value without branching (true)", normalTime, twiddlingTime); ///////////////////////////////////////////////////////// // Conditionally negate a value without branching (false) ///////////////////////////////////////////////////////// input = 30; inputBool = true; // don't negate (this flag is to NOT negate) beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { result = inputBool ? -input : input; } afterTime = getTimer(); normalTime = afterTime - beforeTime; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { input2 = inputBool ? 1 : 0; // temp result = (input2 ^ (input2 - 1)) * input; } afterTime = getTimer(); twiddlingTime = afterTime - beforeTime; row("Conditionally negate a value without branching (false)", normalTime, twiddlingTime); ///////////////// // Count bits set ///////////////// input = 30; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { input2 = input; // temp for (result = 0; input2; input2 >>= 1) { result += input2 & 1; } } afterTime = getTimer(); normalTime = afterTime - beforeTime; beforeTime = getTimer(); for (i = 0; i < REPS; ++i) { result = BitsSetTable256[input & 0xff] + BitsSetTable256[(input >> 8) & 0xff] + BitsSetTable256[(input >> 16) & 0xff] + BitsSetTable256[input >> 24]; } afterTime = getTimer(); twiddlingTime = afterTime - beforeTime; row("Count bits set", normalTime, twiddlingTime); } } }
I ran this test app in the following environment:
- Flex SDK (MXMLC) 4.6.0.23201, compiling in release mode (no debugging or verbose stack traces)
- Release version of Flash Player 11.5.31.2
- 2.3 Ghz Intel Core i7
- Mac OS X 10.8.2
And here are the results I got:
Function | Normal | Twiddling |
---|---|---|
Sign of integer (when positive) | 220 | 220 |
Sign of integer (when negative) | 217 | 216 |
Detect different signs | 261 | 164 |
Minimum integer | 216 | 370 |
Maximum integer | 156 | 372 |
Conditionally set or clear bits (true) | 194 | 402 |
Conditionally set or clear bits (false) | 205 | 402 |
Conditionally negate a value without branching (true) | 184 | 186 |
Conditionally negate a value without branching (false) | 282 | 216 |
Count bits set | 957 | 586 |
There’s definitely a mixed bag of performance here. Sometimes the bit twiddling version is two times slower, sometimes it’s just as fast, sometimes it’s a little faster, and sometimes it’s two times faster. While always fancier, bitwise methods are not always faster. Thankfully, the above performance results should serve as a handy reference guide for the next time you’re thinking of using some bitwise tricks. You could double your performance… or cut it in half.
Spot a bug? Have a suggestion? Post a comment!
#1 by orion elenzil on November 27th, 2012 ·
some of the cases where the bit-twiddling conversion is slower are hardly surprising.
eg, which of these will be faster ?
the latter is certainly doing some bit-twiddling, but it’s doing it in addition to the exact same ternary operation the former is doing.
one of the main points of doing bit hacks is to avoid branching, and the ternary doesn’t really count as avoiding branching.
for example, here’s an example implemented once with traditional if/else and once with ternary:
and here is the resulting release-build .swf for each decompiled:
(decompiled using swfdump -a)
ternary:
00000) + 0:0 getlocal_0
00001) + 1:0 pushscope
00002) + 0:1 pushnan
00003) + 1:1 setlocal_1
00004) + 0:1 getlocal_0
00005) + 1:1 constructsuper 0 params
00006) + 0:1 pushbyte 3
00007) + 1:1 convert_d
00008) + 1:1 setlocal_2
00009) + 0:1 pushbyte 4
00010) + 1:1 convert_d
00011) + 1:1 setlocal_3
00012) + 0:1 getlocal_2
00013) + 1:1 getlocal_3
00014) + 2:1 lessthan
00015) + 1:1 iffalse ->19 21
00019) + 0:1 pushbyte 0
00020) + 1:1 convert_d
00021) + 1:1 convert_d
00022) + 1:1 setlocal_1
00023) + 0:1 returnvoid
if/else:
00000) + 0:0 getlocal_0
00001) + 1:0 pushscope
00002) + 0:1 pushnan
00003) + 1:1 setlocal_1
00004) + 0:1 getlocal_0
00005) + 1:1 constructsuper 0 params
00006) + 0:1 pushbyte 3
00007) + 1:1 convert_d
00008) + 1:1 setlocal_2
00009) + 0:1 pushbyte 4
00010) + 1:1 convert_d
00011) + 1:1 setlocal_3
00012) + 0:1 getlocal_2
00013) + 1:1 getlocal_3
00014) + 2:1 ifnlt ->19 22
00019) + 0:1 pushbyte 0
00020) + 1:1 convert_d
00021) + 1:1 setlocal_1
00022) + 0:1 returnvoid
i wonder if there’s a different branch-prediction success rate for “ifnlt” and “iffalse”.
in the “count bits set” test,
it might be interesting to use a value larger than 30.
that is, the speed of the twiddling method should be a constant no matter what the input value is,
while the speed of the traditional method should be proportional to the position of the highest bit set.
so eg, for 30 (0b11110), the number of iterations will be 5,
while for 0x800000, the number of iterations will be 32.
#2 by jackson on November 27th, 2012 ·
Very good points. In porting some of the hacks it was necessary to make some of them slower due to differences between C and AS3. One major problem is, as you point out, that a
Boolean
can’t be simply treated as anint
oruint
. This led to thebool ? 1 : 0
expression you see a lot. Regardless, the point was to show that not all of the hacks result in better performance when ported to AS3, but I could have been a little clearer as to why.As for the “count bits set” test, that is definitely something to try out. For simplicity’s sake I just had one test for the results, but a fuller range would yield some more clarity. Using
0xFFFFFFFF
as an input would certainly take long with the normal method and widen the gap so that the bitwise version is even faster.#3 by orion elenzil on November 27th, 2012 ·
roger.
i think text-formatting stripped out a comment from the decompiled versions,
just noting that the decompile of the ternary operator includes a conditional branch at index 15.
#4 by skyboy on January 1st, 2013 ·
bool ? 1 : 0
isn’t nessicary:int(bool)
will do the same thing, but is much faster since the JIT sweeps it up and converts it to a single opcode:convert_i
. It is slightly more expensive than in C. pseudocode:I’d cite what the AVM2 actually does, but i don’t really want to look through the dozens of files and thousands of lines of code to find it. :P
However, the end result is that int(boolean) is very nearly free.
#5 by jackson on January 1st, 2013 ·
Great tip!
#6 by benjamin guihaire on June 21st, 2013 ·
can you redo the profiling with int(bool) instead ?
also , in resultBool = ((input ^ input2) >31
in this one:
result = input2 ^ ((input ^ input2) & -(input < input2 ? 1 : 0));
the part doing this: -(input >31) ?
#7 by jackson on June 21st, 2013 ·
Perhaps in an update article I’ll check with
int(bool)
instead ofbool ? 1 : 0
.I’m not sure what your question is about
-(input > 31)
. Could you clarify?#8 by benjamin guihaire on June 22nd, 2013 ·
Sorry, I was not clear, my proposal is to instead of doing this:
result = input2 ^ ((input ^ input2) & -(input < input2 ? 1 : 0));
do that:
result = input2 ^ ((input ^ input2) & -((input-input2)<<31);
to avoid the ternary operator.
((input-input2) will be negative when input<input2, so the bit sign will be set..)
#9 by jackson on June 22nd, 2013 ·
No problem. Let’s consider an example:
But the original version gives either 1 or 0 as an output, so they’re not equivalent.
Perhaps I’m missing something. Care to elaborate how the
<<31
version works?#10 by benjamin guihaire on August 6th, 2013 ·
okay … one more time (can you delete the extra comments?)
It looks like I need to type the comments with & lt ; to get it to show properly:
result = input2 ^ ((input ^ input2) & ((input-input2)>>31));
in this test, if input<input2 , input-input2 is negative, and a negative number with >>31 should be -1, which was the test with -(input < input2 ? 1 : 0)); but my version should be faster than the ternary operator.
#11 by crussell on April 7th, 2013 ·
I would be really wary of trying to prove that the bit twiddling hack performs worse when you are using inputs that remain constant over the repetitions that your are timing. You need to realize the power of a quickly saturating local branch predictor in these tests. On a superscalar out-of-order architecture such as that of the i7 you are using, the price of branch mis-predicts can be huge. You should try this experiment again using a randomized input on each iteration of the loops you are exercising.
#12 by jackson on April 7th, 2013 ·
Good point about branch prediction. I’d like to use more varied inputs, but it’s difficult in a test like this. The cost of a
Math.random
call (or equivalent) would be quite large compared to the actual work being tested. For example, if theMath.random
call takes 100 nanoseconds, the bit twiddling hack takes 50 nanoseconds, and the normal version takes 75 nanoseconds, you get results of 150 and 175. The relative order is preserved but the difference looks much smaller than it actually is. I’ve tried to measure theMath.random
call by itself and subtract it out in the past, but that usually has very poor results with regards to consistency. There are a lot of negative values that come up when using such a technique and I generally don’t think the technique yields high-quality result sets.I am totally open to suggestions though. Please let me know if you’ve got a good way to add some variation to the test inputs without skewing the results. I would happily incorporate such a technique into my testing approach.
#13 by crussell on April 8th, 2013 ·
A solution might be to populate a large array up front with the random values. Try to make it not too large so that the majority of it stays in the cache. Since there is really no global correlation with the branches you are testing, all you need to do is make sure that the randomization array is large enough to negate any pattern correlation with the local prediction. I would start with an array of size 2^10 or so and make your REPS a multiple of that. Just index your array as rand_array[i & 0x3FF]. Hopefully this will get rid of a lot of the overhead of calling Math.random directly during the timing portion of your code.
#14 by jackson on April 9th, 2013 ·
That’s a good idea for reducing the amount of randomizing per loop. Unfortunately, even just a
Vector
indexing takes quite a while. I added a fake test that just indexesBitsSetTable256
and it takes about 220 milliseconds on the same test. That result is actually larger than half the results in the test, so the level of skewing would be quite extreme were I to add aVector
index to each test: results would approximately double and hide lots of details as described above. Your idea is sound, but I’m still looking for a way to implement it without skewing results. If you (or anyone!) have any more ideas, I’m all ears.