Sub-Byte Sizes: Part 1
The smallest a C# type can be is one byte. The byte
type and and an empty struct are examples of this. But what if we want to store data in less than a byte to improve performance such as load times and CPU cache utilization? Today’s article does just this by packing at the bit level!
Data Layout
The key realization today is that a single piece of data can’t take up less than a byte, but many pieces of data can take up less than one byte as long as they’re stored in something that takes up at least one byte. Consider an array of four 4-bit integers a
, b
, c
, and d
:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a | a | a | a | b | b | b | b | c | c | c | c | d | d | d | d |
While there’s no 4-bit type in C#, we can still pack these into 16 bits (2 bytes) because the overall array is at least one byte large. The trick is in reading and writing the packed data, which involves a some bitwise math to extract the right bits. This can be tricky when a piece of data straddles a byte boundary. For example, consider an array of five 3-bit integers a
, b
, c
, d
, and e
:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a | a | a | b | b | b | c | c | c | d | d | d | e | e | e |
When reading c
, we need to read the first byte (bits 0-7) to get its first two bits (at 6 and 7) then read the second byte (bits 8-15) to get its third bit (at 8). Then we need to mask off the unwanted bits of both bytes (0-5 and 9-15), shift the bits of both bytes into place (0-7 shift left 1, 8 shift right 7), and finally combine them with a bitwise OR.
Code Design
There are three main goals for the code that operates on these bits:
- Efficiency: Only the minimum number of memory accesses, branches, and bitwise operations should be performed.
- Burst Support: All accesses should be able to be compiled by Burst.
- Code Reuse: The container of the bits should be abstracted from the code that accesses the bits.
The goal of efficiency precludes the simpler approach of repeatedly reading a byte and masking off all but one bit until we have read enough. This approach reads from memory too many times, performs too many branches for the loop, and does an unnecessary number of bitwise operations. Instead, we’ll read exactly once per byte containing the bits we need, perform just enough branches to figure out how many times that is, and use bitwise operations on runs of bits instead of just one.
Burst support means we’ll be omitting anything it doesn’t support. This may have included managed arrays, classes, and raw C# interfaces. So we’ll abstract the collection holding the bits instead of using a managed array, use structs instead of classes, and use generics to give Burst the compile-time knowledge of exactly which interfaces we’re using.
Code reuse can be achieved using the Burst-compatible generic algorithms approach of combining an interface with generics. Instead of just taking an interface (IFoo
), add a type parameter (TFoo
), take a TFoo
, and use a where TFoo : IFoo
constraint. This means the code isn’t using an IFoo
determined at runtime, but rather a TFoo
determined at compile time. In this case, we’ll have an interface that accesses the bits using a simple index to read and write.
For today, we’ll just design support for reading bits. The core of today’s code is a struct named BitStream
containing the current bit index, functions to read various numbers of bits, and an interface to abstract the bits:
public struct BitStream { public interface IByteStream { byte Read(int index); } public int BitIndex; public byte ReadUpTo8<TByteStream>(in TByteStream stream, int num) where TByteStream : IByteStream { // ... } public ushort ReadUpTo16<TByteStream>(in TByteStream stream, int num) where TByteStream : IByteStream { // ... } public uint ReadUpTo32<TByteStream>(in TByteStream stream, int num) where TByteStream : IByteStream { // ... } public ulong ReadUpTo64<TByteStream>(in TByteStream stream, int num) where TByteStream : IByteStream { // ... } }
Implementation
ReadUpTo8
contains the trickiest bitwise work. Functions for reading more bits (ReadUpTo16
, ReadUpTo32
, and ReadUpTo64
) are all written in terms of ReadUpTo8
. This allows the others to be quite simple and easy to read. Here is the full file:
/// <summary> /// Reader and writer of a stream of bits, which can be stored in any indexed /// container type. This allows for accessing the bits regardless of their size /// (up to 64 bits) even if the accesses cross one or more byte boundaries. This /// can be used to efficiently pack bits such that individual values can be /// stored in less than one bit. For example, a boolean can take up only one bit /// and a 3-bit integer is possible. /// </summary> /// /// <author> /// Jackson Dunstan, https://JacksonDunstan.com/articles/5426 /// </author> /// /// <license> /// MIT /// </license> public struct BitStream { /// <summary> /// Abstraction of a stream of bytes /// </summary> public interface IByteStream { /// <summary> /// Read a byte from the stream at a given index /// </summary> /// /// <param name="index"> /// Index of the byte to read. /// </param> /// /// <returns> /// The byte at the given index. /// </returns> byte Read(int index); } /// <summary> /// Index of the next bit to access /// </summary> public int BitIndex; /// <summary> /// Read up to 8 bits starting at <see cref="BitIndex"/>. /// </summary> /// /// <param name="stream"> /// Byte stream to read from. /// </param> /// /// <param name="num"> /// Number of bits to read. /// </param> /// /// <typeparam name="TByteStream"> /// Type of byte stream to read with. /// </typeparam> /// /// <returns> /// The read value, stored in the least-significant bits. /// </returns> public byte ReadUpTo8<TByteStream>(in TByteStream stream, int num) where TByteStream : IByteStream { // Calculate where we are in the stream and advance int bitIndex = BitIndex; int localByteIndex = bitIndex / 8; int localBitIndex = bitIndex % 8; BitIndex = bitIndex + num; // Read the byte with the high bits and decide if that's the only byte byte high = stream.Read(localByteIndex); int numHighBitsAvailable = 8 - localBitIndex; int excessHighBitsAvailable = numHighBitsAvailable - num; int highMask; if (excessHighBitsAvailable >= 0) { highMask = ((1 << num) - 1) << excessHighBitsAvailable; return (byte)((high & highMask) >> excessHighBitsAvailable); } // Read the low byte and combine with the high byte highMask = (1 << numHighBitsAvailable) - 1; int numLowBits = num - numHighBitsAvailable; byte low = stream.Read(localByteIndex + 1); int lowShift = 8-numLowBits; int lowMask = ((1 << numLowBits) - 1) << lowShift; int highPart = (high & highMask) << numLowBits; int lowPart = (low & lowMask) >> lowShift; int result = highPart | lowPart; return (byte)result; } /// <summary> /// Read up to 16 bits starting at <see cref="BitIndex"/>. /// </summary> /// /// <param name="stream"> /// Byte stream to read from. /// </param> /// /// <param name="num"> /// Number of bits to read. /// </param> /// /// <typeparam name="TByteStream"> /// Type of byte stream to read with. /// </typeparam> /// /// <returns> /// The read value, stored in the least-significant bits. /// </returns> public ushort ReadUpTo16<TByteStream>(in TByteStream stream, int num) where TByteStream : IByteStream { if (num <= 8) { return ReadUpTo8(stream, num); } byte high = ReadUpTo8(stream, 8); int numLowBits = num - 8; byte low = ReadUpTo8(stream, numLowBits); return (ushort)((high << numLowBits) | low); } /// <summary> /// Read up to 32 bits starting at <see cref="BitIndex"/>. /// </summary> /// /// <param name="stream"> /// Byte stream to read from. /// </param> /// /// <param name="num"> /// Number of bits to read. /// </param> /// /// <typeparam name="TByteStream"> /// Type of byte stream to read with. /// </typeparam> /// /// <returns> /// The read value, stored in the least-significant bits. /// </returns> public uint ReadUpTo32<TByteStream>(in TByteStream stream, int num) where TByteStream : IByteStream { if (num <= 16) { return ReadUpTo16(stream, num); } uint high = ReadUpTo16(stream, 16); int numLowBits = num - 16; uint low = ReadUpTo16(stream, numLowBits); return (high << numLowBits) | low; } /// <summary> /// Read up to 64 bits starting at <see cref="BitIndex"/>. /// </summary> /// /// <param name="stream"> /// Byte stream to read from. /// </param> /// /// <param name="num"> /// Number of bits to read. /// </param> /// /// <typeparam name="TByteStream"> /// Type of byte stream to read with. /// </typeparam> /// /// <returns> /// The read value, stored in the least-significant bits. /// </returns> public ulong ReadUpTo64<TByteStream>(in TByteStream stream, int num) where TByteStream : IByteStream { if (num <= 32) { return ReadUpTo32(stream, num); } ulong high = ReadUpTo32(stream, 32); int numLowBits = num - 32; ulong low = ReadUpTo32(stream, numLowBits); return (high << numLowBits) | low; } }
Along with the runtime code, we have unit tests to exhaustively run through every combination of bit offset and number of bits to read:
using NUnit.Framework; using Unity.Collections; /// <summary> /// Unit tests for <see cref="BitStream"/>. /// </summary> /// /// <author> /// Jackson Dunstan, https://JacksonDunstan.com/articles/5426 /// </author> /// /// <license> /// MIT /// </license> public class BitStreamTests { private struct NativeArrayByteStream : BitStream.IByteStream { public NativeArray<byte> Array; public byte Read(int index) { return Array[index]; } } private static NativeArray<byte> CreateArray() { NativeArray<byte> array = new NativeArray<byte>(9, Allocator.Temp); for (int i = 0; i < array.Length; ++i) { array[i] = 0b10101010; } return array; } private static BitStream CreateStream(int bitIndex) { return new BitStream { BitIndex = bitIndex }; } [Test] public void ReadUpTo8() { using (NativeArray<byte> array = CreateArray()) { NativeArrayByteStream byteStream = new NativeArrayByteStream { Array = array }; for (int bitIndex = 0; bitIndex < 8; ++bitIndex) { int expected = 0; int next = bitIndex & 1; for (int numBits = 1; numBits <= 8; ++numBits) { next = ~next & 1; expected = (expected << 1) | next; BitStream stream = CreateStream(bitIndex); Assert.That( stream.ReadUpTo8(byteStream, numBits), Is.EqualTo(expected), $"Bit index: {bitIndex}. Num bits: {numBits}"); } } } } [Test] public void ReadUpTo16() { using (NativeArray<byte> array = CreateArray()) { NativeArrayByteStream byteStream = new NativeArrayByteStream { Array = array }; for (ushort bitIndex = 0; bitIndex < 8; ++bitIndex) { ushort expected = 0; ushort next = (ushort)(bitIndex & 1); for (int numBits = 1; numBits <= 16; ++numBits) { next = (ushort)(~next & 1); expected = (ushort)((expected << 1) | next); BitStream stream = CreateStream(bitIndex); Assert.That( stream.ReadUpTo16(byteStream, numBits), Is.EqualTo(expected), $"Bit index: {bitIndex}. Num bits: {numBits}"); } } } } [Test] public void ReadUpTo32() { using (NativeArray<byte> array = CreateArray()) { NativeArrayByteStream byteStream = new NativeArrayByteStream { Array = array }; for (int bitIndex = 0; bitIndex < 8; ++bitIndex) { uint expected = 0; uint next = (uint)bitIndex & 1; for (int numBits = 1; numBits <= 32; ++numBits) { next = ~next & 1; expected = (expected << 1) | next; BitStream stream = CreateStream(bitIndex); uint actual = stream.ReadUpTo32(byteStream, numBits); Assert.That( actual, Is.EqualTo(expected), $"Bit index: {bitIndex}. Num bits: {numBits}."); } } } } [Test] public void ReadUpTo64() { using (NativeArray<byte> array = CreateArray()) { NativeArrayByteStream byteStream = new NativeArrayByteStream { Array = array }; for (int bitIndex = 0; bitIndex < 8; ++bitIndex) { ulong expected = 0; ulong next = (ulong)bitIndex & 1; for (int numBits = 1; numBits <= 64; ++numBits) { next = ~next & 1; expected = (expected << 1) | next; BitStream stream = CreateStream(bitIndex); ulong actual = stream.ReadUpTo64(byteStream, numBits); Assert.That( actual, Is.EqualTo(expected), $"Bit index: {bitIndex}. Num bits: {numBits}."); } } } } }
Usage
Using BitStream
begins by defining a struct implementing IByteStream
. This is trivial for array-like types such as NativeArray<T>
:
struct NativeArrayByteStream : BitStream.IByteStream { public NativeArray<byte> Array; public byte Read(int index) { return Array[index]; } }
Next, let’s define a Burst-compiled job to prove that we’ve achieved that goal:
[BurstCompile] struct BitStreamJob : IJob { public NativeArray<byte> Input; public NativeArray<uint> Output; public void Execute() { Output[0] = new BitStream().ReadUpTo32( new NativeArrayByteStream { Array = Input }, 32); } }
Finally, let’s execute that job from a MonoBehaviour
:
class TestScript : MonoBehaviour { void Start() { using (NativeArray<byte> input = new NativeArray<byte>(4, Allocator.TempJob)) { NativeArray<byte> inputAlias = input; inputAlias[0] = 0x12; inputAlias[1] = 0x34; inputAlias[2] = 0x56; inputAlias[3] = 0x78; using (NativeArray<uint> output = new NativeArray<uint>(1, Allocator.TempJob)) { new BitStreamJob { Input = input, Output = output }.Run(); print(Convert.ToString(output[0], 16)); } } } }
As expected, this prints 12345678
as it reads 32 bits across the four bytes in the input array into the output array. That’s a relatively easy test case, but the hard cases have already been proven by the unit tests.
Using Burst Inspector, we can verify that this job is indeed being compiled by Burst to x86 assembly:
mov rax, qword ptr [rdi] mov rcx, qword ptr [rdi + 56] mov eax, dword ptr [rax] bswap eax mov dword ptr [rcx], eax ret
Burst has nicely taken constants like the bit length of 32
and bit index of 0
and optimized away the branches and bitwise calculations. In general there will be more instructions and cost to reading from a BitStream
, but this at least shows that we’ve achieved Burst compatibility.
Conclusion
While C# doesn’t allow for a single value to be stored in less than a byte, we can pack multiple sub-byte values into a bit stream as long as the total size of the stream is at least one byte. That’s a very minor concession that allows us to greatly reduce data sizes for purposes such as saving disk space, decreasing load times, and improving CPU cache utilization. Now we can store boolean values in a single bit, define fixed-point numbers such as 1.3
, and even use weird sizes like 9-bit integers without wasting a bunch of bits.
Next week we’ll expand BitStream
to support writing bits. Stay tuned!
#1 by Jayckob on November 25th, 2019 ·
I’ve noticed you’re dabbling in bitwise operations lately.
I’d really like to see some performance statistics to prove the conjecture that the time taken to execute the bitwise operations and packing is made up for with cache performance because I’m not quite seeing it.
For an example I have a Job that iterates over an array of 65,536 values, these values are currently bytes so I was curious whether making them less than that would save time, it turns out that even though I packed 4 values per byte the bitwise extraction and packing took longer than the cache speedup saved.
For clarity I’m writing the array in one job and reading from the array 4 times in another, the reads are cache friendly(reads are sequential, not random)
It would be wonderful to see the appropriate use cases for such a pattern. I’ve done some research and can see a few articles suggesting that the game Minecraft now uses a system similar to this to store the ‘chunks’ to save ram, but then converts the data structure back into byte to operate on.
Great article, thanks!
#2 by jackson on December 1st, 2019 ·
This sounds like an excellent idea for a follow-up article. Thanks!