ASTC runtime compression
18 May 2022
This is a bit of a brain dump on what a BC7 equivalent in ASTC would look like for targeting runtime compression for virtual texturing - “astcenc-rt”. This is an evening thought experiment over a beer, so beware I’ve not tried to actually build this (yet) …
UPDATE: My original idea here was to try and limit the encoding to 2^N QUANT levels, because this is very simple to encode. In hindsight this doesn’t actually work in the general case. ASTC doesn’t explicitly store the color quantization; it’s inferred amount of space remaining after everything else is accounted for. You can’t “under quant” to make encoding easier - it will always round up to the highest quantization factor that will fit in the space remaining. Any encoding will either need to use the full BISE encoding, or cherry pick very carefully to avoid cases that “under quant”.
BC7
The BC7 format is the high-quality format for desktop and consoles. If you squint a bit, it has many of the same fundamental features as ASTC: variable quantization, variable partition counts, options for two weight planes. However it gets more limited options to deploy them, with eight block modes each with a fixed feature set and color/weight encoding.
The one “magic trick” in BC7 is the use of P-bits, which use a common LSB for each color endpoint. For example it can store a 555 color as 444+1, saving two bits, but accepting that one of the three LSBs might be wrong. For the modes with multiple color partitions this can add up a lot of bits saved, so it’s a pretty handy feature and trivial to encode.
BC7 block modes
- Mode 0: RGB+RGB QUANT_32, QUANT_8 weights, 3 partitions, 16 partitionings
- Mode 1: RGB+RGB QUANT_128, QUANT_8 weights, 2 partitions, 64 partitionings
- Mode 2: RGB+RGB QUANT_32, QUANT_4 weights, 3 partitions, 64 partitionings
- Mode 3: RGB+RGB QUANT_256, QUANT_4 weights, 2 partitions, 64 partitionings
- Mode 4: RGB+RGB QUANT_32 (RGB) + QUANT_64 (A), QUANT_4+QUANT_8 weights, channel swizzle
- Mode 5: RGBA+RGBA QUANT_256 (RGB) + QUANT_64 (A), QUANT_4+QUANT4 weights, channel swizzle
- Mode 6: RGBA+RGBA QUANT_256 (RGBA), QUANT_16 weights
- Mode 7: RGBA+RGBA QUANT_64 (RGBA), QUANT_4 weights, 2 partitions, 64 partitionings
Offline compression using BC7enc uses only modes 1 and 6 for opaque blocks, and modes 1, 5, 6, 7 for transparent blocks.
ASTC
For the purposes of this exercise we only care about 4x4 blocks, as this is the matching bitrate for BC7, and the obvious target to use for a runtime compressor as the higher bitrate gives us more wiggle room to cope with slightly weak encoding choices.
To get something fast to encode we need to prune the search space. Ideally we want to reduce the number of encodings we even bother trying, but we also want to try and drop encodings that are expensive to turn into a final bitstream. ASTC includes tricks which are very useful to keep quality at low bitrate, and tricks which keep the decompressor hardware simpler at the expense of compressor complexity, neither of which help our goal of fast runtime compression here. So, let’s bring out the scissors and start trimming …
My immediate choices to end on the cutting room floor for would be:
- Partitions: No support for 3 or 4 partitions
- Rules out directly matching BC7 Mode 0 and Mode 2
- Weight planes: No support for 2 weight planes
- Rules out directly matching BC7 Mode 4 and Mode 5
- Endpoint modes: No support for split modes
- Ensures we use a single CEM for all partitions
- Weight decimation: No support for low-density weight grids
- Value quantization: No support for non-2^N quantization
The last two restrictions make compression MUCH easier to encode into a bitstream on the GPU. BUT they are basically throwing away the two major super-powers of ASTC, so we pay overhead in block mode bits for encoding them but get no return on that cost. Because of this I don’t expect a fast runtime ASTC compressor to match an equivalent BC7 compressor on image quality, especially where BC7 can still use P-bits.
These limitations define the basic bitrate budget we have to store colors and weights:
- 1 partition:
- 11 bits for block mode
- 2 bits for partition count
- 4 bits for endpoint mode
- 128 - (11 + 2 + 4) = 111 bits free
- 2 partition:
- 11 bits for block mode
- 2 bits for partition count
- 10 bits for partition index
- 6 bits for endpoint mode
- 128 - (11 + 2 + 10 + 6) = 99 bits free
The main challenge with ASTC vs BC7 is going to be the fixed overhead of the encoding. ASTC spends a lot of the bits in the “control plane” encoding for a block, which we really won’t get much pay-back on for in this real-time use case. In the 1 partition case ASTC loses 17 bits, and in the 2 partition case ASTC loses 29 bits.
BC7’s cost here is variable (“mode index + 1” bits, plus 4 or 6 for partition selection for the 2 partition encodings) but works out much lower than ASTC AND it saves 4 bits per partition from P-bit encoding the color endpoints.
Weight encoding
If we assume we want to stick to power-of-two quantization factors for simplicity of encoding, we need:
- 16 * QUANT_32 = 80 bits
- 16 * QUANT_16 = 64 bits
- 16 * QUANT_8 = 48 bits
- 16 * QUANT_4 = 32 bits
Color encoding
We have two endpoint component component counts that are of primary interest for matching BC7 - RGB and RGBA. If we assume we want to stick to power-of-two quantization factors for simplicity of encoding, we need:
One part of ASTC that we will need to keep is the ability to store color in different ways. The basic version is two arbitrary color endpoints e.g. (RGB+RGB), but we can also store a base color and a diff (RGB+dRGB) and a base color and a scale (RGBS).
The diff encoding allows us to more accurately match the second endpoint, so the effective quantization is finer than the base quantization. BC7 cannot do this.
The scale encoding allows us to encode RGB data in fewer values (4 rather than 6). The effectively allows a higher base quantization but can only represent colors with correlated chroma. BC7 cannot do this.
- 4 values (RGBS) for 1 partition
- 4 * QUANT_256 = 32 bits
- 4 * QUANT_128 = 28 bits
- 4 * QUANT_64 = 24 bits
- 4 * QUANT_32 = 20 bits
- 4 * QUANT_16 = 16 bits
- 6 values (RGB+RGB, RGB+dRGB, RGBS+A+A) for 1 partition
- 6 * QUANT_256 = 48 bits
- 6 * QUANT_128 = 42 bits
- 6 * QUANT_64 = 36 bits
- 6 * QUANT_32 = 30 bits
- 6 * QUANT_16 = 24 bits
- 8 values (RGBA+RGBA, RGBA+dRGBA) for 1 partition or (RGBS) for 2 partitions
- 8 * QUANT_256 = 64 bits
- 8 * QUANT_128 = 56 bits
- 8 * QUANT_64 = 48 bits
- 8 * QUANT_32 = 40 bits
- 8 * QUANT_16 = 32 bits
- 12 values (RGB+RGB, RGB+dRGB, RGBS+A+A) for 2 partitions
- 12 * QUANT_256 = 96 bits
- 12 * QUANT_128 = 84 bits
- 12 * QUANT_64 = 72 bits
- 12 * QUANT_32 = 60 bits
- 12 * QUANT_16 = 48 bits
- 16 values (RGBA+RGBA, RGBA+dRGBA) for 2 partitions
- 16 * QUANT_256 = 128 bits
- 16 * QUANT_128 = 112 bits
- 16 * QUANT_64 = 96 bits
- 16 * QUANT_32 = 80 bits
- 16 * QUANT_16 = 64 bits
- 16 * QUANT_8 = 48 bits
Encoding options
RGB 1 partition encoding
BC7 Mode 4: RGBA QUANT_32/64 color, QUANT_4/8 weights
This is one area where BC7 lacks a little - it doesn’t have a plain RGB 1 partition endpoint mode. With 111 bits to play with, and the ability to actually encode endpoints without alpha, ASTC provides better options than BC7 here.
Note that mode 4 does allow a separate non-correlated channel, but I’ve ignored that for the purposes of this exercise as it’s out-of-scope for the runtime compressor.
With RGB+RGB endpoints, ASTC can do either:
- Annoyingly 1 bit short for QUANT_128 + QUANT_16.
- QUANT_128 color + QUANT_8 weights (42 + 48 = 90 bits)
- QUANT_64 color + QUANT_16 weights (36 + 64 = 100 bits)
The RGB+S endpoint does better. ASTC can do either:
- Annoyingly 1 bit short for QUANT_256 + QUANT_32.
- QUANT_256 color + QUANT_16 weights (32 + 64 = 96 bits)
- QUANT_128 color + QUANT_32 weights (28 + 80 = 108 bits)
RGB 2 partition encoding
BC7 Mode 1: RGB QUANT_128 color, QUANT_8 weights
BC7 Mode 3: RGB QUANT_256 color, QUANT_4 weights
With only 99 bits to play with, using simple RGB+RGB endpoints for ASTC is limiting. The closest it gets is QUANT_32 color and QUANT_4 weights (60+32=92). This is really NOT compelling in terms of image quality vs BC7.
The RGB+S endpoint does better. ASTC can do either:
- QUANT_64 color and QUANT_8 weights (48 + 48 = 96 bits).
- QUANT_128 color and QUANT_4 weights (56 + 32 = 88 bits).
This is in the right ballpark, so if you have chroma-correlated data is likely a good choice.
RGBA 1 partition encoding
BC7 Mode 4: RGBA QUANT_32/64 color, QUANT_4/8 weights
BC7 Mode 6: RGBA+RGBA QUANT_256 (RGBA), QUANT_16 weights
As before, I’m ignoring the mode 4 separate non-correlated channel.
With RGBA+RGBA endpoints, ASTC can do either:
- Annoyingly 1 bit short for QUANT_64 + QUANT_16.
- QUANT_64 color + QUANT_8 weights (48 + 48 = 96 bits).
- QUANT_128 color and QUANT_4 weights (56 + 32 = 88 bits).
The RGBS+A+A endpoint does better. ASTC can do either:
- Annoyingly 1 bit short for QUANT_128 + QUANT_16.
- QUANT_128 color + QUANT_8 weights (42 + 48 = 90 bits)
- QUANT_64 color + QUANT_16 weights (36 + 64 = 100 bits)
RGBA 2 partition encoding
BC7 Mode 7: RGBA QUANT_64 color, QUANT_4 weights
With only 99 bits to play with, using simple RGBA+RGBA endpoints for ASTC is limiting. The closest it gets is QUANT_16 color and QUANT_4 weights (64+32=96). This is really NOT compelling in terms of image quality vs BC7.
The RGBS+A+A endpoint does better, but it’s still behind BC7. ASTC can do:
- QUANT_32 color and QUANT_4 weights (60 + 32 = 92 bits).
This is pretty weak, but usable for cases where you really need the second partition (QUANT_32 is a 555 color endpoint).
The 1 bit short cases
The way ASTC is specified, I do wonder if it’s actually legal to use overlapping bit streams to “find” the extra bit we need here.
The weight stream is stored bit-reversed, so we’d be colliding a single MSB in both. If the values are mismatched it’s going to be catastrophically bad, but we should be able to use it half the time …
UPDATE: It’s not legal to do this - the color quant is inferred from the free space remaining so you cannot overlap bitstreams …
Summary
It’s definitely possible to build an ASTC compressor which is simple for a real-time use case, but the simplicity does mean we use the format with two hands tied behind its back in terms of image quality (no weight decimation, no NPOT quantization factors). BC7 will win on quality, but a good compressor with these restrictions should still beat offline textures stored at a lower bitrate using e.g. ASTC 6x6 blocks.
For one partition cases it looks like ASTC is better than BC7 for RGB, and only slightly worse for RGBA, but both look pretty usable in terms of quality.
For two partition cases ASTC really struggles to encode two independent endpoints in the available bitrate, but looks very usable for the RGB+scale endpoints which need to store fewer colors. So “good” if you have a block with chroma-correlated data, and I probably wouldn’t even try the independent endpoints.