Games by Tim: A Material System, Part 2: Deciphering the HLSL Packing Rules

Tentative series plan:

An Introduction
Deciphering the HLSL Packing Rules (you are here)
Shader Reflection (clever title pending)
Runtime Parameters (clever title pending)

Last time, on...

In the previous installment of this series, we saw a high level overview of how a flexible material system could look. Ignoring a few details, the result was a largely data-driven approach, where the shader and the parameters that make a specific material can be defined in data, with enough flexibility to change the parameters -- not just the values, but even what the shader expects -- without any code changes.

One of the hand-wavey parts was how to go from the cbuffer layout in HLSL, to proper offsets where to put the final parameter values within a buffer. This article will cover a part of that, the packing rules of HLSL cbuffers.

Disclaimer: Unless otherwise noted, the following is the results of my own experiments. It seems to be the case, but I can't guarantee it wasn't just a coincidence that things worked out.

Disclaimer: I am only concerned with the automatic packing done by HLSL. It's also possible to explicitly define the layout of cbuffer members, using the register keyword, but my aim is to minimize the work needed when writing shaders, putting the complicated finicky stuff in code instead.

First, we RTFM

Obviously the first thing we should do is check out the documentation, see what it says about things. So we go to Packing Rules for Constant Variables at the Windows Dev Center.

HLSL ... packs data into 4-byte boundaries. Additionally, HLSL packs data so that it does not cross a 16-byte boundary. Variables are packed into a given four-component vector until the variable will straddle a 4-vector boundary; the next variables will be bounced to the next four-component vector.

Okay, so far so good. We can check various cases by running a simple shader through FXC. So let's try some basic stuff

Simple Vectors

cbuffer A
{
    float a1;       // Offset:    0 Size:     4
    float2 a2;      // Offset:    4 Size:     8
    float3 a3;      // Offset:   16 Size:    12
    float a4;       // Offset:   28 Size:     4
    bool2 a5;       // Offset:   32 Size:     8
    int a6;         // Offset:   40 Size:     4
};

This is pretty much as advertised. a2 fits immediately after a1, but a3 needs to start on a new 16-byte boundary. a5 is 4 bytes per component even though it's just a boolean value. This is easy!

Maybe we want to put a matrix in there. What happens to those?

Matrices

cbuffer B
{
    float4x4 b1;    // Offset:    0 Size:    64
    float4x3 b2;    // Offset:   64 Size:    48
    float3x4 b3;    // Offset:  112 Size:    60
    float2x2 b4;    // Offset:  176 Size:    24
    float1x4 b5;    // Offset:  208 Size:    52
};

We can see b1 takes up a full 64 bytes, as expected. Likewise, b2 is 48 bytes (basically 3 x float4). But what about b3? If it were tightly packed, we would expect 48 bytes again, but if we treat it as 4 x float3, each float3 needs to start on a new 16-byte boundary, so a full 64 might make sense as well. But instead we have 60 bytes. Well, I guess the above excerpt only concerns where a value starts, not where it ends, so okay, b3 packs the same as if we had this:

cbuffer B
{
    float4x4 b1;    // Offset:    0 Size:    64
    float4x3 b2;    // Offset:   64 Size:    48
    float3   b3_0;  // Offset:  112 Size:    12
    float3   b3_1;  // Offset:  128 Size:    12
    float3   b3_2;  // Offset:  144 Size:    12
    float3   b3_3;  // Offset:  160 Size:    12
    float2x2 b4;    // Offset:  176 Size:    24
    float1x4 b5;    // Offset:  208 Size:    52
};

Moving on to b4, we see again something a bit unexpected. Based on what happened with b3, I would expect b4 to take 16 bytes (2 x float2), but instead we have 24! Well, as it turns out, this works out so that each row of the matrix starts on a new 16-bytes. The same carries over to b5.

Let's check the docs again, maybe it says something about this. The closest thing that resembles it is this about arrays:

Arrays are not packed in HLSL by default. To avoid forcing the shader to take on ALU overhead for offset computations, every element in an array is stored in a four-component vector.

This seems to indicate that each element in an array fills 16 bytes, but otherwise could match what's going on with the matrices. So let's play with arrays a bit:

Arrays

cbuffer C
{
    float4 c1[3];   // Offset:    0 Size:    48
    float3 c2[4];   // Offset:   48 Size:    60
    float2 c3[2];   // Offset:  112 Size:    24
    float  c4[4];   // Offset:  144 Size:    52
    float  c5;      // Offset:  196 Size:     4
};

Well this is familiar! c1,c2,c3,c4 look the same as b2,b3,b4,b5! So the docs are a little misleading here: array elements aren't stored in 4-component vectors, they're just aligned to 16 bytes. c5 verifies that the elements of c4 aren't filling the 16 bytes.

So where do we stand?

Vectors are easy. Pack them together, but a single vector can't cross a 16-byte boundary.
Matrices are treated as arrays of vectors.
Each element in an array of vectors is aligned to 16 bytes. Padding is not inserted after the last element, so the next constant can be packed tightly if it fits.

We're almost done our exploration of HLSL cbuffer packing. We next turn to structs.

Structs

Here's what the docs have to say about structs in cbuffers:

Each structure forces the next variable to start on the next four-component vector. This sometimes generates padding for arrays of structures. The resulting size of any structure will always be evenly divisible by sizeof(four-component vector).

And here's what some basic experimentation shows:

cbuffer D
{
    struct
    {
        float d1_1;     // Offset:    0
    } d1;

    struct
    {
        float2 d2_1;    // Offset:   16
    } d2;               // Offset:   16 Size:     8

    float d3;           // Offset:   24 Size:     4

    struct
    {
        float2x2 d4_1;  // Offset:   32
        float d4_2;     // Offset:   56
    } d4;               // Offset:   32 Size:    28

    float d5;           // Offset:   60 Size:     4
};

So right off the bat, the docs seem to be giving the wrong information. None of these structs have a size that's a multiple of sizeof(four-component vector). d1 has a single float, and is the 4 bytes you would expect if it weren't a struct. d2 starts on a 16-byte value, but again has only the size of its contents. d3 confirms that a value outside the struct is packed tightly after it. d4 has the 24 bytes we saw earlier for a float2x2, plus an additional 4 bytes for d4_2 following immediately. And d5 again packed right after d4 without any padding.

There is one final topic for us. What happens if we take a struct and put it in an array?

Arrays of Structs

Based on past experience, it's probably reasonable to assume that an array of structs will behave similar to any other array. That is, each element starts on a 16-byte address, with no padding at the end. How does it look?

cbuffer E
{
    struct
    {
        float2 e1_1;    // Offset:    0
    } e1[3];            // Offset:    0 Size:    40
    
    float e2;           // Offset:   40 Size:     4
    
    struct
    {
        float  e3_1;    // Offset:   48
        float4 e3_2;    // Offset:   64
        float  e3_3;    // Offset:   80
    } e3[2];            // Offset:   48 Size:    84
};

Looks about how we expect! Going by the sizes given, each array element starts on a 16-byte address, with no padding after the last element.

Summary

So I'll just give a quick summary of what we found:

Vectors are easy. Pack them together, but a single vector can't cross a 16-byte boundary.
Matrices are treated as arrays of vectors.
Each element in an array of vectors is aligned to 16 bytes. Padding is not inserted after the last element, so the next constant can be packed tightly if it fits.
Structs are aligned to 16 bytes. As with arrays, padding is not inserted after the last member.
Arrays of structs behave as expected with these rules.

It's really not so complicated, but it took a bit of experimentation to get a handle on it. The single page of documentation was mostly correct, but had some misleading bits. I didn't look at double values here, but I expect they would behave consistently -- just keeping in mind that each component is now 8 bytes instead of 4, while the alignment is probably still 16 bytes.

With this information, hopefully you can go forth and build all sorts of complex cbuffers, and pack them correctly.

Stayed tuned for next time, when I use the D3D Shader Reflection interface to automatically figure out the entire cbuffer memory layout!

Games by Tim

Saturday, April 23, 2016

A Material System, Part 2: Deciphering the HLSL Packing Rules