Saturday, April 23, 2016

A Material System, Part 2: Deciphering the HLSL Packing Rules

Tentative series plan:
  1. An Introduction
  2. Deciphering the HLSL Packing Rules (you are here)
  3. Shader Reflection (clever title pending)
  4. Runtime Parameters (clever title pending)

Last time, on...

In the previous installment of this series, we saw a high level overview of how a flexible material system could look. Ignoring a few details, the result was a largely data-driven approach, where the shader and the parameters that make a specific material can be defined in data, with enough flexibility to change the parameters -- not just the values, but even what the shader expects -- without any code changes.

One of the hand-wavey parts was how to go from the cbuffer layout in HLSL, to proper offsets where to put the final parameter values within a buffer. This article will cover a part of that, the packing rules of HLSL cbuffers.

Disclaimer: Unless otherwise noted, the following is the results of my own experiments. It seems to be the case, but I can't guarantee it wasn't just a coincidence that things worked out.

Disclaimer: I am only concerned with the automatic packing done by HLSL. It's also possible to explicitly define the layout of cbuffer members, using the register keyword, but my aim is to minimize the work needed when writing shaders, putting the complicated finicky stuff in code instead.

First, we RTFM

Obviously the first thing we should do is check out the documentation, see what it says about things. So we go to Packing Rules for Constant Variables at the Windows Dev Center.
HLSL ...  packs data into 4-byte boundaries. Additionally, HLSL packs data so that it does not cross a 16-byte boundary. Variables are packed into a given four-component vector until the variable will straddle a 4-vector boundary; the next variables will be bounced to the next four-component vector.
Okay, so far so good. We can check various cases by running a simple shader through FXC. So let's try some basic stuff

Simple Vectors

cbuffer A
{
    float a1;       // Offset:    0 Size:     4
    float2 a2;      // Offset:    4 Size:     8
    float3 a3;      // Offset:   16 Size:    12
    float a4;       // Offset:   28 Size:     4
    bool2 a5;       // Offset:   32 Size:     8
    int a6;         // Offset:   40 Size:     4
};
This is pretty much as advertised. a2 fits immediately after a1, but a3 needs to start on a new 16-byte boundary. a5 is 4 bytes per component even though it's just a boolean value. This is easy!

Maybe we want to put a matrix in there. What happens to those?

Matrices

cbuffer B
{
    float4x4 b1;    // Offset:    0 Size:    64
    float4x3 b2;    // Offset:   64 Size:    48
    float3x4 b3;    // Offset:  112 Size:    60
    float2x2 b4;    // Offset:  176 Size:    24
    float1x4 b5;    // Offset:  208 Size:    52
};
We can see b1 takes up a full 64 bytes, as expected. Likewise, b2 is 48 bytes (basically 3 x float4). But what about b3? If it were tightly packed, we would expect 48 bytes again, but if we treat it as 4 x float3, each float3 needs to start on a new 16-byte boundary, so a full 64 might make sense as well. But instead we have 60 bytes. Well, I guess the above excerpt only concerns where a value starts, not where it ends, so okay, b3 packs the same as if we had this:
cbuffer B
{
    float4x4 b1;    // Offset:    0 Size:    64
    float4x3 b2;    // Offset:   64 Size:    48
    float3   b3_0;  // Offset:  112 Size:    12
    float3   b3_1;  // Offset:  128 Size:    12
    float3   b3_2;  // Offset:  144 Size:    12
    float3   b3_3;  // Offset:  160 Size:    12
    float2x2 b4;    // Offset:  176 Size:    24
    float1x4 b5;    // Offset:  208 Size:    52
};
Moving on to b4, we see again something a bit unexpected. Based on what happened with b3, I would expect b4 to take 16 bytes (2 x float2), but instead we have 24! Well, as it turns out, this works out so that each row of the matrix starts on a new 16-bytes. The same carries over to b5.

Let's check the docs again, maybe it says something about this. The closest thing that resembles it is this about arrays:
Arrays are not packed in HLSL by default. To avoid forcing the shader to take on ALU overhead for offset computations, every element in an array is stored in a four-component vector.
This seems to indicate that each element in an array fills 16 bytes, but otherwise could match what's going on with the matrices. So let's play with arrays a bit:

Arrays

cbuffer C
{
    float4 c1[3];   // Offset:    0 Size:    48
    float3 c2[4];   // Offset:   48 Size:    60
    float2 c3[2];   // Offset:  112 Size:    24
    float  c4[4];   // Offset:  144 Size:    52
    float  c5;      // Offset:  196 Size:     4
};
Well this is familiar! c1,c2,c3,c4 look the same as b2,b3,b4,b5! So the docs are a little misleading here: array elements aren't stored in 4-component vectors, they're just aligned to 16 bytes. c5 verifies that the elements of c4 aren't filling the 16 bytes.

So where do we stand?
  1. Vectors are easy. Pack them together, but a single vector can't cross a 16-byte boundary.
  2. Matrices are treated as arrays of vectors.
  3. Each element in an array of vectors is aligned to 16 bytes. Padding is not inserted after the last element, so the next constant can be packed tightly if it fits.
We're almost done our exploration of HLSL cbuffer packing. We next turn to structs.

Structs

Here's what the docs have to say about structs in cbuffers:
Each structure forces the next variable to start on the next four-component vector. This sometimes generates padding for arrays of structures. The resulting size of any structure will always be evenly divisible by sizeof(four-component vector).
And here's what some basic experimentation shows:
cbuffer D
{
    struct
    {
        float d1_1;     // Offset:    0
    } d1;

    struct
    {
        float2 d2_1;    // Offset:   16
    } d2;               // Offset:   16 Size:     8

    float d3;           // Offset:   24 Size:     4

    struct
    {
        float2x2 d4_1;  // Offset:   32
        float d4_2;     // Offset:   56
    } d4;               // Offset:   32 Size:    28

    float d5;           // Offset:   60 Size:     4
};
So right off the bat, the docs seem to be giving the wrong information. None of these structs have a size that's a multiple of sizeof(four-component vector). d1 has a single float, and is the 4 bytes you would expect if it weren't a struct. d2 starts on a 16-byte value, but again has only the size of its contents. d3 confirms that a value outside the struct is packed tightly after it. d4 has the 24 bytes we saw earlier for a float2x2, plus an additional 4 bytes for d4_2 following immediately. And d5 again packed right after d4 without any padding.

There is one final topic for us. What happens if we take a struct and put it in an array?

Arrays of Structs

Based on past experience, it's probably reasonable to assume that an array of structs will behave similar to any other array. That is, each element starts on a 16-byte address, with no padding at the end. How does it look?
cbuffer E
{
    struct
    {
        float2 e1_1;    // Offset:    0
    } e1[3];            // Offset:    0 Size:    40
    
    float e2;           // Offset:   40 Size:     4
    
    struct
    {
        float  e3_1;    // Offset:   48
        float4 e3_2;    // Offset:   64
        float  e3_3;    // Offset:   80
    } e3[2];            // Offset:   48 Size:    84
};
Looks about how we expect! Going by the sizes given, each array element starts on a 16-byte address, with no padding after the last element.

Summary

So I'll just give a quick summary of what we found:
  1. Vectors are easy. Pack them together, but a single vector can't cross a 16-byte boundary.
  2. Matrices are treated as arrays of vectors.
  3. Each element in an array of vectors is aligned to 16 bytes. Padding is not inserted after the last element, so the next constant can be packed tightly if it fits.
  4. Structs are aligned to 16 bytes. As with arrays, padding is not inserted after the last member.
  5. Arrays of structs behave as expected with these rules.
It's really not so complicated, but it took a bit of experimentation to get a handle on it. The single page of documentation was mostly correct, but had some misleading bits. I didn't look at double values here, but I expect they would behave consistently -- just keeping in mind that each component is now 8 bytes instead of 4, while the alignment is probably still 16 bytes.

With this information, hopefully you can go forth and build all sorts of complex cbuffers, and pack them correctly.

Stayed tuned for next time, when I use the D3D Shader Reflection interface to automatically figure out the entire cbuffer memory layout!

Saturday, April 9, 2016

A Material System, Part 1: An Introduction

Tentative series plan:
  1. An Introduction (you are here)
  2. Deciphering the HLSL Packing Rules
  3. Shader Reflection (clever title pending)
  4. Runtime Parameters (clever title pending)
I've been working on some sort of material system, for rendering objects and such. One feature I'm looking for is that it should be easy to setup up new materials and shaders with minimal (or preferably no) code changes. At first glance, this may seem like a simple thing: "well, shaders are written in some shader language, generally as separate data files... so just write a new shader and attach it to your mesh!" But things are seldom so simple...

This post is going to look at some high-level concepts for my materials to set the stage. For the time being, I'm focusing on the pixel shader for the material design, as that's what I'm currently working on. Vertex/Input Assembly has not reached the level of flexibility I want yet, so maybe I'll write about that later when I get there (as well as maybe other crazy things like tessellation support).

Disclaimer 1 : The material system I have arrived at suits my needs at this time. There may be better ways to do it, but this is what I've gone with. To build up to my design, I'll talk about some other possibilities that don't work for me. I'm not saying they're terrible, just that they don't fit what I want. And even if I do say it's terrible, maybe it's perfect for some other purpose. If you're using one of them, and don't need to go any more advanced, then that's fine!

Disclaimer 2 : I'm talking about D3D11/HLSL here. The material design can probably be carried over to other APIs and shader languages, but I'm not generally considering that.

Start with something simple

The simplest thing, with limited flexibility, is probably to have your pixel shader like this:

cbuffer Material : register(b0)
{
  float4 Color;
};
Texture2D DiffuseTex;

float4 psmain( VSOUT In )
{
  return Color * DiffuseTex.Sample( sampler, In.uv );
}

Give or take some missing code, this gives you a configurable color parameter and a texture to sample from. In your C++ code, you might have something like:

struct MaterialParameterData
{
  float Color[4];
};
struct PerObjectMaterialParameters
{
  ID3D11Buffer* Constants;
  ID3D11Texture2D* DiffuseTexture;
};

Give each object a PerObjectMaterialParameters, with Constants filled with a MaterialParameterData. Bind everything and draw your thing. Maybe you read the colors from some data file when creating the object, along with a filename to grab a texture from. Totally flexible! Just change the data and get different colors and textures! Ship it!

Don't get too excited... What happens when color modulation isn't good enough? Maybe someone decided that some objects should use the texture as a mask over a solid color. Well that's easy, just use a new shader:

cbuffer Material : register(b0)
{
  float4 Color;
};
Texture2D DiffuseTex;

float4 psmain( VSOUT In )
{
  float4 tex = DiffuseTex.Sample( sampler, In.uv );
  return lerp( Color, tex, tex.a );
}

Like magic! And look, the cbuffer and texture are the same, so no code changes required! Just point the object at this new shader, and it'll be perfect! ... but wait, someone now wants to have a layered material:

cbuffer Material : register(b0)
{
  float4 Color0;
  float4 Color1;
};
Texture2D DiffuseTex0;
Texture2D DiffuseTex1;

float4 psmain( VSOUT In )
{
  float4 tex0 = DiffuseTex0.Sample( sampler, In.uv );
  float4 tex1 = DiffuseTex1.Sample( sampler, In.uv );
  return lerp( Color0*tex0, Color1*tex1, tex1.a );
}

Phooey. We've got more constants and more textures. Now there are two obvious choices:
  1. Update the other shaders to have the same cbuffer and textures, just don't use the extra stuff. This is will require some small C++ changes to use the new data, but it's pretty simple. But as the materials get more complex the buffer size and number of possible texture bindings may rapidly increase.
  2. Add a new struct in C++. Objects can specify what type of material they use, and get the appropriate constant and texture bindings. Each material's buffer will only contain the data it needs, but any new material will require several code changes.

Which one is best? Neither, they're both terrible.

A Little More Flexible

It's likely that you don't want to spend all your time supporting new materials, with new parameters, new textures, new computations. Maybe eventually it would stabilize, but at what cost? There are more important things to do!

So throw out everything. From the C++ side, we'll treat the constant buffer as a black box. It's just a chunk of memory that gets filled with something. For textures, we'll just have a list of bindings (essentially a texture and the slot to bind it to). Considering the first shader above, with a single Color parameter and a single Texture, we might define the parameters in some data file like:

constants:
  1, 1, 0, 1
textures:
  0=texture.dds

... or whatever. I don't care how it's stored, but somehow we parse that, come up with a bunch of floats to stick in a buffer, and a texture to load and bind. And we get a lovely yellow thing. Now how about the clever layered material? Well, how about doing something like:

constants:
  1, 1, 0, 1,
  0, 1, 1, 1,
textures:
  0=bottom_layer.dds
  1=top_layer.dds

Now there are 8 floats for the buffer and two textures, but because we aren't making assumptions about it, there's no need to make any code changes. Amazing! Okay, this is the best thing since bacon-wrapped hot dogs! We can do anything now, what more could we want?

Well, as it happens, the moment you've finished off this masterpiece, someone comes along and gives you this:

cbuffer Material : register(b0)
{
  float4 Color0;
  float Blend;
  float4 Color1;
};

float4 psmain()
{
  return DoSomethingCleverWithTheParameters();
}

"Easy," you think, "I'll just give it data like this:"

constants:
  1, 0, 0, 1,
  0.3,
  0, 1, 0, 1

... and then it doesn't work as expected... This is where things can get a little complicated. The HLSL compiler has certain rules for how variables are packed into a cbuffer. When using float4, it's nice and easy. Using just float, or just float2 is also nice. When you start mixing things, it gets much worse. I'm not going to go into detail here, I'll just say that in this case, there's 12 bytes of padding inserted after the Blend variable. You can check the link for some more detail, although it's maybe not as complete as it should be.

Let's assume we've got the packing all worked out. We can explicitly pad stuff like so:

constants:
  1, 0, 0, 1,
  0.3, -1,-1,-1
  0, 1, 0, 1

Or we can use shader reflection to figure out programmatically where every value needs to go. This is what I'm doing, and a future article in this series will cover all the annoying fiddly bits of that.

Another potential problem here, is that we're assuming the material parameters are packed into a single cbuffer. But what if we have some effect we want to apply on top of a regular material:

cbuffer Material : register(b0)
{
  float4 BaseColor;
};
cbuffer Effect : register(b1)
{
  float Amount;
  float2 Displacement;
};

These have been split up because Material is some basic properties that are likely shared between many objects (maybe instances of the same object, maybe entirely different, doesn't matter). Sure, we could merge the two, and just not share buffers when the effect is active. But if the base material is much bigger than a single color, and if the effect parameters are changing per frame, maybe it would be a good idea to have a small buffer to update.

An easy solution here is to do the same for constant buffers as we did for textures: Just have a list of them. Then the data might look like:

cb0:
  1, 1, 1, 1
cb1:
  0.75,
  -3, 2.7

This works fine for static data, but if it's static we're probably better off with everything in one buffer. For this effect, we want to update parameters at runtime, which requires runtime knowledge of where one value ends and the next begins. This will be another topic for the future.

That's All For Now

So far, we have a material system that allows a pixel shader to be written with any parameters packed into any constant buffers, and any texture bindings we want. The parameter values for the constant buffers, and names for the textures, can be specified in a separate data file. There are a lot of details that I've glossed over, which I hope to explore deeper in the future.

Saturday, January 30, 2016

Find Junk Released

 

That's right, Find Junk has been released to the wild!

This initial release marks my first Windows Phone game. There are over 250 objects to find across almost 70 different images. More will be added over time in future updates and add-on packs.

Check it out on the Windows Phone store!