DirectCompute tutorial for Unity: Kernels and thread groups

scrawkCompute shader, DirectCompute, dx11

The last post of this tutorial series was just a bit of a introduction but from here on its all about the code. Today I will be going over the core concepts for writing compute shaders in Unity. At the heart of a compute shader is the kernel. This is the entry point into the shader and acts like the Main function in other programming languages. I will also cover the tiling of threads by the GPU. These tiles are also known as blocks or thread groups. DirectCompute officially refers to these tiles as thread groups.

To create a compute shader in Unity simply go to the project panel and then click create->compute shader and then double click the shader to open it up in Monodevelop for editing. Paste in the following code into the newly created compute shader.

#pragma kernel CSMain1
[numthreads(4,1,1)]
void CSMain1()
{
}

This is the bare minimum of content for a compute shader and will of course do nothing but will serve as a good starting point. A compute shader has to be run from a script in Unity so we will need one of those as well. Go to the project panel and click Create->C# script. Name it KernelExample and paste in the following code.

using UnityEngine;
using System.Collections;
public class  KernelExample : MonoBehaviour
{
    public ComputeShader shader;
    void Start ()
    {
        shader.Dispatch(0, 1, 1, 1);
    }
}

Now drag the script onto any game object and then attach the compute shader to the shader attribute. The shader will now run in the start function when the scene is run. Before you run the scene however you need to enable dx11 in Unity. Go to Edit->Project Settings->Player and then tick the “Use Direct3D 11” box. You can now run the scene. The shader will do nothing but there should also be no errors.

In the script you will see the “Dispatch” function called. This is responsible for running the shader. Notice the first variable is a 0. This is the kernel id that you want to run. In the shader you will see the “#pragma kernel CSMain1“. This defines what function in the shader is the kernel as you may have many functions (and even many kernels) in one shader. There must be a function will the name CSMain1 in the shader or the shader will not compile.

Now notice the “[numthreads(4,1,1)]” line. This tells the GPU how many threads of the kernel to run per group. The 3 numbers relate to each dimension. A thread group can be up to 3 dimensions and in this example we are just running a 1 dimension group with a width of 4 threads. That means we are running a total of 4 threads and each thread will run copy of the kernel. This is why GPU’s are so fast. They can run thousands of threads at a time.

Now lets get the kernel to actually do something. Change the shader to this…

#pragma kernel CSMain1
RWStructuredBuffer<int> buffer1;
[numthreads(4,1,1)]
void CSMain1(int3 threadID : SV_GroupThreadID)
{
    buffer1[threadID.x] = threadID.x;
}

and the scripts start function to this…

void Start ()
{
    ComputeBuffer buffer = new ComputeBuffer(4, sizeof(int));
    shader.SetBuffer(0, "buffer1", buffer);
    shader.Dispatch(0, 1, 1, 1);
    int[] data = new int[4];
    buffer.GetData(data);
    for(int i = 0; i < 4; i++)
        Debug.Log(data[i]);
    buffer.Release();
}

Now run the scene and you should see the numbers 0, 1, 2 and 3 printed out. Don’t worry too much about the buffer for now. I will cover them in detail in the future but just know that a buffer is a place to store data and it needs to have the release function called when you are finished with it. Notice this argument added to the CSMain1 function “int3 threadID : SV_GroupThreadID“. This is a request to the GPU to pass into the kernel the thread id when it is run. We are then writing the thread id into the buffer and since we have told the GPU we are running 4 threads the id ranges from 0 to 3 as we see from the print out.

Now those 4 threads make up whats called a thread group. In this case we are running 1 group of 4 threads but you can run multiple groups of threads. Lets run 2 groups instead of 1. Change the shaders kernel to this…

void CSMain1(int3 threadID : SV_GroupThreadID, int3 groupID : SV_GroupID)
{
    buffer1[threadID.x + groupID.x*4] = threadID.x;
}

and the scripts start function to this…

void Start ()
{
    ComputeBuffer buffer = new ComputeBuffer(4 * 2, sizeof(int));
    shader.SetBuffer(0, "buffer1", buffer);
    shader.Dispatch(0, 2, 1, 1);
    int[] data = new int[4 * 2];
    buffer.GetData(data);
    for(int i = 0; i < 4 * 2; i++)
        Debug.Log(data[i]);
    buffer.Release();
}

Now run the scene and you should have 0-3 printed out twice. Now notice the change to the dispatch function. The last three variables (the 2,1,1) are the number of groups we want to run and just like the number of threads groups can go up to 3 dimensions and in this case we are running 1 dimension of 2 groups. We have also had to change the kernel with the argument “int3 groupID : SV_GroupID” added. This is a request to the GPU to pass in the group id when the kernel is run. The reason we need this is because we are now writing out 8 values, 2 groups of 4 threads. We now need the threads position in the buffer and the formula for this is the thread id plus the group id times the number of threads ( threadID.x + groupID.x*4 ). This is a bit awkward to write. Surely the GPU knows the threads position? Yes it does. Change the shaders kernel to this and rerun the scene.

void CSMain1(int3 threadID : SV_GroupThreadID, int3 dispatchID : SV_DispatchThreadID)
{
    buffer1[dispatchID.x] = threadID.x;
}

The results should be the same, two sets of 0-3 printed. Notice that the group id argument has been replaced with “int3 dispatchID : SV_DispatchThreadID“. This is the same number our formula gave us except now the GPU is doing it for us. This is the threads position in the groups of threads.

So far these have all been in 1 dimension. Lets step thing up a bit and move to 2 dimensions and instead of rewriting the kernel lets just add another one to the shader. Its not uncommon to have a kernel for each dimension in a shader performing the same algorithm. First add this code to the shader below the previous code so there are two kernels in the shader.

#pragma kernel CSMain2
RWStructuredBuffer<int> buffer2;
[numthreads(4,4,1)]
void CSMain2( int3 dispatchID : SV_DispatchThreadID)
{
    int id = dispatchID.x + dispatchID.y * 8;
    buffer2[id] = id;
}

and the script to this…

void Start ()
{
    ComputeBuffer buffer = new ComputeBuffer (4 * 4 * 2 * 2, sizeof(int));
    int kernel = shader.FindKernel ("CSMain2");
    shader.SetBuffer (kernel, "buffer2", buffer);
    shader.Dispatch (kernel, 2, 2, 1);
    int[] data = new int[4 * 4 * 2 * 2];
    buffer.GetData (data);
    for(int i = 0; i < 8; i++)
    {
        string line = "";
        for(int j = 0; j < 8; j++)
        {
            line += " "  + data[j+i*8];
        }
        Debug.Log (line);
    }
    buffer.Release ();
}

Run the scene and you will see a row printed from 0 to 7 and the next row 8 to 15 and so on to 63. Why from 0 to 63? Well we now have 4 2D groups of threads and each group is 4 by 4 so has 16 threads. That gives us 64 threads in total. Notice what value we are out putting from this line “int id = dispatchID.x + dispatchID.y * 8“. The dispatch id is the threads position in the groups of threads for each dimension. We now have 2 dimension so we need the threads global position in the buffer and this is just the dispatch x id plus the dispatch y id times the total number of threads in the first dimensions (4 * 2). This is a concept you will have to be familiar with when working with compute shaders. The reason is that buffers are always 1 dimensional and when working in higher dimension you need to calculate what index the result should be written into the buffer at.

The same theory applies when working with 3 dimensions but as it gets fiddly I will only demonstrate up to 2 dimensions. You just need to know that in 3 dimensions the buffer position is calculated as “int id = dispatchID.x + dispatchID.y * groupSizeX + dispatchID.z * groupSizeX * groupSizeY” where group size is the number of groups times number of threads for that dimension.

You should also have a understanding of how the semantics work. Take for example this kernel argument…

int3 dispatchID : SV_DispatchThreadID

SV_DispatchThreadID is the semantic and tells the GPU what value it should pass in for this argument. The name of the argument does not matter. You can call it what you want. For example this argument works the same as above.

int3 id : SV_DispatchThreadID

Also the variable type can be changed. For example…

int dispatchID : SV_DispatchThreadID

See the int3 has been changed to int. This is fine if you are only working with 1 dimension. You could also just use a int2 for 2 dimensions and you could also use a unsigned int (uint) instead of a int if you choose.

Since we now have two kernels in the shader we also need to tell the GPU what kernel we want to run when we make the dispatch call. Each kernel is given a id in the order they appear. Our first kernel would be id 0 and the next is id 1. When the number of kernels in a shader becomes larger this can become a bit confusing and its easy to set the wrong id. We can solve this by asking the shader for the kernels id by name. This line here “int kernel = shader.FindKernel (“CSMain2”);” gets the id of kernel “CSMain2“. We then use this id when setting the buffer and making the dispatch call.

About now you maybe thinking that this concept of groups of threads is a bit confusing. Why cant I just use one group of threads? Well you can but just know that there is a reason that threads are arranged into groups by the GPU. For a start a thread group is limited by the number of threads it can have ( defined by the line “[numthreads(x,y,z)]” in the shader). This limit is currently 1024 but may change with new hardware. For example you can have a maximum of “numthreads(1024,1,1)” for 1D, “numthreads(32,32,1)” for 2D and so on. You can however have any number of groups of threads and as you will often be processing data with millions of element the concept of thread groups is essential. Threads in a groups can also share memory and this can be used to make dramatic performance gains for certain algorithms but I will cover that in a future post.

Well I think that about covers kernels and thread groups. There is just one more thing I want to cover. How to pass uniforms into your shader. This works the same as in Cg shaders but there is no uniform key word. For the most part this relatively simple but there are a few “Gotcha’s” so I will briefly go over it.

For example if you want to pass in a float you need this line in the shader…

float myFloat;

and this line in your script…

shader.SetFloat("myFloat", 1.0f);

To set a vector you need this in the shader…

float4 myVector;

and this in the script…

shader.SetVector("myVector", new Vector4(0,1,2,3));

You can only pass in a Vector4 from the script but your uniform can be a float, float2, float3 or float4. It will be filled with the appropriate values.

Now here’s where it gets tricky. You can pass in arrays of values. Note that this first example wont work. I will explain why. You need this line in your shader…

float myFloats[4];

and this in your script…

shader.SetFloats("myFloats", new float[]{0,1,2,3});

Now this wont work. Whether this is by design or a bug in Unity I don’t know. You need to use vectors as uniforms for this to work. In your shader…

float4 myFloats;

and your script…

shader.SetFloats("myFloats", new float[]{0,1,2,3});

This works. You can also use a float2 or float3. Just not a single float. You can also have arrays of vectors. In your shader…

float4 myFloats[2];

and your script…

shader.SetFloats("myFloats", new float[]{0,1,2,3,4,5,6,7});

So here we have a array of two float4’s and it is set from a array of 8 floats from a script. The same principles apply when setting matrices. In your shader…

float4x4 myFloats;

and your script…

shader.SetFloats("myFloats", new float[]{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15});

And of course you can have arrays of matrices. In your shader…

float4x4 myFloats[2];

and your script…

shader.SetFloats("myFloats", new float[]{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31});

This same logic does not seem to apply to float2x2 or float3x3. Again, whether this is a bug or design I don’t know.

I think that about covers it today. The next part will be covering how to use textures in your compute shaders. You can also download the project file for the kernel example. Its rather basic but its there if you need it. I will be adding to the same project file for each tutorial I do.