Sunday, April 4, 2010

Simple Compute Shader Example

The other big side of DirectX 11 is the Compute Shader. I thought I would write up a very simple example along the same lines as my tessellation example.

First let me say that the Compute Shader is awesome! It opens up so many possibilities. My mind is just reeling with new ideas to try out. Also I must mention that SlimDX really does a great job of minimalizing the code necessary to use the Compute Shader.

This example shows how to create a Compute Shader and then use it to launch threads that simply output the thread ID to a texture.

Device device = new Device(DriverType.Hardware, DeviceCreationFlags.Debug, FeatureLevel.Level_11_0);
 
ComputeShader compute = Helper.LoadComputeShader(device, "SimpleCompute.hlsl", "main");
 
Texture2D uavTexture;
UnorderedAccessView computeResult = Helper.CreateUnorderedAccessView(device, 1024, 1024, Format.R8G8B8A8_UNorm, out uavTexture);
 
device.ImmediateContext.ComputeShader.Set(compute);
device.ImmediateContext.ComputeShader.SetUnorderedAccessView(computeResult, 0);
device.ImmediateContext.Dispatch(32, 32, 1);
 
Texture2D.ToFile(device.ImmediateContext, uavTexture, ImageFileFormat.Png, "uav.png");


Believe it or not, but that is the entirety of my CPU code.

Here is what is going on in the code:
1) Create a feature level 11 Device, in order to use Compute Shader 5.0
2) Load/Compile the HLSL code into a ComputeShader object.
3) Create a 1024x1024 UnorderedAccesdView (UAV) object which will be used to store the output.
4) Set the ComputeShader and UAV on the device.
5) Run the Compute Shader by calling Dispatch (32x32x1 thread groups are dispatched).
6) Save the output texture out to disk.

My HLSL code is even simpler:

RWTexture2D<float4> Output;

[numthreads(32, 32, 1)]
void main( uint3 threadID : SV_DispatchThreadID )
{
Output[threadID.xy] = float4(threadID.xy / 1024.0f, 0, 1);
}


As you can see a RWTexture2D object is used to store the output (this is the UAV). The shader is setup to run 32x32x1 threads. This means that since the CPU is launching 32x32x1 thread groups, then there are 1024x1024x1 separate threads being run. This equates to 1 thread per pixel in the output UAV. So, in the UAV, the color is just set based upon the thread ID.

This code results in the following output image:


Quite simple, eh? But not that interesting. We could easily do something like that with a pixel shader (although we would have to rasterize a full-screen quad to do it).

We should try to do something that shows the power of the compute shader; something you couldn't do in a pixel shader before. How about drawing some primitives like lines and circles?

For drawing lines, let's use the Digital Differential Analyzer algorithm. It translates to HLSL very easily.


void Plot(int x, int y)
{
Output[uint2(x, y)] = float4(0, 0, 1, 1);
}

void DrawLine(float2 start, float2 end)
{
float dydx = (end.y - start.y) / (end.x - start.x);
float y = start.y;
for (int x = start.x; x <= end.x; x++)
{
Plot(x, round(y));
y = y + dydx;
}
}


For drawing circles let's use the Midpoint Circle algorithm. For brevity I won't list it here now.

Then, in my Compute Shader main function, I simply add this code:

if (threadID.x == 1023 && threadID.y == 1023)
{
DrawLine(float2(0, 0), float2(1024, 1024));
DrawLine(float2(0, 1023), float2(1023, 0));

DrawCircle(512, 512, 250);
DrawCircle(0, 512, 250);
}


The if check is just done to prevent the lines and circles from being drawn for every thread. This code results in the following image:


I must admit it seems quite odd writing a shader that draws primitives. It's like some strange recursive loop. But it definitely helps to illustrate the features of the Compute Shader and how powerful it is.

You may download the source code to this example here:
ComputeShader11.zip

My next goal is to setup a standard DX11 Swap Chain and use the Compute Shader to write directly to the backbuffer. Well that's all for now.

FYI: This is my 50th blog post! I never thought I would continue on this long. I think I should crack open a beer to celebrate.

11 comments:

Antoine Leblond said...

Nice entry.

Can you post the sample code? I always get a non-descript error when I do ComputeShader cs = new ComputeShader(device, shaderBytecode); and I would like to compare with your code.

Thanks!

Patrick said...

There really isn't much more to the code, which is why I didn't post the full project before. But since you asked for it, I went ahead and posted it. You can download it here:
ComputeShader11.zip

By the way, are you sure you are compiling against "cs_5_0" and NOT "fx_5_0"? That seems to be a common problem.

Let me know if you need anymore help.

Antoine Leblond said...

Thanks for the quick reply and the sample code.

The non-descript error was because I was compiling against "fx_5_0". Thanks! Then I encountered a "'main': entrypoint not found" error... That was because my hlsl file was not in "ansi" encoding (http://forums.silverlight.net/forums/p/81994/192533.aspx)...

It finally works! ... for now. :)

Thanks again

Anonymous said...

This was really helpful for me! Thanks for writing this great tutorial!

Danix said...

I tried to run your code but unfortunately using the latest SlimDX release many things have changed and it doesn't even compile. What should I do to have it working? (e.g. ShaderBytecode seems to be moved in a DX9 namespace??)

Danix said...

nevermind I got it working with minimal effort...

Unknown said...

There is one topic which is hardly covered anywhere around the net, and this is how to work with multiple kernels two have 2 or more stages where you work on a buffer (e.g. when you calculate positions of a vertex in the first kernel and want to compute on the previous result the normals in the second kernel).

I heard you have to use double buffers or ping pong technique (so, not write to the same buffer in multiple kernels in one frame to force the GPU to finish the first task first), however I have not yet managed to
implement a compute shader with two kernels successfull using different buffers for interims result of the first kernel.

My idea was to have another buffer in the compute kernel, “patchGenerationDataBuffer”, where the results of the first kernel are written to. In the second kernel “patchGenerationDataBuffer” is again read from and the result is written to (patchGeneratedDataBuffer). However it appears that this does not work, “patchGeneratedDataBuffer” does not include the expected results).

StructuredBuffer generationConstantsBuffer;
RWStructuredBuffer patchGenerationDataBuffer;
RWStructuredBuffer patchGeneratedDataBuffer;

I also tried to set “patchGenerationDataBuffer” to patchGenerationDataBuffer StructuredBuffer to make it read only and to swap buffers outside in c# of Unity, but this didnt work either.

Any Idea how it can be done that the first kernel writes his results to a buffer where I can safely work on in the second buffer?

Anonymous said...

Could you repost the sample?

Demofox said...

The zip file link is broken, can you fix it?

Jeroen said...

Without the zipfile, this example is useless.. Show us the contents of "Helper.CreateUnorderedAccessView(device, 1024, 1024, Format.R8G8B8A8_UNorm, out uavTexture);"

Anonymous said...

Can you repost the zipfile?