GPU shaders used to be just some simple kernel functions with some basic maths and output some values, but now GPU shaders have the power to be just as complex and generic as the CPU code; with complex maths, maps, allocators, sorting, complex global synchronisation, etc.
But GPU shaders languages have not evolved to give you tools to debug and validate shaders during development. Usually they are just a black box and not a lot you can do. Of course, tools like RenderDoc or PIX can give you a lot of details what is happening, but they are post diagnose tools and they do not allow any diagnostic or validation during development. Debugging sometimes is not late and is not the solution, similar situation when you try to debug a heavily parallel CPU code with a lot of complex synchronisation, debugging is a nightmare.
Some tools have appeared recently that can help you a lot to diagnosticate the shaders during development, like GPU Reshape.
This post is about the features that Cute offers for developing shaders.
How can we make our shader development easy to diagnosticate and validate?
Reloading shaders:
One of the first feature that has to be implemented in an engine is reloading shaders. Reloading shaders allows you to modify the shader and see the changes with fast iteration, like live edit. A lot of the coding of your shaders can be done when the application is running.
You should be able to undo the reloading if the new shader doesn't compile and give an option for the user to try to fix the shader if it doesn't compile the first time during loading. Only with this feature developing shaders become fun and not a frustration each time you have a problem.
Asserts:
uint indirect_box; //Encode the instance list index (8bits), instance index (12bits), box index (12 bits);
assert(instance_list_index < (1 << 8)); //check range
assert(instance_index < (1 << 12)); //check range
assert(j < (1 << 12)); //check range
indirect_box = ((instance_list_index & 0xFF) << 24) | ((instance_index & 0xFFF) << 12) | (j & 0xFFF);
One of the classic bugs in shaders is a Nan been output in an UAV or render target. Now we can add some asserts to validate that we always output a correct value.
assert(all(!IsNaN(result)));
return float4(result, 1.f);
assert(all(!IsNaN(result)));
return float4(result, 1.f);
Control variables:
CONTROL_VARIABLE(bool, PostProcess, BloomEnable, true)
if (!BloomEnable) bloom = float3(0.f, 0.f, 0.f);
Reloading shaders is really powerful, but sometimes you just want to be able to set a value from outside the shader and use it in the shader. For example to disable or enable a code path or pass a value and multiply it by a variable for testing some idea.Control variables allow you exactly that. If you want to use a control variables, you just need to declare it, use it and reload the shader; the control variables will appear in the debug menu (Cute already has control variables in the CPU) and you can just change the values.
CONTROL_VARIABLE(bool, PostProcess, BloomEnable, true)
if (!BloomEnable) bloom = float3(0.f, 0.f, 0.f);
Counters:
COUNTER(BoxCulling, TotalCubes)
COUNTER(BoxCulling, FrustumVisibleInstances)
COUNTER(BoxCulling, FirstPassVisibleCubes)
COUNTER_INC(TotalCubes);
Implementation details:
If the development shaders option is enabled, for each shader that needs to be compiled, we will just create some new bindings for our development resources (new root signature parameter), parse the shader to collect the used features and inject the code in the shaders that will just replace each feature with the correct code to execute.
For control variables, we parse the shader looking for the definition of the control variable, extract the name, group name, type and default value and register it in the CPU (we already have control variables in the CPU with Imgui interface). Then we create a resource that each frame is updated with the CPU user defined values and sent it to the GPU, then each control variable in the GPU will just read its value from the resource.
For counters, each one registers one counter in the CPU and assign an index/slot, then an UAV is created, initialised each frame with zero values an sent to the GPU. Each time the counter is triggered in the GPU, it just needs to do an InterlockAdd in the correct slot. In the CPU we need to read back that resource (using ReadBack buffers) and then, we update the CPU counters with the values from the GPU.
Asserts are more complicated, as you don't know exactly the number of asserts. So Cute implements a generic command buffer system that allows the shaders to send message/commands to the CPU each frame. It works with an UAV resource set as RWStructureBuffer<uint> where the first integer slot is the current command buffer position and the second the max size of the buffer. Each time the GPU needs to send a command, it will increment the buffer position by the size needed for the message and write the message into the buffer. The first uint of the message is just the value to identify the command and the rest the data. For the assert message, the first element is zero (represent the command assert), the second element is the file index (each file is identified by an index) and the third is the line. Then the resource is read back in the CPU and decoded each message. In the case of the asserts, it merges all the asserts with the same file and line, so when it triggers it doesn't spam too much the log.
If the shader development option is off, the shaders will get compiled removing all diagnostic/validation features. So adding all these features will not make your final shader slower.
The main implementation can be found here in the CompileShader function.
Future work:
- Printf() function working in shaders to be able to output values of variables and state of the shader in the log from the shader. Really interestint post about implementing printf() in shaders.
- Asserts with printf() integrated. Once an assert is triggered, it can send a message to the CPU.
- Debug primitive rendering. So from any shader you can create debug primitives like boxes, circles, lines, etc... Those debug primitives can be sent by the message system to the CPU and then run them with the CPU debug primitives api or accumulate them into a buffer and draw them using indirect draw, so it will not have any lag.