Thursday, 25 January 2024

Development with Shaders

GPU shaders used to be just some simple kernel functions with some basic maths and output some values, but now GPU shaders have the power to be just as complex and generic as the CPU code; with complex maths, maps, allocators, sorting, complex global synchronisation, etc.

But GPU shaders languages have not evolved to give you tools to debug and validate shaders during development. Usually they are just a black box and not a lot you can do. Of course, tools like RenderDoc or PIX can give you a lot of details what is happening, but they are post diagnose tools and they do not allow any diagnostic or validation during development. Debugging sometimes is not late and is not the solution, similar situation when you try to debug a heavily parallel CPU code with a lot of complex synchronisation, debugging is a nightmare.

Some tools have appeared recently that can help you a lot to diagnosticate the shaders during development, like GPU Reshape.

This post is about the features that Cute offers for developing shaders.

How can we make our shader development easy to diagnosticate and validate?

Reloading shaders:

One of the first feature that has to be implemented in an engine is reloading shaders. Reloading shaders allows you to modify the shader and see the changes with fast iteration, like live edit. A lot of the coding of your shaders can be done when the application is running.

You should be able to undo the reloading if the new shader doesn't compile and give an option for the user to try to fix the shader if it doesn't compile the first time during loading. Only with this feature developing shaders become fun and not a frustration each time you have a problem.

Asserts:


uint indirect_box; //Encode the instance list index (8bits), instance index (12bits), box index (12 bits);
assert(instance_list_index < (1 << 8)); //check range
assert(instance_index < (1 << 12)); //check range
assert(j < (1 << 12)); //check range
indirect_box = ((instance_list_index & 0xFF) << 24) | ((instance_index & 0xFFF)  << 12) | (j & 0xFFF);

This is a really useful feature in CPU programming used to validate your data and execution during development. Everybody should know what is an assert and why is so powerful. But GPU shaders doesn't have them. When shader development is on, asserts will be checked and sent back to the CPU if there is a problem, so you can see if you validation has failed.

One of the classic bugs in shaders is a Nan been output in an UAV or render target. Now we can add some asserts to validate that we always output a correct value.

assert(all(!IsNaN(result)));
return float4(result, 1.f);


Control variables:
  
CONTROL_VARIABLE(bool, PostProcess, BloomEnable, true)
  
if (!BloomEnable) bloom = float3(0.f, 0.f, 0.f);

Reloading shaders is really powerful, but sometimes you just want to be able to set a value from outside the shader and use it in the shader. For example to disable or enable a code path or pass a value and multiply it by a variable for testing some idea.
Control variables allow you exactly that. If you want to use a control variables, you just need to declare it, use it and reload the shader; the control variables will appear in the debug menu (Cute already has control variables in the CPU) and you can just change the values. 


Counters:


COUNTER(BoxCulling, TotalCubes)
COUNTER(BoxCulling, FrustumVisibleInstances)
COUNTER(BoxCulling, FirstPassVisibleCubes)

COUNTER_INC(TotalCubes);

Counters are another diagnostic tool used a lot in the CPU (Cute already has it for the CPU). Each time something happen, you just increment a counter and each frame you can see the results. When shader development is on, you can define counters in the shader, reload shaders and see the result in the CPU counters interface. Counters will help you to diagnosticate if the execution of the shader is what you expect and collect stats from your shaders.

Implementation details:

If the development shaders option is enabled, for each shader that needs to be compiled, we will just create some new bindings for our development resources (new root signature parameter), parse the shader to collect the used features and inject the code in the shaders that will just replace each feature with the correct code to execute.

For control variables, we parse the shader looking for the definition of the control variable, extract the name, group name, type and default value and register it in the CPU (we already have control variables in the CPU with Imgui interface). Then we create a resource that each frame is updated with the CPU user defined values and sent it to the GPU, then each control variable in the GPU will just read its value from the resource.

For counters, each one registers one counter in the CPU and assign an index/slot, then an UAV is created, initialised each frame with zero values an sent to the GPU. Each time the counter is triggered in the GPU, it just needs to do an InterlockAdd in the correct slot. In the CPU we need to read back that resource (using ReadBack buffers) and then, we update the CPU counters with the values from the GPU.

Asserts are more complicated, as you don't know exactly the number of asserts. So Cute implements a generic command buffer system that allows the shaders to send message/commands to the CPU each frame. It works with an UAV resource set as RWStructureBuffer<uint> where the first integer slot is the current command buffer position and the second the max size of the buffer. Each time the GPU needs to send a command, it will increment the buffer position by the size needed for the message and write the message into the buffer. The first uint of the message is just the value to identify the command and the rest the data. For the assert message, the first element is zero (represent the command assert), the second element is the file index (each file is identified by an index) and the third is the line. Then the resource is read back in the CPU and decoded each message. In the case of the asserts, it merges all the asserts with the same file and line, so when it triggers it doesn't spam too much the log.


If the shader development option is off, the shaders will get compiled removing all diagnostic/validation features. So adding all these features will not make your final shader slower.

The main implementation can be found here in the CompileShader function.

Future work:

Development shaders has demonstrated a really useful feature but still can be expanded.
Once the message system to send commands from the GPU to the CPU is implemented, we can add more features like:
  • Printf() function working in shaders to be able to output values of variables and state of the shader in the log from the shader. Really interestint post about implementing printf() in shaders.
  • Asserts with printf() integrated. Once an assert is triggered, it can send a message to the CPU. 
  • Debug primitive rendering. So from any shader you can create debug primitives like boxes, circles, lines, etc... Those debug primitives can be sent by the message system to the CPU and then run them with the CPU debug primitives api or accumulate them into a buffer and draw them using indirect draw, so it will not have any lag.



Wednesday, 8 July 2020

The job system in Cute

Cute implements a job system using workers with job queues and a stealing process when they are empty. It is based on:
The idea is that there is a list of worker threads (usually the number of CPUs), each worker has a queue where it can add jobs (push) and remove jobs (pop) for executing them. Then, if you own queue is empty you can steal a job other worker.
The most critical part of the process is the synchronisation of the queue as it need to be lock free between the push, pop and steal, it is implemented using two atomics as explained at the links, the idea is to keep the synchronisation as minimal as possible.
Once the job system is implemented, you can add jobs from anywhere and the job will run in a worker thread. But you still needs two more tools:

Fences:

Cute implements a simple fence system, so jobs can wait for another jobs. A fence is just an integer atomic that gets incremented when jobs associated to the fence are added and the integer gets decremented when the job finish. Then, in whatever part of the code, you can wait for the fence... it that moment the job will wait until the fence reach to zero and all task associated to that fence are finished.
It is really easy and simple to use and it has minimal synchronisation. 
The implementation has some flaws, as maybe when you are waiting for a fence to finish, you start to run another job, but this job could be really long, blocking the wake up of the waiting job. For solving this issue you can use fibers but then then job system implementation is not as clean as now. So, this job system is really sensible from long jobs and can produce some long waits.

Thread data:

The best synchronisation is not do synchronisation. But that is not always possible, still there is a pattern that you can apply when you do multithread code that helps a lot. The idea is that each worker thread has its own data and it works always with it (it can be different jobs in the same worker and same data, it is similar than thread local storage), once all jobs have finished with their own data, then we collect all the data from all the workers.

Implementation:

Cute implements a templated class called ThreadData, that helper class creates an array of data (template parameter), one for each worker. So, each worker now can access to its data (Get function), once all jobs are finished (nobody access to the data anymore) you can collect the results (using the Visit or Access functions). Each data will be aligned to the cache lines as well, so there is not false sharing.
With this pattern you don't need to sync for accessing the data, as only the correct worker can access it. You only access all the data together when all the jobs working in the data are finished.

Sample:

I always use the same sample for explaining this pattern. Imagine that you have a huge array of integers and you need to calculate the max and the min. Using this pattern you will divide the array (avoiding false sharing) in sections and create a lot of jobs that calculate the max and min for the section. The result will be compare with the current data stored in a ThreadData and updated if needed, so when all the jobs are finished you will have an array of min/max values. Then you just calculate the min and max of that array, usually that array is going to be small and fast to calculate, it is the number of CPUs.
This sample only needed one sync point, that is for waiting for all the sections to finish. Each worker can execute a lot of jobs for different sections, but we don't need to sync the access to the data, as for the same worker the jobs will run serialised.

Friday, 3 July 2020

GPU memory management (part 1)

It is clear why libraries like DX12 and Vulkan appear.... close to the metal, but big power comes with bigger responsibility.
Few of the top issues from older libraries that you can improve are:
  • Map and Unmap using discard functions calls between the drawcall submitting for updating data to the GPU.
  • Double or triple buffer of big buffers for uploading instances, bones,....
  • Only the GPU thread was able to upload data easy, that means that it was a lot of copy and synchronisation of data from the game thread to the render thread, and then the render thread was submitting to the GPU.
With the Cute renderer I added some memory management interfaces that allows submit data to the GPU from any of the workers thread at the same time. 

Life handling

One of the most important issue in these modern libraries are that the user has the control of the life of the gpu memory, so you are responsible to keep the memory alive (non overwritten by another frame) until the GPU has finished with it. Cute is based of a job system, that means that we can have jobs working in different frames, for example we can have an update job working for frame 3 and at the same time a render job working for frame 1 and the GPU maybe is doing frame 0. For solving these issues, Cute renderer has three frame counters, one for the game/update, one for the render and other for the GPU. So, we know that if a game job has used some GPU memory at frame 3, we can not reuse that memory until the GPU has finished that frame.

Dynamic memory management

This type of GPU memory is used only for one frame, so it has to be used for data that changes all the frames. 

It is implemented using an allocator inside a big reserved buffer (that memory can be used as a buffer or vertex buffer) in the GPU memory (upload heap), this allocator will reserve a segment of memory for each job working in a frame when it needed and will return memory inside that segment until it is full. then it will allocate another segment. Each segment will know in what frame was used, so the renderer will keep the segments alive until the GPU has done the frame. 

Using this system we can update GPU memory from all the workers (game and render jobs will able to write directly in the GPU memory) and because we know what frame was it used, we know when we can reuse the segment for another allocation. With this implementation we need only to sync inside the allocator when we need to access to the segments, in case there are two workers allocating a segment or the renderer freeing segments because the GPU has finished to needed them. Once a segment is associated to a worker/frame, it doesn't need to sync with other threads, as the segment cannot access from another threads. So the lineal allocator inside a segment doesn't need any synchronisation.

The interface is really simple:

//Alloc dynamic gpu memory
void* AllocDynamicGPUMemory(display::Device* device, const size_t size, const uint64_t frame_index);

Just allocate the memory and fill it, then you can calculate the offset inside the buffer (to pass it to your GPU shaders) using the base pointer of the resource and the return pointer.

Sample:

For my ECS multi agent system test (link) I use this system for uploading the instance buffers to render my agents. There is a step where we go through all the agents in the game and we cull them using the camera frustum, this step is done by a lot of workers at the same time. So, instead to add all of them into a big array (using a lot of synchronisation) and then submit it to the GPU we use the dynamic memory system.
Each worker allocates a block of memory in the GPU and adds the instances into that memory, once the allocation of the memory is done, we don't need to sync between workers, as we know that all workers using that memory are been executed in the same thread (similar that thread local storage), one after the other. Once the memory block is full, then we add an instance drawcall with that memory offset in the the instance buffer and the correct size.

The clear benefit is that we are submitting to the GPU from all the worker threads at the same time, avoiding to copy them in another intermediate buffer and almost without contention, we copy directly to the GPU memory. But it has a negative, that we are going to do more drawcalls, as each worker thread needs a drawcall. Of course, if we are drawing a lot of instances, it is not going to be a big issue (modern GPUs are good overlapping drawcalls), instead one drawcall we are going to have 6 minimum drawcalls (in a 6 processor CPU and we are going to have more if the memory block get filled for the same worker). The benefits are bigger than the negative in the ECS test.

Code: https://github.com/JlSanchezB/Cute/blob/master/engine/render_module/render_segment_allocator.h 

Monday, 29 June 2020

Entity Component System, Testing it (Part3)

ECS are simple to understand and to implement, but stressing them is when you realise where they perform great and where it can be issues, specially if you want to support multithreading.
For testing Cute ECS I implemented a simple multi agent system, so I am not only testing my ECS, I can stress it and calibrate to the max performance.
A multi agent system defines simple agents that solve specific problems, then you run all of them at the same time, creating a lot of interesting reactions. My multi agent system will just try to simulate the relation between hunters, prey and vegetation, so it will create a simulated ecosystem.

Description of the multi agent system:

We have three type of agents:
  • Grass. Green circles, they will grow until they touch another grass.
  • Gazelle. White circles, they will eat the closest grass around them, trying to avoid the hunters.
  • Lion. Yellow Triangles, they will try to eat gazelles, they have a recharge energy, so they can spring during a short time to catch a gazelle.

Each agent will have some identity values (DNA?) that will differentiate to other agents:
  • Grass, each one will have different growing speeds.
  • Gazelle, different speeds (linear or angular), vision range and some of them will prefer bigger Grass than closer Grass.
  • Lions, different speeds, targeting angles and recharge time.
The size of the agent represent its life, when they move around the get smaller because they consume life and when they eat they get bigger. Once they reach a top size, they will duplicate their identity values and divide in new three agents. So successful agents will propagate their identity values.

During the execution of the system we can change some of the parameters of the simulation, so we can see what happens when we spawn a lot more lions or when we stop to spawn grass.

Implementation details:

  • It will use the ECS and the job system.
    • ECS supports accessing, creating and deleting from different threads.
    • The job system in Cute has a lock-free stealing queue for each worker.
  • Each agent will have common components, like EntitySize, PositionComponent and SpeedComponent.
  • The world is a 2D box and it will be divided in 24x24 zones.
  • The frame is divided in three steps:
    • Agents update: Where the agents will update their behaviours.
    • Agents movement: Where the agents will apply the new speed to the positions and move zones.
    • Agents rendering: Where a frustum culling will happen, an instance buffer for each type of agent will be created and the render commands will get send to the renderer.
  • Each step will generate a lot of jobs that will then get distributed to all the workers threads, each job will not share the same cache lines and they will get sync to the correct fence.
  • Each update will use the ECS as much as possible, for example when a lion is looking for gazelles it can calculate the zones against its query, then it can iterate looking for gazelles only in that zones.
  • Gazelles and lions collide between them and they try to keep some distance.
  • We calculate all iterations in all the frames, so each lion looks around for all the gazelles in each frame, each gazelle selects what grass is its target each frame. We could reduce the frequency of a lot of the logic, so we could split the cost between different frames, but I wanted to stress the ECS as much as possible.
  • There are still some iterations that need some threading synchronisation, for example what happens if more than one gazelle is eating a grass. For these situations we use atomics.

Result:


This simulation has more than 10k agents running at more of 150fps in my modest CPU, AMD Ryzen 2400G.

Microprofiler (https://github.com/jonasmr/microprofile) is integrated in Cute for profiling, these are few common frames.