Wednesday, 8 July 2020

The job system in Cute

Cute implements a job system using workers with job queues and a stealing process when they are empty. It is based on:
The idea is that there is a list of worker threads (usually the number of CPUs), each worker has a queue where it can add jobs (push) and remove jobs (pop) for executing them. Then, if you own queue is empty you can steal a job other worker.
The most critical part of the process is the synchronisation of the queue as it need to be lock free between the push, pop and steal, it is implemented using two atomics as explained at the links, the idea is to keep the synchronisation as minimal as possible.
Once the job system is implemented, you can add jobs from anywhere and the job will run in a worker thread. But you still needs two more tools:

Fences:

Cute implements a simple fence system, so jobs can wait for another jobs. A fence is just an integer atomic that gets incremented when jobs associated to the fence are added and the integer gets decremented when the job finish. Then, in whatever part of the code, you can wait for the fence... it that moment the job will wait until the fence reach to zero and all task associated to that fence are finished.
It is really easy and simple to use and it has minimal synchronisation. 
The implementation has some flaws, as maybe when you are waiting for a fence to finish, you start to run another job, but this job could be really long, blocking the wake up of the waiting job. For solving this issue you can use fibers but then then job system implementation is not as clean as now. So, this job system is really sensible from long jobs and can produce some long waits.

Thread data:

The best synchronisation is not do synchronisation. But that is not always possible, still there is a pattern that you can apply when you do multithread code that helps a lot. The idea is that each worker thread has its own data and it works always with it (it can be different jobs in the same worker and same data, it is similar than thread local storage), once all jobs have finished with their own data, then we collect all the data from all the workers.

Implementation:

Cute implements a templated class called ThreadData, that helper class creates an array of data (template parameter), one for each worker. So, each worker now can access to its data (Get function), once all jobs are finished (nobody access to the data anymore) you can collect the results (using the Visit or Access functions). Each data will be aligned to the cache lines as well, so there is not false sharing.
With this pattern you don't need to sync for accessing the data, as only the correct worker can access it. You only access all the data together when all the jobs working in the data are finished.

Sample:

I always use the same sample for explaining this pattern. Imagine that you have a huge array of integers and you need to calculate the max and the min. Using this pattern you will divide the array (avoiding false sharing) in sections and create a lot of jobs that calculate the max and min for the section. The result will be compare with the current data stored in a ThreadData and updated if needed, so when all the jobs are finished you will have an array of min/max values. Then you just calculate the min and max of that array, usually that array is going to be small and fast to calculate, it is the number of CPUs.
This sample only needed one sync point, that is for waiting for all the sections to finish. Each worker can execute a lot of jobs for different sections, but we don't need to sync the access to the data, as for the same worker the jobs will run serialised.

Friday, 3 July 2020

GPU memory management (part 1)

It is clear why libraries like DX12 and Vulkan appear.... close to the metal, but big power comes with bigger responsibility.
Few of the top issues from older libraries that you can improve are:
  • Map and Unmap using discard functions calls between the drawcall submitting for updating data to the GPU.
  • Double or triple buffer of big buffers for uploading instances, bones,....
  • Only the GPU thread was able to upload data easy, that means that it was a lot of copy and synchronisation of data from the game thread to the render thread, and then the render thread was submitting to the GPU.
With the Cute renderer I added some memory management interfaces that allows submit data to the GPU from any of the workers thread at the same time. 

Life handling

One of the most important issue in these modern libraries are that the user has the control of the life of the gpu memory, so you are responsible to keep the memory alive (non overwritten by another frame) until the GPU has finished with it. Cute is based of a job system, that means that we can have jobs working in different frames, for example we can have an update job working for frame 3 and at the same time a render job working for frame 1 and the GPU maybe is doing frame 0. For solving these issues, Cute renderer has three frame counters, one for the game/update, one for the render and other for the GPU. So, we know that if a game job has used some GPU memory at frame 3, we can not reuse that memory until the GPU has finished that frame.

Dynamic memory management

This type of GPU memory is used only for one frame, so it has to be used for data that changes all the frames. 

It is implemented using an allocator inside a big reserved buffer (that memory can be used as a buffer or vertex buffer) in the GPU memory (upload heap), this allocator will reserve a segment of memory for each job working in a frame when it needed and will return memory inside that segment until it is full. then it will allocate another segment. Each segment will know in what frame was used, so the renderer will keep the segments alive until the GPU has done the frame. 

Using this system we can update GPU memory from all the workers (game and render jobs will able to write directly in the GPU memory) and because we know what frame was it used, we know when we can reuse the segment for another allocation. With this implementation we need only to sync inside the allocator when we need to access to the segments, in case there are two workers allocating a segment or the renderer freeing segments because the GPU has finished to needed them. Once a segment is associated to a worker/frame, it doesn't need to sync with other threads, as the segment cannot access from another threads. So the lineal allocator inside a segment doesn't need any synchronisation.

The interface is really simple:

//Alloc dynamic gpu memory
void* AllocDynamicGPUMemory(display::Device* device, const size_t size, const uint64_t frame_index);

Just allocate the memory and fill it, then you can calculate the offset inside the buffer (to pass it to your GPU shaders) using the base pointer of the resource and the return pointer.

Sample:

For my ECS multi agent system test (link) I use this system for uploading the instance buffers to render my agents. There is a step where we go through all the agents in the game and we cull them using the camera frustum, this step is done by a lot of workers at the same time. So, instead to add all of them into a big array (using a lot of synchronisation) and then submit it to the GPU we use the dynamic memory system.
Each worker allocates a block of memory in the GPU and adds the instances into that memory, once the allocation of the memory is done, we don't need to sync between workers, as we know that all workers using that memory are been executed in the same thread (similar that thread local storage), one after the other. Once the memory block is full, then we add an instance drawcall with that memory offset in the the instance buffer and the correct size.

The clear benefit is that we are submitting to the GPU from all the worker threads at the same time, avoiding to copy them in another intermediate buffer and almost without contention, we copy directly to the GPU memory. But it has a negative, that we are going to do more drawcalls, as each worker thread needs a drawcall. Of course, if we are drawing a lot of instances, it is not going to be a big issue (modern GPUs are good overlapping drawcalls), instead one drawcall we are going to have 6 minimum drawcalls (in a 6 processor CPU and we are going to have more if the memory block get filled for the same worker). The benefits are bigger than the negative in the ECS test.

Code: https://github.com/JlSanchezB/Cute/blob/master/engine/render_module/render_segment_allocator.h