Friday, 3 July 2020

GPU memory management (part 1)

It is clear why libraries like DX12 and Vulkan appear.... close to the metal, but big power comes with bigger responsibility.
Few of the top issues from older libraries that you can improve are:
  • Map and Unmap using discard functions calls between the drawcall submitting for updating data to the GPU.
  • Double or triple buffer of big buffers for uploading instances, bones,....
  • Only the GPU thread was able to upload data easy, that means that it was a lot of copy and synchronisation of data from the game thread to the render thread, and then the render thread was submitting to the GPU.
With the Cute renderer I added some memory management interfaces that allows submit data to the GPU from any of the workers thread at the same time. 

Life handling

One of the most important issue in these modern libraries are that the user has the control of the life of the gpu memory, so you are responsible to keep the memory alive (non overwritten by another frame) until the GPU has finished with it. Cute is based of a job system, that means that we can have jobs working in different frames, for example we can have an update job working for frame 3 and at the same time a render job working for frame 1 and the GPU maybe is doing frame 0. For solving these issues, Cute renderer has three frame counters, one for the game/update, one for the render and other for the GPU. So, we know that if a game job has used some GPU memory at frame 3, we can not reuse that memory until the GPU has finished that frame.

Dynamic memory management

This type of GPU memory is used only for one frame, so it has to be used for data that changes all the frames. 

It is implemented using an allocator inside a big reserved buffer (that memory can be used as a buffer or vertex buffer) in the GPU memory (upload heap), this allocator will reserve a segment of memory for each job working in a frame when it needed and will return memory inside that segment until it is full. then it will allocate another segment. Each segment will know in what frame was used, so the renderer will keep the segments alive until the GPU has done the frame. 

Using this system we can update GPU memory from all the workers (game and render jobs will able to write directly in the GPU memory) and because we know what frame was it used, we know when we can reuse the segment for another allocation. With this implementation we need only to sync inside the allocator when we need to access to the segments, in case there are two workers allocating a segment or the renderer freeing segments because the GPU has finished to needed them. Once a segment is associated to a worker/frame, it doesn't need to sync with other threads, as the segment cannot access from another threads. So the lineal allocator inside a segment doesn't need any synchronisation.

The interface is really simple:

//Alloc dynamic gpu memory
void* AllocDynamicGPUMemory(display::Device* device, const size_t size, const uint64_t frame_index);

Just allocate the memory and fill it, then you can calculate the offset inside the buffer (to pass it to your GPU shaders) using the base pointer of the resource and the return pointer.

Sample:

For my ECS multi agent system test (link) I use this system for uploading the instance buffers to render my agents. There is a step where we go through all the agents in the game and we cull them using the camera frustum, this step is done by a lot of workers at the same time. So, instead to add all of them into a big array (using a lot of synchronisation) and then submit it to the GPU we use the dynamic memory system.
Each worker allocates a block of memory in the GPU and adds the instances into that memory, once the allocation of the memory is done, we don't need to sync between workers, as we know that all workers using that memory are been executed in the same thread (similar that thread local storage), one after the other. Once the memory block is full, then we add an instance drawcall with that memory offset in the the instance buffer and the correct size.

The clear benefit is that we are submitting to the GPU from all the worker threads at the same time, avoiding to copy them in another intermediate buffer and almost without contention, we copy directly to the GPU memory. But it has a negative, that we are going to do more drawcalls, as each worker thread needs a drawcall. Of course, if we are drawing a lot of instances, it is not going to be a big issue (modern GPUs are good overlapping drawcalls), instead one drawcall we are going to have 6 minimum drawcalls (in a 6 processor CPU and we are going to have more if the memory block get filled for the same worker). The benefits are bigger than the negative in the ECS test.

Code: https://github.com/JlSanchezB/Cute/blob/master/engine/render_module/render_segment_allocator.h 

No comments:

Post a Comment