Wednesday, 8 July 2020

The job system in Cute

Cute implements a job system using workers with job queues and a stealing process when they are empty. It is based on:
The idea is that there is a list of worker threads (usually the number of CPUs), each worker has a queue where it can add jobs (push) and remove jobs (pop) for executing them. Then, if you own queue is empty you can steal a job other worker.
The most critical part of the process is the synchronisation of the queue as it need to be lock free between the push, pop and steal, it is implemented using two atomics as explained at the links, the idea is to keep the synchronisation as minimal as possible.
Once the job system is implemented, you can add jobs from anywhere and the job will run in a worker thread. But you still needs two more tools:

Fences:

Cute implements a simple fence system, so jobs can wait for another jobs. A fence is just an integer atomic that gets incremented when jobs associated to the fence are added and the integer gets decremented when the job finish. Then, in whatever part of the code, you can wait for the fence... it that moment the job will wait until the fence reach to zero and all task associated to that fence are finished.
It is really easy and simple to use and it has minimal synchronisation. 
The implementation has some flaws, as maybe when you are waiting for a fence to finish, you start to run another job, but this job could be really long, blocking the wake up of the waiting job. For solving this issue you can use fibers but then then job system implementation is not as clean as now. So, this job system is really sensible from long jobs and can produce some long waits.

Thread data:

The best synchronisation is not do synchronisation. But that is not always possible, still there is a pattern that you can apply when you do multithread code that helps a lot. The idea is that each worker thread has its own data and it works always with it (it can be different jobs in the same worker and same data, it is similar than thread local storage), once all jobs have finished with their own data, then we collect all the data from all the workers.

Implementation:

Cute implements a templated class called ThreadData, that helper class creates an array of data (template parameter), one for each worker. So, each worker now can access to its data (Get function), once all jobs are finished (nobody access to the data anymore) you can collect the results (using the Visit or Access functions). Each data will be aligned to the cache lines as well, so there is not false sharing.
With this pattern you don't need to sync for accessing the data, as only the correct worker can access it. You only access all the data together when all the jobs working in the data are finished.

Sample:

I always use the same sample for explaining this pattern. Imagine that you have a huge array of integers and you need to calculate the max and the min. Using this pattern you will divide the array (avoiding false sharing) in sections and create a lot of jobs that calculate the max and min for the section. The result will be compare with the current data stored in a ThreadData and updated if needed, so when all the jobs are finished you will have an array of min/max values. Then you just calculate the min and max of that array, usually that array is going to be small and fast to calculate, it is the number of CPUs.
This sample only needed one sync point, that is for waiting for all the sections to finish. Each worker can execute a lot of jobs for different sections, but we don't need to sync the access to the data, as for the same worker the jobs will run serialised.

No comments:

Post a Comment