Threading Model

Threading Model#

Polaris has a worker pool threading model which combines with memory aligned data blocks to maximise performance. In this section we will provide an overview of the threading model and how it uses conditional variables to co-ordinate between threads and handle exceptions.

Intro#

The threading model is implemented in two main locations:

  • World.h/.cpp - Controller Thread

  • SimulationThread.h/.cpp - Worker Threads

World.h implements a singleton class (World) which initialises structures, creates threads and ensures co-ordination between threads. It provides atomic data members which are used for inter-thread communication and an implementation of a threadGate class which is used to synchronize workload across threads.

Patterns#

ThreadGate - This is a simple class that uses std::conditional variables to implement a “gate” pattern which can be used to syncronize concurrent threads. Gates can not be crossed while they are closed and this class provides a low-cost wait operation which minimises spurious wake operations on the waiting thread.

In the POLARIS implementation there are three gates utilised.

  1. Controller Thread

    1. worldGate - used to pause main thread execution while workers are working

  2. Worker Thread

    1. readyGate - used to pause worker thread execution until all workers are ready to work

    2. finishedGate - used to pause worker thread execution until all workerts have finished working

It may not seem obvious at first why worker threads require both a start gate and an end gate. If there was only a single gate, there is a possibility that a thread finishes the current iteration’s work so quickly that it manages to get through that single gate twice in a single iteration before the gate can be closed. Having two gates allows for a “holding pen” type arragement where all threads can be guaranteed to be in a single place as only one gate is open at a time.

The main worker thread logic is as follows:

while world->is_running:
    do_work()
    tell_world_I_am_at_finish_gate_and_wait()     # until main thread says otherwise
    tell_world_I_am_at_ready_gate_and_wait()      # until the last worker is ready       

This same pattern is used for initialization procedure as well

do_init()
tell_world_I_am_at_finish_gate_and_wait()     # until main thread says otherwise
tell_world_I_am_at_ready_gate_and_wait()      # until the last worker is ready       

Meanwhile on the main world controller thread, the loop logic is:

tell_simulation_engine_to_move_to_first_time_step()
release_threads_to_do_first_time_step()

while running: 
    wait_at_world_gate()         # until all threads are at finish gate
    close_world_gate_behind_me()

    tell_simulation_engine_to_move_to_next_time_step()

    if simulation_is_done()
        running = false
        release_threads_to_discover_the_grim_news()
    else
        release_threads_to_do_next_time_step()

Exception Handling#

Exception handling across threads is hard and the current approach to handling them may be sub-optimal. When an exception is encountered on a thread, it calls the world->raise() method which sets running = false and sets an exception_occurred flag.

The main loop will then exit from the while loop and check that flag - at which point the main thread will then re-raise a generic error ("Exception occcurred on thread, check your logs").

Things to improve:

  1. There is a count of running threads maintained in World which is decremented when threads finish or when they exception out. This should allow us to wait in the controller thread for worker threads to terminate in a normal sane manner rather than our current approach of “tell them the building is on fire and run for the door”.

  2. We aren’t re-raising the original exception - this is because there can (it has happened in practice) be multiple exceptions raised and it’s not easy to decide how to deal with that.