Threading Model#

Polaris has a worker pool threading model which combines with memory aligned data blocks to maximise performance. In this section we will provide an overview of the threading model and how it uses conditional variables to co-ordinate between threads and handle exceptions.

Intro#

The threading model is implemented in two main locations:

  • World.h/.cpp - Controller Thread

  • SimulationThread.h/.cpp - Worker Threads

World.h implements a singleton class (World) which initialises structures, creates threads and ensures co-ordination between threads. It provides atomic data members which are used for inter-thread communication and an implementation of a threadGate class which is used to synchronize workload across threads.

Patterns#

ThreadGate - This is a simple class that uses std::conditional variables to implement a “gate” pattern which can be used to syncronize concurrent threads. Gates can not be crossed while they are closed and this class provides a low-cost wait operation which minimises spurious wake operations on the waiting thread.

In the POLARIS implementation there are three gates utilised.

  1. Controller Thread

    1. worldGate - used to pause main thread execution while workers are working

  2. Worker Thread

    1. readyGate - used to pause worker thread execution until all workers are ready to work

    2. finishedGate - used to pause worker thread execution until all workerts have finished working

It may not seem obvious at first why worker threads require both a start gate and an end gate. If there was only a single gate, there is a possibility that a thread finishes the current iteration’s work so quickly that it manages to get through that single gate twice in a single iteration before the gate can be closed. Having two gates allows for a “holding pen” type arragement where all threads can be guaranteed to be in a single place as only one gate is open at a time.

The main worker thread logic is as follows:

while world->is_running:
    do_work()
    tell_world_I_am_at_finish_gate_and_wait()     # until main thread says otherwise
    tell_world_I_am_at_ready_gate_and_wait()      # until the last worker is ready       

This same pattern is used for initialization procedure as well

do_init()
tell_world_I_am_at_finish_gate_and_wait()     # until main thread says otherwise
tell_world_I_am_at_ready_gate_and_wait()      # until the last worker is ready       

Meanwhile on the main world controller thread, the loop logic is:

tell_simulation_engine_to_move_to_first_time_step()
release_threads_to_do_first_time_step()

while running: 
    wait_at_world_gate()         # until all threads are at finish gate
    close_world_gate_behind_me()

    tell_simulation_engine_to_move_to_next_time_step()

    if simulation_is_done()
        running = false
        release_threads_to_discover_the_grim_news()
    else
        release_threads_to_do_next_time_step()

Synchronizing Shared State Across Threads#

When multiple simulation threads can read and write the same piece of data, that access needs to be coordinated to avoid data races. The right tool depends on what is being shared.

Simple shared values: use std::atomic#

For a single numeric value that is updated by multiple threads — a counter, a running total, a flag — an std::atomic is sufficient and requires no locking at all. Atomic operations (load, store, fetch_add, exchange, etc.) are guaranteed to be indivisible by the hardware, so threads can update them concurrently without corruption:

std::atomic<int> shared_counter{0};

// safe to call from any thread, no lock needed
shared_counter += 10;

Use this wherever possible — it is cheaper than any locking approach.

Compound operations and containers: use a lock#

A lock is needed when you need to perform several operations that must appear atomic as a group, or when modifying a data structure (e.g. std::vector, std::deque, std::unordered_map) whose internal state can be corrupted if two threads modify it simultaneously. POLARIS provides a lightweight lock type for this: one thread acquires the lock, does its work, and releases it — any other thread that tries to acquire the lock while it is held will wait until it is free.

The _lock type#

Lock variables are declared as the _lock type, generally as members on the class that owns the shared data:

_lock my_lock;  // std::atomic<unsigned int>, zero-initialized

_lock is defined in libs/core/Threads.h. It must be initialized to 0 (unlocked). Two usage patterns are provided.

Option 1: ScopedLock (preferred outside of core)#

ScopedLock sl(my_lock);  // acquires on construction, releases on scope exit
// ... protected code ...

ScopedLock is an RAII wrapper that acquires the lock when constructed and releases it automatically when it goes out of scope. Prefer this in agent and simulator code because it is exception-safe (the lock is always released even if the protected code throws) and early-return-safe (no need to manually release before every return or break). The overhead over bare macros is minimal: one stack-allocated reference and a non-virtual destructor call that the compiler can inline.

Option 2: LOCK / UNLOCK macros (core hot paths)#

LOCK(my_lock);
// ... protected code ...
UNLOCK(my_lock);

The explicit macro form. Use this in the simulation core where protected sections are extremely short and every nanosecond matters. It avoids even the minimal overhead of a destructor call. Take care to pair every LOCK with an UNLOCK on all exit paths.

Why spin locks under the hood#

Both options use the same spin lock mechanism (std::atomic<unsigned int>) rather than std::mutex. Three approaches were benchmarked at realistic thread counts (8–64) across both core internals and agent code (experiments conducted May 2026 by Jamie Cook):

Approach

Outcome

std::mutex

Kernel-assisted sleep/wake adds hundreds of nanoseconds per pair; frequent cache-line invalidations on contested locks

std::mutex with try_lock spin-before-block

Similar latency once backoff kicks in; gives up cache locality just like plain mutex

Spin lock (chosen)

Keeps the owning cache line hot; waiters acquire with a single atomic exchange and no kernel transition

At the contention levels typical in POLARIS — very short critical sections, few threads competing at once — spin locks were consistently fastest, both in core and in agent code.

The sleep_for(0 ns) inside LOCK is intentional: on x86 it acts as a PAUSE hint, signalling that the CPU is in a spin-wait loop and reducing power consumption and memory-order penalties without adding measurable latency.

When to revisit#

  • If critical sections grow substantially longer, or thread counts rise well above 64, a hybrid spin-then-block approach may become worthwhile.

  • On architectures where atomic read-modify-write is expensive (e.g. some RISC-V implementations) this should be re-evaluated.

Lock variables in core#

Lock variable

Protects

_ex_lock (Simulation_Engine)

EX-level next-revision and queued-type list

_tex_lock (component managers)

TEX-level schedule and block activation queues

_ptex_lock (Execution/Event_Block)

PTEX-level schedule per block

_memory_lock (Execution/Event_Block)

Block memory pool allocation and free

_optex_lock (Execution/Event_Object, SAFE_MODE only)

Per-object rescheduling races

Exception Handling#

Exception handling across threads is hard and the current approach to handling them may be sub-optimal. When an exception is encountered on a thread, it calls the world->raise() method which sets running = false and sets an exception_occurred flag.

The main loop will then exit from the while loop and check that flag - at which point the main thread will then re-raise a generic error ("Exception occcurred on thread, check your logs").

Things to improve:

  1. There is a count of running threads maintained in World which is decremented when threads finish or when they exception out. This should allow us to wait in the controller thread for worker threads to terminate in a normal sane manner rather than our current approach of “tell them the building is on fire and run for the door”.

  2. We aren’t re-raising the original exception - this is because there can (it has happened in practice) be multiple exceptions raised and it’s not easy to decide how to deal with that.