Skip to main content

Command Palette

Search for a command to run...

Building Orion: A C++23 Distributed Task Runtime Inspired by Ray

Updated
4 min read
Building Orion: A C++23 Distributed Task Runtime Inspired by Ray

Over the past few months, I’ve been taking deep dives into revolutionary open-source projects to understand exactly how they tick under the hood. One project that particularly caught my eye was Ray—a unified framework for scaling AI and Python applications widely used for distributed training of ML workloads.

I really wanted to understand Ray's internals, specifically how it handles distributed task scheduling, dependency resolution, and object storage. The best way to learn complex systems is to build them yourself, so I started Orion: a minimal, Ray-style distributed task runtime written in C++23.

Here is a look at the journey of building Orion so far.

v0.1: Nailing the Local Execution Engine

For the first version, my goal was to build a robust local runtime that could execute a Directed Acyclic Graph (DAG) of tasks using multiple worker threads. The core problem was defining what a "task" is and figuring out how to resolve its dependencies before it gets scheduled.

I designed the engine to be strictly dataflow-driven: tasks only execute when their inputs (dependencies) are available. To keep CPU utilization highly efficient, I relied heavily on condition variables and mutexes. Instead of the scheduler constantly polling for ready tasks, I put it to sleep. The dependency checker is only woken up when a worker thread completes a task and fires an event.

During this build phase, I leveled up my systems programming skills significantly. I learned the nuances of thread synchronization—exactly when (and how safely) to put a thread to sleep and how to wake it up using condition variables without causing race conditions.

Alongside the scheduler and worker pool, I built a thread-safe, in-memory Object Store. When a task is submitted to the scheduler, the user immediately receives an ObjectRef (a concept similar to a Future). Tasks store their results in this object store upon completion, and consumers use that ObjectRef to seamlessly query the store and retrieve the result without busy-waiting.

v0.2: Taking It Distributed with gRPC

I was incredibly happy with how the local runtime turned out in v0.1, but a task execution framework isn't truly "Ray-style" unless it can scale horizontally. The natural next step was to distribute these tasks across multiple separate physical (or logical) machines.

To achieve this, I introduced the concept of a Cluster Head and Worker Nodes, bringing actual network communication into the fold using gRPC.

I built a distributed runtime layer on top of the existing architecture:

  • The Worker Node: Instead of rewriting the execution engine for the network, I designed the NodeRuntime as a clean abstraction that wraps the local runtime built in v0.1. The node simply acts as a gRPC server, listening for incoming tasks over the network and feeding them directly into its isolated local scheduler and worker pool.

  • The Cluster Head: This acts as the brain of the distributed setup. I implemented dynamic Node Registration, allowing worker nodes to announce themselves to the head upon startup. The head maintains a NodeRegistry to keep track of cluster membership.

  • Distributed Scheduling: Replacing the local dispatcher is a new ClusterScheduler. When a task is submitted to the head, the scheduler tracks cross-node dependencies and uses a Round-Robin algorithm to dispatch tasks across the registered network boundaries.

By the end of v0.2, I could spin up the Head server in one terminal, multiple Worker nodes in others, and seamlessly submit tasks via a gRPC client, watching the workload automatically distribute across the cluster.

What's Next?

Building Orion has been a fantastic masterclass in concurrency, synchronization, and distributed systems architecture in modern C++. I've successfully modeled the core DAG execution and distributed dispatching that makes frameworks like Ray so powerful.

Looking ahead to future versions, the next major hurdles will be implementing fault tolerance (handling node crashes), building a true distributed object store (so nodes can fetch data from each other), and supporting dynamic task graphs.

You can check out the source code for Orion on my GitHub here: https://github.com/samarthmahendraneu/Orion