HyperQueue

HyperQueue (HQ) is a specialized job scheduler that enhances the efficiency and scalability of computational tasks in HPC environments. It serves as a meta-scheduler, providing a lightweight layer between workflow managers and system schedulers like SLURM and PBS.

HyperQueue has been designed to simplify the execution of large workflows (task graphs) on HPC clusters, as it eliminates the need to manually submit each job through traditional batch schedulers such as Slurm or PBS. In HyperQueue, a user specifies what to compute, and the tool automatically asks for computational resources and dynamically load-balances tasks across all allocated nodes and resources. HyperQueue can also work without Slurm/PBS as a general distributed task execution engine.

HyperQueue illustrates MaX approach for workflow tools by offering modular and efficient solutions. These tools can be integrated to meet the demanding computational needs of materials science at the exascale level. For more information about HyperQueue, please consult IT4I Documentation.

Key Features

Dynamic load balancing; users don’t need to pre-configure resources for specific tasks
Automatic worker creation and shutdown for efficient allocation of resources
Scales efficiently across hundreds of nodes while maintaining negligible overhead (below 0.1ms per task)
Simple deployment as a single, statically linked binary
Support for virtual resources to assign specific kinds of workers to appropriately tagged tasks
Ability to define task dependencies essential for complex workflow graphs

Integration and Use

HyperQueue is already successfully deployed at major facilities including LUMI and other EuroHPC sites. Within the MaX ecosystem, it is integrated with AiiDA to provide efficient resource management for high-throughput workflows. Its lightweight design makes it particularly suitable for workflows that need to manage many small tasks alongside larger calculations.

Ongoing Development

In its third phase, MaX is actively working on enhancing HyperQueue for exascale workflows. Current development focuses on:

Server state checkpointing to increase resilience and allow continuity in long-running workflows
A data-handling layer that leverages node resources to offer an abstraction of a unified data space
More fine-grained dependency models that allow for partial data availability signaling
Integration with workflow managers to optimize resource usage at exascale