Optimizing heterogeneous memory use on modern CPUs

HPC systems & supercomputing, Computational materials science & physics simulations, Data-intensive engineering.

How intelligent data placement unlocks near-peak performance with limited high-bandwidth memory.


Application sectors: HPC systems & supercomputing, Computational materials science & physics simulations, Data-intensive engineering.
Keywords: HBM, NUMA, memory optimization, performance tuning, HPC tools.


Modern CPUs are evolving rapidly, with increasing core counts and vector capabilities driving unprecedented compute performance. However, this growth exposes a critical bottleneck: memory bandwidth. Traditional DDR memory struggles to keep up, especially for data-intensive scientific applications common in high-performance computing (HPC).

To address this, new processors integrate high-bandwidth memory (HBM) alongside conventional DDR memory, creating heterogeneous memory systems. While HBM offers significantly higher bandwidth, it comes with trade-offs such as increased latency and limited capacity. The central challenge is determining how to efficiently distribute application data across these memory types.
This work introduces a lightweight tool that analyzes and controls data placement at the level of individual memory allocations. The key finding is striking: only 60–75% of application data needs to reside in HBM to achieve around 90% of maximum performance across a wide range of benchmarks. This demonstrates that careful data placement, not full migration, is sufficient to unlock most of the performance benefits.

Additionally, the study reveals nuanced behaviours in mixed-memory scenarios. For example, certain access patterns (e.g., HBM-to-DDR transfers) can significantly degrade performance, while others maintain near-optimal throughput. These insights highlight the importance of fine-grained control rather than coarse memory policies.

The proposed approach combines profiling and control into a unified toolchain, enabling dynamic analysis of memory usage. By instrumenting memory allocations and leveraging CPU performance counters (via Linux perf and instruction-based sampling), the tool correlates memory access behavior with specific data structures in the application. The key technological innovation lies in non-intrusive, allocation-level memory steering, enabling proactive optimization rather than reactive page migration. This bridges the gap between low-level hardware capabilities and application-level performance tuning.

Implications

This work demonstrates that efficient use of heterogeneous memory does not require full reliance on HBM. Instead, strategic placement of critical data structures can deliver near-peak performance while preserving scarce high-bandwidth resources. For HPC practitioners, this has immediate implications:

  • Larger problem sizes can be tackled without exceeding HBM capacity
  • Performance tuning can focus on key allocations rather than entire datasets
  • Applications can become more portable across heterogeneous architectures

 

In domains such as materials science, computational fluid dynamics, and medical simulation, where memory bandwidth is often the limiting factor, these findings enable more efficient use of next-generation hardware. Looking ahead, integrating such tools into automated workflows or runtime systems could enable self-optimizing applications, adapting memory placement dynamically during execution.

Interested in applying these techniques to your simulation codes or MaX workflows? Get in touch with the MaX Centre of Excellence or explore available tools and repositories to start optimizing your applications today.


Reference paper

Heterogeneous Memory Pool Tuning, F. Vaverka, O. Vysocky, and L. Riha, 2025 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), https://doi.org/10.1109/IPDPSW66978.2025.00141.