The new release of Quantum ESPRESSO (qe-v7.0) comes with a significant performance step-up of the QE Car-Parrinello (CP) quantum-engine on machines with CUDA GPGPUs. The work done by the Max centre (CINECA and SISSA groups) eliminates a few significant bottlenecks that hindered the efficient usage of CP with accelerators and permits this new version of CP to reduce 10x the number of node-hours for a simulation.
Now all operations on plane waves are offloaded to GPUs and the code reaches the optimal performance already at a reduced number of MPI processes. This is particularly advantageous for MD simulations in HPC centers, where the load managers may penalize jobs requiring a large number of nodes for very long times.
For example, with the new version of CP, a simulation with 108 ammonia molecules per cell runs on two nodes of CINECA M100 at the rate of 1 step per second while the non-accelerated version takes 3 seconds per step on 4 nodes of Marconi A3.
For larger simulations as the ZrO2 benchmark with 792 atoms, CP takes 14 seconds per step on 2 nodes of M100 against the 54 seconds per step of the non-accelerated version on 8 nodes of Marconi A3.