GPUs, portability and refactoring: these are important keywords of this second phase of MaX. The flagship codes developers have been working on several of the key pillars in the transition to accelerated architectures, like outsourcing standard mathematical operations and achieving new programming models.
In their last and newest official reports (D1.4, D2.2, and D3.3), the code developers outline their most relevant achievements. Starting, for example, from the advancement of modularisation and creation of new libraries and interfaces, which improve the development sustainability and the flexibility of the codes. The regular MPI/OpenMP version and the one for heterogeneous architectures - for each code - are completing their convergence towards a common code base, if not already in place. Above all the high-level parts of the codes have been made more portable and architecture agnostic, with the introduction of new general interfaces for domain-specific computational kernels. In addition, improvements have been made in algorithms, interoperability and I/O functionalities.
Read also the news on the latest code releases
Please find below some key aspects of the main advancements on a per-code basis.
Version v6.7 of Quantum ESPRESSO is downloadable since December 2020, with the new possibility to compile and use the GPU-enabled version also for usage on CPU’s of non-accelerated platforms. Now, the code has been reorganised into 4 different layers. In this way, the high-level code (concerning applications, quantum-engines or property calculators) is separated from the parts of the code implementing lower-level functionalities. Other relevant new developments are related to Hubbard-corrected functionals and Density-Functional Perturbation Theory (DFPT) modularization.
In the Siesta code v4.1, one of the major milestones is the GPU acceleration of the diagonalization solver, achieved via the use of appropriate libraries, following the MaX spirit of separation of concerns. Among the many improvements, we mention here the incorporation of PSolver library, which provides the important capability of performing simulations without imposing periodic boundary conditions.
YAMBO v5.0 is a release with a major number of changes, beginning with the revamped user interface. Moreover, a new feature has been coded: descriptors, objects that encapsulate all database variables in a compact way. They are written to databases and can be dumped to human-readable output files using a single call to a specific routine, greatly reducing the duplication of code lines. Memory tracking has been further extended and extensively used within the code base; separate tracking of CPU and GPU memory has been implemented.
Work on the FLEUR code release 5 has been centered around more advanced refactoring, aimed at improving the usability of the code, separating property calculators from the more computational intensive kernels and incorporating new functionalities. The evaluation of spectral properties has now a refactored property calculator framework, characterized by encapsulated functionality for easier extensions and maintenance for the user.
COSMA is a parallel, high-performance, GPU-accelerated, matrix-matrix multiplication algorithm that is communication-optimal for all combinations of matrix dimensions, number of processors and memory sizes, without the need for any parameter tuning.
Finally, the new library XC_lib is built for the calculation of the exchange and correlation (XC) energies and potentials, which is a frequent and compute-intensive task that implies iterations of some given functional expressions over many thousands of grid-points. To make the XC part architecture agnostic, it has been isolated, encapsulated and refactored as an autonomous library which will be used in future releases of Quantum ESPRESSO. Drivers for the derivative of the XC potential are included too.
Regarding the domain specific libraries: SIRIUS API was refactored to contain only Fortran subroutines with the optional error code parameter as last argument; SPLA (Specialized Parallel Linear Algebra) domain specific library has been developed at CSCS. This library takes care of the two special matrix-matrix multiplications arising in the iterative solvers for plane-wave DFT codes.
DBCSR is a library designed to efficiently perform sparse matrix-matrix multiplication, among other operations. It is MPI and OpenMP parallel and can exploit Nvidia and AMD GPUs via CUDA and HIP.
BigDFT 1.9.1 version has been released in December 2020. The code has further enhanced its modularity: the compilation of the BigDFT suite is performed via a stacked layer of multiple libraries, most of them being developed outside the BigDFT consortium.
CP2K code, now available at v8.1, is heavily relying on the external libraries for its performance portability as the code itself is written in an agnostic way without any architecture specific implementations. In particular, DBCSR and ScaLAPACK libraries must exhibit a decent performance on a given platform in order to achieve a fast execution in O(N) and RPA types of calculation. As such, the effort of optimising and porting CP2K to new architectures was channelled to the performance tuning of DBCSR and COSMA libraries.