Running on HPC

General remarks

The instructions given below are intended to provide support on how to run and compile ONETEP exploiting parallelisation and/or threading. This advice will hopefully help you to optimise the performance of your ONETEP runs. However, it is by no means complete. In particular, the commands required to compile, link or run binaries are system-dependent. While you may gain some insights from the config files distributed with the code, we can’t replace the support provided by your system administrator. The same applies to the particularities of your hardware. We provide detailed documentation for some common HPC systems (eg, Iridis5, ARCHER2, Michael and Young) and typical Red Hat desktops in the hpc_resources directory of the ONETEP distribution.

Tasks ≤ physical cores

The following explanations assume a node > NUMA region > processor > core hierarchy. In general, the number of tasks (processes times threads) at any level of the hierarchy should not be larger than the number of cores within. Note that, when we say “core”, we mean a “physical core”, not a “logical” one: some systems have physical (hardware) cores which can multitask thanks to technologies such as Intel® Hyper-Threading Technology, and some parts of the operating system may think there are more (logical) cores available than there physically are (typically two logical cores per physical core). Due to the nature of the computations performed by ONETEP, there’s no benefit in trying to exploit this kind of multitasking (quite likely the opposite).

MPI parallelisation

Compilation

In order to generate an MPI-enabled binary:

  • You need to set at least the MPI flag in your compilation command (usually achieved by adding -DMPI).
  • Like any MPI-capable code, ONETEP requires the compiler to provide access to the MPI Fortran Support of the system. The compiler may do so either by default or by requiring as part of the compilation command an explicit appropriate include path to the include directory of the local MPI installation (-I/path/to/include_dir). By default, ONETEP tries to USE the MPI module of the installation, but that can be downgraded to just including the mpif.h file of the installation via the compilation flag USE_INCLUDE_MPIFH (add -DUSE_INCLUDE_MPIFH to the compilation commands).
  • MPI-IO functionality is enabled by default, if you need to disable it please use the compilation flag NOMPIIO (add -DNOMPIIO).
  • You may set the SCALAPACK flag (add -DSCALAPACK).
  • You need to link the binary against the MPI libraries.

Running the binary

You may gain some efficiency by pinning MPI processes to the physical cores (if your system allows this).

OpenMP threading

General remarks

  • If you are a beginner user, you are advised not to use any threading keywords in your input file. Instead use the onetep_launcher utilities script. It will take care of OMP threads for you automatically.
  • At the moment, threading is controlled by 5 keywords:
    • threads_max
    • threads_num_fftboxes
    • threads_per_fftbox
    • threads_per_cellfft
    • threads_num_mkl
  • These keywords are used to set the number of threads used by each individual (possibly MPI) process in different parts of the code. Please look up these keywords in the keyword database for further information.
  • At the moment,
    • Threading enabled by threads_max and threads_num_fftboxes is considered stable and is encouraged. Having threads_max=threads_num_fftboxes is reasonable. They are affected by compilation and run-time defaults.
    • Threading enabled by threads_per_fftbox and threads_per_cellfft is less mature. Setting threads_per_cellfft=threads_max is reasonable. Keeping threads_per_fftbox=1 is recommended. They are unaffected by compilation or run-time defaults.
    • threads_num_mkl: maximum number of threads to use in Intel MKL routines.
  • The maximum number of threads used by a process is the maximum of threads_max, threads_per_cellfft, and the product threads_per_fftbox * threads_num_fftboxes.
  • The number of FFTboxes simultaneously held in memory is controlled by the keyword fftbox_batch_size. Because some operations work on two FFTboxes simultaneously per thread, there is little point in having threads_num_fftboxes > fftbox_batch_size / 2.
  • Each process should run within a NUMA region. The maximum number of threads should never be larger than the number of cores in that NUMA region.

Compilation

  • You can specify a default number of threads via the DEFAULT_THREADS flag (e.g., if you you want to set the default to 4 threads, you should add -DDEFAULT_THREADS=4 to your compilation command).
    • This default will only affect threads_max and threads_num_fftboxes.
    • Otherwise, the default number of threads is one.
  • You need to tell your compiler that the ONETEP source files have OpenMP instructions (this requires a compiler-dependent flag).
  • You need to link the binary against the OpenMP libraries.

Running the ONETEP executable

Memory

If you are a beginner user, you are advised to use the onetep_launcher utility script. It will take care of stack sizes (both the global one and the OMP stack size) automatically. You can then ignore this section.

In order to run ONETEP, each process and/or thread will require a big enough memory stack. Your system may provide large enough stack sizes by default, but if they are insufficient you will need to increase them.

  1. If you are running ONETEP from a shell of the bash family, you can lift any artificial limitation on the stack available to each (MPI) process by running ulimit -s unlimited before executing the ONETEP binary. Naturally some limitations like total physical memory will still apply, but ONETEP should never reach those limits via stack memory. By default ONETEP checks at initialisation that your stack is reasonably big enough, and aborts with an informative message if it believes that you should increase it.
  2. If you are running an OpenMP-enabled binary, the command above will only affect the stack size of the master thread of each (MPI) process. The stack size of all other threads is controlled at runtime via the environment variable OMP_STACKSIZE, if the variable is not defined in the environment the fortran runtime will use a default value usually of the order of 4MiB. If you are running ONETEP from bash or a POSIX-compliant shell, you may set its value to say 64MiB by executing export OMP_STACKSIZE=64M before starting ONETEP. Please bear in mind the following:
    • If your OMP_STACKSIZE is too small your simulation may crash with a runtime error that explicitly mentions a problem with the stack, but it is also possible (and actually very likely) that the error you receive is a generic SIGINT or SIGSEGV which does not mention the stack at all.
    • The OMP_STACKSIZE that a given simulation requires may depend on the system / compiler that you use. In particular, there seems to be a bug in v16 of the Intel Fortran Compiler such that runs of ONETEP binaries compiled with it require much larger values of OMP_STACKSIZE than if they were compiled with gfortran or another version of the Intel Compiler (≤v15 or ≥v17).

Number of OpenMP threads

  • Run-time defaults: threads_max and threads_num_fftboxes may be set at runtime via the environment variable OMP_NUM_THREADS. E.g., from bash or a POSIX-compliant shell, if you want to set this default to 2, before running the ONETEP binary you should execute
    export OMP_NUM_THREADS=2
  • What will happen if you don’t export any value for OMP_NUM_THREADS before running the binary?
    • If OMP_NUM_THREADS is given a value in the initialisation files of your shell, or anywhere else, so that it is indeed defined when you run ONETEP, the code will act as if it had been given by you. You probably want to check that this doesn’t happen to you.
    • If OMP_NUM_THREADS is not defined at run-time, the OpenMP libraries will still provide a default to which we should abide.
      • If we detect a hybrid MPI+OpenMP compilation running with more than one MPI process we will discard this default due to the high risk of CPU oversubscription.
      • Otherwise, we will use this value as the run-time default.
  • Run-time defaults have precedence over compilation defaults.
  • Values for all keywords, including those related to threading, can be set in the input file. Values defined in the input file have complete precedence over conflicting numbers of threads defined in any other way (compilation or run-time defaults).

If in a run with an OpenMP-enabled binary you don’t set either OMP_NUM_THREADS or threads_max and threads_num_fftboxes in your input file, ONETEP will try its best to guess how many threads to use based on your system and compilation defaults, but this may lead to unwanted setups with poor performance. You very probably want to avoid this!!

For further information on how the compilation or run-time defaults are affecting your run, please increase the verbosity of ONETEP’s output by setting the keyword output_detail to NORMAL (main relevant detail) or VERBOSE (full detail).

MPI levels of thread support

This sub-section only applies to hybrid MPI + OpenMP compilations.

MPI libraries may provide four different levels of thread support; they are, in increasing order of thread support: MPI_THREAD_SINGLE, MPI_THREAD_FUNNELED, MPI_THREAD_SERIALIZED, and MPI_THREAD_MULTIPLE.

Hybrid MPI + OpenMP runs of ONETEP require MPI_THREAD_SERIALIZED.

In the initialisation stage, ONETEP checks the provided level of thread support:

  • If the provided level is the required one (MPI_THREAD_SERIALIZED) or higher (MPI_THREAD_MULTIPLE), everything should be fine.
  • If the provided level is lower and equal to MPI_THREAD_FUNNELED, ONETEP will issue a severe warning. ONETEP usually produces correct results when this level of threading is provided, but according to the MPI standard the parallellisation + threading is not guaranteed to be correct. This is therefore allowed but strongly discouraged.
  • If the provided level is the lowest MPI_THREAD_SINGLE, running ONETEP with multiple threads is unviable. Hence, ONETEP aborts with an informative message.

This behaviour was implemented in ONETEP v4.5.8.14 (r1551) in March 2017, at a time when most MPI libraries supported MPI_THREAD_SERIALIZED. Older versions of ONETEP still technically required MPI_THREAD_SERIALIZED for their communications, but the rationale for the check and the logic of the warnings were different (due to earlier poorer support of MPI_THREAD_SERIALIZED by MPI libraries), and ONETEP would never abort if the support level was insufficient.

Data corruption with Intel MPI 2017

We have observed data corruption in the communications when ONETEP is compiled with the Intel Compiler v17 + Intel MPI 2017 and communications take place over infiniband. Apparently, there is an incompatibility between the optimisations that ifort v17 uses for allocating memory and the way IMPI 2017 caches the data for the infiniband transfer.

This cache can be enabled/disabled via the environment variable I_MPI_OFA_TRANSLATION_CACHE. Unfortunately, according to the Intel MPI documentation, the cache is enabled by default, despite possibly producing wrong results: “The cache substantially increases performance, but may lead to correctness issues in certain situations.

So far, disabling this cache via the environment variable
export I_MPI_OFA_TRANSLATION_CACHE=0
seems to prevent the data corruption with no noticeable performance hit.

onetep_launcher

The onetep_launcher utility script (by Jacek Dziedzic) is a tool that can be used to control most of the settings discussed in this page. As of 29th August 2017, it can be used to set

  • OMP_STACKSIZE (-o),
  • OMP_NUM_THREADS (-t),
  • per-process stack size (-s),
  • maximum allowed core file size (-c),
  • ifort environment variables to produce core files on RTL’s sever errors (-d), and
  • Intel MPI OFA translation cache (-m).

onetep_launcher provides reasonable defaults for all these parameters, but they may need to be further adjusted. The built-in documentation / help functionality can be accessed by executing the onetep_launcher script without an input file.

Things to do

Compatibility of threads_num_mkl (mind that in the code it is not used to compute threads_max_possible). Some of these bits might belong to a MKL/FFTW documentation page.