General remarks
The instructions given below are intended to provide support on how to run and compile ONETEP exploiting parallelisation and/or threading. This advice will hopefully help you to optimise the performance of your ONETEP runs. However, it is by no means complete. In particular, the commands required to compile, link or run binaries are system-dependent. While you may gain some insights from the config
files distributed with the code, we can’t replace the support provided by your system administrator. The same applies to the particularities of your hardware. We provide detailed documentation for some common HPC systems (eg, Iridis5, ARCHER2, Michael and Young) and typical Red Hat desktops in the hpc_resources
directory of the ONETEP distribution.
Tasks ≤ physical cores
The following explanations assume a node > NUMA region > processor > core hierarchy. In general, the number of tasks (processes times threads) at any level of the hierarchy should not be larger than the number of cores within. Note that, when we say “core”, we mean a “physical core”, not a “logical” one: some systems have physical (hardware) cores which can multitask thanks to technologies such as Intel® Hyper-Threading Technology, and some parts of the operating system may think there are more (logical) cores available than there physically are (typically two logical cores per physical core). Due to the nature of the computations performed by ONETEP, there’s no benefit in trying to exploit this kind of multitasking (quite likely the opposite).
MPI parallelisation
Compilation
In order to generate an MPI-enabled binary:
- You need to set at least the MPI flag in your compilation command (usually achieved by adding
-DMPI
). - Like any MPI-capable code, ONETEP requires the compiler to provide access to the MPI Fortran Support of the system. The compiler may do so either by default or by requiring as part of the compilation command an explicit appropriate include path to the include directory of the local MPI installation (
-I/path/to/include_dir
). By default, ONETEP tries toUSE
theMPI
module of the installation, but that can be downgraded to just including thempif.h
file of the installation via the compilation flagUSE_INCLUDE_MPIFH
(add-DUSE_INCLUDE_MPIFH
to the compilation commands). - MPI-IO functionality is enabled by default, if you need to disable it please use the compilation flag
NOMPIIO
(add-DNOMPIIO
). - You may set the
SCALAPACK
flag (add-DSCALAPACK
). - You need to link the binary against the MPI libraries.
Running the binary
You may gain some efficiency by pinning MPI processes to the physical cores (if your system allows this).
OpenMP threading
General remarks
- If you are a beginner user, you are advised not to use any threading keywords in your input file. Instead use the
onetep_launcher
utilities script. It will take care of OMP threads for you automatically. - At the moment, threading is controlled by 5 keywords:
threads_max
threads_num_fftboxes
threads_per_fftbox
threads_per_cellfft
threads_num_mkl
- These keywords are used to set the number of threads used by each individual (possibly MPI) process in different parts of the code. Please look up these keywords in the keyword database for further information.
- At the moment,
- Threading enabled by
threads_max
andthreads_num_fftboxes
is considered stable and is encouraged. Havingthreads_max=threads_num_fftboxes
is reasonable. They are affected by compilation and run-time defaults. - Threading enabled by
threads_per_fftbox
andthreads_per_cellfft
is less mature. Settingthreads_per_cellfft=threads_max
is reasonable. Keepingthreads_per_fftbox=1
is recommended. They are unaffected by compilation or run-time defaults. threads_num_mkl
: maximum number of threads to use in Intel MKL routines.
- Threading enabled by
- The maximum number of threads used by a process is the maximum of
threads_max
,threads_per_cellfft
, and the productthreads_per_fftbox
*threads_num_fftboxes
. - The number of FFTboxes simultaneously held in memory is controlled by the keyword
fftbox_batch_size
. Because some operations work on two FFTboxes simultaneously per thread, there is little point in havingthreads_num_fftboxes
>fftbox_batch_size
/ 2. - Each process should run within a NUMA region. The maximum number of threads should never be larger than the number of cores in that NUMA region.
Compilation
- You can specify a default number of threads via the
DEFAULT_THREADS
flag (e.g., if you you want to set the default to 4 threads, you should add-DDEFAULT_THREADS=4
to your compilation command).- This default will only affect
threads_max
andthreads_num_fftboxes
. - Otherwise, the default number of threads is one.
- This default will only affect
- You need to tell your compiler that the ONETEP source files have OpenMP instructions (this requires a compiler-dependent flag).
- You need to link the binary against the OpenMP libraries.
Running the ONETEP executable
Memory
If you are a beginner user, you are advised to use the onetep_launcher
utility script. It will take care of stack sizes (both the global one and the OMP stack size) automatically. You can then ignore this section.
In order to run ONETEP, each process and/or thread will require a big enough memory stack. Your system may provide large enough stack sizes by default, but if they are insufficient you will need to increase them.
- If you are running ONETEP from a shell of the bash family, you can lift any artificial limitation on the stack available to each (MPI) process by running
ulimit -s unlimited
before executing the ONETEP binary. Naturally some limitations like total physical memory will still apply, but ONETEP should never reach those limits via stack memory. By default ONETEP checks at initialisation that your stack is reasonably big enough, and aborts with an informative message if it believes that you should increase it. - If you are running an OpenMP-enabled binary, the command above will only affect the stack size of the master thread of each (MPI) process. The stack size of all other threads is controlled at runtime via the environment variable
OMP_STACKSIZE
, if the variable is not defined in the environment the fortran runtime will use a default value usually of the order of 4MiB. If you are running ONETEP from bash or a POSIX-compliant shell, you may set its value to say 64MiB by executingexport OMP_STACKSIZE=64M
before starting ONETEP. Please bear in mind the following:
-
- If your
OMP_STACKSIZE
is too small your simulation may crash with a runtime error that explicitly mentions a problem with the stack, but it is also possible (and actually very likely) that the error you receive is a generic SIGINT or SIGSEGV which does not mention the stack at all. - The
OMP_STACKSIZE
that a given simulation requires may depend on the system / compiler that you use. In particular, there seems to be a bug in v16 of the Intel Fortran Compiler such that runs of ONETEP binaries compiled with it require much larger values ofOMP_STACKSIZE
than if they were compiled with gfortran or another version of the Intel Compiler (≤v15 or ≥v17).
- If your
Number of OpenMP threads
- Run-time defaults:
threads_max
andthreads_num_fftboxes
may be set at runtime via the environment variableOMP_NUM_THREADS
. E.g., from bash or a POSIX-compliant shell, if you want to set this default to 2, before running the ONETEP binary you should execute
export OMP_NUM_THREADS=2
- What will happen if you don’t export any value for
OMP_NUM_THREADS
before running the binary?- If
OMP_NUM_THREADS
is given a value in the initialisation files of your shell, or anywhere else, so that it is indeed defined when you run ONETEP, the code will act as if it had been given by you. You probably want to check that this doesn’t happen to you. - If
OMP_NUM_THREADS
is not defined at run-time, the OpenMP libraries will still provide a default to which we should abide.- If we detect a hybrid MPI+OpenMP compilation running with more than one MPI process we will discard this default due to the high risk of CPU oversubscription.
- Otherwise, we will use this value as the run-time default.
- If
- Run-time defaults have precedence over compilation defaults.
- Values for all keywords, including those related to threading, can be set in the input file. Values defined in the input file have complete precedence over conflicting numbers of threads defined in any other way (compilation or run-time defaults).
If in a run with an OpenMP-enabled binary you don’t set either OMP_NUM_THREADS
or threads_max
and threads_num_fftboxes
in your input file, ONETEP will try its best to guess how many threads to use based on your system and compilation defaults, but this may lead to unwanted setups with poor performance. You very probably want to avoid this!!
For further information on how the compilation or run-time defaults are affecting your run, please increase the verbosity of ONETEP’s output by setting the keyword output_detail
to NORMAL (main relevant detail) or VERBOSE (full detail).
MPI levels of thread support
This sub-section only applies to hybrid MPI + OpenMP compilations.
MPI libraries may provide four different levels of thread support; they are, in increasing order of thread support: MPI_THREAD_SINGLE
, MPI_THREAD_FUNNELED
, MPI_THREAD_SERIALIZED
, and MPI_THREAD_MULTIPLE
.
Hybrid MPI + OpenMP runs of ONETEP require MPI_THREAD_SERIALIZED
.
In the initialisation stage, ONETEP checks the provided level of thread support:
- If the provided level is the required one (
MPI_THREAD_SERIALIZED
) or higher (MPI_THREAD_MULTIPLE
), everything should be fine. - If the provided level is lower and equal to
MPI_THREAD_FUNNELED
, ONETEP will issue a severe warning. ONETEP usually produces correct results when this level of threading is provided, but according to the MPI standard the parallellisation + threading is not guaranteed to be correct. This is therefore allowed but strongly discouraged. - If the provided level is the lowest
MPI_THREAD_SINGLE
, running ONETEP with multiple threads is unviable. Hence, ONETEP aborts with an informative message.
This behaviour was implemented in ONETEP v4.5.8.14 (r1551) in March 2017, at a time when most MPI libraries supported MPI_THREAD_SERIALIZED
. Older versions of ONETEP still technically required MPI_THREAD_SERIALIZED
for their communications, but the rationale for the check and the logic of the warnings were different (due to earlier poorer support of MPI_THREAD_SERIALIZED
by MPI libraries), and ONETEP would never abort if the support level was insufficient.
Data corruption with Intel MPI 2017
We have observed data corruption in the communications when ONETEP is compiled with the Intel Compiler v17 + Intel MPI 2017 and communications take place over infiniband. Apparently, there is an incompatibility between the optimisations that ifort v17 uses for allocating memory and the way IMPI 2017 caches the data for the infiniband transfer.
This cache can be enabled/disabled via the environment variable I_MPI_OFA_TRANSLATION_CACHE
. Unfortunately, according to the Intel MPI documentation, the cache is enabled by default, despite possibly producing wrong results: “The cache substantially increases performance, but may lead to correctness issues in certain situations.”
So far, disabling this cache via the environment variable
export I_MPI_OFA_TRANSLATION_CACHE=0
seems to prevent the data corruption with no noticeable performance hit.
onetep_launcher
The onetep_launcher
utility script (by Jacek Dziedzic) is a tool that can be used to control most of the settings discussed in this page. As of 29th August 2017, it can be used to set
OMP_STACKSIZE
(-o),OMP_NUM_THREADS
(-t),- per-process stack size (-s),
- maximum allowed core file size (-c),
- ifort environment variables to produce core files on RTL’s sever errors (-d), and
- Intel MPI OFA translation cache (-m).
onetep_launcher
provides reasonable defaults for all these parameters, but they may need to be further adjusted. The built-in documentation / help functionality can be accessed by executing the onetep_launcher
script without an input file.
Things to do
Compatibility of threads_num_mkl (mind that in the code it is not used to compute threads_max_possible). Some of these bits might belong to a MKL/FFTW documentation page.