The instructions given below are intended to provide support on how to run and compile ONETEP exploiting parallelisation and/or threading. This advice will hopefully help you to optimise the performance of your ONETEP runs. However, it is by no means complete. In particular, the commands required to compile, link or run binaries are system-dependent. While you may gain some insights from the
config files distributed with the code, we can’t replace the support provided by your system administrator. The same applies to the particularities of your hardware. We provide detailed documentation for some common HPC systems (eg, Iridis5, ARCHER2, Michael and Young) and typical Red Hat desktops in the
hpc_resources directory of the ONETEP distribution.
Tasks ≤ physical cores
The following explanations assume a node > NUMA region > processor > core hierarchy. In general, the number of tasks (processes times threads) at any level of the hierarchy should not be larger than the number of cores within. Note that, when we say “core”, we mean a “physical core”, not a “logical” one: some systems have physical (hardware) cores which can multitask thanks to technologies such as Intel® Hyper-Threading Technology, and some parts of the operating system may think there are more (logical) cores available than there physically are (typically two logical cores per physical core). Due to the nature of the computations performed by ONETEP, there’s no benefit in trying to exploit this kind of multitasking (quite likely the opposite).
In order to generate an MPI-enabled binary:
- You need to set at least the MPI flag in your compilation command (usually achieved by adding
- Like any MPI-capable code, ONETEP requires the compiler to provide access to the MPI Fortran Support of the system. The compiler may do so either by default or by requiring as part of the compilation command an explicit appropriate include path to the include directory of the local MPI installation (
-I/path/to/include_dir). By default, ONETEP tries to
MPImodule of the installation, but that can be downgraded to just including the
mpif.hfile of the installation via the compilation flag
-DUSE_INCLUDE_MPIFHto the compilation commands).
- MPI-IO functionality is enabled by default, if you need to disable it please use the compilation flag
- You may set the
- You need to link the binary against the MPI libraries.
Running the binary
You may gain some efficiency by pinning MPI processes to the physical cores (if your system allows this).
- If you are a beginner user, you are advised not to use any threading keywords in your input file. Instead use the
onetep_launcherutilities script. It will take care of OMP threads for you automatically.
- At the moment, threading is controlled by 5 keywords:
- These keywords are used to set the number of threads used by each individual (possibly MPI) process in different parts of the code. Please look up these keywords in the keyword database for further information.
- At the moment,
- Threading enabled by
threads_num_fftboxesis considered stable and is encouraged. Having
threads_max=threads_num_fftboxesis reasonable. They are affected by compilation and run-time defaults.
- Threading enabled by
threads_per_cellfftis less mature. Setting
threads_per_cellfft=threads_maxis reasonable. Keeping
threads_per_fftbox=1is recommended. They are unaffected by compilation or run-time defaults.
threads_num_mkl: maximum number of threads to use in Intel MKL routines.
- Threading enabled by
- The maximum number of threads used by a process is the maximum of
threads_per_cellfft, and the product
- The number of FFTboxes simultaneously held in memory is controlled by the keyword
fftbox_batch_size. Because some operations work on two FFTboxes simultaneously per thread, there is little point in having
- Each process should run within a NUMA region. The maximum number of threads should never be larger than the number of cores in that NUMA region.
- You can specify a default number of threads via the
DEFAULT_THREADSflag (e.g., if you you want to set the default to 4 threads, you should add
-DDEFAULT_THREADS=4to your compilation command).
- This default will only affect
- Otherwise, the default number of threads is one.
- This default will only affect
- You need to tell your compiler that the ONETEP source files have OpenMP instructions (this requires a compiler-dependent flag).
- You need to link the binary against the OpenMP libraries.
Running the ONETEP executable
If you are a beginner user, you are advised to use the
onetep_launcher utility script. It will take care of stack sizes (both the global one and the OMP stack size) automatically. You can then ignore this section.
In order to run ONETEP, each process and/or thread will require a big enough memory stack. Your system may provide large enough stack sizes by default, but if they are insufficient you will need to increase them.
- If you are running ONETEP from a shell of the bash family, you can lift any artificial limitation on the stack available to each (MPI) process by running
ulimit -s unlimitedbefore executing the ONETEP binary. Naturally some limitations like total physical memory will still apply, but ONETEP should never reach those limits via stack memory. By default ONETEP checks at initialisation that your stack is reasonably big enough, and aborts with an informative message if it believes that you should increase it.
- If you are running an OpenMP-enabled binary, the command above will only affect the stack size of the master thread of each (MPI) process. The stack size of all other threads is controlled at runtime via the environment variable
OMP_STACKSIZE, if the variable is not defined in the environment the fortran runtime will use a default value usually of the order of 4MiB. If you are running ONETEP from bash or a POSIX-compliant shell, you may set its value to say 64MiB by executing
export OMP_STACKSIZE=64Mbefore starting ONETEP. Please bear in mind the following:
- If your
OMP_STACKSIZEis too small your simulation may crash with a runtime error that explicitly mentions a problem with the stack, but it is also possible (and actually very likely) that the error you receive is a generic SIGINT or SIGSEGV which does not mention the stack at all.
OMP_STACKSIZEthat a given simulation requires may depend on the system / compiler that you use. In particular, there seems to be a bug in v16 of the Intel Fortran Compiler such that runs of ONETEP binaries compiled with it require much larger values of
OMP_STACKSIZEthan if they were compiled with gfortran or another version of the Intel Compiler (≤v15 or ≥v17).
- If your
Number of OpenMP threads
- Run-time defaults:
threads_num_fftboxesmay be set at runtime via the environment variable
OMP_NUM_THREADS. E.g., from bash or a POSIX-compliant shell, if you want to set this default to 2, before running the ONETEP binary you should execute
- What will happen if you don’t export any value for
OMP_NUM_THREADSbefore running the binary?
OMP_NUM_THREADSis given a value in the initialisation files of your shell, or anywhere else, so that it is indeed defined when you run ONETEP, the code will act as if it had been given by you. You probably want to check that this doesn’t happen to you.
OMP_NUM_THREADSis not defined at run-time, the OpenMP libraries will still provide a default to which we should abide.
- If we detect a hybrid MPI+OpenMP compilation running with more than one MPI process we will discard this default due to the high risk of CPU oversubscription.
- Otherwise, we will use this value as the run-time default.
- Run-time defaults have precedence over compilation defaults.
- Values for all keywords, including those related to threading, can be set in the input file. Values defined in the input file have complete precedence over conflicting numbers of threads defined in any other way (compilation or run-time defaults).
If in a run with an OpenMP-enabled binary you don’t set either
threads_num_fftboxes in your input file, ONETEP will try its best to guess how many threads to use based on your system and compilation defaults, but this may lead to unwanted setups with poor performance. You very probably want to avoid this!!
For further information on how the compilation or run-time defaults are affecting your run, please increase the verbosity of ONETEP’s output by setting the keyword
output_detail to NORMAL (main relevant detail) or VERBOSE (full detail).
MPI levels of thread support
This sub-section only applies to hybrid MPI + OpenMP compilations.
MPI libraries may provide four different levels of thread support; they are, in increasing order of thread support:
Hybrid MPI + OpenMP runs of ONETEP require
In the initialisation stage, ONETEP checks the provided level of thread support:
- If the provided level is the required one (
MPI_THREAD_SERIALIZED) or higher (
MPI_THREAD_MULTIPLE), everything should be fine.
- If the provided level is lower and equal to
MPI_THREAD_FUNNELED, ONETEP will issue a severe warning. ONETEP usually produces correct results when this level of threading is provided, but according to the MPI standard the parallellisation + threading is not guaranteed to be correct. This is therefore allowed but strongly discouraged.
- If the provided level is the lowest
MPI_THREAD_SINGLE, running ONETEP with multiple threads is unviable. Hence, ONETEP aborts with an informative message.
This behaviour was implemented in ONETEP v184.108.40.206 (r1551) in March 2017, at a time when most MPI libraries supported
MPI_THREAD_SERIALIZED. Older versions of ONETEP still technically required
MPI_THREAD_SERIALIZED for their communications, but the rationale for the check and the logic of the warnings were different (due to earlier poorer support of
MPI_THREAD_SERIALIZED by MPI libraries), and ONETEP would never abort if the support level was insufficient.
Data corruption with Intel MPI 2017
We have observed data corruption in the communications when ONETEP is compiled with the Intel Compiler v17 + Intel MPI 2017 and communications take place over infiniband. Apparently, there is an incompatibility between the optimisations that ifort v17 uses for allocating memory and the way IMPI 2017 caches the data for the infiniband transfer.
This cache can be enabled/disabled via the environment variable
I_MPI_OFA_TRANSLATION_CACHE. Unfortunately, according to the Intel MPI documentation, the cache is enabled by default, despite possibly producing wrong results: “The cache substantially increases performance, but may lead to correctness issues in certain situations.”
So far, disabling this cache via the environment variable
seems to prevent the data corruption with no noticeable performance hit.
onetep_launcher utility script (by Jacek Dziedzic) is a tool that can be used to control most of the settings discussed in this page. As of 29th August 2017, it can be used to set
- per-process stack size (-s),
- maximum allowed core file size (-c),
- ifort environment variables to produce core files on RTL’s sever errors (-d), and
- Intel MPI OFA translation cache (-m).
onetep_launcher provides reasonable defaults for all these parameters, but they may need to be further adjusted. The built-in documentation / help functionality can be accessed by executing the
onetep_launcher script without an input file.
Things to do
Compatibility of threads_num_mkl (mind that in the code it is not used to compute threads_max_possible). Some of these bits might belong to a MKL/FFTW documentation page.