Due to the nature of our research, most of our experiments are run in compute-intense environments - such as big shared-memory or distributed systems. While in some cases well-known research clusters (e. g. at HLRN or LRZ) are used for the computations, the majority of the results is obtained by using our own cluster - located at Humboldt-Universität zu Berlin. In order to support/improve reproducibility, we list here the current configuration of the system. We hope that, if you want to rerun the experiments on your own, this gives you a better understanding of our results and simplifies the assessment of your results.





Overview of cluster topology used for most experiment in research.

Specifications Head Node:

  • 4x 6-Core Intel Xeon X7460

  • 128 GB RAM

  • 350 GB HDD (5x Raid-6 HDD)

  • 2x10 Gbit bond LAN

Specifications SHM Node:

  • 2x 18-Core Intel Xeon 6154

    • Disabled Hyperthreading

    • Disabled security mitigations

    • Driver: intel_pstate

      • disabled all C-states >= C1

      • active C0/P0-state

    • Permanent Boost-Clocks on all cores:

      • (scalar parallel code) 3.700 MHz

      • (vector parallel code, AVX2) 3.300 MHz

      • (vector parallel code, AVX512) 2.700 MHz

      • \(FLOPs_{node} = 3.11 \text{ TFLOPs}\)

  • 24x 64 GB = 1,5 TB RAM (1,49 TB RAM usable for computation)

    • DDR4 2666 MT/s, Quad-Rank (NUMA balanced)

  • 8,2 TB HDD (5x Raid-5 HDD, 2x Raid-1 SATA SSD)

  • 1x10 Gbit LAN (Distributed File System, Management)

Specifications Distributed Nodes:

  • 2x 12-Core Intel Xeon 6126

    • Disabled Hyperthreading

    • Disabled security mitigations

    • Driver: intel_pstate

      • disabled all C-states >= C1

      • active C0/P0-state

    • Permanent Boost-Clocks on all cores:

      • (scalar parallel code) 3.250 MHz

      • (vector parallel code, AVX2) 2.900 MHz

      • (vector parallel code, AVX512) 2.300 MHz

      • \(FLOPs_{node} = 1.77 \text{ TFLOPs}\)

  • 12x 16 GB = 192 GB RAM (182 GB RAM usable for computation)

    • DDR4 2666 MT/s, Dual-Rank (NUMA balanced)

  • 512 GB NVMe SSD

  • 1x10 Gbit LAN (Distributed File System, Management)

  • 1x100 Gbit OPA (MPI)

  • \(R_{max} = FLOPs_{all nodes} = 28.26 \text{ TFLOPs}\)

OS-System

  • CentOS 8.1

  • Boot kernel parameter: rhgb quiet nospec_store_bypass_disable noibrs noibpb spectre_v2_user=off spectre_v2=off nopti l1tf=off kvm-intel.vmentry_l1d_flush=never mitigations=off idle=poll

  • Tools for job scheduling: slurm 1, simexpal 2

Benchmarks

The following benchmarks are specific for the distributed nodes, since they are concerned with MPI efficiency.

  • linpack (HPL) REF: \(R_{peak} = 22.99 \text{ TFLOPs}\), which means the cluster has an efficiency of \(81.3%\) (\(\frac{R_{peak}}{R_{max}}\)). The following configuration was used for this:

    • Using cblas, mpi and compiler from Intel REF 3

    • Launching the problem with mixed MPI and OpenMP instead of using only MPI.

    • Partition linpack into 16x2 parts (reflecting the number of nodes and sockets per node). Each MPI-process spawns 12 OpenMP threads.

    • Inter-communication between different sockets and nodes is done via MPI, intra by OpenMP.

    • Using Omni-Path libraries also for OpenMP-communication

    • Process a block size (\(\text{NB}\)) of 384 double-value entries in one loop.

    • Calling procedure mpirun -genv I_MPI_FABRICS shm:tmi -genv PSM2_BOUNCE_SZ 8192 -perhost 2 -hostfile ${HOST_FILE} -np 32, where the host file contains the information about the 16x2 partition.

  • CPU-Communication Latency: Measurements include communication latencies for core-to-core communication with varying package sizes in three different scenarios:

    1. Both cores are on the same socket (NUMA node 0). This accounts for intra-socket communication.

    2. Both cores are on the same node, but different socket. This accounts for the delay in inter-socket (UPI) communication.

    3. Both cores are on different nodes, but on the same socket (NUMA node 0). This accounts for the delay in inter-node (OPA) communication.

    • Calling procedure: I_MPI_DEBUG=4 mpirun -n 2 -ppn 1 -genv I_MPI_PIN_PROCESSOR_LIST=$pin1,$pin2 -iface ib0 -hosts $i,$j -genvall -verbose ./IMB-MPI1 pingpong -off_cache -1, where pin1/pin2 and i/j sets the core and node pinning. IMB-MPI1 is part of the Intel MPI benchmarking suite 4. The tests using RTT from a single MPI message pair.


CPU-Latency for Intra-Socket, Inter-Socket and Inter-Node communication


Notes

Formula for FLOPs (floating point operations per second) calculation: \(FLOPs_{node} = \frac{FLOPs}{cycle} \times f_{avx512} \times S \times C\)

  • \(\frac{FLOPs}{cycle}\) is calculated by \(\frac{vector-width}{variable-width} \times num_{\text{ FMA units}} \times num_{\text{ FMA-ops per cycle}}\). Since both (SHM + Distributed) nodes share the same architecture and feature-level, \(\frac{FLOPs}{cycle} = \frac{512}{64} \times 2 \times 2 = 32\).

  • \(f_{avx512}\) is the maximum frequency under full vectorization.

  • \(S\) is the number of sockets.

  • \(C\) is the number of cores per socket.



1

https://slurm.schedmd.com/documentation.html

2

https://github.com/hu-macsy/simexpal

3

https://software.intel.com/content/www/us/en/develop/tools/mpi-library.html

4

https://software.intel.com/content/www/us/en/develop/articles/intel-mpi-benchmarks.html