http://www2.mmm.ucar.edu/wrf/OnLineTutorial/compilation_tutorial.php
Check speedup using test case here:
/gaia/home/sweeneyc/Code/WRFMPI/run
gaia has 24 nodes, each with two Intel Xeons E5-2660v2 CPUs
http://gaia.ucd.ie/technicalDetails.php
These CPUs have 10 cores and 20 threads:
http://ark.intel.com/products/75272/Intel-Xeon-Processor-E5-2660-v2-25M-Cache-2_20-GHz
Hyperthreading is just a virtual core, which can speed things up when, for example, there is a cache miss, but I won't use it here. So, each node has 2 CPUs, each CPU has 10 cores.
Test speedup by submitting jobs with different numbers of cores:
#PBS -l nodes=1:ppn=40
- time mpirun -np 4 --map-by ppr:1:core wrf.exe: 550s
- time mpirun -np 9 --map-by ppr:1:core wrf.exe: 331s
- time mpirun -np 15 --map-by ppr:1:core wrf.exe: 299s (????)
- time mpirun -np 16 --map-by ppr:1:core wrf.exe: 303s (????)
- time mpirun -np 20 wrf.exe: 166s (no flags)
- time mpirun -np 20 --map-by ppr:1:core wrf.exe: 166s (map by core)
- time mpirun -np 30 wrf.exe: 258s
- time mpirun -np 30 --map-by ppr:1:core wrf.exe: doesn't run. more procs than available (good).
#PBS -l nodes=2:ppn=40
- time mpirun -np 30 --map-by ppr:1:core wrf.exe: 126s
- time mpirun -np 36 --map-by ppr:1:core wrf.exe: 118s
- time mpirun -np 40 --map-by ppr:1:core wrf.exe: 110s
This speedup was initially horrible. The problem was with Infiniband. It wasn't working, so the nodes were communicating using standard etherenet, hence the horribly slow times. Eamonn fixed this by using the correct port for Infiniband (ib1 not ib0) and installing the official Mellanox drivers. He then recompiled WRF for openMPI/1.8.4 and gcc/4.9.2. The new module is WRF/3.6.1openmpi_ib.
Doing longer run tests: Run WRF for 7 days:
WPS/3.6.1 and WRF/3.6.1openmpi_ib
#PBS -l nodes=6:ppn=40
- time mpirun -np 120 --map-by ppr:1:core wrf.exe: real 39m43s
WPS/3.7 and WRF/3.7
#PBS -l nodes=6:ppn=40
WRF crashes with 7 or more nodes, over-decomposition of domain:
http://forum.wrfforum.com/viewtopic.php?f=6&t=4930
Rough guideline there suggests at least 15x15 per tile.
SMALLER DOMAINS
#PBS -l nodes=4:ppn=40
#PBS -l nodes=5:ppn=40
Doing longer run tests: Run WRF for 7 days:
WPS/3.6.1 and WRF/3.6.1openmpi_ib
#PBS -l nodes=4:ppn=40
- time mpirun -np 80 --map-by ppr:1:core wrf.exe: real 46m20s
- time mpirun -np 120 --map-by ppr:1:core wrf.exe: real 39m43s
#PBS -l nodes=4:ppn=40
- time mpirun -np 80 --map-by ppr:1:core wrf.exe: real
- time mpirun -np 120 --map-by ppr:1:core wrf.exe: real 40m50s
WRF crashes with 7 or more nodes, over-decomposition of domain:
http://forum.wrfforum.com/viewtopic.php?f=6&t=4930
Rough guideline there suggests at least 15x15 per tile.
SMALLER DOMAINS
#PBS -l nodes=4:ppn=40
- time mpirun -np 80 --map-by ppr:1:core wrf.exe: real 40m09s
- time mpirun -np 100 --map-by ppr:1:core wrf.exe: real 37m19s
WRF/3.7_dm_sm
#PBS -l nodes=5:ppn=40
WRF crashes with 6 or more nodes, over-decomposition of domain.
WRF/3.7_dm_sm
#PBS -l nodes=5:ppn=40
export OMP_NUM_THREADS=20
- time mpirun -np 5 --map-by ppr:1:core wrf.exe: time out (40m) at 12-02-06:00
export OMP_NUM_THREADS=4
- time mpirun -np 25 --map-by ppr:1:core wrf.exe: time out (40m) at 12-02-18:00
- time mpirun -np 5 --map-by ppr:1:core wrf.exe: time out (40m) at 12-02-06:00
export OMP_NUM_THREADS=4
- time mpirun -np 25 --map-by ppr:1:core wrf.exe: time out (40m) at 12-02-18:00
export OMP_NUM_THREADS=4
- time mpirun -np 25 wrf.exe: time out (40m) at 12-01-00:00 (!)
- time mpirun -np 25 wrf.exe: time out (40m) at 12-01-00:00 (!)