Monday, 26 January 2015

WRF mpi on gaia

WRF compiled and ran first time following these instructions:

http://www2.mmm.ucar.edu/wrf/OnLineTutorial/compilation_tutorial.php

Check speedup using test case here:

/gaia/home/sweeneyc/Code/WRFMPI/run



gaia has 24 nodes, each with two Intel Xeons E5-2660v2 CPUs

http://gaia.ucd.ie/technicalDetails.php

These CPUs have 10 cores and 20 threads:

http://ark.intel.com/products/75272/Intel-Xeon-Processor-E5-2660-v2-25M-Cache-2_20-GHz

Hyperthreading is just a virtual core, which can speed things up when, for example, there is a cache miss, but I won't use it here. So, each node has 2 CPUs, each CPU has 10 cores.

Test speedup by submitting jobs with different numbers of cores:

#PBS -l nodes=1:ppn=40
  • time mpirun -np 4 --map-by ppr:1:core wrf.exe: 550s
  • time mpirun -np 9 --map-by ppr:1:core wrf.exe: 331s
  • time mpirun -np 15 --map-by ppr:1:core wrf.exe: 299s (????)
  • time mpirun -np 16 --map-by ppr:1:core wrf.exe: 303s (????)
  • time mpirun -np 20 wrf.exe: 166s (no flags)
  • time mpirun -np 20 --map-by ppr:1:core wrf.exe: 166s (map by core)
  • time mpirun -np 30 wrf.exe: 258s
  • time mpirun -np 30 --map-by ppr:1:core wrf.exe: doesn't run. more procs than available (good).
#PBS -l nodes=2:ppn=40
  • time mpirun -np 30 --map-by ppr:1:core wrf.exe: 126s
  • time mpirun -np 36 --map-by ppr:1:core wrf.exe: 118s
  • time mpirun -np 40 --map-by ppr:1:core wrf.exe: 110s
#PBS -l nodes=3:ppn=40
  • time mpirun -np 42 --map-by ppr:1:core wrf.exe: 107s
  • time mpirun -np 49 --map-by ppr:1:core wrf.exe: 98s
  • time mpirun -np 56 --map-by ppr:1:core wrf.exe: 94s
  • time mpirun -np 60 --map-by ppr:1:core wrf.exe: 90s


This speedup was initially horrible. The problem was with Infiniband. It wasn't working, so the nodes were communicating using standard etherenet, hence the horribly slow times. Eamonn fixed this by using the correct port for Infiniband (ib1 not ib0) and installing the official Mellanox drivers. He then recompiled WRF for openMPI/1.8.4 and gcc/4.9.2. The new module is WRF/3.6.1openmpi_ib.

Doing longer run tests: Run WRF for 7 days:

WPS/3.6.1 and WRF/3.6.1openmpi_ib

#PBS -l nodes=4:ppn=40
- time mpirun -np 80 --map-by ppr:1:core wrf.exe: real    46m20s

#PBS -l nodes=6:ppn=40
- time mpirun -np 120 --map-by ppr:1:core wrf.exe: real    39m43s

WPS/3.7 and WRF/3.7

#PBS -l nodes=4:ppn=40
- time mpirun -np 80 --map-by ppr:1:core wrf.exe: real    

#PBS -l nodes=6:ppn=40
- time mpirun -np 120 --map-by ppr:1:core wrf.exe: real  40m50s

WRF crashes with 7 or more nodes, over-decomposition of domain:
http://forum.wrfforum.com/viewtopic.php?f=6&t=4930
Rough guideline there suggests at least 15x15 per tile.

SMALLER DOMAINS

#PBS -l nodes=4:ppn=40
- time mpirun -np 80 --map-by ppr:1:core wrf.exe: real  40m09s

#PBS -l nodes=5:ppn=40
- time mpirun -np 100 --map-by ppr:1:core wrf.exe: real  37m19s

WRF crashes with 6 or more nodes, over-decomposition of domain.

WRF/3.7_dm_sm
#PBS -l nodes=5:ppn=40
export OMP_NUM_THREADS=20
- time mpirun -np 5 --map-by ppr:1:core wrf.exe: time out (40m) at 12-02-06:00

export OMP_NUM_THREADS=4
- time mpirun -np 25 --map-by ppr:1:core wrf.exe: time out (40m) at 12-02-18:00

export OMP_NUM_THREADS=4
- time mpirun -np 25 wrf.exe: time out (40m) at 12-01-00:00 (!)