Amber PME Midpoint (Amber 18)
Release Notes

| Background | Authorship & Support | Performance and Accuracy |
| Supported Features | Compiling and Running |
| Return to Main Amber Page |

News and Updates

Background

About the new Midpoint method implementation in Amber 18

The Midpoint method for Molecular Dynamics was developed by researchers at IBM as part of their BlueMatter project with BlueGene and subsequently expanded upon and described publically by researchers at D. E. Shaw Research, LLC (see this 2006 paper published in The Journal of Chemical Physics).  The Midpoint method is based on domain decomposition and provides anefficient approach to significantly reduce “data distribution” time with increased node count. The San Diego Computing Center (SDSC) at UC San Diego in collaboration with Intel© Corporation have developed a prototype implementation of the Midpoint method within the Amber PMEMD software.

This implementation has been thoroughly tested with three key Amber benchmarks. The midpoint implementation will continue to be refined and the performance improved over the coming months through patches in Amber 18.

Motivation for re-architecting Amber 16 PMEMD CPU code

The aim to re-architect Amber 16’s PMEMD code is to continue to improvethe CPU performance and cluster scalability. For example, the following figures show thatthe current atom decomposition approach in Amber 16 does not scale well for theCellulose andSTMV benchmarks (roughly 400K and 1 Million atoms, respectively) beyond 8 nodes each containing a single Intel© Xeon Phi™processor 7250 (Knights Landing). It was also observed that the data distribution time grew exponentially with increased node count.

Figure 1: Scaling of existing Amber 16 PMEMD code on Intel Xeon Phi 7250 equipped nodes connected with Intel Omni Path.

 

Authorship & Support

PMEMD Midpoint Implementations:

Ashraf Bhuiyan (Intel)
Charles Lin (SDSC)
Tareq Malas (Intel)
Ross C. Walker (SDSC)*

*Corresponding author.

Citing the Midpoint Code

If you make use of any of the Amber midpoint code in your work please include the following citations (in addition to the standard Amber citation):

  • Charles Lin, Tareq Malas, Ashraf Bhuiyan,and Ross C. Walker* "Scalable Amber Molecular Dynamics Implementations for Intel Architecture", 2018, in prep.

^

Performance and Accuracy

Amber 18 midpoint Mixed-precision accuracy

In addition to the traditional full double precision implementation used in the PMEMD CPU code the Amber 18 midpoint also introduces a mixed prevision model, first pioneered with the Amber GPU implementation, termed SPDP. This precision model uses single precision for each particle-particle interaction but sums the resulting forces into double precision accumulators. This precision model was established as sufficient for accurate MD simulations and shown to conserve energy in the original Amber GPU publications and has been tested to the same degree of tolerance as part of this new CPU optimization. The benefit of using mixed-precision is that it provides improved performance, both serial and parallel, since all modern Intel CPUs can perform two single precision floating point calculations for the cost of a single double precision floating point calculation. The figures and table below provide the validation data for Amber 18 DPDP and SPDP compared with Amber 16 DPDP as reference.

Figure  1: Cellulose

Figure  2: DHFR

Figure  3: FactorIX

 

 

DHFR

Factor IX

Cellulose

STMV

Max deviation

A18 SPDP

8.6E-04

1.9E-03

3.8E-03

4.0E-03

A18 DPDP

5.0E-08

4.7E-07

4.9E-07

5.0E-07

 

RMS deviation

A18 SPDP

5.4E-05

1.1E-04

1.4E-04

1.9E-04

A18 DPDP

1.9E-08

2.0E-08

2.0E-08

2.0E-08

 

Amber 18 midpoint beta performance results

Benchmark Downloads: STMV | Cellulose | Poliovirus

The new midpoint based parallel version of PMEMD in Amber 18 beta is now faster and scalesbetter thanPMEMD from the Amber 16 released version. The following figures show that the Amber 18 beta is consistently faster than Amber 16 in the primary benchmarks (Cellulose and STMV) on current Intel processors[Intel© Xeon© Gold 6148 (Skylake) and Intel© Xeon Phi™7250(Knights Landing) processors]. At larger number of nodes, the performance gap between Amber 18 beta and Amber 16 increases, as the former has better scaling efficiency. For example, the figure below shows that Amber 18 beta is 2.2x faster than the Amber 16 released code when running the STMV benchmark on a single node Intel© Xeon Phi™ processor 7250.Moreover, the Amber 18 beta provides 2.5x speedup over Amber 16 on 8 nodes of Intel© Xeon Phi™ processor 7250 and 3.1x speedup on 32 nodes of Intel© Xeon© Gold 6148 Processorfor the STMV benchmark. In addition to the traditional Cellulose and STMV benchmarks we have also introduced a larger, 4 million atom benchmark based on the polio virus. This represents a large simulation and is thus a good stress test of the midpoint decomposition approach.

Hardware configuration: Intel© Xeon Phi™ 7250 runs at 1.4GHz with 96GB of DRAM and 16GB MCDRAM in Quadrant Cache mode. Intel© Xeon© Gold 6148 Processor runs at 2.4GHz frequency with 192GB of DRAM memory capacity. The multinode results are using Omnipath (OPA) fabric on the cluster. The compilers used are Intel compiler version 2017 update2 and Intel MPI version 5.1.3.

^

Currently Supported Features

The midpoint implementation in Amber 18 does not yet support the full PMEMD functionality. We plan to add critical missing functionality in the coming months. 
The midpoint implementation of Amber 18 currently supports: NVE, NVT (Langevin Thermostat), and Shake. The following

^

Configure, Compile, and Build Instructions

We advise the user to configure, compile and run on the same Computer. If you build on Intel© Xeon© Processor E7-4850 v2 processor (Products formerly Ivy Bridge), you will get AVX. If you build on Intel© Xeon© Processor E7-4850 v4 (Products formerly Broadwell) you will get AVX2 instruction. If you build on Intel© Xeon Phi™processor 7250, you will get AVX512 instruction. Binaries compiled on Intel© Xeon Phi™processor 7250, will not run on Haswell or previous generation processors. Currently Intel compilers version 2017 update2 and GCC compiler version 6.2.0 are supported and extensively tested in Amber18 midpoint implementation.

Enabling the Midpoint optimizations at runtime:

The user has a runtime choice (Using "midpoint" option in the cntrl namelist) to run the PMEMD simulation using the midpoint implementation by setting "midpoint=1", given that all the used features are suppored in the midpoint implementation. The original Amber 16 code can be selected by setting "midpoint=0", or removing the parameter.

Optimized code with mixed-precision (SPDP):

    ./configure -intelmpi -openmp -midpoint_spdp intel
    cd $AMBERHOME/src/pmemd/src
    make parallel
    Run in a bash shell:

        export KMP_BLOCKTIME=0 # threads sleep right away when idle (e.g., on barriers)
        export KMP_STACKSIZE=200 M # Allocate 200 MB for each OpenMP private stack
        export I_MPI_PIN_DOMAIN=core # Restricts the threads of each MPI rank in one physical CPU
        export OMP_NUM_THREADS=4 or 2 # 4 for Xeon Phi 2 for Xeon
       
        mpirun -np {cores} $AMBERHOME/bin/pmemd.MPI \
        -O -i mdin -o mdout -p prmtop -c inpcrd 

        #note cores is number of physical cores not hyperthreads


Optimized code with full double precision (DPDP):

    ./configure -intelmpi -openmp intel
    cd $AMBERHOME/src/pmemd/src
    make parallel
        Run in a bash shell:

        export KMP_BLOCKTIME=0 # threads sleep right away when idle (e.g., on barriers)
        export KMP_STACKSIZE=200 M # Allocate 200 MB for each OpenMP private stack
        export I_MPI_PIN_DOMAIN=core # Restricts the threads of each MPI rank in one physical CPU
        export OMP_NUM_THREADS=4 or 2 # 4 for Xeon Phi 2 for Xeon
   
        mpirun -np {cores} $AMBERHOME/bin/pmemd.MPI \
        -O -i mdin -o mdout -p prmtop -c inpcrd

        #note cores is number of physical cores not hyperthreads


Without hardware specific or OpenMP optimizations:

        ./configure -intelmpi intel
        cd $AMBERHOME/src/pmemd/src
        make parallel
            Run in a bash shell:

            mpirun -np {cores} $AMBERHOME/bin/pmemd.MPI \
            -O -i mdin -o mdout -p prmtop -c inpcrd

            #note cores is number of physical cores not hyperthreads

^

Additional Resources

The following provides some additional resources that you may find useful.

^