

# **CPU Engagement Models with Arm**

Arm IP is the basic building block for extraordinary solutions.

#### Core License

- Partner licenses complete microarchitecture.
- CPU differentiation via:
- Configuration options.
- Wide implementation envelope with different process technologies.



#### **Architecture License**

- Partner designs complete microarchitecture.
- Clean room, scratch.
- Maximum design freedom:
- Directly address needs of the target market.
- Arm architecture validation preserves software compatibility



### Arm Neoverse Momentum in Servers & HPC





# Fujitsu/RIKEN Fugaku: Fastest Supercomputer in the World

#### Top place in 4 categories:

Top500 @ 416 Pflop/s

HPCG @ 13.4 Pflop/s

HPL-AI @ 1.42 Eflop/s

Graph 500 @ 70980 GTEPS









# 1. High-Performance Arm CPU A64FX in HPC and AI Areas





#### Architecture features

| ISA          | Armv8.2-A (AArch64 only) SVE (Scalable Vector Extension) | 1 |  |  |
|--------------|----------------------------------------------------------|---|--|--|
| SIMD width   | 512-bit                                                  |   |  |  |
| Precision    | FP64/32/16, INT64/32/16/8                                |   |  |  |
| Cores        | 48 computing cores + 4 assistant cores (4 CMGs)          |   |  |  |
| Memory       | HBM2: Peak B/W 1,024 GB/s                                |   |  |  |
| Interconnect | TofuD: 28 Gbps x 2 lanes x 10 ports                      |   |  |  |

#### Peak performance (Chip level)



# Vanguard Astra by HPE

- 2,592 HPE Apollo 70 compute nodes
  - 5,184 CPUs, 145,152 cores, 2.3 PFLOPs (peak)
- Marvell ThunderX2 ARM SoC, 28 core, 2.0 GHz
- Memory per node: 128 GB (16 x 8 GB DR DIMMs)
  - Aggregate capacity: 332 TB, 885 TB/s (peak)

- Mellanox IB EDR, ConnectX-5
  - 112 36-port edges, 3 648-port spine switches
- Red Hat RHEL for Arm
- HPE Apollo 4520 All–flash Lustre storage
  - Storage Capacity: 403 TB (usable)
  - Storage Bandwidth: 244 GB/s



# Isambard system specification

- **10,752** Armv8 cores (168n x 2s x 32c)
  - Cavium ThunderX2 32core 2.1→2.5GHz
- Cray XC50 'Scout' form factor
- High-speed Aries interconnect
- Cray HPC optimised software stack
  - CCE, Cray MPI, math libraries, CrayPAT, ...
- Phase 2 (the Arm part):
  - Delivered Oct 22<sup>nd</sup>, handed over Oct 29<sup>th</sup>
  - Accepted Nov 9<sup>th</sup>
  - Upgrade to final B2 TX2 silicon, firmware, CPE completed March 15<sup>th</sup> 2019





# Isambard 2 production system

- **21,504** Armv8 cores (168n x 2s x 32c)
  - Marvell ThunderX2 32 core @2.5GHz
- Cray XC50 'Scout' form factor
- High-speed Aries interconnect
- Cray HPC optimised software stack
  - Compilers, math libraries, CrayPAT, ...
  - Also comes with all the open source software toolchains: GNU, Clang/LLVM etc.





# Isambard 2's A64fx Apollo80 system

- 72 nodes, 3,456 cores, 1.8GHz
  - 72 TB/s memory bandwidth
  - 202 TFLOP/s double precision
- Connected with 100Gbps InfiniBand
- Comes with a Cray software stack
  - CCE, Armclang, GNU
- Hope to add the Fujitsu compiler



# **CEA**: Deployment by ATOS







- 292 Atos Sequana X1310 compute nodes
- 584 CPUs, 18,688 cores
- Marvell ThunderX2 ARM SoC, 32 cores, 2.2 GHz
- Memory: 8 channels, DDR4 2666, 256 GB
- Mellanox InfiniBand EDR

- ✓ Peak Performance 329 TFLOPS
- ✓ HPL = 84% of efficiency
- $\checkmark$  HPCG = 3.47 of HPL





#### AWS Graviton2 - an Arm Server Processor



#### **Graviton Processor**



First Arm-based processor available in major cloud



Built on 64-bit Arm Neoverse cores with AWS-designed silicon using 16nm manufacturing technology



Up to 16 vCPUs, 10Gbps enhanced networking, 3.5Gbps EBS bandwidth

#### **Graviton2 Processor**



7x performance, 4x compute cores, and 5x faster memory



Built with 64-bit Arm Neoverse cores with AWS-designed silicon using 7nm manufacturing technology



Up to 64 vCPUs, 25Gbps enhanced networking, 18Gbps EBS bandwidth



## AWS Graviton 2 for HPC workloads

The c6g instances have outstanding price/performance as compared to similar x86

instances



- The AWS Graviton 2 implements the Arm Neoverse N1
- Up to 40% improved price/performance over x86 instances





Cost: lower is better

Run time: lower is better



# CIM Software Ecosystem



#### **Applications**

Open-source, owned, commercial ISV codes, ...

#### Containers, Interpreters, etc.

Singularity, PodMan, Docker, Python, ...

# Debuggers & Profilers

Arm Forge (DDT, MAP), Rogue Wave, HPC Toolkit, Scalasca, Vampir, TAU, ...

#### **Middleware**

Mellanox IB/OFED/HPC-X, OpenMPI, MPICH, MVAPICH2, OpenSHMEM, OpenUCX, HPE MPI

#### OEM/ODM's

Cray-HPE, ATOS-Bull, Fujitsu, Gigabyte, ...

#### Compilers

Arm, GNU, LLVM, Clang, Flang, Cray, PGI/NVIDIA, Fujitsu, ...

#### **Libraries**

ArmPL, FFTW, OpenBLAS, NumPy, SciPy, Trilinos, PETSc, Hypre, SuperLU, ScaLAPACK, ...

#### **Filesystems**

BeeGFS, Lustre, ZFS, HDF5, NetCDF, ...

# Schedulers SLURM, IBM LSF, Altair PBS Pro,

luster

Mana

gement

Bright, HPE CMU, xCat,

OS

RHEL, SUSE, CentOS, Ubuntu, ...

#### **Arm Server Ready Platform**

Standard firmware and RAS

# Silicon Suppliers

Marvell, Fujitsu, Mellanox, NVIDIA, ...



# A Rich and Growing Application Ecosystem

| GROMACS   | LAMMPS              | CESM2    | MrBayes       | Bowtie   | DeepBench |
|-----------|---------------------|----------|---------------|----------|-----------|
| NAMD      | TensorFlow          | ParaView | SIESTA        | UM       | AMBER     |
| WRF       | Quantum<br>ESPRESSO | VASP     | Torch         | MILC     | GEANT4    |
| OpenFOAM  | GAMESS              | Mahout   | Vislt         | DL-Poly  | NEMO      |
| Weka      | BLAST               | NWCHEM   | Abinit        | BWA      | QMCPACK   |
| + +       | + +                 |          | + +           | +        | + +       |
| Chem/Phys | Weather             | CFD      | Visualization | Genomics | AI/ML     |



#### **GNU and LLVM Toolchains**

Toolchains for all Arm cores – supported at release

#### **Status:**

- LTS Linux distributions support Arm CPU features when a CPU becomes generally available
- Improve performance for key user workloads and industry benchmarks

#### **GNU Toolchain (compilers, debuggers, libraries, etc.)**

- Default compiler in Linux distributions like RedHat, SUSE, Ubuntu
- Key segments: Cloud, networking and HPC

#### LLVM Toolchain (compilers, debuggers, libraries, etc.)

- Default compiler in Android and the basis for commercial compilers (including Arm and Cray compilers)
- Key segments: Mobile (Android/iOS), Cloud



# **Example: SVE Support**

Over four years of active, ongoing development

- Arm actively posting SVE open source patches upstream
  - Beginning with first public announcement of SVE at HotChips 2016



#### Available upstream

• <u>Since GNU Binutils-2.28</u> Released Feb 2017, includes SVE assembler & disassembler

Since GCC 8: Full assembly, disassembly and basic auto-vectorization

• Since LLVM 7: Full assembly, disassembly

• Since QEMU 3: User space SVE emulation

Since GDB 8.2
 HPC use cases fully included

#### Constant upstream review

• <u>LLVM</u>: Since Nov 2016, as presented at LLVM conference

• <u>Linux kernel</u>: Since Mar 2017, LWN article on SVE support

Automatic Arm support in latest version of all tools – peer to x86



# **Example: Auto-vectorization in LLVM**

- Auto-vectorization via LLVM vectorizers:
  - Use cost models to drive decisions about what code blocks can and/or should be vectorized.
  - Since October 2018, two different vectorizers used from LLVM: Loop Vectorizer and SLP Vectorizer.
- Loop Vectorizer support for SVE and NEON:
  - Loops with unknown trip count
  - Runtime checks of pointers
  - Reductions
  - Inductions
  - "If" conversion

- Pointer induction variables
- Reverse iterators
- Scatter / gather
- Vectorization of mixed types
- Global structures alias analysis



# Server & HPC Development Solutions from Arm

Commercially supported tools for Linux and high performance computing

#### **Code Generation**

for Arm servers

#### arm

COMPILER FOR LINUX

**arm** C/C++ Compiler

**CITM** Fortran Compiler

**Qrm** Performance Libraries

#### **Performance Engineering**

cross platform, scalable



\_\_\_

Debugger DDT

MAP Profiler

PERFORMANCE REPORTS REPORTS

#### **Server & HPC Solution**

for Arm servers

# ALLINEA STUDIO

Commercially Supported Toolkit for applications development on Linux

- C/C++ Compiler for Linux
- Fortran Compiler for Linux
- Performance Libraries
- Performance Reports
- Debugger (DDT)
- Profiler (MAP)



# **Arm Compiler for Linux**

a.k.a Arm Compiler for HPC, a.k.a. Arm Allinea Compiler







#### Tuned for Scientific Computing, HPC and Enterprise workloads

- Processor-specific optimizations for various server-class Arm-based platforms
- Optimal shared-memory parallelism using latest Arm-optimized OpenMP runtime

#### Linux user-space compiler with latest features

- C++ 14 and Fortran 2003 language support with OpenMP 4.5
- Support for Armv8-A and SVE architecture extension
- Based on LLVM and Flang, leading open-source compiler projects

#### Commercially supported by Arm

 Available for a wide range of Arm-based platforms running leading Linux distributions – RedHat, SUSE and Ubuntu



# Building on LLVM, Clang and Flang projects





### **Arm Performance Libraries**

Optimized BLAS, LAPACK and FFT







#### Commercial 64-bit Armv8-A math libraries

- Commonly used low-level math routines BLAS, LAPACK and FFT
- Provides FFTW compatible interface for FFT routines
- Sparse linear algebra and batched BLAS support
- libamath gives high-performing math.h functions implementations

#### Best-in-class serial and parallel performance

- Generic Armv8-A optimizations by Arm
- Tuning for specific platforms like Marvell ThunderX2 in collaboration with silicon vendors

#### Validated and supported by Arm

- Available for a wide range of server-class Arm-based platforms
- Validated with NAG's test suite, a de-facto standard



# Arm Performance Libraries – Leading BLAS performance

Arm Compiler for Linux 20.0 vs latest OpenBLAS vs latest BLIS



- High serial performance for BLAS level 3 routines, such as **GEMMs** also have classleading parallel performance
- Shown is DGEMM on square matrices using 56 threads on a ThunderX2



# Arm Performance Libraries: OpenMP Scaling on N1

Run on AWS Graviton2



- Shown is DGEMM on square matrices using 64 threads on an AWS Graviton2
- Shown for matrix sizes of 100, 1,000 and 10,000
- Shows up to 85.7% efficiency for large matrices



## ArmPL 20.0 FFT vs FFTW 3.3.8





# Arm Performance Libraries – Optimized Math Routines

Open Source: https://github.com/ARM-software/optimized-routines

#### Normalised runtime



#### ArmPL includes libamath and libastring

- Algorithmically better performance than standard library calls
- No loss of accuracy
- Enabled by default with Arm Compiler for Linux
- Double precision implementations of:
- erf(),erfc()
- single and double precision implementations of: exp(),pow(),log(),log10()
- single precision implementations of: sin(),cos(),sincos()
- Efficient memory/string functions from string.h
- Enable autovectorization of math and string routines by adding -armpl or -fsimdmath

...more to come.



#### **Build Tools**

All popular build tools are supported on Arm

#### Support

- All major build systems and tools:
  - CMake, Make, GNUMake, Spack etc.
  - Spack used internally at Arm.
- Arm supports KitWare etc. to ensure build tools like CMake are stable and supported.
- Arm upstreams any necessary changes to support Arm's commercial tools.
  - e.g. CMake toolchain files for Arm Compilers.

#### **Compilation Performance**

- A data point: ThunderX2 compilation of large code bases is on is on-par with Intel Skylake
  - Usually faster due to higher core counts.
- GNU compilers run faster than LLVM, but that's not aarch64-specific; same on any arch.









# **Application Build Recipes and Spack**

#### Spack is used extensively by Arm

- Multiple places for recipes
  - <a href="https://gitlab.com/arm-hpc/packages/wikis/packages">https://gitlab.com/arm-hpc/packages/wikis/packages</a>
  - <a href="https://developer.arm.com/hpc/hpc-software/categories/applications">https://developer.arm.com/hpc/hpc-software/categories/applications</a>
  - https://github.com/UoB-HPC/benchmarks
- Want to move our knowledge base into Spack
  - https://github.com/spack/spack
  - Would like customers to also contribute to Spack
- Ideally get package owners to update their code





# MPI Implementations

Out-of-the-box support for Arm in the latest versions of...

#### **OpenMPI**

- Out-of-the-box support since
  3.1.2 (currently 4.0.4)
- developer.arm.com guide
- Upstream contributions
- Used inhouse
- Basis of Bull, Mellanox and Fujitsu K implementations
- Active development from Arm and Arm partners

#### **MPICH**

 Basis of Cray and Intel implementations
 ...and MVAPICH

#### **MVAPICH**

- developer.arm.com guide
- Upstream contributions
- Used inhouse
- Basis of Sunway TaihuLight implementation
- Arm investment in OSU
  - Arm hardware & tools



#### Parallel Runtime Environments

#### Threading, thread placement, and affinity

- POSIX threads 2.0 fully supported.
- Thread placement, pinning, affinity via hwloc, numactl, etc.
- Most SoCs support a simple memory hierarchy partitioned into a minimal number of NUMA nodes, e.g. one NUMA node per CPU socket.
- The goal is to minimize code refactoring for performance and eliminate "guess and check" data movement optimization strategies.

#### Dynamically linked libraries and page size

- Users do not need to change anything in their execution environment or workflow to achieve good performance.
  - Demonstrated at multiple application scales at several sites including Sandia and Bristol.
- Tools like <u>LLNL's Spindle</u> are supported to reduce I/O pressure when loading dynamically linked applications.



# **Scientific Computing Libraries**

https://gitlab.com/arm-hpc/packages/wikis/categories/library

#### **Package Support**

- Trilinos, PETSc, Hypre,
  SuperLU, ScaLAPACK,
  NetCDF, HDF5, BLIS, etc.
- Tested to work well with Arm and GNU compilers.
- 54+ packages in Arm's Community Packages Wiki

#### **Testing and Development**

- ThunderX2 access freely available for open source project CI/CD
  - packet.net
  - Verne Global

#### Resourcing

- Arm supports communities as part of broader NRE and commercial projects
- Arm provides reactive support to users at key HPC sites worldwide



# **Arm Performance Engineering Tools Ecosystem**

See the http://www.vi-hps.org/tools/ for an excellent view of the tools ecosystem.











# Hardware Performance Counter Support

Hardware performance counter APIs are fully supported

#### **PAPI**

- Support for many aarch64 server-class CPUs:
  - e.g. ThunderX2
- Marvell planning support for future CPUs e.g. ThunderX4

#### perf\_events

- Native HPM API is fully supported
- User applications may:
  - Initialize the HPM
  - Initiate and reset counters
  - Read counters
  - Generate interrupts on counter overflow
  - Register interrupt handlers from each process and thread independently

#### **Documentation and Tools**

- Arm MAP, HPCToolkit, IPM, TAU, ScoreP, etc.
  - HPM values can be accessed by non-privileged users in a secure manner
- Performance metrics derived from multiple counters:
  - Partners provide their own PMU/HPM documentation





The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.

www.arm.com/company/policies/trademarks

# **Arm Forge Ultimate**

A cross-platform toolkit for debugging, profiling and performance analysis







#### The de-facto standard for HPC development

- Available on the vast majority of the Top500 machines in the world
- Fully supported by Arm on Arm servers, x86, IBM Power, Nvidia GPUs, etc.

#### State-of-the art debugging and profiling capabilities

- Powerful and in-depth error detection mechanisms (including memory debugging)
- Sampling-based profiler to identify and understand bottlenecks
- Available at any scale (from serial to petaflopic applications)

#### Easy to use by everyone

- Unique capabilities to simplify remote interactive sessions
- Innovative approach to present quintessential information to users



# Arm Forge – DDT Parallel Debugger



# Arm Forge – MAP Multi-node Low-overhead Profiler



# Arm Performance Reports Application Analysis Tool

Analyze all performance aspects in a single HTML or TXT file

SIMD, multithreading,

many more...



Follow guidance advices for your next steps and maximize output

