

# **ENERGY EFFICIENCY IN HIGH PERFORMANCE HETEROGENEOUS COMPUTING**



Brian Bergner & Dr. Rong Ge

#### INTRODUCTION

- High performance computing (HPC) refers to the complicated computer systems used in fields such as molecular modeling and weather forecasting
- Computing systems today need to be more energy efficient and faster than existing systems
- As Moore's Law predicted transistor densities are still doubling every 1.5 years
- Clock speeds however are limited by power availability, heat dissipation, and current leakage
- Applications can be optimized to use multi core hardware and coprocessor based accelerators
- Large software applications specifically those requiring repeated matrix operations benefit from this optimization
- Parallelism and hardware efficiency is the future of performance gains through the use of multi-core cluster on a chip technology
- Optimal points in computer performance can give you a competitive advantage with the best possible time-to-solution and minimizes waste of resources

## PROBLEM & OBJECTIVE

- Many of the world's supercomputers, terascale and petascale systems, are designed to perform as many Floating Point Operations Per Second (FLOPS) as possible. Performance improvements in each node can have ripple effects across an entire system.
- The heterogeneous computing node studied includes an NVIDIA Tesla GPU (Graphic Processing Units) and an Intel Xeon Phi MIC (Many Integrated Core) coprocessor as system accelerators.
- Our goal is to systematically find optimal load balancing points given specific resource constraints that maximize energy efficiency while meeting requirements in throughput. The proposed approach is through a system of performance tests and subsequent analysis of energy sensitivity and power consumption. These results can be used in evaluating current and future computing hardware and interfaces to identify appropriate combinations.
- A heterogeneous system containing a CPU, a GPU and an Intel MIC with a PCI Express interface is being studied based on a BLAS (Basic Linear Algebra Subprograms) operation.
- Results Indicate that a heterogeneous computing system is able to provide more energy-efficient solutions to scientific computing with various performance demands as the improvement of system energy efficiency is sensitive to the system components.

### **HARDWARE**

- 'hyy' is an intel Based Server with Dual hyy Bridge intel Xeon E5-2670 v2 CPUs
- Each CPU has 10 Cores clocked at 2.5GHz with a 25MB Intel Smart Cache
- The NVIDIA GPU accelerator card is connected via the PCI-Express bus is a Tesla K20c which features 2688 CUDA Cores
- The Intel Xeon Phi Coprocessor is connected via the PCI-Express bus and has 7120D has 61 Cores clocked at 1.238GHz and sharing a 30.5MB L2 Cache
- . External power meter records the total energy being supplied to the system

### **IMPLEMENTATION & TESTING**

- Jobs are submitted to the 'lvy' system for processing from a remote computer
- Power measurements are sampled at regular intervals and recorded while the Matrix operations are processing on the CPU and Accelerator
- Power averages are calculated and recorded along with performance metrics based on time of job submission and completion
- Each test result creates a data point for further analysis
- For each test only one variable is modified at a time
- Variables tested include:
  - Percent of Matrix Computation per Component
  - Software Power Limits Applied to CPU Package
- Further variables can include:
  - Different Accelerator Hardware
  - Matrix Operation Size
  - Matrix Operation Type
  - · # of Iterations of Matrix Operation
  - Software Based Power Limits Applied to Accelerator
  - Software Power Limits Applied to CPU Core
  - Software Power Limits Applied to DRAM
  - · Hardware Based Power Limits

| CPU  | SPU | Matrix<br>Stee | terefo | PRE Poster |        | Sys<br>Conformeros | GPU<br>Bertermann | CPU<br>Time | Total<br>Time | SPU<br>Time | CPU GPL        |          | PRES   | CPU    | DEAM  | EPU<br>Power |
|------|-----|----------------|--------|------------|--------|--------------------|-------------------|-------------|---------------|-------------|----------------|----------|--------|--------|-------|--------------|
| 100  | 0   | 12660          | 10     | 80         | 505.00 | 149.59             | 0.00              | 115.79      |               | 0.00        | 2800000 708.0  |          | 152.04 | 124,70 | 20.20 | 47.28        |
| 100  |     | 12990          | 19     | 70         | P09-49 | 269,00             | 9.00              | 123.40      | 115,40        | 9,00        | 2500000 705.0  | 9 482.01 | 150.50 | 125,14 | 35.00 | 47.89        |
| 1100 |     | 12900          | 10     | 100        | 971.25 | 471.45             | 12.00             | 173.66      | 732.5B        | 23,000      | 25000000 705.0 | 497.94   | 186.67 | 129.27 | 26.05 | 47.85        |
| 100  | 6   | 12596          | 18     | 80         | 888.64 | 349.34             | 0.00              | 107.81      | 207.81        | 0.00        | 2000000 708.0  | 0 486.87 | 179.00 | 146.50 | 80.48 | 47.35        |
| 100  |     | 12000          | 38     | 100        | 412.00 | 412.50             | 9.90              | 101.79      | 202.79        | 9.88        | 2000000 708.0  | 0 482.86 | 188.88 | 160.52 | 87.94 | 47.22        |
| 100  | 6   | 12666          | 10     | 110        | 921.37 | 451.07             | 2.00              | 99.54       | 99.54         | 0.00        | 2500000 705.0  | 0 488.65 | 199.95 | 165.60 | 27.11 | 47.25        |
| 100  | 6   | 12466          | 10     | 190        | 422,40 | 422.41             | 0.00              | 86.55       | 00.20         | 0.00        | 1000000 706.0  | 0 466.74 | 160.00 | 165.01 | 27.34 | 47.21        |
| 88   | 10  | 12990          | 19     | 60         | PRAT   | 788,87             | 1000.47           | 96.76       | 10.76         | 19,99       | 200000 206.0   | 0 499.26 | 148.91 | 138,84 | 85.79 | 99,00        |
| 80   | 80  | 12466          | 10     | 70         | 872.67 | 748.68             | 1680.47           | BLID        | 50.22         | 10.00       | 1000000 708.0  | 0 479.64 | 147.66 | 120.00 | 26.74 | 91.01        |
| 86   | 80  | 12596          | 18     | 80         | 877.60 | 788.87             | 1680.47           | BL LD       | LE EU         | 19.86       | 20000000 708.0 | 0 476.B1 | 112.91 | 128.00 | 24.24 | 91.48        |
| 58   | 50  | 12000          | 19     | 80         | 894.88 | 789.72             | 1039.50           | 56.11       | 38.11         | 19.99       | 2500000 705.0  | 8 488.14 | 146.85 | 129.58 | 89.81 | 99.18        |
| 58   | 90  | 12000          | 19     | 100        | TILES  | 869.56             | 1099.48           | 99.65       | 10.02         | 19.59       | 2500000 7094   | 0 518.09 | 177.21 | 149-56 | 89.79 | 29.45        |
| 100  | 80  | 12596          | 18     | 110        | 420.20 | 240.00             | 1690.46           | 48.90       | 49.98         |             | 2000000 708.0  |          |        | 183.36 | 26.58 | 94.26        |
| 96   | 80  | 12596          | 18     | 120        | 426.89 | 849.37             | 1690.49           | 40.00       | 49.50         | 19.86       | 200000 705.0   | 0 313.44 | 180.44 | 189,02 | 20.20 | 98.44        |
|      | 100 | 12000          | 10     | 60         | 6.00   | 954.07             | 1090.49           | 6.00        | 48.02         | 89,58       | 2500000 705/   | 0 461.40 | 49.88  | 24.80  | 22.71 | 152,76       |
| Ð    | 100 | 12596          | 18     | 70         | 8.00   | 984.87             | 1099.49           | 0.00        | 49.93         | 10.55       | 2000000 708.0  | 0 483.EC | 80.08  | 24.99  | 22.76 | 102.04       |
| D    | 100 | 12680          | 10     | 60         | E.00   | 994.97             | 1030.49           | 0.00        | 48.92         | 39.50       | 23B0000 F05.0  | 0 489.45 | 49.83  | 24.78  | 22.71 | 159.49       |
| ø    | 160 | 12666          | 10     | 90         | 6.80   | 984.97             | 1650.49           | 6.60        | 49.92         | 88.68       | 2500000 705.0  | 0 468,47 | 49.97  | 24.57  | 22.78 | 158.86       |
| 0    | 100 | 12606          | 10     | 100        | 6.00   | 981.89             | 1690.40           | 0.00        | 42.92         | 20.00       | 20000000 708.0 | 0 483.52 | 49.93  | 24.92  | 22,70 | 183.76       |
|      | 100 | 12000          | 19     | 110        | 0.00   | 934.54             | 1039.48           | 0.00        | 48.92         | 30.50       | 2200000 703.0  | 0 482.5% | 49.51  | 24.81  | 22.92 | 152.49       |
|      | 100 | 12000          | 19     | 120        | 0.00   | 954.84             | 1090.46           | 0.00        | 48.92         | 88.88       | 2500000 705.0  | 0 482.00 | 49.51  | 24.81  | 22.84 | 152.00       |

## **SOFTWARE & CONFIGURATION**

- 'tvy' runs a Linux variant (CentOS 6.5) and is configured to measure and limit CPU power usage using intel's Running Average Power Limit (RAPL) driver
- Intel Manycore Platform Software Stack (MPSS) 3.2 is used to offload computation to the Intel Xeon Phi Coprocessor
- Tests are run using matrix multiplication operations built using Basic Linear Algebra Subsystems (BLAS), which is the generally accepted building block standard for Linear Algebra functions
- cuBLAS is used as a library for GPU accelerated and offloaded subroutines

### **RESULTS & ANALYSIS**

Results can be analyzed by asking:

- How much performance do we gain by parallelizing the code?
- Are there ways to gain energy efficiency by offloading only certain functions to the accelerator?
- Do specific BLAS operations perform better on specific accelerator hardware?
- How does Matrix size effect specific hardware efficiency?





### **ACKNOWLEDGEMENTS**

- Rong Ge, Ryan Vogt, Jahangir Majumder, Arif Alam, Martin Burtscher and Ziliang Zong. Effects of Dynamic Voltage and Frequency Scaling on a K20 GPU, Proceed
- Rong Ge, Xizhou Feng, Shuaiwen Song, Hung-Ching Chang, Dong Li, Kirk W.
  Cameron, PowerPack: Energy profiling and analysis of High-Performance Systems and Applications, IEEE Transactions on Parallel and Distributed Systems, Vol 21,
  No.5, 658-671 (2010). Proceedings of the 42nd International Conference on Parallel Processing. 2013
- Jeffers, Jim, and James Reinders. Intel Xeon Phi Coprocessor High-performance Programming.