# Toward Extreme-Scale Processor Chips

#### **Josep Torrellas**

Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

> HiPC 2016 Hyderabad, India





#### Accelerated Progress in Transistor Integration

 Large multicores for data centers
 3D stacked chips and cloud



Intel Xeon Phi 7290F (Oct 2016) 72 cores, 288 contexts, 260W



Intel 3D Xpoint memory



Micron's Hybrid Memory Cube





#### Research is Pushing Ever Farther Ahead

• More integration  $\rightarrow$  1,000 cores/chip



Runnemede prototype [HPCA-13]

 Research on stacking multiple processor and memory dies





### Meanwhile: Energy Wall.. and Performance Wall

• University of Illinois Blue Waters Supercomputer



Performance: 11 PF Power: 6-11 MW (idle to loaded) 1MW = \$1M per year electricity

• Technology improvements in speed and power slowing down









- Very high energy efficiency
- Faster communication and synchronization
- Ease of programming





### Energy Wall: How Did We Get Here?

- Ideal Scaling (or Dennard Scaling): Every semicond. generation:
  - Dimension: 0.7
  - Area of transistor:  $0.7 \times 0.7 = 0.49$
  - Supply Voltage V<sub>dd</sub>, C: 0.7
  - Frequency: 1/0.7 = 1.4



Constant dynamic power density

- Real Scaling: V<sub>dd</sub> does not decrease much
  - If too close to threshold voltage (V<sub>th</sub>)  $\rightarrow$  slow transistor
  - Dynamic power density increases with smaller tech
  - Additionally: There is the static power

Power density increases rapidly





### Energy Efficiency: Low Voltage Operation

• V<sub>dd</sub> reduction is the best lever for energy efficiency

Dynamic power:  $P_{dyn} \propto CV_{dd}^2 f$ Static power:  $P_{sta} \propto V_{dd}T^2 e^{-qV_t/kT}$ 

- Advantages:
  - Reduces energy per operation quickly
- Drawbacks:
  - Lower speed
  - Higher variation in gate delay and power consumption





### Attaining Very High Energy Efficiency

- Voltage-scalable cores
- Dynamic voltage speculation
- Pervasive power gating
- Control-theoretic controllers







# Attaining Very High Energy Efficiency

- Voltage-scalable cores
- Dynamic voltage speculation
- Pervasive power gating
- Control-theoretic controllers







#### Goal: A Voltage-Scalable Core



Go to low voltage (~0.6V) and attain high energy efficiency "EEMode"



Deliver high performance at nominal voltage (~0.9V) "HPMode"

Goal: Operate at very low  $V_{dd}$  when we have parallelism





- SRAM and logic scale differently with V<sub>dd</sub>
- Small increase in  $V_{dd} \rightarrow$  large reduction in delay



[Gopireddy HPCA'16]

- Decouple the Vdd of logic and storage structures in the pipeline
  - Can reduce the Vdd of logic more  $\rightarrow$  higher energy efficiency



[Gopireddy HPCA'16]

- Raise Vdd of storage structures a little: faster at low E cost
  - Reconfigure the pipeline to leverage the faster storage structures and improve IPC



- At nominal, high-performance conditions (HPMode):
  - Conventional processor
- When energy efficiency matters (EEMode):
  - Decouple  $V_{dd}$  for storage and logic stages in the pipeline
    - Storage stages ~2x faster than logic stages
  - Reconfigure pipeline in one of the two ways:
    - Fuse storage stages in the pipeline (e.g., access register file)
    - Increase storage structure sizes (e.g., load-store queue)











### Fusing Two Pipeline Stages into One







### Fusing Two Pipeline Stages into One







- Highly energy-efficient when needed (parallel sections):
  - Vdd of logic stages very low
  - Reconfigured to fuse stages to increase IPC
- High performance at nominal conditions (serial sections):
  - Unmodified pipeline





# Attaining Very High Energy Efficiency

- Voltage-scalable cores
- Dynamic voltage speculation
- Pervasive power gating
- Control-theoretic controllers





### Risky Ways to Reduce V<sub>dd</sub>











Josep Torrellas Toward Extreme Scale...



# How Much Can We Reduce the Vdd?

[Bacha ISCA'13]



**Observation**: Correctable errors always triggered before uncorrectable ones, while running a stress test workload.





### Reducing the Voltage of On-Chip Network

[Ansari HPCA'14]

- Networks typically have error detection capabilities
- Networks connect slow and fast parts of the chip (due to process variation)
- Propose:
  - Dynamically reduce Vdd of different parts of the network
  - Detect and handle errors





### Error Rate as Function of Vdd

- On-chip network with many routers
- Error rate per router as we change Vdd



• Process variation has a major impact on the routers





### Leveraging the Error Handing of Networks

- Reduce Vdd of clusters of routers based on their tolerance
  - Continuously monitor for errors (and handle them)
  - Dynamically adapt Vdd of each cluster of routers based on errors
- Highly energy efficient
  - Remove Vdd margins added for variation





### Scheme Operation (Initial)







#### Scheme Operation (Lowering Voltage)







#### Scheme Operation (Vdd Tuning on a Path)







#### Scheme Operation (Convergence)













# Attaining Very High Energy Efficiency

- Voltage-scalable cores
- Dynamic voltage speculation
- Pervasive power gating
- Control-theoretic controllers





- Components not in use need to be power-gated
  - OS/software can do it sometimes, but has overhead
  - Many short idle periods; need HW-based power gating
    - Example: Last-level cache miss
- When we power-gate a structure, we lose its state
- Propose micro-checkpoint the pipeline: fast restoration of state





### Use Non-Volatile Memory for Micro-Checkpointing

- Challenge: NVM write latency
  - Need to bring NVM access latency to < 10 cycles away</li>



#### Monolithic integration

#### Same die integration





[Pan ICCD'14]

- Write-latency sensitive units:
  - Reg file, Inst window, ROB, Ld/st queue, pipeline regs
  - Implemented in SRAM + shadow in STTRAM
- Hybrid SRAM/STTRAM
  - SRAM for primary storage
  - STTRAM shadow of identical size
  - Data moved to shadow lazily







- Checkpointing and wakeup of cores managed by L1 cache controller
- Checkpoint/wakeup sequence:







# Attaining Very High Energy Efficiency

- Voltage-scalable cores
- Dynamic voltage speculation
- Pervasive power gating
- Control-theoretic controllers





- Extreme scale manycores need effective controllers
  - Power, energy, temperature, utilization...
- Current approaches for architectural control and tuning
  - Heuristics
  - Machine learning
  - Control theory





### Heuristics

Lightweight
Popular with architects

No guarantees
No formal methodology
Hard to add learning
Prone to errors
Hard to deal with multiple inputs and/or outputs





# Controlling a System with Control Theory

[Pothukuchi ISCA'16]



- The model is {A, B, C, D} + Unpredictability matrices
  - Obtained from analytical formulas or experimental characterization
- Want MIMO control (Multiple Inputs and Multiple Outputs)





### MIMO Control

- Actuate on multiple inputs: cache size, frequency, #ROB entries
- Control multiple outputs: performance (BIPS), power



Inputs (u)





# **Control Theory**

Feedback loop: runtime adaptation to conditions not seen during training

- Cuarantees: Convergence, Stability, Optimality
- ... Easy to add/remove a new input



Specifying the target values of outputs is not obvious (Power, performance)





- Each input and each output has a cost (or weight)
  - Cost of an input: How hard it is to change it from its current value
  - Cost of an output: Cost of not meeting the target of the output
  - System will try to minimize the changes to costly inputs/outputs





#### MIMO Controller Details

• Relative cost of inputs & outputs controls the inertia of the system





Input weights << output weights: Ripply system

Input weights >> output weights: System with inertia





Uses of the Controller (I)

- Set outputs to target values:
  - Performance (BIPS<sub>0</sub>) and Power ( $P_0$ )



Inputs (u)





Uses of the Controller (II)

- Set outputs to varying target values:
  - Changing the quality of service (QoS) as the battery is depleted in a mobile device



Uses of the Controller (III)

- Optimize a combination of output measures:
  - Minimize (ExD)= maximize (IPS<sup>2</sup>/Power)

Propose: Optimizer that searches directly in the space of (IPS<sup>2</sup>,P)







# Attaining Very High Energy Efficiency

- Voltage-scalable cores
- Dynamic voltage speculation
- Pervasive power gating
- Control-theoretic controllers





#### More Energy Efficiency? Need New Technologies



- Picture is unclear
- If we want energy efficiency, we will get lower performance and need to rely on more parallelism (many more cores)
- Likely a combination of technologies in the same die/stack



**TFET Characteristics** 

- Can be fabricated on the same die as CMOS
- Consumes much less power (4—8x less)
- 🙂 Scalable

Not as fast as CMOS (2—4x slower)





#### What Will Happen?

• Heterogeneous architectures?







### Conclusion

- Energy and power efficiency are the strongest constraints in future computer architectures
- There is no silver bullet (or perhaps it is V<sub>dd</sub> reduction)
- Some principles:
  - Reduce voltage (safely or taking risks)
  - Turn-off if unused
  - Minimize waste





# Toward Extreme-Scale Processor Chips

#### **Josep Torrellas**

Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

> HiPC 2016 Hyderabad, India





Currently: Big/Little











**Big/Little Not Optimal** 

- Fixed partitioning of cores
  - A fraction of chip unused
- 🙁 Migration overhead



**ARM System** 



