### Minimum Precision Requirements of Deep Neural Networks

Numerical Software Verification July 20, 2020



### Naresh R. Shanbhag



Charbel Sakr, Sujan Gonugondla, and Hassan Dbouk

## The Efficiency Challenge in Learning

### **Model Complexity**

### **Data Movement Cost**



#### **ILLINOIS** Electrical & Computer Engineering

### **fundamental question**

### how do we design **learning machines** that operate at the limits of accuracy-robustnessenergy efficiency with guarantees?

### **Shanbhag Group Research Vectors**



#### **ILLINOIS** Electrical & Computer Engineering

# **Machine Learning in Reduced Precision**

Google's TPU Intel's NNP **IBM's AI Core MIT's Eyeriss**  $4000 \ \mu m =$ SFU / LO-Y Spatial PE Array 4000 µm (intel) Nervana 1 ŏ **PE Array** (2MB SRAM dx=3.15 **[ISSCC'16]** [ISCA'17] [NIPS'17] [VLSI'18] 8b fixed-point 16b fixed-point **16b flexpoint 16b floating-point** (inference) (inference) (training) (training) **16b floating-point** (training) Are these the minimum precisions required?

Can minimum precision requirements be determined analytically?

UIUC (Sakr, Shanbhag) - ICML 2017, ICASSP 2018, ICLR 2019, ICLR 2019 (with IBM)

#### **ILLINOIS** Electrical & Computer Engineering COLLEGE OF ENGINEERING



#### **ILLINOIS** Electrical & Computer Engineering

## Outline

- "What are the minimum output precision requirements of a dot product kernel in order to meet a specific accuracy requirement at its output (DP accuracy)?"
- 2) "What are the minimum precision requirements of a DNN to meet a specific accuracy requirements at its output (network accuracy)?"
- 3) employ the above two insights to determine the precision limits of the recently proposed **in-memory computing (IMC) architectures.**

## Minimum Output Precision Requirements for Dot Product



#### **ILLINOIS** Electrical & Computer Engineering

# Why Output Precision $B_y$ ?



$$y_q = \mathbf{w}^{\mathrm{T}} \mathbf{x} + q_y$$

- B<sub>y</sub> is the accumulator precision in digital architectures → accumulator complexity dominates power in low-precision DNNs
  - e.g., 32b accumulator 10× more power than a 3×1-b multiplier in 28nm CMOS hence research on lowresolution accumulation [Sakr ICLR19; Wang NeurIPS'18]
- $B_y$  is the ADC precision in in-memory architectures  $\rightarrow$  ADCs can dominate (~80%) latency and power when implementing DNNs [Kim ISLPED'18, Rekhi DAC'20]

## **Quantization Noise Model**



- additive model assumption:  $q_x$  is independent of x
- SQNR : signal-to-quantization noise ratio  $\rightarrow$  accuracy measure
- $\zeta$  : peak-to-average (power) ratio  $\rightarrow$  measure of 'peakiness' of signal distribution

$$SQNR_x = 10 \log_{10} \left[ \frac{\sigma_x^2}{\sigma_{q_x}^2} \right]$$

$$SQNR_{x}(dB) = 6B_{x} + 4.78 - \zeta_{x} (dB)$$
$$\zeta_{x} = \frac{x_{m}}{\sigma_{x}}$$

## **Fixed-Point Dot Product**





input quantization noise *reflected* at the output

## **Accuracy Metrics for Quantized Dot Products**

$$SQNR_{\rm T} = \left[\frac{1}{SQNR_{q_{iy}}} + \frac{1}{SQNR_{y}}\right]^{-1}$$
 Limited by  $SQNR_{q_{iy}}$ 

• Choose  $SQNR_y(dB) \ge SQNR_{q_{iy}}(dB) + 9$  to minimize (< 0.5dB) its impact on  $SQNR_T$ 

$$SQNR_y(dB) = 6B_y + 4.8 - [\zeta_x(dB) + \zeta_w(dB)] - 10\log_{10}(N)$$

- But for fixed  $B_y: SQNR_y(dB)$  reduces with N (N in hundreds in DNNs)  $\rightarrow$  increase  $B_y$
- But large  $B_y \rightarrow$  leads to very large accumulator bit widths
- How to choose output precision  $B_{\gamma}$ ?

# Bit Growth Criterion (BGC) for Choosing $B_y$



$$B_y = B_x + B_w + \log_2(N)$$

### $SQNR_y^{BGC}(dB) = 6(B_x + B_w) + 4.8 - [\zeta_x(dB) + \zeta_w(dB)] + 10\log_{10}(N)$

- commonly employed in digital architectures and network design
- $B_y$  (accumulator precision) and  $SQNR_y$  both increase with N

## **Proposed - Minimum Precision Criterion (MPC)**



- allow for a non-zero but small probability of clipping  $(p_c)$  BGC avoids clipping
- exploits reduction in  $\frac{\sigma}{\mu}$  of  $y_o$  with N (Central Limit Theorem)
- exhibits a trade-off between clipping noise and quantization noise

#### **ILLINOIS** Electrical & Computer Engineering

# **Comparing MPC and BGC**



- MPC achieves the desired  $SQNR_y^*$ with minimum precision ( $B_y = 8$ )
- BGC is a huge overkill → leads to very large accumulator bit widths (B<sub>y</sub> = 16 to 20)
- tBGC (truncated BGC) needs  $B_y = 12$  (still significant)
- Use MPC to assign minimum output (accumulator) precision

#### **ILLINOIS** Electrical & Computer Engineering COLLEGE OF ENGINEERING

## **Input Precision Requirements for DNNs**



#### **ILLINOIS** Electrical & Computer Engineering

## **Related Works**

 much work on reduced precision machine learning since 2015 [stochastic rounding, BinaryNets, TernGrad, pruning...]

### But.....

- largely based on heuristics relying on the benevolence of SGD
- lacking theoretical guarantees on accuracy try and hope it works!
- difficult to realize in H/W complex and irregular arithmetic

# **Deep Learning in Finite Precision**

# Fixed-point inference with theoretical guarantees



Fixed-point training with close-to-minimal precisions



### Fl.-pt. training with accumulation bit-width scaling



Sakr, Kim, Shanbhag ICML 2017 Sakr & Shanbhag ICASSP 2018 Sakr & Shanbhag

ICLR 2019

(with K. Gopalakrishnan [IBM])

Sakr & Shanbhag

ICLR 2019

#### I ILLINOIS

Electrical & Computer Engineering COLLEGE OF ENGINEERING

#### code available at https://github.com/charbel-sakr



## **Precision Analysis Framework**



 $p_m \le \Delta_A^2 E_A + \Delta_W^2 E_W$ 

- no retraining; per-layer precision; activation vs. weight trade-off
- noise gains computed via one standard backprop iteration
- minimizing  $p_m$  done via noise equalization (NE)

#### **ILLINOIS** Electrical & Computer Engineering COLLEGE OF ENGINEERING

### Lesson 1 – Precision Trade-offs Captured Analytically



- bound predicts minimum precision within 1~2 bits
- bound reveals trade-offs between network precisions
- trade-offs captured by relative values of noise gains

#### **ILLINOIS** Electrical & Computer Engineering

## Lesson 2 – Precision Decreases with Depth

CIFAR-10 using VGG-9



- weights typically require more precision than activations
- precision decreases because early perturbations are most destructive



## Lesson 3 – BinaryNets are More Complex than Minimum Precision Networks

CIFAR-10 using VGG-9



- up to 3.5x lower complexity & 2X lower storage over BinaryNet at iso-accuracy
- empirically observed by [Moons, Verhelst]

#### **ILLINOIS** Electrical & Computer Engineering COLLEGE OF ENGINEERING

# **Challenges in Fixed-point Training**



- multiple forward quantization noise sources
- unknown gradient dynamic range
- instability due to quantization noise bias in updates
- back-propagation of quantization noise in activation gradients
- risk of premature stoppage of convergence

#### **I** ILLINOIS

Electrical & Computer Engineering COLLEGE OF ENGINEERING

#### gradients & weights huge dynamic range mismatch



#### gradients spatio-temporally varying dynamic range





#### activations & weights spatio-temporally varying distributions





Weight Gradients layer3.1.conv1.weight







#### Weights layer4.1.conv2.weight



**I** ILLINOIS

COLLEGE OF ENGINEERING

Electrical & Computer Engineering

#### [ResNet-18 on CIFAR-10]





## **The Solution**

- analytically guaranteed to provide close-to-minimal precision @ isoaccuracy
- requires a floating-point network to be probed during training

#### **ILLINOIS** Electrical & Computer Engineering COLLEGE OF ENGINEERING

# FX Training Converges with Close-to-Minimal Precisions



- FX training was believed to be hard due to dynamic range issues [Koester, NeurIPS'2017]
- proposed FX training is able to match FL training accuracy
- precision assignment found to be nearly minimal

# **Per-layer Precision Trends Vary**



- weight precision decreases from network input to output
- precisions of activation gradients and weight accumulators increase
- ResNets have uniform precision requirements per tensor type

#### **ILLINOIS** Electrical & Computer Engineering

# Comparison w.r.t. Hyper-precision Reduction Techniques

|                    | $\mathcal{C}_W$  | $\mathcal{C}_A$ | $\mathcal{C}_M$      | $\mathcal{C}_C$ | Test   | $\mathcal{C}_W$  | $\mathcal{C}_A$ | $\mathcal{C}_M$      | $\mathcal{C}_C$ | Test   |
|--------------------|------------------|-----------------|----------------------|-----------------|--------|------------------|-----------------|----------------------|-----------------|--------|
|                    | $(10^{6}b)$      | $(10^{6}b)$     | $(10^{9} \text{FA})$ | $(10^{6}b)$     | Error  | $(10^{6}b)$      | $(10^{6}b)$     | $(10^{9} \text{FA})$ | $(10^{-}b)$     | Error  |
|                    | CIFAR-10 ConvNet |                 |                      |                 |        | CIFAR-100 ResNet |                 |                      |                 |        |
| FL                 | 148              | 9.3             | 94.4                 | 49              | 12.02% | 1789             | 97              | 4319                 | 597             | 28.06% |
| $\mathbf{FX}(C_o)$ | 56.5             | 1.7             | 11.9                 | 14              | 12.58% | 750              | 25              | 776                  | 216             | 27.43% |
| BN                 | 100              | 4.7             | 2.8                  | 49              | 18.50% | 1211             | 50              | 128                  | 597             | 29.35% |
| SQ                 | 78.8             | 1.7             | 11.9                 | 14              | 11.32% | 1081             | 25              | 776                  | 216             | 28.03% |
| TG                 | 102              | 9.3             | 94.4                 | 3.1             | 12.49% | 1230             | 97              | 4319                 | 37.3            | 30.62% |

- feedforward binarization (BN) and gradient ternarization (TG) fail to match FL accuracy for same topology
- stochastic quantization (SQ) provides marginal returns
- BN, TG, SQ do not address the fundamental problem of realizing true FX training

# Precision Requirements for In-Memory Architectures

## **UIUC 6T SRAM Deep In-memory IC Prototypes**

### **100X EDP reduction over von Neumann equivalent<sup>\*</sup>** @ iso-accuracy



23 Mar 2018 | 15:00 GMT

# To Speed Up AI, Mix Memory and Processing

New computing architectures aim to extend artificial intelligence from the cloud to smartphones

#### By Katherine Bourzac



Image: Sujan Gonugondla **Tearing Down Walls:** This prototype features a new chip design called deep in-memory architecture. If John von Neumann were designing a computer today, there's no way he would build a thick wall between processing and memory. At least, that's what computer engineer <u>Naresh Shanbhag</u> of the University of Illinois at Urbana-Champaign believes. The eponymous von Neumann architecture was published in 1945. It enabled the first stored-memory, reprogrammable computers—and it's been the backbone of the industry ever since.

Now, Shanbhag thinks it's time to switch to a design that's better suited for today's data-intensive tasks. In February, at the <u>International Solid-</u>

<u>State Circuits Conference</u> (ISSCC), in San Francisco, he and others made their case for a new architecture that brings computing and memory closer together. The idea is not to replace the processor altogether but to add new functions to the memory that will make devices smarter without requiring more power.



### The Deep In-memory Architecture (DIMA)



#### **ILLINOIS** Electrical & Computer Engineering

COLLEGE OF ENGINEERING

https://spectrum.ieee.org/computing/hardware/to-speed-up-ai-mix-memory-and-processing

## **In-memory ICs for Machine Learning is Hot!**



#### **TILLINOIS** Electrical & Computer Engineering COLLEGE OF ENGINEERING

### **Research Questions**

- Is there a common basis for these architectures?
- What are their precision (compute SNR) limits? (BIG? for IMCs today)





#### **ILLINOIS** Electrical & Computer Engineering COLLEGE OF ENGINEERING



## **Fixed-Point Dot Product on IMCs**





noise sources

$$\mathbf{w} \xrightarrow{\mathbf{w}} f(\mathbf{a}, \mathbf{b}) = \mathbf{a}^{\mathrm{T}}\mathbf{b} \xrightarrow{\mathbf{y}_{0}} \xrightarrow{\mathbf{y}_$$

$$SNR_a = \frac{\sigma_{y_o}^2}{\sigma_{\eta_a}^2}$$
 (analog SNR)

 $\mathbf{w} \xrightarrow{\mathbf{y}_0} \mathbf{y}_0 \xrightarrow{\mathbf{y}_0} \mathbf{y}_d = y_0 + q_y$ 

$$SNR_{d} = \frac{\sigma_{y_{o}}^{2}}{\sigma_{\eta_{a}}^{2}}$$
  
(digitization SNR)

$$SNR_{\rm T} = \left[\frac{1}{SNR_{\rm a}} + \frac{1}{SNR_{\rm d}}\right]^{-1}$$
 Limited by  $SNR_{\rm a}$ 

#### **ILLINOIS** Electrical & Computer Engineering COLLEGE OF ENGINEERING

### **SNR Tradeoffs in Charge Summing Architectures**



$$SNR_{a} = \frac{12\sigma_{w}^{2}E[x^{2}]}{E[x^{2}]\Delta_{w}^{2} + \sigma_{w}^{2}\Delta_{x}^{2} + E[x^{2}]\frac{D_{c}}{4}\frac{\sigma_{I}^{2}}{I_{o}^{2}} + \sigma_{w}^{2}\frac{D_{c}}{4}\frac{\sigma_{T}^{2}}{T_{o}^{2}} + \frac{1}{N}\sigma_{\eta_{c}}^{2}}$$

- discharge current and pulse width trades-off with clipping noise
- clipping noise dominates as dimensionality N increases

SNR trades off with N and  $V_{WL}$ 

#### **I** ILLINOIS

Electrical & Computer Engineering

## **ADC Precision Requirements**



precision limited by SNR<sub>a</sub>

$$SNR_a(dB) - SNR_T(dB) \le 0.5 dB$$



$$B_y > \min\left(\log_2(k_{\text{clip}}), \frac{\text{SNR}_a(\text{dB}) + 16.6}{6}\right)$$

#### **ILLINOIS** Electrical & Computer Engineering

## Summary

- Design of DNNs need not be trial-&-error based analytical methods exist (for precision assignment) or can be developed
- use MPC & noise gain analysis to determine minimum precisions of DNNs
- parallel considerations for in-memory architectures interplay between analog and quantization noise sources
- Next: design optimization framework? network accuracy vs. energy vs. latency vs....

### Major challenge – engineered design of AI systems $\rightarrow$

composability, interpretability, robustness, security, ethics, with guarantees



#### **ILLINOIS** Electrical & Computer Engineering

### **Thank You!**

http://shanbhag.ece.uiuc.edu