

**Trinity College Dublin** Coláiste na Tríonóide, Baile Átha Cliath The University of Dublin

# Low Complexity Multiply-Accumulate Units for Convolutional Neural Networks with Weight-Sharing

PASM

James Garland, David Gregg SFI Project 12/IA/1381 Date 23 Jan 2019



## **Research Challenge**

"By the year 2600, the world's population would be standing shoulder to shoulder, and the electricity consumption would make the Earth glow red-hot." <sup>1</sup>

— We need to start now to prevent a toasty warm environment!



- Artificial Intelligence (AI) & machine learning (ML) getting more ubiquitous.
- They consume more and more power in data centres.
- How can we stop this increasing power consumption trend whilst getting ML into off-line embedded devices?

## Quick Intro to CNNs



Convolutional neural network (CNN) architecture <sup>2</sup>

<sup>2</sup> mathworks.com

## The One That Started It All! (AlexNet)



- CNNs have 100,000's or more multiply-accumulates, e.g. AlexNet<sup>3</sup>
  - However, LeNet was the pioneer for OCR<sup>4</sup>
- 90% of time in computation is spent in the convolution layer <sup>5</sup>
  - <sup>3</sup> Krizhevsky et al. 2012.
    <sup>4</sup> LeCun et al. 1998.
    <sup>5</sup> Farabet et al. 2010.

## **Convolution Layer**



- To reduce computation time, systolic array loops are unrolled
- CNN Challenges:
  - A lot of data movement required due to megabytes of weight data
  - Hardware convolution accelerators could have as many multipliers as multiply-accumulate (MAC) operations

<sup>6</sup> Sabeetha et al. 2015.

• Hardware multipliers are large and power hungry. <sup>6</sup>

## Weight Shared CNN Accelerator

- Reduce the weight data movement
- Pre-trained weights pruned and quantised to 16-256 shared values <sup>7</sup>.
- Pre-trained weight values are stored in a weights register file.
- Values indexed, retrieved, multiplied by corresponding image value.



## We Propose PASM

Multiple-PAS-Shared-MAC (parallel accumulate shared MAC (PASM))

- Multiple parallel accumulate and store (PAS) units followed by **one** shared MAC.
- PASs accumulate w bit **image** into  $b = 2^{wci}$  bins register file
- Post-pass MAC multiplies weights with binned image values





+=











### **PASM In Operation**

















|      | 0    | 1   | 2   | 3    |
|------|------|-----|-----|------|
| bins | 32.8 | 3.4 | 4.8 | 17.7 |









## Complexity of the PAS

| Sub Component         | Gates    | Simple | Weight Shared | PAS |
|-----------------------|----------|--------|---------------|-----|
|                       |          | MAC    | MAC           |     |
| Adder                 | O(W)     | 1      | 1             | 1   |
| Multiplier            | $O(W^2)$ | 1      | 1             |     |
| Weight Register       | O(W)     | 0      | В             |     |
| Accumulation Register | O(W)     | 1      | 1             | В   |
| File Port             | O(WB)    |        | 1             | 2   |

## **PASM - Gate Count Results**

• Utilization results show more **66**% efficiency increase in NAND2 gate count for PASM - **lower is better.** 



## **PASM - Gate Count Results**

• Utilization results show more **66**% efficiency increase in NAND2 gate count for PASM - **lower is better.** 



## **PASM - Power Consumption Results**

 Power results show 70% lower total power consumption for PASM lower is better.



## **PASM - Power Consumption Results**

 Power results show 70% lower total power consumption for PASM lower is better.



## Kernel Idea Published by IEEE CAL

- Short 4 page paper published in IEEE Computer Architecture Letters <sup>8</sup>.
- DOI: 10.1109/LCA.2017.2656880
- Cited three times (so far!)



<sup>8</sup> Garland et al. 2017.

## **Extended Research**

- Designed three CNN accelerators
  - Standard convolution accelerator (no weight sharing).
  - Weight shared convolution accelerator
  - Weight shared convolution accelerator implemented with PASM.
- Designed in System C rather than Verilog
- Optimised / implemented in field programmable gate array (FPGA) and application specific integrated circuit (ASIC)
- Compared timing, latency, power and gate count of the three designs in FPGA and ASIC



## **Typical Numbers of MAC Operations**

|             |     | <pre>input_channels (C)</pre> |      |       |  |
|-------------|-----|-------------------------------|------|-------|--|
|             |     | 32                            | 128  | 512   |  |
|             | 1x1 | 32                            | 128  | 512   |  |
| kernels (K) | 3x3 | 288                           | 1152 | 4608  |  |
|             | 5x5 | 800                           | 3200 | 12800 |  |
|             | 7x7 | 1568                          | 6272 | 25088 |  |

## Weight-Shared Convolution with PASM



## **Development Flow - FPGA and ASIC**



<sup>9</sup> Xilinx User Guide 902 Vivado High Level Synthesis.
 <sup>10</sup> Cadence Genus User Guide.

#### **ASIC Results**

• 4 bin - 32 bit values, IMG=5 × 5, K=3 × 3, C=15, M=2

8% increase in latency48% less total area53% less total power



Inverter Buffer Sequential Logic Total

#### **ASIC Results**

Power (mW)

• 4 bin - 32 bit values, IMG=5 × 5, K=3 × 3, C=15, M=2

8% increase in latency48% less total area53% less total power





#### **ASIC Results**

• 4 bin - 32 bit values, IMG=5 × 5, K=3 × 3, C=15, M=2

8% increase in latency48% less total area53% less total power

#### Latency Comparison of Accelerators



#### **FPGA Results**

• 4 bin - 32 bit values, IMG=5 × 5, K=3 × 3, C=15, M=2

8.5% increase in latency99% fewer DSPs28% fewer BRAMs80% power saving





#### **FPGA Results**

4 bin - 32 bit values, IMG=5 × 5, K=3 × 3, C=15, M=2

1.25 1 Power (W) 0.75 0.5 0.25 0 Non-Weight-Shared Weight-Shared Weight-Shared-with-PASM

Accelerator Type



**8.5%** increase in latency 99% fewer DSPs 28% fewer BRAMs 80% power saving

#### **FPGA Results**

• 4 bin - 32 bit values, IMG=5 × 5, K=3 × 3, C=15, M=2

8.5% increase in latency99% fewer DSPs28% fewer BRAMs80% power saving

Latency Comparison of Accelerators



## Extended Idea Published by ACM TACO

- 25 page paper published in ACM TACO <sup>11</sup>.
- DOI: 10.1145/3233300
- Cited once (so far!)



<sup>11</sup> Garland et al. 2018.

## To Sum Up



- There's a great need to reduce power and resources in a CNN.
- This will aid power consumption in data centres, allow implementation in low power embedded devices and save the environment.
- We change the programming model of CNN by rearchitecting the MAC.
- These are optimised / implemented in FPGA and ASIC.
  - **8.5%** increase in latency for PASM
  - ASIC: 48% less total area; 53% less total power
  - FPGA: **99%** fewer DSPs; **28%** fewer BRAMs; **80%** less total power
- We show timing, power and ASIC gate count and FPGA resources of the three designs are reduced with only a slight increase in latency.



# Trinity College Dublin

Coláiste na Tríonóide, Baile Átha Cliath The University of Dublin

# Thank You

James Garland David Gregg <u>https://www.scss.tcd.ie/~jgarland/</u> <u>https://www.scss.tcd.ie/David.Gregg/</u>