



# FPGA hardware acceleration turns out to be a software based design flow



#### **Accelerators and Systems**



- An accelerator is a dedicated piece of IP implemented in the configurable logic of an SoC and coupled to the processing system
- The goal is to offload the processor's computationally intensive tasks to the hardware where it can be executed at a significantly higher rate
- The design of the internals of the accelerator is referred to as the microarchitecture and is governed by coding style and #pragmas

BRINGING YOU THE NEXT LEVEL IN EMBEDDED DEVELOPMENT



#### **System Design Challenges**



How to connect the processor to the accelerator?

- AXI ports: general-purpose masters and slaves, ACP, high performance, ACE, HPC
- Interrupts, WFE, WFI, polling
- Clocking
- Cache and memory utilization
- Data movement (DMA, datamover)
- How to coordinate hardware and software?
  - Polling versus interrupting
  - Knowing when the DMA and accelerator(s) are done
  - Knowing where the data is at the end of an acceleration process
  - Blocking versus non-blocking coding styles and support





- Achieving higher computing performance this is the primary objective
- Saving processor cycles by offloading the computation
- High performance of the PL-based accelerator itself
  - Lower latency
  - Higher throughput
  - Several times faster compared to software-based computation
- Ensure that data transfer delays between PS and accelerator do not eliminate the performance gain from the accelerator



#### **System-level Considerations**





- Will it meet performance requirements the first try?
  - What changes are required at the macro/micro-architecture levels (or both)?



#### Zynq-7000 SoC Block Diagram





BRINGING YOU THE NEXT LEVEL IN EMBEDDED DEVELOPMENT



#### **Zynq Accelerator Interfaces**







#### Zynq UltraScale<sup>+</sup> MPSoC





BRINGING YOU THE NEXT LEVEL IN EMBEDDED DEVELOPMENT



# Zynq UltraScale+ Accelerator Interfaces

- Accelerator coherency port ACP
- AXI coherency extension ACE
- Two High-Performance coherency interfaces HPC
- Four AXI High-Performance slave ports
- Two High-Performance master ports
   Can be accessed from APU or RPU





#### **Data Flow Model**



- Custom IP for complex function and data flow
- PS used for control and resource management
  - Minimal to no data processing by the CPUs
- Custom IP in PL operates nearly autonomously from the PS
  - May play through to acccess the DDR using the HP ports





#### **Acceleration Model**



- **PS** primary configures data for the accelerator
  - Can also perform significant tasks

#### PL for hardware acceleration

- Custom IP tightly coupled with processor
- Accelerator reacts to PS
- **Communications between** 
  - GP ports uses for accelerator management
  - Data moved on high-efficiency ports (ACP/HPx)
  - Interrupts or event signals used to signal significant occurrences







#### **Typical ACP Accelerator Example**



- 1. CPU leaves (updates) data in either the L1 or L2 cache depending on the volume of data to move to the accelerator.
- > 2. CPU notifies the accelerator via the event bus to begin data operations.
- 3. The Accelerator issues are read into the SCU via an AXI slave through the ACP. Data may be returned from L1 or L2 cache, OCM, or (worst case) from DDR.
- 4. After processing, the accelerator writes back into the specified memory location which may be in L1, L2, OCM, or DDR via the AXI slave connected to the ACP.





#### **Typical ACP Accelerator Example** cont



- 5. The SCU ensures coherency by placing the data into the appropriate location, ideally L1 or L2 cache, but may be into the DDR. This is handled transparently by the SCU; neither the accelerator nor the CPUs need to worry about this.
- ▶ 6. The Accelerator notifies the PS via the event bus that it has completed.
- 7. The Accelerator is now out of the picture and one or both of the CPUs begin operating on the returned data which should now be in a near (fast) memory (L1, L2, OCM). Where there is too much data or the wrong addresses are targeted, data movement will involve DDR or other slower memories.





#### **Design Flow without SDSoC**





BRINGING YOU THE NEXT LEVEL IN EMBEDDED DEVELOPMENT



#### Add Directives to your C/C++-code



|                                                                           | 🖳 🗖 📴 Outline 🖾 Directive 😒                           |
|---------------------------------------------------------------------------|-------------------------------------------------------|
| 15 <b>for</b> (n = 0, tmp = 0; n < DCT_SIZE; n++) {                       | ▲ ● dct_ld                                            |
| <pre>16 int coeff = (int)dct_coeff_table[k][n];</pre>                     | ×II dct_coeff_table                                   |
| <pre>17 tmp += src[n] * coeff;</pre>                                      | A 💥 DCT Outer Loop                                    |
| 18 }                                                                      | % HLS PIPELINE                                        |
| <pre>19 dst[k] = DESCALE(tmp, CONST_BITS);</pre>                          | CT Inner Loop                                         |
| 20 }                                                                      | ▲ ● dct 2d                                            |
| 21 }                                                                      | % HLS INLINE                                          |
| 239 void dct_2d(a,t_in_block[DCT_SIZE][DCT_SIZE],                         | ×11 row_output                                        |
| 24 dct_data_t out_block[DCT_SIZE][DCT_SIZE])                              | ×11 col outbuf                                        |
| 25 {                                                                      | ×U col_inbuf                                          |
| <pre>26 dct_data_t row_outbuf[DCT_SIZE][DCT_SIZE];</pre>                  | % HLS ARRAY_RESHAPE variable=col_inbuf complete dim=2 |
| <pre>27 dct_data_t col_outbuf[DCT_SIZE][DCT_SIZE], col_inbuf[DCT_SI</pre> | ZE][DCT_SIZE]; Kow_DCI_Loop                           |
| 28 unsigned i, j;                                                         |                                                       |
| 29                                                                        | Xpose_Row_Outer_Loop                                  |
| 30 // DCT rows                                                            | A Voise Row Joner Loop                                |
| <pre>31 Row_DCT_Loop:<br/>32 for(i = 0; i &lt; DCT SIZE; i++) {</pre>     | % HLS PIPELINE                                        |
| <pre>32 dct 1d(in block[i], row outbuf[i]);</pre>                         | V COLDCT_LOOP                                         |
| 34 }                                                                      | A 🦉 Xpose_Col_Outer_Loop                              |
| 35 // Transpose data in order to re-use 1D DCT code                       | A 👯 Xpose Col Inner Loop                              |
| 36 Xpose Row Outer Loop:                                                  | % HLS PIPELINE                                        |
| <pre>37 for (j = 0; j &lt; DCT_SIZE; j++)</pre>                           |                                                       |
| 38 Xpose_Row_Inner_Loop:                                                  | A W RD_Loop_Row                                       |
| <pre>39 for(i = 0; i &lt; DCT_SIZE; i++)</pre>                            | A 🚏 RD_Loop_Col                                       |
| <pre>40 col_inbuf[j][i] = row_outbuf[i][j];</pre>                         | % HLS PIPELINE                                        |
| 41 // DCT columns                                                         | ▲ ● write_data                                        |
| 42 Col_DCT_Loop:<br>43 for (i = 0; i < DCT_SIZE; i++) {                   | ✓ ₩ WR_Loop_Row                                       |
| 44 dct_1d(col_inbuf[i], col_outbuf[i]);                                   | ▲ 🥙 WR_Loop_Col                                       |
| 45 }                                                                      | % HLS PIPELINE                                        |
| 46 // Transpose data back into natural order                              | ▲ ● dct                                               |
| 47 Xpose_Col_Outer_Loop:                                                  | % HLS DATAFLOW                                        |
| 48 for $(j = 0; j < DCT_9175, j++)$                                       | • input                                               |
| 49 Xpose_Col_Inner_Loop:                                                  | <ul> <li>output</li> </ul>                            |
| 50 for(i = 0; i < DCT_SIZE; i++)                                          | ×II buf 2d in                                         |
| <pre>51 out_block[j][i] = col_outbuf[i][j];</pre>                         | % HLS ARRAY RESHAPE variable=buf_2d_in complete dim=2 |
| 52 }                                                                      | ×II buf_2d_out                                        |
| 4                                                                         | ► ► ► ► ► ► ► ► ► ► ► ► ► ► ► ► ► ► ►                 |

BRINGING YOU THE NEXT LEVEL IN EMBEDDED DEVELOPMENT



#### Add #Pragma to your C/C++-code





BRINGING YOU THE NEXT LEVEL IN EMBEDDED DEVELOPMENT



#### **Compare different Solutions**



- Each solution uses a different directive file
  - Constraints
- Improved latency using a pipeline directive or #pragma
- Performance gain comes with area overhead

| c                            | dct.c       | 🗊 Sy  | /nth  | esis(  | solu  | ution | 7)   | f    | comp     | are r | eport | s 🖾 |  |
|------------------------------|-------------|-------|-------|--------|-------|-------|------|------|----------|-------|-------|-----|--|
| Vivado HLS Report Comparison |             |       |       |        |       |       |      |      |          |       |       |     |  |
| A                            | II Compa    | red   | Solu  | ition  | s     |       |      |      |          |       |       |     |  |
| 5                            | solution7:  | xc7z  | 2020  | clg4   | 84-   | 1     |      |      |          |       |       |     |  |
|                              | solution1:  |       |       | _      |       |       |      |      |          |       |       |     |  |
| Ρ                            | erformar    | ice E | stin  | nate   | 5     |       |      |      |          |       |       |     |  |
| Ē                            | Timing      | (ns)  |       |        |       |       |      |      |          |       |       |     |  |
| [                            | Clock       |       |       |        | so    | lutio | n7   | so   | olution1 |       |       |     |  |
|                              | ap_clk      | Tar   | get   |        | 10.00 |       |      | 1(   | 0.00     |       |       |     |  |
|                              |             | Esti  | mat   | ed     | 9.73  |       |      | 6.   | 6.60     |       |       |     |  |
| E                            | Latency     | (clo  | ock ( | cycle  | s)    |       |      |      |          |       |       |     |  |
| [                            |             |       |       | sol    | utic  | on7   | sol  | utio | on1      |       |       |     |  |
|                              | Latency     | m     | in    | 627    | 7     | 395   |      | 59   |          |       |       |     |  |
|                              |             | m     | ах    | 627    | 7     |       | 3959 |      |          |       |       |     |  |
|                              | Interval    | m     | in    | 132    | -     |       | 3960 |      |          |       |       |     |  |
|                              |             | m     | ах    | 132    | 2     |       | 39   | 60   |          |       |       |     |  |
| U                            | Itilization | Esti  | mat   | tes    |       |       |      |      |          |       |       |     |  |
|                              |             |       | so    | lutior | ٦7    | sol   | utio | n1   |          |       |       |     |  |
|                              | BRAM_1      | 8K    | 22    |        | 5     |       |      |      |          |       |       |     |  |
|                              | DSP48E      |       | 16    |        |       | 1     |      |      |          |       |       |     |  |
|                              | FF          |       | 37    | 69     |       | 27    | 2    |      |          |       |       |     |  |
|                              | LUT         |       | 54    | 99     |       | 874   | 4    |      |          |       |       |     |  |



#### **Design Flow with SDSoC**







#### **Embedded Design Flow with SDSoC**



Migrate C/C++ functions to hardware

- System-level debug and profile
- Simple hardwaresoftware partitioning
- Full system generation including driver and hardware connectivity





#### **SDSoC System Level Profiling**



- Rapid system performance estimation
  - Full system estimation (programmable logic, data communication, processing system)
  - Reports SW/HW cycle level performance and hardware utilization
- Automated performance measurement
  - Runtime measurement by instrumentation of cache, memory, and bus utilization





#### **SDSoC System Level Profiling**



#### Performance, speedup and resource estimation report for the 'Topic' project

Note: Performance estimation assumes worst-case latency of hardware accelerators, it also assumes worst-case data transfer size for arrays (if transfer size cannot be determined at compile time). If the accelerator latency and data transfer size at run-time is smaller than such assumptions, the performance estimation will be more pessimistic than the actual performance.

#### Summary

| Performance estimates for 'main' function |             |
|-------------------------------------------|-------------|
| SW-only (Measured cycles)                 | 13736544117 |
| HW accelerated (Estimated cycles)         | 96206486    |
| Estimated speedup                         | 142.78      |

#### Details

#### Performance estimates for functions 'sobel\_filter, sharpen\_filter and rgb\_2\_gray'

| SW-only (Measured cycles)         | 2741921807 |
|-----------------------------------|------------|
| HW accelerated (Estimated cycles) | 13854282   |
| Estimated speedup                 | 197.91     |

#### **Resource utilization estimates for hardware accelerators**

| Resource | Used | Total  | % Utilization |
|----------|------|--------|---------------|
| DSP      | 3    | 220    | 1.36          |
| BRAM     | 6    | 140    | 4.29          |
| LUT      | 715  | 53200  | 1.34          |
| FF       | 600  | 106400 | 0.56          |



#### **Core** Vision



#### **Our competences**

Core|Vision has more than 125 man years of design experience in hard- and software development. Our competence areas are:

- System Design
- FPGA Design
- Consultancy / Training
- Digital Signal Processing
- Embedded Real-time Software
- App development, IOS Android
- Data Acquisition, digital and analog
- Modeling & Simulation
- PCB design & Layout
- Doulos & Xilinx Training Partner







# CORE Vision

# Q&A



Cereslaan 10b 5384 VT Heesch ) +31 (0)412 660088

> www.core-vision.nl Email: info@core-vision.nl





EMBEDDED DEV

# 





- SYSTEM DEVELOPMENT
- DEDICATED ELECTRONICS
- EMBEDDED SOFTWARE
- DESIGN SERVICES
- MODELING AND SIMULATION

# Visit our booth 27



| Essentials of FPGA Design                           | 1 day  |
|-----------------------------------------------------|--------|
| Designing for Performance                           | 2 days |
| Advanced FPGA Implementation                        | 2 days |
| Design Techniques for Lower Cost                    | 1 day  |
| Designing with Spartan-6 and Virtex-6 Family        | 3 days |
| Essential Design with the PlanAhead Analysis Tool   | 1 day  |
| Advanced Design with the PlanAhead Analysis Tool    | 2 days |
| Xilinx Partial Reconfiguration Tools and Techniques | 2 days |
| Designing with the 7 Series Families                | 2 days |





- Designing FPGAs Using the Vivado Design Suite 1 2 days
  Designing FPGAs Using the Vivado Design Suite 2 2 days
  Designing FPGAs Using the Vivado Design Suite 3 2 days
  Designing FPGAs Using the Vivado Design Suite 4 2 days
  Designing with the UltraScale and UltraScale<sup>+</sup> Architecture 2 days
  Vivado Design Suite for ISE Software Project Navigator User 1 day
  Vivado Design Suite Advanced XDC and Static Timing Analysis
  - for ISE Software User

2 days







| Designing with Multi Gigabit Serial IO                   | 3 days |
|----------------------------------------------------------|--------|
| High Level Synthesis with Vivado                         | 2 days |
| C-Based HLS Coding for Hardware Designers                | 1 day  |
| C-Based HLS Coding for Software Designers                | 1 day  |
| DSP Design Using System Generator                        | 2 days |
| Essential DSP Implementation Techniques for Xilinx FPGAs | 2 days |







| Embedded Systems Design                                     | 2 days |
|-------------------------------------------------------------|--------|
| Embedded Systems Software Design                            | 2 days |
| Advanced Features and Techniques of SDK                     | 2 days |
| Advanced Features and Techniques of EDK                     | 2 days |
| Zynq All Programmable SoC Systems Architecture              | 2 days |
| Zynq UltraScale <sup>+</sup> MPSoC for the System Architect | 2 days |
| Introduction to the SDSoC Development Environment           | 1 day  |

Advanced SDSoC Development Environment & Methodology 2 days







| VHDL for Designers                  | 3 days |
|-------------------------------------|--------|
| Advanced VHDL                       | 2 days |
| Comprehensive VHDL                  | 5 days |
| Expert VHDL Verification            | 3 days |
| Expert VHDL Design                  | 2 days |
| Expert VHDL                         | 5 days |
| Essential Digital Design Techniques | 2 days |
|                                     |        |



DOULOS