FPGA-centric software acceleration made easy
Table of Contents
Build (embedded) high-performance computing solutions using FPGA acceleration technology with standard software programming methods
By Dirk van den Heuvel, Product Manager, Topic Embedded Systems
The world of computing is becoming more and more heterogeneous. Processing platforms integrate with an increasing pace varieties of processing units integrated in single SOCs, (embedded) PCs and combinations of edge and cloud computing. In this context, you see a shift in programming methods, a brought range of software abstractions and a focus on coding efficiency like “low-coding”. An interesting question to this development is how to program the different processing architectures with a common and effective software approach? For multi-core processor architectures, a wide variety of compiler solutions are already available. However, when you look at heterogeneous accelerator platforms, which combine GPUs, FPGAs and neural networks with multi-core CPUs, the programming method needs specific programming skills.
This white paper addresses the programming of FPGA fabric in a software development perspective and explains a method to develop and integrate FPGA functionality in a typical software development context without requiring deep FPGA experience.
2. Bridging the CPU-FPGA integration gap
The main difference between a CPU and FPGA is that a CPU executes instructions on a predefined logic silicon structure, where on an FPGA also the logic structures need to be designed. But everything executed on the FPGA is running in parallel, boosting performance extremely. This also means that the programming abstraction level of the FPGA is lower compared to a CPU. The use of FPGA devices is therefore mostly focused on high-speed signal processing, video applications, communication interfaces, blockchain algorithms and other compute-intensive applications with regular constructs. To program FPGA devices, the use of a specific programming language like VHDL and Verilog are required. In addition, the use of IP libraries and OpenCL type of kernels can make life more easy. But low-level FPGA design know-how remains required.
Since many years, a lot of energy has been put in creating compilers for FPGA implementations using C or C++ as the programming language, like Xilinx High Level Synthesis (HLS). Where a CPU compiler builds functionality on top of an existing fixed processor architecture, an FPGA compiler must also compile the logic structure. At this moment the technology is mature and while the logic densities of FPGA devices are very high, a bit of logic inefficiency for such technologies is permissible.
3. Dynamic Process Loader (Dyplo®)
A problem that remains is, that after compilation of the FPGA functionality, the resulting functionality still needs to be integrated with the processing system. On both a System-on-Chip (SOC) as well as on a PC/FPGA combination, data needs to be exchanged between the two processing entities. This is typically where specialist expertise is required: writing Linux kernel drivers, construction of proper DMA based data exchange mechanisms, high-performance FPGA interfaces according to strict bus protocols, etc.. Here multiple programming disciplines meet. TOPIC recognized this problem years ago and developed a Dynamic Process LOader (Dyplo®) that solves this problem to a high extend On the FPGA side, Dyplo® forms a Network-on-Chip (NOC), wrapping fixed and dynamically exchangeable FPGA function blocks. On the processor side, Dyplo® is a Linux kernel driver that interfaces with the Dyplo® NOC using file I/O based data streams. The third aspect of Dyplo® is the implementation flow to transform a software defined function block into a Dyplo® wrapped FPGA function block.
In this paper, the focus will be on the integration of FPGA logic with a PC system using PCI-Express as interface medium. However, the design process and functionality is 100% identical to SOC type of implementations or in a cloud context. The examples mentioned are implemented using a standard Intel i7 based PC system running Ubuntu 18.04LTS and incorporating a Xilinx Alveo U50 FPGA accelerator card with a 16 lanes PCI-Express 3.0 bus.
The Dyplo® concept is based on streaming data transport. The FPGA communication infrastructure is loosely based on Kahn Processing Networks (KPN). This means that nodes (accelerator regions) are interacting via buffers, which synchronize operation between nodes and match computational performance of the system with the available communication bandwidth. In the software application, the data from and to the FPGA are accessible as file streams. You need to open them, read from or write to, and close them. The streams are presented as standard Linux file streams with a clear reference.
Figure 1 – Dyplo® conceptual architecture
3.1 Dyplo Network-on-Chip
Figure 1 gives a conceptual architecture illustration of the Dyplo® NOC from a system perspective. Using 1 or multiple parallel DMA data streams, data can flow between the PC and FPGA fabric. The processors volatile system background memory is used for the data exchange. The Dyplo® infrastructure directly interacts with this background memory with limited processor involvement. Interesting to mention here is, that using Linux the observed responsiveness and latency is much better then when using Microsoft Windows as operating system.
The NOC is constructed as a ring topology with configurable bandwidth. The default data width of the ring is 64 bits. In multiples of 64 bits, the ring performance can be increased. The clock rate of the ring is depending on the FPGA fabric technology and speed grade. Clock rates of over 350MHz are possible. Each node in the NOC is connected to this ring with maximum 4 input and 4 outputs. There are five types of nodes that interact with this communication infrastructure.
Routing the inputs and outputs of nodes with other nodes or the CPU system is controlled by the application software. Setting-up a route is a simple addressing mechanism, stating from which node (1 to 32) and which output (1 to 4) to which node (1 to 32) and which input the data should flow. These routes are flexible and can be changed on-the-fly by software instructions.
3.2 Linux driver
The Linux driver for Dyplo® creates a software abstraction of the NOC interfaces and loading of the reconfigurable nodes. The Linux driver allows multiple applications to share the same Dyplo infrastructure. The arbitration of assigning routes and node functionality is controlled by the Dyplo® driver. If an application tries to allocate a specific already occupied node, the driver will return an error flag. The same is valid for programmed routes.
The Dyplo® driver builds on top of standard Linux drivers for memory and DMA control as well as PCI Express drivers. For proper handling of the parallel data streams and specifics of Dyplo® an additional driver layer is crated around these drivers. Using these drivers, the DMA data exchange channels are available for the user as file streams.
3.3 Function integrator
A third part of Dyplo® is the tool, the Dyplo® Development Environment (DDE), to create and manage the partial bitstreams of the FPGA. A GUI is available for interactive configuration, compilation and bitstream management. Also, scripted and command line driven operation is supported. The tool helps to configure the NOC FPGA IP in a comprehensive manner, guides the import and creation of function blocks from C/C++ and HDL and it creates automatically for the create baseline FPGA image the corresponding partial bitstreams. Figure 2 gives an impression of the DDE user interface.
Figure 2 – Example screenshots of the Dyplo® Development Environment
4. PROGRAMMING EXAMPLE
The question remaining is, how does an actual design flow workout in practice? In the following paragraphs, a description is given of a typical design flow using the Dyplo® acceleration framework based on a desktop PC solution with an Alveo U50 board. However, the exact same design flow is applicable if you want to execute this in a cloud configuration at e.g. Nimbix or Amazon. But also if you are working on a system with a Zynq 7000 device or a Zynq Ultrascale+ device.
This particular Dyplo® work flow is explained using the Dyplo® Development Kit. The kit consists of an Alveo U50 board, the Dyplo® Development Environment (DDE) and an illustrative example application. In addition, a PC running Linux (e.g. Ubuntu 18.04LTS) with an available 8 lanes PCI-Express 3.0 slot is required. Make sure that GCC installed on the machine as well as a Xilinx Vivado/Vitis installation, preferably version 2020.2 or more recent. Also install the Dyplo® Linux driver by simply running the Dyplo® installer which is part of the DDE. The required Dyplo® specific PCI-Express driver is automatically installed and you will notice that a number of file I/O devices are created as standard Linux peripherals.
Figure 3 – Demo image configuration Dyplo® Development Kit
4.2. NOC configuration
The second step is to configure the Dyplo NOC on the FPGA, according to the needs of your application. You basically partition the large FPGA into smaller partitions with well defined, easy to use streaming interfaces. This configuration process is guided using an interactive GUI, launching scripts that result in a complete FPGA project and bit image for the Alveo U50 board. The development kit also comes with a pre-configured NOC wrapping 8 reconfigurable nodes and 4 DMA nodes. Figure 3 illustrates the functionality of this reference configuration. This means you can have 4 parallel streams to the FPGA fabric as well as 4 back into the PC. The full PCI-Express bandwidth is available, providing more then 500Mbyte/sec bandwidth per channel when equally divided over 4 channels, matching performance requirements of 1080p60 video applications.
4.3. Function implementation
This is where the user specific functionality comes in. Dyplo supports two implementation flows. The first implementation flow is based on the traditional FPGA design flow, but strongly simplified. The interfaces of the HDL node shall be AXI4-Stream compatible. A template design with test bench are provided. The code can be hand-crafted, can be created using functions from Xilinx standard IP catalog, can be constructed using Simulink HDL coder or bought from a third party.
Figure 4 – Typical iterative implementation cycle of FPGA accelerator functions
The second implementation flow uses Xilinx high level synthesis technology (HLS). In the application code, the part of which the performance needs to be accelerated, has to be isolated and interfaced according to the Dyplo API requirements. This is very similar to the way file I/O functionality is handled in Linux. Using the Dyplo GUI the provided code is automatically wrapped for compliance with the NOC interfaces. In the same pass, a Vivado HLS project is created for this particular function with the default settings. By opening the Vivado HLS GUI, the function can be further optimized to meet specific processing performance requirements. All features offered by Vivado HLS are supported, including a selection of OpenCL and OpenCV operators. A typical design cycle to get from a “software” function to an FPGA accelerated equivalent is illustrated in figure 4.
Both implementation flows result in partial bit streams for the particular function that can be deployed on any of the 8 reconfigurable nodes. The reference design comes with a number of standard video manipulation functions are provided as C/C++ code, HDL code and as partial bitstreams.
4.4. Application development
In the previous step, a function was isolated from the application software for acceleration. To replace the software function by the Dyplo® accelerated FPGA variant, the application needs to meet the Dyplo® programming model requirements. It is very similar to file I/O operations:
- Start the driver. By default, the NOC on the FPGA is in a passive state. It needs to be initialized using a specific command.
- Configure the routing of streams within the NOC:
- From the software application to one of the nodes (maximum 4 streams)
- From the nodes back to the software application (maximum 4 streams)
- Between the nodes in the NOC on the FPGA (each node has maximum 4 input streams and 4 output streams)
The result of this operation are file-type pointers that can be used by software in the applications
- The file pointers can be activated using standard file operators as “open” and “close”. When opened, you can read and write to these streams at will. The operations are blocking. You can
keep on pushing data to the nodes in the FPGA, until the driver blocks. It is a lossless data implementation.
- The final operation to get the application use the accelerator in the fabric is the actual programming of the node with a partial bitstream. This is done by a simple programming command, referencing the node number and the partial bitstream file. Successful programming is flagged by the return value. A node can already be occupied by a different application as Dyplo® allows multiple applications to use the same NOC.
Although the programming model is inspired on file I/O, the combination with the NOC configuration has similarities with a specific OpenCL programming construct. Therefore, the use of FPGA devices in a software context using Dyplo® is of a comparable complexity as developing CUDA or OpenCL applications for GPU accelerator boards.
Figure 5 – Common software IDE tools are used to create Dyplo® compatible application software
5. Dyplo® in action
In the previous chapters, the Dyplo® concept and programming method is explained. However, the best way to learn about the concept is to try it. Dyplo® is available for Zynq 7000, Zynq Ultrascale+ and Alveo. Via the Xilinx App Store and the licensing using Accelize technology, the Dyplo Development and Runtime Environment can be sourced. However, the best way to experience the Dyplo® benefits is using the Dyplo® Development Kit.
Figure 6 – An Qt-based example application running 2 applications simultaneously in the NOC
TOPIC released recently Dyplo® 2.0, with improved NOC performance, extended devices support for Alveo accelerator boards and 4K video support capabilities. . Where the 1.x version of Dyplo® focused on the disclosure of FPGA fabric for software development, the 2.0 version of Dyplo® is unlocking the potential the performance and integration capabilities of the FPGA fabric and seamlessly integrating this within Dyplo®. While maintaining the threading type of processing infrastructure based dynamic function exchange (DFX, formerly known as partial reconfiguration), the data communication infrastructure is enhanced to support natively multi-4K video streams, make the DMA-based software-in-the-loop via DDR memory an integral part of the system, allows seamless integration of high-bandwidth-memory (HBM) in the data path as well as connecting multiple FPGA devices to each other using high bandwidth, error-free connection links. As an introduction offer, the Dyplo® Development Kit, based on a Xilinx Alveo U50 board, is promoted at an attractive pricing level. Contact us for details.
Unleash superior accelerated algorithmic performance on FPGAs now by testing Dyplo® yourself. Use either a SOC or PC based system and experience with GPU programming convenience and reduced power consumption the benefits of such a flow.