Reconfiguration Computing Systems Lab

Project: Towards Efficient and Scalable Acceleration of Online Decision Tree Learning on FPGA

Decision trees are machine learning models commonly used in various application scenarios. In the era of big data, traditional decision tree induction algorithms are not suitable for learning large-scale datasets due to their stringent data storage requirement. Online decision tree learning algorithms have been devised to tackle this problem by concurrently training with incoming samples and providing inference results. However, even the most up-to-date online tree learning algorithms still suffer from either high memory usage or high computational intensity with dependency and long latency, making them challenging to implement in hardware. To overcome these challenges, we introduce a new quantile-based algorithm to improve the induction of the Hoeffding tree, one of the state-of-the-art online learning models. The proposed algorithm is light-weight in terms of both memory and computational demand, while still maintaining high generalization ability. A series of optimization techniques dedicated to the proposed algorithm have been investigated from the hardware perspective, including coarse-grained and fine-grained parallelism, dynamic and memory-based resource sharing, pipelining with data forwarding. We further present a high-performance, hardware-efficient and scalable online decision tree learning system on a field-programmable gate array (FPGA) with system-level optimization techniques. Please refer to [C74] for more details.
Project: Optimizing OpenCL-based CNN Design on FPGA with Comprehensive Design Space Exploration and Collaborative Performance Modeling

Recent success in applying CNNs to object detection and classification has sparked great interest in accelerating CNNs using hardware like FPGAs. However, finding an efficient FPGA design for a given CNN model and FPGA board is non-trivial. In this work, we try to solve this problem by design space exploration with a collaborative framework, which consists of three main parts: FPGA design generation, coarse-grained modeling, and fine-grained modeling. In the FPGA design generation, we propose a novel data structure, LoopTree, to capture the details of the FPGA design for CNN applications without writing down the source code. Different LoopTrees are automatically generated in this process. A coarse-grained model will evaluate LoopTrees at the operation level so that the most efficient LoopTrees can be selected. A fine-grained model, which is based on the source code, will then refine the selected design in a cycle-accurate manner. A set of comprehensive OpenCL-based designs have been implemented onboard to verify our framework. An average estimation error of 8.87% and 4.8% have been observed for our coarse-grained model and fine-grained model, respectively.
Project: FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates

DNNs (Deep Neural Networks) have demonstrated great success in numerous applications such as image classification, speech recognition, video analysis, etc. However, DNNs are much more computation-intensive and memory-intensive than previous shallow models. Thus, it is challenging to deploy DNNs in both large-scale data centers and real-time embedded systems. Considering performance, flexibility, and energy efficiency, FPGA-based accelerator for DNNs is a promising solution. Unfortunately, conventional accelerator design flows make it difficult for FPGA developers to keep up with the fast pace of innovations in DNNs. To overcome this problem, we propose FP-DNN (Field Programmable DNN), an end-to-end framework that takes TensorFlow-described DNNs as input, and automatically generates the hardware implementations on FPGA boards with RTL-HLS hybrid templates. FP-DNN performs model inference of DNNs with our high-performance computation engine and carefully-designed communication optimization strategies. We implement CNNs, LSTM-RNNs, and Residual Nets with FPDNN, and experimental results show the great performance and flexibility provided by our proposed FP-DNN framework.
Project: Scalable Light-Weight Integration of FPGA Based Accelerators with Chip Multi-Processors

Modern multicore systems are migrating from homogeneous systems to heterogeneous systems with accelerator-based computing in order to overcome the barriers of performance and power walls. In this trend, FPGA-based accelerators are becoming increasingly attractive, due to their excellent flexibility and low design cost. In this project, we propose the architectural support for efficient interfacing between FPGA-based multi-accelerators and chip-multiprocessors (CMPs) connected through the network-on-chip (NoC). Distributed packet receivers and hierarchical packet senders are designed to maintain scalability and reduce the critical path delay under a heavy task load. A dedicated accelerator chaining mechanism is also proposed to facilitate intra-FPGA data reuse among accelerators to circumvent prohibitive communication overhead between the FPGA and processors. In order to evaluate the proposed architecture, a complete system emulation with programmability support is performed using FPGA prototyping. Experimental results demonstrate that the proposed architecture has high-performance, and is light-weight and scalable in characteristics. Please refer to [J41] for more details.
Project: A Hybrid Approach to Cache Management in Heterogeneous CPU-FPGA Platforms

Heterogenous computing is gaining increasing attention due to its promise of high performance with low power. Shared coherent cache based CPU-FPGA platforms, like Intel HARP, are a particularly promising example of such systems with enhanced efficiency and high flexibility. In this work, we propose a hybrid strategy that relies on both static analysis of applications and dynamic control of cache driven by such static analysis to minimize contention on the coherent FPGA cache in emerging shared coherent cache based CPU-FPGA platforms. We develop an LLVM pass, based on reuse distance theory, to analyze memory access patterns of the application kernels to be executed on FPGA and generate kernel characteristics called Key values. Thereafter, a dynamic scheme for cache bypassing and partitioning based on these Key values is proposed to increase the cache hit rate and improve the overall performance. Experiments using a number of benchmarks show that the proposed strategy can increase the cache hit rate by 22.90% on average and speed up the application by up to 12.52%.
Project: Dynamically reconfigurable architecture for closing the FPGA/ASIC gap while providing superior design flexibility

Current FPGA design faces many challenges such as low logic utilization and long reconfiguration delay. The gap between ASIC and FPGA is big in terms of area, delay and power consumption, which reduces the advantages of FPGA devices for embedded system. We proposed hybrid CMOS/nanotechnology dynamically reconfigurable architectures, including NATURE and FDR, to overcome the logic density and reconfiguration efficiency obstacles. We use CMOS logic and interconnects, aided by on-chip nano-electronic RAMs that store reconfiguration bits. It uses the concept of temporal logic folding and fine-grain (i.e., cycle-level) dynamic reconfiguration to bring significant benefits. An over 10X improvement in area-delay product and 2X power reduction can be obtained compared to traditional FPGA implementations, which effectively reduced the gap between reconfigurable architecture and ASIC. We further augment the design with integrated coarse-grained blocks, such as DSP and block memory, new technologies, such as 3D integration and FinFET, to significantly improve the performance and reduce the power consumption of the architecture.
Project: Non-volatile 3D stacking RRAM-based FPGA with unified data/configuration memory

The existing FPGA products keep the configured logics and signal routing information in the SRAM to realize the required functionalities. The fast development of emerging memory techniques, such as the Resistive Random Access Memory (RRAM), has demonstrated significant advantages compared to traditional memory techniques including high density, low power consumption, comparable access speed and non-volatility. We proposed a new FPGA architecture that can completely substitute RRAM for SRAM in all the major components including the logic blocks (LB), switch blocks (SB) and connection blocks (CB). The look-up-table (LUT) design in LB can be engineered in 3D stacking to maximize the benefits from the high density of RRAM. It naturally supports bit-addressable access and can be used as Distributed Random Access Memory (D-RAM), which usually has limited utilization in the conventional SRAM-based FPGAs due to high design complexity and large area. The routing controls in SB and CB utilize the complementary RRAM cells of the crossbar structure. We keep the design of pass gate used in the conventional FPGA to transfer signals for better performance. However, owe to the high density of RRAM, the large number of configuration memory cells associated with the pass gates is no longer a problem. Since the area is improved with smaller area owe to RRAM cells, the delay along the critical path is also reduced because of shorter connection. Experimental results on benchmarks show 62.7% area reduction with 34% delay improvement of this hardware architecture compared to conventional FPGA. Since each LUT is bit-addressable, which enables the run-time fast access, it also can be used together with dynamically reconfigurable architecture and improve the flexibility of the architecture. We take this advantage and propose the unified Block RAM (BRAM) structure which combines the on-chip data memory and configuration memory. With this design, the RRAM-based FPGA can perform run-time fast reconfiguration when needed. While the reconfiguration is not needed, the configuration memory can be replaced with data memory for temporary data storage. Hence no explicit area overhead is incurred. The design hides the configuration memory area but can support the run-time reconfiguration when needed.

FADO: Floorplan-Aware Directive Optimization for High-Level Synthesis Designs on Multi-Die FPGAs

Multi-die FPGAs are widely adopted for large-scale accelerators, but optimizing high-level synthesis designs on these FPGAs faces two challenges. First, the delay caused by die-crossing nets creates an NP-hard floorplanning problem. Second, traditional directive optimization cannot consider resource constraints on each die or the timing issue incurred by the die-crossings. Furthermore, the high algorithmic complexity and the large scale lead to extended runtime for legalizing the floorplan of HLS designs under different directive configurations.To co-optimize the directives and floorplan of HLS designs on multi-die FPGAs, we formulate the co-search based on bin-packing variants and present two iterative optimization flows. The first (FADO 1.0) relies on a pre-built QoR library. It involves a latency-bottleneck-guided greedy directive search and incremental floorplanning. Compared with a global floorplanning solution, it takes 693X~4925X shorter search time and achieves 1.16X~8.78X better design performance, measured in workload execution time. To remove the time-consuming QoR library generation, the second flow (FADO 2.0) integrates an analytical QoR model and re-designs the directive search to accelerate convergence. Through experiments on mixed dataflow and non-dataflow designs, compared with 1.0, FADO 2.0 further yields a 1.35X better design performance on average after implementation on the Alveo U250 FPGA. This project is open-sourced at Github.
AMF-Placer: High-Performance Analytical Mixed-size Placer for FPGA

AMF-Placer is an open-source analytical mixed-size FPGA placer supporting mixed-size placement on FPGA, with an interface to Xilinx Vivado. To speed up the convergence and improve the quality of the placement, AMF-Placer is equipped with a series of new techniques for wirelength optimization, cell spreading, packing, and legalization. Based on a set of the latest large open-source benchmarks from various domains for Xilinx Ultrascale FPGAs, experimental results indicate that AMF-Placer can improve HPWL by 20.4%-89.3% and reduce runtime by 8.0%-84.2%, compared to the baseline. Furthermore, utilizing the parallelism of the proposed algorithms, with 8 threads, the placement procedure can be accelerated by 2.41x on average.The implementation source code and detailed documentation of AMF-Placer is open to the community.
Hi-ClockFlow: Multi-Clock Dataflow Automation and Throughput Optimization in High-Level Synthesis

Tools of high-level synthesis (HLS) are developed to improve the accessibility of FPGAs by allowing designer to describe hardware designs in high-level language, e.g. C/C++. However, the source codes of general applications are not structured as canonical dataflow. Furthermore, clock frequencies are powerful parameters to improve dataflow throughput but currently commercial HLS tools limit themselves to single clock domain. Consequently, in order to benefit from the multiple-clock dataflow design, designers still suffer from manually analyzing the applications, partitioning the source code into modules, optimizing them with appropriate parameters and resource allocation, and finally interconnecting them. We analyze the impact of multiple clock domains for HLS designs and present Hi-ClockFlow, an automatic HLS framework. Hi-ClockFlow, which is implemented as one of Light-HLS applications, can analyze the source code based on Light-HLS, our light weight HLS evaluation framework, explore the large design space, and optimize such parameters as clock frequencies and HLS directives in dataflow. By properly partitioning the source code of an application into parts with various clock domains, Hi-ClockFlow can optimize the dataflow with imbalanced modules and speed up the performance under the specific constraint of resource. The implementation source code of Hi-ClockFlow is open to the community.
Hi-DMM: High-Performance Dynamic Memory Management in High-Level Synthesis

High Level Synthesis (HLS) of FPGA based accelerators has been proposed in order to simplify accelerator design process with respect to design time and complexity. However, modern HLS tools do not consider dynamic memory allocation constructs in high-level programming languages like C and limit themselves to static memory allocation. Hi-DMM is proposed as a dynamic memory allocation and management scheme, for inclusion in commercial HLS design flows. Hi-DMM performs source-to-source transformation of user C code with dynamic memory constructs into C-source code with the dynamic memory allocator and management scheme developed in this work. The transformed C-source code is amenable to synthesis by commercial tools like Vivado HLS. Relying on buddy tree-based allocation schemes and efficient hardware implementation of the allocators, Hi-DMM achieves 4x speed-up in both fine-grained and coarse-grained memory allocation compared to previous works. The highlights of Hi-DMM includes three perspectives. (a): It is a part of HLS methodology. The DMM components, including allocators and heap memories, are described in C and are synthesizable with commercial HLS tools like Vivado-HLS. (b): HLS accelerators can access Hi-DMM allocator via HLS handshake protocol. Most of the proposed DMM components are automatically configured for adaption to the characteristics of source code, e.g. memory allocation granularity and HLS directives. (c): It achieves high-performance memory allocation. The buddy-tree allocators search the allocable addresses by using bit-vector (BV) computation and maintain the information in parallel. Pre-allocation scheme, look-up table and mini-heap are involved to minimize memory allocation latency. The implementation source code of Hi-DMM is open to the community.
Project: Machine Learning Based Routing Congestion Prediction in FPGA High-Level Synthesis

Optimization of complex applications in HLS is challenging due to the effects of implementation issues such as routing congestion. Routing congestion estimation is absent or inaccurate in existing HLS design methods and tools. Early and accurate congestion estimation is of great benefit to guide the optimization in HLS and improve the efficiency of implementation. However, routability, a serious concern in FPGA designs, has been difficult to evaluate in HLS without analyzing post-implementation details after Place and Route. To this end, we propose a novel method to predict routing congestion in HLS using machine learning and map the expected congested regions in the design to the relevant high-level source code. This is greatly beneficial in early identification of routability oriented bottlenecks in the high-level source code without running time-consuming register-transfer level (RTL) implementation flow. Experiments demonstrate that our approach accurately estimates vertical and horizontal routing congestion with errors of 6.71% and 10.05% respectively. By presenting Face Detection application as a case study, we show that by discovering the bottlenecks in high-level source code, routing congestion can be easily and quickly resolved compared to the efforts involved in RTL implementation and design feedback.
COMBA: A Comprehensive Model-Based Analysis Framework for High Level Synthesis of Real Applications

High level synthesis (HLS) relies on the use of synthesis directives to generate digital designs meeting a set of specifications. However, the selection of directives depends largely on designer experience and knowledge of the target architecture and digital design. Existing automated methods of directive selection are very limited in scope and capability to analyze complex design descriptions in high-level languages to be synthesized using HLS. We propose a comprehensive model-based analysis framework, COMBA, which is capable of analyzing the effects of a multitude of directives related to functions, loops and arrays in the design description using pluggable analytical models, a recursive data collector and a metric-guided design space exploration algorithm. The proposed automatic framework can accurately estimate the performance and resource usage of an HLS design given different directives and quickly find the high-performance configuration of directives in an exponentially increasing design space. COMBA has been released as an open-source tool [Tool link].
Project: Modular based partial reconfiguration aware placement and routing in FPGAs

While traditional FPGA design flow usually employs fine-grained tile-based placement, modular placement is increasingly required to speed up the large-scale placement and save the synthesis time. Moreover, the commonly used module can be pre-synthesized and stored in the library for design reuse to significantly save the design, verification time and development cost. In this project, we develop a library-based placement and routing flow, which utilizes the pre-placed and routed modules from the library to significantly save the execution time while achieving the minimal area-delay product. The flow supports the static and reconfigurable modules at the same time. The modular information is represented in B*-Tree structure. Simulated Annealing is performed based on the B*-tree to enable a fast search of the placement space. Different width-height ratios of the modules are exploited to achieve area-delay product optimization. Partial reconfiguration-aware routing using pin-to-wire abutment is defined to connect the modules after placement. Our placer can reduce the compilation time by 77% on average with up to 20% area overhead, and small delay improvement (4% to 21%) compared with fine-grained results of VPR through the reuse of module information in the library.
Project: Integrated mapping flow for cycle-level dynamically reconfigurable system

Automatic mapping methods and tools play a key role in implementing applications in a reconfigurable system, while efficiently utilizing its advantages. In this part of our research, we developed an integrated design and optimization platform for implementing applications using a cycle-level reconfigurable system, such as NATURE or FDR. This platform fully facilitates logic folding and automatically maps and evaluates a specific design in the cycle-level reconfigurable architecture. The mapping flow conducts design optimization from the register-transfer level (RTL) or gate level down to the physical level using novel mapping methodologies. Given a design, it automatically explores and identifies the best temporal logic folding configuration, targeting user-defined area/delay optimization objectives, and implements the design through several steps including logic mapping, temporal clustering, temporal placement, and routing. During logic mapping, a force-directed scheduling technique is used to balance resource usage across different clock cycles, minimize overall resource usage, and optimize performance under constraints. The following steps consider resource sharing and temporal data storage across clock cycles during clustering, placement, and routing. For NATURE or FDR architecture, the flow grants the designer superior flexibility to perform area-delay trade-offs and satisfy various requirements through using different logic folding configurations for one application.

Project: Learning-Based Power Modeling for FPGA

Field-programmable gate arrays (FPGAs) are gaining popularity in wide-ranging domains including data centers and embedded systems, where they serve as reconfigurable hardware accelerators for speeding up computation-centric tasks. As the architecture complexity, integration density and chip size of modern FPGAs continue to grow, the importance of power efficiency increases and FPGA power consumption is turning out to be a key design constraint. In this project, we seek opportunities to exploit the use of machine learning techniques to model FPGA power consumption, and conduct power-oriented optimizations based on the power models. Specifically, we investigate the FPGA power modeling techniques from design time to run time. These techniques are compatible with each other, and putting it all together, they can jointly facilitate FPGA power savings in both design-time hardware construction and real-time application execution.

Power Modeling and DSE [C79]. We investigate power modeling for design-time power analysis and power-oriented design space exploration (DSE) for FPGA designs. We propose HL-Pow, an accurate, fast and early-stage power modeling methodology for high-level synthesis (HLS). In HL-Pow, we incorporate an automated feature construction flow to efficiently identify and extract features that exert a major influence on power consumption, simply based upon HLS results, and a modeling flow that can build an accurate and generic learning-based power model applicable to a variety of designs with HLS. By using HL-Pow, the power evaluation process for FPGA designs can be significantly expedited because the power inference of HL-Pow is established on HLS instead of the time-consuming register-transfer level (RTL) implementation flow. To further facilitate design-time power minimization, we describe a novel and fast DSE algorithm built on top of HL-Pow to reach a close approximation of the Pareto frontier between latency and power consumption.

Run-Time Power Monitoring and Management [J47],[C62]. We study in-situ monitoring of FPGA power consumption at run time. We propose and evaluate a power monitoring scheme capable of accurately estimating the dynamic power of FPGA designs in a fine-grained timescale during onboard execution. Customized computer-aided design (CAD) flows are devised for power modeling either offline or online. Traditional decision trees and customized model ensemble strategies are deployed for power model establishment with offline sampling, while online decision trees are used for simultaneous power inference and training with real-time sample collection. Following this, we introduce the light-weight and scalable realization of the developed models, which can be integrated into the target applications for FPGA dynamic power monitoring at run time. The novel power monitoring techniques open up new opportunities for real-time FPGA power management in both coarse-grained and fine-grained timescales.
Project: Energy Minimization for Multi-core Platforms through DVFS and VR Phase Scaling With Comprehensive Convex Model

Energy management is a critical challenge in multicore processors due to continuous technology scaling. Previous methods have mostly focused on the energy minimization of the processor cores. However, energy overhead of the off-chip voltage regulator (VR) has recently shown to be a non-trivial part of the total energy consumption and has been previously overlooked. In this paper, we propose an overall energy optimization method for the system that minimizes both per-core energy consumption and VR energy consumption using dynamic voltage frequency scaling (DVFS) and VR phase scaling by solving a comprehensive convex model. In order to improve the accuracy of the task latency model, a new task model considering both computation and memory access of the task is also developed. Furthermore, for better scalability and lower on-line overhead, we decompose our proposed convex method into two stages: an off-line stage and an on-line stage. During the off-line stage, we explore the convex model by assuming different numbers of active phases of the VR, various workload pressures and workload characteristics to collect the optimal frequency assignments under different scenarios. During the online stage, the specific frequency assignment for cores and optimal active phase number of the VR are selected and applied based on the actual workload pressure and its characteristics running on the cores. Experiments on real benchmarks show that when compared with state-of-the-art approaches, which are oblivious to VR overheads and exploit slack time to achieve energy minimization, our method can achieve a significant energy saving of up to 22.4% with negligible on-line overhead.

Project: Towards Efficient and Scalable Acceleration of Online Decision Tree Learning on FPGA

Project: Optimizing OpenCL-based CNN Design on FPGA with Comprehensive Design Space Exploration and Collaborative Performance Modeling

Project: FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates

Project: Scalable Light-Weight Integration of FPGA Based Accelerators with Chip Multi-Processors

Project: A Hybrid Approach to Cache Management in Heterogeneous CPU-FPGA Platforms

Project: Dynamically reconfigurable architecture for closing the FPGA/ASIC gap while providing superior design flexibility

Project: Non-volatile 3D stacking RRAM-based FPGA with unified data/configuration memory

FADO: Floorplan-Aware Directive Optimization for High-Level Synthesis Designs on Multi-Die FPGAs

AMF-Placer: High-Performance Analytical Mixed-size Placer for FPGA

Hi-ClockFlow: Multi-Clock Dataflow Automation and Throughput Optimization in High-Level Synthesis

Hi-DMM: High-Performance Dynamic Memory Management in High-Level Synthesis

Project: Machine Learning Based Routing Congestion Prediction in FPGA High-Level Synthesis

COMBA: A Comprehensive Model-Based Analysis Framework for High Level Synthesis of Real Applications

Project: Modular based partial reconfiguration aware placement and routing in FPGAs

Project: Integrated mapping flow for cycle-level dynamically reconfigurable system

Project: Learning-Based Power Modeling for FPGA

Project: Energy Minimization for Multi-core Platforms through DVFS and VR Phase Scaling With Comprehensive Convex Model

Project: Two-stage Thermal-Aware Scheduling of Task Graphs on 3D Multi-cores

Project: Thermal simulator of 3D-IC with modeling of anisotropic TSV conductance and microchannel entrance effects

Project: Vulnerability Hunting on Intel Integrated GPU

Project: Enforce Basic Block CFI on the Fly for Real-world Binaries