HeteroBench: Multi-kernel Benchmarks for Heterogeneous Systems
As Moore’s Law slows, heterogeneous systems with CPUs, GPUs, and FPGAs have become essential for performance and efficiency, yet their architectural diversity complicates hardware selection. To address this, HeteroBench provides a vendor-agnostic benchmark suite for evaluating multi-kernel applications across diverse accelerators and programming models, including Python, OpenMP, OpenACC, CUDA, and Vitis HLS. Covering domains like image processing, ML, and simulation, HeteroBench enables fine-grained performance comparison and optimization across platforms. Its cross-platform design offers practical guidance for AI/ML deployment and HPC development, benefiting both researchers and industry users.
Optimized Spatial Architecture Mapping Flow for Transformer Accelerators
Recent advances in Transformer-based large language models demand high-performance hardware accelerators for efficient inference. While spatial architectures like TPUs offer strong potential, their design is often manual and inflexible across applications. To address this, we introduce SAMT (Spatial Architecture Mapping for Transformers), a framework that automatically optimizes dataflow mappings and applies dynamic operator fusion for Transformer workloads. SAMT achieves 12%–91% latency reduction and 3%–23% energy savings across edge, mobile, and cloud platforms compared to traditional spatial designs.
An MLIR-based Compiling Flow for Heterogeneous Architecture
We proposed an MLIR-based compiler framework that bridges high-level Python code with heterogeneous hardware platforms, including CPUs, GPUs, and FPGAs. It enables automatic backend selection, task-level parallelism, and graph-level optimizations such as operator fusion and dataflow restructuring. It simplifies development by decoupling hardware-specific concerns while improving performance and energy efficiency. Evaluations show significant speedup and energy reduction through heterogeneous parallel execution compared to CPU-only baselines, demonstrating its effectiveness for high-performance computing workloads.
Accelerating Autonomous Path Planning on FPGAs with Sparsity-Aware HW/SW Co-Optimizations
Path planning in autonomous driving is often bottlenecked by the computational cost of quadratic programming (QP). We present an FPGA-based accelerator that leverages the OSQP solver with a preconditioned conjugate gradient (PCG) method, offering better scalability and hardware efficiency than traditional direct methods. By applying memory optimizations and exploiting task-level and operator-level parallelism through hardware pipelining, our design achieves up to 1.8× speedup and 3.2× power reduction over Intel i5, and 3.1× speedup over ARM Cortex-A57.
PyAIE: a Python-based Programming Framework for Versal ACAP AI Engines
To fill the gap of programming abstractions of application and AI Engine, we propose PyAIE, a Python-based programming framework specifically targeting AI Engines in the Versal ACAP. PyAIE allows users to focus on algorithm-level designs without knowledge of the underlying low-level details. PyAIE automatically translates Python code into the optimized AI Engine kernel C/C++ code, host code, along with configuration script files, thereby completing the entire AI Engine-based system design. To the best of our knowledge, this is the first Python-based programming and compilation flow designed specifically for Versal AI Engines.
Design Exploration of Associative Processor Implementation on FPGA
In-memory computing can save the time and energy of data movement between the memory and processor to avoid the memory-wall bottleneck of traditional Von-Neumann architecture. The associative processor (AP) is such an architecture that is proposed to implement in-memory computing. Content addressable memory (CAM) plays an important role in an AP. In this paper, we proposed a novel FPGA implementation of the AP, including the CAM and its peripheral circuits, such as the controller, data cache, instruction cache, and program counter. To the best of our knowledge, this is the first work that implements an associative processor on a real-world FPGA platform.