Publications

Tse AHT, Thomas DB, Tsoi KH, Luk Wet al., 2011, Efficient reconfigurable design for pricing asian options, ACM SIGARCH Comput. Archit. News, Vol: 38, Pages: 14-20, ISSN: 0163-5964

Journal article

Yamaguchi Y, Tsoi HK, Luk W, 2011, FPGA-Based Smith-Waterman Algorithm: Analysis and Novel Design, 7th International Symposium on Applied Reconfigurable Computing, ARC 2011, Publisher: SPRINGER-VERLAG BERLIN, Pages: 181-+, ISSN: 0302-9743

Author Web Link
Cite
Citations: 18

Conference paper

Mak T, Cheung PYK, Lam KP, Luk Wet al., 2011, Adaptive routing in network-on-chips using a dynamic-programming network, IEEE Transactions on Industrial Electronics, Vol: 58, Pages: 3701-3716

Cite

Journal article

Betkaoui B, Thomas DB, Luk W, Przulj Net al., 2011, A framework for FPGA acceleration of large graph problems: Graphlet counting case study, Pages: 1-8

Cite

Conference paper

Yamaguchi Y, Kuen HT, Luk W, 2011, A Comparison of FPGAs, GPUs and CPUs for Smith-Waterman Algorithm, 19th Annual ACM International Symposium on Field-Programmable Gate Arrays, Publisher: ASSOC COMPUTING MACHINERY, Pages: 282-282

Conference paper

Chow GCT, Kwok KW, Luk W, Leong Pet al., 2011, Mixed Precision Comparison in Reconfigurable Systems, IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Publisher: IEEE COMPUTER SOC, Pages: 17-24

Author Web Link
Cite
Citations: 5

Conference paper

Yamaguchi Y, Tsoi KH, Luk W, 2011, A comparison of FPGAs, GPUS and CPUS for Smith-Waterman algorithm (abstract only)., Publisher: ACM, Pages: 281-281

Conference paper

Jin Q, Luk W, Thomas DB, 2011, On Comparing Financial Option Price Solvers on FPGA, IEEE Conf. on Field-Programmable Custom Computing Machines (FCCM), Pages: 89-92

Cite

Conference paper

Denholm S, Tsoi KH, Pietzuch P, Luk Wet al., 2011, CusComNet: A Customisable Network for Reconfigurable Heterogeneous Clusters, 22nd IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 9-16, ISSN: 2160-0511

Author Web Link
Cite
Citations: 1

Conference paper

Todman T, Liu Q, Luk W, Constantinides GAet al., 2010, Customizable Composition and Parameterization of Hardware Design Transformations, 13th Euromicro conference on Digital System Design

Cite

Conference paper

Lysaght P, Luk W, Cheung PYK, 2010, Proceedings - 2010 International Conference on Field Programmable Logic and Applications, FPL 2010: Message from the steering committee, Proceedings - 2010 International Conference on Field Programmable Logic and Applications, FPL 2010

Cite

Journal article

Sriram V, Cox D, Tsoi KH, Luk Wet al., 2010, Towards an embedded biologically-inspired machine vision processor, Pages: 273-278

Biologically-inspired machine vision algorithms - those that attempt to capture aspects of the computational architecture of the brain - have proven to be a promising class of algorithms for performing a variety of object and face recognition tasks. However these algorithms typically require a large number of arithmetic operations per image frame evaluated. Meanwhile, the increasing ubiquity of inexpensive cameras in a wide array of embedded devices presents an enormous opportunity for the deployment of embedded machine vision systems. As a first step towards an embedded implementation, we consider the main requirements for the design of an embedded processor for biologically-inspired object recognition and demonstrate an FPGA prototype of the V1-like algorithm, a simple biologically-inspired system from the literature [1], [2], [3]. We present a multiple instruction, single data (MISD) pipeline implementation of V1-like, and show that such designs are feasible in an FPGA context, particularly for small frame sizes (e.g. 100x100). In addition, we show that such an implementation offers good performance per unit silicon area and power dissipation in comparison to traditional CPU and GPU implementations. Finally, we discuss the constraints under which such an embedded strategy would be feasible for a more general biologically inspired face recognition system, and consider paths forward towards a wider range of possible embedded targets. © 2010 IEEE.

Abstract
Cite
Citations: 28

Conference paper

Liu Q, Todman T, Tsoi KH, Luk Wet al., 2010, Convex models for accelerating applications on FPGA-based clusters, Pages: 495-498

We propose a new approach, based on a set of convex models, to accelerate an application using a computing cluster which contains field-programmable gate arrays (FPGAs). The computationally-intensive tasks of the application are mapped onto multiple acceleration nodes, and the datapaths on the nodes are customized around the tasks during compilation. We propose models for computation and communication on the FPGA-based cluster, and formulate the design problem as a convex non-linear optimization problem allowing design exploration. We evaluate our approach on a cluster with 16 nodes for Monte Carlo simulation, resulting in a design 690 times faster than a software implementation. © 2010 IEEE.

Abstract
Cite
Citations: 4

Conference paper

Tse AHT, Thomas DB, Tsoi KH, Luk Wet al., 2010, Reconfigurable control variate Monte-Carlo designs for pricing exotic options, Pages: 364-367

Exotic options are financial derivatives which have complex features including path-dependency. These complex features make them difficult to price, as only computationally intensive Monte-Carlo methods can provide accurate prices. This paper proposes an FPGA-accelerated control variate Monte-Carlo (CVMC) framework for pricing exotic options. An optimised implementation of arithmetic Asian option pricing under this framework in a Virtex-5 xc5vlx330t FPGA at 200MHz is 24 times faster than a multi-threaded software implementation on a Xeon E5420 at 2.5GHz; it is also 2.4 times faster than the Tesla C1060 GPU at 1.3 GHz. © 2010 IEEE.

Abstract
Cite
Citations: 14

Conference paper

Wray S, Luk W, Pietzuch P, 2010, Run-time reconfiguration for a reconfigurable algorithmic trading engine, Pages: 163-166

In this paper we present an analysis of using run-time reconfiguration of reconfigurable hardware to modify trading algorithms during use. This provides flexibility in algorithm design, enabling the implementation to be reactive to changes in market conditions, increasing in performance. We study what can be achieved to reduce performance loss in algorithms while reconfiguration takes place, such as buffering information during this time. Our results show our average partial reconfiguration time is 0.002091 seconds, using historic highest market data rates would result in about 5,000 messages being missed or require buffering. This is the worst case scenario, normally the system would only require a fraction of messages. The reconfiguration time is acceptable if it is under the required limit by the user to prevent business performance suffering. © 2010 IEEE.

Abstract
Cite
Citations: 2

Conference paper

Chow GCT, Egurot K, Luk W, Leong Pet al., 2010, A karatsuba-based montgomery multiplier, Pages: 434-437

Modular multiplication of long integers is an important building block for cryptographic algorithms. Although several FPGA accelerators have been proposed for large modular multiplication, previous systems have been based on O(N 2) algorithms. In this paper, we present a Montgomery multiplier that incorporates the more efficient Karatsuba algorithm which is O(N (log3/log2)). This system is parameterizable to different bitwidths and makes excellent use of both embedded multipliers and fine-grained logic. The design has significantly lower LUT-delay product and multiplier-delay product compared with previous designs. Initial testing on a Virtex-6 FPGA showed that it is 60-190 times faster than an optimized multi-threaded software implementation running on an Intel Xeon 2.5 GHz CPU. The proposed multiplier system is also estimated to be 95-189 times more energy efficient than the software-based implementation. This high performance and energy efficiency makes it suitable for server-side applications running in a datacenter environment. © 2010 IEEE.

Abstract
Cite
Citations: 40

Conference paper

Cecchi S, Primavera A, Piazza F, Bettarelli F, Ciavattini E, Toppi R, Coutinho JGF, Luk W, Pilato C, Ferrandi F, Sima VM, Bertels Ket al., 2010, The hArtes CarLab: A new approach to advanced algorithms development for automotive audio, 129th Audio Engineering Society Convention 2010, Vol: 1, Pages: 601-612

In the last decade automotive audio has been gaining great attention by the scientific and industrial communities. In this context, a new approach to test and develop advanced audio algorithms for an heterogeneous embedded platform has been proposed within the European hArtes project. A real audio laboratory installed in a real car (hArtes CarLab) has been developed employing professional audio equipment. The algorithms can be tested and validated on a PC exploiting each application as a plug-in of a real time framework. Then a set of tools (hArtes Toolchain) can be used to generate code for the embedded platform starting from plug-in implementation. An overview of the entire system is here presented, showing its effectiveness.

Abstract
Cite
Citations: 2

Journal article

Tse AHT, Thomas DB, Tsoi KH, Luk Wet al., 2010, Dynamic scheduling Monte-Carlo framework for multi-accelerator heterogeneous clusters, Pages: 233 -240-233 -240

Cite

Conference paper

Lam YM, Coutinho JGF, Ho CH, Leong PHW, Luk Wet al., 2010, Multiloop parallelisation using unrolling and fission, International Journal of Reconfigurable Computing, Vol: 2010, ISSN: 1687-7195

A technique for parallelising multiple loops in a heterogeneous computing system is presented. Loops are first unrolled and then broken up into multiple tasks which are mapped to reconfigurable hardware. A performance-driven optimisation is applied to find the best unrolling factor for each loop under hardware size constraints. The approach is demonstrated using three applications: speech recognition, image processing, and the N-Body problem. Experimental results show that a maximum speedup of 34 is achieved on a 274MHz FPGA for the N-Body over a 2.6GHz microprocessor, which is 4.1 times higher than that of an approach without unrolling. Copyright © 2010 Yuet Ming Lam et al.

Abstract
Cite
Citations: 4

Journal article

Tsoi KH, Tse AHT, Pietzuch P, Luk Wet al., 2010, Programming framework for clusters with heterogeneous accelerators, ACM SIGARCH Computer Architecture News, Vol: 38, Pages: 53-59, ISSN: 0163-5964

<jats:p>We describe a programming framework for high performance clusters with various hardware accelerators. In this framework, users can utilize the available heterogeneous resources productively and efficiently. The distributed application is highly modularized to support dynamic system configuration with changing types and number of the accelerators. Multiple layers of communication interface are introduced to reduce the overhead in both control messages and data transfers. Parallelism can be achieved by controlling the accelerators in various schemes through scheduling extension. The framework has been used to support physics simulation and financial application development. We achieve significant performance improvement on a 16-node cluster with FPGA and GPU accelerators.</jats:p>

Journal article

Bertels K, Sima V-M, Yankova Y, Kuzmanov G, Luk W, Coutinho G, Ferrandi F, Pilato C, Lattuada M, Sciuto D, Michelotti Aet al., 2010, HARTES: HARDWARE-SOFTWARE CODESIGN FOR HETEROGENEOUS MULTICORE PLATFORMS, IEEE MICRO, Vol: 30, Pages: 88-97, ISSN: 0272-1732

Author Web Link
Cite
Citations: 20

Journal article

Fu H, Mencer O, Luk W, 2010, FPGA Designs with Optimized Logarithmic Arithmetic, IEEE TRANSACTIONS ON COMPUTERS, Vol: 59, Pages: 1000-1006, ISSN: 0018-9340

Author Web Link
Cite
Citations: 28

Journal article

Cope B, Cheung PYK, Luk W, Howes Let al., 2010, Performance comparison of graphics processors to reconfigurable logic: a case study, IEEE Transactions on Computers, Vol: 54, Pages: 433-448, ISSN: 0018-9340

A systematic approach to the comparison of the graphics processor (GPU) and reconfigurable logic is defined in terms of three throughput drivers. The approach is applied to five case study algorithms, characterized by their arithmetic complexity, memory access requirements, and data dependence, and two target devices: the nVidia GeForce 7900 GTX GPU and a Xilinx Virtex-4 field programmable gate array (FPGA). Two orders of magnitude speedup, over a general-purpose processor, is observed for each device for arithmetic intensive algorithms. An FPGA is superior, over a GPU, for algorithms requiring large numbers of regular memory accesses, while the GPU is superior for algorithms with variable data reuse. In the presence of data dependence, the implementation of a customized data path in an FPGA exceeds GPU performance by up to eight times. The trends of the analysis to newer and future technologies are analyzed.

Abstract
Cite

Journal article

Jamieson P, Becker T, Cheung PYK, Luk W, Rissa T, Pitkanen Tet al., 2010, Benchmarking and evaluating reconfigurable architectures targeting the mobile domain, ACM Transactions on Design Automation of Electronic Systems (TODAES), Vol: 15

We present the GroundHog 2009 benchmarking suite that evaluates the power consumption of reconfigurable technology for applications targeting the mobile computing domain. This benchmark suite includes seven designs; one design targets fine-grained FPGA fabrics allowing for quick state-of-the-art evaluation, and six designs are specified at a high level allowing them to target a range of existing and future reconfigurable technologies. Each of the six designs can be stimulated with the help of synthetically generated input stimuli created by an open-source tool included in the downloadable suite. Another tool is included to help verify the correctness of each implemented design. To demonstrate the potential of this benchmark suite, we evaluate the power consumption of two modern industrial FPGAs targeting the mobile domain. Also, we show how an academic FPGA framework, VPR 5.0, that has been updated for power estimates can be used to estimates the power consumption of different FPGA architectures and an open-source CAD flow mapping to these architectures.

Abstract
Cite

Journal article

Mak T, Sedcole P, Cheung PYK, Luk Wet al., 2010, Wave-pipelined intra-chip signaling for on-FPGA communications, Integration, the VLSI Journal, Vol: 43, Pages: 188-201

On-FPGA communication is becoming more problematic as the long interconnection performance is deteriorating in technology scaling. In this paper, we address this issue by proposing a novel wave-pipelined signaling scheme to achieve substantial throughput improvement in FPGAs. A new analytical model capturing the electrical characteristics in FPGA interconnects is presented. Based on the model, throughput and power consumption of a wave-pipelined link have been derived analytically and compared to the conventional synchronous links. Two circuit designs are proposed to realize wave-pipelined link using FPGA fabrics. The proposed approaches are also compared with conventional synchronous and asynchronous pipelining techniques. It is shown that the wave-pipelined approach can achieve up to 5.7 times improvement in throughput and 13% improvement in power consumption versus conventional delay-based on-chip communication schemes. Also, trade-offs between power, throughput and area consumption between the proposed and conventional designs are studied. The wave-pipelining approach provides a new alternative for on-FPGA communications and can potentially become a promising solution to mitigate the future interconnect scaling challenge.

Abstract
Cite

Journal article

Becker T, Jamieson P, Luk W, Cheung PYK, Rissa Tet al., 2010, Power characterisation for fine-grain reconfigurable fabrics, International Journal of Reconfigurable Computing, Vol: 2010

This paper proposes a benchmarking methodology for characterising the power consumption of the fine-grain fabric in reconfigurable architectures. This methodology is part of the GroundHog 2009 power benchmarking suite. It covers active and inactive power as well as advanced low-power modes. A method based on random number generators is adopted for comparing activity modes. We illustrate our approach using five field-programmable gate arrays (FPGAs) that span a range of process technologies: Xilinx Virtex-II Pro, Spartan-3E, Spartan-3AN, Virtex-5, and Silicon Blue iCE65. We find that, despite improvements through process technology and low-power modes, current devices need further improvements to be sufficiently power efficient for mobile applications. The Silicon Blue device demonstrates that performance can be traded off to achieve lower leakage.

Abstract
Cite

Journal article

Thomas DB, Luk W, 2010, An FPGA-Specific Algorithm for Direct Generation of Multi-Variate Gaussian Random Numbers, 21st IEEE International Conference on Application-Specific Systems, Architectures and Processors, Publisher: IEEE, ISSN: 2160-0511

Conference paper

Liu Q, Todman T, Luk W, 2010, Combining optimizations in automated low power design, International Conference on Design, Automation & Test in Europe, Pages: 1791-1796

Cite

Conference paper

Le Masle A, Luk W, Eldredge J, Carver Ket al., 2010, Parametric Encryption Hardware Design, 6th International Workshop on Applied Reconfigurable Computing, Publisher: SPRINGER-VERLAG BERLIN, Pages: 68-+, ISSN: 0302-9743

Author Web Link
Cite
Citations: 4

Conference paper

Le Masle A, Luk W, 2010, Design Space Exploration of Parametric Pipelined Designs, 21st IEEE International Conference on Application-Specific Systems, Architectures and Processors, Publisher: IEEE, ISSN: 2160-0511

Conference paper

ProfessorWayneLuk

Contact

Location

Summary