Publications
620 results found
Tse AHT, Thomas DB, Tsoi KH, et al., 2011, Efficient reconfigurable design for pricing asian options, ACM SIGARCH Comput. Archit. News, Vol: 38, Pages: 14-20, ISSN: 0163-5964
Yamaguchi Y, Tsoi HK, Luk W, 2011, FPGA-Based Smith-Waterman Algorithm: Analysis and Novel Design, 7th International Symposium on Applied Reconfigurable Computing, ARC 2011, Publisher: SPRINGER-VERLAG BERLIN, Pages: 181-+, ISSN: 0302-9743
- Author Web Link
- Cite
- Citations: 18
Mak T, Cheung PYK, Lam KP, et al., 2011, Adaptive routing in network-on-chips using a dynamic-programming network, IEEE Transactions on Industrial Electronics, Vol: 58, Pages: 3701-3716
Betkaoui B, Thomas DB, Luk W, et al., 2011, A framework for FPGA acceleration of large graph problems: Graphlet counting case study, Pages: 1-8
Yamaguchi Y, Kuen HT, Luk W, 2011, A Comparison of FPGAs, GPUs and CPUs for Smith-Waterman Algorithm, 19th Annual ACM International Symposium on Field-Programmable Gate Arrays, Publisher: ASSOC COMPUTING MACHINERY, Pages: 282-282
Chow GCT, Kwok KW, Luk W, et al., 2011, Mixed Precision Comparison in Reconfigurable Systems, IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Publisher: IEEE COMPUTER SOC, Pages: 17-24
- Author Web Link
- Cite
- Citations: 5
Yamaguchi Y, Tsoi KH, Luk W, 2011, A comparison of FPGAs, GPUS and CPUS for Smith-Waterman algorithm (abstract only)., Publisher: ACM, Pages: 281-281
Jin Q, Luk W, Thomas DB, 2011, On Comparing Financial Option Price Solvers on FPGA, IEEE Conf. on Field-Programmable Custom Computing Machines (FCCM), Pages: 89-92
Denholm S, Tsoi KH, Pietzuch P, et al., 2011, CusComNet: A Customisable Network for Reconfigurable Heterogeneous Clusters, 22nd IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), Publisher: IEEE, Pages: 9-16, ISSN: 2160-0511
- Author Web Link
- Cite
- Citations: 1
Todman T, Liu Q, Luk W, et al., 2010, Customizable Composition and Parameterization of Hardware Design Transformations, 13th Euromicro conference on Digital System Design
Lysaght P, Luk W, Cheung PYK, 2010, Proceedings - 2010 International Conference on Field Programmable Logic and Applications, FPL 2010: Message from the steering committee, Proceedings - 2010 International Conference on Field Programmable Logic and Applications, FPL 2010
Sriram V, Cox D, Tsoi KH, et al., 2010, Towards an embedded biologically-inspired machine vision processor, Pages: 273-278
Biologically-inspired machine vision algorithms - those that attempt to capture aspects of the computational architecture of the brain - have proven to be a promising class of algorithms for performing a variety of object and face recognition tasks. However these algorithms typically require a large number of arithmetic operations per image frame evaluated. Meanwhile, the increasing ubiquity of inexpensive cameras in a wide array of embedded devices presents an enormous opportunity for the deployment of embedded machine vision systems. As a first step towards an embedded implementation, we consider the main requirements for the design of an embedded processor for biologically-inspired object recognition and demonstrate an FPGA prototype of the V1-like algorithm, a simple biologically-inspired system from the literature [1], [2], [3]. We present a multiple instruction, single data (MISD) pipeline implementation of V1-like, and show that such designs are feasible in an FPGA context, particularly for small frame sizes (e.g. 100x100). In addition, we show that such an implementation offers good performance per unit silicon area and power dissipation in comparison to traditional CPU and GPU implementations. Finally, we discuss the constraints under which such an embedded strategy would be feasible for a more general biologically inspired face recognition system, and consider paths forward towards a wider range of possible embedded targets. © 2010 IEEE.
Liu Q, Todman T, Tsoi KH, et al., 2010, Convex models for accelerating applications on FPGA-based clusters, Pages: 495-498
We propose a new approach, based on a set of convex models, to accelerate an application using a computing cluster which contains field-programmable gate arrays (FPGAs). The computationally-intensive tasks of the application are mapped onto multiple acceleration nodes, and the datapaths on the nodes are customized around the tasks during compilation. We propose models for computation and communication on the FPGA-based cluster, and formulate the design problem as a convex non-linear optimization problem allowing design exploration. We evaluate our approach on a cluster with 16 nodes for Monte Carlo simulation, resulting in a design 690 times faster than a software implementation. © 2010 IEEE.
Tse AHT, Thomas DB, Tsoi KH, et al., 2010, Reconfigurable control variate Monte-Carlo designs for pricing exotic options, Pages: 364-367
Exotic options are financial derivatives which have complex features including path-dependency. These complex features make them difficult to price, as only computationally intensive Monte-Carlo methods can provide accurate prices. This paper proposes an FPGA-accelerated control variate Monte-Carlo (CVMC) framework for pricing exotic options. An optimised implementation of arithmetic Asian option pricing under this framework in a Virtex-5 xc5vlx330t FPGA at 200MHz is 24 times faster than a multi-threaded software implementation on a Xeon E5420 at 2.5GHz; it is also 2.4 times faster than the Tesla C1060 GPU at 1.3 GHz. © 2010 IEEE.
Wray S, Luk W, Pietzuch P, 2010, Run-time reconfiguration for a reconfigurable algorithmic trading engine, Pages: 163-166
In this paper we present an analysis of using run-time reconfiguration of reconfigurable hardware to modify trading algorithms during use. This provides flexibility in algorithm design, enabling the implementation to be reactive to changes in market conditions, increasing in performance. We study what can be achieved to reduce performance loss in algorithms while reconfiguration takes place, such as buffering information during this time. Our results show our average partial reconfiguration time is 0.002091 seconds, using historic highest market data rates would result in about 5,000 messages being missed or require buffering. This is the worst case scenario, normally the system would only require a fraction of messages. The reconfiguration time is acceptable if it is under the required limit by the user to prevent business performance suffering. © 2010 IEEE.
Chow GCT, Egurot K, Luk W, et al., 2010, A karatsuba-based montgomery multiplier, Pages: 434-437
Modular multiplication of long integers is an important building block for cryptographic algorithms. Although several FPGA accelerators have been proposed for large modular multiplication, previous systems have been based on O(N 2) algorithms. In this paper, we present a Montgomery multiplier that incorporates the more efficient Karatsuba algorithm which is O(N (log3/log2)). This system is parameterizable to different bitwidths and makes excellent use of both embedded multipliers and fine-grained logic. The design has significantly lower LUT-delay product and multiplier-delay product compared with previous designs. Initial testing on a Virtex-6 FPGA showed that it is 60-190 times faster than an optimized multi-threaded software implementation running on an Intel Xeon 2.5 GHz CPU. The proposed multiplier system is also estimated to be 95-189 times more energy efficient than the software-based implementation. This high performance and energy efficiency makes it suitable for server-side applications running in a datacenter environment. © 2010 IEEE.
Cecchi S, Primavera A, Piazza F, et al., 2010, The hArtes CarLab: A new approach to advanced algorithms development for automotive audio, 129th Audio Engineering Society Convention 2010, Vol: 1, Pages: 601-612
In the last decade automotive audio has been gaining great attention by the scientific and industrial communities. In this context, a new approach to test and develop advanced audio algorithms for an heterogeneous embedded platform has been proposed within the European hArtes project. A real audio laboratory installed in a real car (hArtes CarLab) has been developed employing professional audio equipment. The algorithms can be tested and validated on a PC exploiting each application as a plug-in of a real time framework. Then a set of tools (hArtes Toolchain) can be used to generate code for the embedded platform starting from plug-in implementation. An overview of the entire system is here presented, showing its effectiveness.
Tse AHT, Thomas DB, Tsoi KH, et al., 2010, Dynamic scheduling Monte-Carlo framework for multi-accelerator heterogeneous clusters, Pages: 233 -240-233 -240
Lam YM, Coutinho JGF, Ho CH, et al., 2010, Multiloop parallelisation using unrolling and fission, International Journal of Reconfigurable Computing, Vol: 2010, ISSN: 1687-7195
A technique for parallelising multiple loops in a heterogeneous computing system is presented. Loops are first unrolled and then broken up into multiple tasks which are mapped to reconfigurable hardware. A performance-driven optimisation is applied to find the best unrolling factor for each loop under hardware size constraints. The approach is demonstrated using three applications: speech recognition, image processing, and the N-Body problem. Experimental results show that a maximum speedup of 34 is achieved on a 274MHz FPGA for the N-Body over a 2.6GHz microprocessor, which is 4.1 times higher than that of an approach without unrolling. Copyright © 2010 Yuet Ming Lam et al.
Tsoi KH, Tse AHT, Pietzuch P, et al., 2010, Programming framework for clusters with heterogeneous accelerators, ACM SIGARCH Computer Architecture News, Vol: 38, Pages: 53-59, ISSN: 0163-5964
<jats:p>We describe a programming framework for high performance clusters with various hardware accelerators. In this framework, users can utilize the available heterogeneous resources productively and efficiently. The distributed application is highly modularized to support dynamic system configuration with changing types and number of the accelerators. Multiple layers of communication interface are introduced to reduce the overhead in both control messages and data transfers. Parallelism can be achieved by controlling the accelerators in various schemes through scheduling extension. The framework has been used to support physics simulation and financial application development. We achieve significant performance improvement on a 16-node cluster with FPGA and GPU accelerators.</jats:p>
Bertels K, Sima V-M, Yankova Y, et al., 2010, HARTES: HARDWARE-SOFTWARE CODESIGN FOR HETEROGENEOUS MULTICORE PLATFORMS, IEEE MICRO, Vol: 30, Pages: 88-97, ISSN: 0272-1732
- Author Web Link
- Cite
- Citations: 20
Fu H, Mencer O, Luk W, 2010, FPGA Designs with Optimized Logarithmic Arithmetic, IEEE TRANSACTIONS ON COMPUTERS, Vol: 59, Pages: 1000-1006, ISSN: 0018-9340
- Author Web Link
- Cite
- Citations: 28
Cope B, Cheung PYK, Luk W, et al., 2010, Performance comparison of graphics processors to reconfigurable logic: a case study, IEEE Transactions on Computers, Vol: 54, Pages: 433-448, ISSN: 0018-9340
A systematic approach to the comparison of the graphics processor (GPU) and reconfigurable logic is defined in terms of three throughput drivers. The approach is applied to five case study algorithms, characterized by their arithmetic complexity, memory access requirements, and data dependence, and two target devices: the nVidia GeForce 7900 GTX GPU and a Xilinx Virtex-4 field programmable gate array (FPGA). Two orders of magnitude speedup, over a general-purpose processor, is observed for each device for arithmetic intensive algorithms. An FPGA is superior, over a GPU, for algorithms requiring large numbers of regular memory accesses, while the GPU is superior for algorithms with variable data reuse. In the presence of data dependence, the implementation of a customized data path in an FPGA exceeds GPU performance by up to eight times. The trends of the analysis to newer and future technologies are analyzed.
Jamieson P, Becker T, Cheung PYK, et al., 2010, Benchmarking and evaluating reconfigurable architectures targeting the mobile domain, ACM Transactions on Design Automation of Electronic Systems (TODAES), Vol: 15
We present the GroundHog 2009 benchmarking suite that evaluates the power consumption of reconfigurable technology for applications targeting the mobile computing domain. This benchmark suite includes seven designs; one design targets fine-grained FPGA fabrics allowing for quick state-of-the-art evaluation, and six designs are specified at a high level allowing them to target a range of existing and future reconfigurable technologies. Each of the six designs can be stimulated with the help of synthetically generated input stimuli created by an open-source tool included in the downloadable suite. Another tool is included to help verify the correctness of each implemented design. To demonstrate the potential of this benchmark suite, we evaluate the power consumption of two modern industrial FPGAs targeting the mobile domain. Also, we show how an academic FPGA framework, VPR 5.0, that has been updated for power estimates can be used to estimates the power consumption of different FPGA architectures and an open-source CAD flow mapping to these architectures.
Mak T, Sedcole P, Cheung PYK, et al., 2010, Wave-pipelined intra-chip signaling for on-FPGA communications, Integration, the VLSI Journal, Vol: 43, Pages: 188-201
On-FPGA communication is becoming more problematic as the long interconnection performance is deteriorating in technology scaling. In this paper, we address this issue by proposing a novel wave-pipelined signaling scheme to achieve substantial throughput improvement in FPGAs. A new analytical model capturing the electrical characteristics in FPGA interconnects is presented. Based on the model, throughput and power consumption of a wave-pipelined link have been derived analytically and compared to the conventional synchronous links. Two circuit designs are proposed to realize wave-pipelined link using FPGA fabrics. The proposed approaches are also compared with conventional synchronous and asynchronous pipelining techniques. It is shown that the wave-pipelined approach can achieve up to 5.7 times improvement in throughput and 13% improvement in power consumption versus conventional delay-based on-chip communication schemes. Also, trade-offs between power, throughput and area consumption between the proposed and conventional designs are studied. The wave-pipelining approach provides a new alternative for on-FPGA communications and can potentially become a promising solution to mitigate the future interconnect scaling challenge.
Becker T, Jamieson P, Luk W, et al., 2010, Power characterisation for fine-grain reconfigurable fabrics, International Journal of Reconfigurable Computing, Vol: 2010
This paper proposes a benchmarking methodology for characterising the power consumption of the fine-grain fabric in reconfigurable architectures. This methodology is part of the GroundHog 2009 power benchmarking suite. It covers active and inactive power as well as advanced low-power modes. A method based on random number generators is adopted for comparing activity modes. We illustrate our approach using five field-programmable gate arrays (FPGAs) that span a range of process technologies: Xilinx Virtex-II Pro, Spartan-3E, Spartan-3AN, Virtex-5, and Silicon Blue iCE65. We find that, despite improvements through process technology and low-power modes, current devices need further improvements to be sufficiently power efficient for mobile applications. The Silicon Blue device demonstrates that performance can be traded off to achieve lower leakage.
Thomas DB, Luk W, 2010, An FPGA-Specific Algorithm for Direct Generation of Multi-Variate Gaussian Random Numbers, 21st IEEE International Conference on Application-Specific Systems, Architectures and Processors, Publisher: IEEE, ISSN: 2160-0511
Liu Q, Todman T, Luk W, 2010, Combining optimizations in automated low power design, International Conference on Design, Automation & Test in Europe, Pages: 1791-1796
Le Masle A, Luk W, Eldredge J, et al., 2010, Parametric Encryption Hardware Design, 6th International Workshop on Applied Reconfigurable Computing, Publisher: SPRINGER-VERLAG BERLIN, Pages: 68-+, ISSN: 0302-9743
- Author Web Link
- Cite
- Citations: 4
Le Masle A, Luk W, 2010, Design Space Exploration of Parametric Pipelined Designs, 21st IEEE International Conference on Application-Specific Systems, Architectures and Processors, Publisher: IEEE, ISSN: 2160-0511
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.