Publications

Todman T, Liu Q, Luk W, Constantinides GAet al., 2010, A scripting engine for combining design transformations,  , Vol:

Cite

Journal article

Lopez S, Sarmiento R, Potter PG, Luk W, Cheung PYKet al., 2010, Exploration of Hardware Sharing for Image Encoders, Design, Automation and Test in Europe Conference and Exhibition (DATE), Publisher: IEEE, Pages: 1737-1742, ISSN: 1530-1591

Author Web Link
Cite
Citations: 1

Conference paper

Becker T, Luk W, Cheung PYK, 2010, Energy-aware optimisation for run-time reconfiguration, 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Publisher: IEEE COMPUTER SOC, Pages: 55-62

Author Web Link
Cite
Citations: 11

Conference paper

Becker T, Koester M, Luk W, 2010, Automated placement of reconfigurable regions for relocatable modules, International Symposium on Circuits and Systems Nano-Bio Circuit Fabrics and Systems (ISCAS 2010), Publisher: IEEE, Pages: 3341-3344, ISSN: 0271-4302

Author Web Link
Cite
Citations: 6

Conference paper

Tsoi KH, Luk W, 2010, Axel: A Heterogeneous Cluster with FPGAs and GPUs, 18th ACM International Symposium on Field-Programmable Gate Arrays, Publisher: ASSOC COMPUTING MACHINERY, Pages: 115-124

Author Web Link
Cite
Citations: 59

Conference paper

Wray S, Luk W, Pietzuch P, 2010, Exploring Algorithmic Trading in Reconfigurable Hardware, 21st IEEE International Conference on Application-Specific Systems, Architectures and Processors, Publisher: IEEE, ISSN: 2160-0511

Conference paper

Betkaoui B, David B, Thomas, Luk Wet al., 2010, Comparing performance and energy efficiency of FPGAs and GPUs for high productivity computing, Pages: 94-101

Cite

Conference paper

Thomas D, Luk W, 2010, FPGA-Optimised Uniform Random Number Generators Using LUTs and Shift Registers, International Conference on Field Programmable Logic and Applications, Pages: 77-82, ISSN: 1946-1488

Cite

Conference paper

Mak T, Cheung PYK, Luk W, Lam KPet al., 2009, A DP-network for optimal dynamic routing in network-on-chip, Pages: 119-127

Dynamic routing is desirable because of its substantial improvement in communication bandwidth and intelligent adaptation to faulty links and congested traffics. However, implementation of adaptive routing in a network-on-chip (NoC) system is not trivial and further complicated by the requirements of deadlock-free and real-time optimal decision making. In this paper, we present a deadlock-free routing architecture which employs a dynamic programming (DP) network to provide on-the-fly optimal path planning and network monitoring for packet switching. Also, a new routing strategy called k-step look ahead is introduced. This new strategy can substantially reduced the size of routing table and maintain a high quality of adaptation which leads to a scalable dynamic routing solution with minimal hardware overhead. Our results based on a cycle-accurate simulator demonstrate the effectiveness of the DP-network, which outperforms both the deterministic and adaptive routing algorithms in average delay on various traffic scenarios by 22.3%. Moreover, the hardware overhead for DP-network is insignificant based on the results obtained from the hardware implementations. Copyright 2009 ACM.

Abstract
Cite
Citations: 18

Conference paper

Thomas DB, Luk W, 2009, Using FPGA resources for direct generation of multivariate Gaussian random numbers, Proceedings of the 2009 International Conference on Field-Programmable Technology, FPT'09, Pages: 344-347

The multivariate Gaussian distribution is used to model random processes with distinct pair-wise correlations, such as stock prices that tend to rise and fall together. Generation from a distribution with dimension n is usually achieved by starting with a vector of n independent Gaussian samples, then multiplying with a correlation inducing matrix, using O(n2) multiplications. This paper presents a method of generating vectors directly from the uniform distribution, removing the need for any multipliers or a scalar Gaussian generator. The method uses only small ROMs and adders, and so can be implemented using just basic FPGA resources (LUTs and FFs), saving DSP and block-RAM resources for the numerical simulation that the multivariate generator is driving. The method produces a new vector every cycle, unlike existing methods which produce vectors serially over n cycles, with only a modest increase in resource usage. This provides a ten times increase in performance over the fastest existing method, while also providing five times the performance per logic resource of the most efficient method. © 2009 IEEE.

Abstract
Cite
Citations: 4

Journal article

Tse AHT, Thomas DB, Luk W, 2009, Option pricing with multi-dimensional quadrature architectures, Proceedings of the 2009 International Conference on Field-Programmable Technology, FPT'09, Pages: 427-430

Quadrature based methods for numerical integration provide a means of quickly and accurately pricing financial products such as options. These methods can be applied to multi-dimensional products, such as options on multiple underlying assets, but suffer from an exponential increase in computational complexity as the dimension increases. This paper examines the theoretical complexity of quadrature methods for pricing multi-dimensional options, and then relates this to practical performance in contemporary hardware. An automated system for generating hardware architectures for quadrature is used to explore the performance of increasing dimensionality in FPGA implementations, and then compared them to GPU and CPU solutions. We find that a single-precision FPGA can provide 25 times speedup over software in three dimensions, and offers slightly improved performance over a GPU using comparable technology. The latest GPUs are 2.7 times faster than the older technology Virtex-4 FPGA, but the FPGA still provides over 9 times the energy efficiency. © 2009 IEEE.

Abstract
Cite
Citations: 10

Journal article

Thomas DB, Coutinho JGF, Luk W, 2009, Reconfigurable computing: Productivity and performance, Conference Record - Asilomar Conference on Signals, Systems and Computers, Pages: 685-689, ISSN: 1058-6393

Reconfigurable computing technologies have been used in a variety of computer systems, from supercomputers to portable embedded systems. However, developing applications on reconfigurable computers still appears to be more difficult than developing applications on conventional computers and on graphics processing units. This paper suggests some reasons about why this could be the case, and proposes a number of solutions designed to meet the challenge of improving design productivity and performance for reconfigurable computing systems. © 2009 IEEE.

Abstract
Cite
Citations: 4

Journal article

Yu CW, Luk W, Wilton SJE, Leong PHWet al., 2009, Routing optimization for hybrid FPGAs, Proceedings of the 2009 International Conference on Field-Programmable Technology, FPT'09, Pages: 419-422

This paper optimizes the routing structure for hybrid FPGAs, in which high I/O density coarse-grained units are embedded within fine-grained logic. This significantly increases the routing resource requirement between elements. We investigate the routing demand for hybrid FPGAs over a set of domain-specific applications. The trade-off in delay, area and routability of the separation distance between coarse-grained blocks are studied. The effects of adding routing switches to the coarsegrained blocks and using wider channels near them to meet extra routing demand are examined. Our optimized architectures are compared to existing column based architecture. The results show that (1) there is 44% tracks usage at the edge of the embedded blocks, (2) both the separation of embedded blocks and addition of switches to embedded blocks can increase the area and delay performance by 48.4% compared to column based FPGA architecture, (3) wider channel width reduces the area of highly congested system by 34.9%, but it cannot further improve the system with separation of embedded blocks and additional switches on embedded blocks. © 2009 IEEE.

Abstract
Cite
Citations: 3

Journal article

Jin Q, Thomas DB, Luk W, 2009, Automated application acceleration using software to hardware transformation, Proceedings of the 2009 International Conference on Field-Programmable Technology, FPT'09, Pages: 411-414

This paper describes an approach that allows applications to be developed in a software language, while taking advantage of hardware by facilities that automatically transform such software programs for hardware accelerators. A demonstration of this approach has been built for the C# language. Three case studies in numerical integration show that the automatically generated hardware accelerators can achieve similar speed-ups to manually optimised versions. In particular, the automatically generated accelerator running on an xc4vlx160 FPGA at 83MHz with single precision arithmetic can be more than 18 times faster and up to 143 times more power efficient than a Pentium 4 processor at 3.6GHz, while the double precision accelerator running at 64MHz is 7 times faster and 77 times more power efficient. © 2009 IEEE.

Abstract
Cite

Journal article

Jin Q, Thomas DB, Luk W, Cope Bet al., 2009, Exploring Reconfigurable Architectures for Tree-Based Option Pricing Models, ACM Transactions on Reconfigurable Technology and Systems (TRETS), Vol: 2, Pages: 21:1-21:??, ISSN: 1936-7406

Cite

Journal article

Thomas DB, Howes L, Luk W, 2009, A comparison of CPUs, GPUs, FPGAs, and massively processor arrays for random number generation, Pages: 63-72

The future of high-performance computing is likely to rely the ability to efficiently exploit huge amounts of paral- . One way of taking advantage of this parallelism is formulate problems as "embarrassingly parallel" Monte- simulations, which allow applications to achieve a lin- speedup over multiple computational nodes, without re- a super-linear increase in inter-node communication. , such applications are reliant on a cheap supply high quality random numbers, particularly for the three maximum entropy distributions: uniform, used as a source of randomness; Gaussian, for discrete-time ; and exponential, for discrete-event simulations. this paper we look at four different types of platform: multi-core CPUs (Intel Core2); GPUs (NVidia 200); FPGAs (Xilinx Virtex-5); and Massively Paral- Processor Arrays (Ambric AM2000). For each platform determine the most appropriate algorithm for generat- each type of number, then calculate the peak generation rate and estimated power efficiency for each device. Copyright 2009 ACM.

Abstract
Cite
Citations: 136

Conference paper

Fahmy SA, Cheung PYK, Luk W, 2009, High-throughput one-dimensional median and weighted median filters on FPGA, Computers & Digital Techniques, IET, Vol: 3, Pages: 384-394

Most effort in designing median filters has focused on two-dimensional filters with small window sizes, used for image processing. However, recent work on novel image processing algorithms, such as the trace transform, has highlighted the need for architectures that can compute the median and weighted median of large one-dimensional windows, to which the optimisations in the aforementioned architectures do not apply. A set of architectures for computing both the median and weighted median of large, flexibly sized windows through parallel cumulative histogram construction is presented. The architecture uses embedded memories to control the highly parallel bank of histogram nodes, and can implicitly determine window sizes for median and weighted median calculations. The architecture is shown to perform at 72 Msamples, and has been integrated within a trace transform architecture.

Abstract
Cite

Journal article

Fu H, Osborne W, Clapp RG, Mencer O, Luk Wet al., 2009, Accelerating seismic computations using customized number representations on FPGAs, Eurasip Journal on Embedded Systems, Vol: 2009, ISSN: 1687-3955

The oil and gas industry has an increasingly large demand for high-performance computation over huge volume of data. Compared to common processors, field-programable gate arrays (FPGAs) can boost the computation performance with a streaming computation architecture and the support for application-specific number representation. With hardware support for reconfigurable number format and bit width, reduced precision can greatly decrease the area cost and I/O bandwidth of the design, thus multiplying the performance with concurrent processing cores on an FPGA. In this paper, we present a tool to determine the minimum number precision that still provides acceptable accuracy for seismic applications. By using the minimized number format, we implement core algorithms in seismic applications (the FK step in forward continued-based migration and 3D convolution in reverse time migration) on FPGA and show speedups ranging from 5 to 7 by including the transfer time to and from the processors. Provided sufficient bandwidth between CPU and FPGA, we show that a further increase to 48X speedup is possible.

Abstract
Cite
Citations: 13

Journal article

Lee D-U, Cheung RCC, Luk W, Villasenor JDet al., 2009, Hierarchical Segmentation for Hardware Function Evaluation, IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, Vol: 17, Pages: 103-116, ISSN: 1063-8210

Author Web Link
Cite
Citations: 33

Journal article

Susanto KW, Todman T, Coutinho JG, Luk Wet al., 2009, Design Validation by Symbolic Simulation and Equivalence Checking: A Case Study in Memory Optimization for Image Manipulation, 35th Conference on Current Trends in Theory and Practice of Computer Science, Publisher: SPRINGER-VERLAG BERLIN, Pages: 509-520, ISSN: 0302-9743

Author Web Link
Cite
Citations: 3

Conference paper

Luk W, Coutinho JGF, Todman T, Lam YM, Osborne W, Susanto KW, Liu Q, Wong WSet al., 2009, A High-Level Compilation Toolchain for Heterogeneous Systems, IEEE International SOC Conference, Publisher: IEEE, Pages: 9-18, ISSN: 2164-1676

Author Web Link
Cite
Citations: 10

Conference paper

Lam YM, Coutinho JGF, Luk W, Leong PHWet al., 2009, OPTIMISING MULTI-LOOP PROGRAMS FOR HETEROGENEOUS COMPUTING SYSTEMS, 2009 5TH SOUTHERN CONFERENCE ON PROGRAMMABLE LOGIC, PROCEEDINGS, Pages: 129-+

Author Web Link
Cite
Citations: 4

Journal article

Jin Q, Thomas DB, Luk W, 2009, EXPLORING RECONFIGURABLE ARCHITECTURES FOR EXPLICIT FINITE DIFFERENCE OPTION PRICING MODELS, 19th International Conference on Field Programmable Logic and Applications, Publisher: IEEE, Pages: 73-78, ISSN: 1946-1488

Author Web Link
Cite
Citations: 75

Conference paper

Liu Q, Todman, Luk W, Constantinides GAet al., 2009, Optimising Designs by Combining Model-based and Pattern-based Transformations

Cite

Conference paper

Jamieson P, Luk W, Constantinides GA, Wilton SJEet al., 2009, An Energy and Power Consumption Analysis of FPGA Routing Architectures, Pages: 324-327

Cite

Conference paper

Liu Q, Todman, Coutinho G, Luk W, Constantinides GAet al., 2009, Automatic optimisation of map-reduce designs by geometric programming, Pages: 215-222

Cite

Conference paper

Becker T, Jamieson P, Luk W, Cheung PYK, Rissa Tet al., 2009, POWER CHARACTERISATION FOR THE FABRIC IN FINE-GRAIN RECONFIGURABLE ARCHITECTURES, 5th Southern Conference on Programmable Logic, Publisher: IEEE, Pages: 77-+

Conference paper

Becker T, Luk W, Cheung PYK, 2009, Parametric Design for Reconfigurable Software-Defined Radio, 5th International Workshop on Applied Reconfigurable Computing, Publisher: SPRINGER-VERLAG BERLIN, Pages: 15-+, ISSN: 0302-9743

Author Web Link
Cite
Citations: 10

Conference paper

Jamieson P, Becker T, Luk W, Cheung PYK, Rissa T, Pitkaenen Tet al., 2009, Benchmarking Reconfigurable Architectures in the Mobile Domain, 17th Annual IEEE Symposium on Field Programmable Custom Computing Machines, Publisher: IEEE COMPUTER SOC, Pages: 131-+

Author Web Link
Cite
Citations: 2

Conference paper

Wildie M, Luk W, Schultz SR, Leong PHW, Fidjeland Aet al., 2009, Reconfigurable acceleration of neural models with gap junctions, Sydney, Australia, Pages: 439-442

Cite

Conference paper

ProfessorWayneLuk

Contact

Location

Summary