Energy-efficient Fine-grained Many-core Architecture for Video and DSP Applications

Energy-efficient Fine-grained Many-core Architecture for Video and DSP Applications
Author :
Publisher :
Total Pages :
Release :
ISBN-10 : 1267970197
ISBN-13 : 9781267970190
Rating : 4/5 (97 Downloads)

Synopsis Energy-efficient Fine-grained Many-core Architecture for Video and DSP Applications by : Zhibin Xiao

Many-core processor architecture has become the most promising computer architecture. However, how to utilize the extra system performance for real applications such as video encoding is still challenging. This dissertation investigates architecture design, physical implementation and performance evaluation of a fine-grained many-core processor for advanced video coding with a focus on interconnection, topology, memory system and related parallel programming methodology. A baseline residual encoder for H.264/AVC on a current generation fine-grained many-core system is proposed that utilizes no application-specific hardware. The 25-processor encoder encodes video sequences with variable frame sizes and can encode 1080p HDTV at 30 frames per second with 293 mW average power consumption by adjusting each processor to workload-based optimal clock frequencies and dual supply voltages---a 38.4% power reduction compared to operation with only one clock frequency and supply voltage. In comparison to published implementations on the TI C642 DSP platform, the design has approximately 2.9--3.7 times higher scaled throughput, 11.2--15.0 times higher throughput per chip area, and 4.5--5.8 times lower energy per pixel. Compared to a heterogeneous SIMD architecture customized for H.264, the presented design has 2.8--3.6 times greater throughput, 4.5--5.9 times higher area efficiency, and similar energy efficiency. Next, this dissertation proposes novel processor shapes and inter-connection topologies for many-core processor arrays which result in an overall application processor that requires fewer cores and has a lower total communication length. The proposed topologies compared to the commonly-used 2D mesh and include two 8-neighbor topologies, two 5-nearest-neighbor and three 6-nearest-neighbor topologies---three of which utilize 5-sided or hexagonal processor tiles. A 1080p H.264/AVC residual video encoder and a complete 54 Mbps 802.11a/11g wireless LAN baseband receiver are mapped onto all topologies and compared. The methodology to implement an array of hexagonal-shaped processor tiles with industry-standard CAD tools and automatic place and route flow is described. A 16-bit DSP processor tile is tailored for all proposed topologies and implemented at 65 nm CMOS technology without full-custom layout. Results show that the 6-neighbor hexagonal tile and the 6-neighbor rectangular tile incur a 2.9% area increase per tile compared to the 4-neighbor 2D mesh, but their much more effective inter-processor interconnect yields an average total application area reduction of 21% and a total application inter-processor communication distance reduction of 19%. Motivated by the fact that video encoding tasks normally read and write a block of data at one time in one transaction, the third part of this dissertation proposes a novel source synchronous bufferless shared memory to enable safe memory sharing among multiple processors with different clock domains. Compared with the previous FIFO buffered memory design, the bufferless memory module achieves lower latency, higher throughput, lower area overhead and lower power consumption. The bufferless memory module also supports direct communication with far-away processors through the existing processor-processor circuit switch interconnection network. The implementation results show that a 16 KB bufferless memory module reduces 58% single memory access latency and has higher burst-mode throughput (1%) compared to the 16 KB buffered memory module. The bufferless memory module also reduces the area overhead from 63% to 17% compared with buffered memory module, which yields a power reduction by 43%.

Energy-efficient Computing with Fine-grained Many-core Systems

Energy-efficient Computing with Fine-grained Many-core Systems
Author :
Publisher :
Total Pages :
Release :
ISBN-10 : 1369615574
ISBN-13 : 9781369615579
Rating : 4/5 (74 Downloads)

Synopsis Energy-efficient Computing with Fine-grained Many-core Systems by : Bin Liu

For the past half century, Moore's Law has been the fundamental driver of high-performance computing. The continued CMOS technology scaling doubles the transistor density of VLSI systems and had provided a predictable 40% performance improvement of single-core processors for every 18 to 24 months. However, as Dennard Scaling ends, the era of scaling frequency and performance without increasing power density is over. Since 2005, the semiconductor industry shifted to multi-core and many-core processors in order to sustain the proportional scaling of performance along with transistor count increases. One of the critical challenges for many-core system design is to reduce the power dissipation and improve the energy efficiency of the chip. Researchers are eager to seek innovative low power architectures and techniques to relieve the ``dark silicon" problem and effectively convert transistors to performance. To demonstrate that many-core processors with network-on-chip interconnects is a promising architecture for high-performance energy-efficient computing, 16 Advanced Encryption Standard (AES) engines are proposed on a fine-grained many-core system by exploring different granularities of data-level and task-level parallelism. The smallest design utilizes only six cores for offline key expansion and eight cores for online key expansion, while the largest requires 107 cores and 137 cores, respectively. In comparison with published AES cipher implementations on general purpose processors, the designs have has 3.5--15.6 times higher throughput per unit of chip area and 8.2--18.1 times higher energy efficiency. Moreover, the design shows 2.0 times higher throughput than the TI DSP C6201, and 3.3 times higher throughput per unit of chip area and 2.9 times higher energy efficiency than the GeForce 8800 GTX. Next, a scalable joint local and global dynamic voltage and frequency scaling (DVFS) scheme is proposed to further improve the energy efficiency for many-core systems by monitoring on-line workload variations. The local algorithms selects the voltage and frequency pair for each individual core based on its FIFO occupancy and stall information, while the global algorithm tunes the global voltage supplies based on the workload of all active processors. To demonstrate the effectiveness of the proposed solution, a suite of benchmarks are tested on a many-core globally asynchronous locally synchronous (GALS) platform. The experiment results show that the proposed approach can achieve near-optimal power saving under performance constraints. Different local algorithms are compared in terms of power saving, voltage switching frequency and response delay to workload variation. The impact of the number of voltage supplies and global voltage tuning resolution on the global algorithm is also investigated. To further improve the energy efficiency beyond traditional DVFS, core scaling is proposed by introducing an extra dimension beyond supply voltage and clock frequency scaling. This dissertation addresses the problem of minimizing the power dissipation of many-core systems under performance constraints by choosing an appropriate number of active cores and per-core voltage/frequency levels. A genetic algorithm based solution is proposed to solve the problem. Experiments with real applications show that (1) dynamically scaling the number of active cores can improve the energy efficiency by 5% to 42% compared with per-core DVFS for different performance requirements; (2) core scaling favors systems with more global voltage supplies and high-performance leaky process when the performance requirement is loose, while it favors systems with fewer global voltage supplies and low-power less-leaky process when the performance requirement is tight; (3) increasing the number of global voltage supplies or leakage ratio can reduce the optimal core count by 22% and 50%, respectively.

Energy Efficient Embedded Video Processing Systems

Energy Efficient Embedded Video Processing Systems
Author :
Publisher : Springer
Total Pages : 242
Release :
ISBN-10 : 9783319614557
ISBN-13 : 331961455X
Rating : 4/5 (57 Downloads)

Synopsis Energy Efficient Embedded Video Processing Systems by : Muhammad Usman Karim Khan

This book provides its readers with the means to implement energy-efficient video systems, by using different optimization approaches at multiple abstraction levels. The authors evaluate the complete video system with a motive to optimize its different software and hardware components in synergy, increase the throughput-per-watt, and address reliability issues. Subsequently, this book provides algorithmic and architectural enhancements, best practices and deployment models for new video systems, while considering new implementation paradigms of hardware accelerators, parallelism for heterogeneous multi- and many-core systems, and systems with long life-cycles. Particular emphasis is given to the current video encoding industry standard H.264/AVC, and one of the latest video encoders (High Efficiency Video Coding, HEVC).

ADACORE

ADACORE
Author :
Publisher :
Total Pages :
Release :
ISBN-10 : OCLC:940962821
ISBN-13 :
Rating : 4/5 (21 Downloads)

Synopsis ADACORE by : Nithesh Kurella

Heterogeneous multicore processors offer an energy-efficient alternative to homogeneous multicores. Typically, heterogeneous multi-core refers to a system with more than one core where all the cores use a single ISA but differ in one or more micro-architectural configurations. A carefully designed multicore system consists of cores of diverse power and performance profiles. During execution, an application is run on a core that offers the best trade-off between performance and energy-efficiency. Since the resource needs of an application may vary with time, so does the optimal core choice. Moving a thread from one core to another involves transferring the entire processor state and cache warm-up. Frequent migration leads to large performance overhead, negating any benefits of migration. Infrequent migration on the other hand leads to missed opportunities. Thus, reducing overhead of migration is integral to harnessing benefits of heterogeneous multicores. \par This work proposes \textit{AdaCore}, a novel core architecture which pushes the heterogeneity exploited in the heterogeneous multicore into a single core. \textit{AdaCore} primarily addresses the resource bottlenecks in workloads. The design attempts to adaptively match the resource demands by reconfiguring on-chip resources at a fine-grain granularity. The adaptive core morphing allows core configurations with diverse power and performance profiles within a single core by adaptive voltage, frequency and resource reconfiguration. Towards this end, the proposed novel architecture while providing energy savings, improves performance with a low overhead in-core reconfiguration. This thesis further compares \textit{AdaCore} with a standard Out-of-Order core with capability to perform Dynamic Voltage and Frequency Scaling (DVFS) designed to achieve energy efficiency. The results presented in this thesis indicate that the proposed scheme can improve the performance/Watt of application, on average, by 32\% over a static out-of-order core and by 14\% over DVFS. The proposed scheme improves $IPS^{2}/Watt$ by 38\% over static out-of-order core.

Computing Platforms for Software-Defined Radio

Computing Platforms for Software-Defined Radio
Author :
Publisher : Springer
Total Pages : 241
Release :
ISBN-10 : 9783319496795
ISBN-13 : 3319496794
Rating : 4/5 (95 Downloads)

Synopsis Computing Platforms for Software-Defined Radio by : Waqar Hussain

This book addresses Software-Defined Radio (SDR) baseband processing from the computer architecture point of view, providing a detailed exploration of different computing platforms by classifying different approaches, highlighting the common features related to SDR requirements and by showing pros and cons of the proposed solutions. It covers architectures exploiting parallelism by extending single-processor environment (such as VLIW, SIMD, TTA approaches), multi-core platforms distributing the computation to either a homogeneous array or a set of specialized heterogeneous processors, and architectures exploiting fine-grained, coarse-grained, or hybrid reconfigurability.

Design of Cost-Efficient Interconnect Processing Units

Design of Cost-Efficient Interconnect Processing Units
Author :
Publisher : CRC Press
Total Pages : 292
Release :
ISBN-10 : 9781420044720
ISBN-13 : 1420044729
Rating : 4/5 (20 Downloads)

Synopsis Design of Cost-Efficient Interconnect Processing Units by : Marcello Coppola

Streamlined Design Solutions Specifically for NoC To solve critical network-on-chip (NoC) architecture and design problems related to structure, performance and modularity, engineers generally rely on guidance from the abundance of literature about better-understood system-level interconnection networks. However, on-chip networks present several distinct challenges that require novel and specialized solutions not found in the tried-and-true system-level techniques. A Balanced Analysis of NoC Architecture As the first detailed description of the commercial Spidergon STNoC architecture, Design of Cost-Efficient Interconnect Processing Units: Spidergon STNoC examines the highly regarded, cost-cutting technology that is set to replace well-known shared bus architectures, such as STBus, for demanding multiprocessor system-on-chip (SoC) applications. Employing a balanced, well-organized structure, simple teaching methods, numerous illustrations, and easy-to-understand examples, the authors explain: how the SoC and NoC technology works why developers designed it the way they did the system-level design methodology and tools used to configure the Spidergon STNoC architecture differences in cost structure between NoCs and system-level networks From professionals in computer sciences, electrical engineering, and other related fields, to semiconductor vendors and investors – all readers will appreciate the encyclopedic treatment of background NoC information ranging from CMPs to the basics of interconnection networks. The text introduces innovative system-level design methodology and tools for efficient design space exploration and topology selection. It also provides a wealth of key theoretical and practical MPSoC and NoC topics, such as technological deep sub-micron effects, homogeneous and heterogeneous processor architectures, multicore SoC, interconnect processing units, generic NoC components, and embeddings of common communication patterns.

SmartCell -- An Energy Efficient Reconfigurable Architecture for Stream Processing

SmartCell -- An Energy Efficient Reconfigurable Architecture for Stream Processing
Author :
Publisher :
Total Pages : 238
Release :
ISBN-10 : OCLC:891342886
ISBN-13 :
Rating : 4/5 (86 Downloads)

Synopsis SmartCell -- An Energy Efficient Reconfigurable Architecture for Stream Processing by : Cao Liang

Abstract: Data streaming applications, such as signal processing, multimedia applications, often require high computing capacity, yet also have stringent power constraints, especially in portable devices. General purpose processors can no longer meet these requirements due to their sequential software execution. Although fixed logic ASICs are usually able to achieve the best performance and energy efficiency, ASIC solutions are expensive to design and their lack of flexibility makes them unable to accommodate functional changes or new system requirements. Reconfigurable systems have long been proposed to bridge the gap between the flexibility of software processors and performance of hardware circuits. Unfortunately, mainstream reconfigurable FPGA designs suffer from high cost of area, power consumption and speed due to the routing area overhead and timing penalty of their bit-level fine granularity. In this dissertation, we present an architecture design, application mapping and performance evaluation of a novel coarse-grained reconfigurable architecture, named SmartCell, for data streaming applications. The system tiles a large number of computing cell units in a 2D mesh structure, with four coarse-grained processing elements developed inside each cell to form a quad structure. Based on this structure, a hierarchical reconfigurable network is developed to provide flexible on-chip communication among computing resources: including fully connected crossbar, nearest neighbor connection and clustered mesh network. SmartCell can be configured to operate in various computing modes, including SIMD, MIMD and systolic array styles to fit for different application requirements. The coarse-grained SmartCell has the potential to improve the power and energy efficiency compared with fine-grained FPGAs. It is also able to provide high performance comparable to the fixed function ASICs through deep pipelining and large amount of computing parallelism. Dynamic reconfiguration is also addressed in this dissertation. To evaluate its performance, a set of benchmark applications has been successfully mapped onto the SmartCell system, ranging from signal processing, multimedia applications to scientific computing and data encryption. A 4 by 4 SmartCell prototype system was initially designed in CMOS standard cell ASIC with 130 nm process. The chip occupies 8.2 mm square and dissipates 1.6 mW/MHz under fully operation. The results show that the SmartCell can bridge the performance and flexibility gap between logic specific ASICs and reconfigurable FPGAs. SmartCell is also about 8% and 69% more energy efficient and achieves 4x and 2x throughput gains compared with Montium and RaPiD CGRAs. Based on our first SmartCell prototype experiences, an improved SmartCell-II architecture was developed, which includes distributed data memory, segmented instruction format and improved dynamic configuration schemes. A novel parallel FFT algorithm with balanced workloads and optimized data flow was also proposed and successfully mapped onto SmartCell-II for performance evaluations. A 4 by 4 SmartCell-II prototype was then synthesized into standard cell ASICs with 90 nm process. The results show that SmartCell-II consists of 2.0 million gates and is fully functional at up to 295 MHz with 3.1 mW/MHz power consumption. SmartCell-II is about 3.6 and 28.9 times more energy efficient than Xilinx FPGA and TI's high performance DSPs, respectively. It is concluded that the SmartCell is able to provide a promising solution to achieve high performance and energy efficiency for future data streaming applications.

Advances in Computer Systems Architecture

Advances in Computer Systems Architecture
Author :
Publisher : Springer Science & Business Media
Total Pages : 850
Release :
ISBN-10 : 9783540296430
ISBN-13 : 3540296433
Rating : 4/5 (30 Downloads)

Synopsis Advances in Computer Systems Architecture by : Thambipillai Srikanthan

This book constitutes the refereed proceedings of the 10th Asia-Pacific Computer Systems Architecture Conference, ACSAC 2005, held in Singapore in October 2005. The 65 revised full papers presented were carefully reviewed and selected from 173 submissions. The papers are organized in topical sections on energy efficient and power aware techniques, methodologies and architectures for application-specific systems, processor architectures and microarchitectures, high-reliability and fault-tolerant architectures, compiler and OS for emerging architectures, data value predictions, reconfigurable computing systems and polymorphic architectures, interconnect networks and network interfaces, parallel architectures and computation models, hardware-software partitioning, verification, and testing of complex architectures, architectures for secured computing, simulation and performance evaluation, architectures for emerging technologies and applications, and memory systems hierarchy and management.