Scalable Parallel Programming Applied to H.264/AVC Decoding (SpringerBriefs in Computer Science)
Further questions about the tool and the software package can be directed to Karthik Chandrasekar k. The main feature of the simulator is its core model which is based on the interval core model, a fast mechanistic core model. The interval model raises the level of abstraction in architectural simulation which allows for. As an added benefit, the interval core model allows the. The project proposes an effective subset of existing standardized UML profiles for embedded systems modeling: SysML and MARTE, while avoiding incompatibilities resulting from simultaneous usage of both profiles.
As both of these profiles. For this purpose, the MADES language subset is proposed along with a specific set of unique diagrams for expressing different aspects related to a system. In the EU project DeSyRe, researchers tackle these issues with a new approach: building a reliable system on unreliable components. Recore Systems B. Yogitech SpA Project website: www. For more info: www. The remaining energy is used to power up the memory, interconnection network, and storage system. My advisors are Xavier Martorell and Alejandro Duran.
- Ebook Scalable Parallel Programming Applied to H.264/AVC Decoding (SpringerBriefs in Computer.
- Double reflet (French Edition).
This was my first contact with a compiler infrastructure based on LLVM. LLVM was a new discovery for me, providing new possibilities to my Ph. I also conducted benchmarking activities on the new Mali architecture with the intention of getting information about performance and the memory hierarchy behavior. For this task, I had the privilege of having early access to the architecture which informed about new tendencies in this kind of devices.http://POTOLOKROSTOV.RU/cache/encyclopedias/vewiw-fantastik-shortstories-2.php
Scalable Parallel Programming Applied to H.264/AVC Decoding
This internship was especially valuable in the professional relationships. The evaluation of the whole internship was excellent. In my opinion, ARM has all the features to lead an internship to success. Since I expect to finish my PhD before summer, I am fully committed to a lot of tasks including not only those of writing the thesis, performing final experiments or attending conferences, but also it is also 16 HiPEAC info Here is when I decided to do a poster with the work done along my PhD, go to Paris and present my poster in this special session.
Perhaps I could find a job that would meet my expectations. In the end, despite I didn't find a suitable job for me, my experience in the poster session was very positive because on the one hand, people were very interested in. Firstly, because they gave me a student grant to.
Secondly, for the huge effort shown to help people like me to find a job by improving the traditional HiPEAC conference model. Jose L. During the internship, we developed a method that quantifies the slowdown that simultaneously running tasks may experience due to collision in shared processor resources. We used the presented method to determine if given multithreaded processors are good candidates for systems with timing requirements. Finally, we showed that measuring the slowdown that real. This information can be used during the incremental verification of multithreaded COTS architectures.
The search is extended to GAPtimize. The impact of GAPtimize can be raised by selecting parameters adaptively. Hence established methods for Worst Case Execution Time analysis can be applied. Juan C. Pichel and Dr. Francisco F. Two different contexts have been considered.
Manolis Katevenis and Prof. The crossbar is 32bits wide, runs at 2GHz, and consumes 8Watts. Koen L. Bertels and Prof. Henk J. Sips Delft University of Technology, the Netherlands January The dissertation addresses the problem of runtime adaptation of the application to its. The work focuses on heterogeneous multicore architectures. It addresses three aspects of application optimizations: hardware software mapping, memory allocation and parallel execution. In this master thesis, a generic optimization approach is developed using machine learning with focus on higher performance embedded systems.
The optimization algorithm is presented followed by discussion of evaluation results and an optional live demonstration. As the speed gap between CPU and external memory widens, the memory latency has become the dominant performance bottleneck in modern applications.
Caches play an important role in reducing the average memory latency. Their performance is strongly influenced by the way data is accessed.
Numerous multimedia algorithms, which operate on data such as images and videos, perform processing over rectangular regions of pixels. If this distinctive data access pattern as well as other data access patterns are exploited properly, significant performance improvements could be achieved. In this paper, a new memory organization exploiting 2D, stride and sequential data access patterns, exhibited in multimedia applications, is proposed. It aims at reducing the memory access latency, lowering the number of memory accesses and utilizing the bandwidth efficiently.
Recently, deep learning became one of the most important topics in computer vision, which is a branch of machine learning. Connvolutional Neural Network CNN , is a new efficient tool for deep learning which can automatically learn features from the input image to achieve a high recognition rate in the classification problem.
The aim of this thesis is to investigate machine learning techniques, and specifically CNN in the problem of recognizing the human faces. The main objectives are to find ways to speed up the training and accuracy rate. However the main challenge is to make the network model adaptive, which means, the network model should be able to learn more classes identity of a person like the human brain works, without affecting the accuracy of the recognition of the older classes. Transfer learning is a technique that can be used each time, a change in the dataset occurs in our case a new person.
This technique is realized by taking the pre-trained CNN weights, replacing the last layer which is the fully connected layer and training it with the pre-trained convolutional layer. Yale Extend dataset, combined with our own dataset, is used to train the CNN model. The dataset has varying illumination conditions and poses. In this presentation, we will explain the architecture of the CNN model, discuss the transfer learning process and finally we present a demo for this CNN model. Encoder and decoder implementations of the High Efficiency Video Coding HEVC standard have been subject to many optimization approaches since the release in However, the real-time decoding of high quality and ultra high resolution videos is till a very challenging task.
Unfortunately, it has not been adopted in the latest video coding standard, although it allows to multiply the throughput in CABAC decoding. Experimental results show throughput improvements up to 5. The main challenge is to split the already available HEVC decoder code which is implemented as one process with multi-threads into three processes to accommodate Kalray MPPA architecture Host processor, IO processor, and Cluster general purpose processor. To realize this implementation, a simulated version for functional split and communication model is implemented on Linux based X86 PC.
Furthermore, a ported version of the HEVC decoder to the MPPA platform is implemented which uses one thread of 1 Cluster and the corresponding decoding speed is evaluated. A detailed analysis of the memory constraints and communication overhead of the proposed solution is represented showing the bottlenecks in IO and Cluster communication. Finally, as a conclusion, the maximum theoretical decoding rate, recommendations, and future work are presented.
These recommendations include increasing utilization of the MPPA cluster resources and improving communication between the IO and the Cluster modules. In recent years GPUs have shown huge improvements in their performance. Using high-level languages simplifies software development and allows device independent development. For OpenCL development is also vendor independent. Writing highly optimized code, however, is very hardware and device dependent and cannot fully benefit from the abstraction layers.
Detailed knowledge about the inner workings of the GPU is vital but finer architecture aspects are often not documented. We provide a framework of OpenCL microbenchmarks to uncover device specific architectural details of GPUs from different hardware vendors. In addition, SIMD width, branch divergence behaviour and number of compute units are revealed for the mentioned devices and more benchmarks are still in development. Precision real time control and synchronization is important in systems with several coupled power electronic devices.
For a precision control of a distributed system we need to have a high efficiency fieldbus system. Real-time Ethernet based fieldbus protocols have gained a huge significance in industrial automation. A new real-time Ethernet protocol is proposed for the point to point communication of power electronic drives.
The new protocol is to have higher performance and reduced complexity compared to the existing hardware assisted RTE protocols. The new protocol is to support line topology of controllers allowing them to communicate with each other and with the master with minimum latency and to be synchronized with each other. The thesis studies the existing real-time Ethernet protocols and analyzes their real-time capabilities and proposes a new real-time Ethernet protocol based on the previous studies.
The newly proposed protocol is implemented and a performance analysis is carried out and is compared with that of the existing protocols. SIMD instructions have been commonly used to accelerate video codecs. The recently introduced HEVC codec like its predecessors is based on the hybrid video codec principle, and, therefore, also well suited to be accelerated with SIMD.
Vectorization is a key technique to increase application performance. In modern architectures, special hardware registers have been added for this purpose, executing the same calculation on multiple data chunks in parallel, typically exploiting a program's data level parallelism. Ideally, code should be vectorized automatically by today's compilers, supplying an ideal mapping of the application to these special purpose registers and instructions. The reality is, though, that compilers lag behind compared to manual optimizations with specialized instructions, so-called intrinsics.
In my research, I investigated the state-of-the-art vectorization capabilities of popular compilers and identified opportunities to improve the overall vectorization rate. In this presentation, I will show these analysis results for the LLVM compiler and discuss how I was able to improve the compiler's two vectorization passes, yielding a significantly higher vectorization rate and hence improving the average speedup of the used test patterns. The recent introduction of the Vulkan API along with SPIR-V, an intermediate language for native representation of graphical shaders and compute kernels, promises cross-platform access, high-efficiency and better performance of graphics applications.
However, non of the currently available GPU simulators support these new Khronos standards. The challenge is to find the best result in terms of quality per bit rate. A problem is the definition of quality in this context. However, experiments have shown that the HVS is not consistently sensitive to the input.
Thus, there are many approaches to allocate the bit rate accordingly to the perceptual sensitivity of the input. To improve the perceptual performance, this thesis follows the concept of adapting the Lagrangian multiplier to the perceptual sensitivity for each Coding Tree Unit CTU depending on its perceptual sensitivity. The actual implementation of a memory-demanding application like the HEVC decoder on many-core systems should be preceded by a performance evaluation of the memory subsystems. The goal is to determine the peak throughput and to establish whether the memory accesses limit the overall performance.
The processor integrates user cores and 32 system cores on a chip, each running at a frequency of up to MHz. Micro-benchmarks have been implemented to test the performance of the different communication paths: host memory to MPPA's off-chip memory, MPPA's off-chip memory to compute clusters' shared memories and between the compute clusters' shared memories. Multiple compute clusters are considered, each with up to 16 active cores. In addition, a trace-based benchmarking of the motion compensation and write back stages of the HEVC decoder is performed.
Whereas the two stages require a high memory bandwidth, it is important to predict whether an implementation would satisfy real-time requirements. The best results are obtained for the largest transfer sizes, the performance decreasing with the transfer size. For the HEVC-trace based benchmarking, the best results of around 1. Consequently, the real-time requirements of 60 frames per second cannot be satisfied. The implemented benchmarks reveal that the memory accesses are very costly, especially for small transfer sizes. Therefore, a high number of such memory accesses might lead to performance bottlenecks.
When using FPGA-SoCs to solve complex problems heterogeneously, the developer has to tackle the issue of coherence in memory management. To avoid bottlenecks this should happen without a significant increase in cost for the development or during run time. To access the memory from the FPGA it is necessary to translate addresses in between. I also implement the necessary hardware and software infrastructure to use and evaluate the MMU core. In comparison to an alternative implementation without a separate MMU, the performance of the developed IP core is ca. At the same time the performance is somewhat smaller when comparing to a solution with dedicated memory and without virtual memory management.
To the developer, the use of the MMU core is almost transparent. We address the key challenges of probability estimation, choosing an appropriate symbol length for encoding, and decompression with low latency. Face recognition is an important task in many applications. In this work we investigated the use of Convolutional Neural Networks for face recognition.
The new approach we present in this research will allow the network to learn new faces and update its structure accordingly. We also present the possible hardware architecture that can be used to implement this network. Due to new features of a device a firmware needs to be updated. But how to update a device, which needs to be highly available? The purpose of this thesis is to develop an update unit for a highly available operating device.
With this update unit the device can remain in its system and process the update data, while it keeps its full functionality. Special attention is paid to the discovery and handling of errors during the update process. First, requirements for the update process and variations of the update unit are developed.
These variations are discussed with the requirements of the update process. From these variations the update process and implementation of the update unit is described and evaluated. The Intel SPMD Program Compiler ispc is a new research programming language to easily write vectorized code while maintaining high performance. However, due to a different programming model used in this language and missing language features, the obtained performance can fall short of expectations. The language's type system and control flow constructs are analyzed and a compiler extension was written to output additional information from the compiler's front-end.
With regard to performance, two applications are optimized and benchmarked. Various issues of ispc, like the missing support for templates and missing standard optimizations, are identified and workarounds discussed. Readers want it, funders want it, libraries want it — free and open access to scholarly output. Studies have shown that open access publications have a citation advantage. There are different ways to implement OA: the golden road by publishing with a OA publisher, or the green road by self-archiving papers that were published with a traditional, closed access publisher.
Self-arching on a website is good, self-archiving on a repository is better! As the popularity of Augmented- and Virtual Reality VR rises quickly in modern industry a lot of attention has been put into the development of VR applications such as omnidirectional VR video. Several internet platforms such as YouTube arise where VR video content is distributed and which allow to stream VR video.
However, this type of video content currently lacks in quality and consumes a lot of bandwidth due to it's high bitrate. VR video gives the consumer the freedom to navigate through an entire video scene, which is usually a panoramic or spherical scene. This poses the problem that the users perspective view can not be pre encoded itself as in the case of conventional video with predefined sequences of images. Therefore the common approach is to compress the entire scene involving the transformation and back projection of video content which causes the compression of redundant data as well as additional rendering overhead.
Furthermore it introduces the requirement of particular video players for playback of VR video. This thesis discusses the main challenges of VR video and investigates the performance bottlenecks as well as solutions in regards to playback of VR video. The main bottleneck remains in decoding of the video stream due to the high bitrate and the required resolution of VR video frames.
It will be shown that a significant gain in performance can be achieved by utilizing the HEVC codec over the leading h. The potential use cases are endless: From self-driving cars to faster drug development, from automatic image captioning to smart real-time language translation, Deep Learning is providing exciting opportunities wherever machines interact with the human world.
To this day, one of the pillars behind this revolution has been the use of NVIDIA GPUs, enabling the first groundbreaking results as well as continuously driving performance up to support the advancement of the field. This presentation contains procedure of implementing and testing a new instruction set using Codasip Integrated Development Environment IDE. Compiler optimizations rely on code features in order to perform advanced transformations.
Static code features extraction has been largely used, for instance, in the context of optimization frameworks based on machine learning. However, such approaches use approximated heuristics based on strong assumptions which limit the accuracy of the modeling. This work proposes an automated feature extraction framework for OpenCL kernels based on cost relations.
By exploiting the information known by the OpenCL runtime, the proposed framework builds a set of cost relation features, which calculates each feature as a polynomial of the input variables known at runtime. The method exploits a characteristic of OpenCL which is, the OpenCL is based on C99 standard, and does not allow recursive function calls. The PHP programming language is commonly used for the server-side implementation of web applications, powering everything from personal blogs to the world's largest websites.
As such its performance is often critical to the response time, throughput and resource utilization of these applications. The aim of this thesis is to reduce runtime interpreter overhead by applying classical data-flow optimizations to the PHP bytecode in static single assignment form. Type inference is used to enable the use of type-specialized instructions.
Other optimizations include flow-sensitive constant propagation, dead code elimination and copy propagation. Additionally, inlining is used to increase the applicability of other optimizations. The main challenge is to reconcile classical compiler optimizations, that have been developed in the context of statically typed and compiled languages, with a programming language that is not only dynamically and weakly typed, but also supports a plethora of other dynamic language features.
The main challenge is to split the already available HEVC decoder code which is implemented as 1 process with multi-threads into 3 processes to accommodate Kalray MPPA architecture Host processor, IO processor, and Cluster general purpose processor. In the past, maintenance was predominantly based on the schedule, and more recently, on mileage.
These approaches lead to the excessive replacements and do not reduce the failure rate significantly. With condition-based maintenance, it is possible to achieve even more optimal schedule and replace parts only when their performance starts to drop. In this thesis, the state-of-art approach to analysing single throw mechanical equipment is studied and applied.
The angular displacement of the bus door leafs is used to train two logistic regression classifiers for determining whether the door opening and closing movement belong to an operational or a broken door leaf. Models are trained and evaluated on the data obtained from a test bus fitted with a broken door pillar bracket. The results of this thesis are being incorporated into the next-generation bus body electronic system at Scania.
Free-space optical communication technology has a big potential to be used for high-speed point-to point applications. The possibility of high data rates, license-free use, and small terminal size makes it a promising alternative to radio frequency and fiber-optic communication.
However, atmospheric perturbations, building motion and pointing misalignment can cause Angle-of-Arrival AoA changes which reduce the link availability. The purpose of this thesis is to design and implement a real-time system that measures the AoA changes and-and performs high-speed steering, enabling direct coupling of an optical beam to a Single Mode Fiber SMF. The hardware-software solution was tested in laboratory conditions as well over a 1 km link yielding good results.
At the end of this thesis, actual tracking error suppression by a factor of more than 10 has been achieved. Rasmus is the founder of Merantix, a company in the space of artificial intelligence. Earlier this year he launched howhot. Rasmus will present the research behind howhot. Recent years have shown huge improvements in their performance.
Using high-level languages simplifies software development and renders it intra- or even inter-vendor device independent. Writing highly optimized code, however, is very hardware and thus device dependent and can not benefit from the abstraction layers. Detailed knowledge about the inner workings of the GPU are vital but finer architecture aspects often not documented. In addition, the SIMD width, branch divergence behaviour and number of compute units are revealed for the mentioned devices and more benchmarks are still in development.
Precision real time control and synchronisation are important in systems with several coupled power electronic devices. For a precision control in the distributed system we need to have a high efficiency fieldbus system. A new Ethernet based industrial network is being proposed for the point to point communication for the drives. The new system aims at reducing the complexity of the current system and is expected to remove the main control unit. The new system is also expected to handle daisy chained topology of controllers allowing them to communicate with each other and with the master with minimum latency and to be synchronized with each other.
In this thesis a new real-time Ethernet based industrial network technology will be proposed. The thesis will do a classification of the existing technologies and do a comparison of the technologies. The newly proposed technology will be compared to the existing systems and the performance of the system would be analyzed. The thesis would also involve the design of the slave controller for the new technology. Stencil computations expose a large and complex space of possible equivalent implementations.
These computations often rely on autotuning techniques, based on iterative compilation or machine learning ML , to achieve high performance. Iterative compilation autotuning is a challenging and time-consuming task, which may be unaffordable in many scenarios. Meanwhile, traditional machine learning autotuning approaches exploiting classification algorithms like neural networks and support vector machines face difficulties in capturing all features of large search spaces. This paper proposes a new way of automatically tuning stencil computations based on structural learning.
By organizing the training data in a set of partially-sorted samples, i. Our approach can be coupled within an iterative compilation method or used as a standalone autotuner. We demonstrate its potential by comparing it with state-of-the-art iterative compilation methods on a set of nine stencil codes. The High Efficiency Video Coding HEVC standard provides state-of-the-art compression efficiency at the cost of increased computational complexity, which makes real-time decoding a challenge for high-definition, high-bitrate video sequences.
Then a couple of solutions are discussed to improve the performance scalability. To improve the parallelization of the video standards, the AVC standard has undergone some improvements. As its name implies, it can deal with more frames in flight than the predecessors. In this thesis a strategy will be presented to improve the scheduler. Splitting the existing numbers of threads to the running processes is the key to make the decoding process more efficient and faster.
This thesis will present a framsize based model the scheduler can use to decide, how many threads should work on a decoding frame.
With the model this thesis will present, the scheduler is able to predict the amount of time the decoding process would need to decode a frame. So the predicted time can be used as a base for the decision of how many threads should work on a frame. Most common fan controllers are integrated in the mainboard of a PC.
They have limited configuration options and the speed control is only based on CPU temperatures or measurements on the mainboard. In this presentation, a new concept of fan controller, that also uses GPU temperatures, is shown. The system consists of hardware and software components.
Shop by category
A graphical user interface offers flexible configuration options. The current progress is presented together with promising simulation results using both test and real data. The Kalray MPPA manycore processor is a core general purpose processor that integrates user cores and 32 system cores on a chip. While the MPPA promises a high computational performance, for many memory demanding applications like the HEVC decoder, the memory bandwidth can limit the overall performance.
A performance evaluation of the MPPA memory subsystem is therefore necessary to predict the upper bound performance of an actual implementation. We have thus implemented micro-benchmarks for the evaluation of the memory accesses. The tests include communication between different number of clusters and between the clusters and the main memory, for multiple parallel transfers, number of threads and various data sizes.
The next step is to perform trace-based HEVC benchmarking. The motion compensation stage of the HEVC decoder requires a high memory bandwidth and it is thus of great importance to test whether the memory accesses represent a bottleneck. Stereo vision is a method of extracting 3D information of a scene using images taken from different viewpoints. There are many approaches for disparity map DM creation in a Stereo vision system.
Stereo vision is one of the suitable applications to be implemented in an embedded GPU. In this master thesis, the effect of parameter variations in a Stereo vision algorithm is studied. Software implementation of the algorithm is done and several optimizations are done to improve the execution time of the algorithm. Optimizations yielded 31X to 35X improvement in execution time depending on the platform implemented. Comments are given on the eGPU architecture based on observations recorded. Future works are suggested along the field. Multi-core processors are becoming dominant in the computer industry.
The number of cores inside a chip is still expected to increase in the next few years. The problem of the programmability emerged: how to efficiently utilize the multi-core processors? Task-based programming models such as OmpSs are promising for solving this problem.
But runtime systems within these programming models could be a bottleneck that limits the performance. This loosely coupled setup could be a limiting factor for the system performance. Several new components including a Linux device driver, communication protocols, NexusIO, etc. Because of the hardware limitations, we can only get test results for maximum two cores on the ZC development board. Evaluation results indicate that due to memory contentions, the runtime system VSs with hardware acceleration has similar performance with the pure software VSs runtime system.
These devices combine a powerful embedded processor with programmable logic similar to that found in FPGAs.
Browse more videos
Due to the hard-core processor, the overall performance is higher than the one of a system using a soft-core processor in an FPGA. While high throughput inside an FPGA can be easily achieved for many applications, these ports and therefore the memory bandwidth limit the overall performance. This is especially troublesome as it is often challenging to define the according requirements in advance. Furthermore, the actually achievable memory bandwidth depends on many parameters, e. In this presentation, I will present a work flow that allows to estimate the required memory bandwidth by analyzing the memory trace of an equivalent pure software solution.
The flow also includes mechanisms to simulate the behavior of a HW implementation by mimicking its memory accesses and measure the achieved bandwidth. As the initial result we will present the a profile of the HEVC decoder to identify the potential total improvement and which functions could benifit the most.
dblp: Arnaldo Azevedo
Due to it's sequential algorithm, CABAC is one of the most critical throughput bottlenecks in video decoding, especially for high bitrate videos. Unlike other components of the HEVC decoder, there is no data level parallelism that can be exploited. However, the implementation affects the coding efficiency and requires a duplication of the corresponding decoding hardware. In this work we propose a tiny modification of the HEVC bitstream format that enables the parallel decoding of multiple bitstream sections with negligible impact on coding efficiency.
Furthermore, the hardware cost is not expected to grow linearly with the number of parallel bitstream partition decoders as only parts of the decoder need to be duplicated. Simulation results show potential speed-ups up to 4. Even higher speed-ups can be expected due to the potential clustering and better customization of the decoding hardware. All in all, the proposed bitstream format modification is a promising step towards very high throughput CABAC decoding which might be adopted in future video coding standards.
In this talk I will describe the design choices and the implementation of a new and improved power measurement testbed for the LPGPU2 project. The old testbed was a prototype that was not suitable for manufacturing. It also used a commercial DaQ with buggy closed source drivers incompatible with many versions of Linux and could not sample voltage and current at exactly the same time.
An ARM platform is chosen keeping size, weight and power in mind. Being safety critical software, determinism in terms of worst case execution time is a crucial factor. However, the analysis concluded that the worst case execution time in ARM processors are not in the acceptable range. This thesis studies the bottlenecks imposed by the application, data dependencies between the processing chains and possible improvements. The implementation focuses on the parallelism in the Radar application and schedules them in optimal way to achieve best results. As a part of the investigation, peak processor utilization, peak memory utilization, peak bandwidth utilization, worst case execution time, bottlenecks and future scope are discussed in detail.
An analysis is presented based on the new implementation schemes. PHP is a dynamically typed programming language commonly used for the implementation of web applications, as such its performance is often critical to the response time and throughput of such applications. This thesis aims to reduce runtime overhead by improving the quality of the generated PHP bytecode, using data-flow optimizations that act on static single assignment SSA form annotated with inferred value range and type information.
These optimizations include flow-sensitive constant propagation, dead code elimination, elimination of redundant computations through global value numbering and type specialization, as well as some optimizations specific to the PHP virtual machine. Inlining is used to increase the applicability of other optimizations. A primary challenge is to reconcile these optimizations with PHP's highly dynamic nature, which makes it hard to statically prove preconditions necessary for optimization, especially when considering real application code rather than artificial benchmarks.
A library to support the usage of this interoperability is presented, together with benchmark results. It aims for high efficiency and performance while keeping the simplicity of scalar code. In this thesis, a new compiler mode "ispc explained" is written to get a better insight of the compiler, as well as the actual code. That way it is possible to find problems and difficulties in the compiler itself and in the code of vectorized programs.
SIMD extensions were added to microprocessors in the mid '90s to speed-up data-parallel code by vectorization. Unfortunately, the SIMD programming model has barely evolved and the most efficient utilization is still obtained with elaborate intrinsics coding. The proposals were assessed by implementing two kernels: one standard floating-point benchmark and one real-world integer-based application, both highly data parallel.
In this presentation, we will discuss the different programming approaches and highlight the achieved performance gains, but also share the models' current drawbacks. The processor speed has increased much faster than the memory speed. Therefore, the expected performance improvements from increasing the processor speed are limited by the memory latency. Most of the algorithms, which work on image data exhibit a distinguished data access pattern.
This data access pattern is called the two dimensional access 2D access. If this special data access is exploited properly, it could yield a significant improvement in the application performance. The standard cache and memory organizations do not reflect the way that data are accessed by the image processing applications. As a result, they achieve poor performance when used for image applications. In this work, we propose a memory organization to exploit the 2D data access pattern in image processing applications.
The trajectory data is uploaded to a server. This data is then fetched and projected on the trajectory projection systems. A comparison between trajectories achieved without filtering and trajectories estimated with filter has also been presented. In this chapter we focus on the next step: finding or discovering the parallelism in the application. We qualitatively compare different approaches to parallelize H. In the previous chapter we have analyzed various parallelization approaches for H.
The next question is how to efficiently exploit this parallelism. To answer this question, in this chapter we present two implementations of the 2D-Wave approach. The first implementation maintains a centralized pool of macroblocks that are ready to be decoded and cores retrieve tasks from this Task Pool. In the second approach, called Ring-Line, full lines of macroblocks are statically assigned to cores and the cores synchronize and communicate point-to-point. Both approaches have been implemented and are evaluated on a dual-chip Cell BE system with 18 cores in total. If higher performance is required, a parallel application developer might have to extract more parallelism than initially employed in the application.
To illustrate this step, this chapter presents a parallel implementation of H. The application implements the dynamic 3D-Wave algorithm, which exploits intra-frame MB-level parallelism as well as inter-frame MB-level parallelism. The 3D-Wave algorithm is based on the observation that inter-frame dependencies have a limited spatial range, i.
Experimental results obtained using a simulator of a many-core architecture containing NXP TriMedia TM embedded processors show that the implementation scales very well, achieving a speedup of more than 50 on a core processor for a frame FHD sequence. In the previous chapters we mainly focused on generally the most timeconsuming phase of H. There is another phase, however, the entropy decoding phase, that takes a significant amount of time. In order to be able to do so, however, dependencies that result from reusing sequential legacy code need to be eliminated.
It previous chapters we have presented efficient and scalable parallelization strategies for different parts stages of H.