Login Form



Welcome to the PERFORM web site

Portability and Performance in Heterogeneous Many Core Systems

Below you can find a summary of this project. Detailed information regarding the Project team, description and publications can be found on the respective tabs.

Summary

The multi-core paradigm shifted the burden of performance efficiency from hardware to software. Application’s performance will not follow the increased number of cores; it will depend on the programmer/compiler/run-time system ability to exploit available resources. This problem is aggravated by two major trends: the number of cores will increase with time and these will be heterogeneous in functionality. Current computing systems already have a multiplicity of heterogeneous resources, such as multi-core CPUs, many-core GPUs and additional coprocessors (e.g., the Cell or ClearSpeed). Efficient and portable programming for such parallel and heterogeneous systems is a major research challenge for the next  decade: computing paradigms for CPUs and specialized coprocessors are very different (scalar vs. vector operations, shared address space vs. complex memory hierarchies) and development environments are platform-specific, locking applications into particular platforms. Mapping of basic operations onto processors is done in design time and cannot be changed without rewriting the code, independently of the number and nature of devices available at each particular computing system. Besides code portability, also performance portability is a major issue. A key challenge is to develop software that can adapt to different system configurations and workload requirements, while still maintaining high efficiency levels. Our goal is to understand how applications can efficiently exploit multiple heterogeneous cores scattered over a distributed system (e.g., a cluster) and profit from run-time adaptation to the available devices number and characteristics. By having platform-independent kernels implementing the application’s basic operations and by dynamically parameterizing and assigning kernels to the processors, a dynamic schedule can be generated, efficiently matching the workload requirements to the devices capabilities. Scheduling requires a performance model and scheduling rules . The former dynamically characterizes the devices capabilities, the kernels’ computing requirements and the workload (or dataset) characteristics, while the latter is responsible for all scheduling decisions. OpenCL provides a platform-independent parallel programming environment to write efficient and portable applications for heterogeneous systems, including CPUs, GPUs and other coprocessors . The programming language allows the implementation of portable computing kernels, while the API allows the discovery of available devices and the parameterization and mapping of kernels onto these devices. Code portability is enabled by the platform independent language, while performance portability is achieved through dynamic scheduling of kernels and data onto processors according to system state as reported by the performance model.

 

The feasibility of the proposed concept will be demonstrated by developing, using OpenCL, a customizable and efficient interactive global illumination (IGI) engine. IGI, currently mostly based on ray tracing (RT), exhibits high computational costs, particularly within dynamic scenes. It has been shown that IGI is possible by extensive optimization of the code and the sampling distributions and by exploiting both memory locality and parallelism at multiple levels .

RT entails looping over some basic operations (sorting, sampling, space traversal, ray intersection, shading and integration), which exhibit different data dependencies/access patterns, diverse computational requirements and various degrees of parallelism; while some require involved control flow, others apply the same kernel to various data items, hence being easier to vectorize. RT is thus well suited to exploit a combination of different computing paradigms, available on many-core heterogeneous systems constituted by CPUs, GPUs and additional coprocessors. Interface between the IGI engine and the user application will resort to OpenRT , with extensions to OpenRTS to support deferred shading. IGI still requires exploitation of multiple computers distributed on a cluster environment. The OpenCL specification will be extended to support clusters by specifying a distributed memory model, through extensions to the OpenCL API and a cluster aware runtime system. This project, building on the team experience in interactive rendering and parallel processing, will contribute to advance knowledge in scheduling, heterogeneous parallel computing and IGI. Contributions include: i) a better understanding of the addressed computing paradigms and their suitability to perform IGI; ii) development and assessment of a performance model and scheduling mechanism on heterogeneous many-core distributed systems; iii) a specification and support for a distributed memory model on top of OpenCL; iv) design, evaluation and dissemination of an efficient, adaptive and portable OpenCL IGI engine.

 

 
RocketTheme Joomla Templates