Performance-Portable Programming

The OpenCL language and ecosystem could be a lot more portable than it is today. The idea of a static OpenCL runtime coupled with an OpenCL-to-C translator would make OpenCL code theoretically runnable anywhere C and C++ can be compiled for.

As microprocessors became more and more prevalent, the challenge of portability became critical. The C programming language became popular in large part because it provided the right programming mechanisms for matching common programming patterns, such as counted loops, to architecture-specific mechanisms for handling those constructs. It’s still true that many of the most highly tuned libraries are written in assembly code or intrinsics and tuned to specific pieces of hardware. But the vast majority of programmers do not write assembly code anymore; even the ones that really care about performance. The typical performance-minded programmer does take into account both algorithmic complexity and general architecture principles such as locality, stride-one memory accesses in loops and avoiding control flow in innermost loops. Today, practically the first tools available for any new processor are an assembler and optimizing C compiler.

Now, parallel computing hardware is becoming increasingly diverse and attractive. Multicore CPUs, GPUs, DSPs and many other architectures are becoming more and more parallel. The change has become so overwhelming that, for possibly the first time in decades, many people are considering rewriting working code to be scalably parallel, and portable across parallel architectures.

Accepting that working code can be rewritten opens up huge opportunities for new parallel programming models, which explains why so many have sprung up in the last few years. But the existing landscape is largely segmented. Want to run parallel code on an NVIDIA GPU? Write it in CUDA, and decompose your code into thousands of tiny threads. Want to run parallel code on multicore CPUs? Write it in OpenMP and make sure it scales to a few dozen traditional CPU threads. The only language that is widely touted as being applicable for a wide variety of parallel processors today is OpenCL.

Figure 1: Typical OpenCL Software Framework

Figure 1 shows how OpenCL is implemented in practice. The OpenCL application code links against a system library defined and provided by the Khronos OpenCL standardization committee. When the application runs on a system, it will locate that shared library, which will use some mechanism to discover registered vendor OpenCL platforms on the system. The application code therefore gains visibility into every available OpenCL stack, including the platform interface implementing the OpenCL API for that platform. That interface defines methods for invoking that platform’s compiler on kernel code, and for managing the execution of those kernels. The goal is that OpenCL application code, at runtime, can find vendor support for OpenCL on each particular system, making the application itself portable among all systems implementing OpenCL.

Yet the OpenCL ecosystem today does not actually provide portability to the extent many software publishers actually desire. First, OpenCL is only portable across supported architectures and systems. OpenCL was not installed by default on the majority of currently operating consumer x86 devices, adding a new library dependency to any applications intending to use it. Those planning on using Intel’s OpenCL stack are limited to CPUs at least as recent as the Core i7 generation. The AMD OpenCL stack is somewhat more general, but even it struggles to provide more than functional portability.

Although OpenCL is a functionally portable standard; people can only write performance-portable code if the different implementers of OpenCL agree on a set of best programming practices for good performance. Unfortunately, that agreement does not exist today. The GPU vendors teach developers to create thousands of work-items to fill their hardware thread contexts, and avoid divergence among work-items in a work-group, because that’s where your SIMD execution really comes from. To help control locality, GPU vendors promise lightning-fast, hardware-implemented barriers between work-items in a group. Yet on AMD’s CPU implementation, divergence among work-items ends up being irrelevant, because it won’t execute multiple work-items in SIMD anyway. Plus, the cost of a barrier is at least a dozen instructions per work-item per executed barrier, which fundamentally changes what use cases a barrier can reasonably be used for.

So there are two major problems that need to be solved. The first is that OpenCL is currently distributed as a system runtime library, currently not present by default on many consumer platforms. The second is that current OpenCL implementations on x86 CPUs are either insufficiently portable (Intel) or suffer in performance if given the same code that works well on a GPU (AMD). We need a solution that will enable both broad portability and a single-source, high-performance programming environment compatible with GPU acceleration and parallel CPU execution.

Figure 2: An OpenCL static runtime library enables broader portability

To solve both problems, we need a way of making the OpenCL code, both the runtime and the kernels, just another compiled module of the application, as shown in Figure 2. An OpenCL API implementation could be statically linked against the main application to mediate the standard OpenCL API calls to the precompiled kernels. To be truly portable, it would also have to be able to dynamically find the system OpenCL runtime library, and through it, any other vendor libraries that happen to be available on the system. A static OpenCL runtime with that capability would provide the same access to GPU-accelerated OpenCL implementations, for instance, but would still run on a platform without any preinstalled OpenCL implementation.

The OpenCL kernels themselves are difficult to manage portably. It would be unlikely for one company’s OpenCL runtime product to support all the different architecture variants that C can support now, much less be kept up to date. The best situations would be if the OpenCL kernels could be compiled with the same tools used to compile the main application: the target’s C or C++ compiler. An OpenCL-to-C translator for CPUs would enable this possibility, and would be generically reusable for a wide variety of architectures.

Figure 3: An OpenCL kernel multiplying a list of small matrices, being translated into C code

Translation from OpenCL to multithreaded C is difficult to get right. Several academic papers have been published on the topic, but the core concept is that the many small work-items of the OpenCL work-groups have to be merged into a single CPU software thread. A CPU thread has too much overhead in creation and scheduling to be suitable for a typically tiny, individual OpenCL work-item. By serializing many work items, wrapping regions of the kernel code in loops over work-item indexes, as shown in Figure 3, the serialization process effectively replaces implicitly declared local work-item indexes with explicitly enumerated loop iterations. Serialization increases the task granularity to a degree much more suitable to a more coarsely threaded CPU architecture design. Figure 3 shows how care must be taken to note the placement of barriers in the original OpenCL code, and obey the ordering constraints imposed by those barriers. What is labeled as the second region overwrites input used in the first region. The barrier is inserted to ensure all the input has been consumed before any of it is overwritten. The barriers essentially define regions of code that can safely be serialized, while separate regions must by divided such that a region after a barrier only executes after all operations before the barrier for all work-items are completed. The technique is even generalizable to cases where barriers are inside other control flow constructs. Once the barrier-ordering constraints have been applied to the serialized code, the barrier itself has no function it must perform, and can be removed.

Once the kernels are in this translated, serialized C-code format, they can be considered just additional source files of the application and compiled using the same toolchains. This makes the translator forward-compatible, as updated C compilers are provided with every new architecture release from most companies. A suitable runtime for interfacing the OpenCL API to those precompiled C kernels would essentially treat each kernel as a plug-in identifiable by kernel name, using techniques such as LLVM’s compiler plugin mechanisms.

In conclusion, the OpenCL language and ecosystem could be a lot more portable than it is today. The idea of a static OpenCL runtime coupled with an OpenCL-to-C translator would make OpenCL code theoretically runnable anywhere C and C++ can be compiled for. This is the vision adopted by MulticoreWare in the Multicore cross-Platform Architecture (MxPA) product line. The goal is to move OpenCL away from being the duplicate, fast codepath that’s only used when it works, to being the only implementation an application needs for its data-parallel kernels.



John A. Stratton is a senior architect for MulticoreWare, Inc. John has been teaching GPU computing since the first university course on the subject in spring 2007, and developing compilers and runtimes for kernel-based accelerated programming models since that year. He has received several awards for outstanding research, teaching and technology development, most recently given the “Most Valuable Entrepreneurial Leadership in a Startup” award by the University of Illinois Research Park for his work with MulticoreWare.

Share and Enjoy:
  • Digg
  • Sphinn
  • Facebook
  • Mixx
  • Google
  • TwitThis

Tags: ,