Cycling through SYCL

We recently gave an overview of recent history in the technical computing hardware market. In it, we mention the energy at Intel right now. The weight of Intel is behind the SYCL standard through its new software approach, oneAPI.

SYCL is a cross-platform API that targets heterogeneous hardware, similar to OpenCL and CUDA. The SYCL standard was first introduced by Codeplay and is now being managed by the Khronos group. It allows single-source compilation in C++ to target multiple devices on a system, rather than using C++ for the host and domain specific kernel languages for the device.

Furthermore, SYCL is fully C++ 17 standards compliant. You don’t have any extensions to the language that would prevent any standards compliant compiler to process the source code. This opens up the full power of the C++ language on both the host and accelerator sides. Advanced C++ features such as meta-programming, variadic templates, and lambdas will allow for concise expression of algorithms on any target hardware.

Running SYCL code using the standard compiler will not provide the same performance improvements on a device as a SYCL optimized compiler. Many organizations are implementing the SYCL standard to target specific hardware for best performance. Intel is using this framework for its heterogeneous architectures. CodePlay, the company that developed the API, has been targeting multiple devices in their own SYCL compiler, including NVIDIA and AMD GPUs. Others have also been working on their own implementations, Xilinx for their FPGAs and the University of Heidelberg targeting both CUDA and AMD’s HIP software stack.

At ArrayFire, we always champion things that are open and programmable, while maximizing performance. We shy away from proprietary software and vendor lock-in whenever possible.

In this post, we explore the open SYCL standard and the differences in its various implementations. In addition to a high-level discussion of the differences of each implementation, we discuss some nuances in the build steps of a simple SYCL program where applicable. There are many exciting features in SYCL that we will look at in future blog posts, but for now we are going to use this simple example to test default behavior of different SYCL compilers.

ComputeCpp

ComputeCpp is Codeplay’s implementation of SYCL. It is the world’s first SYCL v1.2.1 conformant implementation. ComputeCpp enables developers to easily integrate heterogeneous parallel computing into their applications and accelerate their code using OpenCL devices such as GPUs. Codeplay has been heavily involved in the development of the SYCL standard and has been helping its adoption in various open-source and commercial applications. Codeplay provides a SYCL implementation as a free download on their website. Once installed, you can display system information by running the computecpp_info application in the bin directory of the SDK. For our test case, the computecpp_info lists the available devices and SYCL information. ComputeCPP is currently based on OpenCL so a previously installed CUDA toolkit OpenCL driver is shown in the output.

$ /usr/local/computecpp/bin/computecpp_info

********************************************************************************
ComputeCpp Info (CE 2.2.1 2020/10/16)

SYCL 1.2.1 revision 3

********************************************************************************

Toolchain information:

GLIBC version: 2.31
GLIBCXX: 20190605

Device 0:
  Device is supported                     : UNTESTED - Untested OS
  Bitcode targets                         : ptx64 
  CL_DEVICE_NAME                          : GeForce GTX 1080
  CL_DEVICE_VENDOR                        : NVIDIA Corporation
  CL_DRIVER_VERSION                       : 450.80.02
...

Codeplay also provides a repository of ComputeCPP samples. The ComputeCPP samples are open-source and available on github. The samples can be built using CMake by setting the ComputeCpp_DIR variable to the installed path containing the bin, include, and lib directories.

A more detailed quick start guide to ComputeCPP can be found on Codeplay’s website.

ComputeCPP can target a wide range of platforms with some additional setup. In this case where an Nvidia OpenCL driver was targeted the CMake required several additional options.

The OpenCL driver needed to be explicitly targeted:

-DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/CL
-DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so

The ComputeCPP compiler could also be manually invoked as shown below:

compute++ -sycl-driver -sycl-target ptx64 -I/usr/local/computecpp/include/ -L/usr/local/computecpp/lib/ -lComputeCpp vecadd.cpp

This compiler is built atop of clang and has different invocation modes allowing the output of intermediate SYCL headers or a full object file. Compilation for an Nvidia target requires specifying the usage of PTX64. More details on manual usage can be found here.

Codeplay additionally offers a “Professional” edition of ComputeCpp which includes live support services, offline kernel compilation, Multi Instruction Single Binary, and expanded profiling capabilities.

hipSYCL

hipSYCL is a SYCL implementation targeting CPUs and GPUs. It leverages existing toolchains like CUDA and HIP as much as possible, which enables writing code mixed with existing compute platforms. For example, using hipSYCL you can mix CUDA device intrinsics with regular C++ code, which hipSYCL can understand and compile. The same applies to ROCm-specific code when targeting the ROCm platform on hipSYCL. It’s possible to use vendor-optimized libraries like CUB and rocPRIM with hipSYCL. Vendor tools like profilers and/or debuggers work well with hipSYCL for the same reason.

A few salient points with respect to current state of hipSYCL:

It is under active development
It is the only SYCL implementation that doesn’t use OpenCL
It is not yet a fully conformant implementation

hipSYCL targets the following hardware:

Any CPU for which a C++17 OpenMP compiler exists
NVIDIA GPUs using CUDA
AMD GPUs via HIP/ROCm

Note that clang, which hipSYCL relies on, may not always support the very latest CUDA version, which sometimes impacts support for new hardware. At the time of writing, the highest CUDA toolkit version supported is 10.1 and so this limits the GPU architecture that can be targeted from 3.0 to 7.5.

In hipSYCL, the command to compile SYCL code is `syclcc` which is the front-end of the hipSYCL compilation. It invokes clang++ to compile the actual code. syclcc uses a clang plugin to parse the source code. syclcc compilation loosely follows the model of CUDA where regular C++ code and CUDA-specific code can sit in the same source file. The following are the most common options you may pass to syclcc:

`–hipsycl-platform`: takes one of the following values: `omp`, `cuda` and `hip`
`–hipsycl-gpu-arch`: to pass compute architecture flags specific to the target CUDA devices

Given below are sample compilation commands to target AMD-GPU(HIP), CPU(OpenMP) and NVIDIA-GPU(CUDA).

syclcc --hipsycl-platform=omp vector_add.cpp -o vector_add_omp

syclcc --hipsycl-platform=cuda --hipsycl-gpu-arch=sm_61 vector_add.cpp -o vector_add_cuda

syclcc --hipsycl-platform=hip --hipsycl-gpu-arch=gfx906 vector_add.cpp -o vector_add_hip

triSYCL

triSYCL is an open source research SYCL project targeting extensions for Xilinx FPGAs and ACAPs. The project is in active development on github, and is very experimental at the moment. Developments started first at AMD and are now mainly funded by Xilinx. The implementation is not for end users at the moment and is heavily biased toward the development of Xilinx device compilers. The single-source compilation model is especially exciting for FPGAs, which can have extremely long compilation times. By developing SYCL code, developers will be able to debug their programs with host-fallback device execution, by emulating FPGA device code on a CPU first they will be able to more efficiently explore hardware-software co-design. Within the project, there is also an effort to unify the oneAPI DPC++ SYCL and triSYCL for a better end user experience once the FPGA device compilers are better developed.

DPC++ and oneAPI

oneAPI is an open, cross-industry specification to facilitate heterogeneous computing. It is a specification that is built on top of the SYCL standard. Additionally, it provides a set of tools and libraries to accelerate common tasks like linear algebra and video processing. Intel recently launched their implementation of oneAPI and there are multiple hardware and software vendors working on their own versions of the standard. Each vendor can optimize their libraries to target specific hardware environments. You can learn more about Intel’s reference implementation of the oneAPI specification here.

oneAPI is much more than just SYCL and it’s an interesting approach to building high performance applications. Because the tools and libraries are included as part of the oneAPI specification, it has the potential of making it much easier to port code to multiple platforms.

You can download Intel’s oneAPI implementation containing the tools necessary to get started with oneAPI and Data Parallel C++ (DPC++). The DPC++ language is an important component of oneAPI. It is a heterogeneous parallel language based on modern C++ and SYCL. It features a number of extensions which Intel is developing with the intention of extending the SYCL standard. Many of these have been incorporated into the next version of the SYCL standard.

Intel makes it easy to set up the development environment by providing a script that sets the environment variables on your system. Assuming you are in a bash shell, run the following command:

source /opt/intel/oneapi/setvars.sh intel64

This will set all the environment variables to start using the Intel compilers and libraries. Once this has been executed you can run the following command to compile our sample program:

dpcpp source.cpp -o program

Here is the output of the program on my laptop:

./program: Running on Intel(R) Gen9 HD Graphics NEO
Result of vector addition:
...

Beyond SYCL

SYCL provides an excellent platform to target multiple accelerators. It conforms to the C++ standard and does not add special syntax to the language. This approach makes it easy to incorporate this standard into existing applications. It is also encouraging that multiple vendors are bringing SYCL to their platforms.

Although SYCL is a great specification, you really need an ecosystem around the platform to make it successful. This is why we are excited about the oneAPI specification. The oneAPI specification builds the tools and APIs around a platform which is extremely important for productive programming. Stay tuned for a more in depth discussion of the oneAPI specification in the upcoming weeks!