Private CUDA and OpenCL training

Train your team to ship accelerated code.

ArrayFire teaches practical 2- or 4-day GPU programming courses for teams that need to understand CUDA, OpenCL, heterogeneous computing, profiling, and performance optimization from the ground up.

We deliver group training at your site, from our Atlanta office, or remotely by video conference. Courses are tailored to your team, your hardware, and the application-specific problems your engineers actually face.

Plan a Team Course View the Syllabus

Engineers working through an ArrayFire training session with laptops in a classroom — Course tracks **CUDA or OpenCL**

2 or 4 days Choose a focused short course or a deeper program with multi-GPU and algorithm modules.

Private groups Bring the course to your location, train remotely, or host the team at ArrayFire in Atlanta.

Hands-on Lecture material is paired with programming exercises, profiling work, and concrete optimization patterns.

What your team gets better at

Turn GPU concepts into engineering judgment.

The goal is not a tour of syntax. Your engineers should leave knowing how to map work to accelerators, reason about memory, profile bottlenecks, and decide when ArrayFire, CUDA, OpenCL, or a vendor library is the right tool.

Programming model fluency

Understand kernels, work-items, grids, blocks, threads, datasets, device context, and the mental model behind heterogeneous execution.

Performance profiling

Use profiling tools, timing patterns, and measurement discipline to distinguish real bottlenecks from code that merely looks suspicious.

Memory and mapping strategy

Work through global, shared, constant, and host-device memory concerns, including practical dataset mapping and coalescing patterns.

Library and ArrayFire judgment

Learn when to write kernels, when to call optimized libraries, and how ArrayFire's lazy evaluation and vectorized code can simplify production work.

Course format

Private training, tuned to your stack.

The strongest courses start with your team's actual goals: simulation, imaging, AI inference, signal processing, quant analytics, embedded systems, or a custom pipeline that needs GPU fluency.

01

At your location

We bring the course to your engineers so the examples can stay close to the systems, hardware, and constraints they use every day.

02

Hosted in Atlanta

Bring a group to ArrayFire for an immersive course with direct access to our accelerated computing team.

03

Remote live training

Run the course by video conference when distributed engineering teams need a shared baseline without travel.

Included in every course

You bring the engineers. We bring the training environment.

The legacy course promise still matters: remove logistics from the learning day so the team can spend its energy on GPU programming.

Expert instruction

Live teaching from practitioners who specialize in CUDA, OpenCL, ArrayFire, and performance engineering.

Hands-on exercises

Programming work that reinforces the lecture material and exposes common mistakes early.

GPU-capable hardware

Training can include laptops with CUDA and OpenCL capable GPUs and CPUs.

Linux or Windows

Teams can choose the operating system that best matches their development environment.

Printed manual

Lecture material is packaged as a reference your engineers can keep after the course.

Exercise files

Electronic programming exercises give the team examples to revisit after training.

CUDA or OpenCL syllabus

A practical progression from first kernel to harder algorithm problems.

Courses are taught in either CUDA or OpenCL. The same performance principles carry across both frameworks, and the third and fourth days can be customized around your applications.

Day 1

Introduction

GPU computing overview, the programming model, dataset mapping, libraries, ArrayFire, and profiling tools.

Practice: simple kernels, equivalent ArrayFire examples, Monte Carlo Pi estimation, timing, and debugging.

Day 2

Optimization

Architecture, grids, blocks, threads, global/shared/constant memory, advanced mapping, streams, lazy evaluation, and code vectorization.

Practice: matrix transpose, shared memory, median filtering, stream examples, and nearest-neighbor work in ArrayFire.

Day 3

Multi-GPU

Multi-GPU use cases, contexts, existing libraries, and scaling work across multiple devices.

Practice: out-of-core matrix multiply, task-level parallelism, optimization, and ArrayFire multi-GPU examples.

Day 4

Algorithm problems

Reductions, scan algorithms, sorting, convolution, and customer-specific exercises.

Practice: adapt the principles to the kinds of problems your team actually needs to accelerate.

Best fit

Designed for engineers who will own accelerated code.

Attendees should have a working knowledge of C or C++ so the course can move quickly from concepts into real implementation.

Training works especially well for teams building HPC, image processing, signal processing, AI inference, simulation, quantitative analytics, embedded, or multi-GPU systems.

For individual students, ask us about upcoming scheduled sessions. For groups, we usually recommend a private course tailored around the team's stack.

"

A US Army Research Lab attendee praised the course for individualized instruction and focused attention on his needs and concerns.

Brian Rapp, US Army Research Lab