CUDA Programming Model | Learn CUDA

This post series is the Solutions of the Professional CUDA C Programming written by John Cheng, Max Grossman and Ty McKercher

Chapter 2: CUDA Programming Model

In this section, we will learn to

  • Writing a CUDA Program
  • Executing kernel functions
  • Organizing the Cuda threads with grids and blocks
  • Measuring the GPU performance

A typical processing flow of a CUDA program follows this pattern;

  1. Copy data from CPU memory to GPU memory
  2. Invoke kernels to operate on the data stored in GPU memory
  3. Copy data back from GPU memory to CPU memory

We will start to learning memory management and data movement between host and device.


Example usage is cudaMalloc

Example usage is cudaMemcpy

cudaMemcpyKind flags types are;

  • cudaMemcpyHostToHost
  • cudaMemcpyHostToDevice
  • cudaMemcpyDeviceToHost
  • cudaMemcpyDeviceToDevice

Sooooo, how we can manage data movement between the host and device, and how to program with CUDA C through a simple example of summing two arrays.

First, I start to simple C program without GPU.

I change this code for the GPU computation

Organizing Threads

In this section, we will learn using Cuda blocks and threads. Devices have Grid, Block and Thread. All threads spawned by single kernel launch are collectively called a grid. Threads in a grid share same memory space. Grid has many thread blocks. A thread block is a group of thread that can cooperate with each other using block-local synchronization and shared memory. Identification of thread is can be defined by using blockIdx and threadIdx. These variables have three different components. These are;


We can use these variables with example code

I mean that deciding the grid and block size is important.

Launch a CUDA Kernel

Cuda syntax for calling functions is like that,

If you recognize the above example which is a summation of the array on GPU is wrong. We will try to write again. And you can try to find the error.

nvproc: Shows the timing/

Parallel Threads

In this section, we will try to implement the multidimensional grid and blocks.

  • 2D grid with 2D blocks
  • 1D grid with 1D blocks
  • 2D grid with 1D blocks

You can index with using below code

First, try to implement matrix summation with 2D grid and 2D blocks

Manage Devices

In this section, we will try to reach information about the devices on C code.


I solve the examples of Chapter 2. You can find solutions at bellow. I use Jetson Xavier.

Example 1

I changed . I change block.x variable 1024 to 1023 and results are below.

Example 2

I changed . I change block.x variable to 256 and results and new kernel codes are below.


Example 3

I changed . I change block.x and block.y variables results and new kernel codes are below.

Block.xBlock.yKernel Time
102410.036771 s
51210.035524 s
12820.037406 s
12880.034023 s
32320.034817 s

Example 4

I changed . I change block.x variables results and new kernel codes are below.

Block.xKernel Time (s)
320.333858 s
640.167834 s
1280.111178 s
2560.115275 s
5120.112036 s
10240.139362 s

The best result is 128 block number.


Leave a Reply

Your email address will not be published. Required fields are marked *