Programmer's View  : threads, thread blocks and grid are essentially a programmer’s perspective

For example, TX2 GPU has two streaming multiprocessors (SMs), each providing 128 1.3-GHz cores that share a 512-KB L2 cache.

Hardware view: hardware groups threads that execute the same instruction into warps. Several warps constitute a thread block. Several thread blocks are assigned to a Streaming Multiprocessor (SM). Several SM constitute the whole GPU unit (which executes the whole Kernel Grid)

Next level is Packing of threads into the warps is what is going to be done by the scheduler which is sitting inside each of the SM

High level scheduler is dispatching thread blocks to each SM

SM = Streaming Multiprocessor

SP = Streaming Processor = ???  CUDA Core

Total SP/CUDA Cores = number of SM * number of SP/CUDA   Cores per SM

Three kernels, K1, K2, and K3, were launched by a single task to a single stream, and three additional kernels, K4, K5, and K6, were launched by a second task to two separate streams. The kernels are numbered by launch time. Copy operations occurred after K2 and K5, before and after K3, and after K6, in their respective streams.

GPU timeline produced by this experiment. Each rectangle represents a block: the j th block of kernel Kk is labeled “Kk:j.” The left and right boundaries of each rectangle correspond to that block’s start and end times, as measured on the GPU using the global timer register. The height of each rectangle is the number of threads used by the block. The y-position of each rectangle indicates the SM upon which it executed. Arrows below the x-axis indicate kernel launch times.