Programmer's View : threads, thread blocks and grid are essentially a programmer’s perspective
Hardware view: hardware groups threads that execute the same instruction into warps. Several warps constitute a thread block. Several thread blocks are assigned to a Streaming Multiprocessor (SM). Several SM constitute the whole GPU unit (which executes the whole Kernel Grid)
For example, TX2 GPU has two streaming multiprocessors (SMs), each providing 128 1.3-GHz cores that share a 512-KB L2 cache.
Three kernels, K1, K2, and K3, were launched by a single task to a single stream, and three additional kernels, K4, K5, and K6, were launched by a second task to two separate streams. The kernels are numbered by launch time. Copy operations occurred after K2 and K5, before and after K3, and after K6, in their respective streams.
GPU timeline produced by this experiment. Each rectangle represents a block: the j th block of kernel Kk is labeled “Kk:j.” The left and right boundaries of each rectangle correspond to that block’s start and end times, as measured on the GPU using the global timer register. The height of each rectangle is the number of threads used by the block. The y-position of each rectangle indicates the SM upon which it executed. Arrows below the x-axis indicate kernel launch times.
High level scheduler is dispatching thread blocks to each SM
SM = Streaming Multiprocessor
SP = Streaming Processor = CUDA Core
Total SP/CUDA Cores = number of SM * number of SP/CUDA Cores per SM
Next level is Packing of threads into the warps is what is going to be done by the scheduler which is sitting inside each of the SM