Practice CUDA core

CUDA Core

  Sum   =  Sum + ( a x b) 

   D  =  C  + ( A x B)   

where A, B, C and D are scalar numbers 

A NVIDIA GPUs contains 1-N Streaming Multiprocessors (SM). Each SM has 1-4 warp schedulers. Each warp scheduler has a register file and multiple execution units. The execution units may be exclusive to the warp scheduler or shared between schedulers. Execution units include CUDA cores (FP/INT), special function units, texture, and load store units. The Fermi and Kepler white papers provide additional information.

”queue of blocks” associated with each kernel launch. 

As resources on a SM become available, the block scheduler will deposit a block from the ”queue” onto that SM. Block scheduler does not deposit blocks warp-by-warp. It is an all-or-nothing proposition, on a block-by-block basis. 

Let us consider a block that is already deposited on a SM. A warp is ”eligible” when it has one or more instructions that are ready to be executed.


1. Block scheduling does not include warp scheduling.

2. Block-scheduler is a device-wide entity

3. Warp scheduler is a per-SM entity

Ampere GPU is having  maximum number of concurrent warps per SM which remains the same as in Volta (i.e., 64),’ The high-priority recommendations from those guides are as follows. Find ways to parallelize sequential code. 

.cu and .py code for NVIDIA device (ref : in git)

Devices with the same major revision number are of the same core architecture.   

 CUDA version  ( is sw version) versus CUDA Cores

The compute capability version of a particular GPU should not be confused with the CUDA version (for example, CUDA 7.5, CUDA 8, CUDA 9), which is the version of the CUDA software platform. 

The CUDA platform is used by application developers to create applications that run on many generations of GPU architectures, including future GPU architectures yet to be invented. While new versions of the CUDA platform often add native support for a new GPU archiecture by supporting the compute capability version of that architecture, new versions of the CUDA platform typically also include.  Software features that are independent of hardware generation.The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp  scheduler for execution. 

The way a block is partitioned into warps is always the same; A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution  path.

The SIMT architecture is akin to SIMD (Single Instruction, Multiple Data) vector organizations in that a single instruction controls multiple processing element.

Compute Capability of given NVIDIA Device (ref  )

C++ programming guide to  use CUDA Core. (ref  )

GPU Sruvival Toolkit for the AI age (ref  )