More effort  in Designing of Code and less effort in Coding 

Thread to Warp

Run C++ code and also Python code in CUDA Cores of GPUs

Run C++ code and also Python code in  TP  Cores of GPUs

 Data Read and write 

    1) CPU to GPU

    2)  GPU to GPU

DLA Core ( Deep Learning Accelerator core ) . Dedicated CUDA  cores to  Deep learning Networks ( mostly NN)