Cuda toolkit 3.2 downloads

CUDA C++ language and compiler improvements

CUDA 11 is also the first release to officially include CUB as part of the CUDA Toolkit. CUB is now one of the supported CUDA C++ core libraries. 

One of the major features in nvcc for CUDA 11 is the support for link time optimization (LTO) for improving the performance of separate compilation. LTO, using the or options, stores intermediate code during compilation and then performs higher-level optimizations at link time, such as inlining code across files. 

nvcc in CUDA 11 adds support for ISO C++17 and support for new host compilers across PGI, gcc, clang, Arm, and Microsoft Visual Studio. If you want to experiment with host compilers not yet supported, nvcc supports a new flag during the compile-build workflow. nvcc adds other new features, including the following:

  • Improved lambda support
  • Dependency file generation enhancements (, options)
  • Pass-through options to the host compiler  

Figure 4. Platform support in CUDA 11.  


Description of Download Link to Binaries Documents
Developer Drivers for MacOS download Getting Started Guide Mac Release Notes *Updated*CUDA C Programming Guide CUDA C Best Practices Guide CUDA Reference Manual API Reference PTX ISA 2.1 Visual Profiler User Guide Visual Profiler Release Notes Fermi Compatibility Guide *Updated*Fermi Tuning Guide CUBLAS User Guide CUFFT User Guide CUDA Developer Guide for Optimus Platforms License

CUDA Toolkit

  • C/C++ compiler
  • Visual Profiler
  • GPU-accelerated BLAS library
  • GPU-accelerated FFT library
  • GPU-accelerated Sparse Matrix library
  • GPU-accelerated RNG library
  • Additional tools and documentation
GPU Computing SDK code samples download CUDA C/C++ Release NotesCUDA Occupancy Calculator License


Description of Download Link to Binaries Documents
Developer Drivers for MacOS download  

CUDA Toolkit

  • C/C++ compiler
  • CUDA Visual Profiler
  • OpenCL Visual Profiler
  • GPU-accelerated BLAS library
  • GPU-accelerated FFT library
  • Additional tools and documentation

Getting Started Guide for MacRelease Notes for Mac CUDA C Programming Guide CUDA C Best Best Practices Guide OpenCL Programming Guide OpenCL Best Best Practices Guide OpenCL Implementation Notes CUDA Reference Manual API Reference PTX ISA 2.0 Visual Profiler User GuideVisual Profiler Release Notes Fermi Compatibility Guide Fermi Tuning Guide CUBLAS User GuideCUFFT User Guide License

NVIDIA Performance Primitives (NPP) library download
GPU Computing SDK code samples download Release Notes for CUDA C Release Notes for OpenCL CUDA Occupancy Calculator License  

Host Code

The main function declares two pairs of arrays.

  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float)); 
  cudaMalloc(&d_y, N*sizeof(float));

The pointers  and  point to the host arrays, allocated with  in the typical fashion, and the  and  arrays point to device arrays allocated with the  function from the CUDA runtime API. The host and device in CUDA have separate memory spaces, both of which can be managed from host code (CUDA C kernels can also allocate device memory on devices that support it).

The host code then initializes the host arrays.  Here we set  to an array of ones, and  to an array of twos.

  for (int i = 0; i < N; i++) {
    x = 1.0f;
    y = 2.0f;

To initialize the device arrays, we simply copy the data from  and  to the corresponding device arrays  and  using , which works just like the standard C  function, except that it takes a fourth argument which specifies the direction of the copy. In this case we use  to specify that the first (destination) argument is a device pointer and the second (source) argument is a host pointer.

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

After running the kernel, to get the results back to the host, we copy from the device array pointed to by  to the host array pointed to by  by using  with .

cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

Launching a Kernel

The  kernel is launched by the statement:

saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);

The information between the triple chevrons is the execution configuration, which dictates how many device threads execute the kernel in parallel. In CUDA there is a hierarchy of threads in software which mimics how thread processors are grouped on the GPU. In the CUDA programming model we speak of launching a kernel with a grid of thread blocks. The first argument in the execution configuration specifies the number of thread blocks in the grid, and the second specifies the number of threads in a thread block.

Thread blocks and grids can be made one-, two- or three-dimensional by passing dim3 (a simple struct defined by CUDA with , , and members) values for these arguments, but for this simple example we only need one dimension so we pass integers instead. In this case we launch the kernel with thread blocks containing 256 threads, and use integer arithmetic to determine the number of thread blocks required to process all  elements of the arrays ().

For cases where the number of elements in the arrays is not evenly divisible by the thread block size, the kernel code must check for out-of-bounds memory accesses.

Cleaning Up

After we are finished, we should free any allocated memory. For device memory allocated with , simply call . For host memory, use  as usual.


The Benefits of Unified Memory on Pascal and Later GPUs

Starting with the Pascal GPU architecture, Unified Memory functionality is significantly improved with 49-bit virtual addressing and on-demand page migration. 49-bit virtual addresses are sufficient to enable GPUs to access the entire system memory plus the memory of all GPUs in the system. The Page Migration engine allows GPU threads to fault on non-resident memory accesses so the system can migrate pages on demand from anywhere in the system to the GPU’s memory for efficient processing.

In other words, Unified Memory transparently enables oversubscribing GPU memory, enabling out-of-core computations for any code that is using Unified Memory for allocations (e.g. ). It “just works” without any modifications to the application, whether running on one GPU or multiple GPUs.

Also, Pascal and Volta GPUs support system-wide atomic memory operations. That means you can atomically operate on values anywhere in the system from multiple GPUs. This is useful in writing efficient multi-GPU cooperative algorithms.

Demand paging can be particularly beneficial to applications that access data with a sparse pattern. In some applications, it’s not known ahead of time which specific memory addresses a particular processor will access. Without hardware page faulting, applications can only pre-load whole arrays, or suffer the cost of high-latency off-device accesses (also known as “Zero Copy”). But page faulting means that only the pages the kernel accesses need to be migrated.

Понимаем работу GPU:


  1. Получить данные для расчетов.
  2. Скопировать эти данные в GPU память.
  3. Произвести вычисление в GPU через функцию ядра.
  4. Скопировать вычисленные данные из GPU памяти в ОЗУ.
  5. Посмотреть результаты.
  6. Высвободить используемые ресурсы.


  1. devPtr – указатель, в который записывается адрес выделенной памяти,
  2. count – размер выделяемой памяти в байтах.
  1. cudaSuccess – при удачном выделении памяти
  2. cudaErrorMemoryAllocation – при ошибке выделения памяти
  1. dst – указатель, содержащий адрес места-назначения копирования,
  2. src – указатель, содержащий адрес источника копирования,
  3. count – размер копируемого ресурса в байтах,
  4. cudaMemcpyKind – перечисление, указывающее направление копирования (может быть cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyHostToHost, cudaMemcpyDeviceToDevice).
  1. cudaSuccess – при удачном копировании
  2. cudaErrorInvalidValue – неверные параметры аргумента (например, размер копирования отрицателен)
  3. cudaErrorInvalidDevicePointer – неверный указатель памяти в видеокарте
  4. cudaErrorInvalidMemcpyDirection – неверное направление (например, перепутан источник и место-назначение копирования)


  1. *event – указатель для записи хендла event’а.
  1. cudaSuccess – в случае успеха
  2. cudaErrorInitializationError – ошибка инициализации
  3. cudaErrorPriorLaunchFailure – ошибка при предыдущем асинхронном запуске функции
  4. cudaErrorInvalidValue – неверное значение
  5. cudaErrorMemoryAllocation – ошибка выделения памяти


  1. event – хендл хаписываемого event’а,
  2. stream – номер потока, в котором записываем (в нашем случае это основной нулевой по-ток).
  1. cudaSuccess – в случае успеха
  2. cudaErrorInvalidValue – неверное значение
  3. cudaErrorInitializationError – ошибка инициализации
  4. cudaErrorPriorLaunchFailure – ошибка при предыдущем асинхронном запуске функции
  5. cudaErrorInvalidResourceHandle – неверный хендл event’а
  1. event – хендл event’а, прохождение которого ожидается.
  1. cudaSuccess – в случае успеха
  2. cudaErrorInitializationError – ошибка инициализации
  3. cudaErrorPriorLaunchFailure – ошибка при предыдущем асинхронном запуске функции
  4. cudaErrorInvalidValue – неверное значение
  5. cudaErrorInvalidResourceHandle – неверный хендл event’а


CUDA is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance
by harnessing the power of the graphics processing unit (GPU).

CUDA was developed with several design goals in mind:

  • Provide a small set of extensions to standard programming languages, like C, that enable a straightforward implementation
    of parallel algorithms. With CUDA C/C++, programmers can focus on the task of parallelization of the algorithms rather than
    spending time on their implementation.
  • Support heterogeneous computation where applications use both the CPU and GPU. Serial portions of applications are run on
    the CPU, and parallel portions are offloaded to the GPU. As such, CUDA can be incrementally applied to existing applications.
    The CPU and GPU are treated as separate devices that have their own memory spaces. This configuration also allows simultaneous
    computation on the CPU and GPU without contention for memory resources.

CUDA-capable GPUs have hundreds of cores that can collectively run thousands of computing threads. These cores have shared
resources including a register file and a shared memory. The on-chip shared memory allows parallel tasks running on these
cores to share data without sending it over the system memory bus.

This guide will show you how to install and check the correct operation of the CUDA development tools.

To use CUDA on your system, you will need the following installed:

  • A CUDA-capable GPU
  • A supported version of Microsoft Windows
  • A supported version of Microsoft Visual Studio
  • the NVIDIA CUDA Toolkit (available at

The next two tables list the currently supported Windows operating systems and compilers.

Table 1. Windows Operating System Support in CUDA 11.4
Operating System Native x86_64 Cross (x86_32 on x86_64)
Windows 10 YES NO
Windows Server 2022 YES NO
Windows Server 2019 YES NO
Windows Server 2016 YES NO
Table 2. Windows Compiler Support in CUDA 11.4
Compiler* IDE Native x86_64 Cross (x86_32 on x86_64)
MSVC Version 192x Visual Studio 2019 16.x YES YES
MSVC Version 191x Visual Studio 2017 15.x (RTW and all updates) YES YES

* Support for Visual Studio 2015 is deprecated in release 11.1.

x86_32 support is limited. See the section for details.

For more information on MSVC versions, Visual Studio product versions, visit

Native development using the CUDA Toolkit on x86_32 is unsupported. Deployment and execution of CUDA applications on x86_32
is still supported, but is limited to use with GeForce GPUs.

To create 32-bit CUDA applications, use the cross-development capabilities of the CUDA Toolkit on x86_64.

Support for developing and running x86 32-bit applications on x86_64 Windows is limited to use with:

  • GeForce GPUs
  • CUDA Driver
  • CUDA Runtime (cudart)
  • CUDA Math Library (math.h)
  • CUDA C++ Compiler (nvcc)
  • CUDA Development Tools


CUDA 11 provides a foundational development environment for building applications for the NVIDIA Ampere GPU architecture and powerful server platforms built on the NVIDIA A100 for AI, data analytics, and HPC workloads, both for on-premises (DGX A100) and cloud (HGX A100) deployments.  

Figure 12. Different ways to get CUDA 11.

CUDA 11 is now available. As always, you can get CUDA 11 in several ways: download local installer packages, install using package managers, or grab containers from various registries. For enterprise deployments, CUDA 11 also includes driver packaging improvements for RHEL 8 using modularity streams to improve stability and reduce installation time. 

To learn more about CUDA 11 and get answers to your questions, register for the following upcoming live webinars:

  • Inside the NVIDIA Ampere Architecture
  • CUDA New Features and Beyond
  • Inside the HPC SDK: The Compilers, Libraries, and Tools for Accelerated Computing
  • CUDA on NVIDIA Ampere Architecture: Taking Your Algorithms to the Next Level of Performance
  • Optimizing Applications for NVIDIA Ampere GPU Architecture
  • Tensor Core Performance on NVIDIA GPUs: The Ultimate Guide
  • Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit on NVIDIA A100

Also, watch out for the following related GTC talks for deep dives on the features for A100 covered in this post. These GTC recorded talks will be posted during the month of May:

  • How CUDA Math Libraries Can Help You Unleash the Power of the New NVIDIA A100 GPU
  • Inside NVIDIA’s ​Multi-Instance GPU Feature
  • CUDA Developer Tools: Overview & Exciting New Features​​

Finally, register for the NVIDIA Developer Program to receive updates on CUDA 11 and future releases of CUDA.

Out of the Blocks

CUDA GPUs have many parallel processors grouped into Streaming Multiprocessors, or SMs. Each SM can run multiple concurrent thread blocks. As an example, a Tesla P100 GPU based on the Pascal GPU Architecture has 56 SMs, each capable of supporting up to 2048 active threads. To take full advantage of all these threads, I should launch the kernel with multiple thread blocks.

By now you may have guessed that the first parameter of the execution configuration specifies the number of thread blocks. Together, the blocks of parallel threads make up what is known as the grid. Since I have  elements to process, and 256 threads per block, I just need to calculate the number of blocks to get at least N threads. I simply divide  by the block size (being careful to round up in case  is not a multiple of ).

I also need to update the kernel code to take into account the entire grid of thread blocks. CUDA provides , which contains the number of blocks in the grid, and , which contains the index of the current thread block in the grid. Figure 1 illustrates the the approach to indexing into an array (one-dimensional) in CUDA using , , and . The idea is that each thread gets its index by computing the offset to the beginning of its block (the block index times the block size: ) and adding the thread’s index within the block (). The code  is idiomatic CUDA.

The updated kernel also sets  to the total number of threads in the grid (). This type of loop in a CUDA kernel is often called a grid-stride loop.

Save the file as  and compile and run it in  again.

That’s another 28x speedup, from running multiple blocks on all the SMs of a K80! We’re only using one of the 2 GPUs on the K80, but each GPU has 13 SMs. Note the GeForce in my laptop has 2 (weaker) SMs and it takes 680us to run the kernel.

New Features

  • As part of the CUDA Toolkit 11.0 release,
    • CUPTI adds tracing and profiling support for devices with compute capability 8.0 i.e. NVIDIA A100 GPUs and systems that are based on A100.
    • CUPTI adds support for the Arm server platform (arm64 SBSA).
    • Enhancements for CUDA Graph:
      • Support to correlate the CUDA Graph node with the GPU activities: kernel, memcpy, memset.
        • Added a new field graphNodeId for Node Id in the activity records for kernel, memcpy, memset and P2P transfers. Activity records CUpti_ActivityKernel4, CUpti_ActivityMemcpy2, CUpti_ActivityMemset, and CUpti_ActivityMemcpyPtoP are deprecated and replaced by new activity records CUpti_ActivityKernel5, CUpti_ActivityMemcpy3, CUpti_ActivityMemset2, and CUpti_ActivityMemcpyPtoP2.
        • graphNodeId is the unique ID for the graph node.
        • graphNodeId can be queried using the new CUPTI API cuptiGetGraphNodeId().
        • Callback CUPTI_CBID_RESOURCE_GRAPHNODE_CREATED is issued between a pair of the API enter and exit callbacks.
      • Introduced new callback CUPTI_CBID_RESOURCE_GRAPHNODE_CLONED to indicate the cloning of the CUDA Graph node.
      • Retain CUDA driver performance optimization in case memset node is sandwiched between kernel nodes. CUPTI no longer disables the conversion of memset nodes into kernel nodes for CUDA graphs.
      • Added support for cooperative kernels in CUDA graphs.
    • Fixed issues in the API cuptiFinalize() including the issue which may cause the application to crash. This API provides ability for safe and full detach of CUPTI during the execution of the application. More details in the section .
    • Added support to trace Optix applications. Refer the .
    • PC sampling overhead is reduced by avoiding the reconfiguration of the GPU when PC sampling period doesn’t change between successive kernels. This is applicable for devices with compute capability 7.0 and higher.
    • CUPTI overhead is associated with the thread rather than process. Object kind of the overhead record CUpti_ActivityOverhead is switched to CUPTI_ACTIVITY_OBJECT_THREAD.
    • Added error code CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED to indicate the presense of another CUPTI subscriber. API cuptiSubscribe() returns the new error code than CUPTI_ERROR_MAX_LIMIT_REACHED.
    • Added a new enum CUpti_FuncShmemLimitConfig to indicate whether user has opted in for maximun dynamic shared memory size on devices with compute capability 7.x by using function attributes CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES or cudaFuncAttributeMaxDynamicSharedMemorySize with CUDA driver and runtime respectively. Field shmemLimitConfig in the kernel activity record CUpti_ActivityKernel5 shows the user choice. This helps in correct occupancy calulation. Value FUNC_SHMEM_LIMIT_OPTIN in the enum cudaOccFuncShmemConfig is the corresponding option in the CUDA occupancy calculator.

Developer tools

CUDA 11 continues to add rich features to the existing portfolio of developer tools.  This includes familiar plugins for Visual Studio, with the NVIDIA Nsight Integration for Visual Studio,  and Eclipse, with Nsight Eclipse Plugins Edition. It also includes standalone tools, such as Nsight Compute for kernel profiling, and Nsight Systems for system-wide performance analysis. Nsight Compute and Nsight Systems are now supported on all three CPU architectures supported by CUDA: x86, POWER, and Arm64. 

Figure 10. Nsight Systems (tracing) and Nsight Compute (kernel profiling).

One of the key features of Nsight Compute for CUDA 11 is the ability to generate the Roofline model of the application. A Roofline model is a visually intuitive method for you to understand kernel characteristics by combining floating-point performance, arithmetic intensity, and memory bandwidth into a two-dimensional plot.

By looking at the Roofline model, you can quickly determine whether the kernel is compute-bound or memory-bound. You can also understand potential directions for further optimizations, for example, kernels that are near the roofline make optimal use of computational resources. 

For more information, see Roofline Performance Model.

  Figure 11. A Roofline model in Nsight Compute.

CUDA 11 includes the Compute Sanitizer, a next-generation, functional correctness checking tool that provides runtime checking for out-of-bounds memory accesses and race conditions. Compute Sanitizer is intended to be a replacement for the tool. 

The following code example shows an example of Compute Sanitizer checking memory accesses.

The following code example shows a Compute Sanitizer example for race condition checks.

Finally, even though CUDA 11 no longer supports running applications on macOS, we are making developer tools available for users on macOS hosts: 

  • Remote target debugging using
  • NVIDIA Visual Profiler
  • Nsight Eclipse plugins 
  • The Nsight family of tools for remote profiling or tracing 

CUDA libraries

The libraries in CUDA 11 continue to push the boundaries of performance and developer productivity by using the latest and greatest A100 hardware features behind familiar drop-in APIs in linear algebra, signal processing, basic mathematical operations, and image processing. 

Figure 5. New features in the CUDA math libraries for NVIDIA A100.

Across the linear algebra libraries, you will see Tensor Core acceleration for the full range of precisions available on A100, including FP16, Bfloat16, TF32, and FP64. This includes BLAS3 operations in cuBLAS, factorizations and dense linear solvers in cuSOLVER, and tensor contractions in cuTENSOR. 

In addition to the enhanced range of precisions, restrictions on matrix dimensions and alignment for Tensor Core acceleration have been removed. For appropriate precisions, the acceleration is now automatic, requiring no user opt-in. The heuristics for cuBLAS automatically adapt to resources when running on the GPU instances with MIG on A100.

Figure 6. Mixed-precision matrix multiply on A100 with cuBLAS.

CUTLASS, the CUDA C++ template abstractions for high-performance GEMM, supports all the various precision modes offered by A100. With CUDA 11, CUTLASS now achieves more than 95% performance parity with cuBLAS. This allows you to write your own custom CUDA kernels for programming the Tensor Cores in NVIDIA GPUs. 

Figure 7. CUTLASS template abstractions for NVIDIA A100.
Figure 8. How CUDA libraries allow you to take advantage of Tensor Cores in A100.

cuFFT takes advantage of the larger shared memory size in A100, resulting in better performance for single-precision FFTs at larger batch sizes. Finally, on multi-GPU A100 systems, cuFFT scales and delivers 2X performance per GPU compared to V100. 

nvJPEG is a GPU-accelerated library for JPEG decoding. Together with NVIDIA DALI, a data augmentation and image loading library, it can accelerate deep learning training on image classification models, especially computer vision. The libraries accelerate the image decode and data augmentation phase of the deep learning workflow. 

The A100 includes a 5-core hardware JPEG decode engine and nvJPEG takes advantage of the hardware backend for batched processing of JPEG images. JPEG acceleration by a dedicated hardware block alleviates bottlenecks on the CPU and allows better GPU utilization.  

Selecting the hardware decoder is done automatically by the for a given image or by explicitly selecting the hardware backend using init function. nvJPEG provides acceleration of baseline JPEG decode, and various color conversion formats, for example, YUV 420, 422, and 444. 

Figure 8 shows that this results in up to 18x faster image decode compared to CPU-only processing. If you use DALI, you can directly benefit from this hardware acceleration because nvJPEG is abstracted. 

Figure 9. nvJPEG Speedup vs. CPU. (Batch 128 with Intel Platinum 8168 @2GHz 3.7GHz Turbo HT on; with TurboJPEG)

There are many more features in the CUDA math libraries than can be covered in a single post.

Key Features

  • Tensor Core acceleration for all popular convolutions including 2D, 3D, Grouped, Depth-wise separable, and Dilated with NHWC and NCHW inputs and outputs
  • Optimized kernels for computer vision and speech models including ResNet, ResNext, EfficientNet, EfficientDet, SSD, MaskRCNN, Unet, VNet, BERT, GPT-2, Tacotron2 and WaveGlow
  • Supports FP32, FP16, BF16 and TF32 floating point formats and INT8, and UINT8 integer formats
  • Arbitrary dimension ordering, striding, and sub-regions for 4d tensors means easy integration into any neural net implementation
  • Speed up fused operations on any CNN architecture

cuDNN is supported on Windows and Linux with Ampere, Turing, Volta, Pascal, Maxwell, and Kepler GPU architectures in data center and mobile GPUs.

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *