GPU Related FAQ¶

I don’t have GPU instances available to me when I go to launch an instance¶

Jetstream2 consists of four distinct resources. You must explictly have access to the Jetstream2-GPU resource to access GPUs. Having access to Jetstream2 (CPU) does not give you access to GPUs. We also highly encourage you familiarize yourself with the VM instance sizes/flavors and note the difference in burn rate (SU cost per hour).

How do I use multiple GPUs on an instance for my research?¶

The short answer is that you cannot use multiple GPUs on a single instance at this time.

The longer answer is that this is a limitation of the NVIDIA GRID vGPU driver for our hypervisors. Basically, even with NVLINK present, the driver cannot gang multiple GPUs together into a single VM. Recent updates indicate that we may be able to use multiple fractional vGPUs on an instance. Engineers are currently looking into this and we will update this FAQ and the documentation overall accordingly if there is a means to do this.

Ubuntu 20 does not work with my GPU¶

Due to issues with the NVIDIA GRID driver, we have discontinued support for GPUs using Ubuntu 20. We will be removing Ubuntu 20 from the featured images once we have a stable Ubuntu 24 build available.

My GPU is not usable after a kernel update¶

The NVIDIA drivers are built as kernel modules and should rebuild on a kernel update. If they do not, you can do this on Ubuntu 20.04 instances:

ls /var/lib/initramfs-tools | sudo xargs -n1 /usr/lib/dkms/dkms_autoinstaller start

For Ubuntu 22.04 instances, you can try:

ls /usr/lib/modules | sudo xargs -n1 /usr/lib/dkms/dkms_autoinstaller start

This doesn’t work on redhat-based instances like Rocky Linux. We’re working on a simple solution for that.

The CUDA debugger (cuda-gdb) doesn’t work on some GPU instances¶

If you try to use the cuda-gdb debugger, you may get an error like this:

fatal:  One or more CUDA devices cannot be used for debugging

GPU instance flavors smaller than g3.xl (e.g. g3.large) rely on a technology called NVIDIA virtual GPU (vGPU), which is unfortunately known to be incompatible with CUDA debugging and some forms of profiling.

Only instances flavored g3.xl or larger are expected to work with cuda-gdb.

Unified memory doesn’t work on some GPU instances¶

Like the CUDA debugger, unified memory (cudaMallocManaged) is only expected to work on flavors g3.xl and larger; vGPU-enabled or “fractional” flavors will not be able to allocate unified memory.

Is nvcc/CUDA available on the images or in the software store ?¶

The NVIDIA HPC SDK is available from the Jetstream2 Software Store.

You can do

module avail

on featured imags to see available software packages. You should see several with names like nvhpc that will have the HPC SDK software.

For other GPU software, where possible, we highly recommend using containers from NVIDIA if they are available. The NVIDIA Docker Container Catalog is the repository.

What CUDA version do I need for Jetstream2 GPUs ?¶

We recommend using the same major revision as reported by nvidia-smi; however, NVIDIA maintains that CUDA versions are backward compatible, up to one major revision ago. For example, if nvidia-smi reports:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01    Driver Version: 535.183.01    CUDA Version: 12.2   |
|-------------------------------+----------------------+----------------------+

then it is “safe” to use CUDA 11.x, though CUDA 12.2 is recommended. In this example, CUDA 10.x and older will not work.

There is a known issue with suspending GPU instances¶

There is an issue/bug with suspending GPU instances with the version of libvirt Jetstream2 is using for virtualization.

DO NOT SUSPEND GPU instances.