GPU Related FAQ¶

I don’t have GPU instances available to me when I go to launch an instance¶

Jetstream2 consists of four distinct resources. You must explictly have access to the Jetstream2-GPU resource to access GPUs. Having access to Jetstream2 (CPU) does not give you access to GPUs. We also highly encourage you familiarize yourself with the VM instance sizes/flavors and note the difference in burn rate (SU cost per hour).

What GPUs are available?¶

Jetstream2 features nodes with A100, L40S, and H100 gpus in the g3,g4, and g5 flavors, respectively. For more information about these flavors and how to gain access to them, see the information on full gpu flavors.

How do I use multiple GPUs on an instance for my research?¶

To use multiple GPUs on an instance for your research, you will need to launch an instance with one of the two card flavors (g3.2xl, g4.2xl, and g5.2xl) or four card flavors (g3.4xl, g4.4xl, and g5.4xl). Please note that multi-GPU flavors are not available by default but are available by submitting a request to help@jetstream-cloud.org with proper justification. Multi-GPU flavors grant you access to the combined resources of the cards, all of which launch on a single node.

My GPU is not usable after a kernel update¶

The NVIDIA drivers are built as kernel modules and should rebuild on a kernel update. If they do not, you can do this on Ubuntu 22.04 instances:

ls /usr/lib/modules | sudo xargs -n1 /usr/lib/dkms/dkms_autoinstaller start

This doesn’t work on redhat-based instances like Rocky Linux. We’re working on a simple solution for that.

The CUDA debugger (cuda-gdb) doesn’t work on some GPU instances¶

If you try to use the cuda-gdb debugger, you may get an error like this:

fatal:  One or more CUDA devices cannot be used for debugging

GPU instance flavors smaller than g3.xl (e.g. g3.large) rely on a technology called NVIDIA virtual GPU (vGPU), which is unfortunately known to be incompatible with CUDA debugging and some forms of profiling.

Only instances flavored g3.xl or larger are expected to work with cuda-gdb.

Unified memory doesn’t work on some GPU instances¶

Like the CUDA debugger, unified memory (cudaMallocManaged) is only expected to work on flavors g3.xl and larger; vGPU-enabled or “fractional” flavors will not be able to allocate unified memory.

Is nvcc/CUDA available on the images or in the software store ?¶

The NVIDIA HPC SDK is available from the Jetstream2 Software Store.

You can do

module avail

on featured imags to see available software packages. You should see several with names like nvhpc that will have the HPC SDK software.

For other GPU software, where possible, we highly recommend using containers from NVIDIA if they are available. The NVIDIA Docker Container Catalog is the repository.

What CUDA version do I need for Jetstream2 GPUs ?¶

We recommend using the same major revision as reported by nvidia-smi; however, NVIDIA maintains that CUDA versions are backward compatible, up to one major revision ago. For example, if nvidia-smi reports:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01    Driver Version: 535.183.01    CUDA Version: 12.2   |
|-------------------------------+----------------------+----------------------+

then it is “safe” to use CUDA 11.x, though CUDA 12.2 is recommended. In this example, CUDA 10.x and older will not work.

There is a known issue with suspending GPU instances¶

There is an issue/bug with suspending GPU instances with the version of libvirt Jetstream2 is using for virtualization.

DO NOT SUSPEND GPU instances.

How do I enable MIG and have the changes persist?¶

On boot, two services prevent MIG from being enabled (gpu-disable-mig.service and gpu-driver-fix.service). You can stop and disable these services with

sudo systemctl disable --now gpu-disable-mig.service

sudo systemctl disable --now gpu-driver-fix.service

Once the services have been stopped, enable MIG using

sudo nvidia-smi -mig 1

and reboot your instance.