Skip to main content

Containers losing GPUs with error: "Failed to initialize NVML: Unknown Error"

https://github.com/NVIDIA/nvidia-docker/issues/1730

5. Workarounds

The following workarounds are available for both standalone docker environments and k8s environments (multiple options are presented by order of preference; the one at the top is the most recommended):

For Docker environments


  • Using the nvidia-ctk utility:

    The NVIDIA Container Toolkit v1.12.0 includes a utility for creating symlinks in /dev/char for all possible NVIDIA device nodes required for using GPUs in containers. This can be run as follows:

    sudo nvidia-ctk system create-dev-char-symlinks \
        --create-all
    

    This command should be configured to run at boot on each node where GPUs will be used in containers. It requires that the NVIDIA driver kernel modules have been loaded at the point where it is run.

    A simple udev rule to enforce this can be seen below:

    # This will create /dev/char symlinks to all device nodes
    ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/usr/bin/nvidia-ctk system 	create-dev-char-symlinks --create-all"
    

    A good place to install this rule would be:
    /lib/udev/rules.d/71-nvidia-dev-char.rules

    In cases where the NVIDIA GPU Driver Container is used, the path to the driver installation must be specified. In this case the command should be modified to:

    sudo nvidia-ctk system create-dev-symlinks \
            --create-all \
            --driver-root={{NVIDIA_DRIVER_ROOT}}
    

    Where {{NVIDIA_DRIVER_ROOT}} is the path to which the NVIDIA GPU Driver container installs the NVIDIA GPU driver and creates the NVIDIA Device Nodes.


  • Method 2:2 (WORKING FOR ME):

  • Explicitly disabling systemd cgroup management in Docker

    • Set the parameter "exec-opts": ["native.cgroupdriver=cgroupfs"] in the /etc/docker/daemon.json file and restart docker.
    • {  
         "runtimes": {  
             "nvidia": {  
                 "args": [],  
                 "path": "nvidia-container-runtime"  
             }  
         },  
         "exec-opts": ["native.cgroupdriver=cgroupfs"]  
      } 
  • Method 3:

    • Downgrading to docker.io packages where systemd is not the default cgroup manager (and not overriding that of course).