Containers losing GPUs with error: "Failed to initialize NVML: Unknown Error"

https://github.com/NVIDIA/nvidia-docker/issues/1730

5. Workarounds

The following workarounds are available for both standalone docker environments and k8s environments (multiple options are presented by order of preference; the one at the top is the most recommended):

For Docker environments

Using the nvidia-ctk utility:

The NVIDIA Container Toolkit v1.12.0 includes a utility for creating symlinks in /dev/char for all possible NVIDIA device nodes required for using GPUs in containers. This can be run as follows:
```
sudo nvidia-ctk system create-dev-char-symlinks \
    --create-all
```
This command should be configured to run at boot on each node where GPUs will be used in containers. It requires that the NVIDIA driver kernel modules have been loaded at the point where it is run.

A simple udev rule to enforce this can be seen below:
```
# This will create /dev/char symlinks to all device nodes
ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/usr/bin/nvidia-ctk system 	create-dev-char-symlinks --create-all"
```
A good place to install this rule would be:
/lib/udev/rules.d/71-nvidia-dev-char.rules

In cases where the NVIDIA GPU Driver Container is used, the path to the driver installation must be specified. In this case the command should be modified to:
```
sudo nvidia-ctk system create-dev-symlinks \
        --create-all \
        --driver-root={{NVIDIA_DRIVER_ROOT}}
```
Where {{NVIDIA_DRIVER_ROOT}} is the path to which the NVIDIA GPU Driver container installs the NVIDIA GPU driver and creates the NVIDIA Device Nodes.
Method 2:

Explicitly disabling systemd cgroup management in Docker

Set the parameter "exec-opts": ["native.cgroupdriver=cgroupfs"] in the /etc/docker/daemon.json file and restart docker.

{  
   "runtimes": {  
       "nvidia": {  
           "args": [],  
           "path": "nvidia-container-runtime"  
       }  
   },  
   "exec-opts": ["native.cgroupdriver=cgroupfs"]  
}

Method 3:
- Downgrading to docker.io packages where systemd is not the default cgroup manager (and not overriding that of course).

Containers losing GPUs with error: "Failed to initialize NVML: Unknown Error"

5. Workarounds

For Docker environments

Method 2:

Method 3: