Containers losing GPUs with error: "Failed to initialize NVML: Unknown Error"
https://github.com/NVIDIA/nvidia-docker/issues/1730
5. Workarounds
The following workarounds are available for both standalone docker environments and k8s environments (multiple options are presented by order of preference; the one at the top is the most recommended):
For Docker environments
-
Using the
nvidia-ctkutility:The NVIDIA Container Toolkit v1.12.0 includes a utility for creating symlinks in
/dev/charfor all possible NVIDIA device nodes required for using GPUs in containers. This can be run as follows:sudo nvidia-ctk system create-dev-char-symlinks \ --create-allThis command should be configured to run at boot on each node where GPUs will be used in containers. It requires that the NVIDIA driver kernel modules have been loaded at the point where it is run.
A simple
udevrule to enforce this can be seen below:# This will create /dev/char symlinks to all device nodes ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/usr/bin/nvidia-ctk system create-dev-char-symlinks --create-all"A good place to install this rule would be:
/lib/udev/rules.d/71-nvidia-dev-char.rulesIn cases where the NVIDIA GPU Driver Container is used, the path to the driver installation must be specified. In this case the command should be modified to:
sudo nvidia-ctk system create-dev-symlinks \ --create-all \ --driver-root={{NVIDIA_DRIVER_ROOT}}Where
{{NVIDIA_DRIVER_ROOT}}is the path to which the NVIDIA GPU Driver container installs the NVIDIA GPU driver and creates the NVIDIA Device Nodes. -
Method 2:
-
Explicitly disabling systemd cgroup management in Docker
- Set the parameter
"exec-opts": ["native.cgroupdriver=cgroupfs"]in the/etc/docker/daemon.jsonfile and restart docker. -
{ "runtimes": { "nvidia": { "args": [], "path": "nvidia-container-runtime" } }, "exec-opts": ["native.cgroupdriver=cgroupfs"] }
- Set the parameter
-
Method 3:
-
Downgrading to
docker.iopackages wheresystemdis not the defaultcgroupmanager (and not overriding that of course).
-