Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Download Nvidia Modular Diagnostic Software

71 views

Skip to first unread message

Solveig Lichtenberg

unread,

Dec 31, 2023, 9:31:46 AM12/31/23

Beginning with 2.4.0 DCGM diagnostics support an additional level 4diagnostics (-r 4). The first of these additional diagnostics is memtest.Similar tomemtest86,the DCGM memtest will exercise GPU memory with various test patterns.These patterns each given a separate test and can be enabled anddisabled by administrators.

When running GPU diagnostics, by default, DCGM drops privileges and uses a (unprivileged) serviceaccount to run the diagnostics. If the service account does not have write access to the directorywhere diagnostics are run, then users may encounter this issue. To summarize, the issue happens whenboth these conditions are true:

download nvidia modular diagnostic software

Download File https://t.co/Knf1bejrER

Edit the systemd unit service file to include a WorkingDirectory option, so thatthe service is started in a location writeable by the nvidia-dcgm user (be sure that thedirectory shown in the example below /tmp/dcgm-temp is created):

If the DCGM agent (nv-hostengine) is running, then stop the DCGM agent (nv-hostengine) orensure that the service was started with privileges. This can be achieved by modifyingthe systemd service file (under /usr/lib/systemd/system/nvidia-dcgm.service) to notstart nv-hostengine with the unprivileged nvidia-dcgm service account.

Any GPU telemetry (either via NVML/DCGM APIs or with nvidia-smi dmon/ dcgmi dmonshould not be used when running the EUD. EUD interacts heavily with the driver and contentionwill impact testing and may cause timeouts.

Automating workflows based on DCGM diagnostics can enable sites to handleGPU errors more efficiently. Additional data for determining the severityof errors and potential next steps is available using either the API orby parsing the JSON returned on the CLI. Besides simply reporting humanreadable strings of which errors occurred during the diagnostic, eacherror also includes a specific ID, Severity, and Category that can beuseful when deciding how to handle the failure.

In general, DCGM has high confidence that errors with the ISOLATE andRESET severities should be handled immediately. Other severities mayrequire more site-specific analysis, a re-run of the diagnostic, or ascanning of DCGM and system logs to determine the best course of action.Gathering and recording the failure types and rates over time can givedatacenters insight into the best way to automate handling of GPUdiagnostic errors.

Super Diagnostics Offline (SDO) provides the capability to determine the health of Supermicro servers' components, including CPU, memory, BMC, HDD, USB, power supply, backplane, PCIe, VGA, GPU and network. It allows data center administrators to efficiently minimize the server down time and unnecessary RMA. Super Diagnostics Offline can be run in both CLI and GUI at local host usage. Combined with the use of the SMCIPMITool and SSM (Supermicro Server Manager), remote diagnostics is also made possible.

Nvidia MODS or Modular diagnostic software is an Nvidia internal set of tools for GPU diagnostic. Those tools did leak out and are now used by third party repair shops when troubleshooting broken GPUs. Lets take a look what MODS can do and how to use it.

The x11-drivers/nvidia-drivers package supports a range of available NVIDIA cards. Multiple versions are available for installation, depending on the card(s) that the system has. See Feature support list and the official NVIDIA documentation, What's a legacy driver?, to find out what version of nvidia-drivers should be used.

The kernel module (nvidia.ko) consists of a proprietary part (commonly known as the "binary blob") which drives the graphics chip(s), and an open source part (the "glue") which at runtime acts as intermediary between the proprietary part and the kernel. These all need to work nicely together as otherwise the user might be faced with data loss (through kernel panics, X servers crashing with unsaved data in X applications) and even hardware failure (overheating and other power management related issues should spring to mind).

From time to time, a new kernel release changes the internal ABI for drivers, which means all drivers that use those ABIs must be changed accordingly. For open source drivers, especially those distributed with the kernel, these changes are nearly trivial to fix since the entire chain of calls between drivers and other parts of the kernel can be reviewed quite easily. For proprietary drivers like nvidia.ko, it doesn't work quite the same. When the internal ABIs change, then it is not possible to merely fix the "glue", because nobody knows how the glue is used by the proprietary part. Even after managing to patch things up to have things seem to work nicely, the user still risks that running nvidia.ko in the new, unsupported kernel will lead to data loss and hardware failure.

When a new, incompatible kernel version is released, it is probably best to stick with the newest supported kernel for a while. NVIDIA usually takes a few weeks to prepare a new proprietary release they think is fit for general use. Just be patient. If absolutely necessary, then it is possible to use the epatch_user command with the nvidia-drivers ebuilds: this allows the user to patch nvidia-drivers to somehow fit in with the latest, unsupported kernel release. Do note that neither the nvidia-drivers maintainers nor NVIDIA will support this situation. The hardware warranty will most likely be void, Gentoo's maintainers cannot begin to fix the issues since it's a proprietary driver that only NVIDIA can properly debug, and the kernel maintainers (both Gentoo's and upstream) will certainly not support proprietary drivers, or indeed any "tainted" system that happens to run into trouble.

A framebuffer driver is required for rendering the Linux console (TTY) as this functionality is not yet provided by the proprietary NVIDIA driver[2][3], i.e. nvidia-drivers, unlike in-tree DRM drivers, rely on other framebuffer drivers to provide Linux console (TTY) support, instead of providing its own. As shown below, set Mark VGA/VBE/EFI FB as generic system framebuffer (CONFIG_SYSFB_SIMPLEFB=y), and then enable a framebuffer driver. Common options for this are to use either efifb (CONFIG_FB_EFI=y) for UEFI devices or vesafb (CONFIG_FB_VESA=y) for BIOS/CSM devices. simplefb (CONFIG_FB_SIMPLE=ym) may also be chosen, however there are reports of it not working, as there exist reports of it working as well as others; the decision is up to end user to make.

The nvidia-drivers ebuild automatically discovers the kernel version based on the /usr/src/linux symlink. Please ensure that this symlink is pointing to the correct sources and that the kernel is correctly configured. Please refer to the "Configuring the Kernel" section of the Gentoo Handbook for details on configuring the kernel.

If GCC plugins of the Kernel are enabled compilation of nvidia-drivers will use them. If the compiler version that was used to compile the plugins does not match the nvidia-drivers' compiler an error will occur.

Now it's time to install the drivers. First follow the X Server Configuration Guide and set VIDEO_CARDS="nvidia" in /etc/portage/make.conf. During the installation of the X server, it will then install the right version of x11-drivers/nvidia-drivers.

Once the modules are signed, the driver will load as expected on boot up. This module signing method can be used to sign other modules too - not only the nvidia-drivers. Just modify the path and corresponding module accordingly.

NVIDIA packages a daemon called nvidia-persistenced to assist in situations where the tearing down of the GPU device state isn't desired. Typically, the tearing down of the device state is the intended behavior of the device driver. Still, the latencies incurred by repetitive device initialization can significantly impact performance for some applications.

nvidia-persistenced is intended to be run as a daemon from system initialization and is generally designed as a tool for compute-only platforms where the NVIDIA device is not used to display a graphical user interface. Depending on the user's system and its uses, it may not be necessary to set persistenced USE flag.

NVIDIA also provides a settings tool. This tool allows the user to monitor and change graphical settings without restarting the X server and is available through Portage as part of x11-drivers/nvidia-drivers with the tools USE flag set.

Disable the Intel CPU idling method using intel_idle.max_cstate=0 on the kernel command line boot method, which should cause the kernel to automatically fall back to the normal or older ACPI CPU idling method. Also, disabling the NVIDIA Powermizer feature, or setting Powermizer to maximum performance within nvidia-settings has been said to help. Although the Intel CPU idling method recently was introduced as the default CPU idling method for i5 and i7 CPUs (versus using ACPI CPU idling) is the root cause here. This idling method significantly solves the problem, however some minimal stuttering or slow video is encountered if deinterlacing was enabled; this is when the video is likely already deinterlaced (ie. alias mplayer-nodeint with something similar to 1 as a work around.)

When using systemd, it may be worth considering adding the following configuration to /etc/modprobe.d to ensure that nvidia-uvm is loaded as a soft dependency of the nvidia module. This helps prevent an error that happens when the configuration file is added to the initrd but not the nvidia-uvm module; causing an error on Plymouth about not being able to find the nvidia-uvm module.

Cause: API mismatch occurs when the nvidia kernel modules are of a different version than the userspace utilities. This occurs when a full system reboot is not performed after an nvidia-drivers package the update.

The nvidia kernel module accepts a number of parameters (options) which can be used to tweak the behavior of the driver. Most of these are mentioned in the documentation. To add or change the values of these parameters, edit the file /etc/modprobe.d/nvidia.conf. Remember to run update-modules after modifying this file, and bear in mind to reload the nvidia module before the new settings take effect.

35fe9a5643

0 new messages