✌️NVLink

NVLink is supported only on instance flavors that include 8× NVIDIA H100 or H200 GPUs. In this configuration, NVLink provides high-bandwidth, low-latency GPU-to-GPU communication, enabling:

  • Faster model training, especially for large models that require frequent inter-GPU data exchange

  • Improved scaling efficiency when using distributed training frameworks (e.g., Megatron-LM, DeepSpeed, PyTorch FSDP)

  • Reduced communication bottlenecks compared to PCIe-only GPU connectivity

  • Higher overall compute throughput for workloads that depend on multi-GPU synchronization

Instances with fewer than 8 GPUs do not support NVLink.

Note: NVLink is not enabled by default in FPT-provided images. Users must manually enable NVLink if required.

To enable NVLink support, follow the steps below:

  • Open the file: /etc/default/grub.d/00-fci-grub.cfg

  • Locate the following line and remove or comment it out:

GRUB_CMDLINE_LINUX_DEFAULT="nvidia.NVreg_NvLinkDisable=1" 
  • Update the GRUB configuration:

sudo update-grub 

Here is a clean, well-structured user-guide rewrite:


Verification

To verify that NVLink has been enabled successfully, run the following commands.

The output should show active NVLink connections with their link speeds (e.g., 25 GB/s per lane).

2. Check GPU topology

The output should display NVLink (NV##) connections between all GPUs, similar to the example below:

Legend

  • X : Same GPU

  • SYS : PCIe + inter-NUMA interconnect

  • NODE: PCIe + interconnect within one NUMA node

  • PHB : PCIe Host Bridge

  • PXB : Multiple PCIe bridges

  • PIX : Single PCIe bridge

  • NV# : NVLink connection with # bonded links


Troubleshooting

If you encounter errors such as:

  • system not yet initialized

  • Issues when calling torch.cuda.device_count() or torch.cuda.get_device_name(i)

Restart the NVIDIA Fabric Manager:

Then retry your application or verification steps.

Last updated