NVIDIA Academy Series – InfiniBand Network Administration

The InfiniBand essentials course gave me the vocabulary, Now i’m looking to delve deep into network administration. From the IPC certification, I already know what a “bad crimp” looks like under a microscope. This course would be able to teach me what a bad crimp looks like to the subnet manager and the host channel adapter. So basically this will allow me to translate my physical inspection expertise into digital forensics. The course provides below 2 values for SQE:

  1. Sometimes a cable failure is actually a firmware mismatch or an incorrect coding by the cable vendor. This course teaches how to read those digital signatures. Basically say “This cable is physically broke” vs “This cable is coded incorrectly”
  2. Isolating Failures: Without this course, I’m guessing which cable is bad or swap them all. With this course, I’m running the diagnostic, identify that a specific cable is throwing “Link down” errors every few hours. I then pull out specific cable, dissect it, find that a physical construction such as overmold stress fracture caused intermittent contact.

Now lets begin the journey by revisiting the basics of Infiniband:

Here is how the components in the above diagram work:

Compute Servers (CPU & DGX system) : These are powerful computer doing the heavy lifting like running LLM’s. They need to exchange massive amount data instantly. The InfiniBand can support upto 400Gbps speeds

Storage Front/Back-End: This is where all the data lives. The compute on the left need to grab this data instantly, and Infiniband allows them to refuel with data instantly

Switches: These are traffic directors. They are designed for zero latency, and make sure that a message from GPU on the left goes to storage on right immediately

Gateway to Internet: This is a bridge to outside internet world. The supercomputer live in their private, high speed infiniband bubble. Eventually, they need to be sent to internet so a common man like us can login and see the result.

Now lets look at Infiniband Port Structure:

This is a critical slide that connects to my IPC 620 knowledge. If i have a bad crimp or cold solder joint in just one pin, the entire cable doesn’t necessarily go dead. Instead Lane 0 might fail and others may work. The MlxLink tool (that will discuss later in another blog) will report the health of each lane individually. Note that EDR, HDR, NDR are just generation name (like 2G, 3G, 4G for phones0. It stands for “Next Data Rates”

Another key topic is Infiniband Architecture Layer: If the Physical layer is bad -> Link layer retries -> Transport layer waits -> The Upper Layer (AI Model) slows down.

Link Layer is a firmware logic inside the transceiver (the plug end) and the switch port.. It tests the wire 24/7, million times per second. If wire is perfect (physical layer), it stays silent. If wire has even microscopic flaw, The link layer generates error counters.

How this lesson will help me? Right now, I can look at a cable and see a defect. After this course, I will be able to look at Link Layer error log and say “Errors observed on Lane 3? That tells me there is likely a crimp deformation on third pair of wires”

Now we will look at Data Packet Structure: This concept seemed very difficult to understand . The key takeaways for me in this slide are: A VCRC error is the digital footprint of a physical defect. It means a bad crimp, poor shielding or impedance mismatch caused electrical signal to distort.

Lets look at the “Liefcycle” of a single message (data packet) as it travels from one compute to another.

If I have to explain this concept in a digestible manner, think of this lifecycle as a shipping process where the cable acts a highway for a fragile package called data. Before leaving the source, the data is wrapped in a protective layers, including shipping label (LRH) for the switch to route and a critical tamper evident seal (CRC) on the back. As the packet travels through cable, physical defects like poor shield terminations or impedance mismatch act like potholes that shake the package. Switch only reads the label to keep the traffic moving along. Only the destination performs the final quality check. If the tamper-evident seal is broken (a CRC error), the system rejects the package immediately, providing a digital proof that the physical assembly failed to protect the signal integrity.

We will continue to focus on Physical Layer, Link Layer, Network Layer, transport Layer, Upper Layer more detailed in next few blogs. Thanks for reading!

Leave a comment