Technical Docs
Design Compute Network for AI Glusters
19 min
preface preface ai clusters involve three types of networks frontend fabric, gpu backend fabric, and storage backend fabric frontend fabric used to connect to the internet or storage systems for loading training data gpu backend fabric (compute network) supports gpu to gpu communication, provides lossless connectivity and enables cluster scaling it is the core carrier for training data interaction between gpu nodes storage backend fabric handles massive data storage, retrieval, and management between gpus and high performance storage this guide focuses on the design of 400g ai intelligent computing gpu backend networks at different scales using asterfusion high density 400g/800g data center switches as hardware carrier, the solutions implements clos networks based on rail only and rail optimized architectures to provide standardized deployment guides target audience target audience intended for solution planners, designers, and on site implementation engineers who are familiar with asterfusion data center switches roce, pfc, ecn, and related technologies overview overview the rapid evolution of ai/ml (artificial intelligence/machine learning) applications has driven a continuous surge in demand for large scale clusters ai training is a network intensive workload where gpu nodes interact with massive gradient data and model parameters at high frequencies this drives the need for a network infrastructure defined by high bandwidth, low latency, and interference resistance traditional general purpose data center networks struggle to adapt to the traffic characteristics of ai training, which are dominated by "elephant flows" and low entropy this often leads to bandwidth bottlenecks, transmission congestion, and latency jitter, failing to meet the rigorous requirements of ai training as the "communication backbone" of the ai cluster, the backend network directly determines the efficiency of gpu compute release therefore, an efficient cluster networking solution is urgently needed to satisfy low latency, high throughput inter node communication ai backend network architecture ai backend network architecture rail only architecture rail only architecture leaf nodes connected to gpus with the same index across different servers are defined as a rail plane that is, rail n achieves interconnection for all # n gpus via the n th leaf switch as shown in the figure below, the gpus on each server are numbered 0–7, corresponding to rail 1–rail 8 intra rail transmission occurs when the source and destination gpus' corresponding nics are connected to the same leaf switch llm (large language model) training optimizes traffic distribution through hybrid parallelism strategies (data, tensor, and pipeline parallelism), concentrating most traffic within nodes and within the same rail the rail only architecture adopts a single tier network design, physically partitioning the entire cluster network into 8 independent rails communication between gpus of different nodes is intra rail, achieving "single hop" connectivity compared to traditional clos architectures, the rail only design eliminates the spine layer by reducing network tiers, it saves on the number of switches and optical modules, thereby reducing hardware costs it is a cost effective, high performance architecture tailored for ai large model training in small scale compute clusters rail optimized architecture rail optimized architecture building on the rail concept, a basic building block consisting of a set of rails is regarded as a group , which includes several leaf switches and gpu servers as the cluster scale increases, expansion is achieved by horizontally stacking multiple groups the compute network can be visualized as a railway system compute nodes are "stations" loaded with computing power; rails are "exclusive rail lines" connecting the same numbered gpus at each station to ensure high speed direct access; and groups are "standard platform" units integrating multiple tracks and their supporting switches through this modular stacking, an intelligent computing center can scale horizontally like building blocks, ensuring both ultra fast intra rail communication and efficient interconnection for 10,000 gpu clusters as shown above, the key design of the rail optimized architecture is to connect the same indexed nics of every server to the same leaf switch, ensuring that multi node gpu communication completes in the fewest possible hops in this design, communication between gpu nodes can utilize internal nvswitch paths, requiring only one network hop to reach the destination without crossing multiple switches, thus avoiding additional latency the details are as follows intra server 8 gpus connect to the nvswitch via the nvlink bus, achieving low latency intra server communication and reducing scale out network transmission pressure server to leaf all servers follow a uniform cabling rule nics are connected to multiple leaf switches according to the "nic1 leaf1, nic2 leaf2 " network layer leaf and spine switches are fully meshed in a 2 tier clos architecture a key design factor in multi stage clos architectures is the oversubscription ratio this is the ratio of total downlink bandwidth (leaf nodes to gpu servers) to total uplink bandwidth (leaf nodes to spine nodes), as shown below if the ratio is greater than 1 1, the fabric may lack sufficient capacity to handle inter gpu traffic when downlink traffic reaches line rate, potentially causing congestion or packet loss in short, a smaller oversubscription ratio leads to non blocking communication but higher costs, while a larger ratio reduces costs but increases congestion risk in high performance ai networks, a 1 1 non blocking design is generally recommended traffic path analysis traffic path analysis the intra server and intra rail communication paths are similar for both architectures taking the rail optimized architecture as an example, the following analyzes inter gpu communication paths in different scenarios intra server communication intra server communication completed via nvswitch without passing through the external network intra rail communication intra rail communication forwarded through a single leaf switch inter rail (without pxn) and cross group communication inter rail communication is routed through the spine layer similarly, inter group communication traverses the spine fabric to reach its destination inter rail (with pxn) communication with pxn technology, transmission is completed in a single hop without crossing the spine technologies supporting lossless networking technologies supporting lossless networking dcqcn technology dcqcn technology rdma (remote direct memory access) is widely used in hpc, ai training, and storage originally implemented on infiniband, it evolved into iwarp and roce (rdma over converged ethernet) for ethernet transport rocev2 utilizes udp for transport, which necessitates end to end congestion control via pfc (priority flow control) and ecn (explicit congestion notification) to guarantee lossless performance a pfc only strategy risks unnecessary head of line blocking by halting traffic too aggressively while a standalone ecn approach may suffer from reaction time latency, potentially leading to buffer overflows and packet loss consequently,a unified congestion control strategy is required to balance responsiveness with stability dcqcn (data center quantized congestion notification) serves as a hybrid congestion control algorithm designed to balance throughput and latency it triggers ecn during the early congestion to proactively throttle the nic's transmission rate should congestion intensify, pfc acts as a fail safe to prevent buffer overflows by exerting backpressure hop by hop the dcqcn operational logic follows a structured hierarchy ecn first (proactive intervention) as egress queues begin to accumulate and breach wred thresholds, the switch marks packets (ce bits) upon receiving these marked packets, the destination node generates cnps (congestion notification packets) directed back to the sender, which then smoothly scales down its injection rate to alleviate pressure without halting traffic pfc second (reactive safeguard) if congestion persists and buffer occupancy hits the xoff threshold, the switch issues a pause frame upstream this temporarily halts transmission for the affected queue, ensuring zero packet loss flow recovery once buffer levels recede below the xon threshold, a resume frame is sent to notify the upstream device to resume the transmission to streamline the complexities of lossless ethernet, asterfusion has introduced the easy roce capability in asternos this feature automates optimized parameter generation and abstracts intricate configurations into business level operations, significantly enhancing cluster maintainability load balancing technology load balancing technology ecmp (equal cost multi path) per flow load balancing is the most widely used routing strategy in data center networks it assigns packets to several paths by hashing fields, such as the ip 5 tuple this approach is known as static load balancing however, per flow hashing struggles with uniform distribution when traffic lacks entropy the impact is severe during "elephant flows", which overwhelm specific member links and trigger packet loss ai workloads further challenge this model deep learning relies on collective communication (e g , all reduce, all gather, and broadcast) that generates massive, bursty traffic reaching terabits per second (tbps) these operations are subject to the "straggler effect" — where congestion on a single link bottlenecks the entire training job this makes traditional ecmp unfit for rocev2 based ai backend fabrics to address this, the following solutions are introduced adaptive routing and switching (ars) adaptive routing and switching (ars) ars is a flowlet based load balancing technology leveraging hardware alb (auto load balancing) capabilities, ars achieves near per packet equilibrium while mitigating packet reordering the technology partitions a flow into a series of flowlets based on gap time by sensing real time link quality—such as bandwidth utilization and queue depth—ars dynamically assigns flowlets to the most idle paths, maximizing overall fabric throughput intelligent routing intelligent routing intelligent routing provides both dynamic and static mechanisms dynamic intelligent routing this strategy evaluates path quality based on bandwidth usage, queue occupancy, and forwarding latency bandwidth and queue statistics are pulled from hardware registers at millisecond precision, while latency is monitored via int (in band network telemetry) at nanosecond resolution switches exchange this real time telemetry via bgp extensions and utilize dynamic wcmp (weighted cost multipath) to steer traffic toward the optimal path, proactively eliminating bottlenecks static intelligent routing designed for scenarios requiring high path stability, this method uses pbr (policy based routing) to enforce deterministic forwarding by binding specific gpu traffic to dedicated physical paths (leaf to spine), it ensures a strict 1 1 non blocking oversubscription for fixed traffic models packet spraying packet spraying packet spraying is a per packet load balancing technique that distributes packets uniformly across all available member links to prevent any single path congestion it supports two primary algorithms random disperses packets across members using a randomized distribution round robin sequences packets across members in a cyclic, equal weight manner while packet spraying theoretically maximizes network utilization, it introduces the challenge of packet reordering due to varying link latencies thus, this technology requires robust hardware support, specifically high performance nics capable of sophisticated out of order reassembly at the endpoint building a 400gbps ai backend network building a 400gbps ai backend network based on hardware cost and scalability, the following design recommendations are provided table 1 solution design by gpu cluster scale true 156,505left #4283c7 unhandled content type left #4283c7 unhandled content type left unhandled content type left unhandled content type left #d8e5f5 unhandled content type left #d8e5f5 unhandled content type left unhandled content type left unhandled content type left #d8e5f5 unhandled content type left #d8e5f5 unhandled content type small scale cluster design small scale cluster design standardized networking solution standardized networking solution the figure above illustrates a rail only architecture for a 400g ai backend network consisting of 32 compute nodes (256 gpus) with 8 cx732q n switches deployed as leaf nodes the key design principles are as follows each gpu connects to a dedicated nic; nics follow the "nic n to leaf n " rule independent subnets per rail single tier clos architecture easy roce enabled on leaf switches hardware selection hardware selection for small scale 400gbps rocev2 fabrics, asterfusion cx864e n or cx732q n switches are recommended taking the nvidia dgx h100 server (equipped with 8 gpus) as a baseline, the maximum capacity for different models is summarized below table 2 max capacity per model (rail only architecture) true 115,170 63306451612902,195 43887497094764,179 92806051292334left #4283c7 unhandled content type center #4283c7 unhandled content type center #4283c7 unhandled content type center #4283c7 unhandled content type left unhandled content type center unhandled content type center unhandled content type center unhandled content type left #d8e5f5 unhandled content type center #d8e5f5 unhandled content type center #d8e5f5 unhandled content type center #d8e5f5 unhandled content type note note cx864e n provides 64 x 800g ports, which can be split into 128 x 400g ports example building a 512 gpu cluster to build a cluster with 64 h100 servers (512 gpus) using cx864e n as leaf nodes number of leaf nodes required = 512 / 128 = 4 scalability limit (leafs) = 8 (matching the 8 gpus per server) scalability limit (gpus) = 8 128 = 1024 node requirements and scalability summary number of leaf nodes = total gpus / max gpus per switch maximum scalability (leafs) = number of gpus per server maximum scalability (total gpus) = gpus per server max gpus per switch medium to large scale cluster design medium to large scale cluster design standardized networking solution standardized networking solution the figure above depicts a rail optimized architecture for 128 compute nodes (1024 gpus) it employs 24 cx864e n switches (8 spines, 16 leafs) organized into two groups, with 8 leaf nodes per group key design principles include each gpu connects to a dedicated nic; nics follow the "nic n to leaf n " rule independent subnets per rail 2 tier clos fabric leaf and spine switches are fully meshed leveraging ipv6 link local , unnumbered bgp neighbors are established to exchange rail subnet routes, eliminating the need for ip planning on interconnect interfaces 1 1 oversubscription to ensure non blocking transport, the oversubscription ratio on leaf switches is strictly maintained at 1 1 unified lossless fabric easy roce and advanced load balancing features are enabled on both leaf and spine nodes hardware selection hardware selection for these fabrics, we recommend cx864e n and cx732q n due to their ultra low latency the cx864e n offers end to end latency as low as 560ns , while the cx732q n reaches 500ns this ensures intra rail latency remains around 600ns and inter rail (3 hop) latency stays under 2μs in a rail optimized design, the number of leaf nodes per group matches the number of gpus per server (rails) for h100 servers (8 gpus), each group contains 8 leaf nodes to maintain a 1 1 oversubscription, half of the leaf's ports connect to gpus and half to spines table 3 maximum capacity per group (rail optimized architecture) true 126,220 89342403628115,314 10657596371885left #4283c7 unhandled content type center #4283c7 unhandled content type center #4283c7 unhandled content type left unhandled content type center unhandled content type center unhandled content type left #d8e5f5 unhandled content type center #d8e5f5 unhandled content type center #d8e5f5 unhandled content type spine node calculation the number of spine nodes is determined by the port density (radix) of the leaf nodes if leaf and spine switches provide m and n ports respectively, the required number of spines = (total leafs m / 2) / n if leaf and spine use identical models, the spine count = total leafs / 2 example building a 4096 gpu cluster to build a cluster with 512 h100 servers (totaling 4096 gpus) using cx864e n for both leaf and spine layers, the calculation is as follows leaf nodes per group = 8 max servers per group = 128 / 2 = 64 max gpus per group = 64 8 = 512 number of groups required = 4096 / 512 = 8 total leaf count = 8 (per group) 8 (groups) = 64 nodes total spine count = 64 (leafs) / 2 = 32 nodes scalability limits (cx864e n as spine/leaf) when designing a compute network, scalability is limited by the spine switch radix for the cx864e n (128 x 400g ports), the theoretical maximum scale is max groups supported 128 (spine ports) / 8 (leafs per group) = 16 max servers 16 64 = 1024 max gpus 16 512 = 8192 the following tables detail the node configuration requirements for deploying backend networks of varying gpu scales using the cx864e n and cx732q n in rail optimized architecture table 4 node requirements for cx864e n true 171,107 63306451612902,115 86627653728186,266 5006589465891left #4283c7 unhandled content type center #4283c7 unhandled content type center #4283c7 unhandled content type center #4283c7 unhandled content type left unhandled content type center unhandled content type center unhandled content type center unhandled content type left #d8e5f5 unhandled content type center #d8e5f5 unhandled content type center #d8e5f5 unhandled content type center #d8e5f5 unhandled content type left unhandled content type center unhandled content type center unhandled content type center unhandled content type left #d8e5f5 unhandled content type center #d8e5f5 unhandled content type center #d8e5f5 unhandled content type center #d8e5f5 unhandled content type left unhandled content type center unhandled content type center unhandled content type center unhandled content type left #d8e5f5 unhandled content type center #d8e5f5 unhandled content type center #d8e5f5 unhandled content type center #d8e5f5 unhandled content type table 5 node requirements for cx732q n true 172,109 36290322580645,115 30715400417523,264 3299427700183left #4283c7 unhandled content type center #4283c7 unhandled content type center #4283c7 unhandled content type center #4283c7 unhandled content type left unhandled content type center unhandled content type center unhandled content type center unhandled content type left #d8e5f5 unhandled content type center #d8e5f5 unhandled content type center #d8e5f5 unhandled content type center #d8e5f5 unhandled content type left unhandled content type center unhandled content type center unhandled content type center unhandled content type node requirements summary for a given cluster size, the required number of components is determined as follows leaf nodes per group = number of gpus per server max servers per group = available leaf ports / 2 (based on 1 1 oversubscription) max gpus per group = max servers per group gpus per server total number of groups = total target gpus / max gpus per group total leaf count = leaf nodes per group total number of groups total spine count = (total leaf count leaf port count / 2) / spine port count (where m is the port count of the leaf switch and n is the port count of the spine switch) maximum scalability limits summary the ultimate scale of a 2 tier clos network is physically constrained by the spine switch radix (port count) max supportable groups = spine available ports / leaf nodes per group max supportable servers = max supportable groups max servers per group max supportable gpus = max supportable groups max gpus per group conclusion conclusion by leveraging rail only and rail optimized architectures, this solution minimizes communication hops between gpus, significantly accelerating alltoall performance and reducing overall training cycles this design provides a robust and scalable framework for ai compute fabrics of any magnitude for detailed deployment cases and configuration specifics, please refer to our best practices documentation
