Alphawave Semi

PAM4 and Coherent-lite Interconnect for Hyperscale Campuses and AI Data Centers

Pankaj Kumar — Mon, 06 Oct 2025 07:14:55 +0000

The explosion of data processing demands driven by the ever-increasing size and complexity of AI models is introducing significant challenges in how data is transferred between processing units (e.g., GPUs and AI accelerators), between processing units and memory, and to the external world.

Additionally, the parallelization of AI models into smaller, more targeted models further increases the need for high-speed, low-latency interconnects both within a single AI node and between nodes across data center networks.

This trend is pushing data center architectures to scale at an unprecedented pace. Interconnects have become a critical enabler of this growth, ensuring low-latency, high-bandwidth communication between systems, where tens, hundreds, or even thousands of GPUs must operate cohesively within a single node. For example, data presented by Meta at OCP shows that over the past 25 years, processing capabilities (measured in FLOPS) have scaled at more than double the rate of interconnect bandwidth: 3.1X every two years versus 1.4X over the same period (Figure 1) indicating the critical need to improve the bandwidth of the interconnects.

Figure 1: Meta’s data highlights that peak hardware FLOPS have scaled at more than twice the rate of interconnect bandwidth – presented at OCP 2022.

The above trend has led to the definition of new standards and creation of innovative approaches to improve the efficiency of back-end processing interconnects, specifically in connecting matrices of accelerators and processing units. This is essential to meet the demands of AI models and their rapid evolution.

Scale-up focuses on interconnecting GPUs and CPUs within a single AI computing node, optimizing for low latency, high power efficiency, and dense connectivity. In contrast, scale-out addresses how to move processed data efficiently from the node to external systems.

These networks build tightly integrated matrices of GPUs and CPUs, and enable high-bandwidth, low-latency communication across numerous processing units. And while NVIDIA’s NVLink has long served this purpose, new standards such as UALink and Scale-Up Ethernet are emerging to define next-generation requirements for intra-node and node to node interconnects.

On the scale-out side, traditional Ethernet (especially in its current RDMA implementations) is limited, particularly with a tail latency that makes it less effective for the performance needs of AI workloads. To address this, the Ultra Ethernet Consortium is leading efforts to evolve Ethernet to better support AI-driven requirements across Network Interface Cards (NICs) and scale-out switching architectures.

Figure 2 illustrates the scale-up and scale-out architecture models optimized for AI-centric data centers.

Figure 2: A Scale up and Scale out architecture

Traditionally, copper has been the dominant material for interconnecting inside data center racks due to its cost-effectiveness, flexibility, and reliability. However, as bandwidth and speed requirements continue to rise, the physical limitations (loss and signal integrity challenges) of copper especially for rack-to-rack and even intra-rack connections are becoming more apparent. This has led to increased adoption of active components such as Active Electrical Cables (AECs) and Active Optical Cables (AOCs), which integrate retimers to extend reach and improve signal integrity.

Furthermore, the reach of traditional Intensity Modulation Direct Detection (IMDD) optical modules is becoming insufficient for long-reach (LR) data center campus applications at higher data rates. As data rates increase, Coherent-lite modulation schemes are emerging as a compelling alternative to PAM4, offering longer reach and higher optical link budgets.

Coherent-lite modules are designed to consume significantly less power than full Coherent solutions and be cost competitive with IMDD, making them ideal for campus applications. While current Coherent-lite standards at 800G span reaches of 2 to 10 km, Coherent-lite solutions will be used for links inside the data center for the first time, due to reach limitations of IMDD technology at the next generation data rates.

In March 2025, Alphawave Semi announced a portfolio of PAM4 and Coherent-lite DSP-based products, purpose-built to address the needs of hyperscaler data centers and AI interconnect markets. Alphawave semi is among a select group of companies capable of offering both PAM4 and Coherent-lite DSPs, designed in leading-edge process nodes and backed by a strong legacy in high-speed SerDes.

The company has introduced three product families:

Cu-Wave PAM4 DSPs for Active Electrical Cables (AEC)
O-Wave PAM4 DSPs for optical retimers and gearbox transceivers
Co-Wave Coherent-lite DSPs for optical transceivers

These product lines are designed to support the scaling requirements of next-generation data centers, enabling the high-throughput, low-power interconnect solutions that are essential for AI and hyperscale workloads in both scale up and scale out architecture in order to create the highly efficient and flexible interconnects for AI.

Figure 3: Alphawave Semi owns all critical connectivity assets behind the Cu-Wave, O-Wave and Co-Wave portfolio of 800G/1.6T PAM4 and Coherent-lite DSPs

To find out more about Alphawave Semi’s portfolio of 800G / 1.6T PAM4 and Coherent-lite DSPs for AI data center and hyperscale campus applications, please visit the DSP page.

Product briefs are also available here.

Cu-Wave briefs: https://awavesemi.com/cu-wave-product-brief/

O-Wave briefs: https://awavesemi.com/o-wave-product-brief/

Co-Wave brief: https://awavesemi.com/wp-content/uploads/2025/03/aw400-o_product_brief_1_0_5.pdf

The post PAM4 and Coherent-lite Interconnect for Hyperscale Campuses and AI Data Centers appeared first on Alphawave Semi.

Programmable Hardware Delivers 10,000X Improvement in Verification Speed over Software for Forward Error Correction

Pankaj Kumar — Thu, 26 Jun 2025 10:39:10 +0000

Tony Chan Carusone, CTO, Alphawave Semi
https://awavesemi.com/

In the following article, Tony Chan-Carusone explores the critical role of Forward Error Correction (FEC) in high-speed wireline networking, particularly with the adoption of PAM4 modulation for mid-distance transmission across data centers and how advances in programmable hardware for verification can achieve speeds 10,000 times faster than traditional software-based simulation.

As data transmission rates increase, FEC is essential for maintaining low error rates and obviating the need for retransmission. The performance of modern FEC depends critically on the details of the receiver DSP, particularly with respect to the potential for bursts of errors to corrupt entire frames of data, making them uncorrectable. Software-based time-domain simulation is traditionally used to verify the performance of FEC. However, software simulation is too slow to confirm the probability of these extremely rare error events.

Fortunately, using FPGAs complete links can be modelled and simulated with enough speed and accuracy to validate FEC performance in a wide variety of real application scenarios prior to widespread deployment. Moreover, such a model can be used to evaluate alternative DSP and FEC for new emerging applications. With high-speed networking evolving rapidly, these innovations will be a key focus at OFC, where experts will explore the latest advancements shaping the future of optical and data center connectivity.

In the race to increase the speeds of wireline networking and communications, forward error correction (FEC) has become a vital part of the toolkit. To function effectively, especially with the increasing use of four-level pulse amplitude modulation (PAM4), high-speed protocols need FEC to avoid a rise in the number of reception errors. Each incremental increase in the transmitted symbol rate requires higher signal bandwidths, with a commensurate increase in the amount of noise in receivers. Thus, more powerful and complex FEC may be expected to counter the increased noise levels.

Next generation PAM4 wireline links for data-centre interconnection will support transmission rates of 200 Gbps per serial lane. The IEEE 802.3dj task force is responsible for writing the standard that implementors will use to develop their 200 Gbps Ethernet interfaces. To prevent a rise in bit error rate (BER), the task force has adopted a two-layer FEC scheme with inner and outer codes to provide two layers of error correction. However, many details of the system-level architecture and how they affect FEC performance need to be analysed.

Figure 1. Concatenated FEC

Required conversion steps

The requirement for two concatenated FEC codes lies in the end-to-end composition of 200 Gbps links. Several transmission hops must be considered; first, a short electrical link transmits data from a host chip in a server, or switch to an optical module. The optical module receives this electrical signal and retransmits an optical signal that is then communicated over a much longer distance to the receiving optical module. This module then receives the incoming optical signal and translates it into an electrical equivalent, relaying the data to the destination chip in another server or switch. In the IEEE 802.3dj architecture, all three of these links employ PAM4 signalling and each of them can introduce errors in the transmitted symbols.

Figure 2. 200 Gb/s multi-park link with concatenated FEC

The electrical links can suffer from large amounts of inter-symbol interference (ISI). Correcting for this ISI often requires equalization techniques, such as decision feedback equalization (DFE), or maximum-likelihood sequence detection (MLSD). These equalization techniques are necessary to establish a link, but are subject to error propagation whereby a single error in the received bit stream due to an extreme noise event may significantly increase the probability of additional errors in neighbouring bits. Thus, errors are probabilistically correlated and it is relatively common to see errors arise in bursts errors.

On the other hand, optical connections are generally subject to less ISI, but a lower signal-to-noise ratio. In practice, PAM4 optical transceivers contain bandwidth-limited amplifiers, introducing some ISI that also demands some equalization. However, the errors will generally be less strongly correlated, compared to purely electrical links. Thus, simulations can model the predominant optical link impairment using additive Gaussian noise resulting in random and highly uncorrelated errors.

This difference in behaviour between the electrical and optical links supports the use of two levels of FEC in a concatenated arrangement. The outer code corrects errors in all three links. The inner code protects only the optical part of the connection and can therefore use a simpler error-correction method. In the case of the upcoming 200 Gbps Ethernet standards, the proposed code is a binary extended Hamming code that can correct a 1-bit error in each 128-bit codeword. This is effective enough for the uncorrelated errors that are likely to be encountered in the optical domain.

In order to correct any correlated errors resulting in a burst from all three links, the outer code proposed for the standard uses a Reed-Solomon code, commonly known as the KP4 FEC, with reference to a previous standard where the same code was used. This can correct up to 15 FEC-symbols per 544-symbol codeword in the FEC-encoded stream, where each FEC symbol in turn comprises multiple PAM symbols. Note that when a decoding operation fails, it can mistakenly introduce yet more errors in its effort to make the corrections. Further complicating matters, a codeword interleaver is introduced between the inner and outer codes, which spreads out any error bursts introduced by the inner code, improving overall performance of the link.

The complexity of these new codes leads to potential hazards in their implementation. Traditionally, it has been assumed that for a given FEC code, a required post-FEC BER can be translated into an equivalent required pre-FEC BER. Called the FEC limit paradigm, this approach allowed system analysis to focus on evaluating only the relatively high pre-FEC BER of worst-case links, which can be simulated in software as described below. However, when widely deployed, 200 Gbps links will face a diversity of noise and ISI in each of the three constituent hops, which will translate into different error correlations and, hence, different post-FEC BER even at the same pre-FEC BER. Moreover, the choice of equalization in each receiver may also impact the end-to-end post-FEC BER. Thus, it has become necessary to accurately evaluate the very low probability of FEC decoder failure in the presence of all these variations. It is also desirable to consider the impact of different DSP approaches (e.g. DFE vs. MLSD).

Software approaches to FEC analysis

An approach often used in the analysis of such protocols is the use of software-based time-domain simulation. A program models the transmission of test data through a wireline channel that captures different signal-integrity impairments and counts the resulting bit errors. But software simulation is slow compared to the speed of the physical system, a problem that is compounded both by the need to target extremely low post-FEC BERs and the use of DFE and MLSD techniques that are more computationally complex.

A typical analysis holds the electrical link’s BER at a constant level, while the optical link’s BER is swept to find the level at which the overall system’s codeword error ratio (CER) is sufficiently low. To meet the Ethernet standard’s specification, this target CER level should be 1.45×10⁻¹¹. Obtaining enough data for analysis using only software simulation can take days, or even weeks. This delay is prohibitive in a situation where a development team needs to try out different protocol and hardware-design strategies.

Another possibility is to use a statistical analysis tool to predict a system’s post-FEC BER without having to run a long time-domain simulation. However, there are currently no statistical methods available that can accurately model the architectures considered for the 200 Gbps Ethernet standard. The equalization methods and symbol interleaving techniques proposed for the standard introduce too much complexity to statistical modelling.

Evaluation of hardware for FEC analysis

As the overhead of running software on even high-performance processors represents the biggest bottleneck, a viable solution is to use programmable hardware as the simulation engine. This provides the ability to evaluate algorithms and changing channel conditions in far less time. It also lets developers try out ideas and implement them quickly by reconfiguring the hardware platform.

The capacity of today’s field-programmable gate arrays (FPGAs) allows for many parallel instances of a complete simulation to on a single device, including a built-in processor to manage the dataflows. A platform for doing so is described in ‘An FPGA-Accelerated Platform for Post-FEC BER Analysis of 200 Gb/s Wireline Systems’.

To avoid needing to allocate space to resource-intensive Reed-Solomon encoders and decoders directly, the FPGA model need not include them. Instead, the platform can use a checker to detect the number of FEC-symbol errors in each Reed-Solomon codeword. If the total number of FEC-symbol errors in a codeword exceeds 15, then a codeword error can be registered without actually performing the decoding. To allow high flexibility in modelling different setups, the hardware platform can be parameterized. Moreover, for any new emerging applications, different constituent linear block codes can be substituted.

A key challenge in performing accurate time-domain simulations is generation of random noise. For example, a white-noise spectrum is commonly used to evaluate FEC performance, which suits statistical techniques. But amplifiers tend to colour the noise spectrum. By passing the AWGN through a finite impulse response filter, the model can better represent this coloured noise. Noise generation is one of the most computationally-intensive operation in time-domain simulations, and can be significantly accelerated on an FPGA platform.

As in software, hardware channel emulation can trade off simulation accuracy and speed. Accurately capturing noise statistics and the ISI of channels with a long response requires more logic. Thus, simpler modelling allows for more parallel instances of a channel simulation to run on any given FPGA platform. For example, using a simple AWGN channel model, an FPGA might host 200 parallel emulations, whereas a more complex channel with a soft-decision inner FEC decoding scheme might reduce the number of cores to eight. Nevertheless, throughputs of at least four orders of magnitude faster than software-only simulation are readily achievable.

Hardware simulation platforms can be scaled using multiple FPGAs in parallel. For example, using dozens of FPGAs, CER levels can be validated to the Ethernet standard in less than a day. This significant speed improvement over existing simulation methods make it possible to explore new and complex FEC architectures with high accuracy and support the creation of reliable optical connectivity at 200 Gbps and beyond.

In addition to being able to discuss for the emerging demands of FEC, the company will demonstrate products and IP for mid-distance transmission over PAM4 optical connections at OFC in the Exhibition Hall, booth #5645.

The post Programmable Hardware Delivers 10,000X Improvement in Verification Speed over Software for Forward Error Correction appeared first on Alphawave Semi.

Exploring Alphawave Semi’s AlphaCHIP1600-IO Chiplet and its Real-World Applications

Pankaj Kumar — Mon, 02 Jun 2025 10:07:14 +0000

By Shivi Arora, ASIC IP Solutions, and Sue Hung Fung, Product Line Manager

Meet the AlphaCHIP1600-IO: a high-bandwidth, low-latency, and power-efficient chiplet built for next-generation connectivity. It integrates the UCIe standard for die-to-die communication alongside support for PCIe^® 6.0, CXL^® 3.1, and 800G Ethernet protocols—enabling robust compute and networking interfaces. Designed with up to 16 lanes of 112G multi-standard SerDes PHY, this chiplet delivers an impressive total bandwidth of up to 1.6 Tbps.

As AI and ML infrastructures, data centers, networking, telecom, cloud, and edge computing continue to evolve, the demand for modular and scalable system integration is accelerating. The AlphaCHIP1600-IO chiplet has been purpose-built for use in SiPs (System-in-Packages) targeting these next-generation platforms. In these environments, robust connectivity across servers, racks, and storage systems is critical—and chiplets offer the modular upgrade path needed to meet this demand.

For SoC designers working on AI data centers and other high-performance applications, traditional monolithic die designs have hit the reticle limit. To scale compute and bandwidth further, disaggregation is essential. Chiplet-based architectures offer a strategic solution by offloading I/O functionality and interfaces from the main die, enabling smaller die sizes, improved yield, and increased design flexibility.

By adopting Alphawave Semi chiplet architectures, design teams can significantly reduce time to market and avoid costly redesigns. This approach also lowers NRE (non-recurring engineering) and manufacturing costs by allowing multiple dies, potentially built on different process nodes, to be integrated on a single package substrate. The AlphaCHIP1600-IO chiplet simplifies this integration by offering high-speed I/O connectivity with support for PCIe^®, CXL^®, and Ethernet protocols—all through a UCIe interface.

This achievement also reflects Alphawave Semi’s deep alignment with industry standards and momentum around chiplet-based design. As Brian Rea, Chair of the UCIe Consortium’s Marketing Working Group, noted, “UCIe (Universal Chiplet Interconnect Express) standardizes the interconnect interface between chiplets, ensuring compatibility and ease of integration across different generations and types of chiplets. We are excited to see Alphawave Semi's commitment to innovation and excellence in advancing the chiplet ecosystem with UCIe technology.”

By integrating UCIe into a silicon-proven I/O chiplet, Alphawave Semi is setting the foundation for scalable and interoperable chiplet solutions that support the growing demands of AI, HPC, and cloud infrastructure.

AlphaCHIP1600-IO Chiplet: Real-World Use Cases for High-Bandwidth Connectivity

End-to-End Ethernet Traffic

Alphawave Semi’s AlphaCHIP1600-IO chiplet demonstrates robust, high-performance Ethernet connectivity by driving a Direct Attach Copper (DAC) link with consistently strong signal integrity and healthy link margins.

In the demonstration below, Ethernet packets are generated by a NIC connected to a host and transmitted end-to-end across two AlphaCHIP1600-IO chiplets, which are interconnected via the UCIe standard.

This setup highlights the chiplet’s ability to deliver scalable and flexible Ethernet connectivity—enabling both scale-up bandwidth and scale-out system expansion—making it ideally suited for next-generation data center and AI workloads.

End-to-End Ethernet over AlphaCHIP1600-IO Chiplets

Demonstration Highlights

Robust End-to-End Ethernet Connectivity
The AlphaCHIP1600-IO chiplet drives a QSFP-DD DAC link with strong signal integrity and consistently reliable link margins.
UCIe-Based Chiplet Interconnection
This demo showcases Ethernet packet transfer between two AlphaCHIP1600-IO chiplets using the Universal Chiplet Interconnect Express (UCIe) standard.
Scalable and Flexible Architecture for AI and Data Centers
Designed to support both scale-up bandwidth and scale-out system expansion, the architecture is ideal for future-proofing high-performance data center and AI workloads.

Linear Pluggable Optics and Live Interoperability

800G Optical Connectivity: AlphaCHIP1600-IO + Lessengers’ VCSEL LPO

At OFC 2025, in the OIF live interoperability booth, Alphawave Semi successfully demonstrated the AlphaCHIP1600-IO chiplet connected to a QSFP-DD SR8 Linear Pluggable Optics (LPO) VCSEL module. The setup received Ethernet traffic over optical fiber from a switch, achieving reliable link-up and a healthy Bit Error Rate (BER) margin—clearly validating the chiplet’s robust signal integrity and high-performance capabilities.

This live demonstration proved the flexibility and scalability of the AlphaCHIP1600-IO chiplet for data center and cloud-scale networking. Paired with a VCSEL-based LPO, the solution meets the bandwidth and efficiency demands of next-generation scale-up and scale-out architectures while maintaining signal integrity over extended optical links.

Demonstration Highlights

Chiplet-Driven 800G Optical Connectivity
AlphaCHIP1600-IO chiplet powers Lessengers’ 800G QSFP-DD SR8 Linear Pluggable Optics (LPO) module for advanced high-speed optical data transmission.
Ethernet Link to Switch Infrastructure
Demonstrates reliable high-speed Ethernet connectivity, successfully linking to a switch in a real-world network scenario.
Stable BER Across Long-Distance Optical Links
Delivers consistent, high-performance Bit Error Rate (BER) over VCSEL-based LPO modules, even across fiber links extending up to 100 meters.

PCIe Link up to a Host System

AlphaCHIP1600-IO Chiplet Link-Up to Host Demonstration

The AlphaCHIP1600-IO chiplet has successfully demonstrated a reliable PCIe® link-up to a host system, showcasing its versatility and robust plug-and-play capabilities.

In this demonstration, the AlphaCHIP1600-IO chiplet was configured as an Endpoint device and connected to a host system motherboard acting as the Root Complex. The system achieved a healthy link initialization within a standard PCIe plug-and-play environment—proving the chiplet’s readiness for I/O disaggregation and its applicability across a wide range of compute architectures.

Demonstration Highlights

AlphaCHIP1600-IO PCIe PHY/Controller Link-Up
The chiplet established a stable and healthy PCIe link with the host system, demonstrating healthy performance signal integrity and interoperability.
Plug-and-Play Simplicity
Designed for modern multi-chiplet ecosystems, the AlphaCHIP1600-IO effortlessly negotiates link-up in PCIe-compliant systems without manual intervention.
Flexible Endpoint/Root Complex/Switch Integration
The chiplet operates as an Endpoint while the host system acts as the Root Complex — a configuration ideal for modular designs where scalability and interoperability are paramount. The chiplet can also be configured as Root Complex or Switch mode.

Conclusion

At Alphawave Semi, our mission is to enable high-performance, scalable connectivity solutions for the next generation of compute and data infrastructure. Whether you're designing for AI/ML clusters, cloud infrastructure, or hyperscale data centers, Alphawave Semi is driving the future of chiplet-based systems with high-speed electrical or optical connectivity.

The silicon demonstrations above highlight Alphawave Semi’s continued innovation in delivering scalable, standards-compliant chiplet solutions. Whether you're architecting next-generation AI, HPC platforms, or disaggregated data center systems, the AlphaCHIP1600-IO chiplet is engineered to meet your performance, flexibility, and interoperability needs.

Validated samples are now available, complete with hardware, firmware, and software support.

Ready to learn more?

Let’s discuss how the AlphaCHIP1600-IO chiplet can accelerate your connectivity roadmap.

Contact Us or Download the Datasheet to get started

In this video from Chiplet Summit, learn about Alphawave Semi’s new AlphaCHIP1600-IO, an industry-first multi-protocol IO chiplet supporting PCIe Gen 6, CXL 3.1, and 800G Ethernet.

Read more about Alphawave Semi’s tape-out of the industry’s first off-the-shelf multi-protocol I/O connectivity chiplet on TSMC’s 7nm process here.

Learn more about how Alphawave Semi drives innovation in hyperscale AI accelerators with I/O chiplet for Rebellios Inc. here.

The post Exploring Alphawave Semi’s AlphaCHIP1600-IO Chiplet and its Real-World Applications appeared first on Alphawave Semi.

Intel and Alphawave Semi Demonstrate UCIe Interoperability

Pankaj Kumar — Wed, 07 May 2025 11:58:48 +0000

By Soni Kapoor, Product Marketing Manager and Sue Hung Fung, Product Line Manager

Chiplets bring several advantages in terms of cost, performance and yield, but among the key benefits is the ability to enact multi-vendor chiplet ecosystems. In these, chiplets from a range of vendors, each built using the optimal process node for that function can work together seamlessly in a SiP (System in a Package). But this is only true when we have a standard, open (i.e. non-proprietary) die-to-die communication protocol.

Universal Chiplet Interconnect Express (UCIe) provides that common standardized interface for chiplets to communicate regardless of the vendor, foundry, or process node. It provides design flexibility over form factors, such as the use of organic substrate standard packaging, advanced packaging with silicon interposer, RDL, bridge type solutions, or 3D hybrid bonding. Critically, this interoperability allows elements within chiplet-based designs to be easily swapped out and/or upgraded without a redesign of the entire system.

In this paper we will provide an overview of the scenarios covered in an Intel and Alphawave Semi UCIe interoperability demonstration. Stimulus was provided by Intel UCIe standard package PHY and received by Alphawave’s UCIe PHY. The bring-up process of PHY parameters was constrained by specific configurations of clock modes, data rates, and other UCIe technical specifications that ensure proper system function. The multi-module bring-up requirements and states from the link training state machine (LTSM) were covered during the training phases.

Overview of Design Verification Setup for Interoperability Testing

Figure 1: Design Verification Setup and placement of vectors for Alphawave Semi and Intel UCIe PHYs interoperability testing

As we can see in the above figure 1, the Intel vectors containing sideband and mainband signals are being driven to a UCIe DUT interface by Alphawave Semi’s UCIe driver. These signals are driven to design RX via the UCIe interface. The design simulates and sends the response over the UCIe interface to bring up the LTSM.

PHY Parameters Constrained During UCIe Bring-Up

During the UCIe bring-up phase, several PHY parameters were constrained to ensure proper system function and configuration. The table below outlines these parameters, and the range of acceptable configurations.

Detailed Parameter Breakdown

Maximum Data Rate: Constrained to 16 GT/s during bring-up but UCIe PHY can support a wider range between 4 to 32 GT/s depending on system configuration.
Clock Mode: Strobe mode is used during the bring-up, but a continuous clock mode is available in other scenarios.
Clock Phase: Differential clock phase is constrained, while quad-phase configurations are excluded.
Module ID: Bring-up occurs with a single UCIe module (M0), while other modules (M1-M3) are also successful in the standard process of link state bring-up.

Multi-Module Bring-Up Requirements

At this point, it should also be noted that when bringing up multiple modules, Intel stimulus required separate vectors for each module. Each vector must be associated with a distinct Module ID. System bring-up for multiple modules was successfully covered by Intel stimulus by instantiating the same Intel vectors for each module. Some of the LTSM states required from reset to an active die-to-die link are described below.

Figure 2: UCIe LTSM States

Results and Achievements

The collaborative interoperability testing between Intel and Alphawave Semi yielded several significant achievements.

Firstly, the testing accomplished the bring up of the UCIe link, which includes sideband and mainband signals bring up.
Data integrity was also demonstrated, with the LFSR (Linear Feedback Shift Register) pattern sent over mainband during LINKSPEED state.
Timing was synced with design behavioral models and the vectors to bring up LTSM within 40 µs.
Furthermore, during the process, the design under test (DUT) was able to seamlessly bring up the state machine with 0 new bugs found.
Specifically, we learned both that the vectors contained mainband clock and valid glitches to mimic real time scenario, and that the DUT was able to handle it.
And using the data provided in the Intel vectors, we developed an understanding on how to synchronize its timing with the DUT to for better simulations.

Identified challenges and lessons learned for UCIe Interoperability

The interoperability testing also highlighted several challenges and provided valuable lessons for future UCIe ecosystem development.

As part of this team-effort, several issues were identified and resolved for the successful interoperability between Intel and Alphawave Semi UCIe PHY IPs. These were primarily related to the three following groups:
- Some unintentional mainband valid bit-slips in vectors.
- Valid and clock alignment issue on mainband signals.
- And shifts in the lane id pattern sent during bring up.
The results and analysis also showed areas to improve the dataset, notably that it’s possible to provide vectors with a smaller delay to ensure greater exactness for these vectors.
Additionally, Alphawave Semi provided a list of additional PHY parameters for inclusion in Intel vectors to test interoperability for all scenarios such as clock mode in future testing.

Conclusion

Intel UCIe PHY stimulus played a critical role in the UCIe link bring-up and interoperability process of the Alphawave Semi UCIe PHY. The UCIe modules demonstrated stability of the die-to-die link after the reset, sideband, and mainband training and calibration processes were performed.

In conclusion, this demonstration of interoperability between the Intel and Alphawave Semi UCIe PHY through simulation addresses the enablement of successful and robust die-to-die communication in a broader multi-vendor open chiplet ecosystem.

Learn more about the AresCORE UCIe Die-to-Die PHY IP here.

Read our blog on "UCIe for 1.6T Interconnects in Next-Gen I/O Chiplets for AI data centers" and discover how this open standard is enabling the future of hyperscale AI infrastructure.

The post Intel and Alphawave Semi Demonstrate UCIe Interoperability appeared first on Alphawave Semi.

UCIe for 1.6T Interconnects in Next-Gen I/O Chiplets for AI data centers

Pankaj Kumar — Tue, 21 Jan 2025 06:26:21 +0000

Letizia Giuliano, VP of IP Product Marketing & Management, Alphawave Semi

The rise of generative AI is pushing the limits of computing power and high-speed communication, posing serious challenges as it demands unprecedented workloads and resources. No single design can be optimized for the different classes of models – whether the focus is on compute, memory bandwidth, memory capacity, network bandwidth, latency sensitivity, or scale, all of which are affected by the choke point of interconnectivity in the data center.

Processing hardware is garnering attention because it enables faster processing of data, but arguably as important is the networking infrastructure and interconnectivity that enables the flow of data between processors, memory and storage. Without this, even the most advanced models can be slowed from data bottlenecks. Data from Meta suggests that more than a third of the time data spends in a data center is spent traveling from point to point. By preventing the data from being effectively processed, connectivity is choking the current network and slowing training tasks.

Fig 1: Data from Meta suggests more than a third of the time data spends in a data center is effectively wasted, moving from processor to processor across the network.

AI data centers

Infrastructure architecture for AI data centers requires a new design paradigm from that used in traditional data centers. Machine learning-accelerated clusters residing in the network’s back end, which handle AI’s large training workloads require high bandwidth traffic to move across the back-end network. Unlike the front-end network, where packet-by-packet handling is needed, this traffic (typically) moves in regular patterns and operate with high levels of activity.

There are steps to reduce latency, with fast access to other resources enabled by a flat hierarchy through limiting hops. This prevents compute from being underutilized, as the performance of AI networks can be bottlenecked by even one link with frequent packet loss. Optimizations in switching designs to be non-blocking, and network robustness are critical design considerations. These back-end ML networks allow AI processors to access each other’s memory seamlessly, as the dedicated network isolates the data from the vagaries of front-end network demands, which rise and fall with various priorities depending on the incoming compute request.

Connectivity in AI data centers

Focusing in on the AI cluster connectivity. As CPUs have taken a managing role in the proliferation of AI connectivity in the data center, a suite of several connectivity solutions is required to keep the XPUs networked together and enable the scale necessary to keep up with the latest models. Alphawave Semi offers industry leading solutions for all of these.

Fig 2: A simplified data center network displaying connectivity required.

The AI cluster connects to the front-end networks via Ethernet through a network interface card (NIC), which can go up to 112G/lane today; you will likely find the fastest speeds in the ethernet switches tying all of these NICs together, and 224G SerDes enabling 1.6T pluggables are coming soon with the P802.3dj standard.

Internal to the server, the PCIe is used to connect the CPU to the front-end NIC, the XPUs, and other peripherals, such as storage, with PCIe 5.0 used today, PCIe 6.0 to be deployed very soon, and not far off is PCIe 7.0 with the first revision of the specification set to be published in 2025. PCIe offers an intrinsically lower latency solution when compared to Ethernet for this aspect of AI connectivity, and the increased efficiency and uptime of compute and storage connections utilizing a PCIe PHY with either a PCIe or CXL layer on top to pool memory and storage resources of the network. This effectively makes the memory and storage look like one large resource, instead of many fragmented resources in practice.

The back-end network, where low latency and high speed are paramount, sees the proliferation of tailored connectivity MSAs. Ultra Accelerator Link (UALink) will be used as an alternative to other proprietary technologies to deliver highly parallelized connectivity to scale-up over 1000 XPUs together.

Within a server or compute rack, internal links between XPUs also use UALink (or another proprietary solution) with extremely dense I/O within the package to connect the compute die to memory.

Scaling network barriers

Scaling this model makes it increasingly difficult for monolithic SoCs to include all the necessary communication bandwidth plus the additional required functionalities without going past the reticle limits of the photolithographic equipment. Doing so introduces more defects as well as increasing wafer costs, ultimately reduces yields.

The alternative approach is to use a chiplet-based architecture. Unlike traditional monolithic SoCs, this unites small and specialized building blocks that have each been created on a process to specifically optimize for function, cost, power, etc. Chiplets can, in many ways, be thought of as an extension to the IP model, with the final SoC using chiplets from multiple vendors.

Fig 3: Moving to a chiplet design in 7, 5, and 3 nm can reduce the total cost of ownership – by more than 30 percent for the largest SoCs on advanced processes.

Having a smaller die size delivers greater yields and, due to the ability to reuse proven silicon, the model cuts NRE (non-recurring engineering) costs. Data also suggests that by using this model it is possible to reduce the power of the overall system somewhere between 25 and 50 percent.

Moving back to the data center, we can envision different types of I/O chiplets with different configurations of I/O connectivity. These would then be combined with memory chiplets for different types of memory subsystems, and compute chiplets developed for specific types of workloads or AI applications.

These are not the only market drivers behind a move to chiplet architectures, with advances in IC packaging technologies playing a crucial role. These include 2.5D silicon interposer, RDL interposer, embedded bridge, and 3D, where we see the deployment of hybrid bonding and technologies to enable a more stacked-die solution.

Industry standards on die-to-die interconnect protocols also play a key factor. Cross-supply-chain collaborations, like the MDI Alliance from Samsung and the 3DFabric Alliance from TSMC, are helping to simplify assembly and driven by the foundries.

1.6T

Monolithic SoC dies currently exceed the reticle die limit and have done so for more than five years. Before this limit was reached, it was possible to increase the die size and deliver increased bandwidth - larger die area equals more pins, but to continue this will become prohibitively expensive as one can no longer significantly increase the number of lanes around the chip or in the cable, and therefore cannot just add significantly more ports.

Another aspect to consider is thermal density on the front panel. There is a standard rack size used for data center infrastructure limited how much power and how deep and wide a typical rack server can be designed. Ultimately, the bandwidth per lane must be increased and move to a more scalable architecture.

For example, a 51.2 terabit switch in use (created via an aggregated 512-lane x 100G links), would benefit from increasing the data rates per link to the incoming 200G links. This enables the same bandwidth to be delivered with fewer links (256) and would move the SoC back below the reticle limit. Increases above 51.2 terabits could then be delivered by using a more scalable path (to 512 x 200G links) via a chiplet architecture.

The future AI SoC

Returning to the ‘traditional’ monolithic AI SoC, with its limited peak-connectivity anatomy, these are (typically) large chips designed on 2 and 3 nm process nodes. As discussed earlier, these include CPU cores, interconnect, memory, cache, and SRAMs, with a custom accelerator and security IP plus core-to-core, and logic-to-logic connectivity and a dedicated chip-to-chip link.

PCIe and Ethernet connectivity are key IP building blocks in the AI SoC that can be easily disaggregated. Doing so reduces the communication functions on a large AI SoC die to small I/O chiplet dies for use across different systems and applications. In such a chiplet-based SoC, the key connectivity IP for this will be PCIe/CXL, Ethernet, UCIe and HBM.

There are some drawbacks when implementing a chiplet architecture. There is a duplication of the same function on different dies; the same PHY and controller will likely be on both sides of a D2D link. This can impact power and area, as well as latency and it is vital to consider multiple factors when selecting the die-to-die interconnect in order to optimize the design.

Bandwidth density must be optimized to match the form factor and the package type being used, as well as cost. Additionally, power consumption must be a key design consideration with sub-pJ/bit being the target as each node looks to devote as much power to the compute and not to the interconnect.

These considerations make the UCIe protocol a key choice. Not only is it designed for the lowest possible latency, but also has a bandwidth density greater than 10 Tbps/mm, and a power consumption of 0.3 pJ/bit. In addition, the standard is highly robust and has full protocol stack definitions and platforms for interoperability.

Fig 4: A comparison of chiplet die-to-die interconnect protocols in terms of energy efficiency and density.

As seen in figure 4 (above), the optimal die-to-die interconnect between chiplets in terms of shoreline bandwidth density (Gbps/mm per pJ/bit), is the UCIe parallel interface; chiplet designs powered by UCIe can enable 224G SerDes and the next generation of high radix use cases, such as switches.

The Future of AI Chiplets

AI workloads are set to grow in complexity and size, with the need for advanced silicon solutions increase in line with this.

Alphawave Semi is committed to leading this charge and our chiplets, which have been optimized for AI applications, from I/O to memory to compute are enabled by UCIe, in addition to a myriad of other high-performance interconnects.

In September 2024 Alphawave Semi announced the availability of the industry’s first silicon-proven UCIe subsystem on 3 nm processes, which is developed specifically for use in high-performance AI infrastructure. Alphawave Semi also launched a 1.2 Tbps connectivity chiplet for HBM3E subsystems (June 2024), taped out the first multi-protocol I/O chiplet for HPC and AI infrastructure and collaborated with Arm, Samsung and TSMC on AI and HPC chiplets using process nodes down to 2 nm. To round off these accomplishments, TSMC announced Alphawave Semi as its high-speed SerDes partner of the year in September 2024.

Modern data centers require scalability, power efficiency and flexibility and by driving innovation in chiplet-based design, advanced packaging, and interconnect technology, Alphawave Semi is leading the way for the next generation of AI-enabled data centers.

Further information on our chiplet offering is available here, and on our AresCORE UCIe die-to-die PHY IP is available here.

The post UCIe for 1.6T Interconnects in Next-Gen I/O Chiplets for AI data centers appeared first on Alphawave Semi.

How PCIe® Technology is Connecting Disaggregated Systems for Generative AI

Pankaj Kumar — Tue, 17 Dec 2024 14:15:57 +0000

David Kulansky, Director of Solutions Engineering, Alphawave Semi

Published by: www.pcisig.com

PCIe technology is set to be leveraged as an important component in the AI infrastructure marketplace. According to the “PCI Express Market Vertical Opportunity” report from ABI Research, the expected total addressable market (TAM) for PCIe technology in AI will grow from $449.33 million to $2.784 billion by 2030, at a compound annual growth rate (CAGR) of 22%. One emerging use case for AI is generative Artificial Intelligence or GenAI. GenAI is a type of AI technology that is used to produce content, including text, images, video, audio and more. As GenAI evolves, some unique challenges in GenAI applications are becoming clear, such as the need for low power, low-latency robust technologies to connect these systems together. Due to the continuing increase in complexity and scale of Large Language Models (LLMs), the most advanced generative AI models can’t fit on one GPU, one server, one rack, or even a single data center.

PCI Express^® (PCIe^®) technology offers numerous benefits for generative AI applications, since its inherent DNA is perfectly suited to enable disaggregated systems including distributed multiplication functionality of value for LLMs. In this blog, we’ll touch on how PCIe technology is used in generative AI today, how the PCIe technology features perfectly aligned with growing AI demands, and how the relationship between PCIe technology and AI will continue to evolve for future applications.

PCIe Technology Features Meet the Technical Demands of Generative AI

PCIe technology is a ubiquitous I/O interconnect that provides the structure to connect nodes together by enabling low-latency, low-power connections, and always ensuring backwards compatibility. PCIe technology connects the entire data center, creating a pooled resource of compute, memory, and storage to fit the unique and specific needs of generative AI applications.

As the need for higher data rates continues and the industry makes the switch from NRZ to PAM4 signaling, Forward Error Correction (FEC) becomes essential to maintaining reliability. PCIe addresses this for generative AI, and all low-latency applications, by utilizing FLIT (Flow Control Unit) Mode. FLITs help maintain the low latency of PCIe technology while still delivering low post-FEC error rates. Additionally, PCIe architecture includes Low Power Modes, which conserves energy when less data throughput is needed and allows for even greater savings when links are temporarily unused through L0p and L1 substates.

Digging further into the importance of low-latency benefits of PCIe, hardware coherency plays a crucial role in scale-up networks to enhance efficiency. It’s not just about the overall bandwidth – latency in data exchanges can cause GPUs and CPUs to stall as they wait for data. Pipelined algorithms often depend on distributed results, and even a single node's delay can lead to significant slowdowns, idling valuable compute resources. PCIe technology, now with FLIT mode, keeps data transport delays minimized and consistent, allowing for efficient performance.

Future Evolutions of Generative AI With Emerging PCIe Technology

PCIe technology will help to evolve the applications of generative AI due to its scalability for both electrical and optical links in the back-end network, where AI operates. As bandwidths increase and electrical reaches decrease, CopprLink Internal and External Cables can extend the reach of PCIe signals within generative AI applications, with the CopprLink Internal cable having a maximum reach of 1m within a single system, while the CopprLink External cable extends the maximum reach to 2m. Additionally, the PCI-SIG Optical Work Group is currently investigating a path for enabling PCIe technology over optical links to ensure any PCIe link will be possible in the future of generative AI applications.

Join PCI-SIG to Support the Future of PCIe Technology and Generative AI

If you would like to support the future development of PCIe technology and generative AI, we encourage you to join PCI-SIG. Follow PCI-SIG on LinkedIn and Twitter/X for the latest information about the PCIe specifications, events and more.

The post How PCIe® Technology is Connecting Disaggregated Systems for Generative AI appeared first on Alphawave Semi.

Unleashing AI Potential Through Advanced Chiplet Architectures

Pankaj Kumar — Wed, 11 Dec 2024 09:43:35 +0000

Tony Chan Carusone, CTO, Alphawave Semi

www.awavesemi.com

The rapid proliferation of machine-generated data is driving unprecedented demand for scalable AI infrastructure, placing extreme pressure on compute and connectivity within data centers. As the power requirements and carbon footprint of AI workloads rise, there is a critical need for efficient, high-performance hardware solutions to meet growing demands. Traditional monolithic ICs will not scale. Thus, chiplet architectures are playing a critical role in scaling AI.

Combining chiplets via low-latency and high-bandwidth connections across modular, custom components facilitates performance growth beyond the reticle limit. Connectivity standards such as UCIe enable seamless inter-die communication. Chiplets also support AI scale-up and scale-out. Even distributed AI across multiple sites benefits from chiplet architectures.

Harnessing the chiplet ecosystem to design flexible, interoperable compute and connectivity within a single package optimized for workloads is the only way to sustainably scale AI.

Data is proliferating

Machine generated data is proliferating like never before. The size of the data sphere will reach 181 billion terabytes (181 ZB) next year and the need to scale AI has accelerated new and upgraded data center infrastructure. However, processing isn’t the only limit – improvements in connectivity will also be critical in scaling AI.

A decade ago, data was primarily generated by people interacting with technology, and its growth was linear. With autonomous sensor and video data, financial data and yet more data produced by analyzing other data, the growth became exponential.

This is driving a focus on AI to parse all this data. Compute infrastructure is being pushed to the absolute limit imposed by the performance of a single full reticle-sized monolithic die. Hardware cost is a significant concern since deploying AI at scale may require, for example, 8 GPUs per server across 20,000 servers at a cost of around $4 billion. Energy is also a limiting factor representing millions of dollars in operational costs. Moreover, individual training runs are estimated to generate 500 tons of CO₂ presenting environmental costs.

AI is driving power consumption

Although computing power consumption has been a growing concern for a while, AI is accelerating the trend as AI training demands large chips operate with high activity, continuously for weeks, or even months at a time.

According to the IEA electricity consumption from data centers is expected to reach 1,000 TWh within the next two years. A Goldman Sachs analysis suggests that by 2030, the US alone will reach this figure, that the data center’s share of power will rise from 2% in 2020 to 8% in 2030, with virtually all acceleration in electricity caused by AI.

If we assume the US’s mix of energy sources remains the same, then this 1,000 TWh will produces 389.8 metric megatons of CO₂.

In many locations, data center consumption is a huge portion of total demand. In Ireland, it is expected to be 30% within a couple of years. Projections forecast that by 2026, global AI usage will consume more electricity than all bar five countries – Japan, Russia, India, the US and China.

Chips such as Nvidia’s H100 GPU consumes more energy than previous devices and, due to the cost, will not be idle for long – unlike traditional compute where overprovisioning addresses peak loads.

A bare H100 consumes 700W but, in a rack with networking and cooling, this can easily become 2kW. With Nvidia shipping 2 million GPUs annually the power consumption would rival that of a large US city – and later processors almost double the consumption.

Clearly, there must be a focus on low power design within AI to reduce consumption and benefit the planet.

The role of chiplets within AI computing

AI development started before the availability of dedicated hardware, with early applications running on server-grade machines. Hardware and software developments over two decades enhanced AI performance.

Fig 1: Silicon has rapidly evolved to support AI growth and with it a chiplet ecosystem has emerged to enable growth beyond the reticle limit.

Finally, the use of highly-parallel GPU architectures accelerated progress. Now, dedicated hardware is used to support the mathematical models for deep learning, including specific AI accelerator chips tailored to the data they must process.

Moving from teraFLOPS to petaFLOPS will require even more specialized hardware and with a low power consumption. It’s not just the AI accelerators that need this attention, it applies to devices such as CPUs and networking as well. The energy-efficient performance must come at a lower cost as hyperscalers require hundreds of thousands of these devices to continue scaling AI.

The pace of AI development now demands new custom silicon chips every 12 months, but the design, verification and fabrication cycle of these custom devices is 18-24 months. Thus, chiplets are the only viable solution to maintain the annual cadence of hardware upgrades that AI scaling demands.

To understand the benefits of chiplet architectures, consider a monolithic IC implementation where all the logic and I/O are on a single die. Implementing such a large ASIC monolithically implies all of it must be in a single technology. So, for example, DRAM must be outside the package. Additionally, the advanced CMOS die is burdened with analog/mixed-signal circuits that take away from the area available for logic and embedded memory. Such designs will be limited by the maximum die size and, approaching limit will reduce yield, pushing costs higher. The integration of complex IP to make a single chip increases the time required for design and verification, and if that timeline is rushed, risk increases.

Fig 2: Chiplets deliver significant benefits in designing for advanced AI.

However, with chiplets, performance, bandwidth and lower cost are realized by putting the latest processors on sub-3nm tiles, interconnected with low-latency die-to-die interfaces. I/O transceivers for Ethernet and PCIe can be moved to separate I/O tiles, which can be implemented in whatever process technology best suits those circuits.

All tiles can be separately pre-designed and validated in parallel, lowering the overall design cycle time. Developing chiplets in this way can increase the compute power for a single part while increasing yield – and lowering cost by up to 40% – due to the use of smaller dies.

Scaling this across millions of AI accelerators within data centers leads to incredible savings. Reasonable estimates put this in the region of billions of dollars for hardware and hundreds of gigawatts of power. With these clear benefits, chiplets are being used for virtually all large compute & networking chips entering development today.

UCIe for die-to-die connectivity

The key to successful chiplet-based designs is low-latency, low-power interconnect between tiles, allowing the whole system to perform as a single chip.

The UCIe die-to-die interconnection standard is proving crucial in the ecosystem. It is an open standard allowing developers to choose the optimum elements for each part of the SoC. It offers 10+Tbps per mm of die edge at energy consumptions below 0.5 picojoules per bit, with a latency of very few clock cycles. To achieve this combination of bandwidth density and low power, attention must be paid to maintain signal integrity, power integrity, and high-performance clocking across the interface.

Scaling AI with connectivity

AI is scaling along two dimensions. Scaling up involves adding more resources per computer (i.e. more accelerators per server) while ‘scaling out’ adds more servers to spread the workload. The scaling is happening in both directions simultaneously, while each individual AI accelerator is simultaneously becoming more powerful due to chiplet design. This is creating a dramatic increase in data processing and, thus, networking demands.

Traditional data center networks are evolving. As data rates increase, optical is replacing electrical in more links and active solutions (using retimers) are replacing passive solutions, yet there is increased focus on low power consumption and cost.

Machine learning AI clusters are connected by their own back-end networks allowing direct memory access that is isolated from traffic demands on the front-end network. The backend networks have specific requirements, including high sustained bandwidth due to the significant AI workloads running there. Scale up networks, in particular, require low latency which implies a flat hierarchy. Robustness and reliability of the links are crucial as packet losses on a single link can bottleneck the whole workload.

Fig 3: AI clusters sit on a high speed, low-latency back-end network (top) and distributed AI (bottom) delivers privacy but required high speed broadband connectivity.

Interconnectivity in the modern data center

A modern server shares a lot with a typical PC, containing one or more CPUs, memory, storage and a (front-end) network interface. The primary interconnect will usually be PCIe between the CPU and its peripherals, Ethernet to other compute nodes, and DDR for the memory.

Fig 4: AI compute nodes add xPU capability and connect to the back-end network.

An AI compute node is more complex but will include many similar parts. Powerful xPUs are added with significant quantities of their own dedicated memory. To perform the most complex machine learning/AI tasks, many xPUs will need to collaborate and have seamless access to each other’s memory. This requires a back-end network interface to connect to the ML network.

Thus, there is a suite of connectivity technologies that enable AI comprising PCIe, Ethernet, HBM and UCIe. Whereas Ethernet is always used for the front-end network, the back-end network may include customized Ethernet and/or PCIe interfaces that reduce latency and/or improve reliability. Connections to the CPU and storage leverage PCIe, and possibly CXL, interfaces. The internal links between chiplets in the CPU and xPU can use UCIe while the GPU connects to memory via HBM.

AI scaling is also impacting the connections to memory and storage. On average, a third of memory in data centers sits unused, yet a third of infrastructure spend is here – meaning over 10% of the overall spend is wasted. Sharing storage in centralized pools allows for its more efficient use, affording cost savings of 30%. Such disaggregated architectures require low latency connectivity with sufficient reach. As distances exceed a few meters, optics are likely to be required.

Scaling up and out is allowing 100’s thousands of xPUs to collaborate on workloads, but that increases connection lengths. Although passive copper solutions offers low cost and power consumption, it simply cannot deliver the required bandwidth, speed and reach for AI scaling. Thus, we see increased use of fiber optics.

Today, optical connectivity is typically established via pluggable modules in servers and switches. These transceivers are being re-architected for AI and benefitting from chiplet-based design. Leveraging silicon photonics and advanced packaging technologies, tighter integration can be achieved, delivering cost and power savings when integrating many parallel lanes of traffic.

Chiplet architectures also open the door co-packaged optics (CPO), offering the potential for optical connectivity right to the xPU and switch. Direct-drive CPO relies on high-performance DSP SerDes on the xPU or switch to handle both optical and electrical links. The use of chiplets allows the broadband analog amplifiers to be implemented separately in their own process technology alongside a photonic chiplet. Digital-drive CPUs use UCIe to connect to the optical I/O on a separate chiplet, requiring less space on the central logic tiles. With this approach, optical and/or electrical interface chiplets can be integrated to create the desired solution for each application. Moreover, digital-drive CPO allows for tremendous bandwidth density in the die-to-die interface to the logic tile. To allow all that bandwidth to escape the package, several approaches are being considered, such as the use of multiple wavelengths and/or dense fiber-attach methods along with dense transceiver circuit design.

Distributing AI

Another key trend is that AI is being distributed across sites that can be separated by kilometers creating new connectivity challenges. There are several motivations for this trend. Regional data centers are being located at “the edge,” close to end users to allow for a responsive experience. It’s also becoming desirable to spread large AI workloads across multiple facilities to form a single larger virtual data center. Distributed training is gaining momentum as it preserves privacy by training AI models without passing sensitive information between institutions. In this approach, models are trained locally and only model updates are shared with the server, without sharing sensitive data such as medical image analysis for healthcare AI, individual location data for autonomous vehicle training, or financial data for fraud detection.

Distributing data centers geographically requires new types of broadband connectivity. Coherent-Lite is seen as an enabler in this area. These transceivers implement coherent optical connectivity, previously developed for reaches exceeding 100km, with greatly reduced power consumption. Coherent-lite transceivers leverage chiplet design paradigms to achieve this, combing bare dies made with different fabrication technologies into a highly-integrated, low-power subsystem.

Fig 5: As AI moves to the edge to provide more responsive inference, data centers are becoming distributed (top). Distributed AI training allows for better data security and privacy (bottom). Both rely heavily on new connectivity technologies, including coherent-lite.

Enabling the chiplet ecosystem

To support AI scaling and foster the chiplet ecosystem, Alphawave Semi has developed UCIe IP across the world’s most advanced CMOS nodes along with a portfolio of reconfigurable, customizable, and interoperable chiplets.

It includes the industry’s first off-the-shelf multi-protocol I/O connectivity chiplet on TSMC’s 7nm process, integrating Ethernet, PCIe, CXL, and UCIe standards to provide high-bandwidth, low-power connectivity for AI and HPC applications. This chiplet, offering up to 1.6 Tbps bandwidth across 16 lanes, enables flexible, scalable system designs by allowing customers to mix and match with existing hardware ecosystems without extensive customization.

Alphawave Semi has also partnered with Arm to develop a high-performance compute chiplet using Arm Neoverse Compute Subsystems (CSS) for AI, HPC, data center, and 5G/6G networking. This chiplet integrates Alphawave’s advanced connectivity IP, including PCIe Gen 6.0/7.0, UCIe, and high-speed Ethernet, supporting scalable and efficient system-on-chip (SoC) solutions tailored to complex digital infrastructure needs.

These developments underscore Alphawave Semi’s commitment to advancing a robust chiplet ecosystem for high-performance AI and data center infrastructure.

Summary

AI has redefined data center infrastructure compute and connectivity. Chiplets enable the custom silicon solutions that are optimized for AI workloads and are essential to affordably scale performance and keep power consumption manageable. Chiplets also enable hardware developers to use trusted, proven silicon IP to get to market more quickly and meet the cadence of AI hardware design cycles.

The post Unleashing AI Potential Through Advanced Chiplet Architectures appeared first on Alphawave Semi.

Redefining XPU Memory for AI Data Centers Through Custom HBM4 - Part 3

Pankaj Kumar — Tue, 03 Dec 2024 13:44:06 +0000

Part 3: implementing custom HBM

by Archana Cheruliyil
Principal Product Marketing Manager

This is the third and final of a series from Alphawave Semi on HBM4 and gives and examines custom HBM implementations. Click here for part 1, which gives an overview of the HBM standard, and here for part 2, on HBM implementation challenges.

This follows on from our second blog, where we discussed the substantial improvements high bandwidth memory (HBM) provides over traditional memory technologies for high-performance applications, and in particular AI training, deep learning, and scientific simulations. In this, we detailed the various advanced design techniques implemented during the pre-silicon design phase. We also highlighted the critical need for more innovative memory solutions to keep pace with the data revolution as AI has pushed the boundaries of what computational systems can do. A custom implementation of HBM allows for greater integration with compute dies and custom logic and can, therefore, be a performance differentiator justifying its complexity.

Custom High Bandwidth Memory systems

Creating custom HBM using a die-to-die interface such as the UCIe standard (Universal Chiplet Interconnect Express) is a cutting-edge approach that involves tightly integrating memory dies with compute dies to achieve extremely high bandwidth as well as a low latency between the components. In such an implementation, the memory controller directly interfaces with the HBM DRAM through a Through-Silicon-Via (TSV) PHY on the memory base die. Commands from the host or compute are translated through a die-to-die interface using a streaming protocol. This allows for reuse of die-to-die shoreline already occupied on the main die for core-to-core or core-to-I/O connections. Such an implementation requires a close collaboration between IP vendors, DRAM vendors and end customers to create a custom memory base die.

Alphawave Semi is uniquely positioned to pioneer this effort with cutting-edge HBM4 memory controller portfolio and recently announced, the industry’s first silicon proven 3 nm, 24 Gbps die-to-die UCIe IP subsystem delivering 1.6 TB/s of bandwidth. In addition to this, Alphawave can design and build the custom ASIC die in-house, with close collaboration with end customers.

Benefits of a Custom HBM Integration:

1. Better alignment of memory vs needs

A custom HBM4 design means optimizing the memory and memory controller to align closely with the specific needs of the processor or AI accelerator. This may involve tweaking the memory configuration (for example, by increasing bandwidth, reducing latency, or adding more memory layers) and fine-tuning the die-to-die interface to ensure smooth and fast communication. Alphawave Semi offers a highly configurable HBM4 memory controller that supports all JEDEC defined parameters and can be customized to meet specific workloads.

2. 2.5D integration

In 2.5D packaging, the processor die and HBM custom dies are placed side-by-side on an interposer, which acts as a high-speed communication bridge between them. This approach allows for wide data buses and short interconnects, resulting in higher bandwidth and lower latency. Alphawave Semi has deep expertise in designing 2.5D systems with interposer and package design. The resulting system-in-package is tested extensively for signal and power integrity.

3. The die-to-die interface gives improved bandwidth

Die-to-die interfaces can support very wide data buses at high clock rates, resulting in massive bandwidth throughput. For example, Alphawave Semi’s UCIe link on a single advanced package module running at 24 Gbps lanes can drive up to 1.6 Tbps per direction.

4. It also improves latency

By reducing the distance between the memory and processor, die-to-die interfaces minimize the latency that comes from accessing external memory. This is critical in AI model training and inference, where latency can significantly impact performance.

5. And power efficiency

The shorter interconnect distances and reduced need for external memory controllers lower power consumption. This is a key advantage for data centers running power-intensive AI workloads, as well as for edge AI devices where power efficiency is crucial.

High-bandwidth memory (HBM) was initially developed to enhance memory capacity in 2.5D packaging. Today, it has become essential for high-performance computing. In this interview for Semiconductor Engineering Magazine, Ed Sperlink speaks with Archana Cheruliyil, Principal Product Marketing Manager at Alphawave Semi, about the evolution of HBM and its impact on the industry.

Summary

When combined with a die-to-die interface, custom HBM provides a powerful solution that helps address the memory bottleneck problem faced by AI chips. By leveraging advanced packaging technologies like 2.5D and 3D stacking, AI chips can achieve ultra-high memory bandwidth, lower latency, and greater power efficiency. This is crucial for handling the massive data requirements of modern AI workloads, particularly in applications such as deep learning, real-time inference, and high-performance computing. While there are challenges in terms of cost and thermal management, the performance benefits make this approach highly valuable for next-generation AI hardware systems.

Alphawave Semi is uniquely positioned to provide complex, high-performance SoC implementations by utilizing our superior HBM4 solutions, our industry-leading portfolio of connectivity IP solutions to complement our HBM4, and the ability to bring it all together using our in-house custom silicon expertise

Check out our 3-part blog series on HBM4.
- Part 1: Overview on the HBM standard
- Part 2: HBM implementation challenges
- Part 3: Custom HBM implementations

The post Redefining XPU Memory for AI Data Centers Through Custom HBM4 - Part 3 appeared first on Alphawave Semi.

To GPU or not GPU

Pankaj Kumar — Tue, 26 Nov 2024 13:23:46 +0000

Over the past decade, GPUs became fundamental to the progress of artificial intelligence (AI) and machine learning (ML). They are widely used to handle large-scale computations required in AI model training and inference, and are playing an increasingly important role in data centers. The key feature of the GPU is its ability to perform parallel processing efficiently, making it ideal for training machine learning models, which require numerous computations to be carried out simultaneously.

However, as the demand for AI computing increases, the overall efficiency of this technology is being called into question. Industry data suggests that roughly 40% of time is spent in networking between compute chips across a variety of workloads, bottlenecked by limited communication capacity (fig. 1).

The demand for AI applications continues to rise and this connectivity issue comes at the same time as the costs of general-purpose GPUs for specific workloads and their high power consumption are motivating a move away from GPU-centric compute architectures, towards custom silicon and chiplet-based designs. These modular and flexible SoCs enable scalable hardware solutions that can be optimized not only to reduce power consumption, and cost but improve communication bandwidth too.

Figure 1: Case studies presented by Meta suggest that, for its models, approximately 40 percent of the time data resides in a data center is wasted in processor to processor transference.

Why GPUs are reaching their limits

GPUs are pivotal in advancing AI due to their parallel processing capabilities, which allow them to handle the intensive, simultaneous calculations required to meet the needs of AI training, such as processing enormous datasets and accelerating model training times. Certainly, GPUs have been extremely successful in this role.

However, GPUs are an expensive option, in terms of cost and energy. Their design focuses on 64-bit elements to handle a broad range of computational tasks, but in real-time AI workloads, dropping the 64-bit components could reduce die size and energy requirements by up to a third, while still meeting most AI processing needs.

Despite their efficiency in training scenarios, GPUs face significant downsides in scaling AI applications for widespread use. As AI moves towards real-time inference, particularly in edge environments where data must be processed close to its source, the high cost and power consumption associated with GPUs will become increasingly unsustainable.

Instead, AI-dedicated ASICs now offer a more cost-effective and powerful alternative for specific inference tasks and can often offer a faster performance and lower power consumption for those specific functions compared to GPUs.

Transition to inference-only models and edge AI

The industry's priorities are shifting from training to inference, emphasizing the need for scalable, energy-efficient hardware solutions suited for edge deployments. Edge AI devices, which process data on-site rather than transferring it to a central data center, benefit from lightweight, specialized chips that emphasize inference over training.

Arm-based architectures are steadily emerging as a strong alternative to GPUs, with architectures like Neoverse delivering high performance for AI inference at the edge with lower energy and cooling demands than would be needed in a GPU-based data center. Because of their low power consumption, Arm chips are already widely used in mobile devices and their architecture is being adapted for AI, particularly for edge computing scenarios where power efficiency is needed.

By focusing on inference-only models, these specialized chips can support AI workloads in constrained environments, making edge AI not only feasible, but also more efficient and adaptable. The transition to edge AI and inference-dedicated architectures could lessen reliance on traditional GPUs while broadening the AI computing landscape to include more diverse, task-optimized hardware solutions.

The scalability challenge and connectivity barriers

Scaling AI to meet mass-market demands introduces some serious challenges. Monolithic GPUs are currently constrained to 858 mm², the maximum reticle size in photolithographic equipment. This not only limits how many transistors can be placed on a single chip, but also dictates a physical perimeter limit, which restricts the number of input/output (I/O) connections, and therefore directly impacts data bandwidth.

Furthermore, larger chips not only increase manufacturing costs, but also introduce defect risks and reduce yields.

Figure 2: Moving to a chiplet design in 7, 5, and 3 nm can reduce the total cost of ownership – by more than 30 percent for the largest SoCs on advanced processes.

As AI workloads scale, their I/O connectivity bottlenecks become increasingly problematic. This wasted networking time is posing an increasingly significant barrier to AI, as real-time communication between nodes becomes critical for large-scale deployments. This scalability barrier, coupled with the reticle limitations of manufacturing equipment, highlights the need for an alternative design approach, with chiplet-based architectures better supporting AI applications at scale.

Breaking down barriers to scale with chiplets

Chiplets enable designers to optimize specific parts of the processor for AI workloads, such as matrix multiplications or memory management, enhancing performance while keeping costs down.

Unlike monolithic SoCs, chiplet-based designs offer a modular approach, breaking down the system into smaller, specialized units. Each chiplet can be optimized for a specific function – such as processing, memory management, or I/O operations – and can be developed using the optimal process for its function and cost, regardless of the process used in other parts of the chip. This means leading-edge processes (e.g. 5 nm, 3 nm or 20Å) can be used for logic components, while larger, cost-efficient processes can be used for analog components.

Chiplet architectures reduce the total cost of ownership for large SoCs by more than 30%, significantly lowering non-recurring engineering (NRE) costs and improving power efficiency.

Furthermore, by integrating chiplets from multiple vendors, companies can build more flexible and scalable AI hardware using only “best-of-breed” components and removing dependence on a single provider.

Chiplets also enable more effective data movement within the SoC, resolving the connectivity bottlenecks that challenge traditional architectures by reducing latency and enhancing communication bandwidth, both of which are essential for real-time AI inference.

Chiplets also remove the issue of reticle size constraints that limit the performance of monolithic SOCs, opening up greater I/O capacity without sacrificing yields or introducing manufacturing defects.

Companies like Alphawave Semi offer innovative custom silicon ASIC solutions for AI computing. These chiplet solutions can outperform general-purpose GPUs in certain tasks because they are designed to handle very specific workloads.

Figure 3: Alphawave Semi can overcome performance, power and space limitations through custom silicon ASICs designed to accelerate specialized AI applications, whether it is machine learning training or inference, generative AI or neural nets.

To GPU, or not to GPU, that is the question

NVIDIA’s dominance in the GPU market, which commands more than 90% of the data center market share, is built on a combination of powerful hardware and an extensive software ecosystem.

But the question is not ‘How can smaller GPU vendors compete against NVIDIA?’. It is ‘Is the GPU even needed as we move from training to inference models?’.

As we saw with Arm in the mobile sector, disruptive players can take over from well-established industry leaders. Dominance in one domain does not guarantee success in another, and in the mobile sector, Arm’s IP model enabled its technology to dominate the market through its adoption by companies such as Qualcomm.

Chiplets are effectively an evolution of the IP model, and in a similar way, Nvidia’s supremacy in AI hardware can be challenged by chiplet-based architectures, which distribute functionality across smaller, interconnected modules and offer the combined advantages of higher I/O density and better data throughput as well as lower costs and greater energy efficiency, especially when used in custom configurations tailored to specific workloads with a fast time-to-market.

As AI scales, the GPU model’s limitations in cost and power efficiency are becoming increasingly apparent. GPUs will continue to play a pivotal role in AI training, but a shift towards chiplets-based SoCs and AI-dedicated ASICs is already emerging.

Chiplets in particular stand out as a key enabler for future AI computing, reducing manufacturing costs, power usage, and network bottlenecks.

The debate over the post-GPU era of AI computing remains lively. To explore this topic in more detail, watch this panel discussion titled "To GPU or not GPU: Is that the question?", featuring Carl Freud from Cambrian-AI Research, Dr. John Summers from RISE Research Institutes of Sweden, and Tony Chan Carusone, CTO of Alphawave Semi. In this panel debate, they discuss the significance of GPUs in AI and the custom solutions that are most likely to challenge the status quo.

Watch webinar now

The post To GPU or not GPU appeared first on Alphawave Semi.

Redefining XPU Memory for AI Data Centers Through Custom HBM4 - Part 2

Pankaj Kumar — Thu, 21 Nov 2024 06:27:34 +0000

Part 2: HBM implementation challenges

by Archana Cheruliyil
Principal Product Marketing Manager

This is the second in a three-part series from Alphawave Semi on HBM4 and gives insights into HBM implementation challenges. Click here for part 1, for an overview on HBM, and in part 3, we will introduce details of a custom HBM implementation.

Implementing a 2.5D System-in-Package (SiP) with High Bandwidth Memory (HBM) is a complex process that spans across architecture definition, designing a highly reliable Interposer channel and robust testing of the entire data path including system level validation. Here is a breakdown of the key elements and considerations involved in implementing a 2.5D HBM design.

Advanced Design and Architecture Planning

Determining the necessary bandwidth, latency and power requirements are important to plan overall system architecture. A monolithic chip can also be disaggregated to smaller specialized modules called chiplets to handle specific functions within the system. This approach can provide enhanced design flexibility, power efficiency, yield and overall scalability.

Interposer Design

The interposer can be either silicon or organic material and supports multiple metal layers to handle high-density routing between the HBM stacks and compute die. HBM4 will build upon improvements seen in HBM3E and aims to push data rates, energy efficiency and memory density further. With the interface width doubled (to 2048 bits), but with the HBM4 memory shoreline remaining the same as HBM3E, the primary challenge is how to manage the denser I/O routing in the PHY as well as the interposer. The layout should ensure careful signal routing, power distribution, and grounding to minimize crosstalk and losses through the channel to meet the HBM4E specifications. Alphawave Semi is implementing a test vehicle with HBM4E memory sub-system, interposer, package and board designed in-house to study the entire data path signal integrity on a leading-edge silicon node.

Signal Integrity (SI) and Power Integrity (PI) Analysis

To prevent signal degradation at HBM4E data rates, Alphawave Semi is implementing techniques like impedance matching, shielding and measures to ensure minimal crosstalk between adjacent traces. The interposer is characterized for insertion loss (IL), reflection loss RL), power sum crosstalk (PSXT) and insertion loss to crosstalk ratio (ICR) to characterize the channel and ensure we meet the requirements of next generation HBM4E technology. Additionally, to achieve peak performance we are also upgrading the I/O architecture to include equalization techniques that ensures a maximum EYE opening through the data path channel.

Simulated WRITE EYE Diagram

The power delivery network also requires careful planning to determine decoupling capacitors, low impedance paths and dedicated power planes for critical sensitive signals. The noise contribution from all components including motherboard, package, interposer and silicon die needs to be considered in determining the target impedance of the power delivery network.

System Level Testing and Validation

Extensive SI-PI testing ensures the HBM package meets jitter and power specifications. Decomposing interposer-induced jitter into ISI, crosstalk, and rise-fall time degradation help to identify the dominant channel parameter affecting EYE closure and aids better layout and I/O architecture optimization.

System level testing of all components in the data path is critical to ensuring that assembled package meets the performance specifications set forth in the design phase. A comprehensive test suite which includes DFT-enabled design is also critical for early diagnostics to achieve high yield. Alphawave Semi supports a full range of DFT standards and functions across all major vendors and specialized EDA tools.

System Level Testing Components

Summary

While providing several benefits in achieving better performance and lower latency, HBM4 systems raise some complex technical challenges in architecture, interposer design, SI/PI analyses, test and validation. Alphawave Semi is well placed to overcoming these challenges to create industry-leading, high-performance, high-reliability HBM4 implementations for AI and data center applications. The critical need for more innovative memory solutions to keep pace with the data revolution is pushing the boundaries of what computational systems can do. A custom implementation of HBM4 allowing for greater integration with compute dies and custom logic, can be a performance differentiator justifying its complexity.

Check out our 3-part blog series on HBM4.
- Part 1: Overview on the HBM standard
- Part 2: HBM implementation challenges
- Part 3: Custom HBM implementations

The post Redefining XPU Memory for AI Data Centers Through Custom HBM4 - Part 2 appeared first on Alphawave Semi.