AI Data Center Power Design at an Unprecedented Juncture

The power consumption per server rack for running GPT-4, which averaged around 10-15 kW in 2020, is now reported to exceed 100 kW for racks equipped with the latest AI accelerators. Racks densely packed with NVIDIA H100s approach 70 kW individually, and the next-generation Blackwell architecture is already looking towards levels exceeding 120 kW per rack. Power density is set to change by an order of magnitude within five years.

This shift necessitates a fundamental re-evaluation of power supply design. A 2% reduction in distribution loss can alter capital investment by several MW in a 10,000-rack data center. The choice of power topology, selection of power devices, and thermal design strategies are no longer solely the concern of infrastructure providers; they have become engineering challenges that will dictate the overall competitiveness of AI systems.

What Has Changed? The Reality of 10x Power Density

The traditional server rack configuration involved power supply units (PSUs) generating a 48V bus, from which DC-DC converters supplied low voltages around 1V to individual devices. While the 48V architecture itself remains relevant, the fundamental question now is "how to generate up to 48V and how to step it down from there."

The "Open Compute Project (OCP)" standards, promoted by hyperscalers like Google, Meta, and Microsoft, have adopted an approach of unifying the entire rack at 48V to maximize conversion efficiency on the server board. Building on this, the current discussion is shifting towards 400V-800V DC power supply (HVDC). This configuration, converting AC to high-voltage DC and then stepping it down incrementally within the rack, is theoretically more efficient due to fewer conversion stages. However, it introduces new costs associated with safety standards and the complexity of protection design for high-voltage DC.

The increase in power density is intrinsically linked to an increase in thermal density, and the transition from air cooling to liquid cooling is progressing in parallel. The adoption of Direct Liquid Cooling (DLC) or immersion cooling not only alters cooling system design but also changes the premises for power device junction temperatures. While increased cooling capacity allows devices to operate under harsher conditions, it also concentrates risk if the cooling system fails.

Key Trends in AI Data Center Power Architecture: Three Axes
01

High Voltage DC (HVDC)

A transition from 48V bus to HVDC (400-800V) is being considered. The primary objective is improved efficiency through reduced conversion stages, but it also involves increased complexity in insulation design and protection circuits.

02

Rapid Increase in Power Density

Latest AI racks exceed 100 kW. Miniaturization and high-density power units, along with low-impedance power distribution paths, are becoming design prerequisites.

03

From Air to Liquid Cooling

The widespread adoption of DLC and immersion cooling changes the thermal design premises for power devices. While thermal resistance margins increase, protection design in case of cooling system failure becomes critical.

The graph illustrates that changes in power supply design are not merely about "1-2% efficiency improvements" but rather at a level that fundamentally alters the architecture. This is accelerating discussions around the adoption of Wide Bandgap (WBG) devices such as SiC and GaN.

SiC and GaN: Differentiating Their Use in Data Centers

SiC (Silicon Carbide) and GaN (Gallium Nitride), both known as Wide Bandgap semiconductors, have wider bandgaps than silicon (Si), making them advantageous for high-voltage, high-temperature, and high-frequency operation. However, it's not simply a matter of being "WBG"; they have distinct areas of expertise depending on the voltage range and application.

SiC is primarily used in the voltage breakdown range of 650V to 1700V and above. In the context of data centers, its main applications are in Uninterruptive Power Supplies (UPS), power conversion equipment (PFC stages, inverters), and conversion stages for HVDC buses. Its ability to simultaneously reduce switching losses and on-resistance directly contributes to efficiency improvements in high-power conversion.

GaN excels in lower voltage ranges (primarily below 650V) and is adept at high switching frequency operation. Its adoption is expanding in conversion stages within server PSUs, particularly for the hundreds of volts to 48V stage, and in high-frequency LLC resonant converters. Increasing the switching frequency allows for miniaturization of passive components (inductors and capacitors), leading to reductions in board area and weight.

SiC vs. GaN: Differentiation in Data Center Power Supplies
01

SiC MOSFET (650V - 1700V)

For UPS, large PFC, and HVDC conversion stages. Its strength lies in low loss at high voltage and high power. Consideration for short-circuit withstand time and gate driver design is necessary.

02

GaN HEMT (Below 650V)

For high-frequency conversion stages within PSUs. It miniaturizes passive components at high switching frequencies. Caution is needed regarding the narrow gate voltage margin.

03

Si IGBT / MOSFET (Conventional Comparison)

Still advantageous in terms of cost and supply stability. Limitations in switching frequency and loss become the deciding factor for adoption compared to WBG devices.

04

Hybrid Configuration

Combinations such as SiC for high-power stages and GaN for high-frequency step-down stages are emerging. This increases complexity in both design and procurement.

Why Short-Circuit Withstand Time Dictates Design When Selecting SiC MOSFETs

When adopting SiC for primary switches in PSUs or UPS, the handling of short-circuit withstand time (SCWT, or Tsc) is a technically overlooked aspect.

Short-circuit withstand time indicates the duration from a load short circuit until the device fails. In other words, it's the "grace period" before the protection circuit is triggered to turn off the switch. If protection is not achieved within this time, the device will be damaged.

The critical factor here is the physical characteristics of the SiC die.

SiC dies are small and have high current density. Heat generated during a short circuit concentrates locally, making the response time requirements for protection circuits stricter compared to Si devices. The datasheets for Microchip's 700V/1200V SiC MOSFETs specify a typical SCWT of 3 μs under certain conditions. This figure means the protection circuit must operate within 3 microseconds.

A commonly used implementation for short-circuit detection is the DESAT (desaturation) function.

DESAT monitors the drain-source voltage (VDS) in the ON state, detects the voltage rise during a short circuit, and turns off the switch. In data center designs, the integration of the DESAT threshold voltage (VDESAT), DESAT current (IDESAT), and short-circuit blanking time determines the balance between protection reliability and false triggering prevention.

Furthermore, it's important to consider that short-circuit withstand time varies with the device's operating conditions. The three main dependent variables are drain voltage, gate voltage, and junction temperature; the withstand time tends to increase as conditions are relaxed. Conversely, the typical values listed in datasheets are for specific conditions, necessitating margin design that accounts for worst-case scenarios.

Regarding temperature dependence, at higher temperatures, RDSon increases and limits the saturation current, leading to an improved short-circuit withstand capability. For designs assuming liquid cooling and low junction temperatures, evaluating this aspect becomes a crucial judgment factor.

The Trade-off Between On-Resistance and Short-Circuit Withstand Time: Where Manufacturers Differentiate

In the development race for SiC MOSFETs, reducing on-resistance (Ron) and ensuring short-circuit withstand capability are, in principle, in a trade-off relationship. Increasing cell density to lower on-resistance tends to increase current density during a short circuit, thereby reducing withstand capability. Where this trade-off is balanced is where structural design differences between manufacturers become apparent.

Each company is taking its own approach to address this challenge. Mitsubishi Electric has developed a structure that significantly improves short-circuit withstand capability by introducing a p-type protection layer in its trench-type SiC-MOSFETs. Rohm's fourth-generation SiC MOSFETs are said to achieve both low on-resistance (RonA) and high short-circuit withstand capability through their proprietary device structure. While both share the goal of "resolving the trade-off through structure," their approaches differ in detail.

From a designer's perspective, "simply comparing SCWT values in datasheets is insufficient." The measurement conditions (at which voltage and temperature) and reliability degradation under repeated short circuits are often not detailed in catalogs. For new product adoption, actual measurements using evaluation boards and reference designs serve as the basis for judgment.

From a procurement perspective, selecting SiC MOSFET suppliers involves not only unit price and delivery time but also consideration of the device structure generation update cycle and the cost of verifying compatibility with successor products. Generation updates like Rohm's fourth generation bring performance improvements but may also necessitate redesign of gate drive circuits and re-evaluation of reliability.

Quantifying Efficiency: Where Losses Concentrate

When discussing power supply design efficiency, it's easy to become diffuse regarding "which conversion stage incurs how much loss." The data center's power flow can be broadly divided into a conversion chain: utility AC → UPS/PFC stage → DC bus → server PSU → on-board DC-DC.

Loading chart

This graph shows that cumulative efficiency improvements across each stage significantly alter the total losses in the chain. If there's a 2.5 percentage point difference between configuring the UPS/PFC stage with Si versus SiC, this translates to a difference in the MW order for a 10,000-rack data center. This is why individual device comparisons directly influence business decisions.

However, the above figures are merely reference levels; actual efficiency varies greatly depending on circuit topology, operating point, and cooling conditions. The key is to identify "which stage is the dominant source of loss," as this will change the priority for improvement investments.

The "Next Question" for Design, Procurement, and Technology Planning

Based on the above analysis, the focus shifts depending on one's role.

AI Data Center Power Design: Next Confirmation Points by Role
01

Circuit Design / Device Selection

Confirm the measurement conditions for SCWT and the alignment of DESAT parameters. The short-circuit blanking time matching the protection circuit's response speed is a critical selection point.

02

Reliability / Evaluation Engineer

Degradation data under repeated short circuits, Ron shift under thermal cycling, and long-term reliability of the gate insulation film serve as evaluation items beyond catalog specifications.

03

Procurement / Supplier Management

With ongoing generational updates like the fourth generation of SiC, it is necessary to consider the total cost of ownership, including successor compatibility and evaluation costs. The difficulty of securing multiple sources is also linked to SiC wafer procurement risks.

04

Technology Planning / Business Development

The timing of the shift to HVDC and standardization trends (e.g., OCP) will determine market inflection points. Whether to prioritize investment in SiC or GaN depends on the voltage range of the target conversion stage.

Discussions on power supply design tend to start with individual device performance comparisons. However, actual decisions are multi-layered issues involving conversion architecture, protection circuit design, procurement risk, and long-term reliability. Even with a single aspect like the short-circuit withstand capability of SiC MOSFETs, it is necessary to examine three layers: measurement conditions, device structure, and protection circuit response time.

As rack power for AI servers exceeds 100 kW, the question of "can the current design be scaled to 100 kW" becomes the next critical point of discussion. The answer in many cases is "no, not entirely," and determining which parts to address first will serve as a basis for decisions in both design and business.