High-density AI exceeds the limits of air cooling, making liquid cooling a design premise

As AI server rack power exceeds 100kW and moves toward 1MW, approaches that move heat through air are nearing practical limits. Air cooling is generally a fit for racks in the tens-of-kW range; far beyond that density, airflow, underfloor and aisle design, and the power and space required for room air conditioning become constraints.

Implementation is already moving on the assumption of liquid cooling. NVIDIA's GB200 NVL72 is a rack-scale liquid-cooled design that combines 36 Grace CPUs and 72 Blackwell GPUs, and the successor GB300 and Vera Rubin NVL72 are also liquid-cooled rack-scale configurations (NVIDIA GB200 NVL72). Cooling is no longer an accessory added later. It has become something to design together with compute, power, and facilities. This article organizes what that shift means for power supplies and power-device design, without overstating what the evidence can support.

Main forms of liquid cooling: direct liquid cooling(DLC) and immersion cooling

Liquid cooling can be divided into two broad families. Direct liquid cooling(DLC / direct-to-chip) places a cold plate against high-heat components such as CPUs and GPUs and removes heat through a liquid loop. It does not immerse the whole server; instead, it extracts heat locally from the hottest components. Commercial products are already available. CoolIT's rack CDU, CHx200, is specified to handle 200kW of heat load in 4U and cool up to 200 servers, with warm-water cooling support for ASHRAE W17 to W+ and N+1 redundant pumps and power supplies (CoolIT Systems).

The other family, immersion cooling, directly immerses electronic equipment in a dielectric liquid. It includes single-phase systems, where the liquid does not change phase, and two-phase systems, where boiling and condensation are used. Two-phase systems are generally described as providing higher heat transfer than single-phase systems. On the standardization side, OCP's Advanced Cooling Solutions includes "Cold Plate" and "Immersion" communities, and specifications are being developed for each approach (Open Compute Project).

Two families of data-center liquid cooling
01

Direct liquid cooling(DLC)

A cold plate contacts high-heat components and liquid removes the heat. CoolIT CHx200 handles 200kW in 4U. This approach is easier to apply to existing racks.

02

Immersion(single-phase)

Equipment is immersed in a dielectric liquid with no phase change, then circulated and heat-exchanged. Cabling and maintenance practices differ from air cooling.

03

Immersion(two-phase)

Boiling and condensation provide high heat transfer. Refrigerant, condenser design, and failure-mode management become central.

04

Standardization(OCP)

OCP's Cold Plate and Immersion communities define specifications and requirements. Ecosystem maturity affects adoption decisions.

Why liquid cooling matters for power supplies and power-device design

The essential effect of liquid cooling is that it changes the assumptions around power-device junction temperature(Tj). Switching loss in SiC MOSFETs depends on Tj, and turn-on loss increases at higher temperature. If stronger cooling keeps Tj lower, assumptions for loss, thermal-runaway margin, and device selection can change. However, Tj is difficult to measure directly and must be estimated from electrical parameters, making monitoring and protection design part of the issue.

Hot-spot management also becomes a concrete design target. An academic study optimizing a cold plate for the GB200 Grace Blackwell Superchip reported reductions of more than 5°C in average temperature and more than 35°C in maximum temperature versus a parallel-channel baseline (arXiv). Suppressing localized temperature variation inside the package matters for both reliability and performance.

There is also a caution. Higher cooling capacity does not automatically mean power-device ratings can be raised. Lower Tj creates "design margin," but rating and reliability still need to be validated separately at the device and protection-circuit level. Liquid cooling can also concentrate risk when the cooling system fails. Just as a CDU includes N+1 pumps and power supplies, redundancy and failure behavior are central to safety.

Efficiency and operations: PUE and control optimization

Liquid cooling is often discussed in the context of efficiency, but it is not as simple as saying that liquid cooling always improves PUE. PUE(Power Usage Effectiveness) is facility energy divided by IT equipment energy; the ideal value is 1.0, and it is standardized as ISO/IEC 30134-2 (ISO). Values change depending on facility boundaries, treatment of server fans, outdoor-air conditions, water use, and whether heat is reused, so the cooling method alone cannot determine the result.

That said, liquid-cooling facilities leave substantial room for control optimization. A digital-twin study of Frontier's liquid-cooling infrastructure showed the potential to reduce total energy by 30.1% through simultaneous optimization of flow rate and supply temperature (arXiv). Cooling is not "installed and finished"; efficiency moves with operations and control. That point matters to both power design and operations.

What each role should check next

Liquid cooling is no longer only a cooling topic. It has become a design theme that runs through power supplies, devices, facilities, and operations.

Power-design checkpoints in the liquid-cooling era
01

Power and device design

Revisit loss and reliability margin under lower Tj assumptions. Rating increases still require separate validation at the device and protection-circuit level. Tj-estimation monitoring is also a design topic.

02

Thermal and mechanical design

Choose DLC or immersion cooling, retrofit or new build. Check CDU cooling capacity, redundancy, and preparation for concentrated risk during cooling-system failure.

03

Facilities and operations

Do not judge PUE by cooling method alone; evaluate by facility boundary. Flow-rate and supply-temperature control can materially move efficiency.

04

Procurement and technology planning

Track OCP and other standardization work plus ecosystem maturity. Time the shift toward integrated procurement of cooling, power, and devices.

For high-density AI racks, power design and thermal design can no longer be treated separately. The next question is whether power designs built on today's air-cooling assumptions can be carried unchanged into the 100kW-to-1MW liquid-cooling scale.

Reference FactCards