Nvidia's liquid cooling revolution for AI server

   The power consumption of cutting-edge AI chips is constantly increasing, which has become a catalyst for the next generation of DGX AI servers to shift towards liquid cooling. The current TDP (thermal design power) of Nvidia's flagship H100 GPU is 700W, which has exceeded the limit of traditional air cooling. It is expected that Nvidia will launch the Blackwell architecture B100 GPU with a TDP of approximately 1000W later this year, and liquid cooling will definitely be necessary at that time.

Nvidia liquid cooling

For high-performance computing systems, liquid cooling has several key advantages over air cooling:
   Excellent heat transfer efficiency enables components with higher TDP to be fully cooled
   Due to reduced demand for high-speed fans, the operation is quieter
   System design is more dense, and bulky heat sinks and fans take up less space
   Potential for capturing and reusing waste heat in liquid-liquid heat exchangers

GPU liquid cooling

     By using liquid cooling, Nvidia can continue to exceed the performance limits of AI accelerators without being limited by the cooling system. As the complexity of artificial intelligence training load continues to increase and the corresponding hardware power consumption increases, this is crucial. Nvidia's DGX AI server packages multiple GPUs into an optimized system for AI workloads, which has been rapidly adopted by large-scale enterprises. Major cloud service providers such as Google Cloud, Meta, and Microsoft have deployed DGX systems in their data centers. In recent years, as more and more organizations seek to leverage the transformative power of artificial intelligence, the adoption of Nvidia DGX artificial intelligence systems has grown exponentially.

GPU LIQUID COOLING

  The Nvidia DGX system may use advanced immersion cooling designs that use dielectric fluids. Direct chip cooling pumps dielectric fluids directly onto GPU chips and other thermal components, without the need for cold plates, achieving more direct heat transfer. It can support very high TDP levels (500W+) on a single chip, achieving more dense systems.

Direct chip immersion cooling

     As artificial intelligence continues to develop at an astonishing speed, the supported hardware infrastructure must evolve synchronously. Liquid cooling is a key enabling technology that will enable accelerators to scale to unprecedented performance levels. This transformation is not without challenges. Because data centers require the transformation of liquid cooling infrastructure and the development of new maintenance programs, the benefits of energy efficiency, density, and performance are significant and cannot be ignored.

 

You Might Also Like

Send Inquiry