AI chip thermal management

Aug 02, 2024

Currently, other tech giants such as Microsoft, Google, and Meta are also expanding their data centers to train and run their artificial intelligence models. According to reports, Microsoft and OpenAI are planning to build a data center project that will include a supercomputer with millions of dedicated server chips, and the current project could cost $115 billion, including an artificial intelligence supercomputer called Stargate, expected to launch in 2028. Meta CEO Mark Zuckerberg also stated in January this year that the company's computing infrastructure will include 30000 H100 graphics cards by the end of 2024. He also added, 'If other GPUs are included, there are approximately 600000 H100 equivalent computations.'.

AI computing

AIGC is based on big models and big data. A large model refers to a model that can adapt to downstream tasks after training on large-scale and broad data. After the emergence of a large model, (1) the model parameters are increased in magnitude; (2) Diversified demand accelerates diversified upgrading of computing power: Computing power can be divided into basic computing power, intelligent computing power, and supercomputing power according to demand matching. In 2021, the total computing power of global computing devices reached 615 EFlops, with a growth rate of 44%. By 2030, it is expected to increase to 56ZFlops, with a CAGR of 65%. The intelligent computing power will increase from 232EFlops to 52.5ZFlops, with a CAGR exceeding 80%; After the emergence of the big model, it brought a new trend of computing power growth, with an average doubling time of 9.9 months for computing power.

AIGC chip cooling

Behind the improvement of computing power, chips must have higher computing efficiency and complete more calculations in a shorter time, which inevitably leads to an increase in chip energy consumption. The high density and high power consumption characteristics of data centers in supercomputing centers make heat dissipation issues increasingly prominent. Modern data centers, especially supercomputing centers, typically contain a large number of high-power devices that generate a significant amount of heat during operation. If the heat cannot be dissipated in a timely and effective manner, it will not only affect the performance of the device, but may also lead to hardware failures. According to IDC's report, about 40% of the energy consumption in data centers is used for cooling systems, indicating that effective cooling solutions are crucial for the operation of data centers.

data canter liquid cooling

Traditional air cooling systems are no longer able to meet the cooling needs of current supercomputers, so liquid cooling technology has gradually become the mainstream choice in the industry. The application of liquid cooling technology enables data centers to accommodate more computing devices in the same space, while reducing the energy consumption of the cooling system. The application of liquid cooling technology not only improves computational efficiency, but also significantly reduces energy consumption and operating costs. Liquid cooling technology can handle more computing tasks with the same energy consumption through more efficient heat conduction.

data center immersion liquid cooling

With the increasing demand for AI training and high-performance computing, liquid cooling technology will play a more important role in future supercomputing centers. It is expected that liquid cooling technology will become a standard configuration in supercomputing centers and large data centers in the coming years to meet the growing computing demands and heat dissipation challenges.

Previous: Cooling method and thermal recycling of Data Center

Next: AI accelerates chip level liquid cooling explosion

Knowledge

AI chip thermal management