What challenges and innovations lie ahead for the data center industry with the rise of artificial intelligence? Arkadiusz Starczewski, an expert from Talex S.A., shares his insights following the Vertiv conference, where the global leader in critical infrastructure presented its vision for powering and cooling high-power servers. In this interview, Arek points out key development directions and challenges facing the industry, particularly regarding liquid cooling and new design solutions tailored for AI.
What are your main takeaways from the Vertiv conference?
We should start with the fact that Vertiv showcased AI development and the challenges that come with it from their own perspective—that is, as a company focused on creating an environment (infrastructure) capable of providing guaranteed power and highly precise cooling for servers and racks responsible for AI and machine learning processes. However, Vertiv is a global brand and one of the leaders in critical infrastructure, so it was certainly worthwhile to explore their expert view on the broad development of AI.
Primarily, we can highlight two key takeaways after analyzing the tables and charts presented by Vertiv.
The first pertains to the rapid development of graphics processors. Nvidia is the main player making waves in this field; each new generation of their GPUs not only brings greater computational power but unfortunately also increases energy consumption. A specific model had to be developed to handle the high power within high-density IT racks and servers.
The second takeaway is an entirely new approach to designing Data Centers, right from the construction planning stage. Here, we’re talking about AI-dedicated DCs where high power and liquid cooling play crucial roles. This doesn’t mean that traditional server rooms will soon be phased out—not at all. Both technologies will coexist.
Could you elaborate on the growing demand for high-power IT racks and its implications for data centers?
Currently, the standard power of IT racks is about 5–10 kW, with slight fluctuations on either side. Based on my experience, the average in the Polish market tends to hover closer to the lower end, around 5 kW. This is the level of energy cooled traditionally with air, often using chilled water, but the cooling medium for IT equipment is cold air. Technically, up to 40–50 kW (depending on the project), we can still cool with air—at a certain point, however, additional cooling must be applied, such as rear-door cooling for IT racks equipped with a cold-water cycle, yet still relying on cooled air for the IT devices.
Above 50 kW, we must move away from traditional cooling methods and design a liquid cooling system, where a loop with specially conditioned chilled water is directly connected to the server, which in turn has appropriate fittings with channels connected to a heat exchanger mounted on the CPU/GPU. Brands like Dell, HP, Nvidia, and several other major players already have solutions ready.
As for the second part: “consequence” may be too strong a word, but there will certainly be new challenges and investments for data centers. Liquid cooling requires not just a different type of air conditioner but also several additional undertakings, such as connecting the air conditioners to the UPS. There’s no option to turn off the air conditioners temporarily without shutting down the server first. There’s also the issue of introducing liquid into rooms or even the server racks themselves—such a system must be continuously monitored for leaks and pressure drops, with immediate response required.
Nvidia appears to be a leader in this field. What role do their graphics processors play in increasing the need for liquid cooling in servers?
Nvidia is undoubtedly the leader in designing and releasing new generations of very efficient GPUs. Intel and AMD may be slightly behind, but they also have solutions, and their CPUs/GPUs in certain versions are adapted for liquid cooling.
The current power demand per chip for top versions is 1,200 W. Given that several of these processors can be placed in each server, we get a straightforward estimate resulting in very high power density within a standard rack. As I mentioned before, beyond a certain level, liquid cooling is now the only solution enabling productive operation for such machines.
You mentioned that liquid cooling is becoming standard in servers using GPUs. What are the main benefits and challenges associated with this solution?
It may be a bit early to call liquid cooling a standard. Yes—it’s being used, needed, and will continue to grow and develop year by year, but it will still take some time before it becomes a universally used solution. This technology isn’t a new invention; IBM successfully used this cooling method as early as the last century. There are also several techniques, including replacing traditional radiators with water-cycle blocks (as discussed in previous questions) and immersion cooling, where entire devices are submerged in cooled liquid.
It’s also worth noting that servers connected to a liquid cooling system also require air cooling. The liquid cools only the most energy-intensive elements, representing 70–90% of the server’s total power consumption, while the remainder is still air-cooled.
So the challenge is to design a dual cooling system. Redundancy must be factored in to ensure uninterrupted server operation. Another consideration is the expansion of UPS modules that guarantee uninterrupted operation for liquid-based air conditioning. In case of a complete failure of air cooling, we have only a few minutes to restart or automatically activate alternative units. This isn’t much time, but it should be sufficient to avoid issues with the cooled devices. In contrast, if the liquid cooling system fails, the server should shut down within seconds; otherwise, it can suffer serious, often irreversible damage.
Thus, air conditioning systems, power supply systems, and the power required to supply the entire Data Center are all increasing significantly. Acquiring high power from energy providers or investing in our own energy sources becomes a challenge.
You also mentioned that liquid cooling is very precise. Could you expand on this?
Yes, I added “very” for a reason. We always deal with precision air conditioning in data centers, where settings don’t just involve determining the temperature and humidity levels, as with typical comfort air conditioning. Precision air conditioner controllers today are essentially computers with operating systems and dedicated applications. At the service level, there are numerous parameters defined specifically for the cooled room or cold aisle arrangement. For example, sensors are calibrated to monitor pressure differences generated by the fans in IT power supplies.
With liquid cooling systems, this precision reaches a new level. The glycol (chilled water) in a traditional system must be of the correct quality and concentration, whereas the glycol circulating in the loop connected to blocks mounted directly on the processors needs to be perfectly pure and enriched with substances that protect against loss of its designed properties, as well as against sediment and rust formation. There is no room for error here, as the channels in the cooling blocks have diameters between 20 and 30 microns—that is, several times thinner than a human hair—so without precise filtration, a blockage would quickly “cook” the processor.
Additionally, there’s a more complicated startup process for such a system; conditions must be sterile. Maintenance explicitly states that configurations and work on these air conditioners are conducted in the same way as for medical devices.
What key changes in server room design will be essential in the future? What are your predictions about the future of AI-dedicated server rooms, particularly regarding power infrastructure design?
Server room design is currently an interesting endeavor. Already, it’s necessary to predict how server rooms will be used over the next few years or decades. First and foremost, we must define the server room’s purpose—whether it will utilize traditional solutions, mixed configurations, or be dedicated specifically to AI.
Since the question was about an AI-dedicated solution, let’s focus on that. Assuming we have already chosen a location based on prior risk analysis and confirmed that the power connections meet our needs, we can safely proceed with entirely reversing the infrastructure design and construction process. Until now, in simplified terms, one-third of a Data Center’s space was typically used for power infrastructure—distribution boards, UPS, generators, partly air conditioning systems—with the remaining two-thirds reserved for server rooms. However, for AI needs and IT racks exceeding 100 kW per unit, we can assume one-third of the space for servers and two-thirds for infrastructure, allowing them to operate safely and uninterruptedly.
Naturally, we also face several issues, some of which I’ve mentioned: higher UPS power, more batteries, and the precise design of the cooling system. This might mean foregoing lower air supply and raised floors. Installing all cabling, busbars directly above IT racks and close to the water supply installation for servers.
There are many considerations, each individual system requires thorough analysis. Ultimately, everything depends on specific project assumptions and needs, which serve as the starting point for design work.
An Electrifying Future
The data center industry faces challenges related to increasing demand for computing power, particularly for artificial intelligence applications. The development of high-performance graphics processors, especially from Nvidia, necessitates the use of precise liquid cooling, currently the only effective method for high-power-density servers. Traditional air cooling is inadequate for AI infrastructure, introducing new design challenges, including expanded power systems and dual cooling systems. The future of data centers centers on adapting designs to meet high energy and technological demands to ensure operational stability and efficiency.