Building the sustainable HPC environments of the future

2022-08-13 10:49:49 By : Mr. Ekin Yan

Putting green IT, sustainability and lean thinking into a business context.

In this guest post, Mischa van Kesteren, sustainability officer at HPC systems integrator OCF runs through the wide variety of ways that large-scale computing environments can be made to run more energy efficiently.

Supercomputers are becoming more energy hungry. The pursuit of Moore’s Law and ever greater hardware performance has led to manufacturers massively ramping up the power consumption of components.

For example, a typical high performance computing (HPC) CPU from 10 years ago would have a thermal design power (TDP) of 115 Watts – today that figure is closer to 200.

Modern GPUs can exceed 400 Watts. Even network switches, which used to be an afterthought from a power consumption perspective can now consume over 1KW of power in a single switch.

And the race to achieve exascale has pushed the power consumption of the fastest supercomputer on the planet from 7.9MW in 2012 to 29.9MW in 2022.

In this era of climate chaos, is this justifiable? Ultimately, yes. Whilst 29.9MW is enough electricity to power 22,000 average UK households, the research performed on these large systems is some of the most crucial to how we will navigate the challenges we are facing and those to come, whether that’s research into climate change, renewable energy or to combat disease.

It is vital, however, that we continuously strive to find ways of running HPC infrastructures as efficiently as possible.

The most common method of measuring the power efficiency of a datacentre is through its power utilisation efficiency (PUE). Traditional air-cooled infrastructure blows hot air through the servers, switches and storage to cool their components and then air-conditioning is used to remove the heat from that air before recirculating it. And this all consumes a lot of power.

The air-cooling often has a PUE in excess of two, meaning the datacentre consumes twice as much power as the IT equipment. The goal is to reduce the PUE of the HPC infrastructure as close to one as possible (or even lower).

A more efficient method is to cool the hot air with water. Water transfers heat over 20 times faster than air making it far better for cooling hardware. Air cooled components can use water through rear door heat exchangers which place a large radiator at the rear of the rack (filled with cold water), cooling all the hot air that is exhausted by the servers.

Get the flow rate and water temperature right and you can remove the need for air conditioning all together. This can get the PUE down to closer to 1.4.

Alternatively, components can be fitted with water blocks on the CPU, GPU, networking etc, which directly cool the components, removing the need for air cooling all together. This is far more efficient, bringing the PUE down further, possibly to less than 1.1.

Ultimately, we need to do something with the waste heat. A good option is to make use of free cooling. This is where you use the air temperature outside to cool the water in your system. The highest outdoor temperature recorded in the UK was 38.7 °C.

Computer components are rated to run at up to double that so as long as the transfer medium is efficient enough (like water) you can always cool your components for just the energy used by the pumps. This is one of the reasons why you hear about datacentres in Norway and Iceland being so competitive – they can make use of free cooling far more judiciously due to their lower temperatures.

Taking things one step further, the heat can be used for practical purposes rather than exhausted into the air. There are a few innovative datacentres which have partnerships with local communities to provide heating to homes from their exhaust heat, or even the local swimming pool. The energy these homes would have consumed to heat themselves has in theory been saved, which can bring the PUE of the total system below one.

The next step which is being investigated is to store the heat in salt, which can hold it indefinitely, to make allowances for the differences in heating requirements and compute utilisation. Imagine the knock-on effect of the traditional Christmas maintenance window where IT infrastructure is turned off just when those local households need heat the most.

One thing you may have noticed about all of these solutions is they are largely only practical at scale. It is not a coincidence that vast cloud datacentres and colocation facilities are the places where these innovations are being tested, that is where they work best. The good news is the industry seems to be moving in that direction anyway – as the age of the broom cupboard server room is fading.

However, in the pursuit of economies of scale, public cloud providers are operating huge fleets of servers, many of which are underutilised. This can be clearly seen in the difference in price between on demand instances that run when you want them to (typically at peak times) and ‘spot’ instances which run when it is most affordable for the cloud provider.

Spot instances can be up to 90% cheaper. As cloud pricing is based almost entirely on the power consumption of the instance you are running, there must be a huge amount of wasted energy costed into the price of the standard instances.

Making use of spot instances allows you to run HPC jobs in an affordable manner, and in the excess capacity of the cloud datacentres, improving their overall efficiency. If you are running your workloads on demand, however, you can make this inefficiency worse.

Luckily HPC workloads often can fit the spot model. Users are familiar with the interaction of submitting a job and walking away, letting the scheduler determine when the best time to run that job is.

Most of the major cloud providers offer the functionality to set a maximum price you are willing to pay when you submit a job and wait for the spot market to reach that price point.

This is only one element of HPC energy efficiency, there is a whole other world of making job times shorter through improved coding, right sizing hardware to fit workloads and enabling power saving features on the hardware itself to name a few.

HPC sustainability is such a huge challenge that involves everyone who interacts with the HPC, not just the designers and infrastructure planners. However, that is a good place to start. Talking to those individuals that can build in the right technologies from the start ensures that they will provide you with a sustainable HPC fit for the future.

Digitization and digital transformation sound similar, but they couldn't be more different in what they demand from CIOs, ...

Communities of practice, agile methods, cross-functional teams and platform strategies rank among the methods IT leaders use to ...

Companies preparing to send employees to tech conferences should have a COVID-19 safety plan and prepare for the possibility that...

At DEF CON 30, Eclypsium researchers detailed three new vulnerabilities in third-party Windows bootloaders that were signed with ...

While several of the vulnerabilities were reported to Cisco in February, they remained unpatched until Thursday when Rapid7's ...

Researchers with Palo Alto Networks took the stage at Black Hat to explain how configurations and system privileges in Kubernetes...

Distributed IT environments increasingly require automated networks, and AIOps can provide the answer for network operations ...

Vendors are offering private 5G in a box -- a condensed and streamlined form of standalone 5G -- to simplify the complexity of ...

With help from AI and machine learning, Wi-Fi sensing detects movement in the Wi-Fi environment. While it sounds promising, the ...

Blockchain has been a significant contributor to the global chip shortage. Explore the role this rising technology has played.

Congress approved the CHIPS Act and billions more for scientific research to help the U.S. better compete against China in ...

From Infineon and Oxford Ionics' partnership to Cambridge and Honeywell's merger and QCI's new Entropy Quantum Computing, explore...

The use of cloud databases is surging, but there are still reasons for on-premises ones. Here's a comparison of cloud and local ...

Nikita Ivanov details the origin of his company and discusses the growing need organizations have for real-time database ...

The co-creator of the open source project at Facebook reflects on 10 years of growth as he helps lead one of its resulting tools ...

All Rights Reserved, Copyright 2000 - 2022, TechTarget Privacy Policy Cookie Preferences Do Not Sell My Personal Info