When people think of supercomputers, they think of two different performance vectors (intentional pun), but usually the first thing they think about is the performance of a large parallel machine, since it is performing a massive task scaling across tens of thousands to hundreds of thousands of cores working in concert. It’s what gets all the headlines, like simulating massive weather systems or nuclear explosions or the inner workings of a distant galaxy.
But most of the time supercomputers are not really used for such capability class workloads but rather used to run many smaller jobs. Sometimes these groups of simulations are linked, as in weather or climate ensembles that modify the initial conditions of their models a bit to see the resulting statistical variation in the prediction. Sometimes the tasks are absolutely independent of each other and on a modest scale – on the order of hundreds to thousands of cores – and the supercomputer is more like a time-sharing number processing engine, with task schedulers playing Tetris with hundreds of simultaneous simulations, all waiting their turn on a slice of the supercomputer.
This last scenario is just as important as the first in the HPC space, and a lot of important scientific work is done at this smaller scale, which does not correlate with the complexity of the models being run. One such job at the National Center for Atmospheric Research in the United States is called the Community Earth System Model, the continuation of the community climate model for atmospheric modeling that NCAR created and gave to the world in 1983. Over time, land surface, ocean, and sea ice models were added to this community model, and other U.S. government agencies, including NASA and the Department of Energy, became involved in the project. Today, the CESM app is funded primarily by the National Science Foundation and is maintained by NCAR’s Climate and Global Dynamics Laboratory.
NCAR, of course, is one of the pioneers of supercomputing as we know it, which we discussed at length over six years ago when the organization announced the deal with Hewlett Packard Enterprise to build the “Cheyenne of 5.34 petaflops in the data center outside the Wyoming city of the same name. NCAR was an early adopter of supercomputers from Control Data Corporation (which Seymour Cray co-founded with William Norris in 1957 after the two grew tired of working at Sperry Rand) and Cray Research (which was created in 1972 when the father of supercomputing tired of not running his own show), and so it’s only fitting for NCAR to look at cloudy HPC and see what it can and can’t do.
And to point out that cloud HPC can offer advantages over on-premises equipment, Brian Dobbins, software engineer at the Climate and Global Dynamics Laboratory, presented some benchmark results running CESM both on the internal Cheyenne system and on the Microsoft Azure cloud, which more than anything else shows the benefits of having more modern applications and proves that the cloud can be used to run complex climate models and simulations meteorological.
As we have repeatedly pointed out, HPC and AI have many similarities, but there is one key difference that should always be considered. HPC takes a small set of data and explodes it into a massive simulation of a physical phenomenon, usually with visualization, because human beings need it. AI, on the other hand, takes an absolutely massive amount of unstructured, semi-structured, or structured data, mixes it up, and uses neural networks to sift through it to recognize patterns and boils down to a relatively simple and small. through which you can run new data. So it’s “easy” to move HPC to the cloud (in that sense), but the resulting simulations and visualizations can be quite large. And, to do AI training, if your data is in the cloud, that’s where you need to do the training because the exit fees will eat you alive. Likewise, if your HPC simulation results are on the cloud, they will likely stay there and you will do your visualizations there as well. It’s just too expensive to move data out of the cloud. In either case, HPC or AI, organizations need to be careful about where source and target data will end up and the size and cost of retaining and moving it.
None of this was part of the reference discussion that Dobbins gave, of course, as part of Microsoft’s presentation on running HPC on Azure.
On the Cheyenne side, the server nodes are equipped with a pair of 18-core “Broadwell” Xeon E5-2697 v4 processors clocked at 2.3 GHz; the nodes are connected via 100 Gb/sec InfiniBand EDR switching from Mellanox (now part of Nvidia).
The Azure cloud configuration was based on HBv3 instances, one of several H-series instance configurations available from Microsoft for HPC workloads. HBv3 instances are for workloads that require high bandwidth and feature 200 Gbps EDR InfiniBand ports on their nodes, which terminate in a two-socket machine based on a pair of AMD “Milan” Epyc chips 7003 series (we don’t know which one), each with 60 cores enabled, running at a peak of 3.75GHz (according to Microsoft), but we believe that’s just a maximum boost speed with one core enabled and with all cores running you’re looking at 2GHz instead. Microsoft said in its presentation that the HBv3 node delivers 8 teraflops of raw FP64 performance on those two sockets. Unless there’s overclocking going on, those 8 teraflops on the node seem a little high to us – we were expecting around 7.68 teraflops with 120 cores at 2GHz and 32 FP64 operations per clock per heart. Anyway, when we do the math on those old Broadwell chips inside Cheyenne, which can do 16 FP64 operations per clock per core, we get just 1.32 teraflops of maximum theoretical performance per node.
So yeah, the Azure stuff better do a lot more work. Oddly enough, it doesn’t do as much as you might think based on these raw specs. Which just shows you that it still depends on how the code can see and use hardware functionality – or can’t without some tweaking.
Here is the relative performance of the Community Atmosphere model, which Dobbins says is one of the most computationally intensive parts of the CESM stack:
The ne30 in this graph refers to a one degree resolution in the model, and F2000climo means running from 1850 (before the rise of the industrial revolution) to 2000 (which is apparently the rise of the machines, but we don’t know with certainty yet).
The X axis shows the number of Cheyenne nodes or Azure HBv3 instances (which are also nodes in this case) running the climate model, and the resulting number of simulated years per day that each set of machines can run. In any case – big surprise – Azure beats Cheyenne. With 2x the interconnect bandwidth and 5.8x the raw 64-bit floating point compute, Azure better beat Cheyenne. But it only beats it by a factor of 2 or more, but Dobbins says NCAR is pushing the CAM model within CESM up to 35 simulated years per day by running on more Azure nodes. He didn’t say how many more, but our eye says around 40 Azure nodes should do it. And our eye also sees that 80 modes won’t give you 70 simulated years a day either. CAM is not evolving that way – at least not yet.
Simulating a part of the climate, like the atmosphere, is one thing, but being able to run all sorts of models at the same time for different Earth systems is where the real sweat starts to come out of the systems, and on such a test, Cheyenne isn’t doing as badly as one might expect:
This B1850 simulation, which means having 1850 as the initial starting condition of the climate model, brings together the models of atmosphere, ocean, land, sea, ice and river runoff.
“It’s very complex and load balancing is a big challenge, so for now we’ve been running identical setups on Cheyenne and Azure to see how it works,” Dobbins explained. “These results represent a low threshold, and with the tuning, we expect to be able to run Azure faster.”
As you can see, as you increase the number of cores, both machines increase their performance on the full CESM model, and with 2160 cores, the Azure configuration does 35% more work (in terms of years simulated per day of execution) compared to the Cheyenne system. It’s hard to say how far the setting can be pushed. Each part of the CESM simulation is different and overall obviously did not behave like the CAM part.
We are not for or against any particular platform here at The next platform, and we love them in the past, in their time, as much as we love those who descend from the pike. What we certainly are for is the wise spending of public funds and we are definitely for more accurate climate models and weather forecasts. To the extent that Azure can help NCAR and other weather centers and government organizations accomplish this, awesome.
But none of these comparisons are about money, and that’s a problem. There’s no easy way from the outside to compare the cost of runs on Cheyenne to the cost of runs on Azure, as we don’t have any of the system, people, and facility cost data attributable to Cheyenne over its lifetime to try to calculate the cost per compute hour. But make no mistake: it would be fascinating to see what the cost per minute of running an application on Cheyenne really is, with all overhead, compared to the cost of lifting and transferring the same work to Microsoft Azure, Amazon. Web Services and Google Cloud. This is the real hard math that needs to be done, and such data for any publicly funded infrastructure should be freely and readily available so that we can all learn from this experience.