In fact, high performance computing (HPC) refers to a specific subfield of computer science and engineering, one that has important connections to many other science and engineering disciplines. At Virginia Tech, several faculty in the Departments of Computer Science and Electrical and Computer Engineering list HPC as a central research focus; and faculty from dozens of other departments use HPC as an important tool in their research and teaching. The purpose of this article is to provide a brief introduction to HPC and to describe some of our recent efforts in this area.
What is high performance computing? Very informally, HPC deals with the design, implementation, and use of the most powerful computing systems at any point in time. Note that last phrase. Today's most powerful systems will be considered average in a few years. (See www.top500.org for one ranking of the world's most powerful computers.) Obsolescence happens not just because faster processors are announced. Historically, radically different approaches to building high performance systems have emerged every ten years or so. The rate of change has slowed a bit in recent years, and more stable software layers have emerged to hide changes in hardware. But the HPC field is still a very dynamic one. Hence, one of our major challenges is to design software that achieves good performance across multiple generations of hardware. A second major challenge characterizing HPC stems from its highly interdisciplinary nature. High performance can only be achieved with considerable knowledge about the problem being solved. At the same time, the opportunities and constraints of HPC often influence how these problems are formulated in the first place. So there is always close interaction between computing specialists and `applications' specialists in HPC research teams.
What's the catch? Why can't we settle the global warming debate once and for all? The problem is that computer simulations can never get the right answer---at least, not if ``right answer'' means the exact behavior of the system being simulated. We can only compute approximations to the true answer. Accuracy is lost at multiple points along the way. The simulation process begins with mathematical models describing the system of interest. (Here is one place where computational science depends on theoretical and experimental science.) These mathematical models describe relationships among the various components of the system. They consist of equations that are often very complicated, although many problems are simulated using models that are taught in high school physics or chemistry. The important point is that these models are only approximations of the real system; they simplify or completely ignore features of the system that (we hope) do not significantly effect the simulation. For example, a solar system simulation assumes all planets are perfect spheres, a structural engineering simulation pretends that the material's density is uniform, or a U.S. weather simulation ignores the effect of ocean temperatures. Any of these assumptions may be dangerous, depending on the questions we expect our simulation to answer. The challenge is to decide which simplifications can be made and which cannot. Making the models more complex may improve the fidelity of the simulation; but complex models require much more computational power to simulate.
Given a mathematical model, the second important loss of accuracy occurs when we generate a finite computational model. We say `finite', in contrast to the mathematical model, which describes what we would like to be true at every point in our simulated world. For example, meteorologists can write down a nice set of equations describing the relationships among various properties of the atmosphere, e.g., temperature, pressure, humidity, and wind velocity. The mathematical model says these equations are true at every point in time and space. We must relax this requirement if we want to do a simulation that runs for a finite amount of time. The computational model calculates approximate values of these unknowns at a large, but finite, number of points. It does this by enforcing the governing equations at a finite number of points or regions in the simulated space/time region. (There are many algorithms for finding values of N unknowns that satisfy N equations; this is the single most important type of computation in scientific computing.) In the weather simulation example, the output might consist of the values of the unknowns at one hour intervals, at millions of uniformly distributed points in the atmosphere. It should come as no surprise that the accuracy of the computational model is roughly proportional to the number of space and time points used. If we represent the atmosphere over the U.S. by a 1000 x 1000 x 100 grid of points, and use a simulation time-step of one hour, we are pretending that the unknowns don't vary across a few kilometers or within an hours time. These assumptions make meteorologists nervous. An obvious approach to improving accuracy is to add more space and time points. Unfortunately, this generates an enormous increase in the computational cost of the simulation.
This brings us back, finally, to high performance computing. On the one hand, accurate simulations require complicated mathematical models and computational models with fine spatial and temporal resolution. On the other hand, both of these approaches require huge computational resources. The goal of HPC is to push on the computational resources side of this bottleneck. We turn now to a brief review of the main ingredients that contribute to progress in HPC.
In past decades HPC hardware was noticeably different from the computers the rest of us used. Novel processors, sophisticated memory architectures, and very steep price tags characterized the field. Today much of this exotic computer architecture has trickled down to the desktop. The building blocks that make up modern high-performance systems aren't all that different from what you find in systems all around you. What is different is scale. HPC systems typically have more of everything---more processors, more memory, more disk space. Hence, the defining characteristic of most HPC systems today is parallelism. The idea of parallel computation is to attack a single problem using many computers (or `processors') at the same time. There are many ways to build a parallel computer. Important issues include whether the processors will share one large memory or each have their own, how the processors will communicate, and how they will be synchronized. One approach to building parallel computers that has become popular in the last decade is the idea of a parallel cluster. A cluster is built out of computers that could exist perfectly well as stand-alone machines; in the cluster they are interconnected with a network, and a software layer makes the set of machines look like a unified resource. Clusters are becoming an attractive alternative to self-consciously parallel computers because of their attractive price/performance. The per-processor cost of a cluster is very low, and the performance of off-the-shelf PCs continues to improve steadily.
A second necessary ingredient for high-end computing is system software. By `system' software we mean the software that allows programmers and users to take advantage of parallel computing system, without worrying about most of the nasty details under the hood. The responsibilities of such software include memory management (`which processor gets which section of memory?'), scheduling (`which program gets which processors next?'), synchronization (`processor 7 needs the results from processor 12; are they ready yet?'), and fault tolerance ('processor 13 isn't responding; what do we do?'). Other important examples of HPC system software include parallel debugging and performance monitoring tools, and programming languages that target parallel computer systems. Over the years, there have been many research projects that have proposed new programming languages for parallel computing. Other projects attempted to automatically detect parallelism in existing non-parallel (`sequential') programs. Neither approach has proved very successful. By far the most common approach today is to use one of the standard sequential programming languages, but with a few extensions that allow the user to explicitly take advantage of parallelism, e.g., by indicating sections of the program that should be distributed across the processors. This puts some burden on the programmer, to be sure. Fortunately, some of the most common computational steps in typical scientific programs have been implemented by experts and are freely available in software libraries.
Probably the most important ingredient in applying HPC to large-scale computational science problems is good algorithms. Traditionally, the emphasis in evaluating an algorithm is on running time. Faster is better. For many computational science problems, the amount of memory required is also a concern. But a new issue, scalability, must be added to the definition of a `good' algorithm in the context of HPC. Informally, an algorithm is scalable if it performs well as both the problem size and the number of processors increases. Notice that we assume the problem size (i.e., the amount of data and amount of arithmetic) will grow proportionally to the number of processors available. This is a realistic assumption when we recall the computational science context----usually more computational power means I will do a larger (more accurate) simulation, not that I will do the same one a little faster. Since processor speeds are much faster than network speeds, the key to finding scalable algorithms is to minimize the amount of communication between processors.
Transport is typical of large-scale scientific codes in at least three ways. First, its computational requirements grow very quickly as a function of a few simple parameters. For example, the accuracy of the simulation is directly tied to the number of so-called plane waves in each of three directions. The table below gives the typical memory and time requirements for a single run of transport, for various numbers of plane waves.
Plane Waves | Memory | Time | ||||||
---|---|---|---|---|---|---|---|---|
(5,5,10) | 0.5 GB | 6 hours | ||||||
(6,6,12) | 1.6 GB | 27 hours | ||||||
(7,7,14) | 3.8 GB | 4 days | ||||||
|
|
|
||||||
(15,15,30) | 307.2 GB | 8 years |
Here we are assuming a computer that can execute 1 billion arithmetic steps per second, a common rate for desktop systems today. Each row of the table represents parameter values that correspond to important scientific questions that Di Ventra's group would like to answer. These estimates are clearly bad news if we hope to do exciting physics (and graduate students) in our lifetime.
The second characteristic that transport shares with many scientific codes is that it never runs just once. Instead, it must be run thousands of times to reach scientifically interesting conclusions. This is very bad news, since the time estimates in the table above are for only a single run of the code. In many cases, the output of a run of transport is only a single number---one point on a graph showing the relationship between some input value and the computed output value. So a computational experiment (i.e., producing that graph) for a particular molecular configuration requires many runs of transport. Worse, there are many molecular configurations to study. And even worse, some of the individual points on the graph should be statistical averages over several runs. One begins to see why computational scientists can consume any amount of computing power they can find!
Finally, transport is a typical HPC application in that it is made up of relatively homogeneous parts. At last, this is some good news. It makes it more likely that we will be able to use parallel computers to attack the problem. By homogeneous parts, we mean that the computation is dominated by millions of the same steps, done over and over again, but with different data. At a very high level, transport naturally breaks into 32 to 64 (almost) independent tasks; each one corresponds to an energy level and each is of roughly similar size. So each of these tasks can be assigned to a different processor on a parallel machine. Within each of the tasks there is a great deal of homogeneity too, but splitting the work of a single task across multiple processors is not nearly so straightforward.
Clearly, we need to apply all the ingredients of HPC---hardware, software, and especially algorithms---if we are to make a dent in the formidable computational requirements of an application such as transport. Parallel computing can definitely help. We are currently running transport night and day on most of LASCA's Anantham cluster. Anantham was built by Dr. Srinidhi Varadarajan of the CS Department, with support from the Departments of Physics and Computer Science, and from the Virginia Bioinformatics Institute (VBI). It consists of 200 personal computers, purchased (literally) from our neighborhood computer store. Each PC runs Linux, has a 1 GHz Athlon AMD processor, 1 GB of main memory, and a 20 GB disk. The only relatively exotic thing about Anantham is the interconnection network. Data can be sent between any two processors in the cluster over a Myrinet network at rates of over 2 billion bits per second.
In terms of software infrastructure, transport is implemented in a widely-used scientific programming language (FORTRAN 90), extended with special functionality to allow explicit sending and receiving of data among the processors. We also take advantage of publicly available parallel implementations of the three most dominant computational steps in the application.
Algorithmically, we have made several improvements to the code. One change involved an improved strategy for assigning energy level tasks to processors, so that the total amount of work is more balanced across the processors. A second major change has substantially improved the scalability of the code. Earlier versions of transport were constrained in that they assigned a single energy level task to a single processor. This had two negative consequences: with 32 energy levels we could use only 32 processors at a time, and we could not solve with more than (5,5,10) plane waves because larger problems wouldn't fit in the 1GB of memory on each Anantham processor. In order to get around these bottlenecks, we changed the problem and data decomposition strategy so that a single energy level task is now distributed over several processors. In this way we have enough total memory for larger problems, and we can conceivably use many hundreds of processors for a single run of transport.
We are currently studying new algorithms for transport that have the potential for improvement that is much more dramatic than we can get from parallel computing alone. For example, the algorithm we are currently using for the most expensive step of the computation takes time proportional to (nx* ny* nz)3, where (nx, ny, nz) represents the number of plane waves. There is reason to believe that we may be able to use a different strategy here, with running time proportional to (nx* ny* nz)2.5. If this works, the time estimate for a single run of the (15,15,30) case will drop from 8 years to 12 days. It is easy to see why better algorithms are the key to real improvements in simulation fidelity.
The second major focus of our current work is in context-aware algorithms and software. By `context' we mean both resources (machines, networks, software) and problems (the scientific application, previous results, domain knowledge). As problems get more challenging and machines get more diverse, it is critical to design software that responds to, and takes advantage of, its context; it must `do the right thing', no matter where it runs on the grid. The performance penalty for using the wrong algorithm or the wrong implementation technique on a particular HPC resource can be enormous. It is unrealistic to expect individual programmers to anticipate all the combinations of problem instance and computing resource that a particular program will be asked to run on. Working with colleagues in Computer Science, we are designing mechanisms that will allow scientific codes to automatically choose among candidate algorithms for important steps of a computation. These mechanisms include analytical models of algorithms and machines (Eunice Santos), patterns mined from previous results (Naren Ramakrishan; see his article in this issue for a discussion of how software is more effective when it takes greater advantage of context), and software frameworks for building complex simulations and managing computational experiments (Cliff Shaffer and Layne Watson).
What does the distant future hold for HPC? It is a safe bet that the field will continue to grow and change for the rest of my career. New ways of building parallel computers will be tried. Molecular electronics may yield dramatic increases in per-processor speeds. Perhaps one of the exotic approaches (DNA computing?, quantum computing?) will prove successful! It's exciting to think about, and it makes me very confident that high performance computing will continue to be much more than an advertising slogan.