ALBUQUERQUE, N.M. — The latest version of CplantTM Antarctica, the upwardly mobile computer at Sandia, is expected by researchers to become the 20th fastest in the world after integrating, by early fall, 1,300 off-the-shelf computers recently arrived from Compaq Computer Corporation.
Formerly composed of only 600 Alpha workstations, the “home-grown” Sandia computational cluster had already been ranked 44th among the world’s fastest supercomputers. It is also already the largest “production” Linux cluster — a cluster that produces technical results to aid ongoing science projects.
“Another kind of revolution is going on,” says lead Cplant software developer Rolf Riesen of Sandia. “A major government laboratory like Sandia is willing to spend $9.6 million plus a significant amount of in-house development to make a supercomputer out of a supply of off-the-shelf parts.”
The new Cplant will include 1,600 Compaq Alpha computers, also called nodes. (Three hundred older nodes will be used for other purposes.) The additional units are expected to be up and running early this fall.
Better than Beowulf?
Cplant differs from better-publicized clusters like Beowulf (see “Additional Background” at end of release), developed to run very specific programs for small groups of users, or the University of California’s Millennium Project, which attempts to link clusters of computers so that, when unused by specific owners, they can be tapped to contribute to the overall power of the system.
Cplant is a true, multipurpose supercomputer, says Bill Camp, director of Sandia’s Computations, Computers, and Math center. Scientists can run any program in exactly the same fashion as though they were using the Sandia supercomputer, currently second fastest in the world, known as ASCI Red. Cplant’s current use is to provide backup for the over-subscribed Red machine, also known as the teraflops computer. With Cplant’s new capabilities, driven by Compaq’s Alpha DS10L processors, it should run from one-half to two-thirds the top computer’s speed.
The term Cplant, for Computational Plant, has a double meaning: physical computational hardware (as in industrial plant), and an organic plant that grows, evolves, and is pruned.
“Most researchers have a hard time convincing their sponsors that this approach is feasible; ordinarily, the software out there doesn’t scale to such numbers of nodes,” says Riesen. “Our software, on the other hand, has already run. So Sandia jumped out ahead of the pack.
“But not that ahead — only a year or two. Eventually, other people will get there too. People all over the world are already using Beowulf. We are hoping to release our software to the general public soon. Then everyone in the world will help us improve it. Otherwise we will have this proprietary code that no one knows about, and something else that may not be as good will become the standard to be improved; and when we hire people, they won’t have experience with the systems we’re running.”
The genesis of “normal” supercomputers
“Supercomputers for the past decade have traditionally been purchased as turnkey machines from the world’s largest computer makers,” says Neil Pundit, manager of Cplant software development at Sandia. “Such machines have cables, connection boxes, as well as monitors and testing equipment, already built in place. In Cplant, we are following a new path, assembling a supercomputer out of parts, open-source software, and our own developments.”
The fastest supercomputers in the world are an integral part of DOE’s science-based stockpile stewardship program, which requires extremely high computational speeds to simulate nuclear explosions and to make sense of the torrent of data obtained from those simulations. ASCI Red, Sandia’s Intel-built supercomputer, was the fastest machine in the world for several years until bested in early July by another DOE supercomputer — ASCI White, an IBM-built supercomputer at Lawrence Livermore National Laboratory. The factory-built machines are still far superior to any off-the-shelf products.
The poor man’s supercomputer
However, Sandia researchers decided they could create a “poor man’s” ASCI Red architecture by combining high-performance commodity parts with Sandia software to be developed by Riesen and his colleagues at Sandia’s sites in New Mexico and California. They called this idea Cplant. Because they had helped develop the system software that made Red into the fastest computer in the world, they believed they could succeed with an off-the-shelf version.
Sandia took up the task of physically linking the highest performance commodity PCs in the world into a tightly knit cluster — really a virtual supercomputer. The researchers then developed the software to make this work.
Bill Blake, vice president of Compaq’s High Performance Technical Computing Group, says, “Sandia is doing pioneering work in building truly large Linux systems, using a combination of open source software along with their researchers’ own development, along with hardware, tools and compilers from Compaq.”
Technical information
The Compaq AlphaServer systems run a modified version of RedHat Linux plus the parallel systems software developed in the Cplant project. The DS10L is less than two inches tall, allowing up to 42 DS10L systems to be packaged in a standard rack. The Sandia design packages 33 systems in a rack, leaving room for other required components such as high-performance interconnections, networking, and system management. This is a significant expansion since current Cplant designs allow only eight systems in a rack with little space for other components. The new racks are designed to require as few external connections as possible, allowing the major functional units of the system to be integrated and tested in manufacturing at Compaq. This greatly simplified installation and maintenance of this large system.
Internal communications among processors are carried out over a series of links and switches called Myrinet, developed by Myricom Corp. The several internal communications networks in Cplant are critical to managing the computer as a single resource and to carrying large parallel jobs. The newest Myrinet switches and links arrived in July. “The machines should be up and running as production resources in their new configuration within a few months,” says Art Hale, Sandia manager of the Cplant project.
Sandia now has 2,600 Compaq Alpha computers as nodes in Cplant clusters of various configurations, with 512 at Sandia’s California site. The Antarctica subcluster, in New Mexico, is the largest and has 1,632 processors. This system is really three systems, with 256 processors always in a classified partition, 256 always in a secure but unclassified partition, and 64 always in an “open” partition. The last are available to uncleared staff and partners from industry and academia. The other 1,056 processors will be switched among the three elements as demand for the types of calculations warrants.
ADDITIONAL BACKGROUND: Thomas Sterling credited with idea that started it all
The idea of hooking together off-the-shelf computers to create an inexpensive computational cluster with supercomputer functionality is generally credited to Thomas Sterling, now at the Jet Propulsion Laboratory in California. Sterling was one of the creators of the Beowulf system in the mid 1990s. Beowulf achieves cost savings and simplicity but sacrifices scalability, balance, and generality. Beowulf systems are typically loosely coupled, and devoted to doing one or a few applications for small groups of researchers. In such roles they have proven to be extremely significant resources with a huge cost advantage over commercially provided solutions.
Many institutions — including Sandia’s California site and the University of California at Berkeley — experimented throughout the late 1980s with loosely coupled clusters or farms of workstations. These were the ancestors of Sterling’s Beowulf system.