Moving up from the far, far outside

Sandia's 'garage' computer hookup grows, prospers

Publication Date:

Sandia news media contact

Neal Singer
nsinger@sandia.gov
505-977-7255

ALBUQUERQUE, N.M. — Among the 500 listings of the world’s fastest supercomputers is a peculiar entry at 44th post.

Rather than the familiar manufacturing names — Intel, IBM, Cray-SGI, Fujitsu, Hitachi, and so on — that occupy all other spots on the Top500 website, the 44th-positioned machine is described as “self-made.”

The machine, called Cplant Cluster, is an assembly of 600 computers at the Department of Energy’s Sandia National Laboratories. The computers individually are no different from high-end systems one might find in any retail store, says Sandia Cplant software developer Rolf Riesen. Individually, each solves benchmark problems at the rate of approximately 600 million operations a second (600 megaflops). But working together, 580 nodes (computer “brains”) solve the same benchmark problems at speeds of 232.6 billion operations a second.

“What we do is find off-the-shelf hardware,” says Riesen. “Then we write our own device driver to get fast communication between nodes; next, we have all these utilities we developed to tie nodes together for networking.”

The utilities, he said, were modeled after those on Sandia’s ASCI Red machine, more familiarly known as Teraflops, the fastest computer in the world. It solves problems at better than a trillion operations a second. Many of the same computer researchers worked on both projects.

A later test run achieving 247 billion operations a second on 572 nodes would have placed Cplant in 40th position on the supercomputer list, but was accomplished too late to be submitted for this year’s list.

Last year, Cplant with 400 operating nodes was ranked 92nd in the world in a widely accepted test called LINPAC, which tests speed and accuracy of machines processing very similar series of operations.

A very important factor, Riesen says, is the ability to add more units to Cplant — a capability called scalability. “There are programs and hardware out there as fast or faster than ours, but it’s the ability to stay that fast even when you have 600 nodes in the system working together that counts.”

It’s relatively easy to get 16 nodes working together, he says, but there are barriers at 64. Two hundred fifty-six nodes are even more difficult.

The work, while not yet achieving performance at the scale of teraflops, is aimed at providing a way to deal with the day when companies are no longer producing massively parallel supercomputers, says Riesen. And the clusters do useful work now by saving the teraflops for larger-scale problems only it can handle.

“We’re running production codes on last year’s Cplant version,” he said. “It’s already useful.”

While many laboratories have attempted similar feats, “no one has a scalable cluster that large, in production modes running large and demanding technical applications,” says Riesen.

 

Sandia National Laboratories is a multimission laboratory operated by National Technology and Engineering Solutions of Sandia LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration. Sandia Labs has major research and development responsibilities in nuclear deterrence, global security, defense, energy technologies and economic competitiveness, with main facilities in Albuquerque, New Mexico, and Livermore, California.

Sandia news media contact

Neal Singer
nsinger@sandia.gov
505-977-7255