ALBUQUERQUE, N.M. — Imagine the large-scale structure of science displayed as a landscape. That big mountain range with colorful peaks, ridges, and valleys is physics. Chemistry is a range off in the distance, and in another direction is, say, biology. Trails linking ranges are composite sciences like nuclear chemistry or molecular biology.
The first phase of this unique science landscape has been constructed at Sandia National Laboratories. The view — formed by algorithms fed citation data from technical articles — originally was conceived to help US intelligence more easily track the thrust of research in countries of interest to the United States. The complex data management scheme can use open scientific literature to make visible the movement of nuclear and other military technologies across industries, countries, and regions.
An expanded program being made available to funding agencies, civilian scientists, and eventually (in scaled-down form) local libraries to quickly find information about science is expected to pay for itself from licensing fees.
The expanded program will allow managers of funding agencies to watch peaks, ridges, hills and trails change contour and definition on a year-by-year basis, indicating research areas of increasing or diminishing scientific interest –information helpful in determining which grant applications to fund, in what overall areas to invest for research and development, and how to improve investment strategies.
Worldwide coverage and market
Researchers using virtual reality techniques can fly over the landscape to see new sub-areas appearing and others merging or separating in perhaps unknown ways. By flying lower, researchers see more subcomponents of the region, and still lower, titles of the journal articles which form its substance.
Henry Small, chief researcher at the Institute for Scientific Information (ISI) — a Philadelphia-based, for-profit corporation that supplied raw data for the project — said his company is interested in licensing the Sandia software for a worldwide market.
The Sandia landscape prototype became operable in September. The computerized color image is created by newly derived algorithms fed citations from 30,000 articles published in analytic chemistry from 1986 to 1996. By November, a landscape created by collating references from more than 3 million articles published over the last 18 years in a variety of sciences are expected to be organized in a huge supercomputer data matrix. However, most graphics work stations will be able to run subsections of data.
“I suspect that anyone who spent an hour playing with the tool would discover interesting patterns they were unaware of,” says Bruce Hendrickson, principal Sandia researcher on the project. Other project researchers include David Johnson and University of New Mexico student Brian Wylie.
“People are looking for a way to make information more accessible from huge data bases. In this case, where are the concentrations of effort in different fields of science?” Small says. “The Sandia system gives an overview, but you can drill down to see details. We’ve never had a computer system that could deal with this magnitude of data. The software gives you the whole display at your fingertips, and you can interact with it, which would be impossible without that kind of computing power.”
Video demonstrates method
A four-minute video, “Mapping Science,” which demonstrates the power of the method will be shown in the fall of 1996 at workshops and symposiums on global security, information retrieval, library and information science, and science and technology, in the United States, Switzerland, Germany and Denmark.
The project was funded by Sandia’s Laboratory Directed Research and Development program . The program finances speculative defense projects at the Labs. It bought use of ISI’s data base for $40,000 to help develop the method.
Says program manager Chuck Meyers, “We know that science is produced through a network of interactions among researchers. This network and science evolve together. It is this co-evolution that drives the idea that we can portray science as an evolving landscape and can learn much from what we observe.
“The landscape shows the dynamics of the evolution of science, and ultimately may help us improve our ability to invest in leading areas.”
Traditionally, industries measure their return on research and development from profit from selling a product, but seeing scientific interest in new areas might lead industrial researchers in new directions, says Meyers.
“People from the intelligence community are interested in a more general way because they also get reams of data to analyze,” says Hendrickson. “Organizing this data in a visual way will enable new insights much more quickly.”
How it works: Citations and algorithms chart science as time passes
The mapping algorithms cluster scientific papers by the number of citations — the number of times they are cited in the reference sections of other scientific papers — that the papers list in common, rather than by their titles or key words. Titles may differ eccentrically and key words have different meanings in different disciplines.
Citations are the collected footnotes that form the list, found at the end of almost every scientific paper, of previously published articles most critical to the current advance. The more citations two articles list in common and the more articles that cite them both, the more likely the research papers have a common focus, and the closer the data points that represent them.
For example, data points representing research papers in related areas of superconductivity might be close together because their citations probably reference many of the same previous papers.
Superconductivity and computing papers might have some citations in common. These coalesce into separate clusters in the same mountain range.
Biology papers might show none of the same references and be quite distinct in biology’s own mountain range. But some biology papers might have a few physics references. The reference point then would be near the biology mountains but not in them, forming a step in a trail between some aspects of physics and biology. The more physics referents, the closer to physics. The more physics/biology papers that follow similar aspects of the subjects, the more obvious the trail, or ridge, or hill on a year-by-year basis.
Libraries may choose to use sections of the system because “there will be much more content in even the subsectioned landscape than you can get by using keywords to search for documents,” says Hendrickson.
An early version of the clustering method was developed at ISI under the direction of Small, who generated a pen-and-ink landscape about AIDS research through use of the citation method. “We could see clusters, but it didn’t involve computer visualization, only graphs and charts,” says Small.
Sandia uses a mathematically sophisticated technique to reduce the amount of operating time by making the program proportional to the number of documents — three million — instead of the square of that number. This made the undoable, doable.