ALBUQUERQUE, N.M. — Data classification, often considered a humdrum task, really is no such thing when the stakes are high enough.
Quick, errorless identifications are needed of chemicals released on a battlefield, explosives found in an airport, or substances collected on interplanetary explorations where no options exist for astronomers to have a second look. Physicians analyzing complicated medical images and certain environmental analyses also require accurate, quick answers.
That’s why a sophisticated new data classification scheme is being incorporated into the design of Sandia National Laboratories’ hand-held “lab-on-a-chip” chemical sensor system. Sandia is a US Department of Energy national security laboratory.
The classification method — based on human perception rather than mathematical equations — is so simple that it can be hard to grasp for those who expect complexity.
It is based upon the human ability to visually group real-world objects seen near each other, says the technique’s principal developer Gordon Osbourn, a Sandia Labs physicist.
“In the area of visual perception, no computer has ever matched a biological system — for example a dog’s or a two-year-old’s,” says Osbourn.
The dumbbell factor
A person groups by subconsciously superimposing over any two points an invisible shape that resembles a dumbbell, Osbourn says. The subconscious mind sizes the dumbbell so that each bell centers on a point. If no other point intrudes in that space, one considers the two points a group. It’s that simple.
But while biological visual systems are limited to analyzing two-dimensional plots or 3-D patterns, Osbourn’s system offers the opportunity to “see” in many “dimensions,” in effect cross-analyzing patterns among many data sets. While this sounds complicated, all that is happening is that the same empirical judgments made by human eyes are made by a computer program to judge closeness among the points of many groups of data. The relations between data may be too complex for a human to see, but the same empirical process used in human decision-making is followed in the computer program.
“We discovered a way to capture in a software model the way human judgments empirically group patterns in two-dimensional plots or 3-D patterns, so that these judgments can be mathematically applied to high-dimensional data,” says Osbourn.
Because the technique is based on observation of how people empirically group objects they see, it is called VERI, for Visual-Empirical Region of Influence. A patent is expected to be issued in 1999.
Connecting the dots
To demonstrate a relationship between data points, the “dumbbell” program draws lines that, in effect, connect the dots, sometimes in surprising ways.
Says Daniel Carr, a George Mason University professor of applied and engineering statistics who has an interest in visualizing high-dimensional data, “I was using [a different algorithm] to do a grand tour [of gene-expression clustering]. Then I added the connecting lines produced by the VERI algorithm, and the thing seemed to come alive. I saw lots of patterns. I showed it at a conference. There was no question that you suddenly saw structure that wasn’t previously obvious. This is a very effective algorithm.”
One way to visualize the system is to imagine yourself in a spaceship observing a number of oddly shaped solar systems, which may be very close to, intertwine with, or even wrap around each other. Each solar system represents a chemical; each planet in a system represents a data point from that chemical. The incoming chemical to be analyzed enters as a newly observed planet. If this planet resides in any of the solar systems, then it is identified as an example of that system. However, if the newly observed planet resides outside of all the known solar systems, then it is labeled an unknown chemical. The method does more than simply measure which solar system is closest to a newly observed planet — it automatically “sees” if the planet is close enough to any solar system to belong in it.
A paper, “VERI Pattern Recognition Applied to Chemical Microsensor Array Selection and Chemical Analysis,” was published in the American Chemical Society’s Accounts of Chemical Research , Vol. 31, No. 5, 1998, with Sandia researchers John Bartholomew, Tony Ricco, and Greg Frye as co-authors with Osbourn. Osbourn has also recently presented the VERI concept at an invited talk at a Gordon Conference and at Optical Society of America and American Chemical Society meetings. Osbourn himself has credentials in complex matters. He is credited with creation of strained-layer superlattices, a novel material used in semiconductor lasers. He won the 1993 American Physical Society’s International Prize for new materials and the Department of Energy’s prestigious E.O. Lawrence Award in 1985.
Deriving the VERI Dumbbell template
While the fashion of the last half-century has been to consider human perceptions untrustworthy — in a famous movie, people who see the same murder each describe it differently — Osbourn says society actually is based on the commonality of human perception. Everyone with normal vision recognizes large nearby buildings as such, and is expected to similarly perceive “STOP” signs or traffic lights and react accordingly.
Working from that commonsense basis, Osbourn and colleagues tested 12 subjects (Sandia employees who ranged from 20 to 50 years old) over a five-year period to determine how they grouped points scattered on a graph. Without exception, the responses of the subjects to complex dot patterns could be predicted from their responses to simple, three-dot patterns. The subjects reacted as if putting an invisible shape resembling a dumbbell around each pair of points. Each pair grouped together only if all other points — any potential third point — were outside the invisible dumbbell — a shape that operated as a region of influence. The researchers found that groupings among many dots are built up one pair at a time.
Because each subset of three points in a complex high-dimensional data set can also be tested in the same way with the dumbbell, the VERI method extends and automates human cluster perception for use in complex data analysis problems. “This approach provides a new way to think about automated pattern recognition,” says Osbourn. “The VERI clusterings provide a direct and automated decision for whether multidimensional patterns match or are distinct.”
The VERI method is a powerful new alternative to conventional mathematical approaches, especially for high-consequence applications. One advantage of VERI is the ability to “see” patterns in data with arbitrarily complex distribution shapes. “Conventional methods often require that real-world data look like widely-separated, compact Gaussian distributions to work properly, yet modern chemical sensor arrays can produce data for different chemicals that resemble intertwined snakes or tangles of spaghetti,” says Osbourn.
“This distinction is important for avoiding false alarms in unexpected or uncontrolled field conditions, for example alarms triggered by diesel fumes or fertilizer in sensor systems that are intended to alert soldiers to chemical warfare attacks,” he says. “VERI provides a complete treatment of the complexity of real-world data distributions that may be essential when human lives are at stake. Another advantage of VERI is that it can discover the simplest and most effective set of sensors to use in the design of hand-held sensor systems — the so-called electronic noses.”
In a benchmark study published in the Journal of Pattern Recognition in 1995 with Sandia researcher Rubel Martinez, Osbourn says, “We culled 25 patterns in computer science literature that caused problems for clustering algorithms. Ours outperformed all commercial clustering algorithms. VERI is one of the best recognizers of cluster patterns because it is based on our discovery of how to quantitatively mimic biological clustering-based pattern recognition performance.”
Works with noisy signals
The computer program works well with imperfect sensor signals coming from what may be an electronically “noisy” field to pinpoint a gas’s identity. Sometimes the gas’s characteristics may not be located in the program’s library of known substances — that is, the gas is unknown — yet the classification system is savvy enough not to create a false alarm by assigning it a category that happens to be the closest match.
VERI also minimizes power needs, as well as the size and weight, of a hand-held unit by informing its users of the smallest number of sensors necessary for a particular job. It also works well with imperfect signals — perhaps degraded because the sensor aged — coming from the field.
“The system deals well with false alarms,” says Sandia sensor researcher Tony Ricco. “If something is unknown and has never been calibrated, it will tell you it’s unknown. You don’t want false alarms when you’re screening for explosives. When [the classification system] sees a new chemical species it’s never seen before, it just says it’s unknown. And it’s likely that in planetary explorations, you’ll come across combinations of chemicals you’ve never seen before. False alarms when soldiers’ lives are at stake on a battlefield are unacceptable.”
Technical contact:
Gordon Osbourn, gcosbou@sandia.gov, (505) 844-8850