Using a sophisticated computer algorithm, a team of scientists at the Whitehead Institute has designed a new technique to analyze the massive amounts of data generated by DNA microarrays, also known as DNA chips. This technique will help scientists decipher how our 100,000 genes work together to keep us healthy and how diseases result when they fail.
"DNA arrays have revolutionized DNA analysis by allowing us to observe the activities of thousands of genes simultaneously," said Todd Golub, a research scientist at the Whitehead/MIT Center for Genome Research. "But until now, it's been really difficult to interpret this extraordinarily complex raw data. Our technique is among the first in a new generation of tools that will speed up the analysis of the enormous amounts of genetic data emerging from laboratories worldwide."
He and his colleagues at Whitehead, Dana-Farber Cancer Institute, Dartmouth Medical School and MIT (including Professor of Biology Eric S. Lander, director of the Whitehead/MIT Center for Genome Research) reported their technique in the March 16 issue of the Proceedings of the National Academy of Sciences.
"The core of the technique is an algorithm, called a self-organizing map (SOM), that takes advantage of the fact that many genes in a cell behave similarly," explained Pablo Tamayo, lead author of the paper and a Whitehead research scientist. "Instead of having 2,000 individual genes, all doing different things, you might have 25 groups of genes doing similar things."
Dr. Tamayo compared the final product of the SOM to an executive summary for CEOs. Rather than having to read every page of a 1,000-page report, CEOs can get an overview of the report by simply reading the summary. "It's impossible to visually inspect every gene," he said. "This method produces a quick scan of what's going on with thousands of genes."
The researchers created a computer package called Genecluster, which organizes the activities of thousands of genes in only minutes. To test Genecluster, they analyzed the genes expressed in several models of leukemia cell growth. In many cases, the algorithm identified genes known to be important in this process, but occasionally it also identified unexpected genes.
This finding suggests that the method might be useful in helping to identify the function of unknown genes. "Because genes that have similar functions are generally expressed in the same basic pattern, knowing the expression pattern of a gene could help identify its function," said Dr. Tamayo.
SOMs have been used widely in data mining, particularly for large or messy datasets like stock market data, but this study is the first to apply them to gene analysis.
The study was supported in part by a consortium of three companies--Bristol-Myers Squibb Company, Affymetrix, Inc., and Millennium Pharmaceuticals Inc.--that have funded a five-year research program in functional genomics at the Whitehead/MIT Genome Center. It was also supported by grants from the National Institutes of Health.
A version of this article appeared in the April 7, 1999 issue of MIT Tech Talk (Volume 43, Number 25).