There is a “deluge of information” at the intersection of biology and computation, says Gaurav Bhatia PhD '14, a graduate of the Harvard-MIT Division of Health, Science and Technology (HST). However, Bhatia seeks to “push complexity away, and solve problems in the simplest incarnation possible.” Using newly developed statistical methods and computer models, he is accomplishing just that, with some of modern biology’s largest and most challenging data sets — those involving the human genome.
By his own admission, Bhatia took “the scenic route” to his academic specialty of bioinformatics and integrative genomics. As an undergraduate at the University of California at San Diego, Bhatia was part of a special program preparing students for medical school. But after a few courses in computer science, he was hooked by the “straightforward power” of programming. He dropped his bioengineering major — and the goal of becoming a doctor — to pursue computer science through a master’s degree, and as an engineer at a Web startup.
But a bioinformatics class triggered a change of heart. While investigating natural selection and the human genome through a statistical lens, Bhatia says, “I realized this is what I wanted to be doing.” He then enrolled in HST in 2009.
He came to the right place at the right time: The Broad Institute, Harvard University, and MIT were standing at the center of genomics research as key players in various international projects, including the HapMap, and the 1000 Genomes Project. Facilitated by recent technological advances in sequencing DNA, these institutions and their partners were producing monumental, publicly accessible datasets on human genetic variations. With these massive information troves immediately at hand, Bhatia was soon immersed in the deluge — which was just the kind of analytic challenge he was looking for.
He found an inspiring mentor and research partner in Alkes Price, an assistant professor of statistical genetics at the Harvard School of Public Health and a member of the Broad Institute. Price's work focused on developing statistical methods for probing population genetics and revealing the genetic basis of human disease — a ripe area of investigation given the newly available wealth of fine-grained genetic data. Bhatia seized the opportunity to apply his computational modeling and programming expertise.
One key question Bhatia took on (and developed for his dissertation) was the curious discrepancy between the HapMap (a database for genes associated with human diseases) and 1000 Genomes Project's estimates of the genetic distance between human populations. This measure of population differentiation, known as the fixation index or (FST), represents a proportion of variance. Comparing genetic samples of West African and European populations, for example, the 1000 Genomes Project's FST numbers suggested much greater genetic similarity than HapMap’s estimates. “Everybody thought that was odd,” Bhatia says. “So we set out to make accurate estimates.”
Bhatia designed a statistical protocol for estimating FST across the two very different consortia datasets: HapMap information includes 1,300 DNA samples yielding genetic variants from 11 discrete populations; the 1000 Genomes Project's collection contains the nearly complete genomes of 1,000 different people, revealing less common genetic variants. Bhatia’s research demonstrated that statistical analysis giving rare genetic variants too much weight was likely skewing FST calculations, and accounted for the differing estimates of genetic distance.
Bhatia also perceived that “confusion about statistical methods” might generate “misleading conclusions.” In another paper, he sought to put on a “more rigorous statistical footing" on a much publicized 2012 study claiming that African-American populations have been subject to natural selection since their ancestors arrived in the United States 300 years ago. This study compared DNA segments of a sample of contemporary African-Americans, who have a combination African-European ancestry, to a sample of current West Africans. Researchers suggested that variant genes, perhaps those protecting against such diseases as influenza, conferred genetic fitness on some African-American populations in a relatively short amount of time.
To set things right, Bhatia, Price, and colleagues, scanned through the genomes of nearly 30,000 African-Americans, looking for genetic specific DNA sequences that deviated significantly from those in the genomes of West Africans. With a sample 15 times larger than the original study, Bhatia’s team found no genome-wide significant deviations between the two populations, and suggested the initial study had a “low threshold” for establishing evidence of natural selection in America.
“Real data is messy, and we all have to be careful,” Bhatia notes. “When we make small statistical mistakes, especially when analyzing a big dataset, we can easily draw the wrong conclusions.” Conversely, the “robust estimates” of natural selection and genetic distance that nuanced statistical models make possible are powerful tools for answering fundamental questions about population genetics and advancing public health, Bhatia says.
With his quantitative toolkit in hand, Bhatia now envisions “a pivot toward medical genetics.” As a postdoc now at the Harvard School of Public Health, he hopes to explore rare genetic variation in such diseases as schizophrenia, using the statistical protocols he has been developing.
“With this type of work, I’m confident that what we and others do will change medicine,” Bhatia says. “In the next 10 years, genetics will move to the center of medical practice. People will live longer and fuller lives because of this work.”