A team of more than 2,800 scientists, including several from MIT, has published its scientific description of the finished human genome sequence, reducing its estimate of the number of human protein-coding genes from 35,000 to only 20,000-25,000, a surprisingly low number for our species.
In the Oct. 21 issue of Nature, researchers with the International Human Genome Sequencing Consortium describe the final product of the Human Genome Project, the 13-year effort to read the information encoded in the human chromosomes. One of the central goals of the effort was to identify all genes, which are generally defined as stretches of DNA that code for particular proteins.
The Nature paper provides rigorous scientific evidence that the genome sequence produced by the Human Genome Project has both the high coverage and accuracy needed to perform sensitive analyses, such as those focusing on the number of genes, segmental duplications involved in disease, and the "birth" and "death" of genes over the course of evolution.
"The human genome sequence far exceeds our expectations in terms of accuracy, completeness and continuity. It reflects the dedication of hundreds of scientists working together toward a common goal--creating a solid foundation for biomedicine in the 21st century," said Eric Lander, director of the Broad Institute of MIT and Harvard and a professor in MIT's Department of Biology.
Francis S. Collins, director of the National Human Genome Research Institute (NHGRI), said, "Only a decade ago, most scientists thought humans had about 100,000 genes. When we analyzed the working draft of the human genome sequence three years ago, we estimated there were about 30,000 to 35,000 genes, which surprised many. This new analysis reduces that number even further and provides us with the clearest picture yet of our genome." In the United States, the International Human Genome Sequencing Consortium is led by NHGRI and the Department of Energy (DOE).
The Nature paper also provides the scientific community with a peer-reviewed description of the finishing process and an assessment of the quality of the finished human genome sequence. The assessment confirms that the finished sequence now covers more than 99 percent of the euchromatic (or gene-containing) portion of the human genome and was sequenced to an accuracy of 99.999 percent--10 times more accurate than the original goal.
"Finished" doesn't mean that the human genome sequence is perfect. There still remain 341 gaps in the sequence, in contrast to the 150,000 gaps in the working draft announced in June 2000. The technology now available can't readily close these gaps; doing so will require more research and new technologies.
The human genome sequence and its annotations can be accessed through several public genome browsers, including GenBank at the National Center for Biotechnology Information.
The International Human Genome Sequencing Consortium includes scientists from 20 institutions in six countries. The five largest sequencing centers are located at Baylor College of Medicine, the Broad Institute of MIT and Harvard, DOE's Joint Genome Institute, Washington University School of Medicine, and the Wellcome Trust Sanger Institute.
A version of this article appeared in MIT Tech Talk on November 10, 2004 (download PDF).