Human Genome Project: Big Science PDF

Hood and Rowen Genome Medicine 2013, 5:79 http://genomemedicine.com/content/5/9/79 OPINION The Human Genome Project: big science transforms biology and medicine Leroy Hood* and Lee Rowen* understanding of cancer. In May 19...

Hood and Rowen Genome http://genomemedicine.com/content/5/9/79 OPINION The Human Genome Project: big science transforms biology and medicine Leroy Hood* and Lee Rowen* Abstract The Human Genome Project has transformed biology through its integrated big science approach to deciphering a reference human genome sequence along with the complete sequences of key model organisms. The project exemplifies the power, necessity and success of large, integrated, cross- disciplinary efforts - so-called ‘big science’ - directed towards complex major objectives. In this article, we discuss the ways in which this ambitious endeavor led to the development of novel technologies and analytical tools, and how it brought the expertise of engineers, computer scientists and mathematicians together with biologists. It established an open approach to data sharing and open-source software, thereby making the data resulting from the project accessible to all. The genome sequences of microbes, plants and animals have revolutionized many fields of science, including microbiology, virology, infectious disease and plant biology. Moreover, deeper knowledge of human sequence variation has begun to alter the practice of medicine. The Human Genome Project has inspired subsequent large-scale data acquisition initiatives such as the International HapMap Project, 1000 Genomes, and The Cancer Genome Atlas, as well as the recently announced Human Brain Project and the emerging Human Proteome Project. Origins of the human genome project The Human Genome Project (HGP) has profoundly changed biology and is rapidly catalyzing a transform- ation of medicine [1-3]. The idea of the HGP was first publicly advocated by Renato Dulbecco in an article published in 1984, in which he argued that knowing the human genome sequence would facilitate an * Correspondence: [email protected]; Lee.Rowen@ Seattle, WA 98109, USA BioMed Central Ltd. Page 2 of 8 University in St Louis, the Joint Genome Institute, and the Whole Genome Laboratory at Baylor College of Medicine) emerged from this effort, with these five centers continuing to provide genome sequence and technology development. The HGP also fostered the development of mathematical, computational and statistical tools for handling all the data it generated. The HGP produced a curated and accurate reference. At the inception of the HGP in the early 1990s, sequence for each human chromosome, with only a small number of gaps, and excluding large heterochromatic regions. In addition to providing a foundation for sub- sequent studies in human genomic variation, the reference sequence has proven essential for the development and subsequent widespread use of second-generation sequen- cing technologies, which began in the mid-2000s. Second- generation cyclic array sequencing platforms produce, in a single run, up to hundreds of millions of short reads (originally approximately 30 to 70 bases, now up to several hundred bases), which are typically mapped to a reference genome at highly redundant coverage. A variety of cyclic array sequencing strategies (such as RNA-Seq, ChIP-Seq, bisulfite sequencing) have significantly ad- vanced biological studies of transcription and gene regula- tion as well as genomics, progress for which the HGP paved the way. Impact of the human genome project on biology and technology First, the human genome sequence initiated the compre- hensive discovery and cataloguing of a ‘parts list’ of most human genes [16,17], and by inference most human proteins, along with other important elements such as non-coding regulatory RNAs. Understanding a complex biological system requires knowing the parts, how they are connected, their dynamics and how all of these relate to function. The parts list has been essential for the emergence of ‘systems biology’, which has transformed our approaches to biology and medicine [21,22]. As an example, the ENCODE (Encyclopedia Of DNA Elements) Project, launched by the NIH in 2003, aims to discover and understand the functional parts of the genome. Using multiple approaches, many based on second-generation sequencing, the ENCODE Project Consortium has produced voluminous and valuable data related to the regulatory networks that govern the expres- sion of genes. Large datasets such as those produced by ENCODE raise challenging questions regarding gen- ome functionality. How can a true biological signal be dis- tinguished from the inevitable biological noise produced by large datasets [25,26]? To what extent is the functional- ity of individual genomic elements only observable (used) in specific contexts (for example, regulatory networks and mRNAs that are operative only during embryogenesis)? It is clear that much work remains to be done before the Page 3 of 8 fostering a more cross-disciplinary culture [1,21,38]. It is important to note that the HGP popularized the idea of making data available to the public immediately in user- friendly databases such as GenBank and the UCSC Genome Browser. Moreover, the HGP also promoted the idea of open-source software, in which the source code of programs is made available to and can be edited by those interested in extending their reach and improving them [41,42]. The open-source operating system of Linux and the community it has spawned have shown the power of this approach. Data accessibility is a critical concept for the culture and success of biology in the future because the ‘democratization of data’ is critical for attracting avail- able talent to focus on the challenging problems of bio- logical systems with their inherent complexity. This will be even more critical in medicine, as scientists need access to the data cloud available from each individual human to mine for the predictive medicine of the future - an effort that could transform the health of our children and grandchildren. Fifth, the HGP, as conceived and implemented, was the first example of ‘big science’ in biology, and it clearly demonstrated both the power and the necessity of this approach for dealing with its integrated biological and technological aims. The HGP was characterized by a clear set of ambitious goals and plans for achieving them; a limited number of funded investigators typically organized around centers or consortia; a commitment to public data/resource release; and a need for significant funding to support project infrastructure and new tech- nology development. Big science and smaller-scope individual-investigator-oriented science are powerfully complementary, in that the former generates resources that are foundational for all researchers while the latter adds detailed experimental clarification of specific ques- tions, and analytical depth and detail to the data produced by big science. There are many levels of complexity in biology and medicine; big science projects are essential to tackle this complexity in a comprehensive and integrative manner. The HGP benefited biology and medicine by creating a sequence of the human genome; sequencing model organ- isms; developing high-throughput sequencing technolo- gies; and examining the ethical and social issues implicit in such technologies. It was able to take advantage of economies of scale and the coordinated effort of an inter- national consortium with a limited number of players, which rendered the endeavor vastly more efficient than would have been possible if the genome were sequenced on a gene-by-gene basis in small labs. It is also worth not- ing that one aspect that attracted governmental support to the HGP was its potential for economic benefits. The Battelle Institute published a report on the economic impact of the HGP. For an initial investment of Page 4 of 8 determining whether these hits reflect the mis-functioning of regulatory elements. The question as to what fraction of the thousands of GWAS hits are signal and what frac- tion are noise is a concern. Pedigree-based whole-genome sequencing offers a powerful alternative approach to iden- tifying potential disease-causing variants. Five years ago, a mere handful of personal genomes had been fully sequenced (for example, [53,54]). Now there are thousands of exome and whole-genome sequences (soon to be tens of thousands, and eventually millions), which have been determined with the aim of identifying disease- causing variants and, more broadly, establishing well- founded correlations between sequence variation and specific phenotypes. For example, the International Can- cer Genome Consortium and The Cancer Genome Atlas are undertaking large-scale genomic data collec- tion and analyses for numerous cancer types (sequencing both the normal and cancer genome for each individual patient), with a commitment to making their resources available to the research community. We predict that individual genome sequences will soon play a larger role in medical practice. In the ideal scenario, patients or consumers will use the information to improve their own healthcare by taking advantage of prevention or therapeutic strategies that are known to be appropriate for real or potential medical conditions suggested by their individual genome sequence. Physi- cians will need to educate themselves on how best to ad- vise patients who bring consumer genetic data to their appointments, which may well be a common occurrence in a few years. In fact, the application of systems approaches to dis- ease has already begun to transform our understanding of human disease and the practice of healthcare and push us towards a medicine that is predictive, prevent- ive, personalized and participatory: P4 medicine. A key assumption of P4 medicine is that in diseased tissues biological networks become perturbed - and change dy- namically with the progression of the disease. Hence, knowing how the information encoded by disease- perturbed networks changes provides insights into dis- ease mechanisms, new approaches to diagnosis and new strategies for therapeutics [58,59]. Let us provide some examples. First, pharmacogenom- ics has identified more than 70 genes for which specific variants cause humans to metabolize drugs ineffectively (too fast or too slow). Second, there are hundreds of ‘ac- tionable gene variants’ - variants that cause disease but whose consequences can be avoided by available medical strategies with knowledge of their presence. Third, in some cases, cancer-driving mutations in tumors, once identified, can be counteracted by treatments with cur- rently available drugs. And last, a systems approach to blood protein diagnostics has generated powerful new Page 5 of 8 information relates to functionally and evolutionarily will be important. Developing the ability to rapidly analyze complete human genomes with regard to actionable gene variants is essential. It is also essential to develop software that can accurately fold genome-predicted proteins into three dimensions, so that their functions can be predicted from structural homologies. Likewise, it will be fascinating to determine whether we can make predictions about the structures of biological networks directly from the infor- mation of their cognate genomes. Indeed, the idea that we can decipher the ‘logic of life’ of an organism solely from its genome sequence is intriguing. While we have become relatively proficient at determining static and stable gen- ome sequences, we are still learning how to measure and interpret the dynamic effects of the genome: gene expres- sion and regulation, as well as the dynamics and function- ing of non-coding RNAs, metabolites, proteins and other products of genetically encoded information. The HGP, with its focus on developing the technology to enumerate a parts list, was critical for launching sys- tems biology, with its concomitant focus on high- throughput ‘omics’ data generation and the idea of ‘big data’ in biology [21,38]. The practice of systems biology begins with a complete parts list of the information ele- ments of living organisms (for example, genes, RNAs, proteins and metabolites). The goals of systems biology are comprehensive yet open ended because, as seen with the HGP, the field is experiencing an infusion of talented scientists applying multidisciplinary approaches to a var- iety of problems. A core feature of systems biology, as we see it, is to integrate many different types of bio- logical information to create the ‘network of networks’ - recognizing that networks operate at the genomic, the molecular, the cellular, the organ, and the social network levels, and that these are integrated in the individual organism in a seamless manner. Integrating these data allows the creation of models that are predictive and actionable for particular types of organisms and in- dividual patients. These goals require developing new types of high-throughput omic technologies and ever in- creasingly powerful analytical tools. The HGP infused a technological capacity into biology that has resulted in enormous increases in the range of research, for both big and small science. Experiments that were inconceivable 20 years ago are now routine, thanks to the proliferation of academic and commercial wet lab and bioinformatics resources geared towards facilitating research. In particular, rapid increases in throughput and accuracy of the massively parallel second-generation sequencing platforms with their cor- related decreases in cost of sequencing have resulted in a great wealth of accessible genomic and transcriptional sequence data for myriad microbial, plant and animal genomes. These data in turn have enabled large- and Page 6 of 8 transcript, and different start and termination sites. Last, it is exciting to contemplate that the ability to parallelize this process (for example, by generating millions of nanopores that can be used simultaneously) could enable the sequencing of a human genome in 15 minutes or less endeavor has. The high-throughput nature of this sequencing may eventually lead to human genome costs of $100 or under. The interesting question is how long it will take to make third-generation sequencing a mature technology. The HGP has thus opened many avenues in biology, medicine, technology and computation that we are just beginning to explore. Abbreviations BAC: Bacterial artificial chromosome; DOE: Department of Energy; ELISA: Enzyme-linked immunosorbent assay; GWAS: Genome-wide association studies; HGP: Human Genome Project; NIH: National Institutes of Health; SNP: Single nucleotide polymorphism; UCSC: University of California, Santa Cruz. are small, specific pro- Competing interests The authors declare that they have no competing interests. Acknowledgements The authors gratefully acknowledge support from the Luxembourg Centre for Systems Biomedicine and the University of Luxembourg; from the NIH, through award 2P50GM076547-06A; and the US Department of Defense (DOD), through award W911SR-09-C-0062. LH receives support from NIH P01 NS041997; 1U54CA151819-01; and DOD awards W911NF-10-2-0111 and W81XWH-09-1-0107. signals, Published: 13 September 2013 will References 1. Hood L: Acceptance remarks for Fritz J. and Delores H. Russ Prize. The Bridge 2011, 41:46–49. 2. Collins FS, McKusick VA: Implications of the Human Genome Project for medical science. JAMA 2001, 285:540–544. 3. Green ED, Guyer MS, National Human Genome Research Institute: Charting a course for genomic medicine from base to bedside. Nature 2011, 470:204–213. 4. Dulbecco R: A turning point in cancer research: sequencing the human genome. Science 1984, 231:1055–1056. 5. Sinsheimer RL: The Santa Cruz workshop - May 1985. Genomics 1989, 5:954–956. 6. Cooke-Degan RM: The Gene Wars: Science, Politics and the Human Genome. New York: WW Norton; 1994. 7. Report on the Human Genome Initiative for the Office of Health and Environmental Research. http://www.ornl.gov/sci/techresources/ Human_Genome/project/herac2.shtml. 8. National Academy of Science: Report of the Committee on Mapping and Sequencing the Human Genome. Washington DC: National Academy Press; 1988. 9. Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome. Nature 2004, 431:931–945. 10. Understanding Our Genetic Inheritance. The United States Human Genome Project, The First Five Years: Fiscal Years. 1991–1995. http://www.genome.gov/. Thus, single-molecule analyses should be able to 10001477. 11. Collins FS, Galas D: A new five-year plan for the U.S. Human Genome Program. Science 1993, 262:43–46. 12. Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SBH, Hood LE: Fluorescence detection in automated DNA sequence analysis. Nature 1986, 321:674–679. 13. Church G, Kieffer-Higgins S: Multiplex DNA sequencing. Science 1988, 240:185–188. Page 7 of 8 Labat I, Drmanac R, Crkvenjakov R: 44. Knoppers BM, Harris JR, Tasse AM, Budin-Ljosne I, Kaye J, Deschenes M, read by a non-gel-based Zawati M: Towards a data-sharing Code of Conduct for international genomic research. Genome Med 2011, 3:46. AR, Smith HO, Hunkapiller M: 45. Hood L: Biological complexity under attack: a personal view of systems 1998, 280:1540–1542. biology and the coming of “big science”. Genet Eng Biotechnol News 2011, Initial sequencing 31:17. 46. Tripp S, Grueber M: Economic Impact of the Human Genome Project. RJ, Sutton GG, Smith HO, Columbus: Battelle Memorial Institute; 2011. Amanatides P, Ballew RM, Huson 47. International HapMap Consortium: A haplotype map of the human XH, Chen L, Skupski M, genome. Nature 2005, 437:1299–1320. GLG, Nelson C, Broder S, Clark 48. The International HapMap3 Consortium: Integrating common and rare The sequence of the human genetic variation in diverse human populations. Nature 2010, 467:52–58. 49. Abbott A: Neuroscience: solving the brain. Nature 2013, 499:272–274. http://www.genome. 50. The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature 2012, 491:56–65. of DNA sequencing. Nat 51. A Catalog of Published Genome-wide Association Studies. http://www. genome.gov/gwastudies/. technology and 52. Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, Rowen L, Pant KP, Goodman N, Bamshad M, Shendure J, Drmanac R, Jorde LB, Hood A New Biology for the L, Galas DJ: Analysis of genetic inheritance in a family quartet by whole- Academies Press; 2009. genome sequencing. Science 2010, 328:636–639. to decoding life: systems 53. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, et al: The diploid genome sequence of an individual human. PLoS Biol 2007, 5:e254. to the Encyclopedia of DNA 54. Wheeler DA, Srinivasian M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen Y-J, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, proteomics. Nature 2003, Irzyk GP, Lupski JR, Chinault C, Song X, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM: The monitoring-based proteomics: complete genome of an individual by massively parallel DNA directions. Nat Methods 2012, sequencing. Nature 2008, 452:872–876. 55. International Cancer Genome Consortium. http://icgc.org/. AI, Mallick P, Eng J, Chen S, 56. The Cancer Genome Atlas. http://cancergenome.nih.gov/. Project. Nucleic Acids 57. Pandey A: Preparing for the 21st century patient. JAMA 2013, 309:1471–1472. 58. Hood L, Flores M: A personal view on systems medicine and the T, Lam H, Tasman N, Sun Z, emergence of proactive P4 medicine: predictive, preventive, DB, Nesvizhskii A, Aebersold R: A personalized and participatory. Nat Biotechnol 2012, 29:613–624. 59. Price ND, Edelman LB, Lee I, Yoo H, Hwang D, Carlson G, Galas DJ, Heath JR, Hood L: Systems biology and the emergence of systems medicine. In Genomic and Personalized Medicine: From Principles to Practice. Volume 1. sted=Complete+ Edited by Ginsburg G, Willard H. Philadelphia: Elsevier; 2009:131–141. 60. Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, McGuire A, universal common ancestry. Nussbaum RL, O’Daniel JM, Ormond KE, Rehm HL, Watson MS, Williams MS, Biesecker LG: ACMG Recommendations for Reporting of Incidental Findings in the genomics evolution. Nat Clinical Exome and Genome Sequencing. Bethesda: American College of Medical Genetics and Genomics; 2013. primate genomes: 61. Meyerson M, Gabriel S, Getz G: Advances in understanding cancer Genet 2009, 10:355–386. genomes through second-generation sequencing. Nat Rev Genet 2010, of modern human. 11:685–696. 62. Qin S, Zhou Y, Lok AS, Tsodikov A, Yan X, Gray L, Yuan M, Moritz RL, Galas population history from D, Omenn GS, Hood L: SRM targeted proteomics in search for biomarkers of HCV-induced progression of fibrosis to cirrhosis in HALT-C patients. Reich D: The date of Proteomics 2012, 12:1244–1252. Humans. PLoS Genet 63. Li X-J, Hayward C, Fong P-Y, Dominguez M, Hunsucker SW, Lee LW, McClean M, Law S, Butler H, Schirm M, Gingras O, Lamontague J, Allard R, of big data biology. Chelsky D, Price ND, Lam S, Massion PP, Pass H, Rom WN, Vachani A, Fang KC, Hood L, Kearney P: A blood-based proteomic classifier for the Database. In The NCBI molecular characterization of pulmonary nodules. Sci Transl Med, in press. Bethesda: National Center for 64. Knoppers BM, Thorogood A, Chadwick R: The Human Genome Organisation: towards next-generation ethics. Genome Med 2013, 5:38. TH, Zahler AM, Haussler 65. Hood L: Who we are: the book of life. Commencement Address. In Res 2002, 12:996–1006. Whitman College Magazine 2002:4–7. 66. Foster MW, Sharp RR: Beyond race: towards a whole-genome perspective http://www. on human populations and genetic variation. Nat Rev Genet 2004, 5:790–796. 67. Royal CDM, Dunston GM: Changing the paradigm from ‘race’ to human Dukes P, Gregurick SK, Kennedy K, genetic variation. Nat Genet 2004, 36:S5–S7. M, Perrin N, Remacle JE, 68. Witherspoon DJ, Wooding S, Rogers AR, Marchani EE, Watkins WS, Batzer M, Tiwari B, Wilbanks J: MA, Jorde LB: Genetic similarities within and between populations. Genetics 2007, 176:351–359. Page 8 of 8 Lindgren AM, Chambert K, Pollak MR, Wilson JG, McCarroll maps of the human Acids Research Database Collection. Nucleic Acids h/hpp/. Yamamoto KR, Amos M, New and improved complex biological systems: Proteomics 2012, Nat Methods 2010, 7:661. into third-generation mechanisms in mammals. approach to finding PLoS Comput Biol its debut. Nature News Genome Project: big Medicine 2013 5:79.

Human Genome Project: Big Science PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue