Panorama of Life - Chapter 3 - Introduction to Genome Browsers PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides an introduction to genome browsers. It details how to use three key browsers: UCSC, NCBI and Ensembl. You'll also learn about variants and genomic data, enabling you to navigate and analyze large sets of genomic data.
Full Transcript
Panorama of Life Chapter 3: 88-97, 99-101, 104 (Box 3.4), 108-114 1 Forward versus Reverse strands: Lab 3 -- Variants 2 Forward (+) or reverse (-) strand? -AAACATGGG- 3...
Panorama of Life Chapter 3: 88-97, 99-101, 104 (Box 3.4), 108-114 1 Forward versus Reverse strands: Lab 3 -- Variants 2 Forward (+) or reverse (-) strand? -AAACATGGG- 3 Forward (+) or reverse (-) strand? 5'-AAACATGGG-3' 4 Forward (+) or reverse (-) strand? 5'-AAACATGGG-3' 3'-TTTGTACCC-5' 5 Forward (+) or reverse (-) strand? 5'-AAACATGGG-3' ||||||||| 3'-TTTGTACCC-5' 6 Forward (+) or reverse (-) strand? 5'-CCCATGTTT-3' ||||||||| 3'-GGGTACAAA-5' 7 Forward (+) or reverse (-) strand? 3'-TTTGTACCC-5' ||||||||| 5'-AAACATGGG-3' 8 How do we get strand? --> 5'-AAACATGGG-3' + strand or forward strand ||||||||| 3'-TTTGTACCC-5' 9 UCSC shows forward by default --> 5'-AAACATGGG-3' + strand or forward strand 10 Which strand is variant on? 123456789 --> 5'-AAACATGGG-3' + strand or forward strand Variant at: chr1:5 A ->G 11 What is reference/alternate codon/AA? 123456789 --> 5'-AAACATGGG-3' + strand or forward strand Variant at: chr1:5 A ->G Ref: CAT His2 Alt: CGT Arg2 12 Now I show the reverse strand 5'-AAACATGGG-3' ||||||||| 3'-TTTGTACCC-5' 13 1) When a variant is reported, it is "typically" (but not always) reported relative to the mRNA (coding DNA strand) 2) Gnomad: all variants relative to Reference forward strand 5'-AAACATGGG-3' - strand or reverse strand ||||||||| G Ref: ATG Met2 This is the sequence we are viewing, when we look at Alt: ACG Thr2 the reverse strand on UCSC. 14 When a variant is reported, strand is "implied" by the reference sequence supplied 5'-AAACATGGG-3' - strand or reverse strand ||||||||| G Ref: ATG Met2 This is the sequence we are viewing, when we look at Alt: ACG Thr2 chr1:5 T -> C the reverse strand on UCSC. 15 Examples NM_000329.3(RPE65):c.1205G>A (p.Trp402Ter) mRNA NP_000320.1:p.Trp402Ter protein NC_000001.11:68431508:C:T genomic (38) NC_000001.10:g.68897192C>T genomic (37) NG_008472.2:g.23451G>A genomic (genomic contig) 16 Other observations GTEx -- Genotype-Tissue Expression Portal Expression is reported per WT gene (2 reported variants would not affect WT expression GTEx reports expression in different tissues == amount of RNA == Gene- based If I gave you a variant that was NOT in a gene, then there probably is no relevant GTEx value (unless it is in an intron) Many of the databases and resources are cross-linked (UCSC -> GTEx) and (Gnomad -> NCBI/ClinVar) But not always, for all data Typically, you can get at the corresponding data through the other resources, if you know how to search for it. 17 Topics Genome Browsers -- the BIG 3 UCSC NCBI ClinVar and dbSNP Ensembl -- similar but different Also gnomAD -- Genome Aggregation Database Automation -- Ensembl 18 Browsers (Genomic Databases) http://genome.ucsc.edu/ Track Viewing, Table Browser https://www.ncbi.nlm.nih.gov/ Database queries http://useast.ensembl.org/index.html Ensembl European In between UCSC and NCBI 19 Lab3 -- variants genome.ucsc.edu: Gene, exon/intron, position, strand, DNA codon, AA codon, GTEx Expression https://gnomad.broadinstitute.org/ position, change, minor allele frequency (MAF), Consequence, Clinical Significance https://ncbi.nlm.nih.gov/ https://ncbi.nlm.nih.gov/clinvar 20 Gene "view" examples 21 Examples: UCSC: "CFH" 22 23 Gene View: CFH 24 25 26 27 28 29 30 Updates Note -- the various genome browsers update continuously (warning at top of Gnomad page) Therefore, what you see in these slides may be slightly different then what you actually see on the genome browser sites. Generally speaking, you should be able to replicate the functionality shown here, but it may take some trial and error. 31 UCSC Assemblies: GRCh38 Newer, fewer gaps. Newer annotation GRCh37: Older (2013) Older annotation, but more annotation 32 When in doubt, you can reset... 33 Anatomy of UCSC Coordinates Gene search or coordinates example: RDX or BBS1 Click on Gene Name for description/links Banding pattern Track View Zoom in/Zoom out (buttons) Point and click zoom Pan left (buttons) Point and click pan DNA view (amino acid level) and strand Not covered: table browser for data extraction Demo Lab 3 example -- refer to lab3 Variants. 35 The following slides show various screen shots and examples. 36 Searching for a gene, by gene symbol RDX is the only match. 37 So many genes – which one NCBI RefSeq are most conservative/highest quality (human reviewed) 38 chr11:110,045,605-110,167,437 Multiple Transcripts 39 Size of the Representation Relative Position window of chromosome on chromosome Coordinates Can enter gene symbol or coordinates: chr11:110,045,605-110,767,437 40 Pan Left/Right Zoom IN Tells us which assembly Can also click in this track to zoom in by factor of 3X Click –+ HOLD + Move mouse L/R in the DATA tracks to PAN left/right 41 Click + HOLD + drag to highlight a region for zoom (in the SCALE or CHR track 42 Data track title Entries in that data Track. These are mRNA sequences (different transcripts) Mouse over an exon to get the 43 exon (or intron) number Click on a transcript to get to a page that describes it and links out to other resources Note the direction of the arrows is LEFT, tells us this gene is on the reverse (3' to 5' strand) 44 45 Zoomed in to level of DNA and Amino Acids chr11:110,150,392-110,150,445 46 Click here to access track 47 48 With "Full", if you zoom in far enough (to see nucleotides), you can see all three coding "frames". There are 6 frames in total, 3 forward, and 3 reverse chr11:110,279,641-110,279,741 49 Click this arrow, to view the reverse strand (and see the other 3 coding frames) 50 UCSC Tracks Mouse over, or follow link Organized by "type" -- Mapping, Genes, Phenotype, etc. RefSeq track Link outs 51 Turn on/off different tracks of data. Can follow the links for more detailed description (or mouse over for summary). Different tracks for GRCh38 and GRCh37 52 View -> DNA (Note – 101 bp) chr11:110,279,641-110,279,741 53 Table Browser: More advanced extraction (not covered) 54 Selected default to "Lower", and selected NCBI RefSeq 55 Shows the DNA that was in the window view (101 bp) 56 Can be slow… European Mirror is often the fastest: http://genome-euro.ucsc.edu 57 NCBI ClinVar – Clinical Variations Publicly submitted variations and observation of disease Any problem with this? Gene Already did this in HW or Lab – Rhodopsin (RHO) or GJB2 GEO Gene expression omnibus 100's of thousands of expression experiments PubMed Publications SNP (aka dbSNP): Example: rs104893768 Database of SNPs NCBI has a "tool" box to facilitate data extraction, but complicated. 58 Databases Starting: All Resources 59 ClinVar example USH2A 60 ClinVar: Type name of gene Shows numbers of results of that type 61 Results Selecting both, 7 results 62 Stars: Level of confidence (4 stars is best) Tyrosine to "termination codon change 63 Review status https://www.ncbi.nlm.nih.gov/clinvar/docs/review_status/ 64 ClinVar We have used UCSC, ClinVar and GnomAD in previous Labs 65 Pubmed Example: Search by gene, author, keyword, etc.. 66 Geo Example: Gene Expression Omnibus Filters 67 5,873 samples! 68 "dbSNP" Example: rs2277596 69 dbSNP Record 70 Gnomad Version v3.* 71 72 73 74 Web Queries Anyone can query a web page interface How do we handle 100's of queries? 1000's of exons/introns? Millions of nucleotides? 75 Ensembl Has some automated extraction tools... (may or may not show) Also has a nice (Perl) API for programmatic data extraction, and REST (representational state transfer) -- Perl, Python, Java, etc 76 # Some docs here: # https://pypi.org/project/ensembl-rest/ python rest_test.py # https://ensemblrest.readthedocs.io/en/latest/ lookup :homo sapiens and BRCA2 lookup is a dictionary import ensembl_rest {'id': 'ENSG00000139618', 'db_type': 'core', 'object_type': 'Gene', lookup = () 'logic_name': 'ensembl_havana_gene_homo_sapiens', 'biotype': lookup = ensembl_rest.symbol_lookup ( 'protein_coding', 'display_name': 'BRCA2', 'assembly_name': 'GRCh38', 'version': 18, 'start': 32315086, 'end': 32400268, 'species': 'homo species = 'homo sapiens', sapiens', 'source': 'ensembl_havana', 'canonical_transcript': symbol='BRCA2' 'ENST00000380152.8', 'description': 'BRCA2 DNA repair associated ) [Source:HGNC Symbol;Acc:HGNC:1101]', 'strand': 1, 'seq_region_name': '13'} print("lookup :homo sapiens and BRCA2") print("lookup is a dictionary") Found gene name = BRCA2 print(lookup) lookup description print() BRCA2 DNA repair associated [Source:HGNC Symbol;Acc:HGNC:1101] print(f"Found gene name = ",lookup['display_name']) print("lookup description") print(lookup['description']) print() 77 {'BBS2': {'strand': -1, 'start': 56465640, 'db_type': 'core', BBS_genes = ensembl_rest.symbol_post ( 'source': 'ensembl_havana', 'end': 56582667, species = 'homo sapiens', 'display_name': 'BBS2', 'seq_region_name': '16', 'id': params={'symbols': ["BBS1", 'ENSG00000125124', 'version': 14, 'object_type': 'Gene', "BBS2", 'species': 'homo sapiens', 'assembly_name': 'GRCh38', "BBS4" ]}) 'biotype': 'protein_coding', 'description': 'Bardet-Biedl syndrome 2 [Source:HGNC Symbol;Acc:HGNC:967]', 'logic_name': 'ensembl_havana_gene_homo_sapiens', print("BBS genes") 'canonical_transcript': 'ENST00000245157.11'}, print(BBS_genes) print() 'BBS4': {'assembly_name': 'GRCh38', 'species': etc...... 78 print("BBS4") print(BBS_genes['BBS4']) print("BBS4 description") print(BBS_genes['BBS4']['description']) BBS4 {'assembly_name': 'GRCh38', 'species': 'homo sapiens', 'canonical_transcript': 'ENST00000268057.9', 'logic_name': 'ensembl_havana_gene_homo_sapiens', 'description': 'Bardet-Biedl syndrome 4 [Source:HGNC Symbol;Acc:HGNC:969]', 'biotype': 'protein_coding', 'end': 72738475, 'source': 'ensembl_havana', 'db_type': 'core', 'strand': 1, 'start': 72686179, 'object_type': 'Gene', 'version': 14, 'id': 'ENSG00000140463', 'seq_region_name': '15', 'display_name': 'BBS4'} BBS4 description Bardet-Biedl syndrome 4 [Source:HGNC Symbol;Acc:HGNC:969] 79 # This gets exons and introns lookup = ensembl_rest.symbol_lookup('human', 'BRCA2', params={'expand':True}) print("****************") # Here, lookup is HUGE -- so I don't print it out #print("lookkup :") #print(lookup) transcript_id = lookup['canonical_transcript'] # this drops the last 2 characters -- the ".8" transcript_id = transcript_id[:-2] print("transcript_id=",transcript_id) sequence = ensembl_rest.sequence_id(transcript_id) print(sequence) **************** transcript_id= ENST00000380152 {'molecule': 'dna', 'query': 'ENST00000380152', 'version': 8, 'id': 'ENST00000380152', 'seq': 'AGAGGCGGAGC....ETC 80 # This is using the REST interface to get the CDNA # for the transcript_id from above # Note, the CDNA would be exons only # By setting the headers, I'm telling it to # provide it as a "fasta" format import requests, sys server = "http://rest.ensembl.org" ext = "/sequence/id/"+transcript_id+"?type=cdna" r = requests.get(server+ext, headers={ "Content-Type" : "text/x-fasta"}) if not r.ok: r.raise_for_status() sys.exit() r.text= >ENST00000380152.8 print("r.text=",r.text) AGAGGCGGAGCCGCTGTGGCACTGCTGCGCCTCTGCTGCGCCTCGGGTGTCTTTTGCGGC GGTGGGTCGCCGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGC etc... print("****************") 81 # this example is pulling out the "id" from the BBS_genes # 2D dictionary to get the ensembl gene dictionary # 'id': 'ENSG00000174483 sequence = ensembl_rest.sequence_id(BBS_genes['BBS1']['id']) print("sequence for BBS1:") print(sequence) sequence for BBS1: {'version': 20, 'desc': 'chromosome:GRCh38:11:66510606:66533613:1', 'query': 'ENSG00000174483', 'id': 'ENSG00000174483', 'molecule': 'dna', 'seq': 'ACTATTGGGCGTTACGCGAGGGCGGGGCCGGTTGCCAGGACGACGCCTGCGAAGATGGCCGCTGCGTCCTCATCGGATTCCGACGCCTGCGGAGCTGAGAGGTGA AGGCAGGGCTCCTCAAGGCCTCTTTTCCCACCCGTGTAAAGAGGGTCCCTTGGTCCCCGGGCTCTGGGCTCCTGCTGTTCTCGGGCAGTCTGGAAGGACTCTTAAGA GGTCAGATAGGGAAACCGAGGCCTAAGATGTGCATATGTGTGTCTGGAGGTCGCTCGGTGAGCCAGTGGCGGAGCCTGGACTTGTACCCAGACGTTCTGTACCTTAT GTTCAGAGTGACCTGTACCAGCTTCCTCAAAGTTTTTTTTTCCCCTCCCATGGGACTCACTCCCCAACTGTCTTTCCCCCACTTCCAGCAATGAGGCCAATTCGAAGTG GTTGGATGCGCACTACGACCCAATGGCCAATATCCACACCTTTTCTGCCTGCCTAGGTGAGTCTCTGGAACCAGGAACCCTGGGTTCTAGTGGGATGGGGAGTCAGAC AATGGTCCTGTAGTGAAGCCTCTGGGATTCTGAGACTCTGGTTTTGGCCCTTTTGTTTTCCAGCGCTGGCAGATTTACATGGGGATGGGGAATACAAGGTAAGCATATC ACCCTAGCCAGGAGAGTTGAGGGTAGGGGGGTGTACCCAGAAATGAGATTTCCTGACGGCTGAAAATAGGCCCAGCATACTCTGGAATTCACATATACTGTGAAAAA GCACATTTAGCGTTAAATTTTGTTTTTATATGGAAAGTGAAGAGAACTTATTGATTTCATAATACAGACTGCAGAAAGTAAAGCAGATCAAATATTTCTGGGGATTCTAC TAAGACAAAGCCTCTTAGTTAACCTGATTTCTTTGTTGGAAGGGCAAATCCCTTGGTTGGGGGTGGGCACTAGAATCTGGAGGGAAAGTAAAAAGGAGGGAAGTGA ACGTTTGATGTTTATCAAAAGATTTTAAG etc 82 www.ensembl.org 83 Big List of "hits" 84 Data and More transcripts 85 Zooming 86 Data Extraction: BioMart 87 Select Sequence, cDNA sequences 88 Count button, 1 gene 89 Results 90 Perl 91 Gives you code to extract data... 92 Summary UCSC – Track view Many data tracks can extract DNA More advanced extraction called table browser (Not covered) NCBI – Different databases Lots of databases Mostly text-based searches Track view exists Tools exist for data extraction (NCBI ToolKit was not covered) Ensembl – Combination of both (Data and Tracks) API access (perl) Can't zoom to DNA but can see in Region Comparison tab 93 End 94 Golden Rice: GMO'ed to contain beta- carotene, precursor to Vitamin A Global and Social Awareness 95