BIO259 Lecture Notes PDF
Document Details
Uploaded by LoyalTurquoise1149
University of Toronto Mississauga
Tags
Summary
This document provides lecture notes for a course on biological data. It covers various topics such as data structures, common data formats, and approaches to performing biological research using data analysis techniques. It is aimed at an undergraduate level.
Full Transcript
BIO259 Lecture Notes Lecture One: Introduction to Biological Data Course Overview: 1. BIO360: Biometrics I a. This course is the natural continuation of BIO259 2. BIO361: Biometrics II a. The sequel of BIO360 3. BIO427: Data science in biology...
BIO259 Lecture Notes Lecture One: Introduction to Biological Data Course Overview: 1. BIO360: Biometrics I a. This course is the natural continuation of BIO259 2. BIO361: Biometrics II a. The sequel of BIO360 3. BIO427: Data science in biology a. Introduction to machine learning in biology 4. BIO429: Data analysis in neurobiology Course Overview: Full Course Week 1-4: What is Data 1. How does data look like 2. Data organization and manipulation 3. Data summaries 4. Plots Week 5-8: Probabilities 1. Summary statistics 2. Probability distribution 3. Sampling data 4. Regression Week 9-11: Statistics 1. Hypotheses and P-values 2. Statistical testing 3. Multiple testing correction Course Overview: Weekly Plan 1. Initial Lecture a. Introduction to material covered in a given week 2. Tutorial a. Guided coding on Jupyter Hub b. Walk-through of example material with TAs 3. Practical a. Graded activities distributed via Jupyter Hub b. Support from TAs will be provided 4. Recap Lecture a. Review of material that we’ve covered during the week b. Some small amount of new material c. Extended question and answer period Course Overview The course syllabus is your guide Quercus ○ Announcements ○ Lecture slides ○ Lecture/ tutorial readings ○ Grades Course Grading 1. 11 practical assignments a. Top 10 grades 2. 2 midterms of 1 hour (20% each) a. Multiple choice b. Not cumulative c. Lecture material + coding questions (no live coding) d. Closed book 3. Final exam a. Multiple choice b. Cumulative c. Lecture material + coding questions d. Closed book Scale of Biological Data Several advanced technologies have opened the door to rapid, cost-effective generation of immense and interconnected datasets across biological systems. A. High-throughput instrumentation a. Speed, automation, diversification B. Next-generation sequencing a. DNA, RNA, protein C. Computational simulations a. Enhanced processing power D. Online data repositories a. Massive regularly updated datasets Big Data Challenges 1. Individual datasets can contain millions of rows and/ or columns 2. Datasets can be multidimensional and relational 3. Datasets are often dynamic and regularly updates Microsoft Excel Specifications Max. Columns: 16,384 Max. Rows: 1,048,576 Memory limitations Big Data Example: GWAS Genome-wide association studies (GWAS) can help biologists identify the genetic basis of biological traits by studying variation across many individuals. These studies often take a k-mer based approach to study variation, which allows us to identify all forms of variation across genomes. K-mers are sequences of length k and can be used to represent all possible sequences of length k in a given genome. What Does Biological Data Look Like? Rows always run from left to right, while columns always run from top to bottom When numerical variables are erroneously converted to other data types it can cause analysis issues Beyond Tables A vector is a collection of like values without dimensions Example 1. A vector of strings Example 2. A vector of integers Example 3: A vector of floats (sometimes called doubles for large values) A matrix is a collection of like values organized into two dimensions Matrices are indexed with two values (row, column) whereas vectors are only indexed with one value Matrix vs. Dataframe An array is a collection of like values organized into many dimensions. Lecture One Recap: Introduction to Biological Data What are Data Files? 1. Text files a. The most-common format for sharing biological data b. Human-readable characters c. Usually, fields are delimited d. Each line of text is terminated by a “end of line” character e. Programs used to convert data into their own internal structure 2. Binary files a. Store the internal structure directly for programs to read b. Usually much smaller and faster for programs to read c. Mostly not human readable-though there are some exceptions Know your Common Text Files 1. Biological sequences a. GFF3: Sequences in the context of genomes b. FASTA: Sequences c. FASTQ: Expanded FASTA d. SAM: Read-alignments e. VCF: Variant polymorphisms f. CLUSTAL: Sequences alignments 2. Biological objects a. BON: Structured data objects b. XML: Structured data objects 3. Text files a. TXT b. CSV c. TAB Know Your Common Text Files 1. Biological sequences 2. Biological objects 3. Text files 4. Phylogenetic trees 5. So many more weird things Text Files Quirks 1. End of line a. Weirdly different between Windows and Macs i. Double-check data conversion by looking at the number of lines or rows 2. Extension (.txt,.csv,.tab, etc.) a. No rules that the extension actually specifies any delimiter b. Extensions not even needed (but greatly appreciated) 3. Character encoding a. The character encoding specifies the “alphabet” that can be present within the text b. Not all programs can read all character sets Text Files Fields Delimiters separate units of data (or fields) Fields (internally) usually describes: Nominal values (labels with no order) Ordinal values (labels with some order) Numeric values (numbers) Fields are stored as data types: Strings Integers Doubles Floats Logical Etc. Lecture 2: Data Organization and Manipulation Learning Outline 1. Data structures and alternative formats 2. Indexing and subsetting data 3. Sorting and filtering data 4. Relational databases 5. Leveraging the tidyverse Review of Basic Data Structures Vectors are a collection of like values in a single dimension Matrices are a collection of like values organized in two dimensions Arrays are a collection of like values organized into many dimensions Data frames are two-dimensional collections of values with headers Data Frame Variants Common delimiters include spaces, tab keys, commas, and other special others. Indexing Biological Data Indexing or slicing allows you to access, extract and replace values in your data, regardless of the structure. R uses one-based indexing which means that the first element in a data series has an index of one. Indexing Vectors Negative Index: Remove from selection Indexing Multidimensional Data Indexing matrices and arrays has to account for each dimension. Matrices are indexed by row first, then by column. Arrays are indexed by row, then column, then dimension. Data frames can be indexed much in the same way as other multidimensional datasets, particularly matrices. Data frames can also be indexed with the dplyr package, which is especially good at indexing based on intrinsic table values. The arrange command allows you to sort your data by one or more multiple columns What are Relational Databases? Relational databases enable us to organize data into multiple tables that are linked by common data. Advantages of relational databases include: 1. Decreasing the footprint of individual tables 2. Being able to link tables together with single or multiple queries 3. Being able to create new tables to better understand relationships between data. What are Relational Databases? Connecting Relational Tables Filesystem The filesystem is your computers way of managing files The only important thing for you to know is how to navigate file paths The root is the bottom of the file system tree Absolute paths should begin with the root, and work its way up: Relative paths use short representations ○ Current directory ○ Parent directory Leveraging the Tidyverse One of the most challenging features of programming in R is that there are many ways to code the same task The tidyverse is a collection of open source R packages designed to model, transform, and visualize data The code in all of these programs share a consistent design philosophy, grammar and data structure We will be leverage a number of the tidyverse packages in this course Dplyr Commands Common tasks ○ Filter (extract row) ○ Select (extract column) ○ Arrange (sort) ○ Mutate (add column) ○ Summarize (aggregate) ○ Join (for relational tables) Lecture 3: Basic Operations and Data Summaries Performing Basic Operations Arithmetic operators enable us to perform basic calculations Relational operators enable us to compare between our values Logical Operators Chaining Operators Logical values are at the heart of most functions and computations Operations are chained through formal logic: Precedence is also a thing for logical operators ○ ! before & before l Logical operators have left-right associativity Performing Basic Operations Managing Different Types of Variables Logicals are Boolean values (TRUE or FALSE) Integers (no decimals) and floats (precise) are numerical values Characters are string of letters and numbers Factors are strings with preset levels or categories Managing Different Types of Variables Continuous, Discrete and Categorical Variable Nominal and Ordinal Variables Nominal variables are simply categorical (Nom -> Named), while ordinal variables are categorical variables with a clear order (Ord -> Order) Qualitative and Quantitative Variables Independent and Dependent Variables Importance of Measurement Scales The four primary data measurement scales are nominal, ordinal, interval and ratio Data that are on nominal scale involve categorical values without any quantitative value (e,g., hair color) Data that are an ordinal scale involve categorical values where order is significant, but the difference are ambiguous (e.g., letter grades- A, B, C, D, F) Data that are on an interval scale involve numerical values where both order and exact differences are known (e.g., temperature) ○ Two additional key propertiesL the “0” in the interval scale does not mean absence, and by extension the ratio of two values in the interval scale are meaningless Data that are on the ratio scale are numerical values with known order, exact difference and absolute zero (e.g., height) ○ The difference between the interval scale and the ratio scale is subtle. But the difference becomes apparent when trying to do statistics on them. ○ In some sense, the interval scale has some arbitrariness on where it starts Measurement Scales and Statistical Summaries Review of Key Statistical Definitions The range of a dataset is the difference between the greatest and the lowest value. The mode of a data is the value that occurs most often. The median of a dataset is the middle number of ordered values. The mean of a dataset is the average. The standard deviation of a data set measures the dispersion relative to the mean and is calculated as the square root of the variance. It is frequently denoted as lowercase sigma. ○ The standard error is the standard deviation of the sampling distribution of a statistic. Leveraging Regular Expressions Regular expressions are sequences of characters that specify a search pattern in a text, file or dataset All sequences that meet the criteria of the search pattern can be printed, stored or replaced. The power of regular expressions comes from the ability to specify an enormous range of matches with a single search string. Operations on Data Frames Operations and regular expressions are most effectively applied to data frames using the dplyr package. The mutate command enables you to add new variables to a data frame by performing operations on existing variables Statistical Data Summaries with Dplyr Pivoting Your Data Pivoting your data allows you to rearrange and aggregate columns and rows in your data frame to view your data from different perspectives and summarize it in various ways. Basically, the pivot table groups data based on a common feature, and then summarizes the group. Lecture 3 Recap: Basic Operations and Data Summaries Regular Expressions Regular expressions are patterns for describing strings ○ They are incredibly powerful, but can be super tricky to design Why RegEx is tricky ○ The more you want to capture something, and exclude other things, the harder it is to develop a precise RegEx that does not fail ○ Let's think of some super “trivial” example: let's develop a pattern for email addresses. Maybe you have this in your data frame somewhere. ○ But first, lets try and understand how RegEx works Regex are representations of state machines The OR architecture in RegEx Character Sets The Repeat Architecture in RegEx The “skip” Architecture in RegEx Some Miscellaneous Characters Some Shorthands A Basic State Machine that Matches Some Email Addresses Replacements Regex in R Lecture 4: Basic Graphical Data Summaries Basics of Graphical Data Representation Graphical data representation involves the use of visual tools to display, analyze, and interpret data. The primary goal is to simplify data in order to gain insight into patterns in individual data and relationships between variables. There are many types of graphs, with most including axes, titles, units and variables represented in two-dimensional space. Choosing the Best Graph for your Data Choosing the appropriate chart for your data depends on a variety of factors, including the nature of your data, the purpose of the figure, and the scale the data is on. Other Common Visualization Mishaps Unnecessary use of multiple colours when one will suffice. Trying to fit too much data into a single chart Manipulating axis start points to amplify differences. Using a 3D rendering of data when 2D rendering will suffice. Graph Types: Dot Plots Dot plots utilize dots to represent quantatites of categorical data They typically plot independent variables on the x-axis and dependent variables on the y-axis They can represent an alternative to standard bar charts when you want to display all values in a dataset. Graph Elements: Error Bars Error bars are used to indicate an uncertainty in the reported measurement In all cases, it is meant to relay some idea of the resampling value In “big data” biological studies the most reasonable error bar is 95% confidence interval for the data point represented. Graph Types: Line Graphs Line graphs display continuous data and are often used to visualize changes in a quantitative variable through time. They typically plot the time variable on the x-axis and a dependent quantitative variable on the y-axis Graph Types: Bar Charts Bar charts use rectangular bars with lengths that are proportional to the values they represent to compare categorical data. The categorical data are typically displated on the x-axis, while the quantitative is displated on the y-axis. Error bars and dot plots can be superimposed to display variance. Graph Types: Box and Whisker Plots Box and whisker plors are a variant of bar charts that also typically display categorical data on the x-axis with quantitative dependent variables on the y-axis. Unlike bar charts, they always include information about variance and outliers. The median is represented by a horizontal line, boxes display the interquartile range, whiskers display the complete range, and outliers are shown as dots. Graph Types: Scatter Plots Scatter plots are some of the most commonly used plots in biological data science They are used to display the relationship between two variables, with each dot representing a individual piece of data. On scatter plots, both the x and the y axis represents a quantitative measurement of the same piece of data. Graph Types: Pie Charts Pie charts are used to display percentage values as slices of a pie. Including the percentage and even the labels within the pie are common strategies to improve the readability of these charts. They are typically only used for categorical data with a small number of categories. Graph Types: Histograms Histograms display frequency distributions of data from one or more variables using adjacent vertical bars Discrete intervals on the x-axis categorize the data, while count data from each interval is displated on the y-axids. Histograms are often used in data exploration to determine if the data are normally distributed. Graph Types: Heat Maps Heat maps display data in a two-dimensional matrix, where each cell represents a piece of data and the colur indicates its value. The data on the x and y axes of heat maps are typically categorical, while the colour of the cells is typically quantitative. A colour gradient outside the heat map specifies the relationship between colour and the variable being displayed. Other Graph Types Violin plots are a variant of boxes and whisker plots, which show you the full shape of the data distribution and are especially useful when you have a multimodal distribution. Bubble plots are a variant of scatter plots that enable you to look at the relationship between three quantitative variables. The third variable is represented by the size of the point. Density plots are a variant of histograms, showing the distribution of data over a continuous intercal. They look like smoothed histograms and are used for determining the shape of a distribution. Graph Elements Axes labels are text that mark major divisions on a chart The legend provides a description of the data rendered on a chart The caption is one of the most important element in a figure, relaying important aspects of the data to allow readers to correctly interpret the figure. It contains clarification elements, such as what error bars represent, what units the axes are, etc. Statistical tests are usually shown as stars, or bars connecting specific pairwise tests. Lecture 4 Recap: Basic Graphical Data Summaries Good Graphical Design Colour choices ○ Multiple color hues are particularly poor for quantitative data: the order of colour hues is not intuitive ○ Color perception is influenced by colors around ○ Always use colors that are less likely to deceive you but also the reader ○ Categorical data labels are incredibly well represented by distinct color choices ○ Be AWARE of colorblind individuals ○ Bad color map usage is accentuated in colorblind individuals ○ Categorical data labels are incredibly well represented by distinct color choices ○ The viridis color map ○ Can highlight specific differences through color choice 3D in graphs are almost always bad ○ 3D representation in 2D lead lead to loss of a dimension ○ Not everyone has the same intuition of distance or height in 3D Labels ○ Avoid busy figures: maybe one message per figure max Dealing with large datasets ○ Scatterplots are often used to portray relationships between the x and y axis ○ The eye is very bad at determining the density of points when there is overlapping dots