Introduction to Data Science Textbook PDF
Document Details
Uploaded by PoliteCarnelian6522
Université Mohammed VI des Sciences et de la Santé
Agouzoul Hibatallah
Tags
Summary
This textbook, "Introduction to Data Science," is intended for biostatistics students. It covers organizing, analyzing, and visualizing data using software like Jamovi and Excel. The book also explains different types of variables, data organization, visualizations, cross-sectional study designs and basic statistical tests (Chi-squared, t-test, and correlation).
Full Transcript
année 2030 Concordia LEARNING OBJECTIVES Understand and classify different types of variables (categorical vs. quantitative) using Jamovi Learn how to organize and manage data efficiently in Excel for statistical analysis Understand how to transform continuous variables into...
année 2030 Concordia LEARNING OBJECTIVES Understand and classify different types of variables (categorical vs. quantitative) using Jamovi Learn how to organize and manage data efficiently in Excel for statistical analysis Understand how to transform continuous variables into categorical variables in Jamovi or Excel Learn how to create effective data visualizations (histograms, bar charts, scatter plots) in Jamovi and Excel Gain an understanding of cross-sectional study design and its application in biostatistics and medical research Conduct basic statistical tests (Chi², t-test, and correlation) to analyze relationships in data. Table of contents of the textbook 1 Introduction page 4 2 Types of variables pages 2-9 3 Data organisation pages 10-11 4 Transforming a quantitative pages 12-13 to a categorical variable pages 14-20 5 Visual presentation of data 6 Cross sectional studies page 21-22 7 Simple data analysis page 22-26 Introduction KEY CONCEPTS IN BIOSTATISTICS AND DATA ANALYSIS In biostatistics, working with data is a foundational skill that allows researchers to draw valid conclusions and make evidence-based decisions. The process of data analysis typically begins with organizing and cleaning the data, ensuring it's ready for meaningful interpretation. Proper classification of variables is key, as it determines the appropriate statistical methods for analysis. Data can be transformed and visualized in a variety of ways, enabling researchers to uncover patterns, trends, and relationships within the data. Whether it's converting numerical data into categories, presenting findings visually, or applying statistical tests, each step plays a critical role in the research process. Understanding the structure of different study designs, such as cross-sectional studies, also helps in contextualizing the data. With the right tools and techniques, researchers can apply fundamental statistical tests—such as the Chi-square test, t-test, or correlation coefficient—to test hypotheses and gain insights that inform decision-making. Definition and Types of Variables WHAT’S A VARIABLE ? A variable is any characteristic, number, or quantity that can be measured or quantified. Variables can take on different values across different individuals or items. These values can vary in different ways, depending on the type of variable and the measurement scale used. Variables are the fundamental units of analysis in data collection and statistical modeling, allowing researchers to study relationships, trends, and patterns. WHAT ARE THE TYPE OF VARIABLES ? QUALITATIVES VARIABLES In biostatistics, variables are categorized based on their characteristics and how they can be measured or classified. Understanding the different types of variables is crucial because it helps determine the appropriate statistical methods to apply in data analysis. Variables are broadly classified into two main categories: categorical variables and quantitative variables. 1 Categorical / qualitative variables Represent data that can be grouped into distinct categories or labels. Non-numeric , describe qualities or characteristics Used to classify or identify groups within a dataset Categorical variables are divided into two sub-types Nominal Variables Ordinal Variables Are categorical variables Are categorical variables where that represent distinct the categories have a specific categories or groups without order or ranking, but the intervals between the categories ranking between them. are not necessarily equal or These categories cannot be meaningful.Thus ordinal meaningfully arranged in variables indicate a relative any particular sequence position Examples Examples Eye color (blue, green, brown) Pain Severity (Mild, Moderate, or Severe) Marital status ( married , single, etc) Socioeconomic Status Blood type ( A , AB, O, B) ( Low, Middle, or High) Disease Status ( presence , absence ) Stages of addiction (Mild, Moderate, or Severe) QUANTITATIVES VARIABLES 2 Numerical / quantitative variables Represent measurable quantities, where the values are numbers that indicate amounts or magnitudes and they allow for arithmetic operations such as addition, subtraction,... Numerical variables are divided into two sub-types Discret Variables Continous Variables These values are typically Continuous variables whole numbers, and there can take an infinite are no intermediate or number of values within fractional values between two consecutive values. We a given range.They can can list all possible values, either take fractional they are usually finite values or reels ones Examples Examples Weight Number of Hospital Visits ( 70.25 kg, 70.253 kg, 70.2531 kg ) Number of Surgeries a Patient Temperature Has Had ( 36.5°C, 36.55°C, or 36.555°C) Number of Car Accidents in a Cholesterol Level Year (200.5 mg/dL, 200.55 mg/dL) Number of Positive Test Results Distance in a Population 60.1 km , 60.15 km, 60.156 CLASSIFYING VARIABLES IN JAMOVI After understanding the different types of variables, it's important to know how to classify them for statistical analysis. Jamovi, an open-source software, simplifies this process by automatically detecting variable types (nominal, ordinal, continuous, or discrete). Users can easily adjust these classifications through the software's intuitive interface. Jamovi provides several tools to help identify, classify, and manage variables effectively for statistical analysis: Variable Classification in Data View: Automatic Detection: Upon data import, Jamovi automatically classifies variables as nominal, ordinal, discrete, or continuous based on their content Manual Adjustment: Users can easily modify the classification of a variable by adjusting its type (e.g., from continuous to categorical) through the variable settings. Assigning Variable Roles: Role Assignment: Variables can be assigned specific roles (e.g., dependent or independent) within statistical analyses, ensuring proper treatment in tests like regression or ANOVA Descriptive Statistics and Visualization: Summary Statistics: Jamovi generates basic summaries (mean, median, mode, etc.) for continuous variables, and frequency tables for categorical variables Graphical Tools: Jamovi generates basic summaries (mean, median, mode, etc.) for continuous variables, and frequency tables for categorical variables Data Transformation: Recoding and Binning: Continuous variables can be recoded into categories or grouped (e.g., creating age ranges or other bins), simplifying data analysis Automatic Classification: Automatic Classification: When importing data (e.g., from Excel or CSV), Jamovi attempts to automatically classify variables based on their format, but users can adjust classifications as needed. DIFFERENCES BETWEEN CATEGORICAL & QUANTITATIVE VARIABLES APPLYING DATA VARIABLES IN HEALTHCARE Both variable types are essential for improving patient care. Integrating them helps to better understand disease patterns and outcomes. Quantitative variables, such as blood pressure or cholesterol levels, provide numerical data that can be analyzed to identify trends and predict outcomes. These variables are typically analyzed using descriptive statistics and regression models. Qualitative variables, such as gender or disease status, categorize data into distinct groups and are analyzed using frequencies or chi- square tests. Combining both types of variables provides a more comprehensive understanding of health conditions. Quantitative data offers precision, while qualitative data adds context. Together, they allow for more accurate risk stratification and personalized treatment. This combination enhances clinical decision-making and supports medical research. Data organisation EXCEL FOR DATA ORGANIZATION Effective data organization is a fundamental step in biostatistical analysis within the medical sector, and Microsoft Excel is a widely utilized tool for managing and preparing clinical and epidemiological data. Proper organization ensures accuracy, consistency, and ease of analysis, and it is crucial for drawing reliable conclusions from medical research patient care, and public health initiatives. Well-organized data simplifies statistical analysis and enhances reproducibility, transparency, and collaboration across research teams. BEST PRACTICES FOR ORGANIZING DATA IN EXCEL Structured Data Layout: Each variable should have its own column, and each row should represent a unique observation or data point (e.g., an individual participant or case). The first row should contain descriptive column headers to clearly label each variable. V1 V2 Variable3 V4 Clear and Consistent Labeling: Column headers should be concise, descriptive, and free of spaces. Use underscores or camel case for clarity. Consistent naming conventions should be applied throughout the dataset. Camel case Underscore case Avoid Merging Cells: Merging cells can disrupt sorting, filtering, and data manipulation. Ensure that each cell in a column represents a single piece of data for a given variable. Use Data Validation: Excel’s data validation feature should be used to restrict data entry and ensure consistency. For example, set minimum or maximum values for numerical data (e.g., age should be between 0 and 120) or define allowed text entries (e.g., "Yes" or "No" for a binary variable). Handle Missing Data Appropriately: Use consistent placeholders for missing data, such as "NA" or "#N/A." Do not leave cells blank, as this can lead to errors in analysis. Make sure that missing data is clearly identifiable. Avoid Special Characters: Use only standard alphanumeric characters (letters, numbers, and underscores) in column headers and data entries. Special characters like commas, periods, and slashes may interfere with data processing or analysis. LABELING AND ARRANGING COLUMNS IN EXCEL 1-Column Labels Column headers should be clear, precise, and representative of the data contained in each column (e.g., "Age," "Treatment_Group," "Blood_Pressure"). Avoid using spaces or ambiguous abbreviations, and maintain consistency in terminology across the dataset 2-Column Arrangement Place the most important or identifying variables (e.g., unique identifier, demographic variables like age or sex) at the beginning of the dataset. This allows for easier navigation and analysis, especially when working with large datasets. Group related variables together, such as grouping all clinical measurements or lab results in adjacent columns. 3-Consistent Data Entry Ensure that the format of data within each column is consistent. For example, for a column representing age, all values should be numerical, and for categorical data, the same codes or words should be used consistently across the dataset. IMPORTANCE OF UNIQUE IDENTIFIERS IN DATA ACCURACY AND SECURITY Tracking Individual Data Points: A unique identifier (e.g., a participant ID) ensures that each row of data corresponds to a specific individual or observation, allowing you to track and reference data accurately throughout the analysis Facilitating Data Merging and Linking: In biostatistical analysis, data often comes from multiple sources (e.g., surveys, clinical records, lab results). A unique identifier allows you to merge datasets from different sources while maintaining data integrity, ensuring that information is linked correctly to the same individual or observation Minimizing Errors: Using a unique identifier helps avoid confusion or duplication, reducing the risk of mixing up data from different participants. It also facilitates efficient data cleaning and auditing Ensuring Data Privacy: A unique identifier can be used in place of personally identifiable information (PII) to maintain participant confidentiality, while still allowing accurate tracking and analysis of individual data Binning Quantitative Variables GROUPING QUANTITATIVE VARIABLES FOR ENHANCED DATA INTERPRETATION Converting a continuous variable into a categorical variable involves grouping data into predefined intervals or categories. Continuous variables, such as age or income, represent an infinite range of values, while categorical variables represent data organized into distinct groups. To categorize a continuous variable, you define specific ranges or intervals and assign each data point to the appropriate group. For example, age can be divided into categories like "18-24," "25-34," "35-44," and so on. Example of Categorizing a Quantitative Variable in a Health Study Categorizing quantitative variables may be necessary to simplify analysis, identify trends, or make data easier to interpret. For example, BMI (Body Mass Index) is a continuous variable, but in a study examining obesity risk, it might be grouped into categories such as "Underweight," "Normal weight," "Overweight," and "Obese" based on predefined BMI ranges. Categorizing BMI allows researchers to more easily assess the prevalence of obesity in different populations and analyze health outcomes across different BMI groups. CONCEPT OF "BINNING" OR "GROUPING" VALUES Binning or grouping involves dividing the range of a continuous variable into smaller, more manageable intervals or categories. Binning aims to reduce the complexity of the data, making it easier to analyze and interpret. For example, instead of analyzing age as a continuous variable, researchers might group it into intervals such as "Children," "Adults," and "Elderly" to study health outcomes across different age groups. This technique simplifies the data and makes patterns and relationships more apparent, especially when analyzing large datasets. Quantitative Binning Qualitative Variable Variable HOW CAN CATEGORIZING CONTINUOUS VARIABLES IMPROVE DATA INTERPRETATION In certain analytical contexts, categorizing continuous variables into discrete groups can provide a clearer structure for data analysis. This transformation is particularly useful when handling large datasets or when aiming to simplify complex data for better interpretation and comparison. The following highlights some key advantages of categorizing continuous variables in data analysis. Simplifying Facilitating Complex Data Comparisons Construction d’un In large datasets, continuous When comparing different centre commercial variables can have a wide groups (e.g., patients de 5 with étages range of values, making it different levels of difficult to detect patterns. cholesterol or blood By categorizing the data, pressure), categorizing analysts can focus on key continuous variables trends without being allows for clearer overwhelmed by the comparisons and easier variation in individual value interpretation of results Construction Improving du Enhancing Construction du préau de l’École Statistical Analysis préau de l’École Decision-Making Amédé Autran Amédé Autran In some cases, Categorizing data into continuous variables may meaningful groups can help not meet the assumptions healthcare professionals of certain statistical tests, make more informed such as normality. decisions. For example, Binning can help classifying patients' risk transform a variable into levels (low, medium, high) a categorical form that is based on continuous more suitable for measurements like blood statistical techniques, pressure or glucose levels such as chi-square tests, can help prioritize which analyze categorical treatment plans. data. In conclusion, the conversion of quantitative variables into categorical variables through binning or grouping can facilitate data analysis by reducing complexity, supporting comparative analysis, and enhancing the applicability of statistical methodologies in health research Visual Presentation of Data In medical biostatistics, effectively presenting data visually is crucial for clear communication and understanding of research findings. Graphical representations help researchers, clinicians, and policymakers identify patterns, trends, and outliers, as well as make data-driven decisions. Below are key points on common graph types and the use of software like Jamovi for data visualization COMMON TYPES OF GRAPHS FOR QUANTITATIVE AND CATEGORICAL DATA In Quantitative Data we use : Scatter plots help Box plots Histograms identify are useful for display the distribution of relationships summarizing the data, showing how between two distribution and frequently data points fall continuous variables identifying outliers within specific intervals. In Qualitative Data we use : Bar charts Pie charts show the frequency or proportion are used to display the relative proportion of each category of each category within a whole KEY DIFFERENCES BETWEEN A HISTOGRAM AND A BAR CHART USING JAMOVI SOFTWARE FOR DATA VISUALIZATION Jamovi offers a variety of visualization tools essential for medical biostatistics, including histograms, box plots, bar charts, and scatter plots. The software allows for easy customization of graphs, including adjustments to labels, axis scales, and color schemes, to enhance clarity and presentation. Additionally, Jamovi seamlessly integrates visualizations with statistical analysis outputs, facilitating a more efficient workflow for health researchers and biostatisticians. This combination of intuitive design and powerful analytical capabilities makes Jamovi a valuable tool for data visualization in medical research JAMOVI: A POWERFUL TOOL FOR DATA VISUALIZATION AND STATISTICAL ANALYSIS Below is a summary of key features in Jamovi, along with screenshots to visual examples of each capability 1. Data Visualization: Jamovi provides a wide range of graphical tools for visualizing data, including histograms, box plots, bar charts, scatter plots, and more. These graphs can be customized for clarity and better presentation. 2. Statistical Analysis Integration : Jamovi integrates statistical analysis tools with data visualization, allowing for the generation of descriptive statistics, t-tests, ANOVAs, regression analyses, and more, with corresponding visualizations available simultaneously 3.Descriptive Statistics : Jamovi allows for the computation of descriptive statistics, such as mean, median, and standard deviation, with results displayed alongside relevant visualizations like histograms or bar charts. 4.Regression Analysis (Linear and Logistic) : Jamovi supports both linear and logistic regression analyses, enabling the visualization of results through scatter plots and regression lines. 5. ANOVA (Analysis of Variance): Jamovi facilitates one-way and two-way ANOVA, enabling group comparisons and visual representation through box plots, bar charts, and interaction plots. 6. Factor Analysis and Principal Component Analysis (PCA) : Jamovi offers tools for performing factor analysis and PCA, displaying results in formats such as biplots, scree plots ,or other appropriate visual representations. 7. Non-parametric Tests : Jamovi supports various non-parametric tests, including the Mann-Whitney U test, Kruskal-Wallis test, and Friedman test, with visual outputs such as box plots for better interpretation 8. Customizable Plots : Jamovi allows extensive customization of plots, including changes to axis labels, color schemes, and formatting, enabling tailored visualizations for specific presentation needs 9. Data Import and Export Jamovi supports importing data from various file formats, such as CSV, Excel, and SPSS, as well as exporting analysis results for further use in other software or reporting tools. 10. Reliability Analysis (Cronbach’s Alpha) Jamovi offers tools for calculating reliability coefficients such as Cronbach’s Alpha, frequently used in psychometric analysis, with results visualized as necessary. 11. Data Transformation Jamovi allows for the transformation of data, such as creating new variables, recoding existing ones, or normalizing data for further analysis. Study Designs in Medical and Epidemiological Research DEFINITION OF A CROSS-SECTIONAL STUDY A cross-sectional study is an observational research design in which data are collected from a population or a representative subset at a single point in time. This design provides a snapshot of health outcomes, characteristics, or other variables of interest within a population. Cross-sectional studies are frequently employed in epidemiology to assess the prevalence of diseases, risk factors, or health behaviors RESEARCH STUDY DESIGNS: OBSERVATIONAL STUDIES, AND EXPERIMENTAL STUDIES Research studies can be broadly categorized into different types based on their design and purpose. These include cross-sectional, longitudinal, and experimental studies, each offering unique approaches to data collection and analysis. While some studies focus on observing a particular moment, others track changes over time or introduce interventions to explore causality. Understanding the distinctions between these study types is crucial in selecting the appropriate methodology for specific research questions and objectives. Observational Studies: In cross-sectional studies, data are collected from a population or representative subset at a single point in time. These studies are commonly used to assess prevalence and associations between variables but do not provide temporal information, making causation difficult to establish. Longitudinal studies collect data from the same subjects over an extended period. By following individuals over time, these studies capture developments and changes in health or behavior, making them ideal for understanding progression and causation. This temporal sequence strengthens the evidence for causative relationships and allows researchers to observe trends within a population. Observational Studies: Experimental (or interventional) studies, such as randomized controlled trials, involve actively introducing a specific treatment or intervention and observing its effects. These studies include control groups and randomization, making them the gold standard for testing causative hypotheses. Unlike observational designs, experimental studies provide strong evidence of causation due to their controlled conditions and ability to isolate the effect of an intervention. WHY CROSS-SECTIONAL STUDIES ARE COMMONLY USED IN MEDICAL AND EPIDEMIOLOGICAL RESEARCH Cross-sectional studies are valuable in medical and epidemiological research for several reasons: Efficiency and Cost-Effectiveness: Because data collection occurs only once, cross-sectional studies are quicker and more cost-effective than longitudinal studies. They require fewer resources and are practical for studying large populations. Estimating Prevalence: Cross-sectional studies are particularly useful for estimating the prevalence of diseases, risk factors, or health behaviors within a defined population, making them effective tools for public health surveillance and needs assessment. Hypothesis Generation: These studies often reveal associations and patterns among variables, serving as a basis for generating hypotheses. While they do not establish causation, cross-sectional studies, such as longitudinal or experimental studies, can inform future research to explore causal relationships. ORGANIZING DATA IN CROSS-SECTIONAL STUDIES Data from cross-sectional studies are typically organized into a structured dataset where each row represents an individual participant or observation, and each column corresponds to a specific variable (e.g., demographic information, exposures, health outcomes). A data dictionary is often recommended to ensure clarity and consistency, as it defines variable names, measurement units, and allowable values. This structured approach to data organization helps reduce errors and ensures that the data are easily interpretable for analysis. Analyzing Data from Cross-Sectional Studies Key Steps in Analyzing Data from Cross-Sectional Studies Data from a cross-sectional study can be organized and analyzed in various ways: Descriptive Statistics: The data can be summarized using measures such as means, medians, frequencies, and percentages to describe the characteristics of the study population. Categorizing Variables: Variables can be classified into different categories (e.g., age groups, disease status) to facilitate comparative analysis and help uncover patterns or differences between subgroups. Statistical Tests: Methods like chi-square tests, t-tests, or regression analysis can be applied to explore the relationships between categorical or continuous variables, such as examining the association between smoking and lung disease. Prevalence Estimates: Researchers can calculate the prevalence of conditions (e.g., the proportion of individuals with hypertension) and investigate factors that may be related to these conditions. WHEN IS A CHI SQUARE TEST APPROPRIATE TO USE? The chi-square test is used when you want to: Assess the relationship between two categorical variables (either nominal or ordinal). Evaluate whether there is a significant difference between the observed and expected frequencies within categories. For example: a chi-square test can be used to determine if there is an association between gender (male/female) and smoking status (smoker/non-smoker) in a population HOW WOULD YOU INTERPRET A SIGNIFICANT P VALUE IN A T TEST OR CHI SQUARE TEST A significant p-value (typically p < 0.05) indicates that the observed results are unlikely to have occurred by chance under the null hypothesis. T-test: If the p-value is significant, it suggests a statistically meaningful difference between the two groups being compared (e.g., comparing the means of two groups such as treatment vs. control). Chi-square test: A significant p-value indicates a relationship between the categorical variables being tested, showing that the observed distribution significantly differs from the expected distribution. DESRIBE THE PURPOSE OF A T TEST A t-test is used to compare the means of two groups and assess whether they are significantly different from each other. There are several types of t-tests: Independent t-test: Compares the means between two independent groups (e.g., control group vs. treatment group). Paired t-test: Compares means within the same group at different times or under different conditions (e.g., before and after treatment for the same participants). The goal is to determine whether any differences observed between group means are statistically significant or if they might have occurred by chance WHAT IS THE PEARSON CORRELATION COEFFICIENT(R), AND HOW IS IT USED TO MEASURE THE RELATIONSHIP BETWEEN TWO VARIABLES? The Pearson correlation coefficient (r) measures the strength and direction of the linear relationship between two continuous variables. Range: r values range from -1 to +1. r = 1: A perfect positive correlation, meaning that as one variable increases, the other also increases in exact proportion. r = -1: A perfect negative correlation, meaning that as one variable increases, the other decreases in exact proportion. r = 0: No linear correlation. Interpretation: Positive r: Both variables increase together (e.g., height and weight). Negative r: As one variable increases, the other decreases (e.g., hours of exercise and body fat percentage). r ≈ 0: There is little to no linear relationship between the variables. Pearson's r is widely used in fields such as psychology, medicine, and social sciences to assess the strength of the relationship between two continuous variables.