Introduction to Data Science

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following is NOT a primary category of data in data science?

  • Structured Data
  • Data Streams
  • Unstructured Data
  • Abstract Data (correct)

Which of the following is the MOST direct goal of data science?

  • Building faster computer hardware.
  • Extracting knowledge and insights from data. (correct)
  • Creating complex mathematical formulas.
  • Developing new programming languages.

Which of the following data types represents characteristics or attributes divided into distinct groups or categories?

  • Categorical Data (correct)
  • Quantitative Data
  • Discrete Data
  • Continuous Data

Which of the following is an example of ordinal data?

<p>Education levels (high school, bachelor's, master's) (D)</p> Signup and view all the answers

What distinguishes ratio scale data from interval scale data?

<p>Ratio data has a meaningful zero point. (B)</p> Signup and view all the answers

In the context of data measurement scales, what characteristic is present in ordinal, interval, and ratio scales, but NOT in nominal scales?

<p>Magnitude (B)</p> Signup and view all the answers

Which sampling method ensures every member of the population has an equal chance of being selected?

<p>Simple Random Sampling (D)</p> Signup and view all the answers

Which type of analysis is used to answer 'Why did this happen?' by identifying relationships and patterns?

<p>Diagnostic Analysis (A)</p> Signup and view all the answers

Which of the following analysis seeks to provide recommendations for actions to achieve desired outcomes?

<p>Prescriptive Analysis (C)</p> Signup and view all the answers

What is a primary goal of Exploratory Data Analysis (EDA)?

<p>To summarize main characteristics and uncover patterns in a dataset. (A)</p> Signup and view all the answers

What is a key challenge associated with Predictive Data Analysis (PDA)?

<p>The potential for overfitting complex models to training data. (A)</p> Signup and view all the answers

Which of the following statements is a misconception about data science?

<p>Data science is solely about predictive modeling. (A)</p> Signup and view all the answers

In the data science life cycle, which stage involves gathering relevant data from various sources, ensuring it aligns with the problem being addressed?

<p>Data Collection (B)</p> Signup and view all the answers

What does a probability of 1 indicate?

<p>The event is certain to occur. (B)</p> Signup and view all the answers

Two dice are rolled. Event A is that the first die shows a 2. Event C is that the die shows an even number. What is P(A ∩ C)?

<p>1/6 (A)</p> Signup and view all the answers

What is the purpose of data preprocessing in data analysis and machine learning?

<p>To transform raw data into a clean and structured format for analysis. (C)</p> Signup and view all the answers

What is a common technique for handling missing values in a dataset?

<p>Imputation (B)</p> Signup and view all the answers

What is the primary goal of data curation?

<p>Ensuring data remains accessible, reliable, and valuable throughout its lifecycle. (D)</p> Signup and view all the answers

Which KDD step involves applying algorithms to extract patterns from prepared data?

<p>Data Mining (B)</p> Signup and view all the answers

The 68-95-99.7 rule applies to which distribution?

<p>Normal distribution (A)</p> Signup and view all the answers

Flashcards

What is Data Science?

A multi-disciplinary field extracting knowledge from varies data formats, using computer science, mathematics, and statistics.

Sources of Data

Databases, web logs, social media, and sensors.

Data Processing

Cleaning, integration, and transformation to prepare it for analysis, handling missing values, converting formats.

Data Analysis

Using statistical and computational tools to identify trends, correlations, and patterns.

Signup and view all the flashcards

Data Visualization

Presenting results using charts, graphs, and dashboards to communicate findings.

Signup and view all the flashcards

Real world Applications of Data Science

Predict customer behavior, disease prediction, fraud detection, and route optimization.

Signup and view all the flashcards

Structured Data

Data resides in fixed fields, conforms to a predefined schema, stored in databases, spreadsheets, and CSV files.

Signup and view all the flashcards

Semi-Structured Data

Data contains tags to separate semantic elements, JSON documents, XML files, and NoSQL databases.

Signup and view all the flashcards

Unstructured Data

Data lacks a predefined format or organization, text documents, emails, and social media posts.

Signup and view all the flashcards

Data Streams

Continuous flows of data generated in real-time by various sources.

Signup and view all the flashcards

Categorical Data

Represents characteristics or attributes divided into groups or categories without numerical meaning such as eye color.

Signup and view all the flashcards

Quantitative Data

Consists of numerical values representing counts or measurements with two types: discrete and continuous.

Signup and view all the flashcards

Nominal Scale

Categorizes data without any specific order of rank (fruit types).

Signup and view all the flashcards

Ordinal Scale

Categorizes data with meaningful order but without uniform differences between them (race placing).

Signup and view all the flashcards

Interval Scale

Numerical data with ordered categories and equal intervals, but no true zero, such as temperature in Celsius.

Signup and view all the flashcards

Ratio Scale

Allows for representation of absence of the measured attribute, a zero value meaning null or nothing.

Signup and view all the flashcards

Descriptive Analysis.

To summarize and understand main features of a dataset using measures such as mean, median, mode, and standard deviation.

Signup and view all the flashcards

Diagnostic Analysis.

Delves deeper to understand reasons behind outcomes by identifying existing relationships and patterns.

Signup and view all the flashcards

Predictive Analysis.

Uses historical data to forecast future events employing statistical models and machine learning.

Signup and view all the flashcards

Prescriptive Analysis.

Provides recommendations for actions by considering scenarios and their potential impact such as optimization.

Signup and view all the flashcards

Study Notes

  • Data Science is a multidisciplinary field extracting knowledge and insights from data by integrating computer science, mathematics, statistics, and domain expertise
  • The goal of data science is data-driven decisions through pattern identification, prediction, and problem-solving

Key Aspects of Data Science

  • Data Collection involves obtaining data from various sources in structured, semi-structured, or unstructured formats
  • Data Processing includes cleaning, integrating, and transforming data
  • Analysis and Interpretation uses statistical and computational tools for trend identification
  • Visualization and Presentation involves communicating findings through charts, graphs, and dashboards using tools such as Tableau, Matplotlib, and Power BI
  • Data Science Applications include predicting customer behaviour and supply chain optimization in Business
  • Data Science Applications include disease prediction and health monitoring in Healthcare
  • Data Science Applications involve fraud detection and risk analysis in Finance
  • Data Science Applications include route optimization and logistics planning in Transport
  • Data science uses machine learning, AI and big data tools

Data Types

  • Structured Data is highly organized for easy searching and analysis, residing in fixed fields within records or files, used in CRM and OLTP systems
  • Semi-Structured Data contains tags or markers to separate semantic elements and enforce hierarchies, serving as a middle ground between structured and unstructured
  • Unstructured Data lacks a predefined format, requiring specialized techniques like NLP and computer vision, and machine learning for analysis
  • Data Streams are continuous flows of real-time data that require immediate processing

Statistical Data Types

  • Data is split into categorical and quantitative types for correct analytical methods selection and results interpretation
  • Categorical Data represents qualitative attributes divided into distinct, unordered groups as Nominal Data, such as eye color or cuisine type
  • Categorical Data represents qualitative attributes divided into meaningful order but inconsistent intervals as Ordinal Data, like education levels or satisfaction ratings
  • Quantitative Data consists of numerical values for counts or measurements
  • Discrete Data uses integers like students in a class or cars in a lot
  • Continuous Data can take any value within a range, for example height, weight, or temperature

Measurement of Data

  • Assigning values to variables according to set rules allows for appropriate statistical analyses
  • Nominal Scale categorizes unordered data, such as numbering different fruits without ranking
  • Ordinal Scale categorizes data with a meaningful order but inconsistent intervals, like ranking runners in a race
  • Interval Scale uses numerical data with consistent intervals but lacks a true zero, such as Celsius
  • Ratio Scale similar to the interval scale, includes a true zero point, such as weight or height

Measurement Scales Characteristics

  • Identity guarantees each value is uniquely identified; gender is captured by using "F" for female, and "M" for male
  • Magnitude indicates values can be ranked or ordered, for example in a range from "low income" to "high income"
  • Equality in Intervals means the difference between any two consecutive values is consistent
  • Minimum or Zero Value is when 0 indicates absence

Four Measurement Scales

  • Nominal Scale only has identity, gender is an example
  • Ordinal Scale has identity and magnitude, income categories from low to high
  • Interval Scale has identity, magnitude, and equal intervals, such as temperature in Celsius
  • Ratio Scale has identity, magnitude, equal intervals, and a minimum zero value, such as temperature in Kelvin

Methods of Sampling

  • Sampling involves selecting a subset from a population
  • This makes inferences about the entire population practical and cost-effective
  • Probability Sampling ensures all members of population have a non-zero chance of being selected
  • Probability Sampling uses Simple Random Sampling where everyone has equal chance of selection
  • Probability Sampling uses Systematic Sampling where every nth person is chosen
  • Probability Sampling uses Stratified Sampling dividing the population into subgroups and randomly sampling from each
  • Probability Sampling uses Cluster Sampling dividing the population into clusters, selecting some, and then sampling all within
  • Non-Probability Sampling where equal chance of being included doesn't exist.
  • Non Probability Sampling could cause bias in results
  • Non-Probability Sampling techniques uses Convenience Sampling where the easiest to reach are selected
  • Non-Probability Sampling techniques uses Quota Sampling where sample meets certain quotas like 50% male 50% female
  • Non-Probability Sampling selected uses Purposive (Judgmental) Sampling selecting individuals based on expertise
  • Non-Probability Sampling techniques uses Snowball Sampling where participants are recruited from acquaintances

The importance of Sampling

  • It allows researchers to draw conclusions about a population without examining every individual
  • The chosen sampling method must align with the research objectives to ensure validity and reliability
  • Probability sampling methods are generally preferred for their ability to produce representative samples

Data Analysis Methods

  • Data Analysis encompasses methods to extract insights from data
  • These methods can be broadly categorized into four primary types: Descriptive, Diagnostic, Predictive, and Prescriptive

Descriptive Analysis

  • Descriptive Analysis focuses on summarizing and understanding main dataset characteristics
  • Descriptive Analysis provides insights via mean, median, mode, and standard deviation
  • Descriptive analysis uses visualizations

Diagnostic Analysis

  • Diagnostic Analysis understands the reasons behind past outcomes
  • Diagnostic analysis seeks to answer "why did this happen?"
  • Drill-Down Analysis is used
  • Data mining is used
  • Correlation Analysis is employed

Predictive Analysis

  • Predictive Analysis uses historical data to forecast future events
  • Predictive Analysis answers "what is likely to happen?"
  • Regression Analysis is employed
  • Time Series Analysis is used
  • Classification and clustering is employed

Prescriptive Analysis

  • Prescriptive Analysis provides recommendations for achieving outcomes
  • Prescriptive analysis answers "what should we do?"
  • Optimization techniques are used

Exploratory Data Analysis (EDA)

  • EDA involves examining datasets to summarize their main characteristics
  • EDA uncovers patterns, anomalies, relationships without assumptions
  • EDA has 4 primary objectives, identifying, spotting, testing, and checking
  • Common EDA Techniques: Data Visualization, descriptive statistics, Data Transformation

Inferential Analysis

  • It involves predicting a population from sample data
  • Inference is beyond the immediate data available
  • Inferential Analysis key components: Estimating Parameters, Hypothesis Testing, Interval creation
  • Common Inferential methods: running T-tests, creating an ANOVA, Regression Analysis

EDA vs Inference

  • EDA's purpose is to explore and visualize without formal modelling, where as Inference sets out to make predictions
  • EDA's approach is Informal where inference is formal
  • EDA aims to create understanding, where is Inference relies on hypothesis testing

Predictive Data Analysis (PDA)

  • PDA forecast's future outcomes using data sets with algorithms
  • PDA applications include customer prediction, risk assessment, and optimization
  • PDA analysis issues involves data quality, model complexity, and interpretability

Data Science Misconceptions

  • Data Science is only about modeling- Data science uses a broad amount of statistics
  • Data roles are identical - Data engineers focus on building data infrastructure, data analysts interpret data
  • Solely for math experts - Should diverse
  • Provides absolute answers - Results require context
  • Quantity better than quality
  • Can be fully automated - Humans must be present to make decisions

Data science applications

  • medical image analysis and prediction
  • Fraud detection
  • Recommendation Systems
  • Life Cycle helps to extract valuable data and insights

Key Probability definitions

  • The numerical value reflecting likelihood
  • Probability is the number of outcomes/ total space
  • Sample space represents all possible outcomes of the experiment

Key Definitions

  • Independent Events, where the outcome of one does not affect the other
  • Disjoint Events, because they do not overlap
  • Intersection of Events, a common value

Conditional Probability

  • Formula includes that an event has already occurred
  • Formula: P(X|Y) = P(X∩Y)/P(Y)

Applying Bayes' Theorem:

  • Bayes' theorem reverses conditional probabilities and follows formula:

P(X|Y) = (P(Y|X) x P(X))/P(Y)

Random Variables Definition:

  • Random Variables maps outcomes of an experiment to numerical values

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Data Visualization Fundamentals
8 questions
Visualización de Datos
5 questions

Visualización de Datos

CommodiousTennessine avatar
CommodiousTennessine
Use Quizgecko on...
Browser
Browser