Data Science: Overview, Roles and Tools

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

What is the central aim of Data Science?

  • Extracting actionable insights from data. (correct)
  • Developing complex algorithms.
  • Creating data visualizations.
  • Managing large databases.

Which discipline is least relevant to Data Science?

  • Statistics
  • Astrology (correct)
  • Domain Knowledge
  • Computer Science

Which task is typically outside the purview of a Data Scientist's responsibilities?

  • Creating data visualizations.
  • Writing fiction (correct)
  • Developing predictive models.
  • Analyzing statistical data.

What is the initial step in the Data Science Life Cycle?

<p>Problem Formulation (D)</p> Signup and view all the answers

Which platform is engineered for collaborative coding and data analysis in a notebook environment?

<p>Google Colab (A)</p> Signup and view all the answers

What is the minimum score required to pass the Digital Skills Essentials (DSE) assessment?

<p>60% (C)</p> Signup and view all the answers

Which Python library excels in creating insightful statistical visuals?

<p>Seaborn (B)</p> Signup and view all the answers

Which of the listed options functions as a Machine Learning framework within Python?

<p>PyTorch (A)</p> Signup and view all the answers

Which outcome is not typically associated with using Jupyter Notebooks?

<p>Static HTML pages (C)</p> Signup and view all the answers

What types of data can Data Science methodologies effectively process?

<p>Structured and unstructured data. (C)</p> Signup and view all the answers

What is the main goal of data analysis?

<p>Interpret and understand data (C)</p> Signup and view all the answers

Which of the following tools is a text-based IDE?

<p>Visual Studio Code (D)</p> Signup and view all the answers

In pivot tables, which component allows for the grouping of data?

<p>Rows (D)</p> Signup and view all the answers

What specific action necessitates the refreshing of a pivot table?

<p>You added new data (B)</p> Signup and view all the answers

What area does TensorFlow primarily address?

<p>Deep Learning (A)</p> Signup and view all the answers

Which of the following is a valid Python variable name?

<p><code>value_2</code> (B)</p> Signup and view all the answers

What is the result of the Python expression: 5**2?

<p><code>25</code> (B)</p> Signup and view all the answers

Which of the following is the correct assignment operator in Python?

<p><code>=</code> (A)</p> Signup and view all the answers

Which of the following is a logical operator in Python?

<p><code>and</code> (C)</p> Signup and view all the answers

Which operator checks for object identity in Python?

<p><code>is</code> (C)</p> Signup and view all the answers

Which Python data type is used to store key-value pairs?

<p>Dictionary (C)</p> Signup and view all the answers

Which of the following data types in Python is ordered and mutable?

<p>List (C)</p> Signup and view all the answers

Which statement is used to check a condition in Python?

<p><code>if</code> (C)</p> Signup and view all the answers

What is the correct syntax for a function definition in Python?

<p><code>def myFunc():</code> (D)</p> Signup and view all the answers

What will the following Python code print: print(type(3.14))?

<p><code>&lt;class 'float'&gt;</code> (A)</p> Signup and view all the answers

Which method is used to read the contents of a file in Python?

<p><code>file.read()</code> (C)</p> Signup and view all the answers

How do you correctly slice a list in Python to get a sublist from index 1 to 3 (exclusive)?

<p><code>list[1:3]</code> (C)</p> Signup and view all the answers

Which of the following data types is immutable in Python?

<p>Tuple (B)</p> Signup and view all the answers

What keyword is used to initiate a loop in Python?

<p><code>for</code> (A)</p> Signup and view all the answers

What will print(10 not in [5, 10, 15]) return?

<p>False (B)</p> Signup and view all the answers

Which of the following is the first step in data preprocessing?

<p>Data Collection (B)</p> Signup and view all the answers

Which Pandas function is the most appropriate to use when trying to remove duplicate rows from a DataFrame?

<p><code>drop_duplicates()</code> (A)</p> Signup and view all the answers

What is the primary purpose of the astype() method in Pandas?

<p>Converts data types (A)</p> Signup and view all the answers

If a dataset represents missing numerical values with a dash (-), what should it be replaced with?

<p><code>0</code> (B)</p> Signup and view all the answers

Which of the following plots is commonly employed for outlier detection?

<p>Box plot (A)</p> Signup and view all the answers

Which column was renamed to 'Grade100/100' in the Certified Test dataset?

<p>'Grade100/00' (B)</p> Signup and view all the answers

In a dataset of student scores, what label is typically applied to students scoring ≥ 60?

<p>'Passed' (C)</p> Signup and view all the answers

How can you filter a Pandas DataFrame to only include rows with a specific value in a certain column?

<p><code>df[df.column == value]</code> (A)</p> Signup and view all the answers

Which Pandas function is used to retrieve the data type of each column in a DataFrame?

<p><code>dtypes</code> (C)</p> Signup and view all the answers

What is the main reason of data cleaning?

<p>To ensure accurate analysis (A)</p> Signup and view all the answers

Which Python library facilitates parsing XML data?

<p><code>xml.etree.ElementTree</code> (B)</p> Signup and view all the answers

Which Pandas method is used to combine multiple DataFrames?

<p><code>concat()</code> (C)</p> Signup and view all the answers

What type of rows are specified to be dropped from a given dataset?

<p>Rows where ‘State' is not ‘Finished’ (D)</p> Signup and view all the answers

Which function calculates the number of null values in Pandas?

<p><code>isnull().sum()</code> (C)</p> Signup and view all the answers

What new columns are to be created as part of a project?

<p>level and q_type (A)</p> Signup and view all the answers

What is the main goal of Exploratory Data Analysis (EDA)?

<p>Understanding the data's structure and patterns (A)</p> Signup and view all the answers

Which of the following is NOT a typical EDA task?

<p>Model tuning (D)</p> Signup and view all the answers

Which plot is best for visually detecting outliers in a dataset?

<p>Box plot (B)</p> Signup and view all the answers

Which summary statistic represents the center of a dataset?

<p>Mean (B)</p> Signup and view all the answers

Flashcards

What is Data Science?

A field using stats, CS, and domain knowledge to gain data insights.

Core Data Science areas

Statistics, Programming, Machine Learning, and Domain Expertise.

Purpose of Data Visualization

To make data understandable and ease insight communication.

Data Science Life Cycle steps

Problem formulation, data collection, data cleaning, EDA, modeling, evaluation, communication.

Signup and view all the flashcards

Data Science in Agriculture

Predicting crop yields using satellite data and weather patterns.

Signup and view all the flashcards

Why is Python preferred?

Easy to learn, rich libraries, strong community, good for prototyping/production.

Signup and view all the flashcards

Python Libraries for Data Science

Pandas, Matplotlib, Scikit-learn.

Signup and view all the flashcards

Pivot tables in Excel

They summarize large datasets quickly by organizing data using rows, columns, values, and filters.

Signup and view all the flashcards

Aggregation functions in pivot tables

Sum, Count, Average, Min, Max.

Signup and view all the flashcards

Refresh a pivot table

Right-click -> Refresh. Ensures your table reflects the source data.

Signup and view all the flashcards

Exploratory Data Analysis (EDA)

Understand data patterns/characteristics before modeling, using statistics/visualizations.

Signup and view all the flashcards

Open Data

Freely accessible datasets for research, testing, and creating public tools.

Signup and view all the flashcards

Data Science vs. Business Intelligence

BI reports past data; Data Science predicts future trends using analytics.

Signup and view all the flashcards

Netflix and Data Science

Recommending shows using algorithms based on viewing history.

Signup and view all the flashcards

Role of domain expertise

Helps interpret data correctly and ensures solutions are practical.

Signup and view all the flashcards

What is a variable

It's a container for storing data values.

Signup and view all the flashcards

What does == do?

Checks if two values are equal.

Signup and view all the flashcards

What does 'and' do?

Returns True if both conditions are True.

Signup and view all the flashcards

'is' vs ==

Checks equality of values, 'is' checks identity.

Signup and view all the flashcards

Mutable vs Immutable

Mutable: list, Immutable: tuple.

Signup and view all the flashcards

Dictionary

Key-value pairs

Signup and view all the flashcards

How a tuple look

It is written in round brackets, e.g., (1, 2, 3)

Signup and view all the flashcards

If statement do

Executes code if condition is true.

Signup and view all the flashcards

for loop

Iterates over a sequence.

Signup and view all the flashcards

def keyword

Defines a function.

Signup and view all the flashcards

Open a file

Using open('filename.txt', 'r')

Signup and view all the flashcards

List slice for

To extract a portion of a list using index ranges.

Signup and view all the flashcards

Role of EDA

Helps understand data's structure, discover patterns, detect outliers.

Signup and view all the flashcards

Python Libraries for EDA

Pandas and Seaborn (also Matplotlib).

Signup and view all the flashcards

Check for missing values

Use df.isnull().sum() to count missing values per column.

Signup and view all the flashcards

A box plot useful

Identifying outliers and visualizing data distribution.

Signup and view all the flashcards

Purpose of data cleaning

Correct or remove inaccurate, incomplete, or irrelevant data.

Signup and view all the flashcards

Rename pandas column

Use df.rename(columns={'old': 'new'}).

Signup and view all the flashcards

Method to remove duplicate rows?

Pandas method: df.drop_duplicates()

Signup and view all the flashcards

Command to replace in dataframe

Pandas command: df.replace('-', 0)

Signup and view all the flashcards

Change column to integer type

DataFrame command: df['col'] = df['col'].astype(int)

Signup and view all the flashcards

Remove 'State' is not 'Finished'

Ensures data only includes completed entries

Signup and view all the flashcards

Handling missing values?

Missing Values: Drop or fill

Signup and view all the flashcards

pd.concat() do?

Combines multiple DataFrames

Signup and view all the flashcards

Why to understand dataset before cleaning

Apply correct cleaning techniques.

Signup and view all the flashcards

Study Notes

Data Science Overview

  • The primary goal of Data Science is to extract insights from data
  • Data science uses statistics, computer science, and domain knowledge to extract insights from data to support data-driven decisions
  • Data science integrates statistics, programming, machine learning, and domain expertise.
  • Astrology is not a core component of Data Science

Roles in Data Science

  • A Data Scientist typically does not write fiction

Data Science Life Cycle

  • Problem formulation is the first step

Collaboration Tools

  • Google Colab facilitates real-time collaboration in notebooks

Digital Skills Essentials (DSE)

  • The passing score is 60%.

Python Libraries for Data Science

  • Seaborn is a Python library suited for statistical data visualization
  • Pandas helps with data manipulation
  • Matplotlib performs visualization
  • Scikit-learn assists with machine learning

Machine Learning Frameworks in Python

  • PyTorch qualifies as a Machine Learning framework

Jupyter Notebooks

  • Jupyter Notebooks do not support static HTML pages

Data Types in Data Science

  • Data Science deals with both structured and unstructured data

Data Analysis

  • Data analysis aims to interpret and understand data

IDEs

  • Visual Studio Code is a text-based IDE

Pivot Tables

  • Rows are used to group data in a pivot table
  • Pivot tables summarize large datasets quickly using rows, columns, values, and filters
  • Refreshing a pivot table ensures it reflects changes in the source data
  • Right-click the pivot table and select "Refresh" to update it with the latest information
  • Aggregation functions in pivot tables summarize data using functions like Sum, Count, Average, Min, and Max

Exploratory Data Analysis (EDA)

  • EDA is a process to understand data patterns and characteristics before modeling, using statistics and visualizations

Open Data

  • Open Data gives freely accessible datasets for research, testing models, and creating public tools

Data Science vs Business Intelligence

  • Business Intelligence (BI) reports past data, while Data Science predicts future trends using advanced analytics

Applications of Data Science

  • Netflix uses Data Science to enhance user experience through show recommendations based on viewing history
  • Data Science can predict crop yields in agriculture using satellite data and weather patterns

Importance of Domain Expertise

  • Domain expertise helps correctly interpret results and ensures solutions are practical for the target industry

Python Fundamentals

  • Value_2 is a valid Python variable name
  • The output of 5 ** 2 is 25
  • "=" is the assignment operator in Python
  • "and" is a logical operator
  • "is" checks for object identity
  • Dictionaries are used to store key-value pairs
  • Lists are ordered and changeable data types

Python Conditionals and Functions

  • The "if" statement checks a condition in Python
  • def myFunc(): is the correct syntax for a Python function definition

Python Data Types and File Handling

  • print(type(3.14)) outputs <class 'float'>
  • file.read() reads a file in Python
  • list[1:3] is the correct slicing syntax for a list
  • Tuples are immutable data types

Python Loops

  • "for" is the keyword used to begin a loop
  • print(10 not in [5, 10, 15]) returns False

Variables in Python

  • A variable is a container that stores data values, e.g., student_age = 20
  • The == operator checks if two values are equal
  • The "and" operator checks if both conditions are True
  • "==" checks equality, "is" checks object identity

Data Types

  • Lists are mutable; tuples are immutable

Dictionaries

  • Dictionaries are collections of key-value pairs

Tuples

  • Tuples are written in round brackets, e.g., (1, 2, 3)

Conditional Statements

  • The "if" statement executes a block of code if a condition is true

Loops

  • For loops iterate over a sequence (list or range)

Functions

  • The "def" keyword defines a function

File Reading

  • Use open('filename.txt', 'r') to open a file for reading

List Slicing

  • List slicing extracts a part of a list using index ranges

Logical Operators

  • if age > 18 and is_student: is an example of using a logical operator

Data Preprocessing

  • Data collection is the first step
  • The pandas function drop_duplicates() removes duplicate rows
  • Replacing "-" with 0 is part of data cleaning

Pandas Methods

  • The astype() method in pandas converts data types
  • df[df.column == value] keeps rows where a column equals a value
  • df.dtypes returns data types of each column

Handling Missing Data

  • Handling missing data ensures accurate analysis
  • Drop or fill missing values after identification
  • To ensure accurate analysis is the purpose

Parsing XML Data

  • The xml.etree.ElementTree helps parse XML data

Combining DataFrames

  • The pandas method concat() combines multiple DataFrames

Removing Data

  • Remove rows where ‘State’ is not ‘Finished’ from the dataset

Counting Null Values

  • The isnull().sum() from Pandas counts null values

Adding New Columns

  • The new columns added in the question pool section are level and q_type

Data Cleaning

  • Data cleaning corrects or removes inaccurate, incomplete, or irrelevant data
  • df.rename(columns={'old': 'new'}) changes a column name
  • df.drop_duplicates() removes duplicate rows
  • Replace "-" with 0 in a DataFrame using df.replace('-', 0)
  • df['col'] = df['col'].astype(int) turns a column to int type

Label Column Purpose

  • The ‘Label’ column classifies rows as 'Passed' or 'Failed' based on a score

Outlier Detection

  • Box plots visually detect outliers
  • To ensure the data only includes completed entries remove rows where 'State' is not 'Finished'

Null Value Function

  • df.isnull().sum() detects null values in each column

XML parsing

  • xml.etree.ElementTree can parse XML data

State Filtering

  • df[df['State'] == 'Finished'] is a filtered DataFrame with only 'Finished' rows

Exploratory Data Analysis (EDA)

  • The main goal is to understand the data's structure and patterns

EDA Tasks

  • Modeltuning is not a typical EDA task

Outlier Detection

  • Box plots are best for visual outlier detection

Center Statistics

  • The mean represents the data's center

Interquartile Range (IQR)

  • IQR is defined as Q3 - Q1

Skewed Distribution

  • A right-skewed distribution has a long tail to the right

Descriptive Statistics

  • df.describe() provides descriptive statistics

Missing Value Identification

  • .isnull().sum() helps find missing values

Data Visualization Library

  • Seaborn is used for data visualization

Categorical Variable Frequencies

  • Bar charts compare categorical variable frequencies

Dispersion Measure

  • Range is a measure of dispersion

Symmetrical Data Distribution

  • A symmetrical data distribution occurs when the mean, median, and mode are approximately equal

Missing Values and Pandas

  • The dropna() pandas function drops rows with missing values

Continuous Variables

  • The scatter plot analyzes relationships between two continuous variables

Data Spread Measure

  • Variance identifies the spread of data from the mean

EDA Purpose

  • EDA helps understand data structure, discover patterns, detect outliers, and prepare data for modeling

EDA Libraries

  • Pandas and Seaborn are two Python libraries useful for EDA, also Matplotlib

DataFrame Missing values

  • df.isnull().sum() is used to check for missing values in a DataFrame

Box Plot Use

  • Box plots are useful for identifying outliers and visualizing data distribution

IQR Definition

  • The Interquartile Range (IQR) is the difference between the third quartile (Q3) and the first quartile (Q1)

Pandas Describe

  • The describe() method in pandas displays summary statistics: count, mean, std, min, max, and quartiles

Outlier Importance

  • Outlier detection is important because outliers can distort statistical analyses and affect model accuracy

Common Correlation Plot

  • Scatter plots or heatmaps are commonly used to identify correlation

Skewness

  • Skewness indicates data distribution asymmetry, positive skew indicates a longer tail on the right

Handling Missing Data Techniques

  • Use dropna() to drop rows/columns or fillna() to fill missing values

mean v median

  • Mean is the average, median is the middle value when data is sorted

Histogram Data

  • Histograms are suitable for numerical, continuous data

Standard Deviation

  • A large standard deviation implies data points are spread widely from the mean

Median and Mean vs Data

  • The median is preferred over the mean when data contains outliers or is skewed

Visualizing Numerical Variables

  • Histograms or box plots can visualize one numerical variable's distribution

Data Analysis Process

  • Understanding the problem is the initial step

Data Type

  • Nominal is a type of data

Parts of A Whole Charts

  • Pie charts

Data Set Center

  • The mean, is a statistical measure to describe the center of a data set

Outliers Plot

  • Box plots detect outliers

Data Vis Python Library

  • Matplotlib is a Python library for visualization

Data Manipulation PyLibrary

  • Pandas is best library

Variable w/ labels

  • Categorical

Means Test

  • T-test

Histogram Usage

  • Visualizes frequency distribution

P-values Determine

  • Statistical significance
  • Line Chart

Boxplot Graph

  • Sns.boxplot()

Data Analysis Purpose

  • To inspect, clean, transform, and model data

Common Data Chart Tool

  • Matplotlib, is used for analysis

Standard Deviation Use

  • Measures dataset data dispersion

outlier Detection Method

  • Box Plot helps

Age Variable

  • Numerical

P-Values Indicate

  • Strong anti null

Read File

  • pd.read_csv(), reads a file.csv into pandas

Histogram Use

  • Distrubution Visualization

t-test

  • Compare 2 means

Pandas Describe

  • Gives summary statistics

Correlation Measurement

  • Linear direction/strength
  • Shows correlations

EDA Roles

  • Understand data, find patterns, look for oddities, and data preparation

Libraries for EDA

  • Seaborn and Pandas

Check values

  • Look for is null with sum

Useful Plots

  • Box plot

Def IQR

  • Q1-Q3

Week 6

  • ML - Ai that learns
  • SL - Data models make predictions
  • Unsupervised Learning - to see hidden groupings
  • Examples of reinforcement learning - rewards and penalties
  • Over fitting - The model is poor on unseen samples but fine on existing
  • Reducing Overfitting - By gaining more data
  • Normalization - data transforms
  • Label Ecoding - Conveerts numbers from text
  • Accuracy : correct predictions
  • F-1 Score - Precision and recall
  • Cross Validation - Multiple splits

  • ML - learning from data
  • SL - label data
  • Unsupervised Learning - is helpful
  • RL - Action based
  • Overfitting - Noise In new data
  • Cross Values - Prevents overfitting
  • FS ensures fairness
  • Normalization 0 - 1
  • Text becomes numeric
  • Correct prediction
  • PE - Positive
  • Rev - captured actives

f Balance - PR

V - multiple splits

NP Model - growing data

Supervised Learning

  • Supervised learning requires labeled data

Unsupervised Learning

  • Market segmentation

Reinforcement Learning

  • Actions directly influence future outcomes

Overfitting

  • Overfitting happens with incorrect data

ML Prevention

  • Cross Value prevents this

Features Scaling

  • FS ensures fairness

Normalization

  • Normalization goes 0 to 1

Transforming to Numeric

  • Text should become numeric

Accurate

  • PE or Positive

Recall Captured

  • captures actuals

Precision Balance

  • Precison and recall must be balances

split

  • Mutliple splits

NP Model

  • Is growing data

Classifcations

  • Used to classify

Examples

  • Diagnosing diseases is a good example

Require Label

Supervised Learning REQUIres A LABEL

Features Vectors

  • Refers to imput features

Not classification

Means is not a classification algorithm

Evaluation

  • Evalutes Model accurately

matrixes

  • gives data and helps

Tp In matrix world

  • TRUE POSSITIVE

Precision

  • Tp div Tp + fp

Rememberal

  • TP div Tp + FN

Rememberal2

  • f score

Evaluation help

To help with evaluations do cross validations

High Quality models

  • Overfitting bad

means

  • Cant get it is the model bad then

deaf Khmer community,

  • To connect deaf peoples faces and see reactions

Random forest,

  • Uses accuracy

Letters

  • 33 are recognizes

Interface.

G Radio uses web and gestures

Data processing?

  • Mediapie

Technique

  • Scaling

data training split

  • 80/20

good for systems?

  • High acuraccy + efficy

Extraction of keypoints

  • 21 keypoint are extracted

Display on Interface

  • Instantly

Labels.

  • The letters ko , ka etc are the labels

importatnt,

  • Helps bridge gaps

loaded models and Interfaces

  • Loaded in jobs and models### Gradios Interface
  • Real time Deployed
  • They are

Aim

  • To use a prototypes
  • To Class models + radio### The 3 step training for radiio
  1. Training
  2. Saving
  3. Testing + deployment ### libraries
  • Scikit - learn### types

  • Use decisons tree### How saved

  • With the libs

How get set up

  • must load radio!### keys
  • input to show!### testing
  • Colab or space faces### reporting
  • model accuarcy and interfaces

sets prep?

  • What topic for classification### datasets?

  • data cleaning### splits and data Splitting the train set!

assesing performance

  • model by using test data!

what to install in radio??

pip install radio### What Is main goal here? To use or make classification based model### required step

  • creating the faces

Commmon use

  • SCikit learn ### Classification Decision Tree!### HOW TO Store or use Joblib Or Pickle!### get radio with a command? , pip install Gradio!

essential components,

Input AND Out### Where we Deploy? - faces, colab or local

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser