Untitled Quiz

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is a key characteristic of data science compared to business intelligence?

  • Emphasis on predictive analytics (correct)
  • Focus on historical data analysis
  • Immediate operational decision support
  • Strictly structured data handling

Which of the following describes a principal goal of data science?

  • To ensure data quality and consistency
  • To organize data into traditional reports
  • To solve real problems using data (correct)
  • To manage database systems effectively

How does data science handle the complexity of data?

  • By applying fixed algorithms to all data types
  • Through structured query language (SQL) usage only
  • By grappling with the structure and messiness of data (correct)
  • By focusing exclusively on clean and organized datasets

Which type of data structure is NOT commonly associated with big data?

<p>Static arrays (B)</p> Signup and view all the answers

What aspect distinguishes analyst-owned processes from DBA-owned ones in a data context?

<p>Analyst-owned prioritizes data insights and analysis (A)</p> Signup and view all the answers

What is one of the challenges data scientists face when working with data?

<p>Interpreting data that is messy and complex (A)</p> Signup and view all the answers

Which method is typically NOT associated with data science?

<p>Data entry tasks (D)</p> Signup and view all the answers

Which of the following risks is commonly associated with data replication?

<p>Data inconsistency and duplication errors (A)</p> Signup and view all the answers

What are the two main components of an audio signal?

<p>DC component and AC component (D)</p> Signup and view all the answers

Why is the DC component usually removed before analyzing the audio signal?

<p>It elevates the level of volume. (A)</p> Signup and view all the answers

What is the basic unit for representing a digital image?

<p>Pixel (A)</p> Signup and view all the answers

How is digital image data generally presented?

<p>In 2-D form (A)</p> Signup and view all the answers

Which term is used to represent a point in a 3D image?

<p>Voxel (D)</p> Signup and view all the answers

In the context of processing digital images, which aspect is often isolated?

<p>Brightness from color channels (D)</p> Signup and view all the answers

What does the term 'subsampled' refer to in digital imaging?

<p>Reducing the resolution of an image (C)</p> Signup and view all the answers

What is the role of the AC component in an audio signal?

<p>It represents the frequency corresponding to the pitch. (A)</p> Signup and view all the answers

What is the primary goal of supervised learning?

<p>To map input variables to output variables using known associations. (D)</p> Signup and view all the answers

Which of the following is NOT a category of supervised models?

<p>Clustering (A)</p> Signup and view all the answers

What type of output is associated with regression models?

<p>Real values or numerical outputs, such as $250. (B)</p> Signup and view all the answers

In the context of supervised learning, what does a training set consist of?

<p>Input variables and their corresponding output variables. (B)</p> Signup and view all the answers

Which of the following learning types relies on labeled data?

<p>Supervised learning (B)</p> Signup and view all the answers

What characterizes unsupervised learning?

<p>It seeks to find structure or patterns in data without labeled outputs. (D)</p> Signup and view all the answers

What distinguishes semi-supervised learning from supervised and unsupervised learning?

<p>It relies on both labeled and unlabeled data for training. (A)</p> Signup and view all the answers

What type of task would require classification in supervised learning?

<p>Determining whether an email is spam or not. (C)</p> Signup and view all the answers

What distinguishes the ETLT approach in data preparation?

<p>It can involve either ETL or ELT based on specific goals. (C)</p> Signup and view all the answers

What is a key activity during the data conditioning phase?

<p>Cleaning and normalizing datasets. (D)</p> Signup and view all the answers

Which of the following is NOT a key activity in Phase 2 of data preparation?

<p>Building predictive models (D)</p> Signup and view all the answers

Why is conducting a data gap analysis important?

<p>To assess what data is available versus what is needed. (A)</p> Signup and view all the answers

What should teams consider prior to moving data into the sandbox?

<p>The types of transformations that will be needed. (D)</p> Signup and view all the answers

What is the purpose of creating a dataset inventory?

<p>To help in understanding available data sources. (A)</p> Signup and view all the answers

What is assessed to determine if a team can move to the modeling phase?

<p>The quality and sufficiency of the data. (C)</p> Signup and view all the answers

Which activity is involved in understanding the data during Phase 2?

<p>Identifying data entry errors and acceptable value ranges. (A)</p> Signup and view all the answers

What distinguishes a data scientist from someone with basic data skills?

<p>Data scientists extract meaning and interpret data. (D)</p> Signup and view all the answers

Which data type can be represented in a 1-D form?

<p>Text Data (B)</p> Signup and view all the answers

What does ASCII stand for in data encoding?

<p>American Standard Code for Information Interchange (D)</p> Signup and view all the answers

Which type of data can be treated as time-series data?

<p>Audio Data (A)</p> Signup and view all the answers

What is the primary use of semantic analysis in data interpretation?

<p>To extract information from text data. (B)</p> Signup and view all the answers

What is one of the key characteristics of Unicode compared to ASCII?

<p>Unicode can represent multiple languages and more symbols. (A)</p> Signup and view all the answers

Which of the following describes trajectory data?

<p>Data that tracks the movement over time. (D)</p> Signup and view all the answers

Which data type typically requires sophisticated coding standards to properly represent various symbols?

<p>Text Data (B)</p> Signup and view all the answers

What is the first step in Phase 1 - Discovery of a project?

<p>Identifying key stakeholders (C)</p> Signup and view all the answers

Which of the following is NOT a key activity in Phase 1 - Discovery?

<p>Conducting market research (D)</p> Signup and view all the answers

What criterion helps to define what constitutes project failure?

<p>Establishing failure criteria (C)</p> Signup and view all the answers

What aspect of the project does interviewing the Analytics Sponsor primarily address?

<p>Defining the business problem (C)</p> Signup and view all the answers

Which statement best describes Initial Hypotheses in Phase 1 - Discovery?

<p>They should start with a few primary ideas. (A)</p> Signup and view all the answers

Which of the following is important when identifying key stakeholders?

<p>Understanding their pain points (A)</p> Signup and view all the answers

In the context of project discovery, what is the significance of industry issues?

<p>They may impact analysis focus and project direction. (B)</p> Signup and view all the answers

What is one of the expected outcomes of developing Initial Hypotheses?

<p>Ideas that can be tested with data (A)</p> Signup and view all the answers

Flashcards

Data Scientist Definition

Someone who extracts meaning from data using statistical and machine learning methods and understanding.

Text Data Types

Data represented by limited symbols encoded using standards like ASCII or Unicode.

Image Data Types

Visual data like pictures, photos, and other visual recordings.

Audio Data Types

Sound data, often represented as time-series with amplitude corresponding to volume.

Signup and view all the flashcards

Streaming/Video Data Types

Data containing flowing or moving visual information (videos, streams).

Signup and view all the flashcards

3-D Image Data Types

Data represented as three-dimensional images.

Signup and view all the flashcards

Trajectory Data Types

Data tracking movement or paths of things.

Signup and view all the flashcards

ASCII Code

Standard for encoding symbols using 8 bits.

Signup and view all the flashcards

Data Science vs. Business Intelligence

Data Science is a broader field focusing on extracting insights and knowledge from data, while Business Intelligence (BI) focuses on using data to help make better business decisions.

Signup and view all the flashcards

Data Science Goal

To solve real-world problems using large datasets and computational methods.

Signup and view all the flashcards

Data Scientist (Academic Def)

A scientist skilled in various fields (like biology or social science) who works with large datasets, handling computational challenges associated with data size and complexity to solve a real problem.

Signup and view all the flashcards

Shadow File Systems

Backup copies of data, reducing cost and risk of data replication.

Signup and view all the flashcards

Analyst Ownership (Data)

Data ownership is handled by analysts, not database administrators (DBAs).

Signup and view all the flashcards

Data Structures in Big Data

Various ways of organizing and storing data in big data, each with advantages and disadvantages for different tasks.

Signup and view all the flashcards

Data replication costs

Money loss and damage, risks to maintain copies.

Signup and view all the flashcards

Data Science Principle

A fundamental concept or guideline for conducting data science tasks.

Signup and view all the flashcards

Problem Statement

A clear description of the issue or challenge that the project aims to solve. It should be concise and understandable to all stakeholders.

Signup and view all the flashcards

Key Stakeholders

Individuals or groups who have a vested interest in the project's success or are significantly impacted by it.

Signup and view all the flashcards

Project Objectives

Specific and measurable goals that the project intends to achieve. These objectives should be aligned with the problem statement.

Signup and view all the flashcards

Business Impact

The tangible benefits or changes the project will bring to the organization in terms of revenue, efficiency, or customer satisfaction.

Signup and view all the flashcards

Failure Criteria

Conditions or outcomes that would indicate the project's failure to meet its objectives. This helps define clear boundaries for success.

Signup and view all the flashcards

Analytics Sponsor Interview

A structured conversation with the project's sponsor to understand the business problem, desired outcomes, available data, and project constraints.

Signup and view all the flashcards

Initial Hypotheses (IHs)

Testable statements that propose potential explanations for the observed problem or phenomenon. These are starting points for data analysis.

Signup and view all the flashcards

Gather Hypotheses from Stakeholders

Collecting potential explanations for the problem from individuals with domain expertise or experience related to the project.

Signup and view all the flashcards

Audio Signal Components

An audio signal consists of two parts: a direct current (DC) component and an alternating current (AC) component. The DC component acts as a bias, controlling the volume level, while the AC component carries the audio information.

Signup and view all the flashcards

DC Component Function

The DC component of an audio signal is responsible for setting the overall volume level. It acts as a bias, raising or lowering the signal level.

Signup and view all the flashcards

Removing DC Component

Before analyzing an audio signal, the DC component is usually removed. This process eliminates the volume bias and allows for a clearer understanding of the actual audio information.

Signup and view all the flashcards

AC Component Function

The AC component of an audio signal carries the actual sound information. This includes the pitch, tone, and other characteristics of the audio.

Signup and view all the flashcards

Frequency in Audio

The frequency of the AC component in an audio signal corresponds to the perceived pitch or tone of the sound. Higher frequencies produce higher pitches, and lower frequencies produce lower pitches.

Signup and view all the flashcards

Image Data Representation

Image data is typically represented in a 2D form, with each point in the image called a pixel. Pixels are the basic units for representing digital images.

Signup and view all the flashcards

Color and Brightness in Images

Some image storage models separate the brightness information from the color channels. This allows for efficient storage and processing of images.

Signup and view all the flashcards

3D Image Data Representation

In 3D images, each point is called a voxel, representing a 3D space. Voxels are the basic units for representing 3D image data.

Signup and view all the flashcards

ETLT

A data preparation approach combining ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) to populate a data sandbox. The team chooses the method based on their specific goals.

Signup and view all the flashcards

Data Gap Analysis

Identifying missing data points in a dataset, which can indicate incomplete information or potential errors.

Signup and view all the flashcards

Data Conditioning

Cleaning, standardizing, and transforming data to prepare it for analysis, including merging datasets and selecting relevant information.

Signup and view all the flashcards

Data Inventory

A documented list of all datasets used in a project, including their sources, formats, and descriptions.

Signup and view all the flashcards

Join/Merge Datasets

Combining different datasets with shared information into a single, unified dataset.

Signup and view all the flashcards

Data Conditioning Steps

Processes involved in data conditioning, including cleaning, normalizing, and transforming data.

Signup and view all the flashcards

Dataset Selection

Identifying the most useful data for analysis from a pool of available datasets, deciding which data to keep or discard.

Signup and view all the flashcards

Model Planning Phase

The stage where the data science team decides whether they have enough good quality data for model building, determining if they can proceed to the next phase.

Signup and view all the flashcards

Supervised Learning

A machine learning approach where the model is trained on labeled data, meaning both inputs and the desired outputs are provided. The model learns to map inputs to outputs based on this training data.

Signup and view all the flashcards

Unsupervised Learning

A machine learning technique where the model learns from unlabeled data, meaning it doesn't have any specific outputs to learn from. It identifies patterns, structures, and relationships within the data itself.

Signup and view all the flashcards

Semi-supervised Learning

A hybrid approach where the model is trained on a mix of labeled and unlabeled data. It leverages the strengths of both supervised and unsupervised learning.

Signup and view all the flashcards

Classification

A type of supervised learning where the model predicts a categorical outcome, placing data points into predefined categories.

Signup and view all the flashcards

Regression

A type of supervised learning where the model predicts a numerical output, finding a continuous relationship between inputs and outputs.

Signup and view all the flashcards

Training Set

The data used to train a machine learning model. It contains both the inputs and desired outputs, allowing the model to learn the relationship between them.

Signup and view all the flashcards

Test Set

The data used to evaluate the performance of a trained machine learning model. It contains only the inputs, and the model predicts the outputs based on its learned knowledge.

Signup and view all the flashcards

Mapping Function

The function that the machine learning model learns during training. It represents the relationship between the inputs and outputs, allowing the model to predict outputs for new, unseen inputs.

Signup and view all the flashcards

Study Notes

Final Online Test Details

  • Date of test: During Week 12 Tutorial Sessions
  • Group 1: Tuesday (November 26th) at A312, 8 am - 10 am
  • Group 2: Thursday (November 28th) at B219, 1 pm - 3 pm
  • Group 3: Friday (November 29th) at A312, 10 am - 12 pm
  • Group 4: Wednesday (November 27th) at A312, 8 am - 10 am
  • Group 5: Friday (November 29th) at B219, 2 pm - 4 pm
  • Mobile phones and ChatGPT prohibited during the test
  • Test must be taken in person on campus

Assessment Details

  • Close-book test
  • 50 questions
  • Total points: 120
  • 40 questions x 2 points = 80 points
  • 10 questions x 4 points = 40 points
  • Question formats:
    • Multiple choice (one correct answer)
    • Multiple choice (up to two correct answers)
    • Matching questions

Big Data Ecosystem Components

  • Data Devices: Cell phone, GPS, MP3, eBook reader, video player, cable box, ATM, credit card reader, RFID
  • Data Collectors: Law enforcement, government, insurance companies, individual medical information brokers, advertising, marketers, employers
  • Data Users/Buyers: Media archives, credit bureaus, financial institutions, banks, delivery services, websites, private investigators
  • Data Aggregators: Websites, data aggregators, etc

Data Devices

  • Gather data from multiple locations
  • Continuously generate new data about subject data
  • For each gigabyte of data created, a petabyte of data is also generated about the subject data

Data Collectors

  • Entities that collect data from devices and users
  • Example: Cable TV provider tracks:
    • Shows watched
    • Channels subscribed to/not willing to pay for
    • Prices for premium TV content

Data Aggregators

  • Entities that compile and make sense of data collected by collectors
  • Companies that transform and package data to sell

Data Users and Buyers

  • Direct beneficiaries of the data collected and aggregated
  • Example: Corporate customers, analytical services, media archives, advertising companies, information brokers, credit bureaus, catalog co-ops

Four V's of Big Data

  • Scale (volume)
  • Distribution
  • Diversity (variety)
  • Timeliness (velocity)
  • Accuracy (veracity)

Data Science vs Enterprise Data Warehouse

  • Data Warehouse (DW) is a relational database designed for querying and analysis rather than for transaction processing.
  • Data warehouse contains cleaned, selective historical data.
  • Includes ETL (Extraction, Transformation, and Loading), OLAP (Online Analytical Processing) processes
  • Data Science processes deal with diverse data sets (4 Vs of big data) and often need different architectures and analytics.

Analytic Sandbox (Workspaces)

  • Resolves conflicts between analysts' needs and traditional enterprise data warehouses.
  • Stores data from various sources and technologies
  • Enables flexible, high-performance analysis in non-production environments
  • Reduces costs and risks of data replication to "shadow" file systems
  • "Analyst-owned" rather than "DBA-owned"

Data Science vs Business Intelligence

  • Data Science is exploratory and predictive, focusing on past and future trends and scenarios using various types of data.
  • Business Intelligence is focused on historical and current data to present trends, performance, and issues via reports.

Big Data Data Structures

  • Unstructured: Data with no predefined format (e.g. text documents, images)
  • Quasi-structured: Data with inconsistent formats that can be structured (e.g., clickstream data)
  • Semi-structured: Data with a defined pattern or format that can be parsed (e.g., spreadsheets, XML)
  • Structured: Data with defined formats, models, and structures (e.g., databases)

Data Scientist Definition (Academic)

  • A scientist trained in diverse fields from social science to biology.
  • Works with large amounts of data
  • Addresses computational issues of data structure, size, and messiness
  • Solves real-world problems simultaneously.

Data Scientist Definition (Industry)

  • Someone who has the capacity to extract meaning and interpret data via using tools and methods in statistics and machine learning as well as being human.

Data Types

  • Text data: Limited symbols, usually encoded with ASCII, Unicode, or other standards.
  • Audio data: Amplitude corresponds to volume, frequency corresponds to pitch
  • Image data: Pixels are basic representation unit.
  • 3-D Image data: Voxels instead of pixels to indicate points in a 3D space
  • Video/Streaming data: Image frames displayed in a timeline of events
  • Trajectory data: Collected by GPS, including geo-location and timestamp

Data Analytics Lifecycle

  • Discovery: Evaluating resources, framing the analytics problem, identifying stakeholders, and determining initial hypotheses.
  • Data Prep: Preparing the analytic sandbox by extracting and cleaning the relevant data
  • Model Planning: Determining the best methods, techniques, and workflow for the next modeling phase
  • Model Building: Creating the model, using data sets for training and testing
  • Communicating Results: Presenting findings and determining if the project achieved intended goals.
  • Operationalizing Results: Implementing models in a production environment.

Key Activities in Phase 1 (Discovery).

  • Learns the business domain and assesses the resources needed for the project (people, technology, time, and data)
  • Formulating initial hypotheses that are testable against the data.
  • Determining the key stakeholders (those who benefit or are affected by the project).
  • Articulating the key stakeholders' pain points.
  • Interviewing the analytics sponsor

Key Activities in Phase 2 (Data Preparation).

  • Preparing the analytics sandbox
  • Performing ETL/ETLT on large datasets
  • Gathering insights about the data's characteristics
  • Building a dataset inventory
  • Performing data conditioning (cleaning, normalizing, transforming data)

Data Discrepancies

  • Poorly designed forms, human errors, deliberate errors (e.g. not providing information), data decay (outdated info), system errors, data integration issues causing attribute name inconsistencies.
  • Detection strategies: examining metadata, using rules regarding uniqueness, consecutiveness or null values and employing commercial data scrubbing tools.

Data Reduction Strategies

  • Data cube aggregation
  • Attribute subset selection (removing irrelevant, weakly relevant, or redundant attributes)
  • Dimensionality reduction (reducing data set size using encoding schemes)
  • Numerosity reduction (replacing the data or estimating it with smaller representations)
  • Discretization and concept hierarchy generation (replacing raw attribute values with ranges or high-level concepts)

Data Transformation Strategies

  • Data smoothing (removing noise)
  • Attribute/feature construction (creating new attributes)
  • Aggregation (building data cubes)
  • Normalization (scaling attributes to fall within a specified range)
  • Discretization (replacing values with numerical intervals or conceptual labels)
  • Concept hierarchy generation (generalizing attributes into higher-level categories)

Data Normalization Methods

  • Min-Max: Transforming data to a specific range (e.g., 0 to 1)
  • Z-score: Normalizing data using standard deviation measure from the mean. It can adjust data to be within a -1,1 range
  • Decimal scaling: Scaling data by dividing by 10j where j is the integer that would place the absolute maximum value within a -1,1 range

Data Discretization Methods

  • Binning, Histogram analysis, Cluster analysis, Decision tree analysis, correlation analysis
  • Concept Hierarchy Approach: Method of transforming data into various levels of granularity (e.g., age, zipcode, country)

K-Means Clustering

  • Exploratory model, unsupervised.
  • Groups data based on attributes into clusters (using centroid values for cluster center).

DBSCAN Clustering

  • Density-based
  • Locates areas of high density, clusters are regions where data density exceeds some threshold
  • Sensitive to parameters (ε, MinPts)

Hypothesis Testing

  • Assessing the difference in means of two data samples, or the significance of the difference
  • Two types of hypotheses:
    • Null Hypothesis(HO): No difference between the two data samples
    • Alternative Hypothesis(HA): A difference exists between the two data samples
  • Outcome can lead to rejection or non rejection of HO

Predictive Models

  • Identifying attributes of a data object in advance (e.g., guessing whether a customer will subscribe or not)

Regression vs Classification

  • Classification deals with making decisions based on categorical results.
  • Linear regressions give numerical values (as opposed to classes)

Training and Test Sets

  • Training set: Used to train the model
  • Test set: Used to evaluate the model
  • Both sets are independent of each other and non-overlapping

Validation Set

  • Part of the data set that's not used in training or testing and is separated to tune the model's parameters

Naïve Bayes Model

  • Classifier based on probabilities and the Bayes' theorem.
  • Simplifying assumption of attribute values being independently dependent

Naïve Bayes Classifier Metrics

  • Accuracy, TPR, FPR, FNR, Precision, AUC

Cross-Validation

  • Holdout (percentage split): Divides data into training and test sets based on pre-determined percentages. The performance is dependent on how the data is split
  • K-fold cross validation: Data divided into K subsets, K trials performed. For each trial, one subset is used for testing, and the remaining K-1 subsets are for training

Model Deployment Best Practices

  • Specify performance requirements (accuracy, TPR, precision, etc.)
  • Separate model coefficients from the program
  • Develop automated tests of the model (testing on a smaller portion of the data outside the training/testing data sets)
  • Develop a back-test and now-test infrastructure. (testing models on historical data for updates, and ensuring that the model still works as expected when used on new data points).
  • Evaluate each model update (testing each update to verify if performance requirements are still met).

Decision Tree & Ensemble Learning

  • Prediction/classification by creating a tree-like structure based on attributes and criteria
  • Splitting attributes, pruning, information gain, and Gini index

Ensemble learning

  • A learning model that combines multiple learners to make predictions that are stronger and more accurate than individual models.
  • Approaches like bagging (Bootstrap Aggregation) create multiple models from sampled training data and find a 'majority' vote for predictions
  • Boosting method builds successive models and builds predictions based on weaker models, and giving weights to those outputs.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Untitled Quiz
37 questions

Untitled Quiz

WellReceivedSquirrel7948 avatar
WellReceivedSquirrel7948
Untitled Quiz
55 questions

Untitled Quiz

StatuesquePrimrose avatar
StatuesquePrimrose
Untitled Quiz
50 questions

Untitled Quiz

JoyousSulfur avatar
JoyousSulfur
Untitled Quiz
48 questions

Untitled Quiz

StraightforwardStatueOfLiberty avatar
StraightforwardStatueOfLiberty
Use Quizgecko on...
Browser
Browser