Podcast
Questions and Answers
What is a key characteristic of data science compared to business intelligence?
What is a key characteristic of data science compared to business intelligence?
Which of the following describes a principal goal of data science?
Which of the following describes a principal goal of data science?
How does data science handle the complexity of data?
How does data science handle the complexity of data?
Which type of data structure is NOT commonly associated with big data?
Which type of data structure is NOT commonly associated with big data?
Signup and view all the answers
What aspect distinguishes analyst-owned processes from DBA-owned ones in a data context?
What aspect distinguishes analyst-owned processes from DBA-owned ones in a data context?
Signup and view all the answers
What is one of the challenges data scientists face when working with data?
What is one of the challenges data scientists face when working with data?
Signup and view all the answers
Which method is typically NOT associated with data science?
Which method is typically NOT associated with data science?
Signup and view all the answers
Which of the following risks is commonly associated with data replication?
Which of the following risks is commonly associated with data replication?
Signup and view all the answers
What are the two main components of an audio signal?
What are the two main components of an audio signal?
Signup and view all the answers
Why is the DC component usually removed before analyzing the audio signal?
Why is the DC component usually removed before analyzing the audio signal?
Signup and view all the answers
What is the basic unit for representing a digital image?
What is the basic unit for representing a digital image?
Signup and view all the answers
How is digital image data generally presented?
How is digital image data generally presented?
Signup and view all the answers
Which term is used to represent a point in a 3D image?
Which term is used to represent a point in a 3D image?
Signup and view all the answers
In the context of processing digital images, which aspect is often isolated?
In the context of processing digital images, which aspect is often isolated?
Signup and view all the answers
What does the term 'subsampled' refer to in digital imaging?
What does the term 'subsampled' refer to in digital imaging?
Signup and view all the answers
What is the role of the AC component in an audio signal?
What is the role of the AC component in an audio signal?
Signup and view all the answers
What is the primary goal of supervised learning?
What is the primary goal of supervised learning?
Signup and view all the answers
Which of the following is NOT a category of supervised models?
Which of the following is NOT a category of supervised models?
Signup and view all the answers
What type of output is associated with regression models?
What type of output is associated with regression models?
Signup and view all the answers
In the context of supervised learning, what does a training set consist of?
In the context of supervised learning, what does a training set consist of?
Signup and view all the answers
Which of the following learning types relies on labeled data?
Which of the following learning types relies on labeled data?
Signup and view all the answers
What characterizes unsupervised learning?
What characterizes unsupervised learning?
Signup and view all the answers
What distinguishes semi-supervised learning from supervised and unsupervised learning?
What distinguishes semi-supervised learning from supervised and unsupervised learning?
Signup and view all the answers
What type of task would require classification in supervised learning?
What type of task would require classification in supervised learning?
Signup and view all the answers
What distinguishes the ETLT approach in data preparation?
What distinguishes the ETLT approach in data preparation?
Signup and view all the answers
What is a key activity during the data conditioning phase?
What is a key activity during the data conditioning phase?
Signup and view all the answers
Which of the following is NOT a key activity in Phase 2 of data preparation?
Which of the following is NOT a key activity in Phase 2 of data preparation?
Signup and view all the answers
Why is conducting a data gap analysis important?
Why is conducting a data gap analysis important?
Signup and view all the answers
What should teams consider prior to moving data into the sandbox?
What should teams consider prior to moving data into the sandbox?
Signup and view all the answers
What is the purpose of creating a dataset inventory?
What is the purpose of creating a dataset inventory?
Signup and view all the answers
What is assessed to determine if a team can move to the modeling phase?
What is assessed to determine if a team can move to the modeling phase?
Signup and view all the answers
Which activity is involved in understanding the data during Phase 2?
Which activity is involved in understanding the data during Phase 2?
Signup and view all the answers
What distinguishes a data scientist from someone with basic data skills?
What distinguishes a data scientist from someone with basic data skills?
Signup and view all the answers
Which data type can be represented in a 1-D form?
Which data type can be represented in a 1-D form?
Signup and view all the answers
What does ASCII stand for in data encoding?
What does ASCII stand for in data encoding?
Signup and view all the answers
Which type of data can be treated as time-series data?
Which type of data can be treated as time-series data?
Signup and view all the answers
What is the primary use of semantic analysis in data interpretation?
What is the primary use of semantic analysis in data interpretation?
Signup and view all the answers
What is one of the key characteristics of Unicode compared to ASCII?
What is one of the key characteristics of Unicode compared to ASCII?
Signup and view all the answers
Which of the following describes trajectory data?
Which of the following describes trajectory data?
Signup and view all the answers
Which data type typically requires sophisticated coding standards to properly represent various symbols?
Which data type typically requires sophisticated coding standards to properly represent various symbols?
Signup and view all the answers
What is the first step in Phase 1 - Discovery of a project?
What is the first step in Phase 1 - Discovery of a project?
Signup and view all the answers
Which of the following is NOT a key activity in Phase 1 - Discovery?
Which of the following is NOT a key activity in Phase 1 - Discovery?
Signup and view all the answers
What criterion helps to define what constitutes project failure?
What criterion helps to define what constitutes project failure?
Signup and view all the answers
What aspect of the project does interviewing the Analytics Sponsor primarily address?
What aspect of the project does interviewing the Analytics Sponsor primarily address?
Signup and view all the answers
Which statement best describes Initial Hypotheses in Phase 1 - Discovery?
Which statement best describes Initial Hypotheses in Phase 1 - Discovery?
Signup and view all the answers
Which of the following is important when identifying key stakeholders?
Which of the following is important when identifying key stakeholders?
Signup and view all the answers
In the context of project discovery, what is the significance of industry issues?
In the context of project discovery, what is the significance of industry issues?
Signup and view all the answers
What is one of the expected outcomes of developing Initial Hypotheses?
What is one of the expected outcomes of developing Initial Hypotheses?
Signup and view all the answers
Study Notes
Final Online Test Details
- Date of test: During Week 12 Tutorial Sessions
- Group 1: Tuesday (November 26th) at A312, 8 am - 10 am
- Group 2: Thursday (November 28th) at B219, 1 pm - 3 pm
- Group 3: Friday (November 29th) at A312, 10 am - 12 pm
- Group 4: Wednesday (November 27th) at A312, 8 am - 10 am
- Group 5: Friday (November 29th) at B219, 2 pm - 4 pm
- Mobile phones and ChatGPT prohibited during the test
- Test must be taken in person on campus
Assessment Details
- Close-book test
- 50 questions
- Total points: 120
- 40 questions x 2 points = 80 points
- 10 questions x 4 points = 40 points
- Question formats:
- Multiple choice (one correct answer)
- Multiple choice (up to two correct answers)
- Matching questions
Big Data Ecosystem Components
- Data Devices: Cell phone, GPS, MP3, eBook reader, video player, cable box, ATM, credit card reader, RFID
- Data Collectors: Law enforcement, government, insurance companies, individual medical information brokers, advertising, marketers, employers
- Data Users/Buyers: Media archives, credit bureaus, financial institutions, banks, delivery services, websites, private investigators
- Data Aggregators: Websites, data aggregators, etc
Data Devices
- Gather data from multiple locations
- Continuously generate new data about subject data
- For each gigabyte of data created, a petabyte of data is also generated about the subject data
Data Collectors
- Entities that collect data from devices and users
- Example: Cable TV provider tracks:
- Shows watched
- Channels subscribed to/not willing to pay for
- Prices for premium TV content
Data Aggregators
- Entities that compile and make sense of data collected by collectors
- Companies that transform and package data to sell
Data Users and Buyers
- Direct beneficiaries of the data collected and aggregated
- Example: Corporate customers, analytical services, media archives, advertising companies, information brokers, credit bureaus, catalog co-ops
Four V's of Big Data
- Scale (volume)
- Distribution
- Diversity (variety)
- Timeliness (velocity)
- Accuracy (veracity)
Data Science vs Enterprise Data Warehouse
- Data Warehouse (DW) is a relational database designed for querying and analysis rather than for transaction processing.
- Data warehouse contains cleaned, selective historical data.
- Includes ETL (Extraction, Transformation, and Loading), OLAP (Online Analytical Processing) processes
- Data Science processes deal with diverse data sets (4 Vs of big data) and often need different architectures and analytics.
Analytic Sandbox (Workspaces)
- Resolves conflicts between analysts' needs and traditional enterprise data warehouses.
- Stores data from various sources and technologies
- Enables flexible, high-performance analysis in non-production environments
- Reduces costs and risks of data replication to "shadow" file systems
- "Analyst-owned" rather than "DBA-owned"
Data Science vs Business Intelligence
- Data Science is exploratory and predictive, focusing on past and future trends and scenarios using various types of data.
- Business Intelligence is focused on historical and current data to present trends, performance, and issues via reports.
Big Data Data Structures
- Unstructured: Data with no predefined format (e.g. text documents, images)
- Quasi-structured: Data with inconsistent formats that can be structured (e.g., clickstream data)
- Semi-structured: Data with a defined pattern or format that can be parsed (e.g., spreadsheets, XML)
- Structured: Data with defined formats, models, and structures (e.g., databases)
Data Scientist Definition (Academic)
- A scientist trained in diverse fields from social science to biology.
- Works with large amounts of data
- Addresses computational issues of data structure, size, and messiness
- Solves real-world problems simultaneously.
Data Scientist Definition (Industry)
- Someone who has the capacity to extract meaning and interpret data via using tools and methods in statistics and machine learning as well as being human.
Data Types
- Text data: Limited symbols, usually encoded with ASCII, Unicode, or other standards.
- Audio data: Amplitude corresponds to volume, frequency corresponds to pitch
- Image data: Pixels are basic representation unit.
- 3-D Image data: Voxels instead of pixels to indicate points in a 3D space
- Video/Streaming data: Image frames displayed in a timeline of events
- Trajectory data: Collected by GPS, including geo-location and timestamp
Data Analytics Lifecycle
- Discovery: Evaluating resources, framing the analytics problem, identifying stakeholders, and determining initial hypotheses.
- Data Prep: Preparing the analytic sandbox by extracting and cleaning the relevant data
- Model Planning: Determining the best methods, techniques, and workflow for the next modeling phase
- Model Building: Creating the model, using data sets for training and testing
- Communicating Results: Presenting findings and determining if the project achieved intended goals.
- Operationalizing Results: Implementing models in a production environment.
Key Activities in Phase 1 (Discovery).
- Learns the business domain and assesses the resources needed for the project (people, technology, time, and data)
- Formulating initial hypotheses that are testable against the data.
- Determining the key stakeholders (those who benefit or are affected by the project).
- Articulating the key stakeholders' pain points.
- Interviewing the analytics sponsor
Key Activities in Phase 2 (Data Preparation).
- Preparing the analytics sandbox
- Performing ETL/ETLT on large datasets
- Gathering insights about the data's characteristics
- Building a dataset inventory
- Performing data conditioning (cleaning, normalizing, transforming data)
Data Discrepancies
- Poorly designed forms, human errors, deliberate errors (e.g. not providing information), data decay (outdated info), system errors, data integration issues causing attribute name inconsistencies.
- Detection strategies: examining metadata, using rules regarding uniqueness, consecutiveness or null values and employing commercial data scrubbing tools.
Data Reduction Strategies
- Data cube aggregation
- Attribute subset selection (removing irrelevant, weakly relevant, or redundant attributes)
- Dimensionality reduction (reducing data set size using encoding schemes)
- Numerosity reduction (replacing the data or estimating it with smaller representations)
- Discretization and concept hierarchy generation (replacing raw attribute values with ranges or high-level concepts)
Data Transformation Strategies
- Data smoothing (removing noise)
- Attribute/feature construction (creating new attributes)
- Aggregation (building data cubes)
- Normalization (scaling attributes to fall within a specified range)
- Discretization (replacing values with numerical intervals or conceptual labels)
- Concept hierarchy generation (generalizing attributes into higher-level categories)
Data Normalization Methods
- Min-Max: Transforming data to a specific range (e.g., 0 to 1)
- Z-score: Normalizing data using standard deviation measure from the mean. It can adjust data to be within a -1,1 range
- Decimal scaling: Scaling data by dividing by 10j where j is the integer that would place the absolute maximum value within a -1,1 range
Data Discretization Methods
- Binning, Histogram analysis, Cluster analysis, Decision tree analysis, correlation analysis
- Concept Hierarchy Approach: Method of transforming data into various levels of granularity (e.g., age, zipcode, country)
K-Means Clustering
- Exploratory model, unsupervised.
- Groups data based on attributes into clusters (using centroid values for cluster center).
DBSCAN Clustering
- Density-based
- Locates areas of high density, clusters are regions where data density exceeds some threshold
- Sensitive to parameters (ε, MinPts)
Hypothesis Testing
- Assessing the difference in means of two data samples, or the significance of the difference
- Two types of hypotheses:
- Null Hypothesis(HO): No difference between the two data samples
- Alternative Hypothesis(HA): A difference exists between the two data samples
- Outcome can lead to rejection or non rejection of HO
Predictive Models
- Identifying attributes of a data object in advance (e.g., guessing whether a customer will subscribe or not)
Regression vs Classification
- Classification deals with making decisions based on categorical results.
- Linear regressions give numerical values (as opposed to classes)
Training and Test Sets
- Training set: Used to train the model
- Test set: Used to evaluate the model
- Both sets are independent of each other and non-overlapping
Validation Set
- Part of the data set that's not used in training or testing and is separated to tune the model's parameters
Naïve Bayes Model
- Classifier based on probabilities and the Bayes' theorem.
- Simplifying assumption of attribute values being independently dependent
Naïve Bayes Classifier Metrics
- Accuracy, TPR, FPR, FNR, Precision, AUC
Cross-Validation
- Holdout (percentage split): Divides data into training and test sets based on pre-determined percentages. The performance is dependent on how the data is split
- K-fold cross validation: Data divided into K subsets, K trials performed. For each trial, one subset is used for testing, and the remaining K-1 subsets are for training
Model Deployment Best Practices
- Specify performance requirements (accuracy, TPR, precision, etc.)
- Separate model coefficients from the program
- Develop automated tests of the model (testing on a smaller portion of the data outside the training/testing data sets)
- Develop a back-test and now-test infrastructure. (testing models on historical data for updates, and ensuring that the model still works as expected when used on new data points).
- Evaluate each model update (testing each update to verify if performance requirements are still met).
Decision Tree & Ensemble Learning
- Prediction/classification by creating a tree-like structure based on attributes and criteria
- Splitting attributes, pruning, information gain, and Gini index
Ensemble learning
- A learning model that combines multiple learners to make predictions that are stronger and more accurate than individual models.
- Approaches like bagging (Bootstrap Aggregation) create multiple models from sampled training data and find a 'majority' vote for predictions
- Boosting method builds successive models and builds predictions based on weaker models, and giving weights to those outputs.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.