Untitled Quiz
48 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is a key characteristic of data science compared to business intelligence?

  • Emphasis on predictive analytics (correct)
  • Focus on historical data analysis
  • Immediate operational decision support
  • Strictly structured data handling
  • Which of the following describes a principal goal of data science?

  • To ensure data quality and consistency
  • To organize data into traditional reports
  • To solve real problems using data (correct)
  • To manage database systems effectively
  • How does data science handle the complexity of data?

  • By applying fixed algorithms to all data types
  • Through structured query language (SQL) usage only
  • By grappling with the structure and messiness of data (correct)
  • By focusing exclusively on clean and organized datasets
  • Which type of data structure is NOT commonly associated with big data?

    <p>Static arrays</p> Signup and view all the answers

    What aspect distinguishes analyst-owned processes from DBA-owned ones in a data context?

    <p>Analyst-owned prioritizes data insights and analysis</p> Signup and view all the answers

    What is one of the challenges data scientists face when working with data?

    <p>Interpreting data that is messy and complex</p> Signup and view all the answers

    Which method is typically NOT associated with data science?

    <p>Data entry tasks</p> Signup and view all the answers

    Which of the following risks is commonly associated with data replication?

    <p>Data inconsistency and duplication errors</p> Signup and view all the answers

    What are the two main components of an audio signal?

    <p>DC component and AC component</p> Signup and view all the answers

    Why is the DC component usually removed before analyzing the audio signal?

    <p>It elevates the level of volume.</p> Signup and view all the answers

    What is the basic unit for representing a digital image?

    <p>Pixel</p> Signup and view all the answers

    How is digital image data generally presented?

    <p>In 2-D form</p> Signup and view all the answers

    Which term is used to represent a point in a 3D image?

    <p>Voxel</p> Signup and view all the answers

    In the context of processing digital images, which aspect is often isolated?

    <p>Brightness from color channels</p> Signup and view all the answers

    What does the term 'subsampled' refer to in digital imaging?

    <p>Reducing the resolution of an image</p> Signup and view all the answers

    What is the role of the AC component in an audio signal?

    <p>It represents the frequency corresponding to the pitch.</p> Signup and view all the answers

    What is the primary goal of supervised learning?

    <p>To map input variables to output variables using known associations.</p> Signup and view all the answers

    Which of the following is NOT a category of supervised models?

    <p>Clustering</p> Signup and view all the answers

    What type of output is associated with regression models?

    <p>Real values or numerical outputs, such as $250.</p> Signup and view all the answers

    In the context of supervised learning, what does a training set consist of?

    <p>Input variables and their corresponding output variables.</p> Signup and view all the answers

    Which of the following learning types relies on labeled data?

    <p>Supervised learning</p> Signup and view all the answers

    What characterizes unsupervised learning?

    <p>It seeks to find structure or patterns in data without labeled outputs.</p> Signup and view all the answers

    What distinguishes semi-supervised learning from supervised and unsupervised learning?

    <p>It relies on both labeled and unlabeled data for training.</p> Signup and view all the answers

    What type of task would require classification in supervised learning?

    <p>Determining whether an email is spam or not.</p> Signup and view all the answers

    What distinguishes the ETLT approach in data preparation?

    <p>It can involve either ETL or ELT based on specific goals.</p> Signup and view all the answers

    What is a key activity during the data conditioning phase?

    <p>Cleaning and normalizing datasets.</p> Signup and view all the answers

    Which of the following is NOT a key activity in Phase 2 of data preparation?

    <p>Building predictive models</p> Signup and view all the answers

    Why is conducting a data gap analysis important?

    <p>To assess what data is available versus what is needed.</p> Signup and view all the answers

    What should teams consider prior to moving data into the sandbox?

    <p>The types of transformations that will be needed.</p> Signup and view all the answers

    What is the purpose of creating a dataset inventory?

    <p>To help in understanding available data sources.</p> Signup and view all the answers

    What is assessed to determine if a team can move to the modeling phase?

    <p>The quality and sufficiency of the data.</p> Signup and view all the answers

    Which activity is involved in understanding the data during Phase 2?

    <p>Identifying data entry errors and acceptable value ranges.</p> Signup and view all the answers

    What distinguishes a data scientist from someone with basic data skills?

    <p>Data scientists extract meaning and interpret data.</p> Signup and view all the answers

    Which data type can be represented in a 1-D form?

    <p>Text Data</p> Signup and view all the answers

    What does ASCII stand for in data encoding?

    <p>American Standard Code for Information Interchange</p> Signup and view all the answers

    Which type of data can be treated as time-series data?

    <p>Audio Data</p> Signup and view all the answers

    What is the primary use of semantic analysis in data interpretation?

    <p>To extract information from text data.</p> Signup and view all the answers

    What is one of the key characteristics of Unicode compared to ASCII?

    <p>Unicode can represent multiple languages and more symbols.</p> Signup and view all the answers

    Which of the following describes trajectory data?

    <p>Data that tracks the movement over time.</p> Signup and view all the answers

    Which data type typically requires sophisticated coding standards to properly represent various symbols?

    <p>Text Data</p> Signup and view all the answers

    What is the first step in Phase 1 - Discovery of a project?

    <p>Identifying key stakeholders</p> Signup and view all the answers

    Which of the following is NOT a key activity in Phase 1 - Discovery?

    <p>Conducting market research</p> Signup and view all the answers

    What criterion helps to define what constitutes project failure?

    <p>Establishing failure criteria</p> Signup and view all the answers

    What aspect of the project does interviewing the Analytics Sponsor primarily address?

    <p>Defining the business problem</p> Signup and view all the answers

    Which statement best describes Initial Hypotheses in Phase 1 - Discovery?

    <p>They should start with a few primary ideas.</p> Signup and view all the answers

    Which of the following is important when identifying key stakeholders?

    <p>Understanding their pain points</p> Signup and view all the answers

    In the context of project discovery, what is the significance of industry issues?

    <p>They may impact analysis focus and project direction.</p> Signup and view all the answers

    What is one of the expected outcomes of developing Initial Hypotheses?

    <p>Ideas that can be tested with data</p> Signup and view all the answers

    Study Notes

    Final Online Test Details

    • Date of test: During Week 12 Tutorial Sessions
    • Group 1: Tuesday (November 26th) at A312, 8 am - 10 am
    • Group 2: Thursday (November 28th) at B219, 1 pm - 3 pm
    • Group 3: Friday (November 29th) at A312, 10 am - 12 pm
    • Group 4: Wednesday (November 27th) at A312, 8 am - 10 am
    • Group 5: Friday (November 29th) at B219, 2 pm - 4 pm
    • Mobile phones and ChatGPT prohibited during the test
    • Test must be taken in person on campus

    Assessment Details

    • Close-book test
    • 50 questions
    • Total points: 120
    • 40 questions x 2 points = 80 points
    • 10 questions x 4 points = 40 points
    • Question formats:
      • Multiple choice (one correct answer)
      • Multiple choice (up to two correct answers)
      • Matching questions

    Big Data Ecosystem Components

    • Data Devices: Cell phone, GPS, MP3, eBook reader, video player, cable box, ATM, credit card reader, RFID
    • Data Collectors: Law enforcement, government, insurance companies, individual medical information brokers, advertising, marketers, employers
    • Data Users/Buyers: Media archives, credit bureaus, financial institutions, banks, delivery services, websites, private investigators
    • Data Aggregators: Websites, data aggregators, etc

    Data Devices

    • Gather data from multiple locations
    • Continuously generate new data about subject data
    • For each gigabyte of data created, a petabyte of data is also generated about the subject data

    Data Collectors

    • Entities that collect data from devices and users
    • Example: Cable TV provider tracks:
      • Shows watched
      • Channels subscribed to/not willing to pay for
      • Prices for premium TV content

    Data Aggregators

    • Entities that compile and make sense of data collected by collectors
    • Companies that transform and package data to sell

    Data Users and Buyers

    • Direct beneficiaries of the data collected and aggregated
    • Example: Corporate customers, analytical services, media archives, advertising companies, information brokers, credit bureaus, catalog co-ops

    Four V's of Big Data

    • Scale (volume)
    • Distribution
    • Diversity (variety)
    • Timeliness (velocity)
    • Accuracy (veracity)

    Data Science vs Enterprise Data Warehouse

    • Data Warehouse (DW) is a relational database designed for querying and analysis rather than for transaction processing.
    • Data warehouse contains cleaned, selective historical data.
    • Includes ETL (Extraction, Transformation, and Loading), OLAP (Online Analytical Processing) processes
    • Data Science processes deal with diverse data sets (4 Vs of big data) and often need different architectures and analytics.

    Analytic Sandbox (Workspaces)

    • Resolves conflicts between analysts' needs and traditional enterprise data warehouses.
    • Stores data from various sources and technologies
    • Enables flexible, high-performance analysis in non-production environments
    • Reduces costs and risks of data replication to "shadow" file systems
    • "Analyst-owned" rather than "DBA-owned"

    Data Science vs Business Intelligence

    • Data Science is exploratory and predictive, focusing on past and future trends and scenarios using various types of data.
    • Business Intelligence is focused on historical and current data to present trends, performance, and issues via reports.

    Big Data Data Structures

    • Unstructured: Data with no predefined format (e.g. text documents, images)
    • Quasi-structured: Data with inconsistent formats that can be structured (e.g., clickstream data)
    • Semi-structured: Data with a defined pattern or format that can be parsed (e.g., spreadsheets, XML)
    • Structured: Data with defined formats, models, and structures (e.g., databases)

    Data Scientist Definition (Academic)

    • A scientist trained in diverse fields from social science to biology.
    • Works with large amounts of data
    • Addresses computational issues of data structure, size, and messiness
    • Solves real-world problems simultaneously.

    Data Scientist Definition (Industry)

    • Someone who has the capacity to extract meaning and interpret data via using tools and methods in statistics and machine learning as well as being human.

    Data Types

    • Text data: Limited symbols, usually encoded with ASCII, Unicode, or other standards.
    • Audio data: Amplitude corresponds to volume, frequency corresponds to pitch
    • Image data: Pixels are basic representation unit.
    • 3-D Image data: Voxels instead of pixels to indicate points in a 3D space
    • Video/Streaming data: Image frames displayed in a timeline of events
    • Trajectory data: Collected by GPS, including geo-location and timestamp

    Data Analytics Lifecycle

    • Discovery: Evaluating resources, framing the analytics problem, identifying stakeholders, and determining initial hypotheses.
    • Data Prep: Preparing the analytic sandbox by extracting and cleaning the relevant data
    • Model Planning: Determining the best methods, techniques, and workflow for the next modeling phase
    • Model Building: Creating the model, using data sets for training and testing
    • Communicating Results: Presenting findings and determining if the project achieved intended goals.
    • Operationalizing Results: Implementing models in a production environment.

    Key Activities in Phase 1 (Discovery).

    • Learns the business domain and assesses the resources needed for the project (people, technology, time, and data)
    • Formulating initial hypotheses that are testable against the data.
    • Determining the key stakeholders (those who benefit or are affected by the project).
    • Articulating the key stakeholders' pain points.
    • Interviewing the analytics sponsor

    Key Activities in Phase 2 (Data Preparation).

    • Preparing the analytics sandbox
    • Performing ETL/ETLT on large datasets
    • Gathering insights about the data's characteristics
    • Building a dataset inventory
    • Performing data conditioning (cleaning, normalizing, transforming data)

    Data Discrepancies

    • Poorly designed forms, human errors, deliberate errors (e.g. not providing information), data decay (outdated info), system errors, data integration issues causing attribute name inconsistencies.
    • Detection strategies: examining metadata, using rules regarding uniqueness, consecutiveness or null values and employing commercial data scrubbing tools.

    Data Reduction Strategies

    • Data cube aggregation
    • Attribute subset selection (removing irrelevant, weakly relevant, or redundant attributes)
    • Dimensionality reduction (reducing data set size using encoding schemes)
    • Numerosity reduction (replacing the data or estimating it with smaller representations)
    • Discretization and concept hierarchy generation (replacing raw attribute values with ranges or high-level concepts)

    Data Transformation Strategies

    • Data smoothing (removing noise)
    • Attribute/feature construction (creating new attributes)
    • Aggregation (building data cubes)
    • Normalization (scaling attributes to fall within a specified range)
    • Discretization (replacing values with numerical intervals or conceptual labels)
    • Concept hierarchy generation (generalizing attributes into higher-level categories)

    Data Normalization Methods

    • Min-Max: Transforming data to a specific range (e.g., 0 to 1)
    • Z-score: Normalizing data using standard deviation measure from the mean. It can adjust data to be within a -1,1 range
    • Decimal scaling: Scaling data by dividing by 10j where j is the integer that would place the absolute maximum value within a -1,1 range

    Data Discretization Methods

    • Binning, Histogram analysis, Cluster analysis, Decision tree analysis, correlation analysis
    • Concept Hierarchy Approach: Method of transforming data into various levels of granularity (e.g., age, zipcode, country)

    K-Means Clustering

    • Exploratory model, unsupervised.
    • Groups data based on attributes into clusters (using centroid values for cluster center).

    DBSCAN Clustering

    • Density-based
    • Locates areas of high density, clusters are regions where data density exceeds some threshold
    • Sensitive to parameters (ε, MinPts)

    Hypothesis Testing

    • Assessing the difference in means of two data samples, or the significance of the difference
    • Two types of hypotheses:
      • Null Hypothesis(HO): No difference between the two data samples
      • Alternative Hypothesis(HA): A difference exists between the two data samples
    • Outcome can lead to rejection or non rejection of HO

    Predictive Models

    • Identifying attributes of a data object in advance (e.g., guessing whether a customer will subscribe or not)

    Regression vs Classification

    • Classification deals with making decisions based on categorical results.
    • Linear regressions give numerical values (as opposed to classes)

    Training and Test Sets

    • Training set: Used to train the model
    • Test set: Used to evaluate the model
    • Both sets are independent of each other and non-overlapping

    Validation Set

    • Part of the data set that's not used in training or testing and is separated to tune the model's parameters

    Naïve Bayes Model

    • Classifier based on probabilities and the Bayes' theorem.
    • Simplifying assumption of attribute values being independently dependent

    Naïve Bayes Classifier Metrics

    • Accuracy, TPR, FPR, FNR, Precision, AUC

    Cross-Validation

    • Holdout (percentage split): Divides data into training and test sets based on pre-determined percentages. The performance is dependent on how the data is split
    • K-fold cross validation: Data divided into K subsets, K trials performed. For each trial, one subset is used for testing, and the remaining K-1 subsets are for training

    Model Deployment Best Practices

    • Specify performance requirements (accuracy, TPR, precision, etc.)
    • Separate model coefficients from the program
    • Develop automated tests of the model (testing on a smaller portion of the data outside the training/testing data sets)
    • Develop a back-test and now-test infrastructure. (testing models on historical data for updates, and ensuring that the model still works as expected when used on new data points).
    • Evaluate each model update (testing each update to verify if performance requirements are still met).

    Decision Tree & Ensemble Learning

    • Prediction/classification by creating a tree-like structure based on attributes and criteria
    • Splitting attributes, pruning, information gain, and Gini index

    Ensemble learning

    • A learning model that combines multiple learners to make predictions that are stronger and more accurate than individual models.
    • Approaches like bagging (Bootstrap Aggregation) create multiple models from sampled training data and find a 'majority' vote for predictions
    • Boosting method builds successive models and builds predictions based on weaker models, and giving weights to those outputs.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    More Like This

    Untitled Quiz
    6 questions

    Untitled Quiz

    AdoredHealing avatar
    AdoredHealing
    Untitled Quiz
    37 questions

    Untitled Quiz

    WellReceivedSquirrel7948 avatar
    WellReceivedSquirrel7948
    Untitled Quiz
    50 questions

    Untitled Quiz

    JoyousSulfur avatar
    JoyousSulfur
    Untitled Quiz
    48 questions

    Untitled Quiz

    StraightforwardStatueOfLiberty avatar
    StraightforwardStatueOfLiberty
    Use Quizgecko on...
    Browser
    Browser