Podcast
Questions and Answers
Which of the following roles is primarily focused on building and maintaining the data infrastructure?
Which of the following roles is primarily focused on building and maintaining the data infrastructure?
- Machine Learning Practitioner
- Data Engineer (correct)
- Data Scientist
- Data Analyst
A large greenhouse tomato grower wants to optimize their harvesting schedule based on predicted yields and market prices. Which category of analytics would be most suited to address this?
A large greenhouse tomato grower wants to optimize their harvesting schedule based on predicted yields and market prices. Which category of analytics would be most suited to address this?
- Diagnostic
- Prescriptive (correct)
- Descriptive
- Predictive
A retail company is analyzing why many customers are leaving their service. Examining historical sales data, they identified a correlation between a recent change in the loyalty program and customer churn. Which type of analytics is exemplified?
A retail company is analyzing why many customers are leaving their service. Examining historical sales data, they identified a correlation between a recent change in the loyalty program and customer churn. Which type of analytics is exemplified?
- Prescriptive
- Descriptive
- Diagnostic (correct)
- Predictive
A hospital is using machine learning to predict which patients are most at risk of readmission within 30 days of discharge. Which type of analytics does this represent?
A hospital is using machine learning to predict which patients are most at risk of readmission within 30 days of discharge. Which type of analytics does this represent?
A company is implementing a data strategy that involves moving from on-premises servers to cloud-based infrastructure. Which of the following aspects of modern data strategies does this align with?
A company is implementing a data strategy that involves moving from on-premises servers to cloud-based infrastructure. Which of the following aspects of modern data strategies does this align with?
An organization has a large amount of customer data stored in various databases and wants to create a single, accessible repository for analysis. Which component of modern data strategies would be most applicable?
An organization has a large amount of customer data stored in various databases and wants to create a single, accessible repository for analysis. Which component of modern data strategies would be most applicable?
A company wants to improve customer service. They decide to implement a system that uses AI to analyze real-time customer feedback and provide suggestions to service representatives. This change falls under which tenet of modern data strategies?
A company wants to improve customer service. They decide to implement a system that uses AI to analyze real-time customer feedback and provide suggestions to service representatives. This change falls under which tenet of modern data strategies?
In a data pipeline, what is the primary goal of the 'Ingestion' layer?
In a data pipeline, what is the primary goal of the 'Ingestion' layer?
In the context of data engineering, which of the following describes data wrangling?
In the context of data engineering, which of the following describes data wrangling?
Which of the following best describes the importance of iterating the data pipeline?
Which of the following best describes the importance of iterating the data pipeline?
Which question is a data engineer most likely to ask when designing a data pipeline?
Which question is a data engineer most likely to ask when designing a data pipeline?
Which of the following questions is a data scientist most likely to consider when working with a data pipeline?
Which of the following questions is a data scientist most likely to consider when working with a data pipeline?
In designing a data pipeline, which of the following is the MOST important initial step?
In designing a data pipeline, which of the following is the MOST important initial step?
When designing a data pipeline from the start, what describes starting with the business problem to be solved and working backwards to the data?
When designing a data pipeline from the start, what describes starting with the business problem to be solved and working backwards to the data?
What does the data pipeline include?
What does the data pipeline include?
Which of the following scenarios exemplifies the use of data science in everyday life?
Which of the following scenarios exemplifies the use of data science in everyday life?
What are the considerations in deciding what type of analysis to use?
What are the considerations in deciding what type of analysis to use?
A company is deciding whether to invest in a more accurate prediction model. What factor should they consider?
A company is deciding whether to invest in a more accurate prediction model. What factor should they consider?
In the context of the Five V's of data, what does 'Veracity' refer to?
In the context of the Five V's of data, what does 'Veracity' refer to?
A company wants to improve decision-making and needs to determine the relevance of its available data. Which of the five V's of data should they focus on?
A company wants to improve decision-making and needs to determine the relevance of its available data. Which of the five V's of data should they focus on?
A company is struggling to analyze customer data due to inconsistencies in formatting and data types across different sources. Which characteristic of the five V's of data presents the greatest challenge?
A company is struggling to analyze customer data due to inconsistencies in formatting and data types across different sources. Which characteristic of the five V's of data presents the greatest challenge?
A data pipeline processes clickstream data from a high-traffic website. Which of the following considerations related to data characteristics is MOST critical for designing the pipeline's ingestion layer?
A data pipeline processes clickstream data from a high-traffic website. Which of the following considerations related to data characteristics is MOST critical for designing the pipeline's ingestion layer?
Why could there be a choice between streaming ingestion and batch ingestion?
Why could there be a choice between streaming ingestion and batch ingestion?
Why could there be a choice between Long term, reporting access and Short term, very fast access?
Why could there be a choice between Long term, reporting access and Short term, very fast access?
A financial company needs to analyze transaction data and generate alerts for potential fraud in real-time. Which type of processing is most suitable?
A financial company needs to analyze transaction data and generate alerts for potential fraud in real-time. Which type of processing is most suitable?
What is the best description of unstructured data?
What is the best description of unstructured data?
Which of the following is an example of unstructured data?
Which of the following is an example of unstructured data?
A company analyzes comments from an online chat application. How is it stored?
A company analyzes comments from an online chat application. How is it stored?
What is the best description of time-series data?
What is the best description of time-series data?
A weather monitoring system collects temperature, humidity, and wind speed measurements from various sensors every minute. Which type of data source does this represent?
A weather monitoring system collects temperature, humidity, and wind speed measurements from various sensors every minute. Which type of data source does this represent?
A data team discovers inconsistencies and outdated information in a customer database obtained from a third-party vendor. Which data quality issue does this primarily relate to?
A data team discovers inconsistencies and outdated information in a customer database obtained from a third-party vendor. Which data quality issue does this primarily relate to?
What happens if value rests on veracity and the veracity is not strong?
What happens if value rests on veracity and the veracity is not strong?
Which of the following is a method of evaluating the veracity of data?
Which of the following is a method of evaluating the veracity of data?
A data analyst identifies a pattern of missing values in a specific field of a customer dataset. According to the best practices for cleaning data, what should be their NEXT step?
A data analyst identifies a pattern of missing values in a specific field of a customer dataset. According to the best practices for cleaning data, what should be their NEXT step?
What describes immutable data?
What describes immutable data?
A company is deciding to analyze only aggregated data to save space. What is the BEST reason to avoid this practice?
A company is deciding to analyze only aggregated data to save space. What is the BEST reason to avoid this practice?
A company needs to store user activity data for compliance purposes and ensure that any changes to the data are fully traceable. Which approach is MOST suitable?
A company needs to store user activity data for compliance purposes and ensure that any changes to the data are fully traceable. Which approach is MOST suitable?
What is a good way to maintain data integrity and consistency?
What is a good way to maintain data integrity and consistency?
Flashcards
Data-Driven Decisions
Data-Driven Decisions
Using data science to make informed decisions in an organization.
Data Analytics
Data Analytics
Systematic analysis of large datasets to find patterns and produce actionable insights.
AI/ML
AI/ML
Mathematical models that make predictions from data at a scale impossible for humans.
Descriptive Analytics
Descriptive Analytics
Signup and view all the flashcards
Diagnostic Analytics
Diagnostic Analytics
Signup and view all the flashcards
Predictive Analytics
Predictive Analytics
Signup and view all the flashcards
Prescriptive Analytics
Prescriptive Analytics
Signup and view all the flashcards
Data Pipeline
Data Pipeline
Signup and view all the flashcards
Data Wrangling
Data Wrangling
Signup and view all the flashcards
Data Engineering
Data Engineering
Signup and view all the flashcards
Modern Data Strategies
Modern Data Strategies
Signup and view all the flashcards
Innovate
Innovate
Signup and view all the flashcards
Modernize
Modernize
Signup and view all the flashcards
Unify
Unify
Signup and view all the flashcards
Volume
Volume
Signup and view all the flashcards
Velocity
Velocity
Signup and view all the flashcards
Variety
Variety
Signup and view all the flashcards
Veracity
Veracity
Signup and view all the flashcards
Value
Value
Signup and view all the flashcards
Unstructured Data
Unstructured Data
Signup and view all the flashcards
Semi-structured data
Semi-structured data
Signup and view all the flashcards
Structured Data
Structured Data
Signup and view all the flashcards
Public Datasets
Public Datasets
Signup and view all the flashcards
On-Premises Databases
On-Premises Databases
Signup and view all the flashcards
Events, IoT Devices, Sensors
Events, IoT Devices, Sensors
Signup and view all the flashcards
Veracity and Value
Veracity and Value
Signup and view all the flashcards
Study Notes
- The lecture covers IT Service Management, specifically focusing on Data Engineering concepts.
- There is new lab link available on the Blackboard Information section, called AWS Academy Data Engineering [105501].
- Related job roles to data engineering include, data engineer, data analyst, data scientist, extract, transform, and load (ETL) developer, and machine learning (ML) practitioners.
Data-Driven Decisions
- Throughout 2026, organizations are expected to increase investments in data and analytics services by 45% to become more data driven and digital.
- Data-driven decisions are relevant in various contexts.
- Some examples are, deciding on a restaurant, finding a used bicycle, fantasy football, a job, or the price of a home.
- Organizations make data-driven decisions for flagging fraudulent transactions, optimizing webpage design, predicting patient relapse, identifying security issues, and determining the optimum time for harvesting crops.
- Data-driven decisions are useful for large greenhouse tomato growers to optimize operations.
Data Analytics vs AI/ML
- Data analytics systematically analyzes large datasets (big data) to find patterns and trends, producing actionable insights, using programming logic to answer questions from structured data with a limited variable range.
- AI/ML uses mathematical models to make predictions from data at a scale difficult for humans, uses examples from large amounts of data, and is suitable for unstructured data with complex variables.
Business example: Customer relationship management
- Data analytics analyzes total revenue per customer to segment customers into categories, enabling higher customer service for high-spending customers.
- AI/ML approach uses AI/ML to analyze customer churn and its causes, helping businesses to improve customer retention strategies.
Categories of Analytics
- Descriptive analytics describes past performance, history, and trends and answers the question "What happened?"
- Diagnostic analytics describes the reasons for trends and answers the question "Why did it happen?"
- Predictive analytics forecasts future trends and answers the question "What will happen?"
- Prescriptive analytics recommends a decision or course of action and answers the question "How can we make it happen?"
- Insights become more valuable but also more difficult to derive as you move from descriptive to prescriptive analytics.
Key points about decisions
- More data combined with fewer barriers enables more data-driven decisions.
- Data science is apparent when shopping for shoes online, and the advertisements become shoe-related.
- Data science is relevant on a streaming platform where there are recommendations for movies or music.
- Data science is also relevant when ordering pizza online and getting continuous updates about the preparation and delivery.
- Data science appears with credit card fraud detection and navigation apps that help identify traffic jams.
- More data does not automatically result in more value.
- Higher data costs result, and more data also means more unstructured data and greater security risks, plus slower query processing.
- Data becomes less valuable for decision-making over time.
- The most valuable data is preventive/predictive used in near real time.
- The least valuable data is historical used within days to months of the initial data capture.
Trade-offs of data-driven Decisions
- The tradeoffs of data-driven decisions include cost (how much to invest), speed (how quickly an answer is needed), and accuracy (how accurate the prediction needs to be).
Key Takeaways: Data-driven decisions
- Data-driven organizations use data science to make informed decisions.
- Data analytics uses programming logic and tools for predictions from structured data.
- AI/ML uses examples from data for unstructured data predictions.
- Increased data and technology advancements are expanding opportunities for data-driven decisions.
Key Takeaways: Modern data strategies
- Organizations that want to become data driven should modernize, unify, and innovate with their data infrastructures.
- A modern data strategy involves a three-pronged approach: modernize, unify, innovate.
- Modernizing involves moving to cloud-based services and purpose-built services to reduce effort.
- Unifying involves creating a single source of truth for data across the organization.
- Innovating involves applying AI and ML to find new data insights.
- Being a data-driven organization means culturally treating data as a strategic asset.
Data Pipeline for Data-Driven Decisions
- The data pipeline involves collecting data, processing it, and building something useful with it.
- It is important to work backward to design infrastructure.
- One must determine what decision to make first, and then identify what data is needed to support the decision and weigh the trade-offs of cost, speed, and accuracy.
- The layers of a pipeline include the data sources, ingestion, storage, processing, analysis/visualization, and predictions and decisions.
- There actions include data wrangling (discovering, cleaning, normalizing, and enriching) and transformation.
- Iterative processing is used to evaluate and improve the data pipeline.
Key points about Data Pipeline
- A data pipeline provides the infrastructure for data-driven decision-making.
- Pipeline design should begin with the business problem.
- A pipeline includes, ingesting, storing, processing, analyzing and visualizing data.
- Data wrangling includes discovery, cleaning, normalization, transformation, and augmentation.
- Data is processed iteratively to refine results.
Data Engineer Role
- Both data scientists and data engineers work with the data pipeline.
- Tasks are sometimes interchangeable between Data Engineers and Data Scientist
- Data engineering focuses on the infrastructure, while data science emphasizes data utilization.
- Crucial to ask questions about the desired outcomes and data when building pipelines to build the most effective data pipeline.
Common data pipeline questions
- Some include, if the organization owns data that addresses the need and if the organization needs to combine data from multiple sources.
- Some questions also include, what is the type and quality of data, what is the source of truth, and what are the security requirements for the data.
- Some include, what mechanisms should be implemented for transferring data, how much data exists and at what frequency is it updated and how important is the requested data speed.
- Some also include, what can data teach me, how to assess results, what type of visualization is required and what formats and tools are analysts familiar with.
- Some include, is a big data framework need, whether AI/ML models fit, and what is the easiest way to implement them.
Module objectives
- This module prepares one to list the five Vs of data.
- Describe the impact of volume and velocity on the data pipeline.
- Compare and contrast structured, semistructured, and unstructured data types.
- Identify data sources that feed a data pipeline, and questions when assessing data veracity.
- Suggest methods to improve the veracity of data in the pipeline.
Five V's of Data
- Volume, velocity, variety, veracity, and value are the five V's of data.
- Volume refers to how big the dataset is and how much new data generated.
- Velocity refers to the frequency of new data generation and ingestion.
- Variety is about the types of data and the sources where the data comes from.
- Veracity relates to how accurate, precise, and trusted the data is. Value relates to the insights that are pulled from the data.
- Confirming data meets needs and evaluate the feasibility of acquiring.
- Match pipeline design to the data, balance throughput and cost.
- Match pipeline design to the data, balance throughput and cost.
- Let users focus on the business, catalog data and metadata, and implement governance.
Volume and Velocity
Data and Pipeline design and configuration
- Volume and velocity impact all layers of the pipeline, and requires each pipeline layer be evaluated for its own requirements.
- It is important to also consider data amount and pace when designing.
- Balance costs for throughput and storage against the required time to answer and the accuracy of the answer.
Ingestion consideration
- Suit ingestion method to the amount of data to be ingested and the frequency with which new data must be ingested and processed.
- Two examples of handling data:
- Streaming ingestion deals with continuous but small bits of data, and requires analysis be performed immediately. An example being, clickstream data from a website's retail sector.
- Batch ingestion involves the same sales transaction data but from all retailers world wide. Send to a central location periodically, and analyzed or processed at intervals. Overnight in the example.
Storage Consideration
- Which storage types can scale to the volume of data to be ingested and make it accessible to processing and analysis as quickly as required?
- Two examples of this:
- Long Term: Five years of data stored.
- Short term: Incoming ecommerce sales used for current session.
Processing considerations
- How much data must be processed in a single iteration?
- Big data processing: Performing analytics on all US-based transactions from the last week.
- Streaming analytics: Real time alerts produced on log data.
Analysis/Visualization
- Analysis or Visualization looks at how much data must be used in visualization while detailing the data. How quickly do consumers need to see and act on the data?
Examples for Analysis/Visualization
- A year's worth of sales data is visualized and the users can drill down by the region and salesperson.
- The real-time error rates of sensors in a plant are visualized.
- Volume is the amount of data to process.
- Velocity is about to quickly the data moves through the pipeline.
- Evaluate the volume and velocity for each layer, and balance the cost and throughput requirements.
Variety - Data Types
- Pipeline design is influenced by decisions by both data types and data sources.
- Certain Data types will lend themselves to certain data processing and analysis.
- Data source types might require different amounts of discovery and transformation work.
- Unstructured data has a Greater potential to discover untapped insights
- The data source type will also drive the type and scope of the ingestion layer.
Data categorization and benefits
- Structured: Rows and columns and well-defined schema. Easy to use.
- Semi-structured: Elements and self-describing attributes. Easy to use.
- Unstructured: Files with no defined structure. Flexible.
- 80% or more of available data is unstructured.
- Querying a rational database to report on customer cases, analyze customer comments from online chat applications, and perform sentiment analysis on customer services emails.
- General data types are Structured, Semi-structured, and unstructured and make the types of data accessible querying.
- The type of data that holds most promise and is more flexible is unstructured. Most data in this form is also unstructured.
Variety - Data Sources
- The types of data types that can be called upon for data types are, data stores, public data sets, and time-series data.
- These sources are able to combined in datasets that are able to enrich the data through the increase in analysis of data that can complicated data processing.
Common data source types
- On-premises databases or file stores: Application data is owned and managed by the organization.
- Public datasets: Data is aggregated about a topic, such as census data, health data, and population data.
- Events, IoT devices, sensors: Data is generated continually by events and includes a time-based component.
- Sources like these are controlled, contain data that may not be needed, or are on time series.
- These data are also of certain privacy and structure that may make one to want to manage.
Veracity and value
- Making a data-driven decision with bad data is worse than making no decision at all.
- It is important to maintain the integrity of data when extracting, combining and transforming it.
- Value rests on veracity because the data needed to drive good transformation rests on its quality and reliability.
Data Evaluation
- Evaluate How the source data entered and is managed. Maintain the integrity of audit trails.
- Be wary on the reliability and type of bias that is created in datasets.
- Clean/transform the data for consistency and make the sources reliable.
Maintaining and improving data.
- Errors are often repetitive, so tracing a data error can improve data in the future.
- Define clean data states.
- Be wary on the reliability and type of bias that is created in datasets.
- Clean/transform the data for consistency and make the sources reliable.
- Discover the data, what's contained from the beginning.
- Clean Transform; duplicates, incorrect points etc.
- Prevent, software bugs, tampering, and human error.
Data Integrity and Consistency
- Secure all layers of the pipeline and grant access by privilege. Best practice with data maintained data integrity in a trail that audit the data usage.
- Implement compliance in a single secure location.
- Use simple processes from start to capture and timestamp details.
- Data's transformation can be simple or complex, as each data can have it has its own data integrity.
- Organization should implement compliance and strategies to protect data.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.