Podcast
Questions and Answers
Which of the following is considered a 'related job role' in the field of data engineering?
Which of the following is considered a 'related job role' in the field of data engineering?
- Data Scientist (correct)
- Network Security Engineer
- Database Administrator specializing in backups
- Frontend Web Developer
Organizations are projected to increase their investment in data and analytics services by what percentage through 2026?
Organizations are projected to increase their investment in data and analytics services by what percentage through 2026?
- 45 percent (correct)
- 25 percent
- 15 percent
- 35 percent
Which of the following best describes the use of data analytics?
Which of the following best describes the use of data analytics?
- Applying mathematical models to handle unstructured data in real-time.
- Systematic analysis of large datasets (big data) to find patterns and trends to produce actionable insights. (correct)
- Analyzing large and complex datasets to uncover intricate relationships between variables.
- Making predictions from data at a scale impossible for humans.
What is the primary function of AI/ML in contrast to data analytics?
What is the primary function of AI/ML in contrast to data analytics?
A retail store analyzes historical total revenue per customer and categorizes customers based on their spending habits. Which type of analytics is the store primarily using?
A retail store analyzes historical total revenue per customer and categorizes customers based on their spending habits. Which type of analytics is the store primarily using?
A healthcare provider uses machine learning to forecast which patients are most likely to be readmitted within 30 days. Which category of analytics does this fall under?
A healthcare provider uses machine learning to forecast which patients are most likely to be readmitted within 30 days. Which category of analytics does this fall under?
A manufacturing plant analyzes sensor data to determine the root causes of recent equipment failures. Which category of analytics are they employing?
A manufacturing plant analyzes sensor data to determine the root causes of recent equipment failures. Which category of analytics are they employing?
A city uses traffic pattern analysis to recommend adjustments to signal timings aiming to reduce congestion. Which type of data analytics are they applying?
A city uses traffic pattern analysis to recommend adjustments to signal timings aiming to reduce congestion. Which type of data analytics are they applying?
What is the crucial initial step in designing a data pipeline for data-driven decisions?
What is the crucial initial step in designing a data pipeline for data-driven decisions?
Which of the following best describes data wrangling in the context of a data pipeline?
Which of the following best describes data wrangling in the context of a data pipeline?
In a data-driven organization, what is the primary focus of data engineering?
In a data-driven organization, what is the primary focus of data engineering?
Which of the following practices aligns with the 'Unify' aspect of modern data strategies?
Which of the following practices aligns with the 'Unify' aspect of modern data strategies?
What is the primary goal of 'Modernizing' data infrastructure in data-driven organizations?
What is the primary goal of 'Modernizing' data infrastructure in data-driven organizations?
Which action exemplifies the 'Innovate' pillar of a modern data strategy?
Which action exemplifies the 'Innovate' pillar of a modern data strategy?
Which of the following is NOT one of the 'Five Vs' of data?
Which of the following is NOT one of the 'Five Vs' of data?
Which 'V' of data is most closely associated with the trustworthiness and accuracy of data?
Which 'V' of data is most closely associated with the trustworthiness and accuracy of data?
How do volume and velocity most directly impact data pipeline design?
How do volume and velocity most directly impact data pipeline design?
Which scenario exemplifies streaming ingestion?
Which scenario exemplifies streaming ingestion?
Which data type is characterized by elements and attributes, has a self-describing structure, and is exemplified by formats like JSON and XML?
Which data type is characterized by elements and attributes, has a self-describing structure, and is exemplified by formats like JSON and XML?
Why is it important to consider data variety when designing a data pipeline?
Why is it important to consider data variety when designing a data pipeline?
Which of the following is generally true about unstructured data compared to structured data?
Which of the following is generally true about unstructured data compared to structured data?
In data analytics, what is the significance of 'data lineage'?
In data analytics, what is the significance of 'data lineage'?
A data engineer discovers inconsistencies in how customer addresses are formatted across two merged datasets. Which action best addresses this issue?
A data engineer discovers inconsistencies in how customer addresses are formatted across two merged datasets. Which action best addresses this issue?
What is the advantage of storing timestamped details instead of aggregated values in a data analytics system?
What is the advantage of storing timestamped details instead of aggregated values in a data analytics system?
Which of the following strategies would most effectively enhance the 'Veracity' of data in a data pipeline?
Which of the following strategies would most effectively enhance the 'Veracity' of data in a data pipeline?
A sensor network is established at a wind farm which sends constant streams of data about wind speed, direction, and temperature. Which category of data source best describes this situation?
A sensor network is established at a wind farm which sends constant streams of data about wind speed, direction, and temperature. Which category of data source best describes this situation?
A data scientist is tasked with building a predictive model for customer churn. During the data exploration phase, they notice a significant number of missing values in a key demographic field. Which action is the MOST appropriate first step to address this issue?
A data scientist is tasked with building a predictive model for customer churn. During the data exploration phase, they notice a significant number of missing values in a key demographic field. Which action is the MOST appropriate first step to address this issue?
A company has traditionally relied on nightly batch processing of sales data to generate reports. However, business stakeholders are now demanding real-time insights into sales performance. Which data architecture change would best enable this capability?
A company has traditionally relied on nightly batch processing of sales data to generate reports. However, business stakeholders are now demanding real-time insights into sales performance. Which data architecture change would best enable this capability?
A financial institution wants to detect fraudulent transactions as quickly as possible. Which type of data analysis approach is MOST suitable for this purpose?
A financial institution wants to detect fraudulent transactions as quickly as possible. Which type of data analysis approach is MOST suitable for this purpose?
Which of the following is a critical consideration when choosing a data storage solution for a high-volume, high-velocity data stream, such as sensor data from industrial equipment?
Which of the following is a critical consideration when choosing a data storage solution for a high-volume, high-velocity data stream, such as sensor data from industrial equipment?
A large e-commerce company captures user browsing behavior, product views, and purchases to personalize recommendations. Which data type would best categorize the user's comments?
A large e-commerce company captures user browsing behavior, product views, and purchases to personalize recommendations. Which data type would best categorize the user's comments?
A health tech company is developing a personalized medicine app that combines patient medical history, genomic data, and real-time data from wearable sensors. What challenge is MOST likely to arise due to the variety of data sources?
A health tech company is developing a personalized medicine app that combines patient medical history, genomic data, and real-time data from wearable sensors. What challenge is MOST likely to arise due to the variety of data sources?
A retail company implements a new data governance program. Which activity would directly support maintaining data integrity and consistency?
A retail company implements a new data governance program. Which activity would directly support maintaining data integrity and consistency?
After a systems upgrade, a data analyst notices that customer birthdates are being incorrectly recorded in the database, resulting in many customers appearing to be born on January 1, 1900. Which step would be the MOST effective in addressing this data quality issue?
After a systems upgrade, a data analyst notices that customer birthdates are being incorrectly recorded in the database, resulting in many customers appearing to be born on January 1, 1900. Which step would be the MOST effective in addressing this data quality issue?
A data team is designing a new data warehouse. Which approach is typically recommended to best supports traceability and debugging in data analytics?
A data team is designing a new data warehouse. Which approach is typically recommended to best supports traceability and debugging in data analytics?
A company captures data about user interactions with their web application. Over time, they transition from capturing simple page view events to more complex events including mouse movements and form inputs. Which 'V' of data does this change primarily reflect?
A company captures data about user interactions with their web application. Over time, they transition from capturing simple page view events to more complex events including mouse movements and form inputs. Which 'V' of data does this change primarily reflect?
Flashcards
Data analytics
Data analytics
A systematic analysis of large datasets to find patterns and trends, producing actionable insights.
AI/ML
AI/ML
Mathematical models used to make predictions from data at a scale that is difficult for humans.
Descriptive Analytics
Descriptive Analytics
Analyzing past performance, history, and trends.
Diagnostic Analytics
Diagnostic Analytics
Signup and view all the flashcards
Predictive Analytics
Predictive Analytics
Signup and view all the flashcards
Prescriptive Analytics
Prescriptive Analytics
Signup and view all the flashcards
Data Pipeline
Data Pipeline
Signup and view all the flashcards
Data Wrangling
Data Wrangling
Signup and view all the flashcards
Data Volume
Data Volume
Signup and view all the flashcards
Data Velocity
Data Velocity
Signup and view all the flashcards
Data Variety
Data Variety
Signup and view all the flashcards
Data Veracity
Data Veracity
Signup and view all the flashcards
Data Value
Data Value
Signup and view all the flashcards
Data Consistency
Data Consistency
Signup and view all the flashcards
Structured Data
Structured Data
Signup and view all the flashcards
Semi-structured Data
Semi-structured Data
Signup and view all the flashcards
Unstructured Data
Unstructured Data
Signup and view all the flashcards
On-Premises Databases
On-Premises Databases
Signup and view all the flashcards
Public Datasets
Public Datasets
Signup and view all the flashcards
Time-series Data
Time-series Data
Signup and view all the flashcards
Data Veracity
Data Veracity
Signup and view all the flashcards
Data Integrity
Data Integrity
Signup and view all the flashcards
Discover Data Errors
Discover Data Errors
Signup and view all the flashcards
Immutable data
Immutable data
Signup and view all the flashcards
Study Notes
- DAB 202 is IT Service Management for Week 6
Job Roles
- Data engineer roles are related to Data analyst, Data scientist, Extract, transform, and load (ETL) developer and Machine learning (ML) practitioners
Data-Driven Decisions
- Through 2026, organizations plan to increase investments in data and analytics services by 45% to become more data-driven and digital per Gartner, March 2022
Making Decisions
- Individuals use data to decide on restaurants, where to buy items, fantasy football, job types, and home prices
- Organizations apply data to decide what transactions are fraudulent, webpage design impact, at risk patients, security issues and harvest times
- Large greenhouse tomato growers, can make data-driven decisions
Data Analytics vs AI/ML
- Data analytics systematically analyzes large datasets, (big data) to identify patterns and insights that can be put into action
- Data analytics uses programming logic to answer from data, working well with structured data and a limited number of variables
- AI/ML involves mathematical models used to forsee data at a scale too difficult or impossible for humans
- AI/ML uses examples from large data amounts to provide learn about and answer questions
- AI/ML works well when data is complex and unstructured
Customer Relationship Management Example
- A retail business uses data analytics to analyzes total revenue per customer and divides customers by spending
- Segmentation can provide higher customer service to high-spending customers
- A retail business can use AI/ML to determine Customer churn and how often customers come and go
- AI/ML aids in discovering churn influences, therefore, enabling changes to better retain customers
Categories of Analytics
- Descriptive analytics describes past performance, history and trends to answer "What happened?"
- Diagnostic analytics describes reasons for trends and help answer "Why did it happen?"
- Predictive analytics forecasts future trends to answer "What will happen?"
- Prescriptive analytics recommends a course of action and helps answer "How can we make it happen?"
Analytics Categories and Questions
- "Which customer did churn?" is Descriptive.
- "Why did the customer churn?" is Diagnostic.
- "Which customer will churn" is Predictive
- "What can I do to change the outcome of customer churn" is Prescriptive
Importance of data value and insights
- More valuable insights are more difficult to derive
- Descriptive analytics produces less valuable and easy to derive insights
- Diagnostic analytics produces medium value and is more difficult to ascertain
- Predictive analytics produces higher value and is even more difficult to work out
- Prescriptive provides the highest level of value and hardest to determine
- More data and fewer barriers equates to more data-driven decision
Data Science in Daily Life
- Everyday experiences include seeing shoe ads online after shopping, streaming recommendations and pizza delivery tracking
- Further examples are fraud alerts when using a credit card and navigation apps alerting of traffic jams
- More data doesn't always equate to more value
- More collected data also brings higher data costs, more unstructured data, greater security risks and slower query processing
- Data becomes less valuable for decision-making over time
Data Value
- The scale of most to least valuable is Preventive or Predictive, Actionable, Reactive and then Historical
- Data that is near real time is more valuable than data from days or months ago
- Cost, speed, and accuracy are the trade-offs taken when dealing with data driven decisions
- Organizations use data to make informed decisions
Data Analytics vs AI/ML usage
- Data analytics rely on programming logic and tools to derive data predictions and work better with limited variables
- AI/ML can learn to make predictions from examples in data and works best when data is unstructured and the variables are complex
The Data Engineering Pipeline
- Data pipelines provide infrastructure for data-driven decision-making
- The simplest data pipeline has stages for data collection, storage and processing, and then building something practical with data
- Design infrastructure by first identifying the decision trying to be made and then what data is essential to support that decision
- The infrastructure should be designed weighing cost, speed and accuracy
- Pipeline infrastructure contains layers for data sources, ingestion, storage, processing, visualization and then, finally, predictions and decisions
- Data wrangling is how data transforms through the pipeline and includes steps like discover, clean, normalize, enrich and transformation
- Data is often processed iteratively to evaluate and improve the results
Key Pipeline actions
- The pipeline includes data ingestion, storage, process, analyzed and visualized
- When designing a pipeline, begin with determining the problem needing solving and then decide what data is required
- Data wrangling is used to act upon data passing through the pipeline like discovery, cleaning, normalization, transformation, and augmentation
- Data is iteratively processed to evaluate and refine results
Roles of the Data Scientist and Data Engineer
- Data scientists and data engineers work with the data pipeline and certain tasks can be fulfilled by either
- Data engineering primarily concerns the infrastructure while a data scientist works with the data inside the pipeline
- Building the best pipeline requires questions about required outcomes and the data, which is then iterated upon
Data Strategies
- A three-pronged strategy to build data infrastructure involves modernizing, unifying and innovating
- Organizations should modernize, unify and innovate data structures
- Modernizing is moving to cloud-based and purpose-built services for administrative and operational reduction
- Unifying means creating a single source of truth for data, making it available across the organization
- Innovating looks for new data value, specifically applying AI and ML
The Five Vs of Data
- The module is to list the five Vs of data: volume, velocity, variety, veracity and value
- The module will describe the impact of volume and velocity on a data pipeline
- The module will compare and contrast structured, semi-structured and unstructured data types
- The module will identify data sources that are commonly used to feed data pipelines and also questions to assess data veracity
- The module will suggest methods to improve data veracity in a pipeline
Data Pipeline Q&A
- Common data pipeline questions include asking if the organization has the needed data, where it is stored and its format
- Additional common question is does the data require combining from multiple sources, what is its quality and security requirements
- Questions to ask are what mechanisms are in place to move the data into a pipeline, how much data, how frequently is it updated, and what is the importance of speed
- Further questions to ask are what the data can determine, how to evaluate results, visualization specifics, formats, tools and whether a big data network is required
- Final questions to ponder about are what AI/ML models are best suited and what is the simplest way to implement AI/ML
Data Characteristics
- Data characteristics drive infrastructure decisions in respect to volume, velocity, variety, veracity and value
- Volume relates to dataset size and how much new data is being generated
- Velocity considers new data being generated, how often and when
- Variety evaluates the types, formats and amount of data sources
- Veracity looks at data accuracy, precision and trustworthiness
- Value reviews what insights can be assembled from the data
- Strategies to support the best value from data includes confirming available data meets need and evaluating the feasibility of acquiring said data
- Additional strategies include matching pipeline design to data, cataloging data, focusing on business and implementing governance policies
Volume and Velocity
- Scaling a pipeline considers data volume and velocity
- Data volume and pace drive design choices and all pipeline layers are impacted
- Each pipeline layer must be evaluated for its individual requirements
- Balance costs for throughput and storage against the answer time and accuracy
Ingestion, Volume and Velocity
- Analyze the best ingestion method, the amount of data to be ingested and the frequency of new data to be ingested and processed
- Streaming ingestion is clickstream data from a retailer's website, sending large data amounts in small bits at a continuous pace which requires immediate analysis
- Batch ingestion is sales transaction data from retailers from different locations sent to a central location periodically
- Batch ingestion is typically analyzed overnight with reports sent the following morning
Storage and Volume
- Storage scalability is in response to volume of data to be ingested and the accessibility for processing and analysis when required
- Long term, reporting access can be achieved by storing five years of sales data for trending analysis on a monthly cadence
- Short term, very fast access involves incoming ecommerce sales transaction data to suggest additional purchases within the current session
Processing, Volume and Velocity
- The volume of data that must be processed in a single iteration must be assessed, additionally, whether it needs distributed solution
- Also, must address how quickly and frequently does the processing need to occur
- An example is in big data processing of credit card data for all the US-Based transactions within the past week
- Streaming analytics can produce real-time alerts produced on log data to identify potential fraud as it occurs
Analysis, Visualization, Volume and Velocity
- Historical analysis involves visualizing a year’s worth of sales data, enabling users to drill down by region and salesperson
- Streaming Internet of Things (IoT) data analyzes rate reporting from rates of sensors in a plant
Data Volume
- Volume relates to the amount of data being processed
- Velocity is how quickly data enters and moves through a pipeline
- Volume and velocity decide the expected throughput and scaling needs of a pipeline
- Every pipeline layer requires evaluation for volume and velocity
- Balance costs and throughput needs
Data Variety and Pipelines
- Pipeline design is influenced by the type and source of data
- Every data type contributes to certain types of processing and analysis
- Each data source contributes to different amounts of discovery and transformation work
- The data source type leads to different type and scope of the ingestion layer
Types of Data
- Data types include structured, semistructured and unstructured
- Structured data is easy to use, involving rows and columns and a well-defined schema, like relational databases
- Semistructured data contains elements and attributes that self-describe like CSV, JSON and XML
- Unstructured data contains files with no predefined structure like images, movies and clickstream data
- Unstructured data is more "colder" or more flexible as opposed to "hotter" or structured data.
- 80% or more of available data is unstructured
How data types are used
- Structured data is queried from relational database and reports on customer services within a specified time
- Semi-structured data extracts and analyzes customer comments from an online chat application that saves conversations as JSON
- Unstructured data creates and analyzes the sentiment in customer service emails
Data Source Types
- Data includes structured, semi-structured and unstructured
- Structured is the easiest to query but least flexible
- Unstructured is hardest to query but more flexible
- Most growth is of available data is unstructured
- Data source types include organizational stores, public datasets, and time-series data
- Combining datasets enriches analysis but can complicate processing
Common Data Sources
- On-premise databases or file stores are application data is owned and managed by the organization
- Public datasets are about a topic, like census, health or population
- Events, IoT devices, sensors are data continuously generated, including a time-based component
Pipeline Considerations
- On-premises databases are controlled by the organization and may contain private, structured information
- Public datasets might not be required and need transformation or merging and are often semistructured
- Events, IoT devices, and sensors require streaming ingestion and time-series storage, requiring real-time processing
Data Veracity Challenges
- An application in healthcare runs analysis on customer data to determine patients who have not received proper care
- An application in healthcare combines public health data with customer data for more accurate and improved mobile alerts
- A mobile application provides real-time heart rate monitoring and alerts given risk patterns
- Possible formatting of data impacts ability to analyze
- Data can become difficult to maintain given multiple data types and merges
Veracity and Value
- Data veracity is easier to maintain with data integrity and fewer challenges
- General type, sources, governance influence data issues
- Trustable facts enable more relevant data to the business
- Volume is having the right amount of facts of the type needed
- The nature of the data helps you derive more business insights with the appropriate speed
- Data source types range from organizational stores, public datasets and time-series data
- Datasets enrich data and add complexity to processing
Data and Pipeline Security
- Bad data is worse than no data and veracity is how much is trusted
- To maintain proper data you must clean data and implement controls
- Prevent unwanted changes to stored data and ensure constant occurrences
Data Integrity
- Ingestion process must track data state and validity
- Cleaning processes must preserve data integrity
- Processing must preserve this validity as it is analyzed
- Data issues include duplicates, dated data, missing records among other concerns
Maintaining Consistency
- Keep new sources in mind to prevent data integrity issues
- Ask how new information is managed in relationship to processes now set
- Establish ways the company and you can discover, fix, prevent facts issues
- Secure layers of every pipeline
- Assign permission to least privilege accounts only
- Design process with better usage/integrity in mind to implement new steps
- Create records of past actions for better fact integrity (Keep audit trails)
Data Management Tips
- Ask questions about data worthiness and lineages of the details
- Need a known fact and must not design to get different details about fact standards
- Be conscious when you may design and add details with conversions
- Time-stamp to keep track of values
- Use best practices, data-integrity maintenance and data is safe
- Data management plans are a must
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.