IT Service Management: Data Engineering

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of the following roles is primarily focused on building and maintaining the data infrastructure?

  • Machine Learning Practitioner
  • Data Engineer (correct)
  • Data Scientist
  • Data Analyst

A large greenhouse tomato grower wants to optimize their harvesting schedule based on predicted yields and market prices. Which category of analytics would be most suited to address this?

  • Diagnostic
  • Prescriptive (correct)
  • Descriptive
  • Predictive

A retail company is analyzing why many customers are leaving their service. Examining historical sales data, they identified a correlation between a recent change in the loyalty program and customer churn. Which type of analytics is exemplified?

  • Prescriptive
  • Descriptive
  • Diagnostic (correct)
  • Predictive

A hospital is using machine learning to predict which patients are most at risk of readmission within 30 days of discharge. Which type of analytics does this represent?

<p>Predictive (C)</p> Signup and view all the answers

A company is implementing a data strategy that involves moving from on-premises servers to cloud-based infrastructure. Which of the following aspects of modern data strategies does this align with?

<p>Modernizing Infrastructure (B)</p> Signup and view all the answers

An organization has a large amount of customer data stored in various databases and wants to create a single, accessible repository for analysis. Which component of modern data strategies would be most applicable?

<p>Unify (A)</p> Signup and view all the answers

A company wants to improve customer service. They decide to implement a system that uses AI to analyze real-time customer feedback and provide suggestions to service representatives. This change falls under which tenet of modern data strategies?

<p>Innovation (A)</p> Signup and view all the answers

In a data pipeline, what is the primary goal of the 'Ingestion' layer?

<p>Collecting data from various sources. (C)</p> Signup and view all the answers

In the context of data engineering, which of the following describes data wrangling?

<p>The tasks taken transforming data as it passes through a pipeline, including discovery, cleaning, and normalization (C)</p> Signup and view all the answers

Which of the following best describes the importance of iterating the data pipeline?

<p>To refine and improve the data processing and analysis based on evaluated results. (C)</p> Signup and view all the answers

Which question is a data engineer most likely to ask when designing a data pipeline?

<p>How frequently is the data updated? (C)</p> Signup and view all the answers

Which of the following questions is a data scientist most likely to consider when working with a data pipeline?

<p>Do I need a big data framework? (A)</p> Signup and view all the answers

In designing a data pipeline, which of the following is the MOST important initial step?

<p>Defining the business problem to be solved. (A)</p> Signup and view all the answers

When designing a data pipeline from the start, what describes starting with the business problem to be solved and working backwards to the data?

<p>Work Backwards (D)</p> Signup and view all the answers

What does the data pipeline include?

<p>Layers to ingest, store, process, analyze, and visualize data passing through it (B)</p> Signup and view all the answers

Which of the following scenarios exemplifies the use of data science in everyday life?

<p>A navigation app alerting you to a traffic jam. (A)</p> Signup and view all the answers

What are the considerations in deciding what type of analysis to use?

<p>Cost, speed, accuracy (A)</p> Signup and view all the answers

A company is deciding whether to invest in a more accurate prediction model. What factor should they consider?

<p>How much incremental improvement justifies additional cost? (D)</p> Signup and view all the answers

In the context of the Five V's of data, what does 'Veracity' refer to?

<p>The accuracy, precision, and trustworthiness of data. (C)</p> Signup and view all the answers

A company wants to improve decision-making and needs to determine the relevance of its available data. Which of the five V's of data should they focus on?

<p>Value (A)</p> Signup and view all the answers

A company is struggling to analyze customer data due to inconsistencies in formatting and data types across different sources. Which characteristic of the five V's of data presents the greatest challenge?

<p>Variety (D)</p> Signup and view all the answers

A data pipeline processes clickstream data from a high-traffic website. Which of the following considerations related to data characteristics is MOST critical for designing the pipeline's ingestion layer?

<p>Volume and velocity of incoming click events. (C)</p> Signup and view all the answers

Why could there be a choice between streaming ingestion and batch ingestion?

<p>Sales transaction data from retailers across the world is sent to a central location periodically. (D)</p> Signup and view all the answers

Why could there be a choice between Long term, reporting access and Short term, very fast access?

<p>Because five years of sales data is stored for trending analysis, which is performed monthly AND incoming ecommerce sales transaction data is used to suggest additional purchases within the current session. (D)</p> Signup and view all the answers

A financial company needs to analyze transaction data and generate alerts for potential fraud in real-time. Which type of processing is most suitable?

<p>Streaming Analytics (A)</p> Signup and view all the answers

What is the best description of unstructured data?

<p>Hardest to query but the most flexible (B)</p> Signup and view all the answers

Which of the following is an example of unstructured data?

<p>An image of a handwritten document. (C)</p> Signup and view all the answers

A company analyzes comments from an online chat application. How is it stored?

<p>JSON (C)</p> Signup and view all the answers

What is the best description of time-series data?

<p>All other options are correct (B)</p> Signup and view all the answers

A weather monitoring system collects temperature, humidity, and wind speed measurements from various sensors every minute. Which type of data source does this represent?

<p>Events, IoT devices, sensors (D)</p> Signup and view all the answers

A data team discovers inconsistencies and outdated information in a customer database obtained from a third-party vendor. Which data quality issue does this primarily relate to?

<p>Veracity (A)</p> Signup and view all the answers

What happens if value rests on veracity and the veracity is not strong?

<p>You could make bad decisions. (B)</p> Signup and view all the answers

Which of the following is a method of evaluating the veracity of data?

<p>Does the source have audit trails? (D)</p> Signup and view all the answers

A data analyst identifies a pattern of missing values in a specific field of a customer dataset. According to the best practices for cleaning data, what should be their NEXT step?

<p>Trace errors by returning to data and finding root cause. (A)</p> Signup and view all the answers

What describes immutable data?

<p>All other options are correct (A)</p> Signup and view all the answers

A company is deciding to analyze only aggregated data to save space. What is the BEST reason to avoid this practice?

<p>Working from aggregate data can make errors harder to identify. (A)</p> Signup and view all the answers

A company needs to store user activity data for compliance purposes and ensure that any changes to the data are fully traceable. Which approach is MOST suitable?

<p>Implement an immutable data store, writing a new record for each event. (A)</p> Signup and view all the answers

What is a good way to maintain data integrity and consistency?

<p>All other options are correct (B)</p> Signup and view all the answers

Flashcards

Data-Driven Decisions

Using data science to make informed decisions in an organization.

Data Analytics

Systematic analysis of large datasets to find patterns and produce actionable insights.

AI/ML

Mathematical models that make predictions from data at a scale impossible for humans.

Descriptive Analytics

Describing past performance, trends and history.

Signup and view all the flashcards

Diagnostic Analytics

Describing the reason for past trends.

Signup and view all the flashcards

Predictive Analytics

Forecasting future trends.

Signup and view all the flashcards

Prescriptive Analytics

Recommending a decision or course of action.

Signup and view all the flashcards

Data Pipeline

Infrastructure for data-driven decisions.

Signup and view all the flashcards

Data Wrangling

Acting on how data passes through the pipeline. Tasks include discovery, cleaning, normalization and augmentation.

Signup and view all the flashcards

Data Engineering

Focuses on the infrastructure that the data passes through.

Signup and view all the flashcards

Modern Data Strategies

Three-pronged strategy to build data infrastructure

Signup and view all the flashcards

Innovate

Move from reactive to proactive decision-making.

Signup and view all the flashcards

Modernize

Move from on premises to cloud-based services.

Signup and view all the flashcards

Unify

Create a single source of truth

Signup and view all the flashcards

Volume

A data characteristic. How much data is generated?

Signup and view all the flashcards

Velocity

How frequently is new data generated and ingested?

Signup and view all the flashcards

Variety

What types and formats of data are there?

Signup and view all the flashcards

Veracity

How accurate and trusted is the data?

Signup and view all the flashcards

Value

What insights can be pulled from the data?

Signup and view all the flashcards

Unstructured Data

Files with no predefined structure. (Images, movies, etc.)

Signup and view all the flashcards

Semi-structured data

Data with some organizational properties, but not a well-defined structure (CSV, JSON, XML).

Signup and view all the flashcards

Structured Data

Data organized in rows and columns with a well-defined schema, like a relational database.

Signup and view all the flashcards

Public Datasets

Data is aggregated about topic. (census data, health data, etc),

Signup and view all the flashcards

On-Premises Databases

Application data is owned and managed by the organization.

Signup and view all the flashcards

Events, IoT Devices, Sensors

Data generated continually by events and includes a time-based component.

Signup and view all the flashcards

Veracity and Value

You need to evaluate the veracity of each data source, clean and transform the data for your needs, and prevent unwanted changes to the data.

Signup and view all the flashcards

Study Notes

  • The lecture covers IT Service Management, specifically focusing on Data Engineering concepts.
  • There is new lab link available on the Blackboard Information section, called AWS Academy Data Engineering [105501].
  • Related job roles to data engineering include, data engineer, data analyst, data scientist, extract, transform, and load (ETL) developer, and machine learning (ML) practitioners.

Data-Driven Decisions

  • Throughout 2026, organizations are expected to increase investments in data and analytics services by 45% to become more data driven and digital.
  • Data-driven decisions are relevant in various contexts.
  • Some examples are, deciding on a restaurant, finding a used bicycle, fantasy football, a job, or the price of a home.
  • Organizations make data-driven decisions for flagging fraudulent transactions, optimizing webpage design, predicting patient relapse, identifying security issues, and determining the optimum time for harvesting crops.
  • Data-driven decisions are useful for large greenhouse tomato growers to optimize operations.

Data Analytics vs AI/ML

  • Data analytics systematically analyzes large datasets (big data) to find patterns and trends, producing actionable insights, using programming logic to answer questions from structured data with a limited variable range.
  • AI/ML uses mathematical models to make predictions from data at a scale difficult for humans, uses examples from large amounts of data, and is suitable for unstructured data with complex variables.

Business example: Customer relationship management

  • Data analytics analyzes total revenue per customer to segment customers into categories, enabling higher customer service for high-spending customers.
  • AI/ML approach uses AI/ML to analyze customer churn and its causes, helping businesses to improve customer retention strategies.

Categories of Analytics

  • Descriptive analytics describes past performance, history, and trends and answers the question "What happened?"
  • Diagnostic analytics describes the reasons for trends and answers the question "Why did it happen?"
  • Predictive analytics forecasts future trends and answers the question "What will happen?"
  • Prescriptive analytics recommends a decision or course of action and answers the question "How can we make it happen?"
  • Insights become more valuable but also more difficult to derive as you move from descriptive to prescriptive analytics.

Key points about decisions

  • More data combined with fewer barriers enables more data-driven decisions.
  • Data science is apparent when shopping for shoes online, and the advertisements become shoe-related.
  • Data science is relevant on a streaming platform where there are recommendations for movies or music.
  • Data science is also relevant when ordering pizza online and getting continuous updates about the preparation and delivery.
  • Data science appears with credit card fraud detection and navigation apps that help identify traffic jams.
  • More data does not automatically result in more value.
  • Higher data costs result, and more data also means more unstructured data and greater security risks, plus slower query processing.
  • Data becomes less valuable for decision-making over time.
  • The most valuable data is preventive/predictive used in near real time.
  • The least valuable data is historical used within days to months of the initial data capture.

Trade-offs of data-driven Decisions

  • The tradeoffs of data-driven decisions include cost (how much to invest), speed (how quickly an answer is needed), and accuracy (how accurate the prediction needs to be).

Key Takeaways: Data-driven decisions

  • Data-driven organizations use data science to make informed decisions.
  • Data analytics uses programming logic and tools for predictions from structured data.
  • AI/ML uses examples from data for unstructured data predictions.
  • Increased data and technology advancements are expanding opportunities for data-driven decisions.

Key Takeaways: Modern data strategies

  • Organizations that want to become data driven should modernize, unify, and innovate with their data infrastructures.
  • A modern data strategy involves a three-pronged approach: modernize, unify, innovate.
  • Modernizing involves moving to cloud-based services and purpose-built services to reduce effort.
  • Unifying involves creating a single source of truth for data across the organization.
  • Innovating involves applying AI and ML to find new data insights.
  • Being a data-driven organization means culturally treating data as a strategic asset.

Data Pipeline for Data-Driven Decisions

  • The data pipeline involves collecting data, processing it, and building something useful with it.
  • It is important to work backward to design infrastructure.
  • One must determine what decision to make first, and then identify what data is needed to support the decision and weigh the trade-offs of cost, speed, and accuracy.
  • The layers of a pipeline include the data sources, ingestion, storage, processing, analysis/visualization, and predictions and decisions.
  • There actions include data wrangling (discovering, cleaning, normalizing, and enriching) and transformation.
  • Iterative processing is used to evaluate and improve the data pipeline.

Key points about Data Pipeline

  • A data pipeline provides the infrastructure for data-driven decision-making.
  • Pipeline design should begin with the business problem.
  • A pipeline includes, ingesting, storing, processing, analyzing and visualizing data.
  • Data wrangling includes discovery, cleaning, normalization, transformation, and augmentation.
  • Data is processed iteratively to refine results.

Data Engineer Role

  • Both data scientists and data engineers work with the data pipeline.
  • Tasks are sometimes interchangeable between Data Engineers and Data Scientist
  • Data engineering focuses on the infrastructure, while data science emphasizes data utilization.
  • Crucial to ask questions about the desired outcomes and data when building pipelines to build the most effective data pipeline.

Common data pipeline questions

  • Some include, if the organization owns data that addresses the need and if the organization needs to combine data from multiple sources.
  • Some questions also include, what is the type and quality of data, what is the source of truth, and what are the security requirements for the data.
  • Some include, what mechanisms should be implemented for transferring data, how much data exists and at what frequency is it updated and how important is the requested data speed.
  • Some also include, what can data teach me, how to assess results, what type of visualization is required and what formats and tools are analysts familiar with.
  • Some include, is a big data framework need, whether AI/ML models fit, and what is the easiest way to implement them.

Module objectives

  • This module prepares one to list the five Vs of data.
  • Describe the impact of volume and velocity on the data pipeline.
  • Compare and contrast structured, semistructured, and unstructured data types.
  • Identify data sources that feed a data pipeline, and questions when assessing data veracity.
  • Suggest methods to improve the veracity of data in the pipeline.

Five V's of Data

  • Volume, velocity, variety, veracity, and value are the five V's of data.
  • Volume refers to how big the dataset is and how much new data generated.
  • Velocity refers to the frequency of new data generation and ingestion.
  • Variety is about the types of data and the sources where the data comes from.
  • Veracity relates to how accurate, precise, and trusted the data is. Value relates to the insights that are pulled from the data.
  • Confirming data meets needs and evaluate the feasibility of acquiring.
  • Match pipeline design to the data, balance throughput and cost.
  • Match pipeline design to the data, balance throughput and cost.
  • Let users focus on the business, catalog data and metadata, and implement governance.

Volume and Velocity

Data and Pipeline design and configuration

  • Volume and velocity impact all layers of the pipeline, and requires each pipeline layer be evaluated for its own requirements.
  • It is important to also consider data amount and pace when designing.
  • Balance costs for throughput and storage against the required time to answer and the accuracy of the answer.

Ingestion consideration

  • Suit ingestion method to the amount of data to be ingested and the frequency with which new data must be ingested and processed.
  • Two examples of handling data:
  • Streaming ingestion deals with continuous but small bits of data, and requires analysis be performed immediately. An example being, clickstream data from a website's retail sector.
  • Batch ingestion involves the same sales transaction data but from all retailers world wide. Send to a central location periodically, and analyzed or processed at intervals. Overnight in the example.

Storage Consideration

  • Which storage types can scale to the volume of data to be ingested and make it accessible to processing and analysis as quickly as required?
  • Two examples of this:
  • Long Term: Five years of data stored.
  • Short term: Incoming ecommerce sales used for current session.

Processing considerations

  • How much data must be processed in a single iteration?
  • Big data processing: Performing analytics on all US-based transactions from the last week.
  • Streaming analytics: Real time alerts produced on log data.

Analysis/Visualization

  • Analysis or Visualization looks at how much data must be used in visualization while detailing the data. How quickly do consumers need to see and act on the data?

Examples for Analysis/Visualization

  • A year's worth of sales data is visualized and the users can drill down by the region and salesperson.
  • The real-time error rates of sensors in a plant are visualized.
  • Volume is the amount of data to process.
  • Velocity is about to quickly the data moves through the pipeline.
  • Evaluate the volume and velocity for each layer, and balance the cost and throughput requirements.

Variety - Data Types

  • Pipeline design is influenced by decisions by both data types and data sources.
  • Certain Data types will lend themselves to certain data processing and analysis.
  • Data source types might require different amounts of discovery and transformation work.
  • Unstructured data has a Greater potential to discover untapped insights
  • The data source type will also drive the type and scope of the ingestion layer.

Data categorization and benefits

  • Structured: Rows and columns and well-defined schema. Easy to use.
  • Semi-structured: Elements and self-describing attributes. Easy to use.
  • Unstructured: Files with no defined structure. Flexible.
  • 80% or more of available data is unstructured.
  • Querying a rational database to report on customer cases, analyze customer comments from online chat applications, and perform sentiment analysis on customer services emails.
  • General data types are Structured, Semi-structured, and unstructured and make the types of data accessible querying.
  • The type of data that holds most promise and is more flexible is unstructured. Most data in this form is also unstructured.

Variety - Data Sources

  • The types of data types that can be called upon for data types are, data stores, public data sets, and time-series data.
  • These sources are able to combined in datasets that are able to enrich the data through the increase in analysis of data that can complicated data processing.

Common data source types

  • On-premises databases or file stores: Application data is owned and managed by the organization.
  • Public datasets: Data is aggregated about a topic, such as census data, health data, and population data.
  • Events, IoT devices, sensors: Data is generated continually by events and includes a time-based component.
  • Sources like these are controlled, contain data that may not be needed, or are on time series.
  • These data are also of certain privacy and structure that may make one to want to manage.

Veracity and value

  • Making a data-driven decision with bad data is worse than making no decision at all.
  • It is important to maintain the integrity of data when extracting, combining and transforming it.
  • Value rests on veracity because the data needed to drive good transformation rests on its quality and reliability.

Data Evaluation

  • Evaluate How the source data entered and is managed. Maintain the integrity of audit trails.
  • Be wary on the reliability and type of bias that is created in datasets.
  • Clean/transform the data for consistency and make the sources reliable.

Maintaining and improving data.

  • Errors are often repetitive, so tracing a data error can improve data in the future.
  • Define clean data states.
  • Be wary on the reliability and type of bias that is created in datasets.
  • Clean/transform the data for consistency and make the sources reliable.
  • Discover the data, what's contained from the beginning.
  • Clean Transform; duplicates, incorrect points etc.
  • Prevent, software bugs, tampering, and human error.

Data Integrity and Consistency

  • Secure all layers of the pipeline and grant access by privilege. Best practice with data maintained data integrity in a trail that audit the data usage.
  • Implement compliance in a single secure location.
  • Use simple processes from start to capture and timestamp details.
  • Data's transformation can be simple or complex, as each data can have it has its own data integrity.
  • Organization should implement compliance and strategies to protect data.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser