Podcast
Questions and Answers
Which processing zone is responsible for verifying data quality and accuracy in a data lake architecture?
Which processing zone is responsible for verifying data quality and accuracy in a data lake architecture?
What describes the function of the discovery sandbox in a data treatment process?
What describes the function of the discovery sandbox in a data treatment process?
Which of the following is NOT considered a consumer system in big data architecture?
Which of the following is NOT considered a consumer system in big data architecture?
In a serverless big data architecture, which of the following is a common cloud service provider?
In a serverless big data architecture, which of the following is a common cloud service provider?
Signup and view all the answers
What is the primary role of metadata in big data architecture?
What is the primary role of metadata in big data architecture?
Signup and view all the answers
Which of the following describes the process of data capturing?
Which of the following describes the process of data capturing?
Signup and view all the answers
What distinguishes a data warehouse from a data lake?
What distinguishes a data warehouse from a data lake?
Signup and view all the answers
Which of the following is NOT a main type of data source?
Which of the following is NOT a main type of data source?
Signup and view all the answers
Which aspect is NOT considered key in effective data presentation?
Which aspect is NOT considered key in effective data presentation?
Signup and view all the answers
What is a responsibility of the user in an on-premise data management setup?
What is a responsibility of the user in an on-premise data management setup?
Signup and view all the answers
Which statement regarding data processing is incorrect?
Which statement regarding data processing is incorrect?
Signup and view all the answers
What is one disadvantage of manual data entry?
What is one disadvantage of manual data entry?
Signup and view all the answers
Which method is typically used for automated data collection?
Which method is typically used for automated data collection?
Signup and view all the answers
Study Notes
Data Collection and Processing
- Data is captured from various sources: web browsers, smartphones, search engines, internet, banking, and gaming.
- Data is collected by government agencies, pharmaceutical companies, consumer product companies, big box stores, and credit card companies.
- Data collection includes: collection, ingestion, preparation, computation, and presentation.
- A data source is any system providing raw material for analysis, reporting, or applications. Sources can be structured or unstructured (databases, APIs, flat files, cloud storage, data warehouses, spreadsheets).
- Data capturing methods: manual data entry (e.g., spreadsheets), automated data collection (e.g., sensors, IoT, APIs, surveys, forms, point-of-sale systems, mobile devices). Capturing must be accurate, private, secure, and integrated.
- Data processing steps: collection, cleaning, transformation, storage, and analysis.
- Data presentation displays data clearly, engagingly, and understandably, emphasizing key insights (visualization, clarity, context, interactivity, storytelling).
- Data warehouses organize incoming data into a single schema for analysis.
- Data lakes store raw data for flexible future use.
Data Architecture
- On-Premise: Users manage all components, including networking, storage, servers, virtualization, operating systems, middleware, runtime, data, and applications (e.g., private data centers).
- Software as a Service (SaaS): Vendors manage everything, users access applications via internet (e.g., Gmail, Microsoft Office 365).
- Serverless Big Data Architecture: Cloud service providers (AWS, Azure, GCP) handle the infrastructure.
Data Treatment (Two Approaches)
-
Data Treatment 1:*
-
Data Sources: Streaming, file data, and relational data.
-
Data Lake Zones:
- Transient: Ingest, tag, catalog data.
- Raw: Metadata, protect sensitive data. Identify data.
- Trusted: Data quality, validation; verified accuracy.
- Refined: Enrich data, automate workflows.
-
Consumer Systems: Data catalog, data prep tools, data visualization, external connectors.
-
Data Treatment 2:*
-
Data Sources: OLTP/ODS, enterprise data warehouse, logs, cloud services, streaming, and file data.
-
Steps:
- Transient loading zone: Ingest data without transformation.
- Raw data zone: Store original, unaltered data and tokenize.
- Refined data zone: Integrate data into consistent format.
- Trusted data zone: Provides high-quality data.
- Discovery sandbox: Enables exploratory analysis and experimentation.
-
Consumers: Business analysts, data scientists.
-
Foundation Layers: Metadata, data quality, data catalog, and security.
Data Technologies and Software Architecture
- Data technologies involve collecting using apps and IoT, ingestion by third parties (MQTT), preparation/computation (data lake), sending to a data warehouse, and presentation.
- Software architecture considers design, quality attributes, IT environment, business strategy, and human dynamics.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Description
This quiz explores the fundamentals of data collection and processing, covering various methods used to capture and handle data. It highlights the importance of accurate and secure data collection from different sources, along with the necessary steps in processing and presenting data effectively.