Part 1 PraDM.pdf
Document Details
Uploaded by SuppleAlliteration
Full Transcript
PART 1 D ATA WA R E H O U S I N G A N D MANAGEMENT DATA WAREHOUSING Data warehousing is a system used for reporting and data analysis, serving as a central repository of integrated data from one or more disparate sources. It stores current and historical data in one place and is used for creating...
PART 1 D ATA WA R E H O U S I N G A N D MANAGEMENT DATA WAREHOUSING Data warehousing is a system used for reporting and data analysis, serving as a central repository of integrated data from one or more disparate sources. It stores current and historical data in one place and is used for creating analytical reports for knowledge workers throughout the enterprise. The primary goal is to support decision-making and business intelligence activities. DATA WAREHOUSE 1. Data Warehouse: A data warehouse is a large, centralized repository that integrates data from various sources within an organization. It is designed to support complex queries, analytics, and reporting processes. Data warehouses often contain vast amounts of historical data, facilitating trend analysis, forecasting, and business intelligence. The key characteristics of data warehouses include: Subject-Oriented: Organizes data around key business subjects, like customers, products, and sales, making it easier to access relevant information. DATA WAREHOUSE Integrated: Combines data from different sources into a coherent format, ensuring consistency in naming conventions, encoding structures, and data types. Time-Variant: Stores historical data to provide a time-based perspective for analysis. Non-Volatile: Once data enters the warehouse, it is not frequently modified or deleted, preserving data integrity. DATA MARTS 2. Data Marts: A data mart is a smaller, more focused subset of a data warehouse, catering to the specific needs of a particular department or business unit. While data warehouses are comprehensive, data marts are designed for more specialized analytical processing. Data marts can be: Dependent Data Marts: Sourced directly from a centralized data warehouse. They inherit consistent data and maintain integrity with the main warehouse. Independent Data Marts: Created without relying on a central data warehouse, gathering data directly from operational systems or external sources. This approach is often faster and easier to set up but may lead to data silos and inconsistencies. ADVANTAGES AND DISADVANTAGES OF DATA MARTS ADVANTAGES DISADVANTAGES Quicker access to relevant data for a Potential for data inconsistency if not specific department. integrated with a central warehouse. Easier and faster implementation Limited scope of data, making cross- compared to full-scale data departmental analysis more warehouses. challenging. Simplifies data management for smaller, focused queries and analyses. ALTERNATE DATA WAREHOUSING ARCHITECTURES In the traditional data warehousing architecture, data from different sources is extracted, transformed, and loaded (ETL) into a centralized repository. However, evolving business needs and technological advancements have led to alternate architectures to handle data more flexibly, efficiently, and cost-effectively. DATA LAKE 1. Data Lake: Data lakes are storage repositories that hold vast amounts of raw, unprocessed data in its native format until it is needed. Unlike data warehouses, data lakes do not impose a schema upon data ingestion. The schema is applied when the data is read (schema-on-read). This approach offers: Flexibility: Accommodates structured, semi-structured, and unstructured data. Scalability: Built on scalable storage systems like Hadoop and cloud storage, suitable for handling large volumes of data. Cost-Efficiency: Usually more cost-effective than traditional data warehouses, especially for storing large datasets. DATA LAKE Challenges of Data Lakes: Data Quality and Governance: Without strict schemas, data quality can become inconsistent, requiring careful management. Complex Data Processing: Analyzing data directly from a data lake can be more complex, often requiring specialized tools and skills. DATA LAKEHOUSE 2. Data Lakehouse: The data lakehouse is a more recent architecture that combines the strengths of data warehouses and data lakes. It allows the storage of vast amounts of raw data (like a data lake) but includes features of a data warehouse, such as ACID (Atomicity, Consistency, Isolation, Durability) transactions, data governance, and support for diverse analytics workloads. Unified Storage: Manages both structured and unstructured data in a single repository. Versatility: Supports both traditional business intelligence and more advanced analytics, including machine learning. Efficiency: By integrating data warehousing and data lakes, data lakehouses can simplify data processing and reduce data movement. CLOUD DATA WAREHOUSING 3. Cloud Data Warehousing: With cloud technology becoming more accessible, cloud data warehouses like Amazon Redshift, Google BigQuery, and Snowflake provide flexible, scalable, and cost-efficient alternatives to traditional on-premises warehouses. Features include: Scalability: Dynamic scaling based on workload, avoiding the limitations of fixed hardware. Cost-Effectiveness: Pay-as-you-go models and reduced maintenance costs. Integrated Services: Seamless integration with various data sources, data lakes, and analytics tools. FEDERATED DATA WAREHOUSE 4. Federated Data Warehouse: In a federated data warehouse, data is not physically stored in a single repository. Instead, a virtual layer is created to access data across multiple, disparate databases and sources in real time. This architecture provides: Reduced Data Duplication: Avoids the need to replicate data in a central repository. Real-Time Access: Retrieves up-to-date data directly from source systems. Flexibility: Accommodates diverse data sources without requiring extensive ETL processes. FEDERATED DATA WAREHOUSE Challenges of Federated Data Warehouses: Complexity: Implementing a federated system can be technically complex and may involve performance trade-offs. Data Consistency: Real-time data access can lead to inconsistencies if source systems are not properly synchronized. In conclusion… Data warehouses and data marts serve as the cornerstone of traditional data warehousing, providing structured, historical data for analytics and reporting. However, as data volume, variety, and velocity have increased, alternate architectures like data lakes, data lakehouses, cloud data warehouses, and federated systems have emerged. These alternatives offer flexibility, scalability, and cost-efficiency to better support modern data-driven business environments..