Fundamentals.md
Document Details
Uploaded by FlatterPegasus
Full Transcript
#databricks #dataengineer ### Data Lakehouse paradigm - unified security, governance and cataloging - unified data storage for reliability and sharing | Data science | ETL & real-time | Workflows | Data Warehousing | | ------------------------ | --------------- | --------- | --------...
#databricks #dataengineer ### Data Lakehouse paradigm - unified security, governance and cataloging - unified data storage for reliability and sharing | Data science | ETL & real-time | Workflows | Data Warehousing | | ------------------------ | --------------- | --------- | ---------------- | | Data intelligence Engine | | | | | Unity catalog | | | | | Delta Lake | | | | #### Delta Lake offers: - predictive I/O - predictive optimization - liquid clustering It is a file-based open source format that allows: - ACID transactions - Scalable data and metadata handling - audit history and time travel - Schema enforcement and evolution - support delete, update and merges - unified streaming and batch processing Also: - uses delta tables - includes transaction log - compatible with Spark #### Unity catalog Allows discovery and classification including: - user manager - access control - data lineage - automated monitoring - auditing - data sharing Provide: - Unified view of data and AI state: discover, classify and organize, data federation, tag-based search - single permission model for data and AI: unified interface, fine-grained access controls, open interfaces - Ai-driven monitoring and reporting: proactive alerts, real time lineage, auto-generated dashboards, end-to-end view on data flow - open data sharing ### Databricks Data Intelligence platform It's **Data Lakehouse + AI** - Databricks AI - Delta live tables (ETL) - Workflows (Orchestration) - Databricks SQL text-to-sql (Data Warehousing) supported by: - Data intelligence Engine - Unity catalog - Delta lake ### Databricks data governance - Unity catalog: unified governance a security - Delta sharing: sharing between organizations - Databricks marketplace: commercialization of datasets - Databricks clean rooms: private, secure computing (collaboration env) ### Control plane & Data plane Control plane refers to: - web application - configurations - notebooks, repos, DBSQL - cluster manager manage security to data plane Data plane refers to: - clusters - cloud storage Includes data encryption for data-at-rest and data-in-motion. ### Compliance Databricks compliance includes: - SOC 2 type II - ISO 27001 - ISO 27017 - ISO 27018 Also supports: - FedRAMP High - HITRUST - HIPAA - PCI GDPR and CCPA ready ### Serverless data plane Serverless computation is available: reduce cost and effort, increasing productivity. Three layer isolation with data encryption Elastic scale up and down automatically ### Photon It's a query engine similar to Spark where data is received as Delta/Parquet and the photon engine delivers data in the same formats. Compatible with Spark APIs. Its goal is to accelerate data and analytics workflows. - SQL based jobs - IoT uses cases - dat privacy and compliances - loading data into delta and parquet ### Databricks SQL Simplifies data analysis. - text-to-sql - autogenerate and completeness code/queries - problems diagnosis and solutions - unity catalog integration Allows: - Best TCO and performance for DW - Intelligent workloads - Elastic allocation of resources ### Orchestration Workflows: - intelligent ETL processes - ai-driven debugging - end-to-end monitoring - broad ecosystem integration ### Delta Live Tables - Automated and scalable streaming ingestion and transformation - Automated autoscaling - Intell. orchest., error handling and optimization Automated checkpoints allow restarting processes if interrupted by unexpected errors. Refresh data incrementally. ### Data science and AI Allows: - securely train models - moving models to production quickly - deploy LLMs cost-effectively Supports: - custom models - model serving - rags - MLOps (ML flow) - AutoML - Monitoring - Governance #### DatabricksX Allows building your own LLMs, training the model and do RAGs cost-effectively.