Big Data Processing - 2410 Application Deployment PDF

Document Details

RecommendedExuberance5684

Uploaded by RecommendedExuberance5684

EAE Business School

Tags

big data processing application deployment scalable applications big data

Summary

This document covers application deployment for big data processing, including manual installations, cloud computing distributions, and scalability techniques. It details practical steps and explains the concepts. The document is part of a larger course or module on big data processing from the EAE Business School.

Full Transcript

Big Data Processing - 2410 03. Application deployment. Development of scalable applications eae.es 68 Big Data Processing -...

Big Data Processing - 2410 03. Application deployment. Development of scalable applications eae.es 68 Big Data Processing - 2410 How to deploy this applications: 1. Manual installations: install Hadoop and Spark (Link1) y multinode 2. Could computing distributions: services used where installation and scalability is simpler and easily scalable. 1. Cloudera 2. Hortonworks 3. Google 4. AWS 5. MICROSOFT 6. IBM Best things are the scallability. You can create them for a particular use case and when you want deploy it and increase the size as you need. When you stop needing it, you can erase it and you Will not be charged for any of it. 69 Big Data Processing - 2410 How to deploy this applications: 1. What do I need to know to install manually? a. basic cmd/bash commands b. download the software needed - Desired Software - Dependencies needed: for example Java c. setup properly the machine I. Environmental variables II. Networking: ip addresses III. comunication protocols: ssh (keys for security encriptions) d. install and verify prerequisits (verify versions with commands like: java --version, python3 -- version, pip…, system utilities: curl, wget, unzip …) e. Install the Big Data Tool a. Follow the installation guide for the software a. extract downloaded binaries b. move to a suitable directory c. Run anysetup scripts provided b. Create system services for tools that run continuously (Linux.service in 70 etc/systmemd/system, Windows use task Scheduler Big Data Processing - 2410 How to deploy this applications: 1. What do I need to know to install manually? f. Configure the Tool i. Modify configuration files as required: core-site.xml, hdfs-site.xml, and mapred-site.xml for Hadoop. spark-defaults.conf, spark-env.sh for Spark. ii. Set appropriate resource limits: Memory allocation Number of threads iii. Enable logging: Specify log levels (INFO, DEBUG, ERROR) Configure log rotation g. Validate the Installation i. Run basic commands to test functionality: Hadoop: hadoop version, hdfs dfs -ls / Spark: spark-submit --versión ii. Verify services are running: Check processes with ps or tasklist. Use web UIs (e.g., Hadoop ResourceManager, Spark Master). 71 Big Data Processing - 2410 How to deploy this applications: 1. What do I need to know to install manually? h. Optimize for Performance i. Adjust system limits (ulimit, kernel parameters). ii. Optimize JVM settings for Java-based tools: Example: -Xms2g -Xmx4g iii. Enable caching and compression for large data handling i. Set Up Monitoring i. Install monitoring tools: JMX Exporter, Prometheus, Grafana for metrics. Log analyzers like Elasticsearch and Kibana. ii. Monitor system resources: Disk usage, CPU, memory, network throughput. j. Document the Process i. Keep a log of installation steps and configurations for future reference. ii. Create a backup of configuration files. 72 Big Data Processing - 2410 Deploy Spark Option 1: Virtual machine (in a Linux OS) - Steps: - Install VM Virtualbox and Ubuntu - Follow this steps in this video to install spark: link Option 2: - using docker: pyspark jupyterlab -- run it with a command: docker run -p 8888:8888 -p 4040:4040 -e JUPYTER_ENABLE_LAB=yes -v /c/Users/marti/OneDrive/Documentos/Notebooks:/home/jovyan/work jupyter/pyspark-notebook 73 Big Data Processing - 2410 Deploy AirFlow Option 1: Virtual machine (in a Linux OS) - Steps: - Install VM Virtualbox and Ubuntu - Follow this steps in this video to install spark: link Option 2: - using docker Option 3: - docker compose 74 Big Data Processing - 2410 Applications Deployment: Development of Scalable Applications for Big Data Processing eae.es 75 Big Data Processing - 2410 Development of Scalable Applications for Big Data Processing Designing for High-Volume, High-Velocity & High Variety Data Big Data Applications must handle high velocity (streaming), massive volumes, and diverse data. Example: Netflix processes petabytes daily for recommendations and analytics. Why Scalability Matters? Real-time analytics, efficient storage, and seamless scaling. Why is critical in Big Data? - Manage exponential data growth - Enable real-time insights for streaming data - Quickly adapt to new data sources and frameworks - Ensure high availability and fault tolerance 76 Big Data Processing - 2410 Development of Scalable Applications for Big Data Processing: SCALING: VERTICAL vs. HORIZONTAL UPGRADING RAM/CPU/STORAGE ADDING MORE NODES 77 Big Data Processing - 2410 Development of Scalable Applications for Big Data Processing: CHALLENGUES: - Managing distributed systems. - Sharding and partitioning data across nodes. - Ensuring fault tolerance. - Balancing compute vs. storage requirements. - Handling varying workloads effectively. BEST PRACTICES FOR BIG DATA SCALABILITY: - Use distributed frameworks like Apache Spark, Kafka, or Flink. - Leverage cloud-based storage (e.g., Amazon S3, Google BigQuery). - Optimize data pipelines for streaming frameworks. - Design for modularity and scalability in architecture. 78 Big Data Processing - 2410 Development of Scalable Applications for Big Data Processing: TOOLS & TECHNICS Tools and Techniques for Big Data Scalability 1. Distributed Storage HDFS (Hadoop Distributed File System): A reliable and scalable storage system for managing large datasets. Amazon S3: Cloud-based object storage with high scalability and durability. Google Cloud Storage: A highly available, secure, and scalable storage option for big data processing. 2. Data Processing Frameworks Apache Hadoop: Batch processing for large-scale data analysis. Apache Spark: Unified analytics engine for large-scale data processing with in-memory computation. Apache Flink: Stream and batch processing framework optimized for real-time analytics. 3. Streaming Tools Apache Kafka: Distributed event streaming platform for real-time data pipelines. AWS Kinesis: Scalable and fully managed platform for streaming data processing. Apache Pulsar: Multi-tenant, high-performance message broker for streaming and queuing. 79 Big Data Processing - 2410 Development of Scalable Applications for Big Data Processing: TOOLS & TECHNICS: 4. Workflow Orchestration and Automation Apache Airflow: A platform to programmatically author, schedule, and monitor workflows. Apache NiFi: Data flow automation for system integration, transformation, and routing. Luigi: Python-based workflow management system designed for batch processes. Prefect: Modern workflow orchestration platform with an emphasis on simplicity and reliability. 5. Monitoring and Optimization Prometheus: Open-source monitoring and alerting toolkit for system metrics. Grafana: Visualization and analytics software for monitoring dashboards. Datadog: Cloud-based monitoring and performance tracking for infrastructure and apps. Elastic Stack (ELK): Log management and analytics platform for monitoring and troubleshooting. 80 Big Data Processing - 2410 Development of Scalable Applications for Big Data Processing: TOOLS & TECHNICS 6. ETL and Data Integration Talend: Data integration platform for transforming, cleansing, and loading data. Informatica: Enterprise-grade data integration and management. dbt (Data Build Tool): Transform data inside the warehouse with SQL-based workflows. 7. Cloud and Resource Management Kubernetes: Container orchestration for managing scalable deployments. Terraform: Infrastructure as code (IaC) tool for provisioning scalable infrastructure. Cloud Dataflow: Google’s managed service for stream and batch processing pipelines 81 Big Data Processing - 2410 Development of Scalable Applications for Big Data Processing: REAL CASE STUDIES: - Netflix: Real-time pipelines with Spark and Kafka for recommendations. - Uber: Scalable architecture for ride-matching and analytics. - Twitter: Handling millions of tweets/second with distributed systems. 82 Big Data Processing - 2410 Development of Scalable Applications for Big Data Processing: STEPS TO BUILD BIG DATA APPLICATIONS: 1. Design modular data pipelines (ingestion, processing, storage). 2. Test with real-world data to identify bottlenecks. 3. Build fault-tolerant systems with recovery mechanisms. 4. Implement data partitioning for distributed workloads. CONCLUSION - Scalability ensures big data systems grow with business needs. - Proper design supports real-time analytics and batch processing. 83 Big Data Processing - 2410 Development of Scalable Applications for Big Data Processing: QUESTIONS OR THOUGHTS? - Discuss ideas on scalable big data applications. - Explore strategies for your projects. 84

Use Quizgecko on...
Browser
Browser