Introduction to Emerging Technology - Addis Ababa University PDF

Addis Ababa University, School of Commerce | Updated: Nov, 2024 1 INTRODUCTION TO EMERGING TECHNOLOGY Chapter - 2 Addis Ababa University, School of Commerce | Updated: Nov, 2024 2 Learning Objectives By the end of this chapter, students will: Define data science and the role of data scientists. Differentiate between data and information. Explain the data processing life cycle. Identify various data types and their representations. Understand the data value chain in the context of big data. Grasp the basics of Big Data and the Hadoop ecosystem. Key Focus: Empowering students to analyze, process, and derive value from data in the age of big data. Addis Ababa University, School of Commerce | Updated: Nov, 2024 3 An Overview of Data Science Definition: A multidisciplinary field that uses scientific methods, processes, and algorithms to extract insights from structured, semi-structured, and unstructured data. Key Roles & Skills: Beyond Analysis: Involves data collection, organization, visualization, and decision- making. Skills Required: Programming (e.g., Python, R). Statistical and mathematical expertise (e.g., linear algebra, statistics). Industry-specific knowledge and strong communication skills. Addis Ababa University, School of Commerce | Updated: Nov, 2024 4 Data vs. Information Data: Raw, unprocessed facts or figures (e.g., numbers, text, symbols). Example: A list of daily temperatures: 25°C, 30°C, 28°C. Information: Processed and organized data with context, making it meaningful. Example: The average temperature for the week is 27.7°C, helping plan weather-dependent activities. Key Differences: Data: Information: Unorganized, lacks context. Structured, contextualized, and Suitable for input to processing valuable. systems. Used for decisions and actions. Addis Ababa University, School of Commerce | Updated: Nov, 2024 5 Data vs. Information Addis Ababa University, School of Commerce | Updated: Nov, 2024 6 Data vs. Information Addis Ababa University, School of Commerce | Updated: Nov, 2024 7 Data Processing Cycle A sequence of steps to restructure or organize data, adding value and usefulness for specific purposes. Key stages: Input, Processing, and Output. Addis Ababa University, School of Commerce | Updated: Nov, 2024 8 Data Processing Cycle Input: Collecting and preparing data for processing. Example: Recording sales data on a hard disk or flash drive. Processing: Transforming input into meaningful information. Example: Calculating monthly sales totals or bank interest. Output: Presenting the processed data for use. Example: Generating employee payroll reports or sales summaries. Addis Ababa University, School of Commerce | Updated: Nov, 2024 9 Data Types and Their Representation Data Types Defines the structure, storage, and operations allowed on data. Key for organizing and analyzing data effectively. Data Types in Computer Programming Programming languages use data types to define how data is used. Common types: Integer (int): Whole numbers (e.g., 5, -100). Boolean (bool): Logical values, true or false. Character (char): A single letter or symbol (e.g., 'A', '!'). Floating-point (float): Real numbers with decimals (e.g., 3.14). String (str): Sequence of characters (e.g., "Hello World"). Example: A program calculating interest: Principal amount → Integer. | Rate → Float. | Is loan approved? → Boolean. Addis Ababa University, School of Commerce | Updated: Nov, 2024 10 Data Types and Their Representation Data Types in Data Analytics Addis Ababa University, School of Commerce | Updated: Nov, 2024 11 Data Types and Their Representation Data Types in Data Analytics Structured Data: Organized, tabular format with relationships. Data that adheres to a predefined data model with a tabular format (rows and columns). Each data point has a clear relationship with others. Characteristics: Stored in relational databases or spreadsheets. Highly suitable for mathematical or statistical analysis. Examples: A bank's customer database with columns like Name, Account Number, and Balance. An Excel spreadsheet tracking monthly expenses. Use Cases: Sales Records: Capturing customer names, transaction IDs, and purchase amounts. Inventory Management: Tracking product IDs, quantities, and reorder levels. Addis Ababa University, School of Commerce | Updated: Nov, 2024 12 Data Types and Their Representation Data Types in Data Analytics Semi-structured Data: Loose structure, uses tags or markers. doesn’t conform fully to a tabular structure but includes tags or markers for partial organization. Characteristics: No strict schema but retains some level of structure. Suitable for NoSQL databases and web applications. Often used for transmitting data between systems. Examples: JSON file describing a product’s details with tags like { "Name": "Laptop", "Price": 1200 }. XML file storing employee details with tags like John. Use Cases: API Responses: Delivering structured responses for developers. Log Files: Analyzing web server activity logs to track website performance. Addis Ababa University, School of Commerce | Updated: Nov, 2024 13 Data Types and Their Representation Data Types in Data Analytics Unstructured Data: No predefined structure, text-heavy. Data that doesn’t follow any specific format or structure, making it complex to analyze without specialized tools. Characteristics: Rich in information but difficult to process directly. Includes multimedia content and free-form text. Requires AI/ML models and advanced analytics for interpretation. Examples: Social media posts with text, hashtags, and emojis, Videos, audio files, and scanned documents. Use Cases: Customer Sentiment Analysis: Extracting insights from online reviews or social media feedback. Image Recognition: Identifying objects or patterns in photos Addis Ababa University, School of Commerce | Updated: Nov, 2024 14 Data Types and Their Representation Data Types in Data Analytics Metadata: Data About Data Metadata provides descriptive information about other data, giving context and details about its purpose, origin, or use Examples of Metadata: In Photos: Date and time of capture, Location (geotag), Camera settings (ISO, shutter speed). In Files: Author of the document, File creation date, Version history. Importance in Big Data: Organizes Complex Datasets: Metadata acts as an index, simplifying the search and retrieval process in massive datasets. Enables Analysis: For example, metadata in emails (e.g., sender, recipient, timestamps) helps filter and analyze communication trends. Foundation for Big Data Tools: Tools like Hadoop and Spark rely on metadata to efficiently process and store data. Addis Ababa University, School of Commerce | Updated: Nov, 2024 15 Data Value Chain The Data Value Chain outlines the steps in transforming raw data into valuable insights, highlighting the key activities in a big data system. Addis Ababa University, School of Commerce | Updated: Nov, 2024 16 Data Value Chain Data Acquisition The process of gathering, filtering, and cleaning data before storing it for analysis. Key Challenges: Infrastructure must handle high transaction volumes and dynamic structures. Low latency is critical for capturing and querying data efficiently. Example: Collecting real-time weather data from sensors for climate prediction. Data Analysis Transforming raw data into actionable insights through exploration, modeling, and synthesis. Techniques: Data mining for patterns. Machine learning for predictive insights. Business intelligence for decision-making. Example: Analyzing customer purchase data to identify popular products and recommend them. Addis Ababa University, School of Commerce | Updated: Nov, 2024 17 Data Value Chain Data Curation Ensuring the quality and accessibility of data through active management over its lifecycle. Key Processes: Data selection, classification, transformation, and validation. Preservation for future use. Why It Matters: Improves trust, usability, and discoverability of data for analysis. Example: Crowdsourcing to annotate images for a machine-learning model to improve accuracy. Data Storage Persisting and managing data in systems that can scale and provide fast access. Technologies: Traditional: RDBMS (e.g., SQL databases) with ACID properties for reliability. Modern: NoSQL (e.g., MongoDB, Cassandra) for handling vast and complex datasets. Example: Using a NoSQL database like MongoDB to store unstructured social media data. Addis Ababa University, School of Commerce | Updated: Nov, 2024 18 Data Value Chain Data Usage Using insights derived from data to drive business decisions and operations. Benefits: Reduces costs and increases efficiency. Enhances competitiveness by enabling informed decision-making. Example: Using predictive analytics in logistics to optimize delivery routes and reduce fuel costs Addis Ababa University, School of Commerce | Updated: Nov, 2024 19 Data Value Chain Addis Ababa University, School of Commerce | Updated: Nov, 2024 20 Basic Concepts of Big Data What is Big Data? Refers to datasets too large and complex to process with traditional tools or a single computer. Big data challenges conventional database management and processing methods. RDBMS are failing … Addis Ababa University, School of Commerce | Updated: Nov, 2024 21 Basic Concepts of Big Data Key Characteristics ("4Vs"): Volume: Massive datasets, e.g., zettabytes of data. Velocity: High-speed data generation and real-time streaming. Variety: Diverse data types (structured, semi-structured, unstructured). Veracity: Data reliability and accuracy. Addis Ababa University, School of Commerce | Updated: Nov, 2024 22 Clustered Computing in Big Data Why Clusters? Individual computers cannot handle the scale of big data effectively. Key Benefits of Clusters: Resource Pooling: Combines storage, CPU, and memory across machines. High Availability: Fault tolerance ensures minimal data loss or downtime. Scalability: Easily add more machines to meet demand. Cluster Management Tools: Hadoop YARN: Manages resources and schedules tasks across nodes. Addis Ababa University, School of Commerce | Updated: Nov, 2024 23 Hadoop and Its Ecosystem What is Hadoop? Open-source framework for distributed storage and processing of big data. Key Features: Economical: Uses commodity hardware. Reliable: Stores redundant data copies for fault tolerance. Scalable: Easily handles horizontal and vertical scaling. Flexible: Supports both structured and unstructured data. Addis Ababa University, School of Commerce | Updated: Nov, 2024 24 Hadoop and Its Ecosystem Addis Ababa University, School of Commerce | Updated: Nov, 2024 25 Hadoop and Its Ecosystem Core Hadoop Components: HDFS: Distributed file system for storage. YARN: Manages cluster resources. MapReduce: Processes data in parallel across nodes. Spark: In-memory data processing for speed. Additional Ecosystem Tools: PIG & HIVE: Data querying and analysis. HBase: NoSQL database for unstructured data. Mahout & Spark MLLib: Machine learning libraries. Zookeeper: Cluster management. Oozie: Job scheduling. Addis Ababa University, School of Commerce | Updated: Nov, 2024 26 Big Data Lifecycle with Hadoop Ingest Data: Transfer data from sources like relational databases or logs. Tools: Sqoop: Transfers data from RDBMS to HDFS. Flume: Captures event logs and streaming data. Process Data: Store data in HDFS or HBase for distributed processing. Tools: Spark: Real-time processing. MapReduce: Batch processing. Addis Ababa University, School of Commerce | Updated: Nov, 2024 27 Big Data Lifecycle with Hadoop Analyze Data: Extract insights using querying frameworks. Tools: Pig: Script-based transformation and analysis. Hive: SQL-like querying for structured data. Access & Visualize Data: Present results using intuitive tools. Tools: Hue: Web-based interface for accessing Hadoop. Cloudera Search: Enables search-based data retrieval. Addis Ababa University, School of Commerce | Updated: Nov, 2024 28 Big Data Lifecycle with Hadoop

Introduction to Emerging Technology - Addis Ababa University PDF

Document Details

Tags

Related

Summary

Full Transcript