Big Data Analysis in Spain PDF

BIG DATA “Data is absolutely key to everything we are doing. It is needed in all areas and fuels the insights that help us make better decisions in all aspects of the business”. – Kathleen Hogan (Chief People Officer at Microsoft). BIG DATA: GLOBAL VOLUME OF DATA: In 2025, it´s estimated that the world will generate 175 zettabytes (ZB) of data. * 1 ZB = 1 billion gigabytes In 2010 it was just 2. Every day Internet users generate around 2.500.000 GB daily 90% of the data were generated in the last 2 years. THE 5 Vs OF BIG DATA: 1. VELOCITY  batch, near time, real time, streams 2. VARIETY  structured, unstructured, semistructured, all the above 3. VOLUME  terabytes, records, transactions, tables, files 4. VERACITY  trustworthiness, authenticity, origin, reputation, accountability 5. VALUE  statistical, events, correlations, hypothetical SOURCES OF DATA: Main sources are: - Facebook - Twits (500.000 tweets per minute) - Instagram (347.222 posts per minute) - IoT (75 mil millions of connected devices generating data) – sensors STORAGE OF GENERATED DATA: Less than 20% of global data is stored in Relational Databases. Is a small percentage but important to handle Banks databases, hospitals, customers… 80% of the global data is not structured (text, images, video). This data is stored in Big Data Architectures, in the Cloud and in NoSQL Databases. BIG DATA STORAGE: Different technologies are needed to store, process and analyze such volume of data that cannot be managed with traditional databases. STORAGE IN HDFS (HADOOP DISTRIBUTED FILE SYSTEM): this type of set up is prepared to handle large volumes of data across multiple servers.  It divides data into small blocks (typically 128 MB or 256 MB) and distributes them across different nodes (servers)  It provides high redundancy (copies of data) to ensure that data is not lost if a node fails  Ideal for storing large amounts of unstructured or semi-structured data DATALAKES: centralized repository that stores flat files of all types of data (structured, semi-structured and unstructured). It is stored as raw data, as the data is generated, with no transformation. It is used when you need to store large volumes of diverse and raw data for long-term analysis or if you don´t know what type of analysis you will perform later. NoSQL ECONOMIC AND FINANCIAL DATA SOURCES There are multiple and relevant Data sources in the Economic and Financial Space that a person can use and incorporate into their analysis and databases. TYPE OF DATA ANALYSIS: All these data sources allow in general to do: Descriptive analysis: to summarize and describe a dataset. (Unemployment this month in Spain by Age range) Trend analysis during time: to analyze how the data changed with time. (How unemployment has changed during the last year by month in Spain) Comparative analysis: between regions, groups of people or between different variables. (How unemployment has changed during the last year by month in Spain in the different communities) INE: It offers a wide range of statistical data on various economic, demographic and social aspects of the country. It regularly updates its data and provides access through its website, where you can download reports, databases and use interactive tools to analyze the information. - Demography and population (census, births, deaths) - Economy (GDP, Consumer Price Index, Surveys) - Labor market (survey data to active population) - Companies and establishments (Central Directory of Companies, Industrial Survey of Companies, Survey on Innovation in Companies, Statistics on Commercial Companies) - Society (Education Statistics, Health Survey, Living Conditions Survey, Labor and Geographical Mobility Statistics) MINISTRY OF ECONOMY, TRADE AND ENTERPRISE: provides a wide range of financial data and statistics. - Macroeconomic data on (the evolution of the Spanish economy, on economic growth) - Public finances (Data on budget execution, statistics on deficit and public debt) - Labor market (labor market statistics, employment, unemployment, job offerings) - Financial system (information on the situation and evolution of the Spanish financial system; banks and other financial institutions) - Foreign trade (data on exports, imports and trade balance) SPANISH GOVERNMENT MADRID STOCK MARKET SPANISH BANK (interest rate Statistics) EUROSTAT: high quality statistics and data on Europe WORLD BANK: free and open access to global development data INTERNATIONAL MONETARY FUND: access to macroeconomic and financial data INTRODUCTION TO DATABASES INTRODUCTION: Understanding about Databases is essential since they are under every aspect of today´s digital world facilitating efficient data management across the different industries: - E-commerce Platforms - Social Media Networks - Banking and Financial Services - Healthcare Systems - Educational Institutions - Logistics and Supply Chain Management - Customer Relationship Management (CRM) - Government Services BEFORE DB: Different ways to store and manage information Paper Magnetic Tapes Books and Accounting Records Electronic Files and Directories These methods had limitations:  Difficulty in searching and retrieving  Lack of integrity and security  Inability to handle large volumes of data EVOLUTION OF DB: 1970s: ER Model - Introduction of the Entity-Relationship model as a standard tool for database design - Oracle introduces the first RDBMS 1980s: DBMS / SQL - IBM creates SQL and it becomes the standard language - More companies creating RDBMS like Sybase introducing the early Microsoft SQL Server. 1990s: NoSQL / Data mining - NoSQL Databases starting to manage not structured data (images, text, audios – other type of information) - Data Warehousing (DataLake: place where there are many databases to analyze data from different databases) and Data mining (data analytics – analysis of data) appear. 2000s: Big Data / Cloud - Open source databases (free version – MySQL, Postgree, Neo4j) - Database products for large volumes - DataLake concepts - Databases in the cloud and serverless solutions BASIC CONCEPTS OF DATABASES: Database: collection of interrelated data that is organized and stored in such a way that it can be easily accessed, managed and updated. It consists of tables, each containing rows and columns where data is stored and related to other tables through relationships. Databases can store information about people, products, orders or other things. Many databases start as a list in a spreadsheet or word processing program. As the list grows, data redundancies and inconsistencies start to appear. It becomes increasingly difficult to understand data in list form, and methods for searching or extracting subsets of data for review are limited. Once these problems start to appear, it´s a good idea to transfer the data to a database created with a database management system (DBMS) Database Management System (DBMS): software designed to manage databases, providing functionalities to define, create, query (search & select), update and administer databases. It acts as an interface between users and the database, ensuring efficient storage, retrieval and manipulation of data while maintaining data integrity and security. Oracle Database, Microsoft SQL Server, MySQL. *It helps you organize the different tables included in the database. DBM Tools and Software: applications that help users create, manage and manipulate databases. They provide an interface to interact with the data and perform various operations like querying, updating and reporting. - Storage: large volumes in structured format - Retrieval: querying capabilities using languages like SQL - Manipulation: insert, update and delete - Integrity: accuracy and consistency through constraints, validation rules and transactions - Security: protecting data with user authentication - Back up and recovery: offering tools for data backup and recovery - Performance optimization: including features like indexing (creating a number for a field that is a text), caching (already stored in the devide), query optimization for better performance (doing it quicker) THE 3 ARCHITECTURE LEVELS IN DB: One of the main purposes of DBMS is to provide users with a simplified view of the data, hiding the complexities of how the data is stored and managed. 1. CONCEPTUAL DESIGN: defining the main entities and relationships in a way that is technology-agnostic. Highest level of abstraction. 2. LOGICAL LEVEL: detailed but still abstract, specifying tables, columns and relationships, ready to be mapped to a specific DBMS. 3. PHYSICAL LEVEL: describes how data is really stored. Is the lower level. o NN: not null – ese campo no puede estar vacío o PK: primary key – único, que no se puede repetir o FK: foreign key – conectar dos tablas (poner solamente el PK de una tabla en la que está el resto de la información para no repetirla entera) DB ARE IMPORTANT IN THE ECONOMIC AND FINANCIAL ANALYSIS BECAUSE: Provide a structured way to store data related to markets, transactions, financial records Organizing the data into easily accessible formats for data analysis, reporting, prediction Facilitating informed decision making, identifying trends and performing accurate forecasting Maintaining accuracy and confidentiality of sensitive financial information INTRO TO SQL SQL AND ITS IMPORTANCE: SQL (Structured Query Language): it´s used for managing and manipulating relational databases. It provides commands for creating, reading, updating and deleting data in a database. Most relational database management systems support SQL, making it universal for database interaction. It´s essential for data analysis allowing users to perform queries to extract meaningful insights from datasets. It´s designed to handle large volumes of data in an efficient manner It´s easy to learn and use even with no deep technical knowledge REAL WORLD APPLICATIONS: - Business Intelligence: companies use SQL to generate reports and analyze business performance - Finance: banks and financial institutions use SQL for transaction processing and risk management - Healthcare: medical records and patient data management - E-commerce: inventory management, customer data and sales tracking MAIN DATA TYPES: Data Type Description Example INT Whole numbers 42 FLOAT Floating-point numbers (approximate) 3.14 DOUBLE Double-precision floating-point 2.718281828459 DECIMAL (p,s) Fixed-point numbers with precision and scale 123.45 (DECIMAL(5.2)) VARCHAR (n) Variable-length text up to n chars `John Doe´ (VARCHAR(100)) CHAR (n) Fixed-length text of n chars `A´ (CHAR (1)) TEXT Large amount of text `This is a long text´ DATE Date value (YYYY-MM-DD) `2024-07-09´ TIME Time value (HH:MM) `14:30:00´ DATETIME Date and time value `2024-07-09 14:30:00´ TIMESTAMP Date and time, auto-updated `2024-07-09 14:30:00´ BLOB Binary Large Object [binary data] BOOLEAN True/False values TRUE or FALSE MAIN STRUCTURES: CREATING TABLES: Table consist of columns (define the type of data) and rows (contain the actual data) Basic syntax of CREATE TABLE: CREATE TABLE TableName ( Column1 DataType1, Column2 DataType2,... ); SELECTING RECORDS: Retrieving data from a table Basic syntax of SELECT query: SELECT column1, column2,... FROM table1 JOIN table2 ON table1.common_column = table2.common_column WHERE condition; INSERTING RECORDS: Insert data into a table. Basic syntax of INSERT query: INSERT INTO table_name (column1, column2,...) VALUES (value1, value2,...); UPDATING RECORDS: Updating data into a table. Basic syntax of UPDATE query: UPDATE table_name SET column1 = value1, column2 = value2,... WHERE condition; DELETING RECORDS: Delete data from a table. Basic syntax of DELETE query: DELETE FROM table_name WHERE condition; ALTERING A TABLE: Add columns to a table. Basic syntax of ALTER query: ALTER TABLE table_name ADD column_name datatype; DROPING TABLE: Delete a table entirely. Basic syntax of DROP query: DROP TABLE table_name; UNDERSTANDING JOINS IN SQL: JOINS are used in SQL to combine rows from 2 or more tables based on a related column between them. TYPES OF JOINS: Inner join: it returns only the rows that have matching values in both tables. SYNTAX: SELECT columns FROM table1 INNER JOIN table2 ON table1.common_column=table2.common_column; Left Outer Join: it returns all rows from the left table, and the matched rows from the right table. If no match is found, NULL values are returned for columns from the right table. SYNTAX: SELECT columns FROM table1 LEFT JOIN table2 ON table1.common_column=table2.common_column; Right Outer Join: it returns all rows from the right table, and the matched rows from the left table. If no match is found, NULL values are returned for columns from the left table. SYNTAX: SELECT columns FROM table1 RIGHT JOIN table2 ON table1.common_column=table2.common_column; USING SIMPLE QUERIES FOR DATA ANALYSIS: These queries cover basic tasks: - Retrieving data - Performing aggregations - Filtering results - Using simple joins SELECT column1, column2,... FROM table_name; SELECT * FROM table_name; SELECT Top X FROM table_name; USING AGGREGATIONS FOR DEEPER ANALYSIS: AGGREGATIONS: used for summarizing data, deriving insights and performing calculations on datasets. SELECT count( * ) FROM table_name; SELECT sum,max,avg,min,…. (column) FROM table_name; FILTERING CONDITIONS: SELECT column1, column2 FROM table_name WHERE condition (); USING OPERATORS: OPERATORS: essential tools for manipulating and retrieving data. They enable to perform various operations on the data (arithmetic calculations, comparisons, logical evaluations and combining multiple conditions in a query). TYPES: Arithmetic: +, -, *, /  SELECT carid, daily_rate, dailyrate- 5 AS discounted_rate FROM cars; Comparison: , =, , !=, LIKE  SELECT * FROM cars WHERE dailyrate > 50; LIKE: used for pattern matching in strings. It´s often used in WHERE clauses to search for a specified pattern in a column. Particularly useful when you need to find records that match a certain pattern rather than an exact match. It´s used with Wildcards like “%” (represents zero, one or multiple characters) and “–” (represents a single character)  SELECT * FROM cars where carmodel like ‘Cam%' Logical operators: AND, OR, NOT  SELECT * FROM rentals WHERE NOT (returndate IS NULL); USING STRING FUNCTIONS: STRING FUNCTIONS: essential tools for manipulating and transforming text data. They allow you to perform various operations on strings (concatenation, extraction, formatting and searching for specific patterns). 1. Concat: Select concat (CustomerName,' ', CustomerPhone) as FullContact from Customers 2. Substring: SELECT SUBSTRING (customerphone , 1, 3) AS area_code FROM customers; 3. Length: SELECT LENGTH (carmodel) AS model_length FROM cars; 4. Upper and Lower: SELECT UPPER (carmodel) AS model_length FROM cars; 5. Trim, Ltrim, Rtrim: SELECT TRIM (customer_name) AS name FROM customer; 6. Replace: SELECT replace (customer_name,”S”,”s”) AS name FROM customer; 7. Left and Right: Select right (customer_name,2) from customer; USING SUBQUERIES: SUBQUERIES: queries within another SQL query. They are enclosed in parentheses and can be used in various clauses (SELECT, FROM, WHERE, HAVING). They are powerful tools for breaking down complex queries into simpler parts, making it easier to understand and maintain SQL code. - Subqueries returning 1 row: retrieve the name of the customer who has rented the most expensive car  SELECT CustomerName FROM customers WHERE customerid = ( SELECT customerid FROM rentals ORDER BY totalcost DESC LIMIT 1 ); - Subqueries returning multiple row: retrieve the details of cars that have been rented by customers who have rented more than 2 times  SELECT * FROM cars WHERE carid IN ( SELECT carid FROM rentals WHERE customerid IN ( SELECT customerid FROM rentals GROUP BY customerid HAVING COUNT(*) > 1) ); - In select  SELECT customer_name, (SELECT SUM(total_amount) FROM Orders WHERE Orders.customer_id = Customers.customer_id) AS total_spent FROM Customers; - In where  SELECT first_name, last_name, salary FROM Employees WHERE salary > (SELECT AVG(salary) FROM Employees); - Correlated subquery  SELECT first_name, salary FROM Employees e1 WHERE salary > (SELECT AVG(salary) FROM Employees e2 WHERE e2.department = e1.department); - With in  SELECT product_name FROM Products WHERE product_id IN (SELECT product_id FROM Orders); USING MICROSOFT EXCEL AS A DATABASE EXCEL AS A FLAT-FILE DATABASE: Excel functions as a flat-file database: It stores data in a single table or sheet without complex relational structures like SQL databases Flat-file databases are useful for managing small datasets where complex relationships and constraints aren´t required. It´s a useful tool for smaller applications or quick analyses - Data Storage in Excel: each worksheet in Excel can represent a “table” with rows as records and columns as fields - Advantages and limitations: while Excel can manage tables and perform simple lookups, it lacks the integrity constraints (like enforcing referential integrity) and scalability of an RDBMS - Appropriate use cases: just used for rapid prototyping, data explorations and small to medium datasets CREATING AND USING RELATIONSHIPS IN EXCEL: Excel´s Data Model allows for the creation of relationships between tables, enabling users to query and analyze data from multiple tables within the same workbook. Data Model Basics: is a built-in feature allowing multiple tables to exist in a single workbook with defined relationships, making it possible to analyze related data. Setting up relationships: allow to establish relationships between tables using fields that serve as foreign keys Relationship types: Excel only supports one-to-one and one-to-many relationships. *Many-to-many relationships require more complex workarounds, like bridge tables. Limitations of relationships in Excel: Excel doesn´t enforce referential integrity in relationships, making manual attention to data consistency necessary. DATA INTEGRITY IN EXCEL: Data integrity is critical in any database to ensure that the data is accurate, consistent and reliable. Although Excel doesn´t enforce integrity constraints, best practices and certain tools within Excel can help maintain data quality. Unique IDs and Data consistency: be sure to include unique IDs for each record to avoid duplicates and the use of consistent data types within each column to prevent errors. Validation rules: use Excel´s Data Validation feature to restrict data input (enforcing date formats or restricting input to specific lists) Error checking tools: use Conditional Formatting to highlight potential errors (duplicates or blank cells) and Text to Columns to standardize data formats (dates or currency) NORMALIZATION WITH EXCEL LIMITS: Database normalization minimizes redundancy and improves data integrity by organizing data into separate tables. In Excel, basic normalization techniques can still be implemented, helping to make the data cleaner and easier to manage. Basic Normal Forms: 1. 1NF: separating atomic values in different fields 2. 2NF: separating entities in different tabs (Splitting Data into Tables) and using PK and FK 3. 3NF: avoiding transitive dependencies Relational Table Design: designing tables around entities and their relationships to minimize redundancy and ensure each piece of information only appears once. EXCEL FUNCTIONS AS SQL EQUIVALENTS: Excel includes lookup functions like VLOOKUP, INDEX, MATCH and, more recently, which can simulate basic JOIN operations in SQL by pulling data from other tables. VLOOKUP (Vertical Lookup): it can retrieve data from a specified range based on a matching key. It has limitations  needing sorted data or being limited to exact matches. INDEX and MATCH Combination: INDEX and MATCH together offer more flexibility and can retrieve data from left-to-right, unlike VLOOKUP Limitations: Unlike SQL joins, these functions don´t dynamically update as new data is added or existing data is changed; they require careful handling and manual updates if the data structure changes. PIVOT TABLE: Advanced tool for calculating, summarizing and analyzing data that allows you to see comparisons, patterns and trends in them in a simple way. NOSQL DATABASES INTRO TO NOSQL DATABASES: NoSQL (non-relational databases): designed to store, retrieve and manage non-tabular data. Schema-less or flexible schema Horizontally scalable Handle a variety of data types: structured, semi-structured (JSON) and unstructured (text, images, logs) WHEN ARE THEY USED: - When scalability is necessary: easily handle large volumes of data with horizontal scaling - When flexibility is needed: dynamic schema design allows for quick iterations and changes - When performance is important: optimized for specific data models and access patterns - In Big Data and Real-Time Applications: suited for handling big data and real-time analytics TYPES OF NON-RELATIONAL DATABASES: DOCUMENT STORES: used for content management systems (store all the documents in a website)  MongoDB Documents within the same collection can have different fields or structures. The database doesn´t enforce a strict schema, giving developers the flexibility to store varied data formats within a single collection. Useful when you are storing products that can have different attributes depending on the category. Use case: A news website uses MongoDB to store articles. Each article is a document containing fields like title, body, author, tags, and publication date. MongoDB’s flexible schema allows for easy addition of new fields (e.g., multimedia content) without altering existing documents. The site can handle millions of readers by distributing the load across multiple servers. Ex: "carID": 101, "brand": "Toyota", "model": "Corolla", "year": 2020 } { "carID": 102, “color": “Blue", "model": “Ibiza", "year": 2019 KEY-VALUE STORES: used for caching mechanisms, session management and real-time bidding  Redis, DynamoDB Use Case: An online shopping platform uses Redis to manage user sessions. Each user's session is stored with a unique session ID as the key, and the session data (e.g., cart contents, authentication tokens) as the value. Redis allows for quick access and updates to session information, ensuring a seamless shopping experience even when the site experiences high traffic during events like Black Friday sales. GRAPH DATABASES: it´s a graph database management system, designed to efficiently store, manage and query highly interconnected data. *Graph data model: data model that represents data as nodes (entities) connected by relationships (edges). This makes it particularly suitable for scenarios where relationships between data points are important. It can be used in various domains (social networks, recommendation systems, network and IT operations, fraud detection). Its flexibility comes from its ability to model complex relationships efficiently. Use case: A social media platform uses Neo4j to manage and analyze its social graph. Users are nodes, and relationships like friendships, messages, and likes are edges. Neo4j allows the platform to efficiently query and analyze these connections to recommend new friends, detect communities, and identify influencers. This setup also helps in detecting fraudulent behavior by analyzing unusual patterns in the connections and interactions. NEO4J PRODUCTS: Neo4j Database: it´s the core product, available in both Community and Enterprise editions. This graph database management system allows users to model, store and query highly connected data efficiently. Neo4j Aura: fully managed cloud service for deploying and running Neo4j databases. It provides users with the benefits of Neo4j´s graph database technology without the need to manage infrastructure, offering scalability, security and reliability in the cloud. Neo4j Desktop: desktop application that provides a development environment for Neo4j. It allows developers to create and manage local Neo4j databases, develop applications using Neo4j and explore graph data visually. Neo4j Bloom: graph visualization and exploration tool designed to help users uncover insights from their graph data. It provides an intuitive interface for exploring and querying graph databases using natural language search and interactive visualizations. Neo4j Graph Data Science Library: collection of algorithms and tools specifically designed for analyzing and extracting insights from graph data. It includes algorithms for centrality analysis, community detection and similarity analysis, enabling advanced graph analytics. Neo4j AuraDB for Google Cloud: fully managed graph database service hosted on Google Cloud Platform (GCP). POWERBI AS A DATABASE TOOL POWERBI AS A DATABASE: PowerBI has 2 different solutions: POWERBI DESKTOP: - Installed in the computer. *It´s license free for your personal use - For creating reports and dashboards - It has another solution embedded which is PowerQuery used to clean and transform data from multiple sources (Excel, SQL) - It allows to create relationships between tables (similar to relational databases) - It allows to visualize data with interactive dashboards - The audience is Developers and Designers POWERBI SERVICE: - Cloud-based platform for sharing and collaborating on Power BI reports - Offers real-time data updates and sharing with team members - It offers less functionalities than PowerBI Desktop as the audience is for business users to visualize or collaborate in reports created by developers in PowerBI Desktop COMPARISON WITH TRADITIONAL SQL DATABASES:  PowerBI focuses on analysis and visualization rather than data storage  Use it to explore and connect data from databases, not as a primary storage tool USING AI IN POWERAPPS TO CREATE DATABASES AND APPS: Power Apps is a low-code platform to create custom applications connected to databases (like SQL or Excel) Allows for drag-and-drop interface creation AI tools (Copilot) can assist in quickly generating tables, forms and business logic AI enhances productivity and accessibility for non-technical users STEPS TO BUILD AN APP IN 10 MINUTES: 1. Generate data: use AI to define the structure of a database (customers, rentals) 2. Create an interface: auto-generate forms and screens based on the database 3. Integrate logic: add AI-driven functionality (automated responses or workflows) RELATIONAL DATABASES Type of database that stores and provides access to data points that are related to one another. Data is organized in tables, which consist of rows and columns. Each table represents a specific entity and contains records (rows) that are composed of fields (columns). Table: the fundamental structure to store data in a relational database. Each table has a unique name and contains rows (records) and columns (fields). Columns define the attributes, and rows contain the data instances. Relationships: they are essential in a relational database for linking tables to enable complex queries and to ensure data integrity. Relationships are established through keys: - PRIMARY KEYS: column/set of columns in a table that uniquely identifies each row in that table. - FOREIGN KEYS: column/set of columns in one table that uniquely identifies a row of another table. It establishes a link between the data in the 2 tables. WHY RELATIONSHIPS MATTER? Data integrity: ensures consistency and accuracy of the data. Foreign key constraints prevent invalid data from being inserted. EX: preventing an order from referencing a non-existent customer. Efficient data retrieval: allows complex queries to retrieve related data across multiple tables. EX: joining Customers and Orders tables to find out which customer placed which order. Reduced data redundancy: eliminates the need to duplicate data by storing related data in separate tables. EX: storing customer details once in the Customers table and referencing it in the Orders table. ENTITY RELATIONSHIP DIAGRAMS – CONCEPTUAL IN RELATIONAL DB: - Entity identification: ER diagrams assist in identifying entities, attributes of these entities and the relationships between entities. - Relationship definition: they define the relationships between entities. It´s essential for understanding how different entities interact with each other. It refers to the number of instances of one entity that can be associated with instances of another entity through a relationship. o ONE TO ONE: one person has one DNI o ONE TO MANY: one school has many students o MANY TO ONE: many employees work in one store o MANY TO MANY: many employees can enroll in many trainings, and many trainings can have many employees enrolled  use of intermediate table - Normalization: it reduces data redundancy and improves data integrity by organizing fields and table of a database. - Implementation: ER diagrams guide the creation of tables, specifying primary keys, foreign keys, and other constraints based on the relationships defined in the ER diagram. STEPS FOR ER MODEL CREATION: 1. Identify entities 2. Define relationships 3. Determine attributes 4. Choose keys 5. Draw the diagram  use standard ER diagram symbols: ✓ ENTITY: represented by a rectangle ✓ RELATIONSHIP: represented by a diamond ✓ ATTRIBUTE: represented by an oval ✓ PRIMARY KEY: represented by an underlined attribute ✓ FOREIGN KEY: represented by a dashed underline or noted in the entity´s attributes NORMALIZATION: Organize and structure the data in an efficient way to avoid redundancies and to ensure accuracy and data integrity. - Eliminate data redundancy - Use primary and foreign keys - Reduce transitive dependencies 1ST NORMAL FORM: each column contains only atomic values (each word/number in a single column) 2ND NORMAL FORM: eliminate partial dependencies. All non-key attributes are fully dependent on the primary key. What is the primary key? Are all non-key attributes depending on the complete primary key? 3RD NORMAL FORM: eliminate non-key attributes depending on another non-key attribute. DATABASE DESIGN: The process of defining the structure, storage and retrieval mechanisms of data in a database system. It involves creating a detailed blueprint of how data will be stored, accessed and managed in a database. Scheme definition: specifies the tables, fields, data types and relationships in the database Normalization: ensures the database structure minimizes redundancy and optimizes data integrity Physical implementation: determines how the logical schema will be physically stored and accessed in the database management system Performance optimization: includes indexing, partitioning and query optimization to enhance database performance

Big Data Analysis in Spain PDF

Document Details

Tags

Related

Summary

Full Transcript