Database Notes PDF
Document Details
Uploaded by TriumphantTsavorite
Tags
Summary
These notes provide an introduction to databases. It covers the evolution of databases, from relational databases to NoSQL databases, discussing key concepts like tables, relationships, and primary/foreign keys. It also touches upon database design and the importance of databases in economic and financial analysis.
Full Transcript
INTRODUCTION TO DATABASES First concepts Databases are essential since they are under every aspect of today's digital world. We can see it in e-commerce platforms, social media networks, healthcare systems... **EVOLUTION:** The evolution goes from 1970 with the first RDBMS. Then in 1980 they crea...
INTRODUCTION TO DATABASES First concepts Databases are essential since they are under every aspect of today's digital world. We can see it in e-commerce platforms, social media networks, healthcare systems... **EVOLUTION:** The evolution goes from 1970 with the first RDBMS. Then in 1980 they created SQL becoming the standard language, introducing the early Microsoft SQL server. In 1990 NoSQL started to manage not structured data. We see DATA MINING, which is analyzing data. Lastly, in the 2000s open-source databases were created for products for large volumes. We see DataLake concepts and databases in the cloud (Big Data/Cloud). **WHAT IS A DATABASE?** Is a collection of objects that allows you to store and organize data that can easily be accessed, managed and updated. - It consists of tables, each containing rows and columns where data is stored and related to other tables through relationships. Databases can store information about people, products, orders, or other things. Many databases start as a list in a spreadsheet or word processing program. As the list grows, data redundancies and inconsistencies start to appear. A **Database Management System (DBMS)** is software designed to manage databases, providing functionalities to define, create, query, update, and administer databases. It acts as an interface between users and the database, ensuring efficient storage, retrieval, and manipulation of data while maintaining data integrity and security. - For example: Oracle Database, Microsoft SQL Server, MySQL, etc. Typical words: SELECT, UPDATE, INSERT, DELETE. **DBM Tools and Software**: Applications that help users create, manage, and manipulate databases. They provide an interface to interact with the data and perform various operations like querying, updating, and reporting. *It features*: · **Storage** - large volumes in structured format\ · **Retrieval** -- querying capabilities using languages like SQL\ · **Manipulation** -- insert, update and delete\ · **Integrity** -- accuracy and consistency through constraints, validation rules\ and transactions\ · **Security** -- protecting data with user authentication\ · **Backup and recovery** - offering tools for data backup and recovery\ · **Performance optimization** -- including features like indexing, caching,\ query optimization for better performance. - Cashing means that some information saves in your computer. - Performance means how well it works. Architectural levels DATABASE DESIGN is done in **THREE ARCHITECTURAL WAYS/LEVELS**. It's necessary to do it in order. ![](media/image2.png) 1. **Conceptual design** Highest level of abstraction, defining the main entities and relationships in a way that is technology-agnostic. ![](media/image4.png) 2. **Logical Design** Detailed but still abstract, specifying tables, columns, and relationships, ready to be mapped to a specific DBMS. 3. **Physical Design** Describes how data is really stored. It's the lower level. NN: it's the relation FK: Foreign Key - ![](media/image6.png)Importante examen. **Why is it important in the Economic and Financial Analysis?** 1\. Provide a structured way to store data related to markets, transactions, financial records.\ 2. Organizing the data into easily accessible formats for data analysis, reporting, prediction, etc.\ 3. Facilitating informed decision making, identifying trends and performing accurate forecasting.\ 4. Maintaining accuracy and confidentiality of sensitive financial information. Relational Database ***Definition:*** Is a type of database that stores and provides access to data points that are related to one another. Data is organized into tables, which consist of rows and columns. Each table represents a specific entity (e.g., customers, orders, products) and contains records (rows) that are composed of fields (columns).. [Key Concepts:] ***Table:*** The fundamental structure to store data in a relational database.\ Each table has a **unique name** and contains rows (records) and columns (fields). Columns define the **attributes** of the entity, and rows contain the data instances. ***Relationships:*** Relationships in a relational database are essential for linking tables to enable complex queries and to ensure data integrity. Relationships are established through keys: **primary keys** and **foreign keys**. - ***Primary Key:*** A primary key is a column (or a set of columns) in a table that uniquely identifies each row in that table. Example: In a Customers table, CustomerID can be a primary key because each customer has a unique ID. - ***Foreign Key:*** A foreign key is a column (or a set of columns) in one table that uniquely identifies a row of another table. It establishes a link between the data in the two tables. Example: In an Orders table, CustomerID can be a foreign key that references CustomerID in the Customers table. [Why relationships matter?] **Data Integrity:** Ensures consistency and accuracy of the data. Foreign key constraints prevent invalid data from being inserted (e.g., preventing an order from referencing a non-existent customer). **Efficient Data Retrieval:** Allows complex queries to retrieve related data across multiple tables. Example: Joining Customers and Orders tables to find out which customer placed which order. **Reduced Data Redundancy:** Eliminates the need to duplicate data by storing related data in separate tables. Example: Storing customer details once in the Customers table and referencing it in the Orders table. Entity Relationship Diagrams -- Conceptual in Relational DB ER (Entity Relationship) diagrams are a crucial tool in database design. They are used to: - Entity identification: ER diagrams assist in identifying entities, attributes of these entities, and the relationships between entities. - Relationship Definition: They define the relationships between entities, such as one-to- one, one-to-many, or many-to-many relationships. This is essential for understanding how different entities interact with each other. - Normalization: Normalization reduces data redundancy and improves data integrity by organizing fields and table of a database. - Implementation: ER diagrams guide the creation of tables, specifying primary keys, foreign keys, and other constraints based on the relationships defined in the ER diagram. [Steps for ER Model Creation ] 1. Identify Entities 2. Define Relationships 3. Determine Attributes 4. Choose Keys 5. Draw the Diagram: Use standard ER diagram symbols to represent the entities, relationships, and attributes. Here's a summary of [common symbols]: **Entity**: Represented by a rectangle. **Relationship**: Represented by a diamond. **Attribute**: Represented by an oval. **Primary Key**: Represented by an underlined attribute. **Foreign Key:** Sometimes represented by a dashed underline or simply noted in the entity\'s attributes. [Type of relationships ] It refers to the number of instances of one entity that can be associated with instances of another entity through a relationship. Here are the main types of cardinality: - One to One -- One person has one DNI - One to Many -- One school has many students - Many to One- Many Employees work in one Store - Many to Many -- Many Employees can enroll in many trainings and many trainings can have many employees enrolled. Intermediate table is typically used for this. Normalization Organize and structure the data in an efficient way to **avoid redundancies** and to ensure accuracy and data integrity. - Eliminate Data redundancy - Use primary and foreign keys - Reduce transitive dependencies There are 3 normal forms: - 1st Normal Form: Each column contains only atomic values. - 2nd Normal Form: Eliminate partial dependencies. All non-key attributes are fully dependent on the primary key. - 3rd Normal Form: Eliminate non-key attributes depending on another non-key attribute. Database Design ***Definition:*** Is the process of defining the structure, storage, and retrieval mechanisms of data in a database system. It involves creating a detailed blueprint for how data will be stored, accessed, and managed in a database. ***Key points:*** *Schema Definition:* Specifies the tables, fields, data types, and relationships in the database.\ *Normalization:* Ensures the database structure minimizes redundancy and optimizes data integrity. *Physical Implementation:* Determines how the logical schema will be physically stored and accessed in the database management system. *Performance Optimization:* Includes indexing, partitioning, and query optimization to enhance database performance.. Sql and its importance ***Definition:*** SQL stands for Structured Query Language and is used for **managing** and **manipulating** relational databases. SQL provide commands for **creating, reading, updating and deleting** data in a database. Most relational database management systems support SQL making it **universal** for database interaction. It is essential for **data analysis** allowing users to perform queries to extract meaningful insights from datasets. Is designed to handle **large volumes of data** in an efficient manner. It **is easy to learn** and use even with no deep technical knowledge. [Real world applications ] ***Business Intelligence:*** Companies use SQL to generate reports and analyze business performance. ***Finance:*** Banks and financial institutions use SQL for transaction processing and risk management. ***Healthcare:*** Medical records and patient data management. ***E-commerce:*** Inventory management, customer data, and sales tracking. [Main structures ] ***Creating tables:*** Table consist of columns and rows, where columns define the type of data, and rows contain the actual data. ***Selecting records:*** Retrieving data from a table ***Inserting records:*** Insert data into a table ***Updating records:*** Updating data into a table ***Deleting records:*** Delete data from a table ***Altering a table:*** Add columns to a table ***Droping Table:*** Delete a table entirely Understanding JOINS in SQL ***Definition:*** JOINS are used in SQL to **combine rows** from two or more tables based on a related column between them. Let\'s start with some foundational concepts and then look at different types of joins with examples. **Inner Join**\ An Inner Join returns only the rows that have matching values in both tables. **Left Outer Join** A Left Join returns all rows from the left table (Customers), and the matched rows from the right table (Orders). If no match is found, NULL values are returned for columns from the right table. **Right Outer Join** A Right Join returns all rows from the right table (Orders), and the matched rows from the left table (Customers). If no match is found, NULL values are returned for columns from the left table. [Using simple queries for data analysis ] ***Definition:*** These queries cover basic tasks such as retrieving data, performing aggregations, filtering results, and using simple joins. Practicing these examples will help you become comfortable with SQL syntax and lay a solid foundation for more advanced querying and analysis tasks in the future. [Using aggregations for deeper analysis ] ***Definition:*** Are used primarily for summarizing data, deriving insights, and performing calculations on datasets. [Using operators ] ***Definition:*** Operators in SQL are essential tools for manipulating and retrieving data. They enable to perform various operations on the data, such as arithmetic calculations, comparisons, logical evaluations, and combining multiple conditions in a query. Different types: Arithmetic, Comparison (also Like) & Logical. - ***Arithmetic (+, -, \*,/)*** - ***Comparison (\,=, \, !=, LIKE)*** - ***Logical Operators (AND,OR,NOT)*** [Example using LIKE ] LIKE is an operator in SQL used for **pattern matching** in strings. It is often used in WHERE clauses to search for a specified pattern in a column. The LIKE operator is particularly useful when you need to find records that match a certain pattern rather than an exact match. [Using string functions ] ***Definition:*** String functions in SQL are essential tools for manipulating and transforming text data. They allow you to perform various operations on strings, such as concatenation, extraction, formatting, and searching for specific patterns. [Common string functions ] 1. ***Concat:*** Select concat (CustomerName,\' \', CustomerPhone) as FullContact from Customers 2. ***Substring:*** SELECT SUBSTRING (customerphone , 1, 3) AS area\_code FROM customers; 3. ***Length:*** SELECT LENGTH (carmodel) AS model\_length FROM cars; 4. ***Upper and Lower:*** SELECT UPPER (carmodel) AS model\_length FROM cars; 5. ***Trim, Ltrim, Rtrim:*** SELECT TRIM (customer\_name) AS name FROM customer; 6. ***Replace:*** SELECT replace (customer\_name,"S","s") AS name FROM customer; 7. ***Left and Right:*** Select right (customer\_name,2) from customer [Using subqueries ] ***Definition:*** Are queries within another SQL query. They are enclosed in parentheses and can be used in various clauses such as SELECT, FROM, WHERE, and HAVING. Subqueries are powerful tools for breaking down complex queries into simpler parts, making it easier to understand and maintain SQL code. Economic and Financial Data Sources There are multiple and relevant Data sources in the Economic and Financial Space that you can use and incorporate into your analysis and databases. [Type of data analysis ] All these data sources allow in general to do: - **Descriptive analysis**, to summarize and describe a dataset.\ Example: Unemployment this month in Spain by Age range. - **Trend analysis during time**, to analyze how the data changed with time - **Comparative analysis** between regions, groups of people, or between different variables, etc. [INE: ] It offers a wide range of statistical data on various economic, demographic and social aspects of the country The INE **regularly** updates its data and provides access through its website, where you can download reports, databases and use interactive tools to analyze the information. - **Demography and Population:** Census, births, deaths, etc - **Economy:** PIB (GDP: Gross Domestic Product), IPC (Consumer Price Index), Surveys - **Labor Market:** Survey data to Active Population - **Companies and Establishments:** Central Directory of Companies (DIRCE). Industrial Survey of Companies. Survey on Innovation in Companies. Statistics on Commercial Companies. - **Society:** Education Statistics. Health Survey. Living Conditions Survey. Labour and Geographical Mobility Statistics. And more\.... [Ministry of Economy, Trade and Enterprise: ] provides a wide range of financial data and statistics - **Macroeconomic data** on the evolution of the Spanish economy, on economic growth, etc. - **Public Finances:** Data on budget execution, statistics on deficit and public debt. - **Labor Market:** Labor market statistics, employment, unemployment, job offerings.. - **Financial System:** Information on the situation and evolution of the Spanish financial system. Banks and other financial institutions. - **Foreign trade:** Data on exports, imports, and trade balance. [Eurostat ] High quality statistics and data on Europe [World Bank ] Free and open access to global development data [International Monetary Fund ] Access to macroeconomic and financial data Excel [Excel as a Flat-File Database ] **Overview**: Excel functions as a **flat-file database.** It stores data in a **single table or sheet** without complex relational structures like SQL databases. Flat-file databases are useful for managing **small datasets** where complex relationships and constraints aren't required. It is a useful tool for **smaller applications or quick analyses**. **Key points:** **Data Storage in Excel**: Each worksheet in Excel can represent a \"table\" with rows as records and columns as fields. **Advantages and Limitations**: While Excel can manage tables and perform simple lookups, it lacks the integrity constraints (like enforcing referential integrity) and scalability of an RDBMS. **Appropriate Use Cases**: Just used for rapid prototyping, data exploration, and small to medium datasets. [Creating and Using Relationships in Excel ] **Overview:** Excel's Data Model allows for the creation of relationships between tables, enabling users to query and analyze data from multiple tables within the same workbook. **Key points:** - **Data Model Basics**: Is a built-in feature allowing multiple tables to exist in a single workbook with defined relationships, making it possible to analyze related data. - **Setting Up Relationships**: Allow to establish relationships between tables using fields that serve as foreign keys. - **Relationship Types**: Excel only supports one-to-one and one-to-many relationships. Many-to-many relationships require more complex workarounds, like bridge tables. - **Limitations of Relationships in Excel**: Note that Excel does not enforce referential integrity in relationships, making manual attention to data consistency necessary. [Data Integrity in Excel ] **Overview:** Data integrity is critical in any database to ensure that the data is accurate, consistent, and reliable. Although Excel doesn't enforce integrity constraints, best practices and certain tools within Excel can help maintain data quality. **[Key points: ]** - **Unique IDs and Data Consistency:** Be sure to include unique IDs for each record to avoid duplicates and the use of consistent data types within each column to prevent errors. - **Validation Rules:** Use Excel's Data Validation feature to restrict data input (e.g., enforcing date formats or restricting input to specific lists). - **Error Checking Tools:** Use Conditional Formatting to highlight potential errors (e.g., duplicates or blank cells) and Text to Columns to standardize data formats (e.g., dates or currency). [Normalization with Excel Limits ] **Overview:** Database normalization minimizes redundancy and improves data integrity by organizing data into separate tables. In Excel, basic normalization techniques can still be implemented, helping to make the data cleaner and easier to manage. **Key points:** **Basic Normal Forms:** Remember how the three normal forms (1NF, 2NF, 3NF) will apply to Excel. - 1NF: Separating atomic values in different fields. - 2NF: Separating entities in different tabs (Splitting Data into Tables) and using PK and FK. - 3NF:Avoidingtransitivedependencies. **Relational Table Design:** Designing tables around entities and their relationships to minimize redundancy and ensure each piece of information only appears once. [Excel functions as SQL Equivalents ] **Overview:** Excel includes lookup functions like VLOOKUP, INDEX, MATCH, and more recently, which can simulate basic JOIN operations in SQL by pulling in data from other tables. **Key points:** - **VLOOKUP:** VLOOKUP (Vertical Lookup) can retrieve data from a specified range based on a matching key. It has **limitations,** like needing sorted data or being limited to exact matches**.** - **INDEX and MATCH Combination:** INDEX and MATCH together offer more flexibility and can retrieve data from left-to-right, unlike VLOOKUP. **Limitations**: Unlike SQL joins, these functions don't dynamically update as new data is added or existing data is changed; they require careful handling and manual updates if the data structure changes. PowerBI as a database PowerBI has two different solutions: - PowerBI Desktop: - Installed in the computer. It is license free for your personal use. - For creating reports and dashboards. - It also has another solution embedded which is PowerQuery used to clean and transform data from multiple sources (Excel, SQL, etc.). - It allows to create relationships between tables (similar to relational databases). - It allows to visualize data with interactive dashboards. - The audience is: Developers and Designers - PowerBI Service: - Cloud-based platform for sharing and collaborating on Power BI reports. - Offers real-time data updates and sharing with team members. - It offers less functionalities than PowerBI Desktop as the audience is for business users to visualize or collaborate in reports created by developers in powerBI desktop. [Comparison with traditional sql databases] - Power BI focuses on analysis and visualization rather than data storage. - Use it to explore and connect data from databases, not as a primary storage tool. [Using AI in PowerApps to Create Databases and Apps] - Power Apps is a low-code platform to create custom applications connected to databases (like Excel, or SQL). - Allows for drag-and-drop interface creation. - AI tools (like Copilot) can assist in quickly generating tables, forms, and business logic. - AI enhances productivity and accessibility for non-technical users. [Steps to build an app in 10 minutes (as we show in the demo):] - Generate Data: Use AI to define the structure of a database (e.g., customers, rentals). - Create an Interface: Auto-generate forms and screens based on the database. - Integrate Logic: Add AI-driven functionality, such as automated responses or workflows. Intro to NoSQL Databases Definition: - Non-relational databases, also known as NoSQL (Not Only SQL) databases, are designed to store, retrieve, and manage non-tabular data. Key Characteristics: - Schema-less or flexible schema - Horizontally scalable - Handle a variety of data types: structured, semi-structured (JSON), and unstructured (text, images, logs, etc) [When are they used?] When **Scalability** is necessary: Easily handle large volumes of data with horizontal scaling When **Flexibility** is needed: Dynamic schema design allows for quick iterations and changes When **Performance** is important: Optimized for specific data models and access patterns In Big Data and Real-Time Applications: Suited for handling big data and real-time analytics [Types of Non-Relational Databases] **Document Stores:** - Examples: MongoDB - Usage: Content management systems for example to store all the documents in a website. Example of storage - Documents within the same collection can have **different fields or structures**. The database does not enforce a strict schema, giving developers the flexibility to store varied data formats within a single collection. - Useful when you are storing products that can have different attributes depending on the category **Key-Value Stores:** - Examples: Redis, DynamoDB - Usage: Caching mechanisms, session management, real-time bidding **Graph Databases:** Graph database management system is designed to efficiently store, manage, and query highly interconnected data. What is a graph data model? It is a data model that represents data as nodes (entities) connected by relationships (edges). This makes it particularly suitable for **[scenarios where relationships between data points are important.]** It can be used in various domains such as social networks, recommendation systems, network and IT operations, fraud detection, and more. Its flexibility comes from its ability to model complex relationships efficiently. - Examples: Neo4j - Usage: Social networks, recommendation engines, fraud detection [Neo4j Products] **Neo4j Database**: The core product is the Neo4j Database, available in both Community and Enterprise editions. This graph database management system allows users to model, store, and query highly connected data efficiently. **Neo4j Aura**: Fully managed **cloud service** for deploying and running Neo4j databases. It provides users with the benefits of Neo4j\'s graph database technology without the need to manage infrastructure, offering scalability, security, and reliability in the cloud. **Neo4j Desktop**: is a desktop application that provides a development environment for Neo4j. It allows developers to create and manage local Neo4j databases, develop applications using Neo4j, and explore graph data visually. **Neo4j Bloom**: Graph visualization and exploration tool designed to help users uncover insights from their graph data. It provides an intuitive interface for exploring and querying graph databases using natural language search and interactive visualizations. **Neo4j Graph Data Science Library**: Collection of algorithms and tools specifically designed for analyzing and extracting insights from graph data. It includes algorithms for centrality analysis, community detection, similarity analysis, and more, enabling advanced graph analytics. **Neo4j AuraDB for Google Cloud**: Fully managed graph database service hosted on Google Cloud Platform (GCP). Big Data **Global Volume of Data:** In 2025 it is estimated that the world will generate 175 zettabytes (ZB) of data. In 2010 it was just 2. - 1 ZB = 1 billion of gigabytes Everyday Internet users generate around 2,500,000 GB daily. 90% of the data were generated in the last 2 years. Data is generated through: Smartphone usage, Social Media activity, Emails, Location Tracking, IoT devices, Online Shopping, Health Data, TV usage, etc. ![](media/image8.png)[The 5 Vs of Big Data] [Sources of data ] Main sources are: - Facebook - Twits ( 500.000 tweets per minute) - Instagram (347222 posts per minute) - IoT (75 mil millions of connected devices generating data) -- sensors [Storage of generated data ] Less than 20% of global data is stored in Relational Databases. Is a small percentage but important to handle Banks databases, hospitals, customers, etc 80% of the global data is not structured (text, images, video). This data is stored in Big Data Architectures, in the Cloud, and in NoSQL Databases. [Big Data Storage ] Different technologies are needed to store, process and analyze such volume of data that cannot be manage with traditional databases. 1\. Storage in HDFS (Hadoop Distributed File System): This type of set up is prepared to **handle large volumes of data across multiple servers**.\ It divides data into small blocks (typically 128 MB or 256 MB) and distributes them across different nodes (servers). It **provides high redundancy** (copies of data) to ensure that data is not lost if a node fails. Ideal for storing large amounts of **unstructured or semi- structured data**. 2\. DataLakes: It is a centralized repository that stores flat files of all types of data (structured, semi- structured and unstructured). It is stored as raw data, as the data is generated, with no transformation. It is used when you need to store large volumes of diverse and raw data for long --term analysis or if you don't know what type of analysis you will perform later. 3\. NoSQL - Esto es importante