INTRO Data Processing PDF
Document Details
Uploaded by FlexibleAltoSaxophone
Tags
Summary
This document provides a basic introduction to data processing. It explains different types of data (structured, unstructured, and semi-structured) and their characteristics. It also briefly covers the processes of data collection, storage, and processing.
Full Transcript
Introduction Understanding Data Processing “Data is not information, Information is not knowledge, Knowledge is not understanding, Understanding is not wisdom.” In general, data is a collection of characters, numbers, and other symbols that represents values of some situations or variables...
Introduction Understanding Data Processing “Data is not information, Information is not knowledge, Knowledge is not understanding, Understanding is not wisdom.” In general, data is a collection of characters, numbers, and other symbols that represents values of some situations or variables. The following list contains some examples of data that we often come across. Name, age, gender, contact details, etc., of a person Transactions data generated through banking, ticketing, shopping, etc. whether online or offline Images, graphics, animations, audio, video Documents and web pages Online posts, comments and messages Signals generated by sensors Satellite data including meteorological data, communication data, earth observation data, etc. As data come from different sources, they can be in different formats. For example, an image is a collection of pixels; a video is made up of frames; and messages/chats are made up of texts, icons (emoticons) and images/videos. Two broad categories in which data can be classified on the basis of their format are: (A) Structured Data B) Unstructured Data C) Semi-Structured Data Structured Data Data which is organized and can be recorded in a well defined format is called structured data. Structured data is usually stored in computer in a tabular (in rows and columns) format where each column represents different data for a particular parameter called attribute/ characteristic/variable and each row represents data of an observation for different attributes Unstructured Data Unstructured data refers to data that does not have a predefined format or organization. It lacks a specific data model and does not fit into traditional tabular structures. Unstructured data is typically in the form of text, images, audio, video, social media posts, emails, documents, web pages, and other content that is not easily organized into rows and columns. Analyzing unstructured data requires advanced techniques such as natural language processing, computer vision, and machine learning algorithms. Examples of unstructured data include include text documents, business reports, books, audio/video files, social media messages. semi-structured data semi-structured data, which lies between structured and unstructured data. Semi-structured data has some organizational structure but does not fit neatly into the traditional tabular format. Examples of semi-structured data include XML files, JSON data, and log files. Metadata Metadata refers to the additional information or descriptors that provide context and describe the characteristics of the main data. It is essentially data about data and helps to provide meaningful insights and understanding of the main data. In the case of unstructured data, metadata plays a crucial role in organizing and categorizing the data. It helps to identify and describe different components or attributes of the unstructured data, making it easier to search, retrieve, and analyze. By extracting and utilizing metadata, unstructured data can be indexed, classified, and structured in a way that enables efficient storage, retrieval, and analysis. Data Collection For processing data, we need to collect or gather data first. We can then store the data in a file or database for later use. Data collection here means identifying already available data or collecting from the appropriate sources Data are continuously being generated at different sources. Our interactions with digital medium are continuously generating huge volumes of data. Hospitals are collecting data about patients for improving their services. Shopping malls are collecting data about the items being purchased by people. Data Storage Once we gather data and process them to get results, we may not then simply discard the data. Rather, we would like to store them for future use as well. Data storage is the process of storing data on storage devices so that data can be retrieved later. We store data like images, documents, audios/ videos, etc. as files in our computers. Likewise, school/ hospital data are stored in data files We use computers to add, modify or delete data in these files or process these data files to get results. However, file processing has certain limitations, which can be overcome through Database Management System (DBMS). Data Processing by looking at the vast or large amount of data, one cannot arrive at a conclusion. Rather, data need to be processed to get results and after analyzing those results, we make conclusions or decisions. Finding information from a huge volume of papers or deleting/modifying an entry is a difficult task in pen and paper based approach. File Processing System A collection of application programs that perform services for the end-users such as production of reports Each program defines and manages its own data Library Examination Registration Library Examination Registration Applications Applications Applications Program and Data Interdependence Library Examination Registration Data Data Data Files Files Files 14 File Processing Systems Library Examination Registration Reg_Number Reg_Number Reg_Number Name Name Name Father Name Address Father Name Books Issued Class Phone Fine Semester Address Grade Class 15 Files Based Processing Disadvantages of File Processing 1. Program-Data Dependence: The file structure is defined within the program code, leading to a strong coupling between programs and data. Any changes in the structure require modifications to the program, making it inflexible. 2. Metadata Maintenance: All programs need to maintain metadata for each file they use, which increases complexity and overhead. 3. Duplication of Data (Data Redundancy): Different systems or programs may have separate copies of the same data, resulting in data redundancy. This redundancy wastes storage space and can lead to inconsistencies when data is updated in one copy but not in others. 4. Limited Data Sharing: There is no centralized control of data, making it difficult for different programs or systems to access and share data. Programs written in different languages may face compatibility issues when accessing each other's files. 5. Lengthy Development Times: Programmers must design their own file formats, which increases development time and effort. File formats need to be designed and implemented for each program, leading to duplication of work. 6. Excessive Program Maintenance: File processing requires significant program maintenance efforts, consuming a large portion of the information system's budget (typically around 80%). Modifying file structures or formats can be time-consuming and error-prone. 7. Vulnerable to Inconsistency: Changes in one table or file may require corresponding changes in other related tables or files to maintain data consistency. Failure to make these changes can result in inconsistent or inaccurate data. SOLUTION: The DATABASE Approach Central Repository of Shared Data: Instead of having separate copies of data in different systems or programs, the database approach employs a central repository where data is stored. This promotes data consistency and eliminates data redundancy. Managed by a Controlling Agent: A database management system (DBMS) serves as the controlling agent for the data. It provides functionalities for data storage, retrieval, and management, ensuring data integrity and security. Standardized and Convenient Form: Data in a database is organized using a standardized and structured format, such as tables in a relational database. This allows for efficient storage, retrieval, and manipulation of data. This requires a Database and Database Management System (DBMS) Advantages of Database Approach Library Examination Registration Library Examination Registration Applications Applications Applications Database Management System - Data Sharing - Data Independence - Controlled Redundancy University - Better Data Integrity Students Database 19 Database Management System Limitations faced in file system can be overcome by storing the data in a database where data are logically related. We can organize related data in a database so that it can be managed in an efficient and easy way. A database management system (DBMS) or database system in short, is a software that can be used to create and manage databases. DBMS lets users to create a database, store, manage, update/modify and retrieve data from that database by users or application programs. Some examples of open source and commercial DBMS include MySQL, Oracle, PostgreSQL, SQL Server, Microsoft Access, MongoDB. A database system hides certain details about how data are actually stored and maintained. Thus, it provides users with an abstract view of the data. A database system has a set of programs through which users or other programs can access, modify and retrieve the stored data. The DBMS serves as an interface between the database and end users or application programs. Retrieving data from a database through special type of commands is called querying the database. In addition, users can modify the structure of the database itself through a DBMS. Database Management System A software system that is used to create, maintain, and provide controlled access to users of a database (Database) application program: A computer program that interacts with database by issuing an appropriate request (SQL statement) to the DBMS Database Management System DBMS manages data resources like an operating system manages hardware resources Databases are widely used in various real-life applications and industries. 1. Banking and Finance: Banks and financial institutions rely heavily on databases to store and manage customer information, account details, transaction records, and financial data. Databases enable secure and efficient processing of transactions, tracking of balances, and generation of reports for regulatory compliance. 2. E-commerce and Retail: Online retailers utilize databases to manage product catalogs, customer profiles, orders, and inventory. Databases enable efficient tracking of stock levels, personalized recommendations, and streamlined order fulfillment processes. 3. Healthcare: Databases play a critical role in healthcare for storing patient records, medical histories, test results, and treatment plans. They enable healthcare providers to access and update patient information securely, track medication and allergy details, and facilitate data sharing among different healthcare facilities. 4. Transportation : Databases are used in transportation and logistics to manage routes, schedules, tracking information, and inventory. 5. Social Media and Content Management: Social media platforms and content management systems rely on databases to store user profiles, posts, comments, and media files. 6. Education: Educational institutions use databases to store student information, academic records, timetables, and course materials. 7. Government and Public Services: Databases are extensively used by government agencies for citizen registration, tax records, public safety, and administration. 8. Research and Science: Databases support scientific research by storing and organizing large volumes of research data, experimental results, and scientific literature. Definitions of Database Def 1: Database is an organized collection of logically related data Def 2: A database is a shared collection of logically related data that is stored to meet the requirements of different users of an organization Def 3: A database is a self-describing collection of integrated records Def 4: A database models a particular real world system in the computer in the form of data Figure 1-1a Data in Context Context helps users understand data Graphical displays turn data into useful information that managers can use for decision making and interpretation Descriptions of the properties or characteristics of the data, including data types, field sizes, allowable values, and data context The concept of a shared organizational database Management Marketing Product Planning Control Sales Development Corporate Database Accounting Manufacturing Accounts Accounts Scheduling Production Receivable Payable 35 Key Concepts in DBMS (A) Database Schema Database Schema is the design of a database. It is the skeleton of the database that represents the structure (table names and their fields/columns), the type of data each column can hold, constraints on the data to be stored (if any), and the relationships among the tables. Database schema is also called the visual or logical architecture as it tells us how the data are organised in a database. (B) Data Constraint Sometimes we put certain restrictions or limitations on the type of data that can be inserted in one or more columns of a table. This is done by specifying one or more constraints on that column(s) while creating the tables. For example, one can define the constraint that the column mobile number can only have non-negative integer values of exactly 10 digits. Since each student shall have one unique roll number, we can put the NOT NULL and UNIQUE constraints on the RollNumber column. Constraints are used to ensure accuracy and reliability of data in the database (C) Meta-data or Data Dictionary The database schema along with various constraints on the data is stored by DBMS in a database catalog or dictionary, called meta-data. A meta-data is data about the data. (D) Database Instance When we define database structure or schema, state of database is empty i.e. no data entry is there. After loading data, the state or snapshot of the database at any given time is the database instance. We may then retrieve data through queries or manipulate data through updation, modification or deletion. Thus, the state of database can change, and thus a database schema can have many instances at different times. E) Query A query is a request to a database for obtaining information in a desired way. Query can be made to get data from one table or from a combination of tables. For example, “find names of all those students present on Attendance Date 2000-01-02” is a query to the database. (F) Data Manipulation Modification of database consists of three operations viz. Insertion, Deletion or Update. (G) Database Engine Database engine is the underlying component or set of programs used by a DBMS to create database and handle various queries for data retrieval and manipulation. Relational Data Model Different types of DBMS are available and their classification is done based on the underlying data model. A data model describes the structure of the database, including how data are defined and represented, relationships among data, and the constraints. The most commonly used data model is Relational Data Model. Other types of data models include object-oriented data model, entity- relationship data model, document model and hierarchical data model. In relational model, tables are called relations that store data for different columns. Each table can have multiple columns where each column name should be unique. Each tuple (row) in a relation (table) corresponds to data of a real world entity (for example, Student, Guardian, and Attendance) ATTRIBUTE: Characteristic or parameters for which data are to be stored in a relation. Simply stated, the columns of a relation are the attributes which are also referred as fields. For example, GUID, GName, GPhone and GAddress are attributes of relation GUARDIAN. TUPLE: Each row of data in a relation (table) is called a tuple. In a table with n columns, a tuple is a relationship between the n related values. DOMAIN: It is a set of values from which an attribute can take a value in each row. Usually, a data type is used to specify domain for an attribute. For example, in STUDENT relation, the attribute RollNumber takes integer values and hence its domain is a set of integer values. Similarly, the set of character strings constitutes the domain of the attribute SName. DEGREE: The number of attributes in a relation is called the Degree of the relation. For example, relation GUARDIAN with four attributes is a relation of degree 4. CARDINALITY: The number of tuples in a relation is called the Cardinality of the relation. For example, the cardinality of relation GUARDIAN is 5 as there are 5 tuples in the table. It is important to note here that relations in a database are not independent tables, but are associated with each other. For example, relation ATTENDANCE has attribute RollNumber which links it with corresponding student record in relation STUDENT. Three Important Properties of a Relation In relational data model, following three properties are observed with respect to a relation which makes a relation different from a data file or a simple table. Property 1: imposes following rules on an attribute of the relation. Each attribute in a relation has a unique name. Sequence of attributes in a relation is unimportant. Property 2: governs following rules on a tuple of a relation. Each tuple in a relation is distinct. For example, data values in no two tuples of relation ATTENDANCE can be identical for all the attributes. Thus, each tuple of a relation must be uniquely identified by its contents. Sequence of tuples in a relation is unimportant. The tuples are not considered to be ordered, even though they appear to be in tabular form. Property 3: imposes following rules on the state of a relation. All data values in an attribute must be from the same domain (same data type). Each data value associated with an attribute must be atomic (cannot be further divisible into meaningful subparts). For example, GPhone of relation GUARDIAN has ten digit numbers which is indivisible. No attribute can have many data values in one tuple. For example, Guardian cannot specify multiple contact numbers under GPhone attribute. A special value “NULL” is used to represent values that are unknown or non-applicable to certain attributes. For example, if a guardian does not share his or her contact number with the school authorities, then GPhone is set to NULL (data unknown). Keys in a Relational Database The tuples within a relation must be distinct. It means no two tuples in a table should have same value for all attributes. That is, there should be at least one attribute in which data are distinct (unique) and not NULL. That way, we can uniquely distinguish each tuple of a relation Candidate Key A relation can have one or more attributes that takes distinct values. Any of these attributes can be used to uniquely identify the tuples in the relation. Such attributes are called candidate keys as each of them are candidates for the primary key. Primary Key Out of one or more candidate keys, the attribute chosen by the database designer to uniquely identify the tuples in a relation is called the primary key of that relation. The remaining attributes in the list of candidate keys are called the alternate keys. Composite Primary Key If no single attribute in a relation is able to uniquely distinguish the tuples, then more than one attribute are taken together as primary key. Such primary key consisting of more than one attribute is called Composite Primary key. Foreign Key A foreign key is used to represent the relationship between two relations. A foreign key is an attribute whose value is derived from the primary key of another relation. This means that any attribute of a relation (referencing), which is used to refer contents from another (referenced) relation, becomes foreign key if it refers to the primary key of referenced relation. The referencing relation is called Foreign Relation. In some cases, foreign key can take NULL value if it is not the part of primary key of the foreign table. The relation in which the referenced primary key is defined is called primary relation or master relation. SUMMARY A file in a file system is a container to store data in a computer. File system suffers from Data Redundancy, Data Inconsistency, Data Isolation, Data Dependence and Controlled Data sharing. Database Management System (DBMS) is a software to create and manage databases. A database is a collection of tables. Database schema is the design of a database A database constraint is a restriction on the type of data that that can be inserted into the table. Database schema and database constraints are stored in database Catalog. Whereas the snapshot of the database at any given time is the database instance. A query is a request to a database for information retrieval and data manipulation (insertion, deletion or update). It is written in Structured Query Language (SQL). Relational DBMS (RDBMS) is used to store data in related tables. Rows and columns of a table are called tuples and attributed respectively. A table is referred to as a relation. Primary key in a relation is used for unique identification of tuples. Foreign key is used to relate two tables or relations. Each column in a table represents a feature (attribute) of a record. Table stores the information for an entity whereas a row represents a record. Each row in a table represents a record. A tuple is a collection of attribute values that makes a record unique. A tuple is a unique entity whereas attribute values can be duplicate in the table. SQL is the standard language for RDBMS systems like MySQL.