INTRO 0-15.pdf
Document Details
Uploaded by BetterPlatinum
Full Transcript
Introduction Understanding Data Processing “Data is not information, Information is not knowledge, Knowledge is not understanding, Understanding is not wisdom.” In general, data is a collection of characters, numbers, and other symbols that represents values of some situations or variabl...
Introduction Understanding Data Processing “Data is not information, Information is not knowledge, Knowledge is not understanding, Understanding is not wisdom.” In general, data is a collection of characters, numbers, and other symbols that represents values of some situations or variables. The following list contains some examples of data that we often come across. Name, age, gender, contact details, etc., of a person Transactions data generated through banking, ticketing, shopping, etc. whether online or offline Images, graphics, animations, audio, video Documents and web pages Online posts, comments and messages Signals generated by sensors Satellite data including meteorological data, communication data, earth observation data, etc. As data come from different sources, they can be in different formats. For example, an image is a collection of pixels; a video is made up of frames; and messages/chats are made up of texts, icons (emoticons) and images/videos. Two broad categories in which data can be classified on the basis of their format are: (A) Structured Data B) Unstructured Data C) Semi-Structured Data Structured Data Data which is organized and can be recorded in a well defined format is called structured data. Structured data is usually stored in computer in a tabular (in rows and columns) format where each column represents different data for a particular parameter called attribute/ characteristic/variable and each row represents data of an observation for different attributes Unstructured Data Unstructured data refers to data that does not have a predefined format or organization. It lacks a specific data model and does not fit into traditional tabular structures. Unstructured data is typically in the form of text, images, audio, video, social media posts, emails, documents, web pages, and other content that is not easily organized into rows and columns. Analyzing unstructured data requires advanced techniques such as natural language processing, computer vision, and machine learning algorithms. Examples of unstructured data include include text documents, business reports, books, audio/video files, social media messages. semi-structured data semi-structured data, which lies between structured and unstructured data. Semi-structured data has some organizational structure but does not fit neatly into the traditional tabular format. Examples of semi-structured data include XML files, JSON data, and log files. Metadata Metadata refers to the additional information or descriptors that provide context and describe the characteristics of the main data. It is essentially data about data and helps to provide meaningful insights and understanding of the main data. In the case of unstructured data, metadata plays a crucial role in organizing and categorizing the data. It helps to identify and describe different components or attributes of the unstructured data, making it easier to search, retrieve, and analyze. By extracting and utilizing metadata, unstructured data can be indexed, classified, and structured in a way that enables efficient storage, retrieval, and analysis. Data Collection For processing data, we need to collect or gather data first. We can then store the data in a file or database for later use. Data collection here means identifying already available data or collecting from the appropriate sources Data are continuously being generated at different sources. Our interactions with digital medium are continuously generating huge volumes of data. Hospitals are collecting data about patients for improving their services. Shopping malls are collecting data about the items being purchased by people. Data Storage Once we gather data and process them to get results, we may not then simply discard the data. Rather, we would like to store them for future use as well. Data storage is the process of storing data on storage devices so that data can be retrieved later. We store data like images, documents, audios/ videos, etc. as files in our computers. Likewise, school/ hospital data are stored in data files We use computers to add, modify or delete data in these files or process these data files to get results. However, file processing has certain limitations, which can be overcome through Database Management System (DBMS). Data Processing by looking at the vast or large amount of data, one cannot arrive at a conclusion. Rather, data need to be processed to get results and after analyzing those results, we make conclusions or decisions. Finding information from a huge volume of papers or deleting/modifying an entry is a difficult task in pen and paper based approach. File Processing System ❑ A collection of application programs that perform services for the end- users such as production of reports ❑ Each program defines and Library Examination Registration manages its own data Applications Applications Applications Library Examination Registration Data Data Data Files Files Files Library Examination Registration Program and Data Interdependence 14 File Processing Systems Library Examination Registration Reg_Number Reg_Number Reg_Number Name Name Name Father Name Address Father Name Books Issued Class Phone Fine Semester Address Grade Class 15 Files Based Processing Disadvantages of File Processing ❑ 1. Program-Data Dependence: The file structure is defined within the program code, leading to a strong coupling between programs and data. Any changes in the structure require modifications to the program, making it inflexible. ❑ 2. Metadata Maintenance: All programs need to maintain metadata for each file they use, which increases complexity and overhead. ❑ 3. Duplication of Data (Data Redundancy): Different systems or programs may have separate copies of the same data, resulting in data redundancy. This redundancy wastes storage space and can lead to inconsistencies when data is updated in one copy but not in others. ❑ 4. Limited Data Sharing: There is no centralized control of data, making it difficult for different programs or systems to access and share data. Programs written in different languages may face compatibility issues when accessing each other's files. ❑ 5. Lengthy Development Times: Programmers must design their own file formats, which increases development time and effort. File formats need to be designed and implemented for each program, leading to duplication of work. ❑ 6. Excessive Program Maintenance: File processing requires significant program maintenance efforts, consuming a large portion of the information system's budget (typically around 80%). Modifying file structures or formats can be time-consuming and error-prone. ❑ 7. Vulnerable to Inconsistency: Changes in one table or file may require corresponding changes in other related tables or files to maintain data consistency. Failure to make these changes can result in inconsistent or inaccurate data.