CSW241-File Organization and Processing Lecture Notes PDF
Document Details
Uploaded by ImprovingPeace
Sinai University
Dr. Riham Moharam
Tags
Summary
These lecture notes cover CSW241 File Organization and Processing, including course assessment, objectives, and activities. Topics discussed include file operations, data structures, memory/storage hierarchies, and computer architecture. The document was prepared by Dr. Riham Moharam from Sinai University.
Full Transcript
# CSW241-File Organization and Processing ## Introduction - Dr. Riham Moharam - Faculty of Information Technology & Computer Science - Sinai University - North Sinai, Egypt ## Course Assessment - Course Work (Assignments, Labs, Projects): 20% - Midterm Exam: 20% - Final Examination: 60% ## Refe...
# CSW241-File Organization and Processing ## Introduction - Dr. Riham Moharam - Faculty of Information Technology & Computer Science - Sinai University - North Sinai, Egypt ## Course Assessment - Course Work (Assignments, Labs, Projects): 20% - Midterm Exam: 20% - Final Examination: 60% ## References - **File Structures:** Michael J. Folk, Bill Zoellick, Greg Riccardi - **File Organization and Processing:** Alan L. Tharp - **Database Management Systems (3rd Edition):** Ramakrishnan, Gehrke ## Course Objectives This course will help students to achieve the following objectives: - The basic ideas and principles, overview of files, blocking and buffering. - Secondary storage devices. - Sequential file organization. - Explain the basic file operations. - Introducing techniques for organization and manipulation of data in secondary storage. - Introducing the most important high-level file structures tools which include indexing and hashing. - Applying different hashing and indexing techniques in the design of C++ programs solving various file management problems. ## Laboratory Activities - Lab activity will be use C++ Programming Language to: - Apply file operations. - Indexing and hashing files. - etc. - Student must read / do Pre-Lab before coming to the laboratory. - Lab will be assessed based on the report + Task of Day - There will be a lab assessment/ lab test - purpose to test student skill - individually. ## Plagiarism/Cheating Policy - What is Plagiarism: "Using another person's ideas or creative work without giving credit to that person". - Copying and pasting another student's code. - Giving your code to another student. - Attempting to buy code online. - This will result in an immediate F in the class. ## Outline - Introduction to File Organization - Data Structure VS File Structure - Memory/Storage Hierarchies - Computer Architecture - Why Study File Structure Design? - Overview of File Structure Design - File Design Summary ## Introduction to File Organization - Data processing from a computer science perspective: - Storage of data. - Organization of data. - Access of data. This will be built on your knowledge of Data Structures. - File Structure is a combination of representations for data in files and of operations for accessing the data. - A File Structure allows applications to read, write and modify data. It might support finding the data that matches some search criteria or reading through data in some particular order. ## Data Structures vs File Structures - Both involve: - Representation of Data + Operations for accessing data. - Difference: - **Data structures:** deal with data in the main memory. - **File structures:** deal with the data in the secondary storage. ## Computer Architecture - Data is Manipulated Here: Main Memory (RAM) - Data is Stored Here: Secondary Storage ## Computer Architecture - **CPU:** - Registers. - Cache. - **RAM (Semiconductor):** - Main Memory. - **Disk, Tape, DVD-R:** - Second Storage. ## Differences - **CPU:** fast, small, expensive, volatile. - **RAM:** slow, large, cheap, stable ## Computer Architecture - **Main Memory-MM:** fast, small, volatile (data is lost during power failures). - **Secondary Storage-SS:** big (because it is cheap), stable (non-volatile), slow (10,000 times slower than MM). ## Memory/Storage Hierarchies - Balancing performance with cost: - Small memories are fast but expensive. - Large memories are slow but cheap. - **Capacity:** - L1/2. - CACHE. - L3 CACHE. - DRAM. SSD. - HARD DISK. - **Performance:** ## Memory/Storage Hierarchies - Memory/storage hierarchy: - Combining many technologies to balance costs/benefits. - For long time not the focal point of storage system design. - **Persistence:** - Storing data for lengthy periods of time. - To be useful, it must also be possible to find it again later. ## How fast is the main memory? - Typical time for getting info from: - Main memory: ~120 nanosec = 120 x 10-9 sec. - Hard disks: ~30 millisec = 30 x 10-6 sec. ## Goal of the Course - Minimize number of trips to the disk in order to get desired information. Ideally, get what we need in one disk access or get it with as few disk accesses as possible. - Grouping related information so that we are likely to get everything we need with only one trip to the disk (e.g., name, address, phone number). - **Locality of Reference in Time and Space** ## Why Study File Structure Design? - Most computers are used for data processing, as a big growth area in the information age. Data processing from a computer science perspective: - Storage of data. - Organization of data. - Access to data. - Processing of data. ## Why Study File Structure Design? - **Data Storage:** - Computer Data can be stored in three kinds of locations: - Primary Storage ==> Memory [Computer Memory] - Secondary Storage [Online Disk/ Tape/ CDRom that can be accessed by the computer] - Tertiary Storage ==>Archival Data [Offline Disk/Tape/ CDRom not directly available to the computer.] - **Our Focus:** Secondary Storage ## Why Study File Structure Design? - **Performance:** - Minimize the number of trips to the Secondary Storage in order to get desired information. - Group related information so that we are likely to get everything we need with one trip to the Secondary Storage. - **Memory:** - Balance the memory size and the time. ## Why Study File Structure Design? - **How Can Secondary Storage Access Time be Improved?** - Use the right file structure. - Understand the advantages disadvantages of alternative methods. - **By improving the File Structure.** - Since the details of the representation of the data and the implementation of the operations determine the efficiency of the file structure for particular applications, improving these details can help improve secondary storage access time. ## Why Study File Structure Design? - Metrics used to measure efficiency and effectiveness of a file structure: - Simplicity. - Time complexities. - Space complexities. - Scalability. - Programmability. - Maintainability. - Note that the domains of the efficiency and effectiveness concerns rely on space complexity more than any other factor. ## Why Study File Structure Design? - **Good File Structure Design:** 1. Fast access to great capacity. 2. Reduce the number of disk accesses. 3. By collecting data into buffers, blocks or buckets. 4. Manage growth by splitting these collections. ## Why Study File Structure Design? - The file structures involve two domains: hardware and software. - **Hardware:** primarily involves the physical characteristics of the storage medium. - **Software:** involves the data structures used to manipulate the files and methods/ algorithms to deal with these structures. ## File Operations - Search for a particular data in a file. - Add a certain data item. - Remove a certain item. - Order the data items according to a certain criterion. - Merge of files. - Creation of new files from existing file(s). - Finally create, open, and close operations which have implications in the operation of the system. ## Overview of File Structure Design - **General Goals:** - Get the information we need with one access to the disk. - If that's not possible, then get the information with as few accesses as possible. - Group information so that we are likely to get everything we need with only one trip to the disk. - **Fixed versus Dynamic Files:** - It is relatively easy to come up with file structure designs that meet the general requirements when the files never change. - When files grow or shrink when information is added and deleted, it is much more difficult. ## History of File Structures - In the beginning... it was the tape. - Sequential access - Access cost proportional to size of file [Analogy to sequential access to an array data structure]. - Disks became more common. - Direct access [Analogy to access to position in array]. - Indexes were Invented. - List of keys and points stored in small file. - Allows direct access to a large primary file. - Great if index fits into main memory. - As file grows we have the same problem we had with a large primary file. ## History of File Structures - Tree structures emerged for main memory (1960's): - Binary search trees (BST`s). - Balanced, self-adjusting BST`s: e.g., AVL trees (1963). - AVL tree: Self-adjusting binary tree for data in memory. - A tree structure suitable for files was invented: - B trees (1979) and B+ trees. - Good for accessing millions of records with 3 or 4 disk accesses. - What about getting info with a single request? - Hashing Tables (Theory developed over 60's and 70's but still a research topic). - Good when files do not change too much in time. - Expandable, dynamic hashing (late 70's and 80's). - One or two disk accesses even if file grows dramatically. ## File Design Summary - **Data characteristics:** - **Usage characteristics:** - **Storage characteristics:** - **Performance criteria:** - **File Structures Design Problem:** - **File Organization:** ## Data Characteristics - Record Length. - Record format: fixed or variable. - Maximum record length. - Primary key and its value distribution. ## Usage Characteristics - Frequency of query. - Frequency of update. - Growth expected. - Volatility of data values. - Record ordering needed. - Response time required. - Variety of types of transactions. - Priorities of requirements. ## Storage Characteristics - Type of storage devices available. - Block sizes. - Average random access times. - Average sequential access times. - Cost of storage. - Available file organization methods. - Available access methods. ## Performance Factors of File Design - Excepted and worst case response times. - Excepted and worst case update times. - I/O access requirements. - Storage space requirements. - Complexity of required support. - Maintenance requirements.