Chapter 7 PDF - File Concepts
Document Details
Uploaded by CleverChrysoprase4501
University of San Agustin
Tags
Summary
This document provides an introduction to file concepts. It explores various aspects of files, including their attributes and operations. It also explains different types of files and the internal structures behind them. The document discusses how computers store information on different media types.
Full Transcript
File Concept: Computers can store information on various storage media, such as magnetic disks, magnetic tapes, and optical disks. The physical storage is converted into a logical storage unit by the operating system. The logical storage unit is called a FILE. A file is a collection of similar reco...
File Concept: Computers can store information on various storage media, such as magnetic disks, magnetic tapes, and optical disks. The physical storage is converted into a logical storage unit by the operating system. The logical storage unit is called a FILE. A file is a collection of similar records. A record is a collection of related fields that can be treated as a unit by some application program. A field is a basic element of data. Any individual field contains a single value. A database is a collection of related data. Student Marks Marks Fail/Pass KUMA 85 86 P LAKSH 93 92 P DATA FILE Student name, marks in sub1, sub2, and Fail/Pass are fields. The collection of fields is called a RECORD. RECORD: | LAKSH | 93 | 92 | P | The collection of these records is called a data file. FILE ATTRIBUTES: 1. Name: A file is named for the convenience of the user and is referred to by its name. A name is usually a string of characters. 2. Identifier: This unique tag, usually a number, identifies the file within the file system. 3. Type: Files are of many types. The type depends on the extension of the file. Example: ○.exe: Executable file ○.obj: Object file ○.src: Source file 4. Location: This information is a pointer to a device and the location of the file on that device. 5. Size: The current size of the file (in bytes, words, blocks). 6. Protection: Access control information determines who can read, write, execute, etc. 7. Time, Date, User Identification: This information may be kept for creation, last modification, and last use. FILE OPERATIONS: 1. Creating a file: Two steps are needed to create a file: ○ Check whether space is available or not. ○ If space is available, make an entry for the new file in the directory. The entry includes the name of the file, path of the file, etc. 2. Writing a file: To write a file, we have to know two things: the name of the file and the information or data to be written on the file. The system searches the entire given location for the file. If the file is found, the system must keep a write pointer to the location in the file where the next write is to take place. 3. Reading a file: To read a file, first of all, we search the directories for the file. If the file is found, the system needs to keep a read pointer to the location in the file where the next read is to take place. Once the read has taken place, the read pointer is updated. 4. Repositioning within a file: The directory is searched for the appropriate entry, and the current file position pointer is repositioned to a given value. This operation is also called file seek. 5. Deleting a file: To delete a file, first search the directory for the named file, then release the file space and erase the directory entry. 6. Truncating a file: To truncate a file, remove the file contents only, but the attributes remain as they are. FILE TYPES: The name of the file is split into two parts: the name and the extension. The file type depends on the extension of the file. File Type Extension Purpose Executable.exe,.com,.bin Ready to run (or) ready to run machine Source code.c,.cpp,.asm Source code in various languages Object.obj,.o Compiled, machine Batch.bat,.sh Commands to the command Text.txt,.doc Textual data, documents Word processor.doc,.wp,.rtf Various word processor formats Library.lib,.dll Libraries of routines for Print/View.pdf,.jpg Binary file in a format for Archive.arc,.zip Related files grouped into a Multimedia.mpeg,.mp3,.avi Binary file containing audio or audio/video FILE STRUCTURE File types can also be used to indicate the internal structure of the file. The operating system requires that an executable file have a specific structure so that it can determine where in memory to load the file and what the location of the first instruction is. If the OS supports multiple file structures, the resulting size of the OS is large. If the OS defines five different file structures, it needs to contain the code to support these file structures. All OS must support at least one structure, that of an executable file, so that the system is able to load and run programs. INTERNAL FILE STRUCTURE In UNIX OS, all files are defined as simply streams of bytes. Each byte is individually addressable by its offset from the beginning or end of the file. In this case, the logical record size is 1 byte. The file system automatically packs and unpacks bytes into physical disk blocks, such as 512 bytes per block. The logical record size, physical block size, and packing determine how many logical records are in each physical block. The packing can be done by the user’s application program or the OS. A file may be considered a sequence of blocks. If each block were 512 bytes, a file of 1949 bytes would be allocated 4 blocks (2048 bytes). The last 99 bytes would be wasted. This is called internal fragmentation. All file systems suffer from internal fragmentation; the larger the block size, the greater the internal fragmentation. FILE ACCESS METHODS Files store information, and this information must be accessed and read into computer memory. There are many ways that the information in the file can be accessed. 1. Sequential file access: Information in the file is processed in order, i.e., one record after the other. Magnetic tapes support this type of file access. ○ Example: A file consisting of 100 records. The current position of the read/write head is at the 45th record. Suppose we want to read the 75th record. It accesses sequentially from 45, 46, 47… 74, 75. So, the read/write head traverses all the records between 45 and 75. 2. Direct access: Direct access is also called relative access. Here, records can be read/written randomly without any order. The direct access method is based on a disk model of a file, because disks allow random access to any file block. ○ Example: A disk containing 256 blocks. The position of the read/write head is at the 95th block. The block to be read or written is the 250th block. Then, we can access the 250th block directly without any restrictions. ○ Example: A CD consists of 10 songs. At present, we are listening to song 3. If we want to listen to song 10, we can shift to song 10. 3. Indexed Sequential File access: The main disadvantage of sequential files is that it takes more time to access a record. Records are organized in sequence based on a key field. ○ Example: A file consisting of 60,000 records, with a master index dividing the total records into 6 blocks, each block consisting of a pointer to a secondary index. The secondary index divides the 10,000 records into 10 indexes, each index consisting of a pointer to its original location. Each record in the index file consists of two fields: a key field and a pointer field. ○ DIRECTORY STRUCTURE Sometimes, the file system consists of millions of files. In such cases, it is very hard to manage the files. To manage these files, group them and load one group into one partition. Each partition is called a directory. A directory structure provides a mechanism for organizing many files in the file system. Operations on the directories: Search for a file: Search the directory structure for the required file. Create a file: New files need to be created and added to the directory. Delete a file: When a file is no longer needed, remove it from the directory. List a directory: We can list the files in the directory. Rename a file: When we need to change the name of the file, we can change the name. Traverse the file system: We need to access every directory and every file within a directory structure. We can traverse the file system. The various directory structures: 1. Single-level directory: The directory system has only one directory, which consists of all files. Sometimes, it is called the root directory. ○ ○ Example: Here, the directory contains 4 files (A, B, C, D). The advantage of this scheme is its simplicity and the ability to locate files quickly. The problem is that different users may accidentally use the same names for their files. ○ Example: If user 1 creates a file called "sample" and then later user 2 creates a file called "sample," then user 2's file will overwrite user 1's file. That’s why it is not used in multi-user systems. 2. Two-level directory: The problem in a single-level directory is that different users may accidentally use the same name for their files. To avoid this problem, each user needs a private directory. Names chosen by one user don't interfere with names chosen by a different user. ○ ○ Root directory is the first-level directory. User 1, User 2, User 3 are user-level directories with files A, B, C, etc. 3. Tree-structured directory: Two-level directories eliminate name conflicts among users, but they are not satisfactory for users with a large number of files. To avoid this, subdirectories are created, and the same type of files are loaded into subdirectories. So, each user can have as many directories as needed. There are two types of paths: ○ Absolute path: Begins with the root and follows a path down to the specified file, giving the directory and directory name on the path. ○ Relative path: A path from the current directory. 4. Acyclic graph directory: When multiple users are working on a project, the project files can be stored in a common sub-directory for the multiple users. This type of directory is called an acyclic graph directory. The common directory is declared a shared directory. The graph contains no cycles with shared files, and changes made by one user are made visible to other users. A file may now have multiple absolute paths. When a shared directory/file is deleted, all pointers to the directory/files must also be removed. 5. General graph directory: When we add links to an existing tree-structured directory, the tree structure is destroyed, resulting in a simple graph structure. Advantages: Traversing is easy. Easy sharing is possible. File system structure: Disk provides the bulk of secondary storage on which a file system is maintained. They have two characteristics that make them a convenient medium for storing multiple files: A disk can be rewritten in place. It is possible to read a block from the disk, modify the block, and write it back to the same place. A disk can access any block of information it contains directly. I/O Control: It consists of device drivers and interrupt handlers to transfer information between the main memory and the disk system. The device driver writes specific bit patterns to special locations in the I/O controller’s memory to tell the controller which device location to act on and what actions to take. The Basic File System needs only to issue commands to the appropriate device driver to read and write physical blocks on the disk. Each physical block is identified by its numeric disk address (e.g., Drive 1, Cylinder 73, Track 2, Sector 10). The File Organization Module knows about files and their logical and physical blocks. By knowing the type of file allocation used and the location of the file, the File Organization Module can translate logical block addresses to physical addresses for the Basic File System to transfer. Each file’s logical blocks are numbered from 0 to n. Thus, physical blocks containing the data usually do not match the logical numbers. A translation is needed to locate each block. The Logical File System manages all file system structures except the actual data (contents of the file). It maintains file structures via file control blocks. A file control block (inode in Unix file systems) contains information about the file, ownership, permissions, and the location of the file contents. File System Implementation: Overview: A Boot Control Block (per volume) can contain information needed by the system to boot an OS from that volume. If the disk does not contain an OS, this block can be empty. A Volume Control Block (per volume) contains volume (or partition) details, such as the number of blocks in the partition, size of the blocks, free block count, and free block pointers, as well as the free FCB count and FCB pointers. A typical File Control Block. A Directory Structure (per file system) is used to organize the files. A per-file FCB contains many details about the file. Once a file has been created, it can be used for I/O. First, it must be opened. The open() call passes a file name to the Logical File System. The open() system call first searches the system-wide open file table to see if the file is already in use by another process. If it is, a per-process open file table entry is created, pointing to the existing system-wide open file table. If the file is not already open, the directory structure is searched for the given file name. Once the file is found, the FCB is copied into a system-wide open file table in memory. This table not only stores the FCB but also tracks the number of processes that have the file open. Next, an entry is made in the per-process open file table, with a pointer to the entry in the system-wide open file table and some other fields. These fields include a pointer to the current location in the file (for the next read/write operation) and the access mode in which the file is open. The open() call returns a pointer to the appropriate entry in the per-process file system table. All file operations are performed via this pointer. When a process closes the file, the per-process table entry is removed, and the system-wide entry open count is decremented. When all users who have opened the file close it, any updated metadata is copied back to the disk base directory structure. The system-wide open file table entry is then removed. The system-wide open file table contains a copy of the FCB of each open file and other information. The per-process open file table contains a pointer to the appropriate entry in the system-wide open file table and other information. Allocation Methods – Contiguous: An allocation method refers to how disk blocks are allocated for files: Contiguous Allocation – Each file occupies a set of contiguous blocks. ○ Best performance in most cases. ○ Simple – Only the starting location (block #) and length (number of blocks) are required. ○ Problems include finding space for the file, knowing the file size, external fragmentation, and the need for compaction (either off-line, during downtime, or on-line). ○ Linked Allocation: Each file is a linked list of blocks. ○ The file ends at a nil pointer. ○ No external fragmentation. ○ Each block contains a pointer to the next block. ○ No compaction, external fragmentation. ○ Free space management system is called when a new block is needed. ○ Efficiency is improved by clustering blocks into groups but increases internal fragmentation. ○ Reliability can be a problem. ○ Locating a block can take many I/Os and disk seeks. ○ FAT (File Allocation Table) variation: The beginning of the volume has a table, indexed by block number. Much like a linked list, but faster on disk and cacheable. File-Allocation Table Indexed Allocation: Each file has its own index block(s) of pointers to its data blocks. Free-Space Management: The file system maintains a free-space list to track available blocks/clusters: Linked List (Free List): ○ Cannot get contiguous space easily. ○ No waste of space. ○ No need to traverse the entire list. Bitmap or Bit Vector: ○ A Bitmap or Bit Vector is a series or collection of bits where each bit corresponds to a disk block. ○ The bit can take two values: 0 and 1. A 0 indicates that the block is allocated, and 1 indicates a free block. The given instance of disk blocks on the disk in Figure 1 (where green blocks are allocated) can be represented by a bitmap of 16 bits as: 0000111000000110. ○ Advantages: Simple to understand. Finding the first free block is efficient. It requires scanning the words (a group of 8 bits) in a bitmap for a non-zero word. (A 0-valued word has all bits 0). The first free block is then found by scanning for the first 1 bit in the non-zero word. Linked Free Space List on Disk: In this approach, the free disk blocks are linked together, i.e., a free block contains a pointer to the next free block. The block number of the very first disk block is stored at a separate location on disk and is also cached in memory. Grouping: Modify the linked list to store the address of the next n-1 free blocks in the first free block, plus a pointer to the next block that contains free-block pointers (like this one). An advantage of this approach is that the addresses of a group of free disk blocks can be found easily. Counting: Because space is frequently contiguously used and freed, with contiguous allocation, extents, or clustering: ○ Keep the address of the first free block and the count of following free blocks. The free space list then has entries containing addresses and counts. Directory Implementation: Linear List: In this algorithm, all the files in a directory are maintained as a singly linked list. Each file contains pointers to the data blocks assigned to it and the next file in the directory. Characteristics: ○ When a new file is created, the entire list is checked to see whether the new file name matches an existing file name. If it doesn’t exist, the file can be created at the beginning or at the end. ○ Therefore, searching for a unique name is a big concern because traversing the whole list takes time. ○ The list needs to be traversed in every operation (creation, deletion, updating, etc.), making the system inefficient. ○ Hash Table: To overcome the drawbacks of singly linked list implementation of directories, there is an alternative approach: the hash table. This approach suggests using a hash table along with linked lists. A key-value pair for each file in the directory is generated and stored in the hash table. The key can be determined by applying the hash function on the file name, while the key points to the corresponding file stored in the directory. Now, searching becomes efficient because the entire list is no longer searched on every operation. Only hash table entries are checked using the key, and if an entry is found, the corresponding file is fetched using the value. Efficiency and Performance: Efficiency depends on: Disk allocation and directory algorithms. Types of data kept in the file’s directory entry. Performance: Disk Cache: A separate section of main memory for frequently used blocks. Free-behind and Read-ahead: Techniques to optimize sequential access. Improve PC performance by dedicating a section of memory as a virtual disk or RAM disk. I/O Hardware: I/O Devices Input/output devices are the devices responsible for input/output operations in a computer system. There are two types of input/output devices: Block Devices Character Devices Block Devices: A block device stores information in blocks with fixed sizes and its own address. It is possible to read/write each and every block independently in the case of a block device. In the case of a disk, it is always possible to seek another cylinder and wait for the required block to rotate under the head, regardless of where the arm currently is. Therefore, a disk is a block-addressable device. Character Devices: A character device accepts/delivers a stream of characters without regard to any block structure. A character device isn't addressable. It doesn't have any seek operation. There are many character devices present in a computer system, such as printers, mice, rats, network interfaces, etc. These four are the common character devices. Device Controllers: Device drivers are software modules that can be plugged into an OS to handle a particular device. The operating system takes help from device drivers to handle all I/O devices. The Device Controller works as an interface between a device and a device driver. I/O units (keyboard, mouse, printer, etc.) typically consist of a mechanical component and an electronic component, where the electronic component is called the device controller. There is always a device controller and a device driver for each device to communicate with the Operating Systems. A device controller may be able to handle multiple devices. As an interface, its main task is to convert the serial bit stream to a block of bytes and perform error correction as necessary. Any device connected to the computer is connected by a plug and socket, and the socket is connected to a device controller. The following is a model for connecting the CPU, memory, controllers, and I/O devices, where the CPU and device controllers all use a common bus for communication. Synchronous vs Asynchronous I/O: Synchronous I/O: In this scheme, CPU execution waits while I/O proceeds. Asynchronous I/O: I/O proceeds concurrently with CPU execution. Communication to I/O Devices: The CPU must have a way to pass information to and from an I/O device. There are three approaches available to communicate with the CPU and device: Special Instruction I/O Memory-mapped I/O Direct Memory Access (DMA) Special Instruction I/O: This uses CPU instructions that are specifically made for controlling I/O devices. These instructions typically allow data to be sent to an I/O device or read from an I/O device. Memory-mapped I/O: When using memory-mapped I/O, the same address space is shared by memory and I/O devices. The device is connected directly to certain main memory locations so that the I/O device can transfer a block of data to/from memory without going through the CPU. While using memory-mapped I/O, the OS allocates a buffer in memory and informs the I/O device to use that buffer to send data to the CPU. The I/O device operates asynchronously with the CPU and interrupts the CPU when finished. The advantage of this method is that every instruction that can access memory can be used to manipulate an I/O device. Memory-mapped I/O is used for most high-speed I/O devices like disks and communication interfaces. Direct Memory Access (DMA) Slow devices like keyboards generate an interrupt to the main CPU after each byte is transferred. If a fast device, such as a disk, generated an interrupt for each byte, the operating system would spend most of its time handling these interrupts. So, a typical computer uses direct memory access (DMA) hardware to reduce this overhead. Direct Memory Access (DMA) means the CPU grants the I/O module authority to read from or write to memory without involvement. The DMA module itself controls the exchange of data between main memory and the I/O device. The CPU is only involved at the beginning and end of the transfer and is interrupted only after the entire block has been transferred. Direct Memory Access needs special hardware called the DMA controller (DMAC) that manages the data transfers and arbitrates access to the system bus. The controllers are programmed with source and destination pointers (where to read/write the data), counters to track the number of transferred bytes, and settings, which include I/O and memory types, interrupts, and states for the CPU cycles. The operating system uses the DMA hardware as follows: Step Description 1 Device driver is instructed to transfer disk data to a buffer address X. 2 Device driver then instructs the disk controller to transfer data to the buffer. 3 Disk controller starts DMA transfer. 4 Disk controller sends each byte to the DMA controller. DMA controller transfers bytes to the buffer, increases the memory address, and decreases 5 the counter C until C becomes zero. 6 When C becomes zero, DMA interrupts the CPU to signal transfer completion. I/O Software Layers I/O software is often organized in the following layers: User Level Libraries: This provides a simple interface to the user program to perform input and output. For example, stdio is a library provided by C and C++ programming languages. Kernel Level Modules: This provides device drivers to interact with the device controller and device-independent I/O modules used by the device drivers. Hardware: This layer includes actual hardware and hardware controllers that interact with the device drivers and make the hardware functional. A key concept in the design of I/O software is that it should be device-independent, meaning it should be possible to write programs that can access any I/O device without having to specify the device in advance. For example, a program that reads a file as input should be able to read a file on a floppy disk, a hard disk, or a CD-ROM, without having to modify the program for each different device. Device Drivers Device drivers are software modules that can be plugged into an OS to handle a particular device. The operating system relies on device drivers to handle all I/O devices. Device drivers encapsulate device-dependent code and implement a standard interface in such a way that the code contains device-specific register reads/writes. A device driver is generally written by the device's manufacturer and delivered along with the device on a CD-ROM. A device driver performs the following tasks: Accept requests from the device-independent software above it. Interact with the device controller to take and give I/O and perform the required error handling. Ensure that the request is executed successfully. How a device driver handles a request is as follows: Suppose a request comes to read a block N. If the driver is idle when a request arrives, it starts carrying out the request immediately. Otherwise, if the driver is already busy with another request, it places the new request in the queue of pending requests. Interrupt Handlers An interrupt handler, also known as an interrupt service routine (ISR), is a piece of software or, more specifically, a callback function in an operating system or a device driver, whose execution is triggered by the reception of an interrupt. When the interrupt happens, the interrupt procedure does whatever is necessary to handle the interrupt, updates data structures, and wakes up the process that was waiting for the interrupt to occur. The interrupt mechanism accepts an address ─ a number that selects a specific interrupt handling routine/function from a small set. In most architectures, this address is an offset stored in a table called the interrupt vector table. This vector contains the memory addresses of specialized interrupt handlers. Device-Independent I/O Software The basic function of device-independent software is to perform the I/O functions that are common to all devices and to provide a uniform interface to the user-level software. Though it is difficult to write completely device-independent software, we can write some modules that are common among all the devices. Following is a list of functions of device-independent I/O software: Uniform interfacing for device drivers Device naming - Mnemonic names mapped to Major and Minor device numbers Device protection Providing a device-independent block size Buffering because data coming off a device cannot be stored in its final destination. Storage allocation on block devices Allocation and releasing of dedicated devices Error reporting User-Space I/O Software These are the libraries that provide a richer and simplified interface to access the functionality of the kernel or, ultimately, interact with the device drivers. Most of the user-level I/O software consists of library procedures, with some exceptions like the spooling system, which is a way of dealing with dedicated I/O devices in a multiprogramming system. I/O libraries (e.g., stdio) are in user space to provide an interface to the OS-resident device-independent I/O software. For example, putchar(), getchar(), printf(), and scanf() are examples of user-level I/O library functions from stdio in C programming. Kernel I/O Subsystem The Kernel I/O Subsystem is responsible for providing many services related to I/O. The following are some of the services provided: Scheduling: The kernel schedules a set of I/O requests to determine a good order in which to execute them. When an application issues a blocking I/O system call, the request is placed on the queue for that device. The Kernel I/O scheduler rearranges the order of the queue to improve overall system efficiency and the average response time experienced by the applications. Buffering: The Kernel I/O Subsystem maintains a memory area known as a buffer that stores data while it is transferred between two devices or between a device and an application operation. Buffering is done to cope with a speed mismatch between the producer and consumer of a data stream or to adapt between devices that have different data transfer sizes. Caching: The kernel maintains cache memory, a region of fast memory that holds copies of data. Access to the cached copy is more efficient than access to the original. Spooling and Device Reservation: A spool is a buffer that holds output for a device, such as a printer, that cannot accept interleaved data streams. The spooling system copies the queued spool files to the printer one at a time. In some operating systems, spooling is managed by a system daemon process. In other operating systems, it is handled by an in-kernel thread. Error Handling: An operating system that uses protected memory can guard against many kinds of hardware and application errors.