File Concept: File-System Interface PDF
Document Details
Tags
Summary
This document provides an overview of file concepts in operating systems. It details file attributes and the operations that can be performed on files, discussing file systems, from a computer science perspective. The document emphasizes the role of file systems in managing data storage.
Full Transcript
# File Concept ## File-System Interface A file is a named collection of related information that is recorded on secondary storage. Files represent programs and data. ## File Attributes A file's attributes vary from one operating system to another but typically consist of these: * Name: The sym...
# File Concept ## File-System Interface A file is a named collection of related information that is recorded on secondary storage. Files represent programs and data. ## File Attributes A file's attributes vary from one operating system to another but typically consist of these: * Name: The symbolic file name is the only information kept in human-readable form. * Identifier: This is usually a number, identifies the file within the file system; it is the non-human-readable name for the file. * Type: This information is needed for systems that support different types of files. * Location: This information is a pointer to a device and to the location of the file on that device. * Size: The current size of the file (in bytes, words, or blocks) and possibly the maximum allowed size are included in this attribute. * Protection: Access-control information determines who can do reading, writing, executing, and so on. * Time, date, and user identification: This information may be kept for creation, last modification, and last use. These data can be useful for protection, security, and usage monitoring. The information about all files is kept in the directory structure. Typically, a directory entry consists of the file's name and its unique identifier. The identifier in turn locates the other file attributes. It may take more than a kilobyte to record this information for each file. In a system with many files, the size of the directory itself may be megabytes. ## File Operations A file is an abstract data type. We need to consider the operations that can be performed on files. The operating system can provide system calls to create, write, read, reposition, delete, and truncate files. * **Creating a file**: Two steps are necessary to create a file. First, space in the file system must be found for the file. Second, an entry for the new file must be made in the directory. * **Writing a file**: To write a file, we make a system call specifying both the name of the file and the information to be written to the file. Given the name of the file, the system searches the directory to find the file's location. The system must keep awrite pointer to the location in the file where the next write is to take place. * **Reading a file**: To read from a file, we use a system call that specifies the name of the file and where (in memory) the next block of the file should be put. Again, the directory is searched for the associated entry, and the system needs to keep a read pointer to the location in the file where the next read is to take place. * **Repositioning within a file**: The directory is searched for the appropriate entry, and the current-file-position pointer is repositioned to a given value. This file operation is also known as a file seek. * **Deleting a file**: To delete a file, we search the directory for the named file. Having found the associated directory entry, we release all file space, so that it can be reused by other files, and erase the directory entry. * **Truncating a file**: The user may want to erase the contents of a file but keep its attributes. This function allows all attributes to remain unchanged-except for file length-be reset to zero and its file space released. Most of the file operations mentioned involve searching the directory for the entry associated with the named file. To avoid this constant searching, many systems require that an *open()* system call be made before a file is first used actively. The operating system keeps a small table, called the open-file table, containing information about all open files. When a file operation is requested, the file is specified via an index into this table, so no searching is required. When the file is no longer being actively used, it is closed by the process, and the operating system removes its entry from the open-file table; create and delete are system calls that work with closed rather than open files. ## File Types A common technique for implementing file types is to include the type as part of the file name. The name is split into two parts-a name and an extension, usually separated by a period character. In this way, the user and the operating system can tell from the name alone what the type of a file is. | File type | Usual extension | Function | | --------------- | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Executable | exe, com, bin, or none | Ready-to-run program in a compiled, machine-readable format that can be loaded into main memory and executed. May include instructions for one type of computer. The file may also contain data and instructions for running a language processor. | | Object | obj, o | Compiled machine language, not ready to run. May be linked with other object files to create an executable file. | | Source code | c, cc, java, pas, asm, a | Source code in a high-level language, not ready to run. Requires compilation to create an object file. May be written in assembly language. | | Batch | bat, sh | A file of commands that can be executed by a command interpreter. Each command is a single executable statement, such as a call to a program. | | Text | txt, doc | Textual data stored in a simple ascii format. | | Word processor | wp, tex, rtf, doc | Various word processor formats, possibly including formatting information, such as font sizes and margins. | | Library | lib, a, so, dll | Libraries of routines that can be linked with other programs during compilation. | | Print or view | ps, pdf, jpg | ASCII or binary format for printing or viewing. | | Archive | arc, zip, tar | Related files or data that have been compressed for storage or transmission. Data in an archive is usually compressed, and an archive typically includes a table of contents listing all the files in the archive. | | Multimedia | mpeg, mov, rm, mp3, avi | Binary file containing audio or A/V data. | ## Internal File Structure Disk systems have a well-defined block size determined by the size of a sector. All disk I/O is performed in units of one block (physical record), and all blocks are the same size. It is unlikely that the physical record size will exactly match the length of the desired logical record. Logical records may even vary in length. Packing a number of logical records into physical blocks is a common solution to this problem. For example, the UNIX operating system defines all files to be simply streams of bytes. In this case, the logical record size is 1 byte. The file system automatically packs and unpacks bytes into physical disk blocks- say, 512 bytes per block-as necessary. The logical record size, physical block size, and packing technique determine how many logical records are in each physical block. The packing can be done either by the user's application program or by the operating system. Because disk space is always allocated in blocks, some portion of the last block of each file is generally wasted. If each block were 512 bytes, for example, then a file of 1,949 bytes would be allocated four blocks (2,048 bytes); the last 99 bytes would be wasted. The waste incurred is internal fragmentation. All file systems suffer from internal fragmentation; the larger the block size, the greater the internal fragmentation. ## Access Methods Files store information. The information in the file can be accessed in several ways. ### Sequential Access The simplest access method is sequential access. Information in the file is processed in order, one record after the other. This mode of access is used editors and compilers. A read operation-read next-reads the next portion of the file and automatically advances a file pointer. Similarly, the write operation-write next-appends to the end of the file and advances to the end of the newly written material. Such a file can be reset to the beginning. ### Direct Access Another method is direct access (or relative access). A file is made up of fixed length logical records. It allows programs to read and write records rapidly in no particular order. For the direct-access method, the file operations must be modified to include the block number as a parameter. Thus, we have read n, where n is the block number, rather than read next, and write n rather than write next. And to add an operation position file to n, where n is the block number. The block number provided by the user to the operating system is normally a relative block number. A relative block number is an index relative to the beginning of the file. Not all operating systems support both sequential and direct access for files. Some systems allow only sequential file access; others allow only direct access. Some systems require that a file be defined as sequential or direct when it is created; such a file can be accessed only in a manner consistent with its declaration. ### Other Access (Indexed) Methods Other access methods can be built on top of a direct-access method. These methods generally involve the construction of an index for the file. The index contains pointers to the various blocks. To find a record in the file, we first search the index and then use the pointer to access the file directly and to find the desired record. For example, IBM's indexed sequential-access method (ISAM) uses a small master index that points to disk blocks of a secondary index. The secondary index blocks point to the actual file blocks. ## Directory Structure The file systems store millions of files. To manage all these files, we need directories to organize them. In this section, we explore the topic of directory structure. First, we explain some basic features of storage structure. ### Storage Structure Each volume contains information about the files in the system. This information is kept in entries in a device directory. The device directory records information-such as name, location, size, and type-for all files on that volume. ## Directory Overview The directory can be viewed as a symbol table of entries. We want to be able to insert entries, to delete entries, to search for a named entry, and to list all the entries in the directory. * **Create a file**: New files need to be created and added to the directory. * **Delete a file**: When a file is no longer needed, we want to be able to remove it from the directory. * **List a directory**: We need to be able to list the files in a directory and the contents of the directory entry for each file in the list. * **Rename a file**: We must be able to change the name when the contents or use of the file changes. * **Traverse the file system**: We may wish to access every directory and every file within a directory structure. In the following sections, we describe the most common schemes for defining the logical structure of a directory. ### Single-Level Directory The simplest directory structure is the single-level directory. All files are contained in the same directory, which is easy to support and understand. ``` directory └── cat └── bo └── a └── test └── data └── mail ├── cont ├── hex └── records ``` A single-level directory has significant limitations, however, when the number of files increases or when the system has more than one user. Since all files are in the same directory, they must have unique names. Even a single user on a single-level directory may find it difficult to remember the names of all the files as the number of files increases. ### Two-Level Directory As we have seen, a single-level directory often leads to confusion of file names among different users. The standard solution is to create a separate directory for each user. In the two-level directory structure, each user has his own user file directory (**UFD**). The UFDs have similar structures, but each lists only the files of a single user. When a user job starts or a user logs in, the system's master file directory (**MFD**) is searched. The MFD is indexed by user name or account number, and each entry points to the UFD for that user. When a user refers to a particular file, only his own UFD is searched. Thus, different users may have files with the same name, as long as all the file names within each UFD are unique. ``` └── master file directory └── user 1 └── user file directory └── cat └── bo └── a └── test └── a └── data └── a └── test └── x └── data └── a └── data └── a └── test └── a └── data └── a └── data └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── data └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── data └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data └── a └── test └── a └── data ``` Although the two-level directory structure solves the name-collision problem, it still has disadvantages. This structure effectively isolates one user from another. Isolation do not allow local user files to be accessed by other users. ### Tree-Structured Directories It is generalization of two-level directory, to extend the directory structure to a tree of arbitrary height. This generalization allows users to create their own subdirectories and to organize their files accordingly. A tree is the most common directory structure. A directory (or subdirectory) contains a set of files or subdirectories. A directory is simply another file, but it is treated in a special way. All directories have the same internal format. One bit in each directory entry defines the entry as a file (0) or as a subdirectory (1). Special system calls are used to create and delete directories. ``` └── root └── spell └── bin └── programs └── stat └── mail └── dist └── find └── count └── hex └── reorder └── P └── e └── mail └── prog └── copy └── prt └── exp └── reorder └── list └── find └── hex └── count └── list └── obj └── spell └── all └── last └── first ``` With a tree-structured directory system, users can be allowed to access, the files of other usersby specifying its path names. ### Acyclic-Graph Directories Consider two programmers who are working on a joint project. Both want the subdirectory to be shared. A shared directory or file will exist in the file system in two (or more) places at once. A tree structure prohibits the sharing of files or directories. An acyclic graph -that is, a graph with no cycles- allows directories to share subdirectories and files. The same file or subdirectory may be in two different directories. ``` └── root └── list └── all └── W └── count └── dict └── spell └── count └── words └── list └── list └── rade └── W7 ``` The acyclic graph is a natural generalization of the tree-structured directory scheme. With a shared file, only one actual file exists, so any changes made by one person are immediately visible to the other. Sharing is particularly important for subdirectories; a new file created by one person will automatically appear in all the shared subdirectories. ## File-System Mounting A file system must be mounted before it can be available to processes on the system. The mount procedure is straightforward. The operating system is given the name of the device and the mount point-the location within the file structure where the file system is to be attached. Next, the operating system verifies that the device contains a valid file system. Finally, the operating system notes in its directory structure that a file system is mounted at the specified mount point. ``` └── bill └── users └── fred └── help └── sue ``` ``` └── sue └── users └── prog └── jane └── doc ``` Figure shows the effects of mounting the volume residing on /device/disk over/users. If the volume is unmounted, the file system is restored to the previous situation depicted in Figure. A system may allow the same file system to be mounted repeatedly, at different mount points; or it may only allow one mount per file system. ## File Sharing We begin by discussing general issues that arise when multiple users share files. ### Multiple Users To implement sharing among multiple users, the system must maintain more file and directory attributes than are needed on a single-user system. Most systems have use the concepts of file owner (or user) and group. The owner is the user who can change attributes and grant access over the file. The group attribute defines a subset of users who can share access to the file. The owner and group IDs of a given file (or directory) are stored with the other file attributes. When a user requests an operation on a file, the user ID can be compared with the owner attribute to determine if the requesting user is the owner of the file. Likewise, the group IDs can be compared. The result indicates which permissions are applicable. The system then applies those permissions to the requested operation and allows or denies it. ### Remote File Systems Remote file systems allow a computer to mount one or more file systems from one or more remote machines. In this case, the machine containing the files is the server, and the machine seeking access to the files is the client. A server can serve multiple clients, and a client can use multiple servers, depending on the implementation details of a given client-server facility. The server usually specifies the available files on a volume or directory level. Client identification is by using secure authentication of the client via encrypted keys. ### Consistency Semantics Consistency semantics represent an important criterion for file sharing. These semantics specify how multiple users of a system are to access a shared file simultaneously. These semantics are typically implemented as code with the file system. * **UNIX Semantics**: * Writes to an open file by a user are visible immediately to other users that have this file open. * One mode of sharing allows users to share the pointer of current location into the file. Thus, the advancing of the pointer by one user affects all sharing users. Here, a file has a single image that interleaves all accesses, Regardless of their origin. * **Session Semantics**: * The Andrew file system (AFS) uses the following consistency semantics: * Writes to an open file by a user are not visible immediately to other users that have the same file open. * Once a file is closed, the changes made to it are visible only in sessions starting later. Already open instances of the file do not reflect these changes. * According to these semantics, a file may be associated temporarily with several (possibly different) images at the same time. Consequently, multiple users are allowed to perform both read and write accesses concurrently on their images of the file, without delay. Almost no constraints are enforced on scheduling accesses. * **Immutable-Shared-Files Semantics**: * A unique approach is that of immutable shared files. Once a file is declared as shared by its creator, it cannot be modified. An immutable file has two key properties: Its name may not be reused, and its contents may not be altered. Thus, the name of an immutable file signifies that the contents of the file are fixed. The implementation of these semantics in a distributed system is simple. ## Protection When information is stored in a computer system, we want to keep it safe from physical damage (reliability) and improper access (protection). * **Reliability** is generally provided by duplicate copies of files. * **Protection** can be provided in many ways in a multiuser system. ### Types of Access Protection provide by the types of file access that can be made. Access is permitted or denied depending on type of access requested. Several different types of operations may be controlled: * **Read**: Read from the file. * **Write**: Write or rewrite the file. * **Execute**: Load the file into memory and execute it. * **Append**: Write new information at the end of the file. * **Delete**: Delete the file and tree its space for possible reuse. * **List**: List the name and attributes of the file. ### Access Control The most common approach to the protection problem is to make access dependent on the identity of the user. Different users may need different types of access to a file or directory. The most general scheme to associate with each file and directory an access-control list (ACL) specifying user names and the types of access allowed for each user. When a user requests access to a particular file, the operating system checks the access list associated with that file. If that user is listed for the requested access, the access is allowed. Otherwise, a protection violation occurs, and the user job is denied access to the file. The main problem with access lists length. If we want to allow everyone to read a file, we must list all users with read access. This problems can be resolved by use of a condensed version of the access list. To condense the length of the access-control list, many systems recognize three classifications of users in connection with each file: * **Owner**: The user who created the file is the owner. * **Group**: A set of users who are sharing the file and need similar access is a group, or work group. * **Universe**: All other users in the system constitute the universe. The most common recent approach is to combine access-control lists with the more general (and easier to implement) owner, group, and universe access control scheme just described. ### Other Protection Approaches Another approach to the protection problem is to associate a password with each file. If the passwords are chosen randomly and changed often, this scheme may be effective in limiting access to a file. The use of passwords has a few disadvantages, however. First, the number of passwords that a user needs to remember may become large, making the scheme impractical. Second, if only one password is used for all the files, then once it is discovered, all files are accessible; protection is on an all-or-none basis. Some systems allow a user to associate a password with a subdirectory, rather than with an individual file, to deal with this problem. The IBMVM/CMS operating system allows three passwords for a minidisk-one each for read, write, and multi write access. # File-System Structure ## FILE SYSTEM IMPLEMENTATION File are stored on disk. To provide efficient access to the disk, the operating system imposes file systems to allow the data to be stored, and retrieved easily. The file system is generally composed of many different levels. The structure shown in Figure. Each level in the design uses the features of lower levels to create new features for use by higher levels. * Application programs * Logical file system * File-organization module * Basic file system * I/O control * Devices **Layered file system.** The lowest level, the **I/O control**, consists of device drivers to transfer information between the main memory and the disk system. A device driver can be thought of as a translator. Its input consists of high-level commands such as "retrieve block 123." Its output consists of lowlevel, hardware-specific instructions that are used by the hardware controller. The **basic file system** needs only to issue generic commands to the appropriate device driver to read and write physical blocks on the disk. Each physical block is identified by its numeric disk address (for example, drive 1, cylinder 73, track 2, and sector 10). The **file-organization module** knows about files and their logical blocks, as well as physical blocks. By knowing the type of file allocation used and the location of the file, the file-organization module can translate logical block addresses to physical block addresses. Finally, the **logical file system** manages metadata information. Metadata includes all of the file-system structure except the actual data (or contents of the files). It maintains file structure via file-control blocks. A file-control block (FCB) contains information about the file, including ownership, permissions, and location of the file contents. ## File-System implementation In this section, we discuss the structures and operations used to implement file system operations. Several on-disk and in-memory structures are used to implement a file system. * **On disk, structures are** * A boot control block (per volume) can contain information needed by the system to boot an operating system from that volume. If the disk does not contain an operating system, this block can be empty. It is called as boot block; * A volume control block (per volume) contains volume (or partition) details, such as the number of blocks in the partition, size of the blocks, free block count and free-block pointers, and free FCB count and FCB pointers. this is called as superblock. * A directory structure per file system is used to organize the files. This includes file names and associated inode numbers. * A per-file FCB contains many details about the file, including file permissions, ownership, size, and location of the data blocks. * **The in-memory structures may include the ones described below:** * An in-memory mount table contains information about each mounted volume. * An in-memory directory-structure cache holds the directory information of recently accessed directories. * The system-wide open-file table contains a copy of the FCB of each open file, as well as other information. * The per-process open-file table contains a pointer to the appropriate entry in the system-wide open-file table, as well as other information. To create a new file, an application program calls the logical file system. The logical file system knows the format of the directory structures. To create a new file, it allocates a new FCB. The system then reads the appropriate directory into memory, updates it with the new file name and FCB, and writes it back to the disk. ## Directory implementation The selection of directory-allocation and directory-management algorithms significantly affects the efficiency, performance, and reliability of the file system. In this section, we discuss various algorithms. ### Linear List The simplest method of implementing a directory is to use a linear list of file names with pointers to the data blocks. This method is simple to program but time-consuming to execute. To create a new file., we must first search the directory to be sure that no existing file has the same name. Then, we add a new entry at the end of the directory. To delete a file, we search the directory for the named file, then release the space allocated to it. The real disadvantage of a linear list of directory entries is that finding a file requires a linear search. ### Hash Table Another data structure used for a directory is a hash table. With this method, a hash data structure is used. The hash table takes a value computed from the file name and returns a pointer to the file name in the linear list. Therefore, it can greatly decrease the directory search time. Insertion and deletion are also fairly straightforward, although some provision must be made for collisions-situations in which two file names hash to the same location. a chained-overflow hash table can be used. Each hash entry can be a linked list instead of an individual value, and we can resolve collisions by adding the new entry to the linked list. ## Allocation Methods In this section we mainly discuss about how to allocate space to the files. Three major methods of allocating disk space are in use: contiguous, linked, and indexed. ### Contiguous Allocation Contiguous allocation requires that each file occupy a set of contiguous blocks on the disk. Disk addresses define a linear ordering on the disk. With this ordering, accessing block b + 1 after block bnormally requires no head movement. When head movement is needed, the head need only move from one track to the next. Thus, the number of disk seeks required for accessing contiguously allocated files is minimal. The IBM VM/CMS operating system uses contiguous allocation. The directory entry for each file indicates the address of the starting block and the length of the area allocated for this file. | File | Start | Length | | ----------- | ----- | ------ | | count | 0 | 2 | | tr | 2 | 3 | | mail | 19 | 6 | | tr | 28 | 4 | | list | 28 | 4 | **Contagious Allocation of Disk Space** Accessing a file that has been allocated contiguously is easy. For sequential access, the file system remembers the disk address of the last block referenced and, reads the next block. For direct access to block i of a file that starts at block b, we can immediately access block b + i. Thus, both sequential and direct access can be supported by contiguous allocation. Contiguous allocation has some problems. One difficulty is finding space for a new file. Second suffer from the problem of external fragmentation. Another problem is determining how much space is needed for a file, When the file is created. To minimize these drawbacks, some operating systems use a modified contiguous-allocation scheme. Here, a contiguous chunk of space is allocated initially; and then, if that amount proves not to be large enough, another chunk of contiguous space, known as an extent, is added. The location of a file's blocks is then recorded as a location and a block count, plus a link to the first block of the next extent. ### Linked Allocation Linked allocation solves all problems of contiguous allocation. With linked allocation, each file is a linked list of disk blocks; the disk blocks may be scattered anywhere on the disk. The directory contains a pointer to the first and last blocks of the file. There is no external fragmentation with linked allocation, and any free block on the free-space list can be used to satisfy a request. The size of a file need not be declared when that file is created. ``` directory └── jeep ├── start └── end └── 0 └── 1 └── 5 └── 2 └── 7 └── 3 └── 4 └── 1 └── 8 └── 9 └── 10 └── 2 └── 1 └── 12 └── 13 └── 14 └── 1 └── 16 └── 17 └── 18 └── 19 └── 20 └── 21 └── 22 └── 23 └── 24 └── 25 └── 26 └── 27 └── 28 └── 29 └── 30 └── 31 ``` **Linked Allocation** The major problem is that it can be used effectively only for sequential-access files. To find the ith block of a file, we must start at the beginning of that file and follow the pointers until we get to the ith block. Another disadvantage is the space required for the pointers. If a pointer requires 4 bytes out of a 512-byte block, then 0.78 percent of the disk is being used for pointers, rather than for information. The usual solution to this problem is to collect blocks into multiples, called clusters, and to allocate clusters rather than blocks. This approach is increases internal fragmentation, because more space is wasted when a cluster is partially full than when a block is partially full. Another problem of linked allocation is reliability, because files are linked together by pointers scattered all over the disk, and pointers are lost or damaged. ### Indexed Allocation Linked allocation solves the external-fragmentation