Operating Systems Three Easy Steps.pdf

O PERATING S YSTEMS T HREE E ASY P IECES R EMZI H. A RPACI -D USSEAU A NDREA C. A RPACI -D USSEAU U NIVERSITY OF W ISCONSIN –M ADISON.. © 2008–23 by Arpaci-Dusseau Books, LLC All rights reserved i To Vedat S. Arpaci, a lifelong inspiration T HREE © 2008–23, A RPACI -D USSEAU E ASY P IECES Preface To Everyone Welcome to this book! We hope you’ll enjoy reading it as much as we enjoyed writing it. The book is called Operating Systems: Three Easy Pieces (available at http://ostep.org), and the title is obviously an homage to one of the greatest sets of lecture notes ever created, by one Richard Feynman on the topic of Physics [F96]. While this book will un- doubtedly fall short of the high standard set by that famous physicist, perhaps it will be good enough for you in your quest to understand what operating systems (and more generally, systems) are all about. The three easy pieces refer to the three major thematic elements the book is organized around: virtualization, concurrency, and persistence. In discussing these concepts, we’ll end up discussing most of the impor- tant things an operating system does; hopefully, you’ll also have some fun along the way. Learning new things is fun, right? At least, it (usually) should be. Each major concept is divided into a set of chapters, most of which present a particular problem and then show how to solve it. The chapters are short, and try (as best as possible) to reference the source material where the ideas really came from. One of our goals in writing this book is to make the paths of history as clear as possible, as we think that helps a student understand what is, what was, and what will be more clearly. In this case, seeing how the sausage was made is nearly as important as understanding what the sausage is good for1. There are a couple devices we use throughout the book which are probably worth introducing here. The first is the crux of the problem. Anytime we are trying to solve a problem, we first try to state what the most important issue is; such a crux of the problem is explicitly called out in the text, and hopefully solved via the techniques, algorithms, and ideas presented in the rest of the text. In many places, we’ll explain how a system works by showing its be- havior over time. These timelines are at the essence of understanding; if you know what happens, for example, when a process page faults, you are on your way to truly understanding how virtual memory operates. If 1 Hint: eating! Or if you’re a vegetarian, running away from. iii iv you comprehend what takes place when a journaling file system writes a block to disk, you have taken the first steps towards mastery of storage systems. There are also numerous asides and tips throughout the text, adding a little color to the mainline presentation. Asides tend to discuss some- thing relevant (but perhaps not essential) to the main text; tips tend to be general lessons that can be applied to systems you build. An index at the end of the book lists all of these tips and asides (as well as cruces, the odd plural of crux) for your convenience. We use one of the oldest didactic methods, the dialogue, throughout the book, as a way of presenting some of the material in a different light. These are used to introduce the major thematic concepts (in a peachy way, as we will see), as well as to review material every now and then. They are also a chance to write in a more humorous style. Whether you find them useful, or humorous, well, that’s another matter entirely. At the beginning of each major section, we’ll first present an abstrac- tion that an operating system provides, and then work in subsequent chapters on the mechanisms, policies, and other support needed to pro- vide the abstraction. Abstractions are fundamental to all aspects of Com- puter Science, so it is perhaps no surprise that they are also essential in operating systems. Throughout the chapters, we try to use real code (not pseudocode) where possible, so for virtually all examples, you should be able to type them up yourself and run them. Running real code on real systems is the best way to learn about operating systems, so we encourage you to do so when you can. We are also making code available for your viewing pleasure2. In various parts of the text, we have sprinkled in a few homeworks to ensure that you are understanding what is going on. Many of these homeworks are little simulations of pieces of the operating system; you should download the homeworks, and run them to quiz yourself. The homework simulators have the following feature: by giving them a dif- ferent random seed, you can generate a virtually infinite set of problems; the simulators can also be told to solve the problems for you. Thus, you can test and re-test yourself until you have achieved a good level of un- derstanding. The most important addendum to this book is a set of projects in which you learn about how real systems work by designing, implement- ing, and testing your own code. All projects (as well as the code exam- ples, mentioned above) are in the C programming language [KR88]; C is a simple and powerful language that underlies most operating systems, and thus worth adding to your tool-chest of languages. Two types of projects are available (see the online appendix for ideas). The first type is systems programming projects; these projects are great for those who 2 https://github.com/remzi-arpacidusseau/ostep-code O PERATING S YSTEMS WWW. OSTEP. ORG [V ERSION 1.10] v are new to C and U NIX and want to learn how to do low-level C pro- gramming. The second type is based on a real operating system kernel developed at MIT called xv6 [CK+08]; these projects are great for stu- dents that already have some C and want to get their hands dirty inside the OS. At Wisconsin, we’ve run the course in three different ways: either all systems programming, all xv6 programming, or a mix of both. We are slowly making project descriptions, and a testing framework, available. See our repository3 for more information. If not part of a class, this will give you a chance to do these projects on your own, to better learn the material. Unfortunately, you don’t have a TA to bug when you get stuck, but not everything in life can be free (but books can be!). 3 https://github.com/remzi-arpacidusseau/ostep-projects T HREE © 2008–23, A RPACI -D USSEAU E ASY P IECES vi To Educators If you are an instructor or professor who wishes to use this book, please feel free to do so. As you may have noticed, they are free and available on-line from the following web page: http://www.ostep.org You can also purchase a printed copy from http://lulu.com or http://amazon.com. Look for it on the web page above. The (current) proper citation for the book is as follows: Operating Systems: Three Easy Pieces Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau Arpaci-Dusseau Books August, 2018 (Version 1.00) or July, 2019 (Version 1.01) or October, 2023 (Version 1.10) http://www.ostep.org The course divides fairly well across a 15-week semester, in which you can cover most of the topics within at a reasonable level of depth. Cram- ming the course into a 10-week quarter probably requires dropping some detail from each of the pieces. There are also a few chapters on virtual machine monitors, which we usually squeeze in sometime during the semester, either right at end of the large section on virtualization, or near the end as an aside. One slightly unusual aspect of the book is that concurrency, a topic at the front of many OS books, is pushed off herein until the student has built an understanding of virtualization of the CPU and of memory. In our experience in teaching this course for nearly 20 years, students have a hard time understanding how the concurrency problem arises, or why they are trying to solve it, if they don’t yet understand what an address space is, what a process is, or why context switches can occur at arbitrary points in time. Once they do understand these concepts, however, in- troducing the notion of threads and the problems that arise due to them becomes rather easy, or at least, easier. As much as is possible, we use a chalkboard (or whiteboard) to deliver a lecture. On these more conceptual days, we come to class with a few major ideas and examples in mind and use the board to present them. Handouts are useful to give the students concrete problems to solve based on the material. On more practical days, we simply plug a laptop into the projector and show real code; this style works particularly well for concurrency lectures as well as for any discussion sections where you show students code that is relevant for their projects. We don’t generally use slides to present material, but have now made a set available for those who prefer that style of presentation. O PERATING S YSTEMS WWW. OSTEP. ORG [V ERSION 1.10] vii If you’d like a copy of any of these materials, please drop us an email. We have already shared them with many others around the world, and others have contributed their materials as well. One last request: if you use the free online chapters, please just link to them, instead of making a local copy. This helps us track usage (million of chapters downloaded each month) and also ensures students get the latest (and greatest?) version. T HREE © 2008–23, A RPACI -D USSEAU E ASY P IECES viii To Students If you are a student reading this book, thank you! It is an honor for us to provide some material to help you in your pursuit of knowledge about operating systems. We both think back fondly towards some textbooks of our undergraduate days (e.g., Hennessy and Patterson [HP90], the classic book on computer architecture) and hope this book will become one of those positive memories for you. You may have noticed this book is free and available online4. There is one major reason for this: textbooks are generally too expensive. This book, we hope, is the first of a new wave of free materials to help those in pursuit of their education, regardless of which part of the world they come from or how much they are willing to spend for a book. Failing that, it is one free book, which is better than none. We also hope, where possible, to point you to the original sources of much of the material in the book: the great papers and persons who have shaped the field of operating systems over the years. Ideas are not pulled out of the air; they come from smart and hard-working people (including numerous Turing-award winners5 ), and thus we should strive to cele- brate those ideas and people where possible. In doing so, we hopefully can better understand the revolutions that have taken place, instead of writing texts as if those thoughts have always been present [K62]. Fur- ther, perhaps such references will encourage you to dig deeper on your own; reading the famous papers of our field is certainly one of the best ways to learn. 4 A digression here: “free” in the way we use it here does not mean open source, and it does not mean the book is not copyrighted with the usual protections – it is! What it means is that you can download the chapters and use them to learn about operating systems. Why not an open-source book, just like Linux is an open-source kernel? Well, we believe it is important for a book to have a single voice throughout, and have worked hard to provide such a voice. When you’re reading it, the book should kind of feel like a dialogue with the person explaining something to you. Hence, our approach. 5 The Turing Award is the highest award in Computer Science; it is like the Nobel Prize, except that you have never heard of it. O PERATING S YSTEMS WWW. OSTEP. ORG [V ERSION 1.10] ix Acknowledgments This section will contain thanks to those who helped us put the book together. The important thing for now: your name could go here! But, you have to help. So send us some feedback and help debug this book. And you could be famous! Or, at least, have your name in some book. The people who have helped so far include: Aaron Gember (Colgate), Aashrith H Govindraj (USF), Abdallah Ahmed, Abhinav Mehra, Abhinay Reddy, Abhirami Senthilku- maran*, Abhishek Bhattacherjee (NITR), Adam Drescher* (WUSTL), Adam Eggum, Adam Morrison, Aditya Venkataraman, Adriana Iamnitchi and class (USF), Ahmad Jarara, Ahmed Fikri*, Ajaykrishna Raghavan, Akiel Khan, Alain Clark (github), Alex Curtis, Alex Giorev, Alex Wyler, Alex Zhao (U. Colorado at Colorado Springs), Alexander Nordin (MIT), Ali Razeen (Duke), Alistair Martin, AmirBehzad Eslami, Anand Mundada, Andrei Bozântan, Andrew Mahler, Andrew Moss, Andrew Valencik (Saint Mary’s), Angela Demke Brown (Toronto), An- tonella Bernobich (UoPeople)*, Arek Bulski, Arun Rajan (Stonybrook), Aryan Arora, Axel So- lis Trompler (KIT), B. Brahmananda Reddy (Minnesota), Bala Subrahmanyam Kambala, Bart Miller (Wisconsin), Basti Ortiz (github), Ben Kushigian (U. Mass), Benita Bose, Benjamin Wil- helm (Konstanz), Bill Yang, Biswajit Mazumder (Clemson), Bo Liu (UCSD), Bobby Jack, Björn Lindberg, Brandon Harshe (U. Minn), Brennan Payne, Brian Gorman, Brian Kroth, Calder White (University of Waterloo), Caleb Sumner (Southern Adventist), Cara Lauritzen, Charlotte Kissinger, Chen Huo (Shippensburg University), Chris Simionovici (U. Toronto), Cheng Su, Chien Chi, Chien-Chung Shen (Delaware)*, chriskorosu (github), Christian Stober, Christoph Jaeger (HTWK Leipzig), C.J. Stanbridge (Memorial U. of Newfoundland), Cody Hanson, Con- stantinos Georgiades, Dakota Cookenmaster (Southern Adventist), Dakota Crane (U. Washington- Tacoma), Dan Soendergaard (U. Aarhus), Dan Tsafrir (Technion), Daniel J Williams (IBM)*, Daniela Ferreira Franco Moura, Danilo Bruschi (Universita Degli Studi Di Milano), Darby Asher Noam Haller, David Hanle (Grinnell), David Hartman, Dawn Flood, Deepika Muthuku- mar, Demir Delic, Dennis Zhou, Dheeraj Shetty (North Carolina State), Diego Oleiarz (FaMAF UNC), dominikw1 (github), Dominic White, Dorian Arnold (New Mexico), Dustin Metzler, Dustin Passofaro, Dylan Kaplan, Eduardo Stelmaszczyk, Efkan S. Goktepe, Emad Sadeghi, Emil Hessman, Emily Jacobson, Emmett Witchel (Texas), Eric Freudenthal (UTEP), Erik Hjelmås Eric Kleinberg, Eric Johansson, Erik Turk, Ernst Biersack (France), Ethan Wood, Evan Leung, Fangjun Kuang (U. Stuttgart), Feng Zhang (IBM), Finn Kuusisto*, Francesco Piccoli, Gavin In- glis (Notre Dame), Gia Hoang Tran, Giovanni Di Santi, Giovanni Lagorio (DIBRIS), Giovanni Moricca, Glenn Bruns (CSU Monterey Bay), Glen Granzow (College of Idaho), Greggory Van Dycke, Guilherme Baptista, gurugio (github), Tian Guo (WPI), Hamid Reza Ghasemi, Hao Chen, Hao Liu, Hein Meling (Stavanger), Helen Gaiser (HTWG Konstanz), Henry Abbey, Hilmar Gústafsson (Aalborg University), Holly Johnson (USF), Hrishikesh Amur, Huanchen Zhang*, Huseyin Sular, Hugo Diaz, Hyrum Carroll (Columbus State), Ilya Oblomkov, Itai Hass (Toronto), Jack Xu (Wisconsin), Jackson “Jake” Haenchen (Texas), Jacob Levinson (UCLA), Jaeyoung Cho (SKKU), Jagannathan Eachambadi, Jake Gillberg, Jakob Olandt, James Earley, James Perry (U. Michigan-Dearborn)*, Jan Reineke (Universität des Saarlandes), Jason MacLaf- ferty (Southern Adventist), Jason Waterman (Vassar), Jay Lim, Jerod Weinman (Grinnell), Jer- sey Deng (UCLA), Jhih-Cheng Luo, Jiao Dong (Rutgers), Jia-Shen Boon, Jiawen Bao, Jidong Xiao (Boise State), Jingxin Li, jmcruzGH (github), Joe Jean (NYU), Joel Kuntz (Saint Mary’s), Joel Hassan (U. Helsinki), Joel Sommers (Colgate), John Brady (Grinnell), John Komenda, John McEachen (NPS), Jonathan Perry (MIT), Joshua Carpenter (NCSU), Josip Cavar, Jun He, T HREE © 2008–23, A RPACI -D USSEAU E ASY P IECES x Justinas Petuchovas, Kai Mast (Wisconsin), Karl Schultheisz, Karl Wallinger, Kartik Singhal, Katherine Dudenas, Katie Coyle (Georgia Tech), Kaushik Kannan, Kemal Bıçakcı, Kevin Liu*, Khaled Emara, Kyle Hale (Illinois Institute of Technology), KyoungSoo Park (KAIST), Kyu- tae Lee*, Lanyue Lu, Laura Xu, legate (github), Lei Tian (U. Nebraska-Lincoln), Leonardo Medici (U. Milan), Leslie Schultz, Liang Yin, Lihao Wang, Looserof, Louie Lu, Luigi Finetti (FaMAF, UNC) Luna Gal (Wooster), lyazj (github), Manav Batra (IIIT-Delhi), Manu Awasthi (Samsung), Marcel van der Holst, Marco Guazzone (U. Piemonte Orientale), Marco Pedri- nazzi (Universita Degli Studi Di Milano), Marius Rodi, Mart Oskamp, Martha Ferris, Masashi Kishikawa (Sony), Matt Reichoff, Matı́as De Pascuale (FaMAF Universidad Nacional De Cor- doba), Matthew Prinz, Matthias St. Pierre, Mattia Monga (U. Milan), Matty Williams, Megan Cutrofello, Meng Huang, Michael Machtel (Hochschule Konstanz), Michael Walfish (NYU), Michael Wu (UCLA), Mike Griepentrog, Ming Chen (Stonybrook), Mohammed Alali (Delaware), Mohamed Omran (GUST), Mondo Gao (Wisconsin), Muhammad Mobeen Movania, Muham- mad Yasoob Ullah Khalid, Murat Kerim Aslan, Murugan Kandaswamy, Nadeem Shaikh, Nan Xiao, Natasha Eilbert, Natasha Stopa, Nathan Dipiazza, Nathan Sullivan, Neeraj Badlani (N.C. State), Neil Perry, Nelson Gomez, Neven Sajko, Nghia Huynh (Texas), Ngu (Nicholas) Q. Truong, Nicholas Mandal, Nick Weinandt, Nishin Shouzab, Nizare Leonetti, Noah Jackson, Otto Sievert, Patel Pratyush Ashesh (BITS-Pilani), Patricio Jara, Patrick Elsen, Patrizio Palmisano, Pavle Kostovic, Perry Kivolowitz, Peter Peterson (Minnesota), Phani Karan, Pieter Kockx, Po- Hao Su (Taiwan) Prabhsimrandeep Singh, Prairie Rose Goodwin, Radford Smith, Repon Ku- mar Roy, Reynaldo H. Verdejo Pinochet, Riccardo Mutschlechner, Rick Perry, Richard Cam- panha (Georgia Tech), Ripudaman Singh, Rita Pia Folisi, Robert Ordóñez and class (South- ern Adventist), Robin Li (Cornell), Roger Wattenhofer (ETH), Rohan Das (Toronto)*, Rohan Pasalkar (Minnesota), Rohan Puri, Ross Aiken, Ruslan Kiselev, Ryland Herrick, Sam Kelly, Sam Noh (UNIST), Sameer Punjal, Samer Al-Kiswany, Sandeep Ummadi (Minnesota), San- tiago Marini, Sankaralingam Panneerselvam, Sarvesh Tandon (Wisconsin), Satish Chebrolu (NetApp), Satyanarayana Shanmugam*, Scott Catlin, Scott Lee (UCLA), Scott Schoeller, Seth Pollen, Sharad Punuganti, Shawn Ge, Shivaram Venkataraman, Shreevatsa R., Simon Pratt (University of Waterloo), Sirui Chen, Sivaraman Sivaraman*, Song Jiang (Wayne State), Spencer Harston (Weber State), Srinivasan Thirunarayanan*, Stardustman (github), Stefan Dekanski, Stephen Bye, Stephen Schaub, Suriyhaprakhas Balaram Sankari, Sy Jin Cheah, Teri Zhao (EMC), Thanumalayan S. Pillai, thasinaz (github), Thomas Griebel, Thomas Scrace, Tianxia Bai, Tobi Popoola (Boise State), Tong He, Tongxin Zheng, Tony Adkins, Torin Rudeen (Princeton), Tuo Wang, Tyler Couto, Varun Vats, Vegard Stikbakke, Vikas Goel, Waciuma Wanjohi, William Royle (Grinnell), Winson Huang (github), Xiang Peng, Xu Di, Yanyan Jiang, Yifan Hao, Yuanyuan Chen, Yubin Ruan, Yudong Sun, Yue Zhuo (Texas A&M), Yufeng Zhang (UCLA), Yufui Ren, Yuxing Xiang (Peking), Zef RosnBrick, Zeyuan Hu (Texas), Zhengguang Zhou (Wisconsin), ZiHan Zheng (USTC), zino23 (github), Zuyu Zhang. Special thanks to those marked with an asterisk above, who have gone above and beyond in their sug- gestions for improvement. In addition, a hearty thanks to Professor Joe Meehean (Lynchburg) for his detailed notes on each chapter, to Professor Jerod Weinman (Grin- nell) and his entire class for their incredible booklets, to Professor Chien- Chung Shen (Delaware) for his invaluable and detailed reading and com- ments, to Adam Drescher (WUSTL) for his careful reading and sugges- tions, to Glen Granzow (College of Idaho) for his incredibly detailed com- ments and tips, Michael Walfish (NYU) for his enthusiasm and detailed O PERATING S YSTEMS WWW. OSTEP. ORG [V ERSION 1.10] xi suggestions for improvement, Peter Peterson (UMD) for his many bits of useful feedback and commentary, Mark Kampe (Pomona) for detailed criticism (we only wish we could fix all suggestions!), and Youjip Won (Hanyang) for his translation work into Korean(!) and numerous insight- ful suggestions, to Terence Kelly for his sidebar on memory mapping. All have helped these authors immeasurably in the refinement of the materi- als herein. A special thank you to Professor Peter Reiher (UCLA) for writing a wonderful set of security chapters, all in the style of this book. We had the fortune of meeting Peter many years ago, and little did we know that we would collaborate in this fashion two decades later. Amazing work! Also, many thanks to the hundreds of students who have taken 537 over the years. In particular, the Fall ’08 class who encouraged the first written form of these notes (they were sick of not having any kind of textbook to read — pushy students!), and then praised them enough for us to keep going (including one hilarious “ZOMG! You should totally write a textbook!” comment in our course evaluations that year). A great debt of thanks is also owed to the brave few who took the xv6 project lab course, much of which is now incorporated into the main 537 course. From Spring ’09: Justin Cherniak, Patrick Deline, Matt Czech, Tony Gregerson, Michael Griepentrog, Tyler Harter, Ryan Kroiss, Eric Radzikowski, Wesley Reardan, Rajiv Vaidyanathan, and Christopher Wa- clawik. From Fall ’09: Nick Bearson, Aaron Brown, Alex Bird, David Capel, Keith Gould, Tom Grim, Jeffrey Hugo, Brandon Johnson, John Kjell, Boyan Li, James Loethen, Will McCardell, Ryan Szaroletta, Simon Tso, and Ben Yule. From Spring ’10: Patrick Blesi, Aidan Dennis-Oehling, Paras Doshi, Jake Friedman, Benjamin Frisch, Evan Hanson, Pikkili He- manth, Michael Jeung, Alex Langenfeld, Scott Rick, Mike Treffert, Garret Staus, Brennan Wall, Hans Werner, Soo-Young Yang, and Carlos Griffin (almost). Although they do not directly help with the book, our students have taught us much of what we know about systems. We talk with them reg- ularly while they are at Wisconsin, but they do all the real work — and by telling us about what they are doing, we learn new things every week. This list includes the following collection of former Ph.D.s and post-docs: Aishwarya Ganesan, Florentina Popovici, Haryadi S. Gunawi, Joe Mee- hean, John Bent, Jun He, Kan Wu, Lanyue Lu, Lakshmi Bairavasundaram, Leo Arulraj, Muthian Sivathanu, Nathan Burnett, Nitin Agrawal, Ram Alagappan, Samer Al-Kiswany, Sriram Subramanian, Stephen Todd Jones, Sudarsun Kannan, Suli Yang, Swaminathan Sundararaman, Thanh Do, Thanumalayan S. Pillai, Timothy Denehy, Tyler Harter, Vijay Chidambaram, Vijayan Prabhakaran, Yiying Zhang, Yupu Zhang, Yuvraj Patel, Zev Weiss. Of course, many other students (undergraduates, masters) and collab- orators have co-authored papers with us. We thank them as well; please see our web pages or DBLP to see who they are. Our graduate students have largely been funded by the National Sci- ence Foundation (NSF), the Department of Energy Office of Science (DOE), T HREE © 2008–23, A RPACI -D USSEAU E ASY P IECES xii and by industry grants. We are especially grateful to the NSF for their support over many years, as our research has shaped the content of many chapters herein. We thank Thomas Griebel, who demanded a better cover for the book. Although we didn’t take his specific suggestion (a dinosaur, can you be- lieve it?), the beautiful picture of Halley’s comet would not be found on the cover without him. A final debt of gratitude is also owed to Aaron Brown, who first took this course many years ago (Spring ’09), then took the xv6 lab course (Fall ’09), and finally was a graduate teaching assistant for the course for two years or so (Fall ’10 through Spring ’12). His tireless work has vastly im- proved the state of the projects (particularly those in xv6 land) and thus has helped better the learning experience for countless undergraduates and graduates here at Wisconsin. As Aaron would say (in his usual suc- cinct manner): “Thx.” O PERATING S YSTEMS WWW. OSTEP. ORG [V ERSION 1.10] xiii Final Words Yeats famously said “Education is not the filling of a pail but the light- ing of a fire.” He was right but wrong at the same time6. You do have to “fill the pail” a bit, and these notes are certainly here to help with that part of your education; after all, when you go to interview at Google, and they ask you a trick question about how to use semaphores, it might be good to actually know what a semaphore is, right? But Yeats’s larger point is obviously on the mark: the real point of education is to get you interested in something, to learn something more about the subject matter on your own and not just what you have to digest to get a good grade in some class. As one of our fathers (Remzi’s dad, Vedat Arpaci) used to say, “Learn beyond the classroom”. We created these notes to spark your interest in operating systems, to read more about the topic on your own, to talk to your professor about all the exciting research that is going on in the field, and even to get involved with that research. It is a great field(!), full of exciting and wonderful ideas that have shaped computing history in profound and important ways. And while we understand this fire won’t light for all of you, we hope it does for many, or even a few. Because once that fire is lit, well, that is when you truly become capable of doing something great. And thus the real point of the educational process: to go forth, to study many new and fascinating topics, to learn, to mature, and most importantly, to find something that lights a fire for you. Andrea and Remzi Married couple Professors of Computer Science at the University of Wisconsin Chief Lighters of Fires, hopefully7 6 If he actually said this; as with many famous quotes, the history of this gem is murky. 7 If this sounds like we are admitting some past history as arsonists, you are probably missing the point. Probably. If this sounds cheesy, well, that’s because it is, but you’ll just have to forgive us for that. T HREE © 2008–23, A RPACI -D USSEAU E ASY P IECES xiv References [CK+08] “The xv6 Operating System” by Russ Cox, Frans Kaashoek, Robert Morris, Nickolai Zeldovich. From: http://pdos.csail.mit.edu/6.828/2008/index.html. xv6 was developed as a port of the original U NIX version 6 and represents a beautiful, clean, and simple way to understand a modern operating system. [F96] “Six Easy Pieces: Essentials Of Physics Explained By Its Most Brilliant Teacher” by Richard P. Feynman. Basic Books, 1996. This book reprints the six easiest chapters of Feynman’s Lectures on Physics, from 1963. If you like Physics, it is a fantastic read. [HP90] “Computer Architecture a Quantitative Approach” (1st ed.) by David A. Patterson and John L. Hennessy. Morgan-Kaufman, 1990. A book that encouraged each of us at our undergraduate institutions to pursue graduate studies; we later both had the pleasure of working with Patterson, who greatly shaped the foundations of our research careers. [KR88] “The C Programming Language” by Brian Kernighan and Dennis Ritchie. Prentice- Hall, April 1988. The C programming reference that everyone should have, by the people who invented the language. [K62] “The Structure of Scientific Revolutions” by Thomas S. Kuhn. University of Chicago Press, 1962. A great and famous read about the fundamentals of the scientific process. Mop-up work, anomaly, crisis, and revolution. We are mostly destined to do mop-up work, alas. O PERATING S YSTEMS WWW. OSTEP. ORG [V ERSION 1.10] Contents To Everyone.............................. iii To Educators.............................. vi To Students............................... viii Acknowledgments........................... ix Final Words.............................. xiii References............................... xiv 1 A Dialogue on the Book 1 2 Introduction to Operating Systems 3 2.1 Virtualizing The CPU..................... 5 2.2 Virtualizing Memory...................... 7 2.3 Concurrency.......................... 9 2.4 Persistence........................... 11 2.5 Design Goals.......................... 13 2.6 Some History.......................... 14 2.7 Summary............................ 19 References............................... 20 Homework............................... 21 I Virtualization 23 3 A Dialogue on Virtualization 25 4 The Abstraction: The Process 27 4.1 The Abstraction: A Process.................. 28 4.2 Process API........................... 29 4.3 Process Creation: A Little More Detail............ 30 4.4 Process States.......................... 31 4.5 Data Structures......................... 33 4.6 Summary............................ 35 References............................... 37 xv xvi C ONTENTS Homework (Simulation)....................... 38 5 Interlude: Process API 41 5.1 The fork() System Call................... 41 5.2 The wait() System Call................... 44 5.3 Finally, The exec() System Call............... 44 5.4 Why? Motivating The API................... 46 5.5 Process Control And Users.................. 48 5.6 Useful Tools........................... 49 5.7 Summary............................ 50 References............................... 52 Homework (Simulation)....................... 53 Homework (Code)........................... 54 6 Mechanism: Limited Direct Execution 57 6.1 Basic Technique: Limited Direct Execution......... 57 6.2 Problem #1: Restricted Operations.............. 58 6.3 Problem #2: Switching Between Processes.......... 63 6.4 Worried About Concurrency?................. 67 6.5 Summary............................ 68 References............................... 71 Homework (Measurement)...................... 72 7 Scheduling: Introduction 73 7.1 Workload Assumptions.................... 73 7.2 Scheduling Metrics....................... 74 7.3 First In, First Out (FIFO).................... 74 7.4 Shortest Job First (SJF)..................... 76 7.5 Shortest Time-to-Completion First (STCF).......... 77 7.6 A New Metric: Response Time................ 78 7.7 Round Robin.......................... 79 7.8 Incorporating I/O....................... 81 7.9 No More Oracle......................... 82 7.10 Summary............................ 83 References............................... 84 Homework (Simulation)....................... 85 8 Scheduling:The Multi-Level Feedback Queue 87 8.1 MLFQ: Basic Rules....................... 88 8.2 Attempt #1: How To Change Priority............ 89 8.3 Attempt #2: The Priority Boost................ 92 8.4 Attempt #3: Better Accounting................ 93 8.5 Tuning MLFQ And Other Issues............... 94 8.6 MLFQ: Summary........................ 96 References............................... 97 Homework (Simulation)....................... 98 O PERATING S YSTEMS WWW. OSTEP. ORG [V ERSION 1.10] C ONTENTS xvii 9 Scheduling: Proportional Share 99 9.1 Basic Concept: Tickets Represent Your Share........ 99 9.2 Ticket Mechanisms....................... 101 9.3 Implementation......................... 102 9.4 An Example........................... 103 9.5 How To Assign Tickets?.................... 104 9.6 Stride Scheduling........................ 104 9.7 The Linux Completely Fair Scheduler (CFS)......... 105 9.8 Summary............................ 110 References............................... 111 Homework (Simulation)....................... 112 10 Multiprocessor Scheduling (Advanced) 113 10.1 Background: Multiprocessor Architecture.......... 114 10.2 Don’t Forget Synchronization................. 116 10.3 One Final Issue: Cache Affinity................ 117 10.4 Single-Queue Scheduling................... 118 10.5 Multi-Queue Scheduling.................... 119 10.6 Linux Multiprocessor Schedulers............... 122 10.7 Summary............................ 122 References............................... 123 Homework (Simulation)....................... 124 11 Summary Dialogue on CPU Virtualization 127 12 A Dialogue on Memory Virtualization 129 13 The Abstraction: Address Spaces 131 13.1 Early Systems.......................... 131 13.2 Multiprogramming and Time Sharing............ 131 13.3 The Address Space....................... 133 13.4 Goals............................... 135 13.5 Summary............................ 136 References............................... 138 Homework (Code)........................... 139 14 Interlude: Memory API 141 14.1 Types of Memory........................ 141 14.2 The malloc() Call...................... 142 14.3 The free() Call........................ 144 14.4 Common Errors........................ 144 14.5 Underlying OS Support.................... 148 14.6 Other Calls........................... 148 14.7 Summary............................ 149 References............................... 150 Homework (Code)........................... 151 T HREE © 2008–23, A RPACI -D USSEAU E ASY P IECES xviii C ONTENTS 15 Mechanism: Address Translation 153 15.1 Assumptions.......................... 154 15.2 An Example........................... 154 15.3 Dynamic (Hardware-based) Relocation........... 157 15.4 Hardware Support: A Summary............... 160 15.5 Operating System Issues.................... 161 15.6 Summary............................ 163 References............................... 166 Homework (Simulation)....................... 167 16 Segmentation 169 16.1 Segmentation: Generalized Base/Bounds.......... 169 16.2 Which Segment Are We Referring To?............ 172 16.3 What About The Stack?.................... 174 16.4 Support for Sharing...................... 175 16.5 Fine-grained vs. Coarse-grained Segmentation....... 175 16.6 OS Support........................... 176 16.7 Summary............................ 178 References............................... 179 Homework (Simulation)....................... 180 17 Free-Space Management 181 17.1 Assumptions.......................... 182 17.2 Low-level Mechanisms.................... 183 17.3 Basic Strategies......................... 191 17.4 Other Approaches....................... 193 17.5 Summary............................ 195 References............................... 197 Homework (Simulation)....................... 198 18 Paging: Introduction 199 18.1 A Simple Example And Overview.............. 199 18.2 Where Are Page Tables Stored?................ 203 18.3 What’s Actually In The Page Table?............. 204 18.4 Paging: Also Too Slow..................... 206 18.5 A Memory Trace........................ 207 18.6 Summary............................ 210 References............................... 211 Homework (Simulation)....................... 212 19 Paging: Faster Translations (TLBs) 215 19.1 TLB Basic Algorithm...................... 216 19.2 Example: Accessing An Array................ 217 19.3 Who Handles The TLB Miss?................. 220 19.4 TLB Contents: What’s In There?............... 222 19.5 TLB Issue: Context Switches................. 223 19.6 Issue: Replacement Policy................... 225 O PERATING S YSTEMS WWW. OSTEP. ORG [V ERSION 1.10] C ONTENTS xix 19.7 A Real TLB Entry........................ 225 19.8 Summary............................ 226 References............................... 228 Homework (Measurement)...................... 229 20 Paging: Smaller Tables 231 20.1 Simple Solution: Bigger Pages................ 231 20.2 Hybrid Approach: Paging and Segments.......... 232 20.3 Multi-level Page Tables.................... 235 20.4 Inverted Page Tables...................... 243 20.5 Swapping the Page Tables to Disk.............. 243 20.6 Summary............................ 243 References............................... 244 Homework (Simulation)....................... 245 21 Beyond Physical Memory: Mechanisms 247 21.1 Swap Space........................... 248 21.2 The Present Bit......................... 249 21.3 The Page Fault......................... 250 21.4 What If Memory Is Full?.................... 251 21.5 Page Fault Control Flow.................... 252 21.6 When Replacements Really Occur.............. 253 21.7 Summary............................ 254 References............................... 255 Homework (Measurement)...................... 256 22 Beyond Physical Memory: Policies 259 22.1 Cache Management...................... 259 22.2 The Optimal Replacement Policy............... 260 22.3 A Simple Policy: FIFO..................... 262 22.4 Another Simple Policy: Random............... 264 22.5 Using History: LRU...................... 265 22.6 Workload Examples...................... 266 22.7 Implementing Historical Algorithms............. 269 22.8 Approximating LRU...................... 270 22.9 Considering Dirty Pages.................... 271 22.10 Other VM Policies....................... 272 22.11 Thrashing............................ 272 22.12 Summary............................ 273 References............................... 274 Homework (Simulation)....................... 276 23 Complete Virtual Memory Systems 277 23.1 VAX/VMS Virtual Memory.................. 278 23.2 The Linux Virtual Memory System.............. 284 23.3 Summary............................ 293 References............................... 295 T HREE © 2008–23, A RPACI -D USSEAU E ASY P IECES xx C ONTENTS 24 Summary Dialogue on Memory Virtualization 297 II Concurrency 301 25 A Dialogue on Concurrency 303 26 Concurrency: An Introduction 305 26.1 Why Use Threads?....................... 306 26.2 An Example: Thread Creation................ 307 26.3 Why It Gets Worse: Shared Data............... 310 26.4 The Heart Of The Problem: Uncontrolled Scheduling... 313 26.5 The Wish For Atomicity.................... 315 26.6 One More Problem: Waiting For Another.......... 316 26.7 Summary: Why in OS Class?................. 317 References............................... 318 Homework (Simulation)....................... 319 27 Interlude: Thread API 321 27.1 Thread Creation........................ 321 27.2 Thread Completion....................... 322 27.3 Locks.............................. 325 27.4 Condition Variables...................... 327 27.5 Compiling and Running.................... 329 27.6 Summary............................ 329 References............................... 331 Homework (Code)........................... 332 28 Locks 333 28.1 Locks: The Basic Idea..................... 333 28.2 Pthread Locks.......................... 334 28.3 Building A Lock........................ 335 28.4 Evaluating Locks........................ 335 28.5 Controlling Interrupts..................... 336 28.6 A Failed Attempt: Just Using Loads/Stores......... 337 28.7 Building Working Spin Locks with Test-And-Set...... 338 28.8 Evaluating Spin Locks..................... 341 28.9 Compare-And-Swap...................... 342 28.10 Load-Linked and Store-Conditional............. 343 28.11 Fetch-And-Add......................... 344 28.12 Too Much Spinning: What Now?............... 345 28.13 A Simple Approach: Just Yield, Baby............. 346 28.14 Using Queues: Sleeping Instead Of Spinning........ 347 28.15 Different OS, Different Support................ 350 28.16 Two-Phase Locks........................ 352 28.17 Summary............................ 352 References............................... 353 O PERATING S YSTEMS WWW. OSTEP. ORG [V ERSION 1.10] C ONTENTS xxi Homework (Simulation)....................... 354 29 Lock-based Concurrent Data Structures 355 29.1 Concurrent Counters...................... 355 29.2 Concurrent Linked Lists.................... 361 29.3 Concurrent Queues....................... 364 29.4 Concurrent Hash Table.................... 366 29.5 Summary............................ 366 References............................... 369 Homework (Code)........................... 370 30 Condition Variables 371 30.1 Definition and Routines.................... 372 30.2 The Producer/Consumer (Bounded Buffer) Problem.... 376 30.3 Covering Conditions...................... 384 30.4 Summary............................ 386 References............................... 387 Homework (Code)........................... 388 31 Semaphores 391 31.1 Semaphores: A Definition................... 391 31.2 Binary Semaphores (Locks).................. 393 31.3 Semaphores For Ordering................... 394 31.4 The Producer/Consumer (Bounded Buffer) Problem.... 396 31.5 Reader-Writer Locks...................... 401 31.6 The Dining Philosophers................... 403 31.7 Thread Throttling........................ 406 31.8 How To Implement Semaphores............... 406 31.9 Summary............................ 407 References............................... 409 Homework (Code)........................... 410 32 Common Concurrency Problems 411 32.1 What Types Of Bugs Exist?.................. 411 32.2 Non-Deadlock Bugs...................... 412 32.3 Deadlock Bugs......................... 415 32.4 Summary............................ 424 References............................... 425 Homework (Code)........................... 426 33 Event-based Concurrency (Advanced) 427 33.1 The Basic Idea: An Event Loop................ 427 33.2 An Important API: select() (or poll())......... 428 33.3 Using select()........................ 429 33.4 Why Simpler? No Locks Needed............... 431 33.5 A Problem: Blocking System Calls.............. 431 33.6 A Solution: Asynchronous I/O................ 432 T HREE © 2008–23, A RPACI -D USSEAU E ASY P IECES xxii C ONTENTS 33.7 Another Problem: State Management............ 433 33.8 What Is Still Difficult With Events.............. 435 33.9 Summary............................ 436 References............................... 437 Homework (Code)........................... 438 34 Summary Dialogue on Concurrency 439 III Persistence 441 35 A Dialogue on Persistence 443 36 I/O Devices 445 36.1 System Architecture...................... 445 36.2 A Canonical Device...................... 447 36.3 The Canonical Protocol.................... 448 36.4 Lowering CPU Overhead With Interrupts.......... 449 36.5 More Efficient Data Movement With DMA......... 450 36.6 Methods Of Device Interaction................ 451 36.7 Fitting Into The OS: The Device Driver............ 452 36.8 Case Study: A Simple IDE Disk Driver............ 453 36.9 Historical Notes........................ 455 36.10 Summary............................ 457 References............................... 458 37 Hard Disk Drives 459 37.1 The Interface.......................... 459 37.2 Basic Geometry......................... 460 37.3 A Simple Disk Drive...................... 461 37.4 I/O Time: Doing The Math.................. 464 37.5 Disk Scheduling........................ 468 37.6 Summary............................ 472 References............................... 473 Homework (Simulation)....................... 474 38 Redundant Arrays of Inexpensive Disks (RAIDs) 475 38.1 Interface And RAID Internals................. 476 38.2 Fault Model........................... 477 38.3 How To Evaluate A RAID................... 477 38.4 RAID Level 0: Striping..................... 478 38.5 RAID Level 1: Mirroring.................... 481 38.6 RAID Level 4: Saving Space With Parity........... 484 38.7 RAID Level 5: Rotating Parity................ 488 38.8 RAID Comparison: A Summary............... 489 38.9 Other Interesting RAID Issues................ 490 38.10 Summary............................ 490 O PERATING S YSTEMS WWW. OSTEP. ORG [V ERSION 1.10] C ONTENTS xxiii References............................... 491 Homework (Simulation)....................... 492 39 Interlude: Files and Directories 493 39.1 Files And Directories...................... 493 39.2 The File System Interface................... 495 39.3 Creating Files.......................... 495 39.4 Reading And Writing Files.................. 497 39.5 Reading And Writing, But Not Sequentially......... 499 39.6 Shared File Table Entries: fork() And dup()....... 501 39.7 Writing Immediately With fsync()............. 504 39.8 Renaming Files......................... 504 39.9 Getting Information About Files............... 506 39.10 Removing Files......................... 507 39.11 Making Directories....................... 508 39.12 Reading Directories...................... 509 39.13 Deleting Directories...................... 510 39.14 Hard Links........................... 510 39.15 Symbolic Links......................... 512 39.16 Permission Bits And Access Control Lists.......... 514 39.17 Making And Mounting A File System............ 516 39.18 Summary............................ 518 References............................... 520 Homework (Code)........................... 521 40 File System Implementation 523 40.1 The Way To Think....................... 523 40.2 Overall Organization...................... 524 40.3 File Organization: The Inode................. 526 40.4 Directory Organization.................... 530 40.5 Free Space Management.................... 532 40.6 Access Paths: Reading and Writing.............. 532 40.7 Caching and Buffering..................... 536 40.8 Summary............................ 538 References............................... 539 Homework (Simulation)....................... 540 41 Locality and The Fast File System 541 41.1 The Problem: Poor Performance............... 541 41.2 FFS: Disk Awareness Is The Solution............. 543 41.3 Organizing Structure: The Cylinder Group......... 543 41.4 Policies: How To Allocate Files and Directories....... 545 41.5 Measuring File Locality.................... 547 41.6 The Large-File Exception................... 548 41.7 A Few Other Things About FFS................ 550 41.8 Summary............................ 552 References............................... 553 T HREE © 2008–23, A RPACI -D USSEAU E ASY P IECES xxiv C ONTENTS Homework (Simulation)....................... 554 42 Crash Consistency: FSCK and Journaling 555 42.1 A Detailed Example...................... 556 42.2 Solution #1: The File System Checker............ 559 42.3 Solution #2: Journaling (or Write-Ahead Logging)..... 561 42.4 Solution #3: Other Approaches................ 571 42.5 Summary............................ 572 References............................... 573 Homework (Simulation)....................... 575 43 Log-structured File Systems 577 43.1 Writing To Disk Sequentially................. 578 43.2 Writing Sequentially And Effectively............. 579 43.3 How Much To Buffer?..................... 580 43.4 Problem: Finding Inodes................... 581 43.5 Solution Through Indirection: The Inode Map....... 581 43.6 Completing The Solution: The Checkpoint Region..... 583 43.7 Reading A File From Disk: A Recap............. 583 43.8 What About Directories?................... 584 43.9 A New Problem: Garbage Collection............. 585 43.10 Determining Block Liveness.................. 586 43.11 A Policy Question: Which Blocks To Clean, And When?.. 587 43.12 Crash Recovery And The Log................. 588 43.13 Summary............................ 588 References............................... 590 Homework (Simulation)....................... 591 44 Flash-based SSDs 593 44.1 Storing a Single Bit....................... 593 44.2 From Bits to Banks/Planes.................. 594 44.3 Basic Flash Operations..................... 595 44.4 Flash Performance And Reliability.............. 597 44.5 From Raw Flash to Flash-Based SSDs............ 598 44.6 FTL Organization: A Bad Approach............. 599 44.7 A Log-Structured FTL..................... 600 44.8 Garbage Collection....................... 602 44.9 Mapping Table Size...................... 604 44.10 Wear Leveling......................... 609 44.11 SSD Performance And Cost.................. 609 44.12 Summary............................ 611 References............................... 613 Homework (Simulation)....................... 615 45 Data Integrity and Protection 617 45.1 Disk Failure Modes....................... 617 45.2 Handling Latent Sector Errors................ 619 O PERATING S YSTEMS WWW. OSTEP. ORG [V ERSION 1.10] C ONTENTS xxv 45.3 Detecting Corruption: The Checksum............ 620 45.4 Using Checksums....................... 623 45.5 A New Problem: Misdirected Writes............. 624 45.6 One Last Problem: Lost Writes................ 625 45.7 Scrubbing............................ 625 45.8 Overheads Of Checksumming................ 626 45.9 Summary............................ 627 References............................... 628 Homework (Simulation)....................... 629 Homework (Code)........................... 630 46 Summary Dialogue on Persistence 631 47 A Dialogue on Distribution 633 48 Distributed Systems 635 48.1 Communication Basics..................... 636 48.2 Unreliable Communication Layers.............. 637 48.3 Reliable Communication Layers............... 639 48.4 Communication Abstractions................. 642 48.5 Remote Procedure Call (RPC)................. 643 48.6 Summary............................ 648 References............................... 649 Homework (Code)........................... 650 49 Sun’s Network File System (NFS) 653 49.1 A Basic Distributed File System................ 654 49.2 On To NFS............................ 655 49.3 Focus: Simple And Fast Server Crash Recovery....... 655 49.4 Key To Fast Crash Recovery: Statelessness......... 656 49.5 The NFSv2 Protocol...................... 657 49.6 From Protocol To Distributed File System.......... 659 49.7 Handling Server Failure With Idempotent Operations... 661 49.8 Improving Performance: Client-side Caching........ 663 49.9 The Cache Consistency Problem............... 663 49.10 Assessing NFS Cache Consistency.............. 665 49.11 Implications On Server-Side Write Buffering........ 665 49.12 Summary............................ 667 References............................... 669 Homework (Measurement)...................... 670 50 The Andrew File System (AFS) 671 50.1 AFS Version 1.......................... 671 50.2 Problems with Version 1.................... 673 50.3 Improving the Protocol.................... 674 50.4 AFS Version 2.......................... 674 50.5 Cache Consistency....................... 676 T HREE © 2008–23, A RPACI -D USSEAU E ASY P IECES xxvi C ONTENTS 50.6 Crash Recovery......................... 678 50.7 Scale And Performance Of AFSv2.............. 679 50.8 AFS: Other Improvements................... 681 50.9 Summary............................ 682 References............................... 683 Homework (Simulation)....................... 684 51 Summary Dialogue on Distribution 685 General Index 687 Asides 699 Tips 703 Cruces 707 O PERATING S YSTEMS WWW. OSTEP. ORG [V ERSION 1.10] List of Figures 2.1 Simple Example: Code That Loops And Prints (cpu.c)... 5 2.2 Running Many Programs At Once................ 6 2.3 A Program That Accesses Memory (mem.c).......... 7 2.4 Running The Memory Program Multiple Times....... 8 2.5 A Multi-threaded Program (threads.c)............ 9 2.6 A Program That Does I/O (io.c)................ 12 4.1 Loading: From Program To Process............... 30 4.2 Process: State Transitions..................... 32 4.3 Tracing Process State: CPU Only................ 32 4.4 Tracing Process State: CPU and I/O............... 33 4.5 The xv6 Proc Structure...................... 34 5.1 Calling fork() (p1.c)...................... 42 5.2 Calling fork() And wait() (p2.c).............. 43 5.3 Calling fork(), wait(), And exec() (p3.c)........ 45 5.4 All Of The Above With Redirection (p4.c).......... 48 6.1 Direct Execution Protocol (Without Limits).......... 58 6.2 Limited Direct Execution Protocol................ 61 6.3 Limited Direct Execution Protocol (Timer Interrupt)..... 67 6.4 The xv6 Context Switch Code.................. 68 7.1 FIFO Simple Example....................... 75 7.2 Why FIFO Is Not That Great................... 75 7.3 SJF Simple Example........................ 76 7.4 SJF With Late Arrivals From B and C.............. 77 7.5 STCF Simple Example...................... 78 7.6 SJF Again (Bad for Response Time)............... 79 7.7 Round Robin (Good For Response Time)........... 79 7.8 Poor Use Of Resources...................... 82 7.9 Overlap Allows Better Use Of Resources............ 82 xxvii xxviii L IST OF F IGURES 8.1 MLFQ Example.......................... 89 8.2 Long-running Job Over Time.................. 90 8.3 Along Came An Interactive Job: Two Examples........ 91 8.4 Without (Left) and With (Right) Priority Boost........ 93 8.5 Without (Left) and With (Right) Gaming Tolerance...... 94 8.6 Lower Priority, Longer Quanta.................. 95 9.1 Lottery Scheduling Decision Code............... 102 9.2 Lottery Fairness Study...................... 103 9.3 Stride Scheduling: A Trace.................... 105 9.4 CFS Simple Example....................... 106 9.5 CFS Red-Black Tree........................ 109 10.1 Single CPU With Cache...................... 114 10.2 Two CPUs With Caches Sharing Memory........... 115 10.3 Simple List Delete Code..................... 117 13.1 Operating Systems: The Early Days............... 132 13.2 Three Processes: Sharing Memory............... 133 13.3 An Example Address Space................... 134 15.1 A Process And Its Address Space................ 156 15.2 Physical Memory with a Single Relocated Process...... 157 15.3 Dynamic Relocation: Hardware Requirements........ 161 15.4 Dynamic Relocation: Operating System Responsibilities.. 162 15.5 Limited Direct Execution (Dynamic Relocation) @ Boot... 163 15.6 Limited Direct Execution (Dynamic Relocation) @ Runtime. 164 16.1 An Address Space (Again).................... 170 16.2 Placing Segments In Physical Memory............. 171 16.3 Segment Register Values..................... 171 16.4 Segment Registers (With Negative-Growth Support)..... 174 16.5 Segment Register Values (with Protection)........... 175 16.6 Non-compacted and Compacted Memory........... 176 17.1 An Allocated Region Plus Header................ 185 17.2 Specific Contents Of The Header................ 185 17.3 A Heap With One Free Chunk.................. 187 17.4 A Heap: After One Allocation.................. 187 17.5 Free Space With Three Chunks Allocated........... 188 17.6 Free Space With Two Chunks Allocated............ 189 17.7 A Non-Coalesced Free List.................... 190 17.8 Example Buddy-managed Heap................. 195 18.1 A Simple 64-byte Address Space................ 200 18.2 A 64-Byte Address Space In A 128-Byte Physical Memory.. 200 18.3 The Address Translation Process................ 202 O PERATING S YSTEMS WWW. OSTEP. ORG [V ERSION 1.10] L IST OF F IGURES xxix 18.4 Example: Page Table in Kernel Physical Memory....... 203 18.5 An x86 Page Table Entry (PTE).................. 205 18.6 Accessing Memory With Paging................. 207 18.7 A Virtual (And Physical) Memory Trace............ 209 19.1 TLB Control Flow Algorithm................... 216 19.2 Example: An Array In A Tiny Address Space......... 218 19.3 TLB Control Flow Algorithm (OS Handled).......... 220 19.4 A MIPS TLB Entry......................... 225 19.5 Discovering TLB Sizes and Miss Costs............. 229 20.1 A 16KB Address Space With 1KB Pages............ 233 20.2 A Page Table For 16KB Address Space............. 233 20.3 Linear (Left) And Multi-Level (Right) Page Tables...... 236 20.4 A 16KB Address Space With 64-byte Pages........... 238 20.5 A Page Directory, And Pieces Of Page Table.......... 239 20.6 Multi-level Page Table Control Flow.............. 242 21.1 Physical Memory and Swap Space............... 249 21.2 Page-Fault Control Flow Algorithm (Hardware)........ 252 21.3 Page-Fault Control Flow Algorithm (Software)........ 253 22.1 Tracing The Optimal Policy................... 261 22.2 Tracing The FIFO Policy..................... 263 22.3 Tracing The Random Policy................... 264 22.4 Random Performance Over 10,000 Trials............ 264 22.5 Tracing The LRU Policy...................... 265 22.6 The No-Locality Workload.................... 267 22.7 The 80-20 Workload........................ 268 22.8 The Looping Workload...................... 269 22.9 The 80-20 Workload With Clock................. 271 23.1 The VAX/VMS Address Space.................. 280 23.2 The Linux Address Space..................... 285 26.1 Single-Threaded And Multi-Threaded Address Spaces... 306 26.2 Simple Thread Creation Code (t0.c)............. 308 26.3 Thread Trace (1).......................... 309 26.4 Thread Trace (2).......................... 309 26.5 Thread Trace (3).......................... 310 26.6 Sharing Data: Uh Oh (t1.c).................. 311 26.7 The Problem: Up Close and Personal.............. 314 27.1 Creating a Thread......................... 323 27.2 Waiting for Thread Completion................. 324 27.3 Simpler Argument Passing to a Thread............. 325 27.4 An Example Wrapper....................... 327 T HREE © 2008–23, A RPACI -D USSEAU E ASY P IECES xxx L IST OF F IGURES 28.1 First Attempt: A Simple Flag................... 337 28.2 Trace: No Mutual Exclusion................... 338 28.3 A Simple Spin Lock Using Test-and-set............ 340 28.4 Compare-and-swap........................ 342 28.5 Load-linked And Store-conditional............... 343 28.6 Using LL/SC To Build A Lock.................. 344 28.7 Ticket Locks............................ 346 28.8 Lock With Test-and-set And Yield................ 347 28.9 Lock With Queues, Test-and-set, Yield, And Wakeup..... 348 28.10 Linux-based Futex Locks..................... 351 29.1 A Counter Without Locks..................... 356 29.2 A Counter With Locks....................... 357 29.3 Tracing the Approximate Counters............... 358 29.4 Approximate Counter Implementation............. 359 29.5 Performance of Traditional vs. Approximate Counters.... 360 29.6 Scaling Approximate Counters................. 360 29.7 Concurrent Linked List...................... 362 29.8 Concurrent Linked List: Rewritten............... 363 29.9 Michael and Scott Concurrent Queue.............. 365 29.10 A Concurrent Hash Table..................... 366 29.11 Scaling Hash Tables........................ 367 30.1 A Parent Waiting For Its Child.................. 371 30.2 Parent Waiting For Child: Spin-based Approach....... 372 30.3 Parent Waiting For Child: Use A Condition Variable..... 373 30.4 Parent Waiting: No State Variable................ 374 30.5 Parent Waiting: No Lock..................... 375 30.6 The Put And Get Routines (v1)................. 376 30.7 Producer/Consumer Threads (v1)................ 377 30.8 Producer/Consumer: Single CV And If Statement...... 378 30.9 Thread Trace: Broken Solution (v1)............... 379 30.10 Producer/Consumer: Single CV And While.......... 380 30.11 Thread Trace: Broken Solution (v2)............... 381 30.12 Producer/Consumer: Two CVs And While........... 382 30.13 The Correct Put And Get Routines............... 383 30.14 The Correct Producer/Consumer Synchronization...... 383 30.15 Covering Conditions: An Example............... 385 31.1 Initializing A Semaphore..................... 392 31.2 Semaphore: Definitions Of Wait And Post........... 392 31.3 A Binary Semaphore (That Is, A Lock)............. 393 31.4 Thread Trace: Single Thread Using A Semaphore....... 393 31.5 Thread Trace: Two Threads Using A Semaphore....... 394 31.6 A Parent Waiting For Its Child.................. 395 31.7 Thread Trace: Parent Waiting For Child (Case 1)....... 396 31.8 Thread Trace: Parent Waiting For Child (Case 2)....... 396 O PERATING S YSTEMS WWW. OSTEP. ORG [V ERSION 1.10] L IST OF F IGURES xxxi 31.9 The Put And Get Routines.................... 397 31.10 Adding The Full And Empty Conditions............ 398 31.11 Adding Mutual Exclusion (Incorrectly)............. 399 31.12 Adding Mutual Exclusion (Correctly).............. 400 31.13 A Simple Reader-Writer Lock.................. 402 31.14 The Dining Philosophers..................... 404 31.15 The get forks() And put forks() Routines........ 405 31.16 Breaking The Dependency In get forks().......... 405 31.17 Implementing Zemaphores With Locks And CVs....... 407 32.1 Bugs In Modern Applications.................. 412 32.2 Atomicity Violation (atomicity.c).............. 412 32.3 Atomicity Violation Fixed (atomicity fixed.c)...... 413 32.4 Ordering Bug (ordering.c)................... 413 32.5 Fixing The Ordering Violation (ordering fixed.c).... 414 32.6 Simple Deadlock (deadlock.c)................ 415 32.7 The Deadlock Dependency Graph............... 416 33.1 Simple Code Using select().................. 430 36.1 Prototypical System Architecture................ 446 36.2 Modern System Architecture................... 447 36.3 A Canonical Device........................ 448 36.4 The File System Stack....................... 453 36.5 The IDE Interface......................... 454 36.6 The xv6 IDE Disk Driver (Simplified)............. 456 37.1 A Disk With Just A Single Track................. 460 37.2 A Single Track Plus A Head................... 461 37.3 Three Tracks Plus A Head (Right: With Seek)......... 462 37.4 Three Tracks: Track Skew Of 2.................. 463 37.5 Disk Drive Specs: SCSI Versus SATA.............. 465 37.6 Disk Drive Performance: SCSI Versus SATA......... 466 37.7 SSTF: Scheduling Requests 21 And 2.............. 468 37.8 SSTF: Sometimes Not Good Enough.............. 470 38.1 RAID-0: Simple Striping..................... 478 38.2 Striping With A Bigger Chunk Size............... 478 38.3 Simple RAID-1: Mirroring.................... 482 38.4 RAID-4 With Parity........................ 484 38.5 Full-stripe Writes In RAID-4................... 486 38.6 Example: Writes To 4, 13, And Respective Parity Blocks... 487 38.7 RAID-5 With Rotated Parity................... 488 38.8 RAID Capacity, Reliability, and Performance......... 489 39.1 An Example Directory Tree.................... 494 39.2 Shared Parent/Child File Table Entries (fork-seek.c)... 502 T HREE © 2008–23, A RPACI -D USSEAU E ASY P IECES xxxii L IST OF F IGURES 39.3 Processes Sharing An Open File Table Entry.......... 503 39.4 Shared File Table Entry With dup() (dup.c)......... 503 39.5 The stat structure......................... 507 40.1 Simplified Ext2 Inode....................... 528 40.2 File System Measurement Summary.............. 530 40.3 File Read Timeline (Time Increasing Downward)....... 533 40.4 File Creation Timeline (Time Increasing Downward).... 535 41.1 FFS Locality For SEER Traces.................. 547 41.2 Amortization: How Big Do Chunks Have To Be?....... 550 41.3 FFS: Standard Versus Parameterized Placement........ 551 42.1 Data Journaling Timeline..................... 570 42.2 Metadata Journaling Timeline.................. 571 44.1 A Simple Flash Chip: Pages Within Blocks.......... 594 44.2 Raw Flash Performance Characteristics............. 597 44.3 A Flash-based SSD: Logical Diagram.............. 599 44.4 SSDs And Hard Drives: Performance Comparison...... 610 45.1 Frequency Of LSEs And Block Corruption........... 618 48.1 Example UDP Code (client.c, server.c).......... 637 48.2 A Simple UDP Library (udp.c)................. 638 48.3 Message Plus Acknowledgment................. 640 48.4 Message Plus Acknowledgment: Dropped Request..... 640 48.5 Message Plus Acknowledgment: Dropped Reply....... 641 49.1 A Generic Client/Server System................. 653 49.2 Distributed File System Architecture.............. 654 49.3 Client Code: Reading From A File................ 656 49.4 The NFS Protocol: Examples................... 658 49.5 Reading A File: Client-side And File Server Actions..... 660 49.6 The Three Types Of Loss..................... 662 49.7 The Cache Consistency Problem................. 664 50.1 AFSv1 Protocol Highlights.................... 672 50.2 Reading A File: Client-side And File Server Actions..... 675 50.3 Cache Consistency Timeline................... 677 50.4 Comparison: AFS vs. NFS.................... 679 O PERATING S YSTEMS WWW. OSTEP. ORG [V ERSION 1.10] 1 A Dialogue on the Book Professor: Welcome to this book! It’s called Operating Systems in Three Easy Pieces, and I am here to teach you the things you need to know about operating systems. I am called “Professor”; who are you? Student: Hi Professor! I am called “Student”, as you might have guessed. And I am here and ready to learn! Professor: Sounds good. Any questions? Student: Sure! Why is it called “Three Easy Pieces”? Professor: That’s an easy one. Well, you see, there are these great lectures on Physics by Richard Feynman... Student: Oh! The guy who wrote “Surely You’re Joking, Mr. Feynman”, right? Great book! Is this going to be hilarious like that book was? Professor: Um... well, no. That book was great, and I’m glad you’ve read it. Hopefully this book is more like his notes on Physics. Some of the basics were summed up in a book called “Six Easy Pieces”. He was talking about Physics; we’re going to do Three Easy Pieces on the fine topic of Operating Systems. This is appropriate, as Operating Systems are about half as hard as Physics. Student: Well, I liked physics, so that is probably good. What are those pieces? Professor: They are the three key ideas we’re going to learn about: virtualiza- tion, concurrency, and persistence. In learning about these ideas, we’ll learn all about how an operating system works, including how it decides what program to run next on a CPU, how it handles memory overload in a virtual memory sys- tem, how virtual machine monitors work, how to manage information on disks, and even a little about how to build a distributed system that works when parts have failed. That sort of stuff. Student: I have no idea what you’re talking about, really. Professor: Good! That means you are in the right class. Student: I have another question: what’s the best way to learn this stuff? 1 2 A D IALOGUE ON THE B OOK Professor: Excellent query! Well, each person needs to figure this out on their own, of course, but here is what I would do: go to class, to hear the professor introduce the material. Then, at the end of every week, read these notes, to help the ideas sink into your head a bit better. Of course, some time later (hint: before the exam!), read the notes again to firm up your knowledge. Of course, your pro- fessor will no doubt assign some homeworks and projects, so you should do those; in particular, doing projects where you write real code to solve real problems is the best way to put the ideas within these notes into action. As Confucius said... Student: Oh, I know! ’I hear and I forget. I see and I remember. I do and I understand.’ Or something like that. Professor: (surprised) How did you know what I was going to say?! Student: It seemed to follow. Also, I am a big fan of Confucius, and an even bigger fan of Xunzi, who actually is a better source for this quote1. Professor: (stunned) Well, I think we are going to get along just fine! Just fine indeed. Student: Professor – just one more question, if I may. What are these dialogues for? I mean, isn’t this just supposed to be a book? Why not present the material directly? Professor: Ah, good question, good question! Well, I think it is sometimes useful to pull yourself outside of a narrative and think a bit; these dialogues are those times. So you and I are going to work together to make sense of all of these pretty complex ideas. Are you up for it? Student: So we have to think? Well, I’m up for that. I mean, what else do I have to do anyhow? It’s not like I have much of a life outside of this book. Professor: Me neither, sadly. So let’s get to work! 1 According to http://www.barrypopik.com (on, December 19, 2012, entitled “Tell me and I forget; teach me and I may remember; involve me and I will learn”) Confucian philosopher Xunzi said “Not having heard something is not as good as having heard it; having heard it is not as good as having seen it; having seen it is not as good as knowing it; knowing it is not as good as putting it into practice.” Later on, the wisdom got attached to Confucius for some reason. Thanks to Jiao Dong (Rutgers) for telling us! O PERATING S YSTEMS WWW. OSTEP. ORG [V ERSION 1.10] 2 Introduction to Operating Systems If you are taking an undergraduate operating systems course, you should already have some idea of what a computer program does when it runs. If not, this book (and the corresponding course) is going to be difficult — so you should probably stop reading this book, or run to the near- est bookstore and quickly consume the necessary background material before continuing (both Patt & Patel [PP03] and Bryant & O’Hallaron [BOH10] are pretty great books). So what happens when a program runs? Well, a running program does one very simple thing: it executes in- structions. Many millions (and these days, even billions) of times ev- ery second, the processor fetches an instruction from memory, decodes it (i.e., figures out which instruction this is), and executes it (i.e., it does the thing that it is supposed to do, like add two numbers together, access memory, check a condition, jump to a function, and so forth). After it is done with this instruction, the processor moves on to the next instruction, and so on, and so on, until the program finally completes1. Thus, we have just described the basics of the Von Neumann model of computing2. Sounds simple, right? But in this class, we will be learning that while a program runs, a lot of other wild things are going on with the primary goal of making the system easy to use. There is a body of software, in fact, that is responsible for making it easy to run programs (even allowing you to seemingly run many at the same time), allowing programs to share memory, enabling programs to interact with devices, and other fun stuff like that. That body of software 1 Of course, modern processors do many bizarre and frightening things underneath the hood to make programs run faster, e.g., executing multiple instructions at once, and even issu- ing and completing them out of order! But that is not our concern here; we are just concerned with the simple model most programs assume: that instructions seemingly execute one at a time, in an orderly and sequential fashion. 2 Von Neumann was one of the early pioneers of computing systems. He also did pioneer- ing work on game theory and atomic bombs, and played in the NBA for six years. OK, one of those things isn’t true. 3 4 I NTRODUCTION TO O PERATING S YSTEMS T HE C RUX OF THE P ROBLEM : H OW T O V IRTUALIZE R ESOURCES One central question we will answer in this book is quite simple: how does the operating system virtualize resources? This is the crux of our problem. Why the OS does this is not the main question, as the answer should be obvious: it makes the system easier to use. Thus, we focus on the how: what mechanisms and policies are implemented by the OS to attain virtualization? How does the OS do so efficiently? What hardware support is needed? We will use the “crux of the problem”, in shaded boxes such as this one, as a way to call out specific problems we are trying to solve in building an operating system. Thus, within a note on a particular topic, you may find one or more cruces (yes, this is the proper plural) which highlight the problem. The details within the chapter, of course, present the solution, or at least the basic parameters of a solution. is called the operating system (OS)3 , as it is in charge of making sure the system operates correctly and efficiently in an easy-to-use manner. The primary way the OS does this is through a general technique that we call virtualization. That is, the OS takes a physical resource (such as the processor, or memory, or a disk) and transforms it into a more gen- eral, powerful, and easy-to-use virtual form of itself. Thus, we sometimes refer to the operating system as a virtual machine. Of course, in order to allow users to tell the OS what to do and thus make use of the features of the virtual machine (such as running a pro- gram, or allocating memory, or accessing a file), the OS also provides some interfaces (APIs) that you can call. A typical OS, in fact, exports a few hundred system calls that are available to applications. Because the OS provides these calls to run programs, access memory and devices, and other related actions, we also sometimes say that the OS provides a standard library to applications. Finally, because virtualization allows many programs to run (thus shar- ing the CPU), and many programs to concurrently access their own

Operating Systems Three Easy Steps.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue