🎧 New: AI-Generated Podcasts Turn your study notes into engaging audio conversations. Learn more

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Document Details

PremierPine

Uploaded by PremierPine

Technion - Israel Institute of Technology

Tags

operating systems scheduling supercomputers

Full Transcript

Operating Systems (234123) Scheduling Dan Tsafrir (2024-06-23) OS (234123) - scheduling 1 BATCH (NON-PREEMPTIVE) SCHEDULERS OS (234123) - scheduling 2 Consider a different context… In this course, we typically consider general purpose OS – S...

Operating Systems (234123) Scheduling Dan Tsafrir (2024-06-23) OS (234123) - scheduling 1 BATCH (NON-PREEMPTIVE) SCHEDULERS OS (234123) - scheduling 2 Consider a different context… In this course, we typically consider general purpose OS – Such as the one running in our own computers For some of the topics in this lecture to make sense, need to change the mind set – (Although some of them nevertheless apply to general purpose OSes) The new context – “Batch scheduling” on “supercomputers” – We will explain both of these terms next OS (234123) - scheduling 3 Supercomputers Comprised of hundreds (1990s) to millions (2020s) of CPU cores – “Nodes” (=computers) are connected via a high-speed network Used by 100s to 1000s of scientific users – Physicists, chemists… Users submit “jobs” (=programs) – Jobs can be serial (one core) or parallel (many cores) – A parallel job simultaneously uses N cores to solve a single scientific problem quickly (ideally N times faster) – N is the “size” of the job (determined by the submitting user) Different jobs have different sizes For a given running job, size is fixed throughout its lifetime – While the program is running, threads/processes communicate with each other and exchange information – Typical workload: jobs running from a few seconds to 10s–100s of hours OS (234123) - scheduling 4 Think of jobs as rectangles in core X time plane size = one job width Cores runtime = length Time Terminology: – Size: jobs are said to be “wide” = ”big” or “narrow” = ”small” – Runtime: jobs are said to be “short” or “long” OS (234123) - scheduling 5 A FCFS schedule example (FCFS = First Come First Serve) cores Time (Numbers indicate arrival order) OS (234123) - scheduling 6 Batch scheduling It’s important that all the processes of a job run together (as they repeatedly & frequently communicate) – Each process runs on a different core – What will happen if they don’t run together? Jobs are sometimes tailored to use much of (if not the entire) physical memory of each individual multicore CPU – So can’t easily share cores with other jobs via multiprogramming Thus, supercomputers typically use “batch scheduling” – When a job is scheduled to run, it gets its own cores – Cores are dedicated (not shared) – Each job runs to completion (until it terminates) – Only after, the cores are allocated to other waiting jobs – Such scheduling is said to be “non-preemptive” OS (234123) - scheduling 7 Wait time, response time, slowdown, utilization, throughput METRICS TO EVALUATE PERFORMANCE OF BATCH SCHEDULERS OS (234123) - scheduling 8 Average wait time & response time Average wait time – The “wait time” of a job is the interval between the time the job is submitted to the time the job starts to run waitTIme = startTime - submitTime – The shorter the average wait time, the better the performance Average response time – The “response time” of a job is the interval between the time the job is submitted to the time the job terminated responseTime = terminationTime – submitTime responseTime = waitTime + runTime – The shorter the average response time, the better the performance Wait vs. response – Users typically care most about response time (wait for their job to end) – But batch schedulers primarily only affect wait time OS (234123) - scheduling 9 Avg. wait vs. response time – the connection Claim – In our context, we assume that job runtimes (and thus their average) are a given; they stay the same regardless of the scheduler – Therefore, for batch schedulers, the difference between average wait time and average response time of a given schedule is a constant – The constant is the average runtime of all jobs Proof – For each job i (i = 1,2, 3, … N) Let Wi be the wait time of job i Let Ri be the runtime of job i Let Ti be the response time of job i (Ti = Wi + Ri) – With this notation we have " " " " 1 1 1 1 # 𝑇# = #(𝑊# + 𝑅# ) = # 𝑊# + # 𝑅# 𝑁 𝑁 𝑁 𝑁 ! ! ! ! $%& ()*+,-*) $%&.$#/ $%& (0-/#1) OS (234123) - scheduling 10 Average slowdown (aka “expansion factor”) The “slowdown” (or “expansion factor”) of a job is – The ratio between its response time & its runtime – slowdown = responseTime / runTime = Ti/Ri – slowdown = (waitTime + runTime) / runTime = (Wi+Ri) / Ri = Wi/Ri + 1 Examples – If Ri = 1 minute, and Wi = 1 minute Job has a slowdown of (Wi+Ri)/Ri = (1+1)/1 = 2 Namely, job was slowed by a factor of 2 relative to its runtime – If Ri = 1 hour, and Wi = 1 minute, then the slowdown is much smaller (Wi+Ri)/Ri = (60+1)/60 ≈ 1.02 The delay was insignificant relative to the job’s runtime Like wait & response, we aspire to minimize avg. slowdown – slowdown = 1 ó job was immediately scheduled – The greater the slowdown, the longer the job is delayed OS (234123) - scheduling 11 Utilization & throughput Utilization (aspire to maximize) – Percentage of time the resource (CPU in our case) is busy – In this example, the utilization is: (3 ´ 7 - 6) cores 100 ´ = 71% 3´ 7 Throughput (aspire to maximize) – How much work is done in one 0 1 2 3 4 5 6 7 time unit Time – Examples Typical hard disk throughput: 100 MB/second (sequential access) Database server throughput: transactions per second Web server throughput: requests per second Supercomputer throughput: job completions per second – In above example: 4(jobs)/7(time units) ≈ 0.57 jobs per time unit OS (234123) - scheduling 12 Which metric is more important? Depends on your point of view – Users typically care about wait/response time and slowdown – System owners typically care more about utilization & throughput For users: wait time or slowdown? – Different notion of “fairness” Wait time => FCFS, which humans typically consider as fairest Slowdown depends on your runtime (a.k.a. service time) – "‫"רק שאלה‬ What’s fairer? No definite answer – system owners decide Performance analysts typically keep an eye on (=evaluate) both – Notice that Average wait time tends to be dominated by long jobs Average slowdown tends to be dominated by short jobs Why? Slowdown: easy. Wait time: try to answer after next few slides OS (234123) - scheduling 13 FCFS, EASY, backfilling, RR, SJF BATCH SCHEDULING EXAMPLES OS (234123) - scheduling 14 FCFS (First-Come First-Served) scheduling Jobs are scheduled by their arrival time – If there are enough free cores, a newly arriving job starts to run immediately – Otherwise, it waits, sorted by arrival time, until enough cores are freed Pros: – Easy to implement (FIFO wait queue) – Typically perceived as most fair (we tend to dislike “‫)”אני רק שאלה‬ (Numbers indicate Cons: arrival order) – Creates fragmentation (=unutilized cores) – Small/short jobs might wait for a long, long while.. (we disallow “‫)”אני רק שאלה‬ OS (234123) - scheduling 15 EASY (= FCFS + backfilling) scheduling The “backfilling” optimization FCFS – A short waiting job can jump over the head of the wait queue cores – Provided it doesn’t delay the job @ head of the FCFS wait queue EASY algorithm: whenever a job arrives (=submitted) or terminates: FCFS + Backfilling = “EASY” 1. Try to start the job @ head of the FCFS wait queue cores 2. Then, iterate over the rest of the waiting jobs (in FCFS order) and try to backfill them Time OS (234123) - scheduling 16 EASY (= FCFS + backfilling) scheduling Pros – Better utilization (less FCFS fragmentation) – Narrow/short jobs have better cores chance to run sooner Con: – Must know runtimes in advance FCFS + Backfilling = “EASY” To know width of holes To know if backfill cores candidates are short enough to fit holes Time OS (234123) - scheduling 17 EASY (= FCFS + backfilling) scheduling Backfilling mandates users to estimate how long their job will run Upon job submission If a job tries to overrun its estimate? – It is killed by the system – Provides incentive to supply accurate estimates Short estimate => better chance to backfill Too short => jobs will be killed EASY (and FCFS) are popular – Many supercomputer schedulers use them by default BTW, EASY stands for – Extensible Argonne Scheduling sYstem – Developed @ Argonne National Laboratory (USA) circa 1995 OS (234123) - scheduling 18 SJF (Shortest-Job First) scheduling Instead of – Ordering jobs (or processes) by their arrival time (FCFS) Order them by – Their (typically estimated) runtime Perceived as unfair – At the extreme: causes starvation (whereby a job waits forever) But optimal (in a sense) for performance – As we will see later, using some math NOTE: job-scheduling theoretical reasoning is limited (called: queuing theory) – Hard (impossible?) to do it for arbitrary parallel workloads – Therefore, for theoretical reasoning We’ll assume jobs are serial (job=process) – Empirically, intuition for serial jobs often also applies to parallel jobs OS (234123) - scheduling 19 Average wait time example: FCFS vs. SJF Assume – Processes P1, P2, P3 arrive together in the very same second – Assume their runtimes are 24, 3, and 3, respectively – Assume FCFS orders them by their index: P1, P2, P3 – Whereas SJF orders them by their runtime: P2, P3, P1 Then – The average wait time under FCFS is (0+24+27)/3 = 17 24 3 3 P1 P2 P3 – The average wait time under SJF is (0+3+6)/3 = 3 3 3 24 P2 P3 P1 SJF seems “better” than FCFS – In terms of optimizing the average wait time metric OS (234123) - scheduling 20 “Convoy effect” Slowing down all (possibly short) processes due to currently servicing a very long process – As we’ve seen in the previous slide Does EASY suffer from convoy effect? – Yes (why?), but less than FCFS: – Empirically, when examining real workloads (recordings of the activity of real supercomputers), we find that there are often holes in which short/narrow jobs can fit, such that many of them manage to start immediately shortly after they arrive Does SJF suffer from convoy effect? – Yes (why?), but less than FCFS (why?) Q: Can we eliminate convoy effect altogether? OS (234123) - scheduling 21 Optimality of SJF for average wait time Claim – Given: a 1-core system where all jobs are serial – If: (a) all processes arrive together and (b) their runtimes are known – Then: the average wait time of SJF is equal to or smaller than the average wait time of any other batch scheduling order S Proof outline – Assume the scheduling order S is: P(1), P(2), …, P(n) – If S isn’t SJF, then there exist two processes P(i), P(j=i+1) such that R(i) = P(i).runtime > P(i+1).runtime = R(j=i+1) – If we swap the scheduling order of P(i) and P(i+1) under S, then we’ve increased the wait time of P(i) by R(i+1), i j=i+1 we’ve decreased the wait time of P(i+1) by R(i) and this sums up all the changes that we’ve introduced j=i+1 i – And since R(i) > R(i+1), the overall average is reduced – We do the above repeatedly until we reach SJF OS (234123) - scheduling 22 Fairer variants of SJF Motivation: disallow job starvation SJBF (Shortest-job Backfilled First) – Exactly like EASY in terms of servicing the head of the wait queue in FCFS order (and not allowing anyone to delay it) – But the backfilling traversal is done in SJF order LXF (Largest eXpansion Factor) – Recall that the “slowdown” or “expansion factor” metric for a job is defined to be: slowdown = (waitTime + runtime) / runtime – LXF is similar to EASY, but instead of ordering the wait queue in FCFS, it orders jobs based on their current slowdown (greater slowdown means higher priority) – The backfilling activity is done relative the job with the largest current slowdown (= the head of the LXF wait queue) – Note that every scheduling decision (when jobs arrive/finish) requires a re-computation of slowdowns (because wait time has changed) OS (234123) - scheduling 23 RR, selfish RR, negative feedback, multi-level priority queue PREEMPTIVE SCHEDULERS OS (234123) - scheduling 24 Reminder In a previous lecture, we runnable – Talked about three se pro rvice processes states ready vid sleeping ed scheduled preempted In this lecture, we often (e.g., waiting when we want to prove things) – Focus on only two running ed slow Ready & running ne ice serv Namely, we’ll assume that – Processes only consume CPU – And don’t do I/O (of course, actually, they do, but we ignore that now) – They’re either running or ready-to-run OS (234123) - scheduling 25 Preemption The act of suspending one job (process) in favor of another – Even though it is not finished yet Exercise – Assume a one-core system – Assume two processes, each requiring 10 hours of CPU time – Does it make sense to do preemption (say, every few milliseconds)? When would we want a scheduler to be preemptive? 1. When responsiveness to user matters (they actively wait for the output of the program), and 2. When runtimes vary (some jobs are shorter than others) Examples – Two processes, one needs 10 hours of CPU time, the other needs 10 seconds (and a user is actively waiting for it to complete) – Two processes, one needs 10 hours of CPU time, the other is a word processor (like MS Word or Emacs) that has just been awakened because the user clicked on a keyboard key OS (234123) - scheduling 26 Quantum The maximal amount of time a process is allowed to run before it is preempted – Typically, milliseconds to 10s to milliseconds in general-purpose OSes (like Linux or Windows) Quantum is oftentimes set per-process – Processes that behave differently get different quanta, e.g., in Solaris: A CPU-bound process gets long quanta but with low priority Whereas an I/O-bound process gets short quanta with high priority – In Linux, the process “nice” value affects the quantum duration “Nice” is a system call and a shell utility which allow one process to be “nicer” to the other processes in terms of scheduling (see man) We’ll see this later OS (234123) - scheduling 27 Performance metrics for preemptive schedulers Wait time (try to minimize) – Same as in batch (non-preemptive) systems Response time (or “turnaround time”; try to minimize) – Like batch systems, stands for Time from process submission to process completion – But unlike batch systems, instead of responseTime = waitTime + runTime – We have responseTime ≥ waitTime + runTime – Because Processes can be preempted, plus context switches have a price Overhead (try to minimize); consists of – How long a context switch takes, and how often context switches occur Utilization & throughput (try to maximize) – Same as in batch (non-preemptive) systems – Although may want to account for context switching overhead OS (234123) - scheduling 28 RR (Round-Robin) scheduling Processes are arranged in a cyclic ready-queue – The head process runs, until its quantum is exhausted – The head process is then preempted (suspended) – The scheduler resumes the next process in the circular list – When we‘ve cycled through all processes in the run-list (and we reach the head process again), we say that the current “epoch” is over, and the next epoch begins Requires a timer interrupt – Typically, it’s a periodic interrupt (fires every few millisecond) – Upon receiving the interrupt, the OS checks if its time to preempt Features – For small enough quantum, it’s like everyone of the N processes advances in 1/N of the speed of the core (sometime called “virtual time”) – With a huge quantum (infinity), RR becomes FCFS OS (234123) - scheduling 29 Is there RR in parallel systems? Yes! It’s called “gang scheduling” – Time is divided to slots (seconds, or minutes) – Every job has a “native” time slot – Algorithm attempts to fill holes in time slots by assigning to them jobs from other native slots (called “alternative” slots) Challenge: to keep contiguous chunks of free cores Uses a “buddy system” algorithm http://www.cs.huji.ac.il/~feit/parsched/jsspp96/p-96-6.pdf – Algorithm attempts to minimize slot number using slot unification when possible – Why might gang scheduling be useful? Don’t need to know runtime in advance (but what about memory?) – Supported by most commercial supercomputer schedulers However, seems to be rarely used If lots of memory is “swapped out” upon context switch, context switching will take a very long time… (so it won’t be worth it) OS (234123) - scheduling 30 Gang scheduling - example RR scheduling of (jobs time slots in) time slots CPUs OS (234123) - scheduling 31 Gang scheduling – alternative slots time slots CPUs OS (234123) - scheduling 32 Gang scheduling – native slot unification terminated unify slots time slots CPUs OS (234123) - scheduling 33 rpt Price of preemption: example Assume – One core, 1 second quantum – 10 processes, each requires 100 seconds of CPU time Assuming no context switch overhead (takes zero time) – Then the “makespan” metric (time to completion of all processes) is 10 x 100 = 1000 seconds – Both for FCFS and for RR Assume context switch takes 1 second – The makespan of FCFS is 10 (proc) x 100 (sec) + 9 (ctx-sw) x 1 (sec) = 1009 – Whereas the makespan of RR is 10 (proc) x 100 (sec) + 100 (proc) x 100 (ctx-sw) - 1 = 10,999 Shorter quantum – Potentially quicker response/wait time (why?), but higher overhead price OS (234123) - scheduling 34 Batching vs. preemption Claim – Let avgResp(X) be the average response time under algorithm X – Assume a single core system, and that all processes arrive together – Assume X is a preemptive algorithm with context switch price = 0 – Then there exists a non-preemptive algorithm Y such that avgResp( Y ) ≤ avgResp( X ) Proof outline 1. Let Pk be the last preempted process to finish computing Pk Pr … Pk Pr Pk 2. Compress all of Pk’s quanta to the “right” (assume time progresses left to right), such that the last quantum remains where it is, and all the rest of the quanta move to the right towards it Pr Pr … Pk Pk Pk (Pk’s response time didn’t change, and the response time of the other processes improved or stayed the same) 3. Go to 1 (until no preempted processes are found) OS (234123) - scheduling 35 Aftermath Corollary: Based on (1) the previous slide and (2) the proof about SJF optimality from a few slides ago – SJF is also optimal relative to preemptive schedulers (that meet our assumptions) avgResponse(SJF) ≤ avgResponse(batch or preemptive scheduler) rpt So why do we use preemptive schedulers? OS (234123) - scheduling 36 Connection between RR and SJF Claim – Assume: 1-core system All processes arrive together (and only use CPU, no I/O) Quantum is identical to all processes + no context switch overhead – Then avgResponse(RR) ≤ 2 x avgResponse(SJF) // RR up to 2x “slower” Notation / definitions: – Process P1, P2, …, P_N – Ri = runtime of Pi; – Wi = wait time of Pi – Ti = Wi + Ri = response time of Pi – delay(i,k) = duration Pi delayed Pk in RR (how long k waited due to i) – delay(i,i) = Ri N – Note that Tk = å delay (i, k ) i =1 OS (234123) - scheduling 37 Connection between RR and SJF Proof outline – For any scheduling algorithm A, N*avgResponse(A) is N N N = å Tk = åå delay A (i, j ) k =1 i =1 j =1 N = å Ri + i =1 å [ delay 1£i < j £ N A (i, j ) + delay A ( j , i ) ] – Thus, for A=SJF, we get N = å Ri + å min( Ri , R j ) at most twice i =1 1£i < j £ N (assume Pi is shorter => Ri delay(i,j)=Ri & delay(j,i)=0) ` – But for A=RR, we get N = å Ri + å 2min( Ri , R j ) i =1 1£i < j £ N (assume Pi is shorter, then Pi delays Pj by Ri, and since it’s a “perfect” RR, Pj symmetrically delays Pi by Ri) OS (234123) - scheduling 38 SRTF (Shortest-Remaining-Time First) Assume different jobs may arrive at different times SJF is not optimal – As it’s not preemptive, and – A short job might arrive while a very long job is running => recall: convoy effect SRTF is just like SJF but – Is allowed to use preemption – Hence, it’s “optimal” (assuming a zero context-switch cost etc.) Whenever a new job arrives, or an old job terminates – SRTF schedules the job with the shortest remaining time – Thereby making an optimal decision OS (234123) - scheduling 39 Selfish RR New processes wait in a FIFO queue – Not yet scheduled Older processes scheduled using RR New processes are scheduled when 1. No ready-to-run “old” processes exist 2. “Aging” is being applied to new processes (some per-process counter is increased over time); when the counter passes a certain threshold, the “new” process becomes “old” and is transferred to the RR queue Fast aging – Algorithm resembles RR Slow aging – Algorithm resembles FCFS OS (234123) - scheduling 40 Back to the context of general-purpose OSes GENERAL-PURPOSE SCHEDULERS: PRIORITY-BASED & PREEMPTIVE OS (234123) - scheduling 41 Scheduling using priorities Every process is assigned a priority – That reflects how “important” it is in that time instance – Can change over time Processes with higher priority are favored – Scheduled before processes with lower priorities The priority concept can also be used for batch scheduler – SJF: priority = runtime (smaller => higher) – FCFS: priority = arrival time (earlier => higher) OS (234123) - scheduling 42 Negative feedback principle The schedulers of all general-purpose OSes (Linux, Windows, …) employ a negative feedback policy – Running reduces priority to run more – Not running increases priority to run How does this affect I/O-bound & CPU-bound processes? – I/O-bound processes (that seldom use the CPU) get higher priority – Which is why editors are responsive even if they run in the presence of CPU-bound processes like while(1) { sqrt( time() ); } How about other interactive apps like videos with high- frame rate or 3D games? – Negative feedback doesn’t help them (they consume lots of CPU) – Need other ways to identify/prioritize them (nontrivial) OS (234123) - scheduling 43 Multi-level priority queue General-purpose OSes – Typically use some variant of multi-level priority queue Consists of several RR queues – Each associated with a priority – Higher-priority queues at the top – Lower-priority at the bottom Processes migrate between queues – So they have a dynamic priority – “Important” processes move up (e.g., I/O-bound or “interactive”) – “Unimportant” move down (e.g., CPU-bound or “non-interactive”) OS (234123) - scheduling 44 Multi-level priority queue rpt Priority is greatly affected by – CPU consumption of processes I/O bound ó move up CPU-bound ó move down Quanta duration – Some schedulers allocate short quanta to higher priority queues – Some don’t, or even do the opposite OS (234123) - scheduling 45 A comprehensive example Simple enough to see all its code, and Provides intuition about how all general-purpose schedulers work THE LINUX ≤2.4 SCHEDULER OS (234123) - scheduling 46 The

Use Quizgecko on...
Browser
Browser