Computer Architecture PDF
Document Details
Uploaded by HappyLawrencium8640
Tags
Summary
This document provides an overview of computer architecture, covering fundamental concepts like binary control signals, processor components, instruction types, and instruction cycles. It also explores advanced architectures and modern trends in processor design.
Full Transcript
Computer Architecture Main topics 1. Fundamentals of computer architecture Binary control signals - computers operate using binary signals (0 and 1) Processor (CPU) - controlled by binary signals, capable of executing 100-200 operations Operations are selected via specific signals Instructio...
Computer Architecture Main topics 1. Fundamentals of computer architecture Binary control signals - computers operate using binary signals (0 and 1) Processor (CPU) - controlled by binary signals, capable of executing 100-200 operations Operations are selected via specific signals Instruction and programs - instructions are basic operations executed by the CPU. Programs are sequences of instructions stored in memory (RAM) and accessed via a buss 2. Components of a computer Registers - small, fast storage areas in the CPU for instructions and data. Types include arithmetic/logical registers, instruction registers (IR), and program counters (PC) ALU (arithmetic logic unit) - executes arithmetic and logical operations Control unit - manages binary control signals to execute instructions Cache - speeds up memory access by storing frequently accessed data, leveraging principles of locality (spatial and temporal) Bus - a communication pathway that transfers data between CPU, memory and I/O devices. Parallel buses are faster but shorter; serial buses are slower but longer 3. Instruction types Data processing - arithmetic and logical operations performed by the ALU Data transfer - moving data between memory and registers Program control - includes conditionals jumps (if, while) that alter instruction flow 4. Instruction cycle Fetch and execute phases - instructions are retrieved from memory and executed sequentially Pipeline execution - breaks down the instruction cycle into multiple stages for parallel execution. Two-stage, five-stage pipelines for example to improve speed 5. Challenges in parallel execution Hazards: Structural Hazard - multiple instructions require the same resource Data Hazard - instructions depend on data not yet available Control Hazard - jumps cause uncertainty in the next instruction 6. Advanced architectures Superscalar processors - feature multiple pipelines to execute instructions in parallel Out-of-order execution - allows the CPU to process instructions speculatively, reducing idle time 7. Modern trends in processor design Multicore processors - combine multiple cores in one chip, enabling parallel execution of programs. Operating systems manage task distribution across cores Moore's law - increasing transistor density enables improved performance through features like larger caches and additional cores 8. Memory and cache design Multi-level cache - combines small, fast caches with larger, slower ones to balance speed and capacity Locality principle - programs tend to access nearby or recently used memory locations 9. Numerical systems Binary (base-2) fundamental to computer operation, using digits 0 and 1 Decimal (base-10) familiar system with digits 0-9 Hexadecimal (base-16) uses 0-9 and A-F for compact representation 10. Instruction-level parallelism (ILP) Pipeline Design - enables simultaneous processing of multiple instruction Superscalar execution - executes instructions in parallel within a single program 11. Chipsets and system architecture Traditional chipsets - northbridge and southbridge control CPU-memory and peripheral interactions Modern chipsets - integrate Northbridge into the processor, with Southbridge managing I/O devices A program is a sequence of instructions sent to the processor one after another. The instructions are stored in memory (RAM) and transferred to the processor via wires (BUS). each signal is called a bit (binary digit) Main components of a computer system - CPU (central processing unit) the brain of the computer, executing instructions and managing data - Memory (RAM) temporarily stores data and instructions for quick access - Bus system includes data, address and control buses for communication between components - Cache memory a small, fast memory that stores frequently accessed data to reduce access time - I/O devices includes peripherals like keyboards, mice and storage devices - Interrupt handling allows the CPU to respond to high-priority events by pausing its current task and handling the interrupt The core functionality of a computer revolves around executing instructions, managing data flow, and interacting with hardware components CPU (central processing unit) A program is a sequence of instructions executed strictly in the order they are written. This is called sequential execution. The program is stored in primary memory, and the instructions are fetched and executed one by one. The process of fetching an instruction and then executing it is known as the instruction cycle. The processor has three steps - Instruction cycle (fetch - decode - execute) 3 registers in the CPU, these are bits of fast storage where the CPU holds values that it's working on. When an instruction is fetched from memory, it is placed in a register within the processor. A register is a small memory location capable of holding a single bit pattern. The processor has several registers. The register that holds the fetched instruction is called the instruction register (IR). Another register keeps track of the program's progress, the program counter (PC). There are several data registers used to store the data required by the instructions. A pipeline refers to a technique used to increase CPU performance by breaking down instruction execution into multiple stages. Each stage performs a part of the instruction, allowing multiple instructions to be processed simultaneously in an overlapping manner. This improves throughput and resource utilization Pipelining breaks down instruction execution into stages (instructions are divided into stages) And allows overlapping of instruction execution, meaning multiple instructions are different stages at the same time Fetch - the CPU retrieves the next instruction is retrieved from main memory using the program counter (PC), which keeps track of the memory address of the instruction Decode - the fetched instruction is decoded by the control unit, which interprets what needs to be done, the decoded instruction is held in the instruction register (IR) Execute - the CPU performs the operation, which might involve arithmetic computation transfer or control flow changes. The arithmetic logic unit (ALU) handles arithmetic and logic operations Memory access involves reading or writing data from/to memory Write back delays writing modified data back to memory to enhance performance. Commonly used in caching systems Registers are critical for fast data access within the CPU - Program counter (PC) points to the next instruction to be executed - Instruction register (IR) holds the currently executing instruction General-purpose registers (data registers) - temporarily hold operands and results of computations Status register (flags) - indicates conditions like zero, carry, overflow, or sign after an operation Memory access - data is read from or written to memory if needed Write back - the result is written back to the register file Advantages - Increased throughput, multiple instructions can be in different stages at the same time, increasing the number of instructions completed per clock cycle Clock cycles The first instruction takes 5 clock cycles. Subsequent instructions finish with one clock cycle apart. The formula is P+n−1P + n - 1P+n−1, where P is the number of pipeline stages (5 here) and n is the number of instructions (10 here). The result is 14. Better resource utilization components of the CPU are kept busy with different parts of the pipeline Challenges - like hazards can arise due to the overlapping of instructions Instructions are programmed for sequential execution, and hazards occur when such instructions are executed in parallel. Hazards do not lead to errors, but they require special considerations to ensure correct execution. The processor itself is responsible for ensuring proper execution. Data hazards - when one instruction depends on the result of a previous instruction Control hazards - when the pipeline is unsure of the next instruction to fetch, often due to branches (branch prediction is used to mitigate control hazards by guessing the outcome of condition branches to keep the pipeline filled) Structural hazards - when two instructions need the same hardware resource at the same time Types of Hazards: 1. Structural Hazard: Resource conflicts (e.g., when multiple instructions require the same resource, such as memory). This is resolved by inserting stall cycles. 2. Data Hazard: Data dependencies where one instruction requires a result that hasn't been calculated yet because a previous instruction hasn't progressed far enough in the pipeline. This is resolved by inserting stall cycles. 3. Control Hazard: Problems with non-sequential execution, especially conditional and unconditional jumps. This requires efficient algorithms for branch prediction and pipeline flushing when predictions are incorrect. A program is typically written to execute sequentially (instruction-by-instruction: the next instruction starts only when the previous one is completely finished). To increase performance, modern processors are designed to work with multiple instructions simultaneously. Since this is a feature of the processor, the processor itself is responsible for ensuring correct execution. The programmer does not code this, but the hardware itself resolves hazards. However, both the programmer and especially the compiler can optimize code to minimize hazards. This requires a good understanding of the platform the software will run on. It is important to understand that instruction-level parallelism differs from parallel execution of multiple programs (either running on separate cores or taking turns using the processor). In parallel execution of different programs, the operating system ensures no errors occur. A transistor is a semiconductor device used to amplify or switch electronic signals. It has three main parts: the emitter, base and collector for bipolar junction transistors (BJTS), or source gate, and drain for field-effect transistors (FETs). Transistors control the flow of current between two terminals using a small input signal at the third terminal, making them essential for logic gates, amplification and signal processing in electronic circuits. Logic gates are built by connecting transistors in specific ways to perform logical operations. AND Gate both inputs must be high for the output to be high, built with transistors in series OR gate output is high if at least one input is high, built with transistors in parallel NOT gate inverts the input using a single transistor NAND and NOR gates inverses of AND and OR gates XOR gate outputs high if the inputs are different, requiring a combination of other gates Clock generator produces a periodic signal that synchronizes the processor's operation. It controls when actions like instruction fetch, decode and execution happen Clock signal oscillates between high and low states to mark cycles Synchronization ensures all parts of the processor operate in sync Clock distribution distributes the clock signal to all components Clocked logic many circuits use the clock to update input and outputs to precise times These systems, combining transistor-based gates and synchronized clock signal, enable processors to execute instructions efficiently and at high speeds. Trends in modern processor architecture Developments in processor architecture is the increasing integration of transistors into processor circuits. A modern processor is essentially made up of multiple independent processors, each referred to as a core. These processors also include large amounts of cache. This makes the processors extremely fast. Instead, other hardware struggles to keep up with the speed of the processor, this has led to a trend of integrating more hardware components that were previously outside the processor into the processor itself: - Multicore processors - System architecture Modern performance optimization mechanisms sometimes lead to security issues, such as side-channel attacks, with specrete being the most well-known example. Processor architecture Superscalar processors Can issue multiple instruction per cycle by using multiple execution units Requires mechanisms to handle dependencies between instructions Out-of-order execution Instructions are executed based on the availability of the operands and resources rather than their order in the program This improves CPU efficiency by avoiding stalls due to instruction dependencies Multicore processors Contain multiple processing cores within a single CPU chip, enabling parallel execution of threads and processes Each core can independently execute tasks, improving performance for multi-threaded applications Processors Components Cores individual processing units capable of executing tasks independently Graphic processing unit (GPU) specialized for parallel processing of graphics and computational tasks Memory controller manages data flow between CPU and RAM Parallel execution Parallelism enhances performance by dividing tasks across multiple processors or cores Instruction-Level Parallelism (ILP) Involves executing multiple instructions simultaneously within a single GPU core, techniques include pipelining and superscalar execution Is that modern processors can execute multiple instructions at one time. The use of pipelining, where the actions required to execute an instruction are partitioned into different steps and processor hardware is organized as a series of stages, each performing one of these steps. The stages can operate in parallel, working on different parts of different instructions. Hardware design can sustain an execution rate close to one instruction per clock cycle. Processors that can sustain execution rates faster than one instruction per cycle are known as superscalar processors. Most modern processors support superscalar operation. High-level parallelism Multicore and multiprocessor system use multiple CPU cores or processors to run tasks concurrently Clusters groups of interconnected computers that work together to perform tasks as a single system Parallel programming models examples like OpenMP for shared memory systems and MPI for distributed systems Locality and Cache Primary memory is too slow to deliver instructions quickly enough to the processor. To address this, the most relevant portions of primary memory are copied to a faster (but smaller) memory called cache. Cache is a crucial mechanism in modern processors, cache memory is critical for speeding up data access To predict which parts of memory are likely to be needed next, the principle of locality is used. The principle suggests that it is highly likely that the next required data will be located near the data just accessed. Therefor, instead of fetching a single instruction from memory, the processor also copies the surrounding area along with the requested instruction Types of locality - Spatial locality refers to accessing memory locations close to one another. For example, in a loop iterating over an array, adjacent elements are accessed sequentially - Temporal locality refers to accessing the same memory location multiple times within a short period, accessing a frequently used variable in a function Cache hierarchy and operation Levels of cache L1 cache - closest to the CPU and the fastest, but also the smallest L2 and L3 Cache - larger but slower than L1 cache, L3 is often shared among CPU cores and acts as a buffer between the CPU cores and main memory (RAM) Cache mapping Direct mapping - each block of memory maps to a specific cache line Associative mapping - any block can be placed in any cache line Set-associative mapping - a compromise between direct and fully associative mapping Replacement policies - determines which cache block to replace when the cache is full (least recently used-LRU) Bus and bus hierarchy A buss consists of copper tracks or wires that conduct electricity, each of these can be either ON or OFF which are represented by 0 and 1 All signals in a computer are transmitted over buses. A bus is essentially a collection of wires used to transfer bits between different components. Since modern computers operate at very high speeds, they require fast buses with high bandwidth to handle large volumes of data efficiently. Modern computers use multiple buses, each optimized for specific tasks, creating a bus hierarchy. Efficient bus design and high bandwidth are critical for overall system performance. The speed and bandwidth of the bus determine how quickly data can move between components, affecting tasks such as data processing, gaming and multitasking Types of buses 1. Data bus Transfers data between the CPU, memory and peripherals. The width of the data bus (32-bit or 64-bit) determines how much data can be transferred at once 2. Address bus Carries memory addresses from the CPU to other components, specifying where data should be read from or written to, it is unidirectional 3. Control bus Transmits control signals, such as read/write commands, between the CPU and other components, it coordinates the activities of the system Bus Hierarchy 1. Front-side bus (FSB) Connects the CPU to the main memory and chipset. Modern system often use a point-to-point connection like Intel's QuickPath interconnect (QPI) or AMD's Infinity Fabric instead of a traditional FSB 2. Back-side Bus (BSB) Links the CPU to the cache, specifically the L2 cache, it is typically faster than the FSB 3. Peripheral Buses - PCI express (PCIe) a high speed interface for connecting peripherals such as graphics cards and storage devices - USB used for external devices like keyboards, mice and storage devices - SATA/NVMe interfaces for connecting storage devices 4. Memory Bus Connects the CPU to the system's RAM, high speed memory busses, like DDR and GDDR, ensure fast data access RAM Keeps the instructions and any values, random access because it doesn't matter when in or what order the information is read or written. Unlike permanent storage (like hard drives), RAM is much faster, but its contents are lost when the system powers down. It is used by the CPU to store running programs, operating system data and currently processed information. It allows quick read and write access to data, which is essential for system speed and responsiveness. Dynamic RAM (DRAM) which requires periodic refreshing Static RAM (SRAM) which is faster but more expensive. DRAM is typically used for main system memory while SRAM is used in caches It is a temporary storage and its performance can greatly affect system speed, faster RAM means quicker data access and processing improving overall system performance Processors double their performance approximately every 1.5 years, while DRAM takes 10 years to double its performance. This growing performance gap between processors and DRAM makes the importance of cache mechanisms even greater. As explained in the lesson, the total cache size increases over time. However, a large cache is slower than a small one, so multi-level caching is used. The locality principle explains why memory access tends to occur in the same area over a relatively long period of time. The cache mechanism copies this area of primary memory to a faster memory technology and feeds the processor from this copy instead. This principle is so strong that even though we load less than 1/100,000th of the memory into the L1 cache, there is a high probability that the next location to be used will be within this area. The performance improvement with cache is so significant that it just justifies the increased complexity and cost that the cache mechanism introduces in modern processors. Instructions in the same program are executed in parallel, either with partial overlap as with pipelining, or full overlap, as with superscalar processors. 1. Pipeline (assembly line) 2. Superscalar processors 3. Speculative executions Floating Point Numbers Floating-point numbers are placed at different locations along the number line compared to integers. They are very close for small values, but for large values, there is a significant gap. For most calculations, we use relatively small numbers, where floating-point values are closer together than integers. However, when dealing with very large numbers, floating-point precision can become problematic if not handled carefully. Computing User datagram protocol (UDP) Is used for fast, connectionless communication in networks. It is commonly utilized in applications where speed is more critical than reliability such as: - streaming - online gaming It doesn't guarantee delivery, order, or error checking, making it lightweight and efficient. Using UDP for gmail or similar email services would not work well because emails services require reliability and error-checking, which UDP doesn't provide. Transmission control protocol (TCP) - reliable delivery, data packets are guaranteed to arrive - order, packets arrive in the same order they were sent - error correction, corrupted data is detected and retransmitted Cloud Computing Cloud computing Is one of the most important paradigm shifts in computing. It is an outsourcing model, in which IT services are provided and paid based on actual on-demand use. infrastructure as a service (IaaS) is a specific service model of cloud computing. Clouds are composed of five essential characteristics, four deployment models, and three service models. Cloud characteristics On demands self-service: users can deploy systems or apps with minimal management effort Rapid elasticity: resources can scale up or down quickly based on demand Resource pooling: resources are shared from a common pool using virtualization Measured service: recourse usage is tracked and billed as operating expenses Broad network access: services are accessible over the network via standard methods Cloud deployment models Public cloud: accessible to the public via the internet, benefiting from economies of scale Private cloud: exclusive to one organization, managed internally or by a third party Community cloud: shared by organizations with similar concerns, managed by one or more parties Hybrid cloud: combines public, private, or community clouds to run different services (like email servers) Cloud service models SaaS: full applications for end users (Office365, Salesforce) PaaS: platforms for developers to build apps (Azure, Google App Enige) IaaS: virtual machines and storage with users responsible for OS and software (Amazon Elastic Cloud (EC2 and S3) and Microsoft Azure IaaS. IaaS (infrastructure as a service) Provides virtualized resources (machines, storage networking) Users must install and maintain the OS and apps IaaS hardware is inexpensive and can fail so apps should be robust and scalable Cloud computing Cloud service providers Key characteristics of cloud computing Cloud enabling technologies Virtualization Containerization Data centers Cloud service models Cloud deployment models Cloud computing the delivery of various computing services over the internet Servers Storage Networking Software Databases Machine learning Various resources and functionalities can be provided as services Free services: Gmail, dropbox, onedrive, google drive Some of them are free but most of them are not, especially for the services that provide advanced functionalities Cloud service providers The companies that offer computing services are known as cloud providers or cloud service providers Amazon web services, microsoft azure, ibm cloud, digitalocean, google cloudplatform Key characteristics of cloud computing 1. Multitenancy Cloud computing is designed to serve multiple customers Customers include regular end users, SMBs, big companies, organizations, and government agencies 1. Resource pooling Each type of computing resources is organized and grouped into a logical pool to serve multiple customers A CPU pool consists of many physical CPU cores located in many different interconnected physical machines Computing resources can be dynamically assigned to customers from the pools according to customers demand Two types of ressources CPU pool and Disk pool 1. Scalability Refers to the capability of a system/application to handle a growing amount of work Cloud computing enables a system/application to accommodate incresing load by: ○ Vertical scaling: adding more resources to a single server ○ Horizontal scaling: adding more servers or machines to distribute the load (adds more machines to handle the entire application if it's requested) Performance isn't affected by users 1. Rapid elasticity Elasticity extends the concept of scalability by automatically and dynamically adapting resources based on actual demand Cloud computing can automatically provision(take more resources) or deprovision(take away resources) computing resources in response to changes in workloads Resources can be added when demand increases and removed when demand decreases Customers want to pay based on their demand ( scaling capacity to meet demand ) Either increase or decrease capacity based on demand Pay as you go (pay per use) Customer only pay resources that they actually use It is like purchasing electricity from an electricity provider Ubiquitous network access The services provided by a cloud provider are basically available all the time These services can be accessed from anywhere by various kinds of devices over the internet Always available because from cloud service providers, they don't know where the customers comes from (anywhere in the world) Cloud enabling technologies Computer servers & storage devices Cloud access devices High speed broadband (fast service) Virtualization Containerization Data center Virtualization Is the technology that enables cloud computing by creating virtual versions of physical hardware, such as servers, storage, and networks. It allows multiple virtual machines (VMs) to run on a single physical machine, each with its own operating system and applications. It is the backbone of cloud computing, enabling resources pooling and rapid elasticity Key benefits of virtualization include: Efficient resource use: multiple VMs share the same physical resources, reducing hardware costs (using virtualization the maximum computing capacity can be utilized) Isolation: VMs run independently, ensuring one VM's issues don't affect others Flexibility: VMs can be easily created, modified, or moved between physical machines. Scalability: virtualization allows for quick scaling of resources to meet demand. Minimum downtime: application and OS crash cases can be neglected by running multiple VMs with the same OS Time management: setting up whole server from scratch can be avoided by using sufficient hardware devices for virtualization Hypervisor is a software that manages VMs it acts as an interface between Vm and physical hardware to ensure proper access to the resources needed for working Types of hypervisor Type-1/Bare Metal Type-2 Types of virtualization Desktop virtualization Allows us to run multiple desktop OS in each VM Virtual desktop infrastructure (VDI) Local desktop virtualization Network virtualization Can combine multiple physical networks into one virtual, software-based network, or it can divide one physical network into separate, independent virtual networks Types of network virtualization Software defined networking (SDN) Network function virtualization (NFV) like a firewall Storage virtualization ○ Uses all the storage devices on the system to be accessed and be managed as a single storage unit ○ The management and provision of storage are made by the software ○ All the storage is formed in a shared pool from which they can be allotted to any VM on the system Application virtualization ○ Application virtualization runs software applications without installing them directly into the host's OS ○ Types of application virtualization Local application virtualization Application virtualization Server-based application virtualization Virtualization It refers to the action of creating virtual version of something, including computer hardware platforms, memory, storage devices, and network resources Physical storage can create to a virtualize physical storage, can create a virtual version of something Virtualization technologies 1. Storage virtualization Storage virtualization is a technology that abstracts and combines physical storage resources into a single virtualized storage pool The virtualization layer enables centralized management and allocation of storage capacity from various storage devices or arrays Storage virtualization simplifies storage management and improves utilization There is management and mapping(managed where the data is stored) in the virtualization layer 1. Memory virtualization Memory virtualization is a technology that abstracts and pools a computer system's physical memory (RAM) resources It allows memory to be efficiently managed and allocated to different application and processes Differance Virtualized memory pool - the space and contributor can offer or contribute memory and this memory will be virtualized and can be offered to different application Virtualized memory pool with operating system - application cant be directly access applications and has an operating system in between We can allocate the amount of memory that they need For example: It needs 16GB, 128 GB can allocate to different customers which ever demand they need 1. Network virtualization Network virtualization is a technology that abstracts and virtualizes the physical network infrastructure, enabling multiple virtual networks to run on top of a shared physical network It provides a way to create isolated and logically segmented network environments within a single physical network, allowing organization to achieve greater flexibility, scalability, and resource optimization Provide a segment to the networks, deploy a server another server this virtual network ( can be more virtual networks in one network) Hardware virtualization Also called Server Virtualization or Platform Virtualization Hardware virtualization is a method whereby one or more virtual machines are created to share the hardware resources of one physical machine at the same time VMs are called guest machines Each virtual machine is a simulated computer environment, and it acts like a real computer with its own operating system ○ In other words, hardware virtualization enables multiple OSs to run on the same physical machine The technique that enables all these is hypervisor Hypervisor Hypervisor is a software that manages VMs it acts as an interface between Vm and physical hardware to ensure proper access to the resources needed for working A hypervisor is also known as a Virtual Machine Manager (VMM) It is a software designed to create and manage VMs Two kinds ○ Type 1 hypervisor (bare metal hypervisor) Can access hardware process like a CPU and create more VMs Type 2 hypervisor (hosted hypervisor) like an application/app, when roaming we need already need to have a VM to generate and mange these virtual machines Examples: Type-1 hypervisor ○ Vmware ESXi Server ○ Microsoft Hyper-V ○ Xen Type-2 hypervisor ○ Oracle VirtualBox ○ VMware Workstation Pro ○ Microsoft Virtual PC ○ QEMU Type-1 is more efficient because it can access straight, but type-2 has to go through the operating system first 1. OS virtualization (containerization) Containerization Also called OS virtualization Containerization is a method of packaging, distributing, and running applications and their dependencies within isolated environments called containers Containerization enables multiple container to be created and run on the same host OS When container is running, it is like a process to the host OS A container can then be easily moved and executed consistently across different environments, regardless of underlying infrastructure Containerization has emerged as an alternative to hypervisor-based virtualization It offers the following benefits: Isolation: containers are isolated from one another and from the host system, ensuring that one container's changes or issues don't affect others Portability: containers run consistently across different environments, reducing "it works on my machine" issues Resource efficiency: containers share the host system's kernel, which makes them lightweight and more resource-efficient compared to traditional virtual machines Fast deployment: containers start quickly, allowing for rapid application deployment Version control: container images can be versioned, making it easier to manage and update software The most well known containerization product is Docker Docker is a software platform that allows us to build, test, and deploy application quickly To run an application on Docker, a Docker image needs to be created A Docker image is lightweight, standalone, executable package of software that includes everything needed to run an application: code, system tools, system libraries, etc. When a Docker image runs on Docker Engine, it becomes a container, and it is separated and isolated from other containers Data Center Data center is a dedicated space within a building, or a group of buildings used to house computers, storage devices, and network infrastructure ○ In other words, a data center is a place where a cloud provider keeps and maintains all its hardware, provides cloud services, runs customers applications, and stores customers data Key Components of Data Centers 1. Server racks and hardware: data centers house numerous server racks or cabinets containing servers, storage devices, and other hardware. These servers can be standard x86 servers, specialized hardware, or virtual machines running on powerful servers 2. Networking infrastructure: data centers have robust networking infrastructure that includes switches, routers, load balancers, and firewalls to manage data traffic both within and outside the data center 3. Cooling and climate control: data centers require efficient cooling systems to maintain an optimal operating temperature for the hardware. Advanced cooling techniques are used to prevent overheating 4. Power supply: uninterruptible power supplies (UPS) and backup generators ensure continuous power in case of electrical outages 5. Physical security: data centers have strong physical security measures in place, including access controls, biometric authentication, surveillance cameras, and security personnel 6. Fire suppression: special fire suppression systems, often using gases that do not harm equipment, are employed to prevent and manage fires 7. Monitoring and management: advanced software and system monitor the performance and status of the hardware and applications within the data center 8. Energy efficiency: many modern data centers are designed with energy-efficient technologies to reduce power consumption and environmental impact Google has more than 30 data centers around the world so far Cloud Service Models Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (SaaS) Infrastructure as a Service (IaaS) Computing infrastructure is delivered as a service ○ Including computer servers, storages, network, and operation systems Users have a control over the entire infrastructure that they buy from a cloud provider, but they don't need to maintain or manage it Platform as a Service (PaaS) A computing platform is delivered as a service, so that users can focus on developing their applications on the top of the platform ○ A platform typically includes a runtime environment, development tools, version control and collaboration tool Software as a service (SaaS) A software/application is offered as a service Users just use the software/application The cloud provider manages everything, including infrastructures, platforms, application, and data Management as users Cloud Deployment Models Public cloud: accessible to the public via the internet, benefiting from economies of scale In a public cloud, all the compute resources are owned, managed, and operated by a third-party cloud provider High scalability, flexibility and cost effectiveness. But users might lose of control over their data Private cloud: exclusive to one organization, managed internally or by a third party In private cloud, computing resources are maintained and used exclusively by only one organization or enterprise Anyone outside of the organization or enterprise is unable to access the cloud It requires more capital cost than using a public cloud, but organizations can keep their data safe locally Hybrid cloud: combines public, private, or community clouds to run different services (like email servers) The best of both worlds Combination of a public cloud and private cloud Organizations can use a public cloud for high-volume, lower-security tasks, while using its private cloud for sensitive, business-critical operations Computer Networks History and Evolution of Networks Early computers (pre-1970) were isolated; networking them began as an efficiency need. Initial networks, called Local Area Networks (LANs), were formed to share resources within a limited space (e.g., research labs). LANs and Ethernet Technology LANs: Networks that connect computers within a close range, like a single building or campus. Ethernet: Developed in the 1970s, Ethernet allows computers to connect over a shared cable using unique MAC addresses for identification. Managing Network Traffic: CSMA and Collisions Carrier Sense Multiple Access (CSMA): Method for networked computers to communicate over shared media (Ethernet cables or WiFi). Handling Collisions: When multiple computers transmit at once, a collision occurs. Exponential backoff and random wait times reduce these issues. Expansion of Networks and Collision Domains Switches and Collision Domains: Dividing a network with switches reduces the number of devices on each shared line, increasing efficiency. Network Structuring: Multiple smaller networks interconnect to form larger networks, ultimately building the global network we know as the internet. Routing Methods and Packet Switching Circuit Switching vs. Packet Switching: Circuit Switching: Direct, exclusive lines between devices (like early telephones). Packet Switching: Data split into packets that travel through multiple routes, optimizing network capacity. Routing Reliability: Routers dynamically adjust routes based on network conditions to prevent congestion. Protocols and the Modern Internet Internet Protocol (IP): Assigns unique addresses (IP addresses) to devices, facilitating packet-based communication across networks. ARPANET: The original packet-switched network, funded by the US government, was a precursor to the internet. Journey of a Data Packet Data travels through a sequence of connections: from the Local Area Network (LAN) to regional Internet Service Providers (ISPs), up to the internet backbone, and down to destination servers. Packet routes can be checked by users through tools like traceroute, showing each hop from source to destination. Internet Protocols: IP and UDP Internet Protocol (IP): Each packet has an IP address guiding it to the correct computer. User Datagram Protocol (UDP): Adds essential metadata (port number and checksum) for specific programs, though it doesn't guarantee data integrity or completeness. Reliability through Transmission Control Protocol (TCP) TCP: Offers more robust data transfer than UDP by numbering packets, sending acknowledgments (ACKs), and retransmitting lost packets. Handles network congestion by adjusting the rate of data transmission based on acknowledgment success and delay times. Domain Name System (DNS) DNS: Translates domain names (like youtube.com) to IP addresses for easier access, using a distributed system of DNS servers organized in a tree structure with top-level domains like.com and.gov. OSI Model Layers Discusses the first five layers: Physical, Data Link, Network, Transport, and Session, detailing how they abstract different aspects of networking to improve development efficiency and data transfer. Transport Layer: Focuses on reliable point-to-point transfer, error detection, and data recovery. Distinction between the Internet and the Web Internet: A global network that supports various applications, including the web, email, and online games. World Wide Web: The largest application on the internet, relying on web browsers to access information distributed across servers globally. Web Pages, Hyperlinks, and URLs Web Pages: Documents that contain information and hyperlinks, which link to other pages, forming a connected "web" of data. URL: Unique addresses for web pages, critical for navigation and linking information. HTTP and HTML HTTP (Hypertext Transfer Protocol): A protocol for requesting and receiving web pages from servers using commands like GET. HTML (Hypertext Markup Language): A language that defines the structure and links in web pages. Early HTML enabled basic content layout, while modern HTML includes advanced features like images and tables. Evolution of Web Browsers Early Browsers: Sir Tim Berners-Lee developed the first web browser and server in 1990, creating foundational web technologies like HTTP and HTML. Advances: Browsers like Mosaic introduced embedded images and a GUI, paving the way for popular browsers of the 90s, including Netscape and Internet Explorer. Search Engines and Ranking Early Search Engines: Systems like JumpStation began cataloging pages using web crawlers and indexing for search. Google’s Approach: Innovated by ranking pages based on backlinks from reputable sites, leading to reliable search results. The Debate on Net Neutrality Net Neutrality: The principle that all internet data should be treated equally without prioritization. Arguments For: Ensures fair access, promotes innovation, and prevents ISP monopolization. Arguments Against: Suggests flexibility in data flow could benefit real-time applications like video calls. Storage, backup and recovery Data availability Data availability term used to describe any product/service that is able to ensure data to be continuously available in any situations. High data availability can be achieved by using the following three techniques altogether: - Data replication - Data backup - Data archiving Data replication Any tool that is able to copy data from one server to another server in real time Data is copied to another server it is used to provide up to date of mission critical data and continuous access in the event of a disaster Data backup Any tool that can copy data from one server to another server periodically (how critical the data is) Enable lost/corrupted data to be restored to its previous state in case of data loss, data corruption or disaster Go to a previous date to restore previous data lost Data archiving Patients medical data for long time of years and is not used at all ever Moving data that is no longer used to a separate storage device for long term retention (5-10 years) To keep data accessible in a long term for compliance and regulation reasons The 3-2-1 backup rule Create 3 copies of you data (1 primary copy and 2 backups) Store your copies in at least 2 types of storage media (local drive, network share/NAS, tape drive, etc.) Store 1 of these copies offsite (somewhere far away from your 2 original backups so your data is still there in case of a emergency) 1 in disk drive and another in network attached storage or you can choose to store it somewhere else (network provides). Two different backups of your copies RAID Each RAID level provides a different balance among the key goals: Reliability, availability, performance and capacity Redundant array of independent disks/drives Raid is a data storage technology that uses multiple physical disk drives to achieve data redundancy and performance improvement Data redundancy refers to the unnecessary duplication of data within a database or storage system. It occurs when the same piece of data is stored in multiple locations In practice, five RAID levels are most often implemented Raid 0: Striping No redundancy because it is separated between two disks Array requires at least two disks Data is split/stripped into blocks (16 KG), which are evenly distributed to all disks of the array Each data block is stored on exactly one disk RAID 0 is designed to increase read and write performance Read/write could occur in parallel More disks, better read and write performance If we want to save a file to this array can happen in a parallel , creates great performance No fault tolerance The failure of one disk might cause total data loss, no redundancy the files of one disk can fail and retrieve the file again Raid 1 mirroring A RAID 1 array consists of at least two disk A redundancy and if one disk fails so can u reconstruct a file by using the others disks contents Each block of data is mirrored and stored on all disks of the array Provide fault tolerance All data will be available if one of the disks is functional and working Expensive backup solution The capacity of the array is determined by the capacity of the smallest member disk Any read request can be served by any disk drive in the array (fast) Any write request must be done by all the drives (slow) Raid 10 striping and mirroring RAID 1+0 It is the combination of RAID 1 and RAID 0 Nested/hybrid RAID Requires at least four disks, and disk are divided into groups Data is split/stripped into blocks, which are then evenly distributed to all disk groups Each data block distributed to a disk group is then mirrored and stored on all disks in the group Provide fault tolerance Al data will be available as long as each disk group has one disk working In most cases, it provides better I/O performance than all the other RAID levels except RAID 0 The total array capacity is a half of the total capacity of all disks in the array Since 50% is used to stored data copies Use case Suitable for I/O- intensive applications, such as database, emails and web servers Raid 5 striping and distributed parity Requires at least three disks Split/stripped into blocks (16KB) Parity blocks are generated based on XOR operation Parity is a widely used method to provide fault tolerance and data recovery Data blocks and parity blocks are evenly distributed to all disks XOR takes in 0s and 1s If outputs 1 -> then total number os '1' in the input is odd If outputs 0 -> then total number os '0' in the input is even XOR(1,0,1,10,0,0) the input is odd --> equal to 1 More than 1 disk failures then we cannot reconstruct the data by using XOR Because there is insufficient information for XOR to calculate the lsot data RAID 5 array tolerates at most one failed disk at a time Why does RAID 5 distribute parity among different disks Because writing parity blocks to multiple disk can be done in parallel, which is much faster than writing all of them to one dedicated disk (this disk might become the performance bottleneck of the RAID 5 array) RAID 5 is cost effective Read performance is good. But write performance can be very poor Because parity blocks has to be calculated for every write operation IMPORTANT FOR THE EXAM Same disk system used by many people which changes the data, the parity brook has to be recalculated so that's why we need to redistribute almost the different disks. RAID 6 striping and double parity Extends RAID 5 by adding another parity block Requires at least 4 disks It uses block-level striping with TWO parity blocks distributed across all member disks It can tolerate at most two failed disk drives at the same time RAID 6 is less cost effective than RAID 5 since two entire disks are used to store parity Good read performance, but write performance can be even worse than RAID 5 Suitable for server that require high-available data archiving Storage solutions used by enterprises today Tape Library Contains: One or more tape drives A number of slots to hold tape cartridges A barcode reader to identify tape cartridges An automated robot for loading tape cartridges DAS Direct Attached Storage A storage system where one or more dedicated disk drives are directly attached to a computer or a server These disks might arranged as a RAID array No network device is required Data is only directly accessible by the server to which DAS is attached Limited scalability due to the capability of the used RAID controller Ideal for small businesses NAS Network Attached Storage Centralized data storage server that is connected to a network The goal is to provide data access and sharing to heterogeneous users on the network Data access is in file level, rather than in block level Suitable for homes and SMBs (small and midsize businesses) Easy backup Good fault tolerance SAN Storage Area Network Dedicated high-speed network of storage devices over Fibre Channel These devices are not accessible through LAN, thereby preventing interference of LAN traffic in data transfer SAN is primarily used to enhance accessibility of diverse storage devices (disk array, tape libraries) to servers Used in data centers, large enterprises, or virtual computing environments Provides block-level access Key features Support diverse storage devices Provide high-speed data transfer High scalable (can attach any storage networks to this) Suitable for large organizations Suitable for complex, mission-critical applications Object Storage Data storage architecture that manages data as objects, rather than files or blocks Each object has globally unique identifier called object ID and its metadata(such as filename, date, owner, access permission, file description) Object storages do not use directory trees to store objects All object are placed in a flat address space (like a bucket) Each object can be accessed with its object ID at application level by HTTP or HTTPS Suitable to store massive amounts of unstructured data ( any data that cannot be stored in a database table) Examples of unstructured data: Text-heavy files (emails, e-books, word processing documents, webpages, health record) Multimedia contents (such as videos, audio files, photos= Unstructured data is often handled and analyzed by techniques such as data mining, text analytics, natural language processing Cloud providers that offer object storage: Amazon web services S3 Rackspace Files Google Cloud Storage SDS Software-Defined Storage Marketing term for data storage software that is able to Virtualize heterogeneous physical data storage devices into one large shared storage pool Manage the pool via one administrative interface A toolset that helps a business address diverse data management challenges to better achieve business objectives Provides policy-based data management Replications, snapshots, backup Data deduplication ( a technique to reduce duplicate data copies= Thin provisioning ○ It relies on on-demand storage capacity allocation, rather than allocating storage space in advance ○ Aims to optimize utilization of available storage Summarizing storage solutions Tape library - awesome DAS - no network device is needed NAS - centralized data sharing SAN - a high speed network of storages Object storage - data storage architecture for unstructured data SDS - a toolset Linux commands Wildcards Commands to manipulating files and directories: Mkdir Cp Mv Rm Cat Linux is case-sensitive uppercase and lower cases are treated as different "Abc.txt" and "abc.txt" are treated as two different files, "documents and "Documents" are treated as two different directories "John" and "john" are treated as two different users Wildcards known as globbing is a shell feature that provides powerful file manipulation Wildcards are a set of special characters, which enable us to construct a search pattern to quickly specify a group of files Mkdir - create directories It allows us to create one or more directories. If you create a directory that already exists then you get a message that the file already exists. Cp - copy files and directories Syntax: Cp item1 item2 Cp item directory (it will copy multiple items (either files or directories) to a directory and this directory must pre-exist Commonly used options for cp Mv - move/rename files and directories Performs both file moving and file renaming, in either case, the original file(s) will be deleted from their original directories Syntax: Mv item1 item2 It will move or rename item1, (which can be either a file or a directory) to item2 Mv item directory It will move one or more items to a directory Options for mv -i (asks for confirmation) -u (when moving files from one directory to another, only move files that either don't exist, or are newer than the existing corresponding files in the destination directory. Rm - remove files and directories Syntax: Rm item Options -i (prompts for confirmation) -r (recursively delete directories, if a directory being deleted has subdirectories, deletes them too, to delete a directory, this option must be specified) -f (ignore nonexistent files and do not prompt, this overrides -i option) -v (display info messages as deletion) Cat - concatenate files Displays what we type on the keyboard (ctrl + d to exit) Displays a short text file (cat ls) I/O redirection and pipeline Outline Standard input, output and error Stdin 0, stdout 1, stderr 2 I - input O - output I/O redirection Where the input comes from and goes to Redirect stdout to a file Redirect stderr to a file Redirect both stdout and stderr to a file Redirect stdin from a file Connect multiple commands together to build a command pipeline Redirect stdout to a file Redirection operator >, to redirect standard output of a command/program to a file follow by the name of the file Echo followed by a string, this echo command will print out exactly what you type. Overwrite existing file > Append, not overwriting the existing file >> Not existing, redirecting error message file 2> Redirect both stdout and stderr to a file by using &> You can quickly create AN EMPTY FILE by just typing > follow by a file name Cat command can copy stdin to stdout We can also redirect what we type on the keyboard to a file We can also redirect stdin from a file < (now this file becomes the input of cat) Grep command is used for searching Searching for a word in a file : Grep keyword < fileName Pipeline Command1 | command2 Communicated between two processes Using the pipe operator, the stdout of one command will be piped into the stdin of another command The difference between > and | The pipeline operator connects a command with another command | While the redirection operator connects a command with a file > Never redirect a command to another command > NEVER Commands can be used in a pipeline SORT Syntax: sort [-options] [files...] Sorts a commands output in ascending order Option 'r' will reverse the sorting in descending order Option 'u' will sort and remove duplicates (only unique items will be listed and sorted) To output the sorted result to a file, add the option -o Syntax: sort -p newFile originalFile If no arguments are provided, the sort command accepts stdin. Therefore, it can be used in a pipeline. UNIQ Syntax: uniq[-option][input[output]] It doesn't do sorting but is used to filter out adjacent, duplicate lines from a file or stdin. So it will basically remove adjacent repeated lines. Uniq -d if you want to list of adjacent of duplicated The uniq command is often used in conjunction with sort in a pipeline. WC Wc stands for word count It is used to display the number of lines, words and bytes contained in a file or multiple files | wc -l is used to print only the number of lines GREP Syntax: grep pattern [file..] Grep is a powerful command used to find text patterns or keywords within files It prints the lines containing the pattern/keyword Option 'i' it will ignore letter cases when performing the search ls | grep -i b Here we search lines containing either an 'b' or an 'B', grep can also be used in a pipeline To search for multiple keywords, separate the keywords with "\|", and include all pattern in single quotation marks Whitespaces are considered as part of the keyword If the pattern is not wrapped with the single quotations marks, there will be an error Command "grep" accepts multiple files Head vs tail No option The head command - prints the first ten lines of a file The tail command- prints the last ten lines of a file Options: With option 'n', we can set a number of lines we want to see -n number of lines or just -number of lines Both commands can also be used in a pipeline Monitor files in real time using tail By specifying option "f" to the tail command, we can see the contents of a file in real time It is useful to watch the progress of log files as they are being written Tee Syntax: command | tee [option][file] Reads stdin and copies it to both stdout and file(s) Can generate to multiple files Permission File access control Commands like: ○ Chmod - change file permission ○ Umask - display or set file mode mask ○ Su - run a shell as another user ○ Sudo - execute a command as another user ○ Chown - change a file's owner Permission attributes R W X File access control File access control is a feature of an OS to protect files and directories from any unauthorized access Why does linux need this? Multi-user system More than one user may use the same computer at the same time In order to protect users from each other and avoid users to interfere with the files belonging to another user, file access controll is required Linux controls file access by setting permission to each file and directory Permission can be set to grant or deny access to specific file and directory Filemode: shows file permission attributes for three types of users Owner - the user who owns the file/directory Owner group - a group of users who own the file/directory or have certain permissions World - all other users on the system Permission attributes R read permission ○ For files: the user can view the file's content ○ For directories: the user can list the files and subdirectories inside the directory W write permission ○ For files: the user can modify the file's content (edit or delete it) ○ For directories: the user can add, delete, or rename files and subdirectories within the directory X execute permission ○ For files: the user can run the file as a program or script ○ For directories: the user can enter (navigate into) the directory and access its contents Truncated - make file empty Not right permission for example but there is a command where you can change permission CHMOD - CHANGE FILE MODE This command is used to change the file mode (permission) of a file/directory Can execute this command File/directory owner Superusers ○ A superuser is a user who has all permissions to perform administrative tasks. He is not necessary to be the root user. The root user is a superuser and the main superuser. ○ Can be more than one superuser if the permission is granted Chmod syntax Chmod syntax on octal number representation To view the permissions of a file or directory, use the command ls -l First - indicates it's a file (d for directory) Rwx: permissions for the owner (read, write, execute) R-x: permissions for the group (read and execute) R--: permissions for others (read only) Changing permissions To modify permissions, use chmod command Chmod u+x filename: Adds execute permissions (+x) for the owner (u) Chmod g-w filename: Removes write permission (-w) for the group (g) Chmod o=r filename: Sets read permission (r) for others (o) Can also use numeric values R=4 W=2 X=1 For example, cmod 755 gives the owner full access (rwx), and the group and others can only read and execute (r-x) How can we calculate actual permission for any newly created file? How can we calculate actual permission for any newly created directory UMASK COMMAND This command controls and decides actual permission for any newly created file and directory It uses octal notation to express a mask of bits to be removed from the file mode of a file/directory Important to know which to use to change permission For file/directory created in the past, use the chmod command For all files and directories created in the future, use the umask command Example Change our role in Linux Why: In some cases we might need to take on the identity of another user for example: ○ When we want to gain superuser privileges to carry out some administrative task Create/delete a user account ○ When we want to test an account for a regular user Test if someone is able to access some files How: 1. Log out and log back in as the alternative user 2. Use the sudo command 3. Use the su command SUDO This command allows a regular user to execute commands as another user (usually the superuser) in a control way. On skyhigh we dont password to use the sudo command for the user account "ubuntu". However if you use your own linux machine you will be asked to enter your own password for using the sudo command Sudo less /etc/shadow To see contents of the /etc/shadow file using less. We use sudo since the file is restricted to the root user for security reasons that's why we need superuser privilege SU COMMAND The su command enables us to start a shell session as another user If no username is specified, the root user is assumed In general to execute this command requires the target user's password (not your own password) In skyhigh the root user's password is unknown and randomly generated for each VM. If you want to become the root user, you can run "sudo su" To return to your original user account, just type "exit" Which one is more secure "su" or "sudo" Sudo is more secure than su because Su needs the root user's password but sudo doesn't CHOWN COMMAND Change file ownership This command is used to change the owner and group owner of a file or directory Syntax: Superuser privileges are required to run this command Regular user is not able to change the ownership of the file using this command User management The id command The /etc/passwd File The /etc/shadow File Commands related to user management Useradd Usermod Userdel Passwd Different accounts on linux 1. Root user account This account can run any commands without any restriction The username "root" and its uid is 0 1. Service accounts Used by system services such as web servers Typically created and configured by the package manager upon installation of the service software Service accounts often have a predefined uid (< 1000) 1. User accounts For regular users These accounts have limited access to linux Main accounts information Username (or account name) It is a name that uniquely identifies an account on linux Case sensitive (e.g., Amy and amy are different accounts) Password A sequence of characters used to verify the identity of an account Uid (user identifier) A numeric value used by linux to identify an account and determine what system resources the account can access Gid (group ID) It is a numeric value that uniquely identifies a group Each account in linux is assigned to a primary group Each account can also belong to other groups to share files or resources First colon name of the user, the second is name of the primary group ls ls-l ubuntu ubuntu (the same username and the user) COMMAND "id" This command prints information about an account Syntax: id[-options][userName] Superuser privileges are not required to run this command Without any arguments, it shows your uid, username,gid primary group name, and all the other groups you belong to This command doesn't work since command "id" accepts only one argument at a time doesn't allow multiple arguments The /etc/passwd/ File This file contains information about all accounts on linux Use cat dont need to use superuser privilages The /etc/passwd file is a text file and it is world-readable Each record represents one account, and it consists of 7 field seperated by colons (:) 1. User or account name 2. The password field 'x' denotes that the password is encrypted and saved in the /etc/shadow file 1. Uid 2. Primary gid (more group info is stored in the /etc/group file) 3. A commentary field to describe an account, such as fulle name, contact info 4. Home directory 5. The default login shell that is assigned to the account (bin/bash) The /etc/shadow File All password info associated with each account is stored in the /etc/shadow file This file is a text file, but is not world-readable Can't use cat, you need to use superuser priveleges to see this file Each record consists of 9 fields separated by colons to show the password information for one account 1. Username 2. Encrypted password 3. Last password change data Expressed as the number of days since january 1, 1970 it is called the Unix epoch (the date when the time started for unix computers. This timestamp is marked as 0) 1. Minimum password age The minimum number of days that must pass before a user is allowed to change their password Default value 0 (the user can change their password anytime) 1. Maximum password age The maximum number of days a password can be used before it must be changed. Once the time period expires, user are forced to create a new password Default value 99999 (the password will always be valid) the system will not force you to change your password can use the same password 1. Password warning period The number of days before the password expiration date when users start receiving notifications to cahnge their passwords Default value 7 1. Password inactivity period It defines the number of days after the user password expires. During this period, the user need to change his password. Otherwise, his account will be locked (disabled) after this period. Then he will need to contact the system administrator to enable his account By default, this field is empy 1. Account expiration data The date when the account will be locked It is expressed as the number of days since january 1, 1970 By default, this field is empy 1. This field is reserved for future user What will happen if the maximum password age is 9999 and inactivity period is 7? The password will never expire regardless of inactivity period What will happen if account expiration date is earlier than the date of the password expiration? The account will be locked on the expiration date, preventing login before the password expires Commands related to user management Useradd Usermod Userdel Passwd Note: Superuser privileges are required to run these commands COMMAND "useradd" This command is used to create a new account Syntax: useradd[-options] NewAccountName (exactly one account name here) Example: sudo useradd john A new account "john" will be created, but in locked state because this account does not have any password yet To unlock this account, we need to set a password for it by typing "sudo passwd john" You will be prompted to enter a password for the account Commonly used options: Option 'm' With this options, a default home directory will be created Options 'md' To specify a different home directory for the new account Option 'c' Allows us to add extra comment or info to the new account Option 'e' To add an account expiration date. Two ways to specify the date: 1. Use the format YYYY-MM-DD 2. Directly specify an integer (the number of days since the unix epoch (jan 1, 1970)) Option 'u' Assign a specific uid to the new account Note that the uid cannot be already used by another account For example sudo useradd -u 1212 eric Use case: maintain UID consistency across multiple systems Options 'ou' Assign an existing uid to the new account Use case: collaborative work for multiple user accounts Option 'g' Specify an existing group to be the new account's primary group Both gid and group name are accepted Option 'G' Add the new account to multiple existing groups Each group name or GID is seprated from the next by a comma, with no intervening space Option 'f' To set the inactivity period Value '0' means that account "fanny" will be disabled as soons as her password has expired Option 's' By default, login shell for every new account is "/bin/sh" (rahter than "/bin/bash") Which is defined by variable "SHELL" in the "/etc/default/useradd" file With option 's', we can change the login shell for the new account For example: sudo useradd -s /bin/bash john When we create an account using the useradd command, this command will perform the following actions 1. It edits the following four files for the new account /etc/passwd /etc/shadow /etc/group /etc/gshadow 1. If option 'm' is specified, this command also creates a default home directory for the new account and sets permission and ownership to this directory 2. However, if the options 'md' are specified, the command created a specified home directory and sets permission and ownership to this directory COMMAND "usermod" This command is used to change any attributes associated with an existing account Syntax: usermod[options] ExistingUsername Commonly-used options Option 'g' Change the specified account's primary group Option 'G' Add the specified account to more existing groups Option -G " " Remove the account from all groups except his primary group How do we remove an account from a particular group? sudo gpasswd -d userName groupName Option 'md' Change the specified account's home directory Option 'l' Change the specified account's username Syntax: sudo usermod -l newname oldname Option 'u' Change the specified account's uid into another unique uid Option 'ou' Changes the specified account's uid to another existing uid Option 'e' Change the account expiration date Two ways to specify the date 1. Use the format YYYY-MM-DD 2. Specify an integer By specify -1 to option 'e', we disable the account expiration Field 8 will become empty again Option 'f' Change the specified account's inactivity period Option 'L' Lock a user's password by adding a '!' at the beginning of the encrypted password, which disables the password Option 'U' Unlock a user's password by removing the '!' in from of the encrypted password Option 'c' Change the extra comment or info for an existing account Syntax: sudo usermod -c "Olaf Torland, Accounting Department" olaf Option 's' Change the login shell for an existing account Syntax: sudo usermod -s /bin/bash olaf COMMAND "userdel" This command is used to delete an account Syntax: userdel[options] username If no options is specified, the specified account will be removed, but the associated home directory will be help Syntax: sudo userdel olaf Use case: if the files in that directory are worth to keep Option 'r' The account's home directory will be removed as well Syntax: sudo userdel -r olaf COMMAND "passwd" This command changes passwords for a user account Syntax: passwd[options] username A regular user can only change their own password To change your own password, just type "passwd" without any argument You will be prompted to enter your current password, your new password, and retype your new password for the verification A superuser can change any account's password Option 'd' Delete the user's password (make it empy) This is a quick way to disable an account's password Option 'e' Immediately expire the specified user's password and force the user to change his password at his next login Option 'l' Lock the specified account's password by adding a '!' at the beginning of the password Option 'u' Unlock the password of the specified account by removing '!' from the beginning of the password Field 7: inactivity period Configurable with option 'i' Minimum password age 'n' Maximum password age 'x' Warning period 'w' The id command - to show a user account's username, UID, group name(s), and GID(s) The /etc/passwd File - world-readable The /etc/shadow File - non-world-readable Commands related to user management - useradd, usermod, userdel, passwd Troubleshooting The six-step troubleshooting methodology Recommended by CompTIA stands for computing technology industry association Troubleshooting is a systematic approach to solve problem The goal of troubleshooting is to determine why something does not work as expected and how to resolve the problem Having a troubleshooting methodology is important since it gives you a starting place and logical sequence of steps to follow The six-step troubleshooting methodology 1. Identify the problem Collect information on what is going wrong It is critical to identify the underlying problem causing the symptoms To gather information and identify the årpblem by asking detailed questions ○ When did the systems occur? How often? Has anything changed lately? Have you done anything to resolve this issue 2. Establish a theory of probable cause Use your technical knowledge to prioritize that list Question the obvious causes List possible reasons and prioritize the lists from the most to the least important, try to do the task and check the possible problem 3. Test probable cause theory to determine actual cause Test each possible cause in your theory to determine the actual cause 4. Establish an action plan and execute the plan This plan should outline the steps needed to fix the issue and may involve making changes, adjustments or repairs For example: the adapter is broken --> you buy a new adapter, some problems can be more complicated than that. Also can make a call for help, if you don't have much experience 5. Verify full system functionality Make sure no new issues have been introduced during the troubleshooting process You may be able to implement preventative measures so that the same problem won't occur again Your original problem is resolved and no other issues have occurred during this 6. Document the process Document the whole process, including the problem, symptoms, your action, and the corresponding outcomes Proper documentation is essential for future reference, knowledge sharing, and as a reference in case similar problems occur in the future Identify which action really works, and those actions that didn't work to resolve the issue. If you experience the same problem, you have documentation on how to resolve this problem. Helpful for other people who may experience the same problem as you and can use you knowledge Log Analysis Logs are computer-generated records. They might be generated by operating systems, applications, Network devices. Log analysis is an action to analyze logs and find valuable information Use cases: To troubleshoot systems, computer or network problems ○ Reducing problem diagnosis and resolution time To understand the behaviors of users, a system that contests many user, which users can compromise your system To comply with internal security policies and outside regulations and audits (you must know how to analyze a log to understand its contents) To help organizations or businesses to either reactively or proactively mitigate different risks LINUX LOGS main theme In Linux, you can find plenty of logs, it provides a timeline of events for the system, for the kernel, for package managers, and for the booting process Most log files can be found in the /var/log directory (ls -l /var) System logs Contains detailed records of events regarding a Linux system, including hardware, OS, software, system processes, and system components Authorization log: It records authentication-related events on the system. You have successfully logged in and what time, who tried and who failed to log into your system Location /var/log/auth.log you can access this by (cat /var/log/auth.log) This log is useful for learning about user logins, password changes, account lockouts Kernel log It records kernel-related messages and events Location /var/log/kern.log (empty file if there is no issue) It is useful for troubleshooting hardware issues, diagnosing kernel-related problems, or monitoring the general health and behaviour of the kernel System log Location /var/log/syslog It logs everything about the system, except the authorization-related events Application logs Many applications also create logs in the /var/log directory Some examples: Apache HTTP server logs ○ /var/log/apache2/access.log contains entries for every request made to the web server (from which IP address for instance) ○ /var/log/apache2/error.log records of all error messages reported by the HTTP server CUPS Print system logs ○ The common unix printing system (CUPS) is a printing system commonly used on Unix-like Oss ○ It uses /var/log/cups/error_log to store error messsages and events related to printing and printer management ○ If you need to solve a printing issue in Ubuntu, this log may be a good place to start Non-Human_readable Logs Some log files in the /var/log subdirectory are designed to be read by applications, rather than by humans. Example: Last login log ○ This log file stores the last login information for all users ○ Location /var/log/lastlog ○ To see list of logins page by page: lastlog | less -> to exit, press the 'q' key We cant use cat to see the contents of this file But we can use 'lastlog' while you are in the /var/log directory Lastlog |less to see the contents of this file one by one 0 indicates the first terminal Linux commands for log analysis: Less: to see a log file page by page Sort: to see the output of a command in sorted order Uniq: to filter out the repeated lines in a file or from stdin Tail: to see the last few lines of a log file ○ Option 'n' to specify number of output lines Example: tail -n 5 /var/log/auth.log Option 'f' to keep monitoring a log file on the fly Example: tail -f /var/log/auth.log Grep ○ To search lines based on a keyword or regular expression ○ Only the matching lines will be displayed on the screen Example: grep kelly /var/log/auth.log All the lines containing word 'kelly' will be printed Can be used to perform a surround search if we want to see a number of lines before or after a match ○ Option 'B' (before) to specify the number of lines to return before the matching line ○ Option 'A' (after) to specify the number of lines to return after the matching line Prints however many lines before or after the word match by using the grep command and options 'B' and 'A' Cut ○ This command allows us to parase fields from delimited logs ○ Option 'd' to specify the desired delimiter Delimiters are characters like equal signs (=) or semicolon (:) that break up fields or key-value pairs ○ Option 'f'. To specify which field of the results to be printed only Awk ○ It is a powerful command and a good tool for log analysis ○ It provides a scripting language for text processing ○ With awk scripting language, we can do: Define variables Use string and arithmetic operators Use control flow and loops Generate formatted reports ○ Syntax awk 'pattern {action}' filename Awk will take each line of the specified file as its input If a line contains the specified pattern, awk performs the specified action on that line and print the corresponding output Examples: Print all lines that are within a particular time range For example, to list all logs betwen 13:25 to 13:45 Awk '/13:25:/, /13:45:/ {print}' auth.log Print all lines that match a pattern and are within a particular time range For example, to list all logs containing 'close' and the time is between 13:25 to 13:45 Awk '/13:25:/, /13:45:/ {if (/close/) print}' auth.log Shell Scripting Shell scripting is an important part of process automation in Linux Scripting enables us to write a sequence of (lengthy and repetitive) commands into a file and then execute them It saves us a lot of time because we don't need to type command again and again for daily tasks We focus on Bash script, which is a series of commands written in a file and then executed by Bash A bash script usually ends with a file extension of.sh Example: wordcount.sh How to create you Bash script 1. Create a file with.sh as the file extension (not required) 2. Find the path name to your Bash shell 3. Write commands to the file and check the content of the file 4. Set execution permission to the file 5. Run the script Example: Basic syntax of bash scripting: Variables Arithmetic expansion Read user input If else statements For loop While loop Variables We can define a variable by using the syntax Variable_name = value To retrieve the value of the variable, add $ before the variable Arithmetic expansion Arithmetic expansion allows the evaluation of an arithmetic expression and the subsitution of the result Format $(( expression )) Expression is an arithmetic expression consisting of integer values and arithmetic operators '+' for addition '-' for subtraction '*' for multiplication '/' for division '**' for exponentiation '%' for modulus The expression can also be calculated and stored in a variable using the syntax: var=$(( expression )) Use $(()) for simple integer calculations and bc for more complex or floating point arithmetic Example: Read User Input In bash, we can take user input using the read command Syntax: read variable_name Option 'p' to prompt the user with a custom message Read -p "enter a number:" variable_name Or just write in two lines using echo Echo "enter a number:" Read a Example: If else statements They are used for conditional execution of commands They allow you to make decisions in your scripts based on the evaluation of a condition Syntax: Examples: For loop A for loop in Bash is used when you want to iterate over a sequence of values, such as number, items in a list or files in a directory Syntax: Examples: