CP4P_Compression-and-Backup.pptx
Document Details
Uploaded by VibrantSwamp
Full Transcript
COMPUTER PRINCIPLES FOR PROGRAMMERS Compression, 3ncrypt10n, Backup How many programmers does it take to change a lightbulb? rKR0sCfVHXydWGwgIS5UbodHXjlpENq6TeXWEE12PSU= (copy encrypted punchline and click, enter secret: CP4P ) News of the Week i Agenda Le...
COMPUTER PRINCIPLES FOR PROGRAMMERS Compression, 3ncrypt10n, Backup How many programmers does it take to change a lightbulb? rKR0sCfVHXydWGwgIS5UbodHXjlpENq6TeXWEE12PSU= (copy encrypted punchline and click, enter secret: CP4P ) News of the Week i Agenda Lecture: 1. What, Why, and How of “File Compression” … depends on the Use Case … overview of formats … Lossless vs Lossy 2. What, Why, and How of “Backup” … types of backups … backup media Agenda Activity: A. Explore File Compression à la LZW B. Compress various file formats within a ZIP archive and compare their compression factors C. Do your own 3-2-1 backup What is “File Compression?” Pink x 4 Green x 5 Blue x 3 Don't store redundant/repeating data. What is “File Compression?” storing a file’s data in less space by “minimizing redundancy” in the content An archive is a collection of folders and files stored in one file, e.g. filename.ZIP archives are usually compressed to save space archives can be encrypted for security ZIP for cross-platform exchange (Katz & Conway, 1989) local OS options to compress / encrypt local files Why use File Compression? Writing / Sending data takes time (I/O is slow) Tape – Linear Tape-Open drives on-board compression & encryption NAS – Network Attached Storage (local + backup/passthru to cloud) FTP or Cloud Storage AWS Glacier, Google Nearline USB – removable drives (ad hoc, user level only) Encrypt for security once drive is dismounted, there is no OS security all compression software has an encryption option Streaming sends compressed, receiver decompresses VoIP compresses data in real-time How File Compression is done Data compression combines: 1. match and replace duplicate strings using a dictionary: unique code = "repeating string" Lempel–Ziv–Welch (LZW) compression (1984) 2. replace 8-bit char with variable bit length symbols based on character frequency Huffman coding (1952) David A. Huffman was a Ph.D student at MIT …who didn't know it could not be done. Are Pop Lyrics G etting More Rep etitive? (yes, they are) Lempel-Ziv- Welch compression ratio used as a measure of repetitive lyric sequences. Huffman decoding Encoded: 1 0 1 1 0 0 0 1 0 1 1 1 0 0 Decode 1 1 1 Logic: ? ? ? 0 0 0 ? = next encoded bit, S L O E L O S S L E S S 10 110 0 0 10 111 0 0 8 chars × 8 bits = 64 bits compressed to 14 bits (22%) Encode / Decode info: 0=S, 10=L, 110=O, 111=E. Lossless vs. Lossy Compression Lossless: contains all original data with redundancies removed Data, PNG / TIFF images, FLAC / ALAC audio Lossy: sacrifices quality for smaller file size JPG images, MP3 / AAC audio, all video drops details your eyes and ears may not notice for end-use only, not for modification/editing Lossless versus Lossy GIF lossless, 45.4kb TIFF lossless, 941kb JPG lossy, 25.4kb 8 bit colour, LZW 32 bit colour, 32 bit max size 24 bit colour, 16 bit max size HiRes CD MP3/AAC/Spotify 9,216 kbps 1,411 kbps 320 kbps 653% 100% 22.7% HiRes CD MP3/AAC/Spotify 9,216 kbps 1,411 kbps 320 kbps 653% 100% 22.7% Common Compression File Formats Data Music ZIP ZIP.docx.pptx.xlsx MP3 AAC MQA.tar.gz 7z RAR WAV ogg FLAC Standard for Video Images Lossless Cross-Platform MPG MP4 DIVX JPEG JXL AVIF WebP data exchange XVID MOV AVI GIF PNG TIFF RAW AVIF WebP JXL BOLD formats are Lossless Drawbacks to Compression o Time: compression needs CPU and primary storage resources o PCs have lots of both and only one user. Servers on the other hand… o Space: archived files must be uncompressed before use, extra space needed for both compression & decompression o Integrity: any data corruption can cause loss of entire archive o Solid or multi-volume archives can be lost with even minor data corruption. Archive repair is possible but not probable. o Test your archives to confirm integrity. o Recoverability: the Lossy sacrifice is reduced quality o Lossy compression is appropriate only for specific Use Cases. Why do we need backups? #1 Accidental deletion by users or IT people 2/3 to 3/4 of all #2 Hardware failure: all storage fails eventually data loss Far Less Frequent Causes o Catastrophe: sabotage (ransomware), fire, flood, theft o Account on cloud provider is cancelled or accidently closed o Cloud service provider as a single point of failure Continuous Data Integrity with Redundant Array of RAID Independent Disks RAID 1, 5, 6 tolerate drive failure RAID 1 pairs drives RAID 5 +1 parity drive RAID 6 +2 parity drives RAID appears as one logical drive space to OS (excluding parity drives) read/write performance increases with multiple drives doing concurrent I/O Three characteristics of a Backup A copy in a geographically separate location that is platform independent. Classic File Backup Strategy Types: Full (all files) + Differential (only files changed since last Full backup) Full backup is slow, Differential backup is faster but gets slower. Restore requires Full + Differential. Classic File Backup Strategy Types: Full (all files) + Incremental (files changed since last backup of any type) Incremental backups are faster than Differential. Restore is slowest because multiple backup types must be done in sequence. Enterprise backup Backup software does Full, Differential, Incremental strategies Options for file versions / generations, and periodic snapshots.. Enterprise OS provides for backup of continuously running systems. LTO tape or Optical Disc libraries as nearline tertiary storage AWS Glacier, Google Nearline, Sync cloud storage Inexpensive upload storage, download may be ¢¢¢ slow or $$$ fast Recovery and Restoration speed is highly variable. Depends on data transfer rates from backup device or location, and complexity of rebuilding the relational aspect of data base objects Data deduplication and Single-instance optimize storage eliminate duplicate copies of data within and across systems User Level File Recovery … is not backup Windows File History, macOS Time Machine are not exactly backup Automatic copying of files to external or network drive [good] Historical versions of user files maintained. Easy to restore. [good] Must configure and test to ensure copying of all user folders. [okay] If drive is always connected, it is not a backup, just a copy; likely not geographically separate or platform independent. [BAD!] Windows Recycle Bin, macOS Trash can are not backup Only good for oops! and short-term recovery. [hopeful] Two-way synchronization is not backup [deluded] Synchronization is platform interdependent, not independent. [BAD!] A file on one system does not have a "copy" on other systems, [BAD!] the same file co-exists on all synchronized systems. [good?] 3-2-1 Backup Checklist 3 copies (change only the active file, not the backups) 1 active, 1 local backup, 1 remote backup 2 different formats/platforms (platform independence) External drive is platform independent only when not plugged in LTO tape or optical disc. Initially local, optionally moved offsite. One-way backup to cloud cold storage (not two-way cloud sync) 1 off-site backup (geographically separate location) Cloud storage different from your cloud service provider tape/optical media – rotate Full, Diff, Incr to offsite storage services The near loss of Toy Story 2 our data is safe in the cloud UniSuper pension fund, USD$125 billion, 647000 members Cloud provider configures client's services on May 2, 2024 One active data store at Google Cloud … GONE Two backup copies on Google Cloud … GONE …oops "This should not have happened." UniSuper pension fund also had a proper 3-2-1 backup with a different provider. Services restored May 16, 2024 The final word on backups… Backups do not matter. Only RESTORE matters. NOTES …not on the quiz but here for further information and explanation. Effect of File Compression on Data Transfer Assumptions: 1MB plain text file, unique for each of 30,000 users Network throughput is 2 seconds per file text compressed to 35% of original, throughput 1 sec/file Data Time Size Original 1MB plain text 8.24 Mb 16.6 hours 241.4Gb Compressed to 35% 2.88 Mb 8.3 hours 84.5Gb Compression for end user distribution TIFF, PNG, GIF (lossless). JPG (lossy). formats used by the graphics industry FLAC (lossless), CD:WAV (~lossless) MP3, AAC, MP4 (lossy) Compression formats used by the sound engineering and music industry MPG, MP4, DIVX, XVID, MOV, AVI (all lossy). Compression formats used by the video industry How Compression Works Here is an old quote from Vangie Beal: Data compression is particularly useful in communications because it enables devices to transmit or store the same amount of data in fewer bits. There are a variety of data compression techniques, but only a few have been standardized. The CCITT has defined a standard data compression technique for transmitting and a compression standard for data communications through modems. In addition, there are file compression formats, such as ARC and ZIP. This quote contains 449 characters. How Compression Works (cont’d) Replace “compression ” with "♠". The text becomes: Data ♠is particularly useful in communications because it enables devices to transmit or store the same amount of data in fewer bits. There are a variety of data ♠techniques, but only a few have been standardized. The CCITT has defined a standard data ♠technique for transmitting and a ♠standard for data communications through modems. In addition, there are file ♠formats, such as ARC and ZIP. With dictionary “♠compression˽” and 5 replacements, total size is 406 characters, 90.4% of 449. algorithm builds a token/string dictionary How Compression Works (cont’d) With more pattern matching and a bigger dictionary… ♠compression˽ ☻transmit ♣here are˽ ♥data˽ ♦communications˽ ☺standard˽ ♪technique ♥♠is particularly useful in ♦because it enables devices to ☻ or store the same amount of ♥in fewer bits. T♣a variety of ♥♠♪s, but only a few have been ☺ized. The CCITT has defined a ☺♥♠♪ for ☻ting faxes and a ♠☺ for ♥♦through modems. In addition, t♣ file ♠formats, such as ARC and ZIP. Including dictionary, total size is 363 characters or 81% of original. The more pattern matches there are for each dictionary item, the higher the compression. Two types of Compression Formats Lossless (LZW and or Huffman) ZIP, TIFF, FLAC, and other general file compression routines are lossless: all original data is completely encoded. Lossless compression reduces redundant data by not repeating recurring strings of the same data (e.g. a large blank space in a TIFF image or a long noiseless passage in FLAC audio). After decompression, the data is always complete and is indistinguishable from the original source file. GIF files use LZW lossless compression as part of their file format. Although GIF images capture only 265 colours per frame, this is their highest resolution. The minimal colour depth captured is by design, it is not an artifact of compression. GIFs are must successful as sharp-edged line art with animations. Lossy JPG/JPEG, MPG, MP3, and other end-user formats use lossy compression. Data representing the highest, fine- grained resolutions of the image or sound are removed in order to achieve high levels of compression to reduce file size for distribution. The level of compression is variable according to the Use Case. e.g. minimal compression of the original, high compression for email or MMS distribution. JPG images effectively delete colour information to achieve compression. MP3s simplify the sound waves of audio Developers decide to adjust the level of compression according to their Use Cases. Overview of some Compression File Formats ZIP: o The most popular general-purpose compression archive. o Supported on virtually all platforms from mainframes to PCs. o Includes features such as encryption using password protection. RAR, 7z, TAR, StuffIt: Proprietary general-purpose compression file formats with incremental improvements over Zip but with the loss of standardized support. use different algorithms, with various benefits and uses. Some are designed for different operating systems (StuffIt for Mac, TAR for *nix--TapeARchive). What is a “Backup” and why do we need backups? Backup is the “procedure for making extra copies” of data “in case the original is lost or damaged and must be restored.” The procedure includes storing the copy in a geographically separate location which is platform independent from the original file and host system. Having a backup will allow you to recover from lost, broken or stolen hardware, and from your own accidental deletions. You should be in the habit of backing up user created files on your laptop or PC. OS and apps can be restored from their original software providers or a system Restore Point but user created data can only be restored from backups. When & How to run your Backup Automatically: o Performed by continuously running backup software that constantly monitors for file changes. Used for Full and Incremental strategy. Scheduled: o system operator or backup software runs a backup at specific times, such as overnight, when it has the least impact on business operations or at critical business times such as at accounting month/year end. Used for Full, Differential, and Incremental strategy. Manual Backup: o a user performs backups at their own convenience. Not a strategy. o It is the least effective method (what if you forget to do it?), but it’s better than no backup at all! Locations of Backup Media Local: o copies files to a drive in use by the system. o fastest and most convenient, but if the computer is lost or malfunctions, so goes the data! Just having a copy is not a backup. o Local copies may be made to reduce downtime. The copies are then moved to External media or transmitted which is a slower process. External: Copies files to External/Portable/Flash Drive. i.e. a device which can be disconnected from the computer. It is a backup when the platform independent device is taken off-site. Locations of Backup Media (Cont’d) Network: o Back up files to the cloud (Google Drive, OneDrive, Dropbox, iCloud) o It is a slower option for large backup. Cost effective communications bandwidth has significantly less throughput than writing data to a directly attached device. The best location depends on the type of work you’re doing, the volatility of the data (how quickly it changes), the volume of data, the backup window (available downtime), security considerations, and the speed/availability of restoration.