Regular Expressions grep and sed PDF

Summary

This document provides an introduction to regular expressions, covering basic usage with grep and sed for text search and manipulation. It details various commands and techniques, as well as typical applications of regular expression use in real-world scenarios.

Full Transcript

11/19/24, 6:57 PM OneNote p-Chapter 10 Tuesday, November 12, 2024 5:17 PM 10.4 Basic Regualr expressions Filters...

11/19/24, 6:57 PM OneNote p-Chapter 10 Tuesday, November 12, 2024 5:17 PM 10.4 Basic Regualr expressions Filters Using Regular Expressions grep and sed What Is a Regular Expression? A pattern of special characters used to match strings in a search 10.2 grep: Searching for a pattern Typically made up from special characters called metacharacters grep scans its input for a pattern, and can display the selected Regular expressions are used in the "real" programming world in many applications, pattern, the line numbers, or the filenames where the including: - Web form validation: Regular Expressions are used for validating phone num postal codes, etc. on many websites. grep searches for pattern in one or more filename(s), or the Advanced searching: Applications such as the unix grep utility, Google Codesearch, and standard input if no filename is specified. The first argument popular text editors allow for searching regular expressions such colou?r. (barring the option) is the pattern, and the ones remain- ing are Email and spam filtering: Many automated email filters can be configured with regular filenames. expressions. Ex: grep: Searching for a Pattern Automated web services and spiders: Many Internet tasks involve taking data from one s grep options pattern filename(s) and reformatting it for another, a task made easier by regular expressions : The grep command searches for a specific word or phrase (called a "pattern") in one or more files. The Character Class grep: This is the command itself. options: These are extra settings you can add to control the A regular expression lets you specify a group of characters enclosed within a pair of rectangular brackets, [ ]. search, like making it case-insensitive or showing line match is then performed for any single character in the group. This form resembles the one used by the shell’ numbers. cards. Thus, the expression pattern: This is the word or phrase you’re looking for. [od] Either o or d filename(s): This is the file or files where grep will look for matches either an o or a d the pattern. [malsaid@HAL ~]$ grep malsaid /etc/passwd: searches for the username "malsaid" in the file /etc/passwd, which stores user account information on most Linux systems. This output shows: The username (malsaid) User ID and Group ID (both 1001 in this example) The user’s home directory (/home/malsaid) Default shell (/bin/bash) Quoting in grep Quoting is essential if the search string consists of more than one Negating a Class(^) : Regular expressions use the ^(caret)to negate the character class, while the shell use word or uses any of the shell’s characters like *, $, and so on (bang). When the character class begins with this character, all characters other than the ones grouped in the c $ grep gordon lightfoot emp.lst are matched. So, [^a-zA-Z] matches a single nonalphabetic character string. grep: lightfoot: No such file or directory emp.lst:1006:gordon lightfoot:director :sales $_ $ grep ‘gordon lightfoot’ emp.lst 1006:gordon lightfoot:director :sales :09/03/38:140000 In the first example, lightfoot was interpreted as a filename, but grep could locate gordon in emp.lst. The second example solves the problem by quoting the string. We used single quotes here, but this technique won’t do if we use grep to locate neil o’bryan from the file. Recall that double quotes protect single quotes: Quote the pattern used with grep if it contains multiple words or special characters that can be interpreted otherwise by the shell. You can generally use either single or double quotes. However, if the special characters in the pattern require command substitution or variable evaluation to be performed, you must use double quotes. $ grep -i ‘WILCOX’ emp.lst Repetition * 2345:james wilcox :g.m. :marketing :03/12/45:110000 The * (asterisk) refers to the immediately preceding character. However, its interpretation is the trickiest of th Explanaation: searched for the text "WILCOX" in the file emp.lst, it bears absolutely no resemblance whatsoever with the * used by wild cards or DOS (or the * used by Amazo ignoring case sensitivity. eBay in their search strings). Here, it indicates that the previous character can occur many times, or not at all Breakdown: grep: The command used for searching text. https://kpuemp-my.sharepoint.com/personal/minerva_sharma_student_kpu_ca/_layouts/15/Doc.aspx?sourcedoc={de89058b-e46b-4467-b4c6-8ab3902681c8}&action… 1/4 11/19/24, 6:57 PM OneNote -i: The "ignore case" option, which makes grep match regardless of letter case. This means it will find "WILCOX," "wilcox," or any other variation of upper and lower case letters. 'WILCOX': The search term that grep looks for in the file. emp.lst: The file being searched. Deleting Lines (-v) grep can also play an inverse role; the -v (inverse) option selects all lines except those containing the The Dot The.regular expression can be used to match any character. pattern. Thus, you can create a file otherlist containing all but matches a single character. directors: The Regular Expression.* : signifies any number of characters, or none $ grep -v ‘director’ emp.lst > otherlist $ wc -l otherlist Say, for instance, you are looking for the name p. woodhouse, but are not sure whether it actually exists in t 11 otherlist There were four directors initially as p.j. woodhouse. No problem, just embed the.* in the search string: grep “p.*woodhouse” emp.lst More often than not, when we use grep -v, we also redirect its Anchors: Anchors are used to match at the beginning or end of a line (or both). ^ means beginning of the lin output to a file as a means of getting rid of unwanted lines. means end of the line Obviously, the lines haven’t been deleted from the original file as such. The -v option removes lines from grep’s output, but it doesn’t actually change the argument file. This option is frequently used with redirection. Matching Multiple Patterns (-e) The -e option has two functions —to match mul- tiple patterns and patterns beginning with a hyphen. Linux supports both functions, but Solaris offers this option only with the XPG4 version. This is how you match multiple patterns by using -e multiple times: $ grep -e woodhouse -e wood -e woodcock emp.lst 2365:john woodcock :director :personnel :05/11/47:120000 5423:barry wood :chairman :admin :08/30/56:160000 1265:p.j. woodhouse :manager :sales :09/12/63: 90000 ListingOnlyDirectories UNIXhasnocommandthatlistsonlydirectories.However, we can us pipeline to “grep” those lines from the listing that begin with a d: Explanation: grep: The command used for searching text. -e: The "expression" option, which allows multiple patterns ls -l | grep “^d” Shows only the directories to be searched in the same command. It’s indeed strange that ls, which supports 20 options has none to display directories! You -e woodhouse -e wood -e woodcock: The three patterns should convert this into an alias (Table 8.2) or a shell function so that it is always available being searched. By using -e for each, grep looks for any line you to use. containing at least one of these terms ("woodhouse", Identifying Files with Specific Permissions Here’s how grep can add power to the "wood", or "woodcock"). command. This pipeline locates all files that have write permission for the group: emp.lst: The file where the search is conducted. Result: List all files that have write permission for the group: [malsaid@HAL ~]$ chmod 661 grou [malsaid@HAL ~]$ ls -l | grep '^.....w' -rw-rw---x. 1 malsaid faculty 142 Oct 31 12:09 group The command outputs any line in emp.lst that includes any of the specified patterns: "woodhouse," "wood," or "woodcock." Each The caret has a triple role to play in regular expressions. When placed at the beginning of a character class (e z]), it negates every character of the class. When placed outside it, and at the beginning of the expression (e.g matching line will be displayed, which could include different the pattern is matched at the beginning of the line. At any other location (e.g., a^b), it matches itself literally. employees whose names or details contain one of these terms. When Metacharacters Lose Their Meaning This allows you to search for several variations of a related term Some of the special characters may actually exist as text. If these characters violate the regular expression rul (e.g., names with "wood") in a single command. their special meanings are automatically turned off. For example, the. and * lose their meanings when placed the character class. The * is also matched literally if it’s the first character of the expression. Thus, grep “*” for an asterisk. You could question the wisdom of entering such a long command line when the patterns don’t differ much from one another. Yes, Sometimes, you may need to escape these characters. For instance, when looking for a pattern g*, you need t grep supports sophisticated pattern matching techniques that can grep “g\*”. Similarly, to look for a [, you should use \[, and to look for the literal pattern.*, you should use \ display the same lines but with a single expression. This is the ideal forum for regular expressions to make their entry. 10.5.1 Extended Regular Expressions (ERE) and egrep Extended regular expressions (ERE) make it possible to match dissimilar patterns with a single expression. 10.6 sed:The Stream Editor The + and ? § is a multipurpose tool which combines the work of several filters. § sed The ERE set includes two special characters, + and ?. They are often used in place of the * to restrict the mat performs non interactive operations on a data stream. § the syntax is: sed scope: options ‘address action’ file(s) + — Matches one or more occurrences of the previous character. ? — Matches zero or one occurrence of the sed uses instructions to act on text. An instruction combines an address for previous character. selecting lines, with an action to be taken on them, as shown by the syntax: What all of this means is that b+ matches b, bb, bbb, etc., but, unlike b*, it doesn’t match nothing. The expre sed options ‘address action’ file(s) b? matches either a single instance of b or nothing. These characters restrict the scope of match as compared The address and action are enclosed within single quotes. Addressing in sed is *. done Example: Extended Regular Expressions + ? in two ways: [malsaid@HAL ~]$ grep -E "true?man" emp.lst By one or two line numbers (like 3,7). 3564:ronie trueman :executive:personnel :07/06/47: 75000 0110:julie truman :g.m. :marke By specifying a /-enclosed pattern which occurs in a line (like /From:/). :12/31/40: 95000 sed processes several instructions in a sequential manner. Each instruction Explanation: operates on the output of the previous instruction. In this context, two options uses grep with extended regular expressions (enabled by -E) to search for pattern are relevant, and most likely they are the only ones you’ll use with sed—the -e emp.lst that match either "truman" or "trueman". option that lets you use multiple instructions and the -f option to take instructions from a file. Both options are used by grep in an identical manner. https://kpuemp-my.sharepoint.com/personal/minerva_sharma_student_kpu_ca/_layouts/15/Doc.aspx?sourcedoc={de89058b-e46b-4467-b4c6-8ab3902681c8}&action… 2/4 11/19/24, 6:57 PM OneNote Matching Multiple Patterns (|, ( and )) The | is the delimiter of multiple patterns. Using it, we can locate both woodhouse and woodcock without u the -e option twice: $ grep -E ‘woodhouse|woodcock’ emp.lst 2365:john woodcock :director :personnel :05/11/47:120000 1265:p.j. woodhouse :manager :sales :09/12/63: 9 locate both woodhouse and woodcock [malsaid@HAL ~]$ grep -E "wood(house|cock)" emp.lst 2365:john woodcock :director :personnel :05/11/47:120000 1265:p.j. woodhouse :manager :sales :09/12/63: 90000 The ERE thus handles the problem easily, but offers an even better alternative. The characters ( and ) let you patterns, and when you use the | inside the parentheses, you can frame an even more compact pattern: EREs when combined with BREs form very powerful regular expressions. For instance, the expression in the following command contains characters from both sets: Line Addressing: § lines can be addressed using line numbers and context. § in $ grep -E ‘wilco[cx]k*s*|wood(house|cock)’ emp.lst line addressing, sed can be used to display the first 3 lines of a file as follows: 2365:john woodcock 1265:p.j. woodhouse 3212:bill wilcocks 2345:james wilcox To consider line addressing first, the instruction 3q can be broken down into :director :personnel :manager :sales :d.g.m. :accounts :g.m. :marketing the address 3 and the action q (quit). When this instruction is enclosed within :05/11/47:120000 :09/12/63: 90000 :12/12/55: 85000 :03/12/45:110000 quotes and followed by one or more filenames, you can simulate head -n 3 in All EREs can also be placed in a file in exactly the same way they are used in the command line. You then ha this way: use grep both with the -E and -f options to take the patterns from the file $ sed ‘3q’ emp.lst 2233:charles harris 9876:bill johnson 5678:robert dylan Quits after line number 3 :g.m. :sales :12/12/52: 90000 :director :production:03/12/50:130000 :d.g.m. :marketing 10.8 sed Options :04/19/43: 85000 the -e option allows us to enter as many instructions as we wish , each preceded by the opti Generally, we’ll use the p (print) command to display lines. However, this sed -n -e ‘1,2p’ -e ‘7,9p’ -e ‘$p’ emp.lst command behaves in a seemingly strange manner: it outputs both the selected lines and all lines. So the selected lines appear twice. We must suppress this behavior with the -n option, and remember to use this option whenever we use the p command. Thus, $ sed -n ‘1,2p’ emp.lst 2233:charles harris :g.m. :sales :12/12/52: 90000 9876:bill johnson :director :production:03/12/50:130000 prints the first two lines. To select the last line of the file, use the $: $ sed -n ‘$p’ emp.lst 0110:julie truman :g.m. :marketing :12/31/40: 95000 Selecting Lines from Anywhere The two command invocations above emulate Instructions in a File (-f) When you have too many instructions to use or when you have a set of common the head and tail commands, but sed can also select a contiguous group of lines instructions that you execute often, they are better stored in a file. For instance, the preceding three instructio from any location. To select lines 9 through 11, use this: be stored in a file, with each instruction on a separate line: sed -n ‘9,11p’ emp.lst $ cat instr.fil 1,2p Alternatively, you can place multiple sed instructions in a single line using the ; 7,9p $p as delimiter: You can now use the -f option to direct sed to take its instructions from the file: sed -n ‘1,2p;7,9p;$p’ emp.lst sed -n -f instr.fil emp.lst Negating the Action (!): § sed has a negation operator (!), which can be used with any action. § selecting the first 2 lines is the same as not selecting lines 3 through the end. This is written as: sed -n ‘3,$!p’ emp.lst // don’t print lines 3 10.11 text editing to end. Commands are : i (insert), a (append), c (change) and d (delete). Inserting and Changing 10.9 Context Addressing ( i, a, c): § lets us specify one or two patterns to locate lines. § the pattern must be The i command inserts text. A C programmer can add two common “include” lines at the beginning of a prog bounded by a / on either side.The second form of addressing lets you specify a foo.c pattern (or two) rather than line numbers. This is known as context addressing The sed command can add a new line before a pattern match is found. The "i" command t where the pattern has a / on either side. tells it to add a new line before a match is found. sed '/unix/ i "Add a new line"' file.txt [malsaid@HAL ~]$ sed -n '/director/p' emp.lst Using Regular Expressions: § context addresses also use regular expressions. § to locate all people born in the The sed command can add a new line after a pattern match is found. The "a" command to year 1950 is done as follows: [malsaid@HAL ~]$ sed -n '/50.......$/p' emp.lst 9876:bill johnson :director :production:03/12/50:130000 4290:neil o'bryan tells it to add a new line after a match is found. sed '/unix/ a "Add a new line”’ file.txt :executive:production:09/07/50: 65000 The sed command can be used to replace an entire line with a new line. The "c" command sed tells it to change the line. sed '/unix/ c "Change line"' file.txt 10.12 Substitution (s) Substitution is easily the most important feature of sed, and this is one job that Deleting Lines (d): § sed uses the d (delete) command to select lines not containing the pa sed does exceedingly well. It lets you replace a pattern in its input with [malsaid@HAL ~]$ sed ‘/director/d’ emp.lst > outfile selects all lines except those containin something else. The use of regular expressions enhances our pattern matching director , and saves them in outfile. [malsaid@HAL ~]$ sed -n ‘/director/!p’ emp.lst capabilities, § sed is mostly used for substitution. § it lets us replace a pattern in its input with something else. § to replace the | with a colon : [malsaid@HAL ~]$ sed 's/:/ |/' emp.lst | head –n 2 -- replace first occurrence 2233 |charles harris :g.m. :sales :12/12/52: 90000 9876 |bill johnson :director :production:03/12/50:130000 [malsaid@HAL ~]$ sed 's/:/ |/g' emp.lst | head -n2 https://kpuemp-my.sharepoint.com/personal/minerva_sharma_student_kpu_ca/_layouts/15/Doc.aspx?sourcedoc={de89058b-e46b-4467-b4c6-8ab3902681c8}&action… 3/4 11/19/24, 6:57 PM OneNote --- replace all 2233 |charles harris |g.m. |sales |12/12/52 | 90000 9876 |bill johnson |director |production |03/12/50 |130000 (Example: Text Replacement ) sed 's/unix/linux/' file.txt /replace first occurrence sed 's/unix/linux/2' file.txt /replace second occurrence sed 's/unix/linux/g' file.txt /replace all occurrences sed '3 s/unix/linux/' file.txt /replace line 3 occurrence sed '1,3 s/unix/linux/' file.txt /replace lines 1 to 3 sed '2,$ s/unix/linux/' file.txt /replace lines 2 to last sed '/linux/ s/unix/centos/' file.txt /replace line that starts with linux and replace unix with centos § the g flag at the end makes the substitution global. § use g (global) flag to replace all the : pipes with | $ sed ‘s/ :/ |/g’ emp.lst we can replace the word director with member of a file using: $ sed ‘1,5 s/director/member /’ emp.lst § to add the 2 prefix to all emp-ids: $ sed ‘s/^/2/’ emp.lst § we can add the suffix.00 to the salary : $ sed ‘s/$/0.00/’ emp.lst https://kpuemp-my.sharepoint.com/personal/minerva_sharma_student_kpu_ca/_layouts/15/Doc.aspx?sourcedoc={de89058b-e46b-4467-b4c6-8ab3902681c8}&action… 4/4

Use Quizgecko on...
Browser
Browser