Home > Undergraduate > Open Source Tools and Scripting >    Labs

  CITS4407/CITS2003 OPEN SOURCE TOOLS AND SCRIPTING
 
 

Lab 7: Regex, grep and sed

Questions

  1. Use grep (and wc) to answer the following questions about Alice_in_Wonderland.txt
    1. How many times does the word rabbit appear in Alice_in_Wonderland.txt? What about Rabbit?
    2. grep rabbit Alice_in_Wonderland.txt | wc -l # 5
      grep Rabbit Alice_in_Wonderland.txt | wc -l # 47
      # grep -c is an acceptable substitute for wc here and in most answers to this question
      
    3. How could you search for both rabbit and Rabbit at once?
    4. grep -i rabbit Alice_in_Wonderland.txt | wc -l # note this also matches RABBIT
      #or
      grep [Rr]abbit Alice_in_Wonderland.txt | wc -l
      
    5. How many times does the word Alice appear in Alice_in_Wonderland.txt? Note that it sometimes occurs more than once on the same line. You might want to look at grep -o
    6. grep -o Alice Alice_in_Wonderland.txt | wc -l # 401
      # grep -c doesn't work here, because it doesn't count multiple matches on a single line
      # instead you need to print them all with -o and use wc -l
      
    7. How many lines do not contain the word [Cc]aterpillar?
      grep -vic caterpillar Alice_in_Wonderland.txt # 3732
      
  2. Solve the following problems about arcade.csv with grep:
    1. Print all lines in arcade.csv where the team is GREEN
    2. grep GREEN arcade.csv
      
    3. What arcade score did Molly get?
    4. grep Molly arcade.csv | cut -d, -f6
      
    5. Print all lines in arcade.csv where the machine is b and the score begins with the number 4
    6. grep ,b, arcade.csv | grep ,4[0-9]*$ # you can use egrep to do this all in a single grep command
      
  3. You've been sent a list of university enrolment data (australian-universities.csv) but the data is messy. Write a sed script to do the following:
    • Remove all lines that do not contain the word University (case insensitive, so you should also keep UNIversity etc)
    • Remove all lines that contain letters after the name field
    • Replace all full stops (.) with commas (,)
    • Remove all trailing commas
    /university/I!d
    /,.*[a-zA-Z].*/d
    s/[.]/,/g
    s/,$//
    
  4. The following questions use sed to operate on json files in /lab/week8/aces
    1. Write a sed command to convert the json field name "seat" to "crew".
    2. sed -e 's/seat/crew/' heroes.json
      
    3. Test your sed command on heroes.json. Pay particular attention to Mace Windu's caption. Adjust your sed command to only affect the field name and not his caption.
    4. sed -e 's/\"seat\"/\"crew\"/' heroes.json
      
    5. villains.json was written in a rush and contains invalid json. There should be a comma at the end of every field in an object except the last one. Fortunately for us, the "keywords" field is always the last field in an object, so we can make use of that when adding the missing commas. Write a sed command to add a comma at the end of every line, except lines ending in one of the following characters:
      , { } [ ]
    6. sed -e 's/\([^][{},]\)$/\1,/' villains.json
      # this is a tough one, the [ ] that you want to avoid matching need to be in reverse order for it to work
      
      # regex breakdown
      's/\(        \) /   /' # \(\) defines a capturing group, because we need to retain (not replace) the end-of-line character we replace
      's/\(        \) /\1 /' # \1 prints the first capture group
      's/\(        \) /\1,/' # , prints the comma we are adding
      's/\(        \)$/\1,/' # $ ensures we match at the end of a line
      's/\([^     ]\)$/\1,/' # [^ ] means we want to NOT match a character
      's/\([^][{},]\)$/\1,/' # list all the characters we want to NOT match
      
    7. villains.json is also missing quotes around some field names. Write a sed command to add quotes around any field names which do not have them. For example:
      name: "Grand Moff Tarkin"
      
      should be:
      "name": "Grand Moff Tarkin"
      
    8. Note: this is a challenging regex. You will probably need to use sed in extended mode with -r so that you can use the + operator (match one or more occurences of a character). Note that in extended mode, capturing groups do not need to be escaped (so use ( ) instead of \( \)).
      sed -re 's/([a-z]+):/\"\1\":/' villains.json
      
      # Regex breakdown
      's/(      ) /       /' # define a capturing group, note no backslashes
      's/([a-z]+) /       /' # we want to capture the field name, which is one or more lowercase letters
      's/([a-z]+):/       /' # the field name is right before a colon
      's/([a-z]+):/\"\1\":/' # as output, print quotes around captured field name and then the colon
      # the backslashes before the quotes might not be needed here
      

Bonus

  1. See how many levels you can beat at https://alf.nu/RegexGolf
  2. What regex would you write to save the day in this situation?
    Regular Expressions
  3. vim supports regex-style search with / and sed-style replacement with s. For example: /foo$ matches any line ending in "foo", and :%s/foo/bar/g replaces any occurrence of "foo" with "bar". Repeat this week's lab exercises using vim's search and replace commands. You will need to double check on the exact syntax for escaping brackets and using capturing groups, as it may vary slightly from the syntax used by sed.


Department of Computer Science & Software Engineering
The University of Western Australia
Last modified: 8 February 2022
Modified By: Daniel Smith

UWA