Friday, May 27, 2011

Chapter 6. Regular Expressions

Chapter Syllabus

6.1 How a Command Is Executed

6.2 Position Specifiers

6.3 Meta Characters

6.4 Standard and Extended Regular Expressions

All human languages have idioms and phrases. These are made up of combinations of words not used in their ordinary meanings. Regular expressions can be considered as idioms of the UNIX shell. These are used for string pattern matching in many UNIX commands. As idioms and phrases convey a handful of meanings in few words, regular expressions are also very useful where you need to match complex text patterns and ordinary methods are just not applicable.

Regular expressions consist of strings of characters, position specifiers or anchor characters, and meta characters that have special meanings. Each regular expression is expanded into its meaning before the UNIX shell executes a command containing a regular expression. Before we actually use regular expressions in this chpater, we will start with the command execution process. We will then discuss basic meta characters used in regular expressions. You will learn the use of regular expressions with some simple commands. At the end of the chapter, you will be able to use regular expressions to search and replace character strings in files and in stdin and stdout.

6.1 How a Command Is Executed

All HP-UX commands consist of two basic parts. The first one is the command name and the second part consists of options and arguments. Before executing a command, the shell looks for a valid command in the path specified by the PATH variable. If it finds an executable command, it checks for any meta characters or

position specifiers used in the arguments. These meta characters and position specifiers are discussed later in this chapter. If the shell finds any of these characters in the arguments, it starts expanding the argument according to predetermined rules. After expansion, the shell then passes the arguments to the command and invokes it for the execution process. The shell then displays any output or error message generated by the command on the user terminal. It also checks to see if the command execution was successful and keeps a record until a next command is executed.

The command execution process is completed in the following steps.

1. The shell looks for a valid command by searching all directories specified by the PATH variable.

2. Options and arguments are parsed and arguments are expanded depending on the special characters used.

3. The command is invoked.

4. The results of the command are displayed back to the user.

As an example, if you issue a command ls [a-d]ile, before executing the ls command, the shell first expands its argument [a-d]ile to aile, bile, cile, and dile. After that, the ls command is executed, which, in turn, will list any file having any of these four names.

After understanding this process, let us move to the next sections where you will learn the use of special characters and regular expressions.

6.2 Position Specifiers

Position specifiers are characters that are used to specify the position of text within a line. Sometimes these are also called anchor characters. The caret character (^) is the starting position specifier. It is used to match a text string occurring at the start of a line of text. The dollar sign ($) is the end-position specifier and is used to refer to a line that ends with a particular string.

Table 6-1 shows the uses of position specifiers.

Table 6-1. Uses of Position Specifiers

Position Specifier Example

Result of Match

^Miami

Matches word Miami at the start of a line.

Miami$

Matches word Miami at the end of a line.

^Miami$

Matches a line containing only one word, Miami.

^$

Matches a blank line.

^\^

Matches a ^ at the beginning of a line.

\$$

Matches a $ at the end of a line.

Use of $

The dollar sign $ is used to match a string if it occurs at the end of a line. Consider a file with the name myfile having contents as shown below after using the cat command.

$ cat myfile

Finally I got it done. The procedure for adding a

new template is completed in three steps.

1- Create a new template.

2- Assign this template to a node with this procedure.

Action -> Agents -> Assign Templates -> Add -> Enter

hostname and template nee -> OK

3- After assignment, the template is still on the ITO

server. To install it on the required server, the

procedure is:

Action -> Agents -> Install/Update SW & Config ->

Select Templates, Node name & Force update -> OK

If step 3 is successful, a message appears on ITO

message browser showing that update process on the node

is complete.

IMPORTANT

===========

The template will not work if the node name specified

in it is unknown to ITO server. In our template we

specified batch_server which was unknown to ITO server

node name in the template. Finally I got out the node

name which is more convenient as ITO automatically takes

current node name if the name is not specified in the

template.

Template Options

===============

1- It runs every minute. Scans the file only if it is

modified.

2- User initiated action is specified to run restart.

3- A short instruction is provided to run the script.

It needs to be modified to make more meaningful.

$

Let us use the grep command to find all lines in the file that contain the word node.

$ grep node myfile

2- Assign this template to a node with this procedure.

message browser showing that update process on the node

The template will not work if the node name specified

node name in the template. Finally I got out the node

current node name if the name is not specified in the

$

You found out that there are five lines in the file containing the word node. Now let us find only those lines that end with this word by using the $ position specifier.

$ grep node$ myfile

message browser showing that update process on the node

node name in the template. Finally I got out the node

$

The position specifiers can be used with any command that deals with text-type data.

Use of ^

The caret character (^) matches a string at the start of a line. Using the same example of finding the word node, now at the start of a line, enter the following command and watch the result.

$ grep ^node myfile

node name in the template. Finally I got out the node

$

As another example, you can list all users on your system with login names starting with the letter "m" as follows.

$ grep ^m /etc/passwd

Getting Rid of Blank Lines

Use of position specifiers is very useful in many cases. To show you one example, ^$ can find blank lines in a file. If you want to count blank lines, you can just pipe output of the grep command to the wc command as in the following.

$ grep ^$ myfile | wc -l

5

$

This command will scan myfile and tell you exactly how many blank lines there are in the file. You can use the grep command to take out all blank lines from the file as shown below. The grep -v command reverses the selection and shows those lines that are not empty.

$ grep -v ^$ myfile

Finally I got it done. The procedure for adding a

new template is completed in three steps.

1- Create a new template.

2- Assign this template to a node with this procedure.

Action -> Agents -> Assign Templates -> Add -> Enter

hostname and template nee -> OK

3- After assignment, the template is still on the ITO

server. To install it on the required server, the

procedure is:

Action -> Agents -> Install/Update SW & Config ->

Select Templates, Node name & Force update -> OK

If step 3 is successful, a message appears on ITO

message browser showing that update process on the node

is complete.

IMPORTANT

===========

The template will not work if the node name specified

in it is unknown to ITO server. In our template we

specified batch_server which was unknown to ITO server

node name in the template. Finally I got out the node

name which is more convenient as ITO automatically takes

current node name if the name is not specified in the

template.

Template Options

===============

1- It runs every minute. Scans the file only if it is

modified.

2- User initiated action is specified to run restart.

3- A short instruction is provided to run the script.

It needs to be modified to make more meaningful.

$

Please note that an "empty line" means a line that doesn't contain any characters. Some lines seem to be empty but actually contain a space or tab character. These lines are not matched by the above command. To match a line that contains space characters, you can use ^[ ]$, where there is a space character between the two square brackets.

Escaping Position Specifiers

Sometimes the actual string contains one of the position specifiers or meta characters. If you pass this string as-is to a command, the shell will expand the meta character to its special meaning, and you will not get correct results. To instruct the shell not to expand a character to its special meaning, you need to escape that character. For this purpose, you use a backslash (\) before the character. For example, if you want to search for the $ character in a file, you will use the grep \$ command instead of grep $. If you don't escape the $ character, this command will display all contents of the file.

Please note that \ is also a special character. To match a backslash, you need to use two backslashes \\ in the string.

6.3 Meta Characters

Meta characters are those that have special meaning when used within a regular expression. You already have seen two meta characters used as position specifiers. A list of other meta characters and their meanings is shown in Table 6-2.

Table 6-2. Meta Characters Used in Regular Expressions

Character

Description

*

Matches any number of characters, including zero.

.

Matches any character, one at a time.

[]

One of the enclosed characters is matched. The enclosed characters may be a list of characters or a range.

{n1,n2\\

Matches minimum of n1 and maximum of n2 occurrences of the preceding character or regular expression.

\<

Matches at the beginning of the word.

\>

Matches at the end of the word.

\

The character following acts as a regular character, not a meta character. It is used for escaping a meta character.

Use of the Asterisk * Character

The asterisk character is used to match zero or more occurrences of the preceding characters. If you take our example of myfile, the result of the following grep command will be as shown below.

$ grep mom* myfile

name which is more convenient as ITO automatically takes

modified.

It needs to be modified to make more meaningful.

$

Is this what you were expecting? The grep command found all text patterns that start with "mo" and after that have zero or more occurrences of the letter m. The words that match this criteria are "more," and "modified." Use of * with only a single character is meaningless as it will match anything. For example, if we use m*, it means to match anything that starts with any number of "m" characters including zero. Now each word that does not start with the letter "m" is also matched because it has zero occurrences of "m". So one must be careful when using the asterisk (*) character in regular expressions.

Use of the Dot (.) Character

The dot character matches any character excluding the new line character, one at a time. See the example below where we used the dot to match all words containing the letter "s" followed by any character, followed by the letter "e".

$ grep s.e myfile

new template is completed in three steps.

If step 3 is successful, a message appears on ITO

The template will not work if the node name specified

specified batch_server which was unknown to ITO server

current node name if the name is not specified in the

1- It runs every minute. Scans the file only if it is

2- User initiated action is specified to run restart

$

In every line shown above, there is a word containing an "s" followed by another character and then "e". The second-to-last line is of special interest, where this letter combination occurs when we combine the two words "runs every." Here "s" is followed by a space and then an "e".

Use of Range Characters [...]

Consider that you want to list all files in a directory that start with the letters a, b, c, d, or e. You can use a command such as:

$ ls a* b* c* d* e*

This is not convenient if this list grows. The alternate way is to use a range pattern like the following.

$ ls [a-e]*

Square brackets are used to specify ranges of characters. For example, if you want to match all words that contain any of the capital letters from A to D, you can use [A-D] in the regular expression.

$ grep [A-D] myfile

1- Create a new template.

2- Assign this template to a node with this procedure.

Action -> Agents -> Assign Templates -> Add -> Enter

3- After assignment, the template is still on the ITO

Action -> Agents -> Install/Update SW & Config ->

IMPORTANT

3- A short instruction is provided to run the script.

$

Similarly, if you need to find words starting with lowercase vowels, [aeiou] will serve the purpose. If such words are desired to be at the beginning of a line, we can use ^[aeiou]. Multiple ranges can also be used, such as ^A[a-z0-9], which matches words that are at the start of a line, has "A" as the first character, and either a lowercase letter or a number as the second character.

The selection criteria can also be reversed using ^ as the first character within the square brackets. An expression [^0-9] matches any character other than a number.

Use of the Word Delimiters \< and \>

These two sets of meta characters can be used to match complete words. The \< character matches the start of a word and \> checks the end of a word. Without these meta characters, all regular expressions match a string irrespective of its presence in the start, end, or middle of a word. If we want to match all occurrences of "this" or "This" as a whole word in a file, we can use the following grep command.

$ grep \<[tT]his\>

If you use \< only, the pattern is matched if it occurs in the start of a word. Using only \> matches a pattern occurring in the end of a word.

6.4 Standard and Extended Regular Expressions

Sometimes you may want to make logical OR operations in regular expressions. As an example, you may need to find all lines in your saved files in the $HOME/mbox file containing a sender's address and date of sending. All such lines start with the words "From:" and "Date:". Using a standard regular expression it would be very difficult to extract this information. The egrep command uses an extended regular expression as opposed to the grep command that uses standard regular expressions. If you use parentheses and the logical OR operator (|) in extended regular expressions with the egrep command, the above-mentioned information can be extracted as follows.

$ egrep '^(From|Date):' $HOME/mbox

Note that we don't use \ prior to parentheses in extended regular expressions.

You may think that this task can also be accomplished using a standard regular expression with the following command; it might seem correct at the first sight but it is not.

$ grep '[FD][ra][ot][me]:' $HOME/mbox

This command does not work because it will also expand to "Fate," "Drom," "Droe," and so on.

Extended regular expressions are used with the egrep and awk commands. Sometimes it is more convenient to use standard expressions. At other times, extended regular expressions may be more useful. There is no hard and fast rule as to which type of expression you should use. I use both of these and sometimes combine commands using both types of expressions with pipes to get a desired result. With practice you will come to know the appropriate use.

Test Your Knowledge

1:

The purpose of the command grep ^Test$ is:

A. to find the word "Test" in the start of a line

B. to find the word "Test" in the end of a line

C. to find the word "Test" in the start or end of a line

D. to find a line containing a word "Test" only

2:

Square brackets in pattern matching are used for:

A. escaping meta characters

B. specifying a range of characters; all of which must be present for a match

C. specifying a range of characters; only one of which must be present for a match

D. specifying a range of characters; one or more of which must be present for a match

3:

A regular expression \matches:

A. all words starting with "join"

B. all words ending with "join"

C. all words starting or ending with "join"

D. none of the above

4:

The grep command can use:

A. standard regular expressions only

B. extended regular expressions only

C. both standard and extended regular expressions

D. either standard or extended regular expressions but not both of these simultaneously

5:

Which of these is NOT a meta character?

A. *

B. \

C. $

D. -

No comments:

Post a Comment