Disclaimer :
The original version of this article was first published on IBM
developerWorks, and is property of Westtech Information Services. This
document is an updated version of the original article, and contains
various improvements made by the Gentoo Linux Documentation team.
This document is not actively maintained.
|
Awk by example, Part 1
1.
An intro to the great language with the strange name
In defense of awk
In this series of articles, I'm going to turn you into a proficient awk coder.
I'll admit, awk doesn't have a very pretty or particularly "hip" name, and the
GNU version of awk, called gawk, sounds downright weird. Those unfamiliar with
the language may hear "awk" and think of a mess of code so backwards and
antiquated that it's capable of driving even the most knowledgeable UNIX guru to
the brink of insanity (causing him to repeatedly yelp "kill -9!" as he runs for
coffee machine).
Sure, awk doesn't have a great name. But it is a great language. Awk is geared
toward text processing and report generation, yet features many well-designed
features that allow for serious programming. And, unlike some languages, awk's
syntax is familiar, and borrows some of the best parts of languages like C,
python, and bash (although, technically, awk was created before both python and
bash). Awk is one of those languages that, once learned, will become a key part
of your strategic coding arsenal.
The first awk
Code Listing 1.1: The first awk |
$ awk '{ print }' /etc/passwd
|
You should see the contents of your /etc/passwd file appear before
your eyes. Now, for an explanation of what awk did. When we called awk, we
specified /etc/passwd as our input file. When we executed awk, it
evaluated the print command for each line in /etc/passwd, in order.
All output is sent to stdout, and we get a result identical to catting
/etc/passwd.
Now, for an explanation of the { print } code block. In awk, curly braces are
used to group blocks of code together, similar to C. Inside our block of code,
we have a single print command. In awk, when a print command appears by itself,
the full contents of the current line are printed.
Code Listing 1.2: Printing the current line |
$ awk '{ print $0 }' /etc/passwd
$ awk '{ print "" }' /etc/passwd
|
In awk, the $0 variable represents the entire current line, so print and print
$0 do exactly the same thing.
Code Listing 1.3: Filling the screen with some text |
$ awk '{ print "hiya" }' /etc/passwd
|
Multiple fields
Code Listing 1.4: print $1 |
$ awk -F":" '{ print $1 $3 }' /etc/passwd
halt7
operator11
root0
shutdown6
sync5
bin1
|
Code Listing 1.5: print $1 $3 |
$ awk -F":" '{ print $1 " " $3 }' /etc/passwd
|
Code Listing 1.6: $1$3 |
$ awk -F":" '{ print "username: " $1 "\t\tuid:" $3 }' /etc/passwd
username: halt uid:7
username: operator uid:11
username: root uid:0
username: shutdown uid:6
username: sync uid:5
username: bin uid:1
|
External scripts
Code Listing 1.7: Sample script |
BEGIN { FS=":" }
{ print $1 }
|
The difference between these two methods has to do with how we set the field
separator. In this script, the field separator is specified within the code
itself (by setting the FS variable), while our previous example set FS by
passing the -F":" option to awk on the command line. It's generally best to set
the field separator inside the script itself, simply because it means you have
one less command line argument to remember to type. We'll cover the FS variable
in more detail later in this article.
The BEGIN and END blocks
Normally, awk executes each block of your script's code once for each input
line. However, there are many programming situations where you may need to
execute initialization code before awk begins processing the text from the input
file. For such situations, awk allows you to define a BEGIN block. We used a
BEGIN block in the previous example. Because the BEGIN block is evaluated before
awk starts processing the input file, it's an excellent place to initialize the
FS (field separator) variable, print a heading, or initialize other global
variables that you'll reference later in the program.
Awk also provides another special block, called the END block. Awk executes this
block after all lines in the input file have been processed. Typically, the END
block is used to perform final calculations or print summaries that should
appear at the end of the output stream.
Regular expressions and blocks
Code Listing 1.8: Regular expressions and blocks |
/foo/ { print }
/[0-9]+\.[0-9]*/ { print }
|
Expressions and blocks
Code Listing 1.9: fredprint |
$1 == "fred" { print $3 }
|
Code Listing 1.10: root |
$5 ~ /root/ { print $3 }
|
Conditional statements
Code Listing 1.11: if |
{
if ( $5 ~ /root/ ) {
print $3
}
}
|
Both scripts function identically. In the first example, the boolean expression
is placed outside the block, while in the second example, the block is executed
for every input line, and we selectively perform the print command by using an
if statement. Both methods are available, and you can choose the one that best
meshes with the other parts of your script.
Code Listing 1.12: if if |
{
if ( $1 == "foo" ) {
if ( $2 == "foo" ) {
print "uno"
} else {
print "one"
}
} else if ($1 == "bar" ) {
print "two"
} else {
print "three"
}
}
|
Code Listing 1.13: if |
! /matchme/ { print $1 $3 $4 }
|
Code Listing 1.14: if |
{
if ( $0 !~ /matchme/ ) {
print $1 $3 $4
}
}
|
Both scripts will output only those lines that don't contain a matchme
character sequence. Again, you can choose the method that works best for your
code. They both do the same thing.
Code Listing 1.15: Printing the fields equal to foo and bar |
( $1 == "foo" ) && ( $2 == "bar" ) { print }
|
This example will print only those lines where field one equals foo and field
two equals bar.
Numeric variables!
In the BEGIN block, we initialize our integer variable x to zero. Then, each
time awk encounters a blank line, awk will execute the x=x+1 statement,
incrementing x. After all the lines have been processed, the END block will
execute, and awk will print out a final summary, specifying the number of blank
lines it found.
Stringy variables
Code Listing 1.16: Sample field |
2.01
|
Code Listing 1.17: 1.01x$( )1.01 |
{ print ($1^2)+1 }
|
If you do a little experimenting, you'll find that if a particular variable
doesn't contain a valid number, awk will treat that variable as a numerical zero
when it evaluates your mathematical expression.
Lots of operators
Another nice thing about awk is its full complement of mathematical operators.
In addition to standard addition, subtraction, multiplication, and division, awk
allows us to use the previously demonstrated exponent operator "^", the modulo
(remainder) operator "%", and a bunch of other handy assignment operators
borrowed from C.
These include pre- and post-increment/decrement ( i++, --foo ), add/sub/mult/div
assign operators ( a+=3, b*=2, c/=2.2, d-=6.2 ). But that's not all -- we also
get handy modulo/exponent assign ops as well ( a^=2, b%=4 ).
Field separators
Awk has its own complement of special variables. Some of them allow you to
fine-tune how awk functions, while others can be read to glean valuable
information about the input. We've already touched on one of these special
variables, FS. As mentioned earlier, this variable allows you to set the
character sequence that awk expects to find between fields. When we were using
/etc/passwd as input, FS was set to ":". While this did the trick,
FS allows us even more flexibility.
Code Listing 1.18: Another field separator |
FS="\t+"
|
Above, we use the special "+" regular expression character, which means "one or
more of the previous character".
Code Listing 1.19: Setting FS to space |
FS="[[:space:]+]"
|
While this assignment will do the trick, it's not necessary. Why? Because by
default, FS is set to a single space character, which awk interprets to mean
"one or more spaces or tabs." In this particular example, the default FS setting
was exactly what you wanted in the first place!
Code Listing 1.20: Field separator sample |
FS="foo[0-9][0-9][0-9]"
|
Number of fields
Code Listing 1.21: Number of fields |
{
if ( NF > 2 ) {
print $1 " " $2 ":" $3
}
}
|
Record number
Code Listing 1.22: Record number |
{
if ( NR > 10 ) {
print "ok, now for the real information!"
}
}
|
Awk provides additional variables that can be used for a variety of purposes.
We'll cover more of these variables in later articles.
We've come to the end of our initial exploration of awk. As the series
continues, I'll demonstrate more advanced awk functionality, and we'll end the
series with a real-world awk application. In the meantime, if you're eager to
learn more, check out the resources listed below.
2.
Resources
Useful links
|