AWK Information

AWK is a text-processing language commonly used for massaging data. It automatically parses its input into records and automatically parses each record into fields. An AWK program consists of one or more pattern-action pairs. A pattern is matched against the input; if it matches, the action is performed. It has facilities for regular-expression matching and substitution and for other aspects of string processing. Many people now use PERL for similar tasks, but AWK is simpler, cleaner, and easier to learn. Since AWK is effectively a subset of PERL, if you don't know either one you are probably best off starting with AWK and moving on to PERL if you need it. (For more information about PERL, go to the PERL web site.)

There are three main versions that you are likely to encounter on UNIX systems: original ATT AWK, the ATT proprietary revised version NAWK, and the GNU implementation GAWK, which is approximately the same as NAWK. On many systems "awk" and "gawk" refer to the same program, but on some "awk" will be the old ATT version and gawk will be GNU AWK.

The main GAWK web site is: http://www.gnu.org/directory/GNU/gawk.html.

GAWK will be found on all Linux systems and most other UNIX systems. It may be downloaded from any GNU project archive site. GAWK is also available for Microsoft Windows from http://gnuwin32.sourceforge.net/packages/gawk.htm.

GAWK is 8-bit clean, so any single-byte encoding can be used. If locale support is enabled, the locale correctly set, and the encoding one known to gawk, gawk will handle it correctly. However, GAWK presently (2004/06/21) supports Unicode only in part. Since gawk is 8-bit clean, UTF-8 text is processed correctly provided that gawk does not need to know how byte sequences are parsed into characters or to recognize particular codepoints. This means that UTF-8 text can be read and printed out correctly and that the basic parsing mechanisms will work so long as the field and record separators are ASCII characters. Searches for fixed strings will also work. The length() function, on the other hand, does not work correctly on UTF-8 text; it returns the number of bytes in the string rather than the number of characters. An important exception is that regular expression matching does work on UTF-8 text.

The Heirloom Toolchest contains a version of NAWK that provides full Unicode support. While not identical to GAWK, it is very similar. (Note: to get Unicode to work properly using this version of nawk, it is necessary to set the locale to UTF-8. In the csh, do: setenv LC_CTYPE UTF-8. In bash, LC_CTYPE=UTF-8; export LC_CTYPE;

The classic description of AWK, still a good source, is The AWK Programming Language by Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger, the creators of AWK. Pointers to the book, downloadable excerpts, and other information may be found on the AWK Book Website.

The principal documentation for GAWK is the Reference Manual.

All sorts of information about AWK, including information about different versions and answers to common questions, can be found in the comp.lang.awk FAQ (Frequently Asked Questions).

The reference card for GNU AWK may be downloaded here.