Unix Tools

Introduction

There are a number of Unix utilities that allow one to do such things as break text files into pieces, combine text files together, extract bits of information from them, rearrange them, and transform their content. Taken together, these Unix tools provide a powerful system for obtaining linguistic information. Here is a brief summary of the relevant tools. In each case, the name of the program is a link to the manual page or other more detailed information.

Overview of the Tools

Determining How Much is in a File

It is often useful to know how much is in a file. This can help to determine whether it contains enough material to be worth the bother, whether it is so large as to require special handling, and whether it is in the expected or desired format. (For example, if a file has very few lines in comparison to the number of words or characters, it may come from system with different end-of-line conventions.)

wc: Prints a count of characters, words, and lines. By default, wc simply counts bytes to produce its character count, but if the -m flag is used it will correctly count UTF-8 Unicode characters.

Cutting a File Into Pieces

Several utilities allow one to cut a file into pieces.

cut: Extracts the specified field from each input line. Fields may be specified by numerical position or in terms of character offset. By default, fields are taken to be separated by whitespace. Another delimiter may be specified instead. The inverse of cut is paste.
head: Copies the first N lines of its input to the standard output. An option makes the unit bytes instead of lines. The opposite of head is tail.
tail: Copies the last N lines of its input to the standard output. An option makes the unit bytes instead of lines. The opposite of tail is head.

Note that head and tail used in combination allow one to extract any desired contiguous set of lines. For example, the command

head -20 | tail -5

extracts lines 16 through 20.

Extracting Selected Lines from a File

Instead of cutting a file into pieces based purely on the position of the pieces, it is possible to extract material based on its content.

uniq: Given sorted input, writes to the standard output the unique lines, that is, one line in place of what may be multiple identical lines in the input. If desired, uniq will print a count of the repeated lines. Options provide for the printing only of lines that are not repeated or only of lines that are repeated.
grep: Copies to the standard output the lines of input that match a regular expression. An option allows the lines not matching a regular expression to be selected instead. GNU grep understands Unicode.

Combining Files

Given two or more files, it is possible to combine them into a single file either "horizontally" or "vertically", or on the basis of the contents of a particular field.

cat: Concatenates the files named on the command line and writes the result on the standard output.
paste: Writes lines consisting of sequentially corresponding lines of each input file on the standard output. By default, the "columns" taken from each input file are separated by a TAB character. Another delimiter may be specified. The inverse of paste is cut.
join: For each line in which the specified join field is the same in the two input files, writes to the standard output the concatenation of the two input lines. Join provides a simple, text-based, relational database facility.

Rearranging a File

Most frequently we want to rearrange a file on the basis of the content of the pieces, for which we use sort. The standard sort program is very useful, but it is not capable of some of the kinds of sorting that arise in linguistic work. A more powerful sorting program designed specifically for linguistics is msort, which we will look at later when we deal with sorting in more detail. It is occasionally useful, however, to be able to reverse the order of the contents of a file, for which tac is available.

sort: Sorts its input and writes the result on the standard output. GNU sort does not understand Unicode.
tac: Concatenates the files named on the command line and writes the result on the standard output in reverse order, that is, the last record first, the next-to-last record second, and so forth. Records default to lines, but the record separator may be specified by a regular expression.

Comparing Two Files

cmp: Given two files, identifies the byte and line at which they differ, if they do. cmp is useful for finding out whether binary files are the same and, if they are different, finding out where to look for the difference. comm and diff are generally more useful for comparing human-readable files.
comm: Given two sorted files as input, writes on the standard output the lines that are common to both inputs, the lines that occur only in the first input file, and the lines that occur only in the second input file. Options allow any chosen combination of the three columns of output to be suppressed.
diff: Diff generates a description of how one input file differs from the other. Several output formats are available. Generally, they describe the differences in terms of the changes that must be made to derive the second file from the first.

Transforming a File

There are a variety of ways of transforming a file in a systematic way. These range from the specialized transformations provided by fold to the very general transformations provided by sed and awk.

fold

Breaks long input lines. The primary use is formatting, but fold is sometimes useful in linguistic text processing. For example, suppose that you need to get each character onto a line by itself. The command

fold -w 1

will do the job. It sets the line length to one character. GNU fold understands Unicode.

tr

Translate one set of characters into another. Can also delete specified characters and reduce sequences of multiple tokens of the same character to a single token.

sed

A powerful stream editor. Sed copies its input to its output, editing it in passing. Each portion of an input line matching a regular expression can be replaced with a fixed string or another portion of the input. Lines matching a regular expression can be deleted. GNU sed understands Unicode.

awk

Copies its input to standard output, performing specified actions whenever the input matches a specified pattern. awk automatically parses its input into records and the records into fields. By default, a record is a line, with fields separated by whitespace. However, both the field and record separators may be changed. awk is actually a full-fledged programming language, meaning both that there is a good deal to learn in order to use all of its capabilities and that it can be used for many purposes. With only a small amount of effort, however, it can be used to extract particular records and fields and to rearrange fields. For example, the command

awk '{print $3,$1}'

will extract the first and third fields from every input line and print them in reverse order, that is, the third field followed by the first field. GNU awk understands Unicode.

Using Tools Together

In some cases it is possible to accomplish a task using a single tool, but often it is necessary, or at any rate easier and more efficient, to divide the task up among different tools. One of the strengths of Unix is the fact that it provides unusually good support for using tools together.

I/O Redirection

Some programs read and write files named on the command line. In this case, if you want one program to read the output of another, you have no choice but to have the second program read from the file created by the first program. However, many programs read from the standard input, abbreviated stdin and write to the standard output, abbreviated stdout. By default, these are associated with the terminal. A program that reads from the standard input will read what the user types at the keyboard; what a program writes on its standard output will appear on the terminal. Every process has three i/o streams opened for it automatically when it is created. Two of these are stdin and stdout; the third, which we will not discuss here, is the standard error output, abbreviated stderr. stderr is a second output stream. It is provided so that a program's main output can be kept separate from error messages or other commentary.

It is possible to redirect the three default i/o streams. The less than sign reassociates stdin with a file; the greater than sign redirects stdout. Thus, a program like wc that reads from stdin and writes on stdout will read its input from file a and write on file b if we use the following command line:

wc < a > b

pipes

Shell Scripts

simple shell scripts - just a sequence of commands like one might type on the command line command line arguments $1, $2 etc. shift more complex shell scripts make use of the shell as a programming language loops conditionals variable setting choice of shells: tcsh vs. bash

Make

If your work involves several stages, some of which take a significant amount of time, it is desirable not to have to redo more than necessary. If you just run your commands from the command line or assemble them into a shell script, whenever you change something you either have to rerun the entire process or you have to figure out which parts must be rerun and edit these out so that you can run them separately. This is tedious and error-prone.

Fortunately, there is an alternative, the make program. make executes the commands necessary to generate specified targets from the files on which they depend. make obtains its instructions from files known as makefiles. If you don't specify what file to use, make looks for a file named makefile in the current directory, then for a file named Makefile. A makefile may also be specified on the command line.

The makefile expresses dependencies among files and indicates how to generate each file from those it depends on. Here is a simple makefile:

text.u:		text.can
		WeirdFont2Unicode < text.can > text.u

text.can:	text
		ReorderWeirdFont < text > text.can

This might be used to obtain Unicode text from text in a font with a proprietary encoding. In some cases, it is necessary to reorder the codes before performing the rest of the conversion, and it is often easiest to separate the reordering from the main part of the conversion. If the conversions are not complex and the amount of text to be processed is not large, we might not need to use make, but this will serve as an example. For some real projects, the makefile is very complex. The makefile that controls the generation of the Jonathan Amith's printed dictionary of Oapan and Ameyaltepec Nahuatl from the database is 36 lines long. The makefile that controls the generation of the Carrier dictionaries that I work on is 428 lines long.

This makefile contains two rules. The first rule says that the target text.u depends on text.can and that the former can be generated from the latter by executing the command WeirdFont2Unicode < text.can > text.u. The second rule says that the target text.can depends on the target text and that text.can can be generated from text by executing the command ReorderWeirdFont < text > text.can.

If we execute make with this makefile in a directory containing the file text but neither text.can nor text.u, make will automatically execute the necessary commands to create text.u. It does this by observing that in order to generate text.u it needs text.can. Since this does not exist, it looks to see if it knows how to create it. In fact, it does, since the second rule tells it how. So make first executes the command ReorderWeirdFont < text > text.can to create text.can, and then executes the command WeirdFont2Unicode < text.can > text.u to create text.u.

Now, suppose that text.can already existed when we ran make. We might think that make would use the existing text.can and just execute the command WeirdFont2Unicode < text.can > text.u to create text.u. In fact, this will happen only if text.can is more up to date than the files it depends on, namely text. If the modification time of text is more recent than that of text.can, make will decide that text.can is out of date and should be replaced. As a result, whenever you modify a file, you need only run make and it will rerun precisely those commands necessary to update the target.

A similar system that offers some advantages over make is makepp, which at present runs only on Unix systems. makepp is backwards compatible with make in the sense that it can use makefiles created for make, but it has additional capabilities. One difference that is particularly useful for linguistic work is that makepp rebuilds when the build rules are changed as well as when files change. When one is writing computer programs, it is typically the case that the relationship between the components is easy to set up and does not change very often. Most of the changes during the development process are in the programs themselves, that is, in the files that must be processed. So for the purposes for which make was originally developed, rebuilding only when files change makes sense. On the other hand, in linguistic data processing, the underlying files are typically data sources that do not change. What changes during development are the commands that process the data sources. Rebuilding automatically when these commands change is therefore a great convenience.

Resources

make homepage: The make hompage, with links to documentation, source, etc.
make manual: The on-line reference manual, as hypertext.
make tutorial: This is a very clear, elementary, and nicely illustrated tutorial. It uses the compilation of C programs as examples, but this shouldn't put off those who don't know anything about C programming.
make tutorial: Long and detailed. Not for beginners. A good introduction to the more advanced features. Good to read after you understand the basics.
makepp homepage: The homepage, with links to documentation, source, etc.
makepp tutorial: A nice, clear tutorial, suitable for those with no background.

Executing Other Programs as Child Processes

Another way in which separate programs can be made to work together is for one program to execute another. The first program is said to be the parent, the second to be a child of the first. Only some programs can do this, essentially those like awk and python that are programming languages. How to do this depends on the programming language and is not appropriate for discussion here. But you should keep this possibility in mind when using a programming language. To take a simple example, suppose that you need to sort some data. It is possible to write a sorting subroutine in awk, but it is much easier and more efficient to use sort or another specialized sorting program. If you need this sort in the midst of an awk program, you can run the sorting program as a child of awk.

Revised 2004/04/25.