Lists and Multidimensional Arrays in AWK

The Problem - Data Varying in Two or More Dimensions

A number of situations arise in which we have some category of data of which there may be more than one item and where we do not know in advance how many there will be. For example, a word may be exemplified by any number of example sentences. An affix on which we wish to index may occur in many words.

Where we know in advance how many entries there will be, we can simply set up a variable in which to store each one. For example, if we knew that each word would have exactly one example sentence, we could create two variables in which to store them:


ExampleSentence1

ExampleSentence2

If there is only one dimension of variation, we can use an ordinary array:

ExampleSentences[1]
ExampleSentences[2]
ExampleSentences[3]

etc.

But a problem arises when there are two or more dimensions of variation. Suppose, for example, that the lexicon with which we are working lists individual allomorphs of morphemes and that we must keep track of words and sentences that exemplify each allomorph. We now have two dimensions of variation: allomorph and example token. If we create an array indexed by the different allomorphs, how do deal with the existence of multiple examples for each allomorph? Similarly, if we are constructing an index for a certain class of affixes, if we create an array indexed by the affixes, how do we deal with the fact that we may have multiple words in which each affix occurs?

There are two approaches to this sort of problem. One is to use a single dimensional array but to store not a single value but a list of values. For example, if we are generating a list of the words that contain each of the "lexical suffixes" of a Salishan language, we would use an array indexed by suffix and in each entry in the array store the list of record IDs.

The other approach is to use a multi-dimensional array. For example, we could create an array of example sentences indexed both by allomorph and by token number:

Examples["qin"][1]
Examples["qin"][2]

would, for example, contain the first two examples of the allomorph "qin".

Unfortunately, AWK does not directly support either lists or multidimensional arrays. Fortunately, it is possible to emulate them without too much difficulty.

Emulating Lists

To emulate a list, what we need to do is to concatenate the members of the list into a single string, separated by a divider that we can use to when we need to extract the individual members. For the sake of concreteness, let us suppose that we are constructing a list of record identifiers consisting of integers for each lexical suffix. We suppose that the lexical suffix for the current record is stored in the variable ls and that the fields of the record are in an array rec indexed by their tags. As we process each record, we use code like this:

LexicalSuffixList[ls] = LexicalSuffixList[ls] rec["UID"] ",";

This line appends the unique identifier of the current record to the existing list of unique identifiers for the suffix in the variable ls. It does this by concatenating the current list with the unique identifier of the current record and a comma and replacing the current value of the list with the new one. (Remember that space is the concatenation operator in AWK.) The first time this statement is executed for a given lexical suffix, this slot in the array will be empty, so the current record identifier will be appended to the empty string, resulting in something like:

000120,

The next time it is executed, the current record identifier will be appended to this, resulting in something like:

000120,000276,

We will end up with a string containing a number of record identifiers, each followed by a comma. We can treat this string as a list since whenever we need to we can split it on the commas and extract the individual record identifiers. A statement like:


split(LexicalSuffixList[ls], uids,",")

will leave the identifiers of the records containing the suffix ls in the array uids.

In general, it is possible to emulate lists in AWK by combining the elements of the list into a string together with a separator that cannot be a member of any of the individual items.

Emulating Multidimensional Arrays

A multidimensional array is one in which there is more than one index. We can emulate this with single-dimensional arrays by combining two or more indices into a single string. From the point of view of AWK, it looks like a single index, but to it is composed of two or more discrete parts. In fact, AWK allows us to do this in a way that looks very much like using real multidimensional arrays. Consider the expression:

ex[allomorph,exno]

Superficially this looks like a reference to a two-dimensional array, where the first index is the allomorph and the second is the example number. It will in fact function this way. What actually happens, though, is that AWK treats this as a single-dimensional array whose index is the string consisting of the concatenation of the two indices with a separator between them. The separator is the value of the variable SUBSEP and defaults to character code octal 034. This is the ASCII "file separator" character and is unlikely to occur in text files.

For most purposes, this is a detail of implementation that makes no difference for the user. However, you can run into trouble if your indices contain the character that AWK uses as the separator. This is unlikely, and you can change the separator if necessary.

Notice that arrays with emulated lists as values and multi-dimensional arrays have very similar structures. The difference is that in one case we combine multiple values into a single value, while in the other case we combine multiple indices into a single index. Further discussion of multi-dimensional arrays in AWK can be found here.

The Extra-Separator Problem

In the discussion above of emulating lists by concatenating the list items together with a separator, we glossed over the fact that the code presented resulted in an extra comma at the end of the list. This "extra-separator" problem is one that arises frequently in computer programming. It arises whenever we need to generate a list of items delimited by a separator. Even if the programming language that we are using supports lists directly, this problem arises when printing a list out.

The reason for the problem is that we often don't know when we are dealing with the last item of a set. We want to end up with a separtor following every item but the last, but if we emit a separator after every item, we end up with an extra separator after the last item.

There are three ways of dealing with this problem:

Prevent it from occurring;
Clean up afterward;
Adapt to it;

It is possible to prevent the problem from occurring by emitting the separator before each item instead of after it. If we do this unconditionally, this results in the opposite problem, namely an extra separator at the beginning of the list:

,000120,000276

However, we can prevent this by emiting the separator only before items other than the first item. This we can do. Although we have no way of knowing, as we process the input, whether we are dealing with the last item, we can keep track of whether we are dealing with the first item. For the particular example discussed above, here is code that will do the job:

Count[ls] += 1;
if(Count[ls] < 2) LexicalSuffixList[ls] = LexicalSuffixList[ls] rec["UID"];
else LexicalSuffixList[ls] = LexicalSuffixList[ls] "," rec["UID"];

Another approach is to generate the extra separator and remove it later. This works because once we have read all of the input we know that the last character in the string is an extra separator. One way to do this in our example case is:

LexicalSuffixes[ls] = substr(LexicalSuffixes[ls],1,length(LexicalSuffixes[ls])-1);

A third approach is not to worry about the trailing extra separator but to adapt to it where necessary. If we are printing out a list, we do need to do something about it because, rightly or wrongly, people will consider the presence of a trailing comma an error. But if we are just emulating lists within a program, what harm does it do?

The one place in which it makes a difference is when we use split to extract the components of the list. If there is a trailing separator, split will think that it is followed by another item and will create an additional, empty, slot in the array.

For example, compare the following two statements:


cnt = split("001232,000435",ids,",")

cnt = split("001232,000435,",ids,",")

The first will produce the following result:

cnt	2
ids[1]	"001232"
ids[2]	"000435"

The second will produce the following result:

cnt	3
ids[1]	"001232"
ids[2]	"000435"
ids[3]	""

So long as we recognize that the count returned by split is one too high and ignore the last member of the array it creates, we will be fine.

Which approach to take is partly a matter of taste and partly a matter of the kind of program you are writing. If you are writing a small program, or one that no one else is likely to have to revise, you may well choose to ignore the problem and adapt to it as this is the simplest approach, and under most circumstances, the most efficient. On the other hand, it may be undesirable to do this in a complex program, because at some time in the future, you or someone else working on it will not realize that your lists have a trailing separator.

Preventing the problem from occurring seems to be the cleanest approach, but it requires a test when processing every record. However, sometimes it is desirable to avoid cluttering up the code at this point.