Parsing Lexical Database Files


  1. The Shoebox File Format
  2. Parsing the Input into Records
  3. Parsing Records into Fields
  4. Extracting the Desired Fields
    1. The Simple Case
    2. Repeated Tags
  5. Multiple-Valued Fields
  6. Variations on the Shoebox Format
  7. Band Format

The Shoebox File Format

The most widely used format for lexical databases is the Shoebox format. It is the format used by the Shoebox program, a program for working with lexical databases created by the Summer Institute of Linguistics for the use of its fieldworkers. However, the SIL has made Shoebox generally available, so many other people have used it. Furthermore, because this is a convenient format, many databases use essentially the same format even though they are not managed with the Shoebox software.

A Shoebox file consists of a set of records, each of which in practice is a block of text separated from other records by blank lines. Strictly speaking, blank lines are not necessary. A new record begins with a specified tag. Since people often edit Shoebox files directly and files without blank lines between records are very difficult (for humans) to read, in practice Shoebox files virtually always have blank lines between records.

A record consists of fields. Each field consists of two meaningful parts: a tag, which identifies the field, and a value, which is the content of the field. The tag and the value are separated by one or more space characters. Each field begins with a backslash. It is therefore possible for a field to consist of multiple lines.

Here is a fragment of a database in Shoebox format, extracted from Steve Swartz's Warlpiri lexicon:

\w jakanypa
\p nm
\d small tree or {*shrub=sp.} (Petalostylis cassiodes)
\s wariyi Flora

\w jakarlapayi-payi
\p nm
\d *gall=sp. {*insect=sp.} {within nutlike coccoon which grows on the coolibah tree}

\w jakarlarra
\p nm
\d prickly=*bush=sp.
\s nyili Flora

\w jakarn-karrimi
\p vc1i
\d to be *crack%ed, as of the ground

Each record begins with the Warlpiri headword marked by the tag w ("word"). This is followed by the part of speech, with tag p, and a definition, tagged d. Two of the records also contain a synomym field, tagged s.

Shoebox format does not require tags to consist of a single character, and it does not require fields to be ordered in a consistent way. The choice of tags, furthermore, is up to the user. Depending on the language and the nature of the information about it, different tags will be required. Some databases make use of only a few different tags; others make use of over a hundred.

For the time being, we will discuss files in Shoebox format as narrowly defined. See below for discussion of variations on Shoebox format. However, there is one aspect of the original Shoebox format that is, from our point of view, a very bad choice. This is the use of backslash as the character that marks the beginning of a field. Backslash has a special meaning as a quote character in numerous Unix tools. This means that when we want to use a literal backslash, it is necessary to escape it. Since backslash has a special meaning to multiple Unix tools, we find ourselves having to worry about backslash repeatedly. Moreover, several levels of escape may be necessary, when, for example, a shell script contains an AWK script. Overall, it is a major pain in the neck to deal with backslash. Almost any other character would be less trouble.

Therefore, from now on we will use a percent-sign (%) in place of backslash as the character that begins fields. This is what I use in my own databases. When processing someone else's database, I simply convert their backslashs to percent-signs at the outset. (You can use tr to do this: tr '\\' '%') If the database uses percent-sign in its values, you can either convert these percent-signs to something else first or use another string at the beginning of fields.

back to top


Parsing the Input into Records

In Shoebox files records consist of blocks of text separated by blank lines. We can get AWK to split its input into records in this way by setting the record separator RS to the empty string.

RS = "";

Normally we would do this in the BEGIN action, though it can also be done on the command line. This is a special feature that you will need to remember. There is no logical reason that setting the record separator to the empty string should result in this kind of parse. Furthermore, when you do this, AWK silently absorbs any extra blank lines. If, for example, two "real" records are separated by four blank lines, AWK will absorb the extra blank lines rather than deciding that there are empty records between them.
back to top

Parsing Records into Fields

How AWK parses records into fields is determined by the value of the field separator variable FS. Since percent-sign separates fields, we want to make percent-sign the field separator.

FS="%";

This will work for records like those shown above, in which each field consists of a single line. However, it will fail if fields extend over multiple lines. This is because, when the record separator is set to the null string, AWK not only makes newline the default field sepator, it insists on making it a field separator no matter what else is defined as the field separator.

To overcome this problem we set the field separator to a regular expression:

 FS = "\n?%";

This regular expression makes the field separator a percent-sign preceded by an optional newline character. This has the effect of overriding AWK's insistence on making the newline character a field separator. Note that the use of regular expressions such as this as field separators is a GNU awk extension. It does not work in older versions of AWK, and is not provided for in the POSIX standard.
back to top

Extracting the Desired Fields


The Simple Case

Once we have the record and field separators set correctly, AWK will automatically parse the input. However, we still need to select the particular fields with which we wish to work. Since fields are identified by their tags, this means that we need to be able to separate the tag from the value and to select fields by their tags. Here is a useful way of doing this:

1 for(i = 2; i <= NF; i++){
2   split($i, f, " "); 
3   rec[f[1]]=substr($i,index($i," ")+1);
4 }

Line 1 iterates over the fields in the record. We start with the index at 2 rather than one because the first field will be blank. This is because, strictly speaking, in Shoebox format there are no field separators; rather there are field initiators. That is, the backslash occurs at the beginning of every field, including the first field in the record. Since AWK is looking for field separators, it will interpret the backslash at the beginning of the first field of the record as the end of a field, and will therefore create an empty initial field, assigned to $1. Starting the loop index at 2 skips this empty field.

Line 2 uses the builtin function split to create an array called f each of whose components is one piece of the field. The first argument to split specifies the string to split. In this case it is whichever field we are currently working on. The second argument specifies the name of the array into which to put the pieces. The third argument is a regular expression on which to split. It is analogous to the field separator used by AWK in its initial parse of the input. In this case, we have used a space character as the regular expression. The field will therefore be split into pieces separated by strings of one or more spaces.

The result of the call to split is an array f whose first component, f[1] contains the tag. The remaining fields contain the pieces of the value.

Line 3 makes an entry in an array called rec. The index for this entry is f[1], that is, the tag. The value assigned to this entry is the remainder of the field. We obtain the remainder by using the builtin function substr to extract the portion of the field that begins with character following the first space and extends to the end of the field. The builtin function index returns the position of the first occurence of the specified string, in this case, a space, in its target.

Note that although we know that f[1] contains the tag, we cannot assume that f[2] contains the value of the field. This is because the value may contain spaces, which will result in the value being split into several pieces. We must either construct the value by re-assembling it from f[2] etc., or, as above, skip over the tag and the separator between tag and value and make the remainder of the field the value.

The array rec now contains the contents of the various fields of the record, indexed by their tags. If, for example, you want the headword, you can refer to it as rec["w"]. Of course, you may not be sure that every record contains the field in which you are interested. You may therefore need to test whether it is present. You can do this using the contruct:

if("w" in rec)

There is one other point with which we may need to deal. In the approach taken here we found the beginning of the value by starting one character past the first space. Since there may be more than one space between the tag and the value, our "value" may begin with spaces that are really part of the separator, not intended as part of the value.

There are several ways of dealing with this. The best is not to create the problem in the first place by using a single character as the separator between tag and value. This is another way in which the original Shoebox format is not optimal. If one must deal with this problem, there are two approaches. One is to use a more sophisticated technique for identifying the beginning of the value. Another, probably the simplest, is to chop the extra spaces off the beginning of the value. Here is a revised version of the code that does this:

1 for(i = 2; i <= NF; i++){
2   split($i, f, " "); 
3   value = substr($i,index($i," ")+1);
4   sub(/^ */,"",value);
5   rec[f[1]]=value;
6 }

The call to sub in line 4 replaces a sequence of spaces at the beginning of the string with the empty string; that is, it deletes them.
back to top

Repeated Tags

In some databases a tag may occur more than once in a single record. This arises because of the need to deal with multiple pieces of information of the same type. There may, for example, be multiple synonyms, or multiple example sentences. In this case the technique described above for extracting fields will not work because it assumes that there is at most one field with a given tag. If there is more than one field with a certain tag, the last one will be stored and the others discarded.

In this case, we loop over the fields and identify tags by matching regular expressions against the beginning of the field. Each time we identify a tag, we increment a counter for fields of the appropriate type, extract the value, and store it in an array indexed on the counter. Here is an example:

 1  for (i = 2; i <= NF; i++){
 2    if(match($i,"^xrb ")){
 3      RootCnt += 1;
 4      Roots[RootCnt] = substr($i,index($i," ")+1);
 5    }
 6    if(match($i,"^sem ")){
 7      SemanticFieldCnt += 1;
 8      SemanticFields[SemanticFieldCnt] = substr($i,index($i," ")+1);
 9    }
10   }

The loop in 1 iterates over the fields. As before, we start at 2 because the first field is empty. Lines 2 and 6 attempt to match regular expressions against the field. In each case, the regular expression begins with a circumflex to anchorit to the beginning of the field, and ends in a space, so as to guarantee that we are matching the entire tag and not a prefix of it. If a match is found, the appropriate counter is incremented, as in lines 3 and 7. In each case there is an array intended to hold this sort of information. In lines 4 and 8 the value is extracted along the lines described above, by using substr to extract the portion of the field that follows the space that separates the tag from the value. As above, if it is possible for extra spaces to be located between tag and value, they can be stripped off using sub.

At the end of this process, the counters RootCnt and SemanticFieldCnt will contain the number of fields of each type found in the record, and the arrays Roots and SemanticFields will contain the values of these fields. For example, if there are two semantic fields in the record, SemanticFieldCnt will be equal to 2. The first semantic field will be stored in SemanticFields[1]; the second in SemanticFields[2].


back to top

Multiple-Valued Fields

When there are several pieces of information of the same type, one way to store them is by using multiple fields with the same tag. A second way is again to use separate fields, but to use indexed tags, e.g. SemanticField1, SemanticField2, etc. These can be identified by matching just the constant part of the tag, or, if necessary, by a fuller parse, in which the index (the number at the end) is extracted as well. A third approach is to store multiple pieces of information in the same field, using a separator to keep them apart.

For example, some databases contain an "inverse header" field, used to generate an inverse dictionary, e.g. the English-Carrier section of a Carrier dictionary in which the headwords are Carrier words. A word may have multiple inverse headwords. Here is an example:

%P:lacholbai
%G:yarrow plant, milfoil
%IH:yarrow plant/milfoil
%C:N
%SN:Achillea millefolium
%S:MAJO
%UID:000296
%MD:1997/04/15
This plant is known by two English names, both of which are listed in the inverse header field, separated by a slash.

Records of this type can be parsed by split:

items = split(rec["IH"],ihs,"/");
This splits rec["IH"] into pieces separated by slashes, putting each piece into an array named ihs. The number of pieces, in this case the number of inverse headwords, is returned by split and stored in the variable items. After this line, when processing the record above, ihs[1] would have the value "yarrow plant" and ihs[2] would have the value "milfoil".
back to top

Variations on the Shoebox Format

It is not uncommon to find variations on the Shoebox format, for two reasons. First, a format of the same general structure is logical and has evidently been invented independently a number of times. Secondly, as noted above, the original Shoebox format has some undesirable features. One, the use of backslash as a field-initiator, is a problem when using Unix tools. The other, the use of sequences of spaces to separate tag and value, both requires one to absorb extra spaces and can create difficulties because characters that one cannot see are hard to cope with. For example, although the intention may be that only space characters are permitted, users may include tabs as well.

The format that I personally use is one in which fields begin with a percent-sign. This eliminates the problem of dealing with backslashes. Percent-sign is a good choice because lexical data rarely contain percent-signs. Instead of spaces, I use a colon to separate tags from values. This eliminates the need to absorb extra separator characters, and eliminates the ambiguity between spaces and tabs.

In general, one should be prepared to work with databases that use a variety of strings at the beginning of fields. What string is appropriate is partly a matter of what may occur in the value. If, for example, percent-signs occur in the values, percent-sign is not a good choice for the field-initiator. Problems of this type are easily overcome by using a string that can be expected never to occur in a value. One database with which I have worked, for example, uses "==". A sequence of two equal-signs is extremely unlikely to occur in anything that might show up in this database.

Although from a computational point of view allowing the separator between tag and value to consist of any non-zero number of spaces is undesirable, some users insist on it because they use tags of variable length and like to keep the values aligned. It is, therefore, important to be prepared to deal with this format.

Band Format

There is another format that is not as widely used as Shoebox format but has nonetheless seen fairly extensive use. This is known as band format. This format was developed by Robert Hsu and has been used in a number of lexical databases, primarily for languages of North America and the Pacific, especially at the University of Hawaii. This format is intended to be processed by Hsu's Lexware suite of programs.

Band format resembles Shoebox format at first glance. Records consist of multiple lines, with fields identified by tags. Tags are separated from values by spaces. However, the manner in which fields and records are delimited sets it apart from Shoebox format. In band format, a field is terminated by a line feed not immediately followed by two spaces. In other words, fields by default are terminated by a linefeed. In order to make a field longer than one line, each line after the first must begin with two spaces. These two spaces serve as what is called a continuation marker.

Second, a record in this format begins with a line-initial period not immediately followed by another period. The reason for the restriction "not immediately followed by another period" is that in this format the number of periods before the tag is used to indicate hierarchical structure. For example, a word with two senses would have two definition fields, each preceded by two periods. Any other information associated with a particular sense would occur in the same block of text.

Here is an example of band format. The line numbers have been added for reference. They would not appear in the actual data.

 1 .root chun
 2 tag  tree
 3 ..noun  duchun
 4 gloss tree
 5 ..theme  n=d-chun
 6 gloss strut around
 7 example nusduchun
 8 trans I am strutting around  
 9 ..theme 0-chun
10 gloss stand proud

Here the record begins with a root chun. The next field contains the tag tree. This is associated with the entry as a whole and so has no leading periods. Line 3 begins with two periods, marking the first of three sub-entries. This sub-entry is a noun duchun. Line 4 contains the gloss. Since both of these are associated with this sub-entry, neither begins with a period. Line 5 begins a second sub-entry, this one a verb theme based on the same root as the noun. Line 6 contains the gloss, line 7 an example, and line 8 the English translation of the example. A third sub-entry begins at line 9, this one another verb theme. Line 10 contains the gloss.

The way in which records and fields are delimited makes band format impossible to parse in a simple way using general purpose tools like awk. It is of course possible to write special-purpose parsers for this format. Furthermore, the use of leading periods to indicate level of embedding means that tags are a bit more difficult to find since the leading periods must be removed or ignored.

Lexware and band format have been used to produce a number of fine dictionaries. When this format and the associated software were originally designed, there were reasons for using such a format. Furthermore, tools like AWK did not exist, so that compatibility with them was not a concern. However, at present, unless you are actually working with Lexware, it is best to avoid this format and use a format that is easier to deal with.

A file in band format can be converted to one in (generalized) Shoebox format as follows:

  1. Read the file one line at a time. Replace each line-initial period that is not immediately followed by another period with a blank line. It will now be possible to parse the file into records as if it were in Shoebox format.

  2. Replace every sequence of a line-feed followed by two spaces with a string that cannot otherwise occur in the data. This string marks the location of line breaks. It might, for example, be LINEBREAK. The result of this is to make every field into a single line.

  3. Now read the file line-by-line and at the beginning of every line insert the field-initiator of your choice. If you want to use strict Shoebox format, this will be a backslash.

  4. Now translate the string that you have chosen to mark the original line breaks, e.g. LINEBREAK, into linefeeds. This will restore the original line breaks. You now have a file that can be parsed into records and fields in the usual way.

  5. The remaining problem is to deal with the periods at the beginning of fields. If you don't care about the hierarchical information, or are prepared to recover it from the grouping of tags, you can simply delete them. That is, just delete one or more periods at the beginning of a field. Whether it is safe to do this depends on the structure of the data. If, for example, you know that nouns and verb themes will always be co-ordinate, you can simply assume that every field between one noun or verb theme and another is associated with the former. If it is necessary to retain the information provided by the initial periods, the best approach is probably to convert it to a new field, e.g. a LEVEL tag whose value is the level. Two periods would then translate into: LEVEL 2. Exactly what the best approach is will depend on the details of the hierarchical structure that you wish to impose.

back to top