Indices and Concordances

Introduction

Indices and concordances provide a means of finding words, or occasionally other units, in a text. In the case of traditional printed materials, An index provides the location information indirectly, by means of page numbers, line numbers, section numbers, verse numbers and so forth. The word concordance is sometimes used with the same meaning as index, but it is also used to mean a format that provides the information more directly, for example, printing the lines of text in the which a word occurs. A printed index is thus more time-consuming to use than a concordance, but has the virtue of requiring much less space. In electronic form, an index becomes less of a burden because the pointer can be a hyperlink to an appropriate location in the full text, which one can follow virtually instantaneously.

An electronic concordance is superior to a printed concordance in two major ways. First, it is effectively unlimited by considerations of size. It may include virtually any amount of text, and it may be indexed in numerous ways. Second, an electronic concordance may readily be changed, whereas a printed concordance is fixed. There are now few if any reasons to produce a printed concordance other than circumstances in which it is to be used for an extended period of time in a location in which electrical power is not available.

In recent years, printed concordances have been replaced by electronic concordances, some intended for local use, some with web interfaces. The University of Dundee Web Concordancess are typical examples of web-searchable concordances of literary texts.


Creating Your Own Concordances

If you work on well-known literary texts, there is a reasonable chance that someone has already created a concordance for the text that you are interested in. Otherwise, you will need to create your own concordance. You can do this using commercial software or by rolling your own.

Commercial Software

There are several commercial programs for creating and viewing concordances. These include the following:

For additional information, see Hans Klarkov Mortensen's concordance page [in Danish].

Rolling Your Own

It is easy to create indices and concordances using Unix tools. Here is a shell script that creates an index to a text in which each word is followed by a list of the line numbers on which it occurs.

cat $1 |
tr -dc '[A-Za-z][:blank:]\012' |
gawk '
  {for (i = 1; i <= NF; i++) words[$i] = words[$i]  sprintf(", %d",NR);}
  END{for (i in words){
	lines = words[i];
	sub(/^,/,"",lines);
	printf("%s\t%s\n",i,lines);
  }
}' | sort -f -k 1 > $1.index

This script makes use of the utilities tr and sort along with a short program in awk. Typical output looks like this:

ally	 2313
almost	 440, 1051, 1117, 2247
alone	 710, 927, 1084, 1394, 1395, 1850, 1950, 2221, 2390, 2594
along	 504, 1389

Click here to download a version of this script with detailed annotation.

A type of concordance with a long history on UNIX systems due to its use in indexing computer manuals is a keyword in context or KWIC index. A KWIC index looks like this:

                Principalities      are either hereditary, in whic
     long established; or they      are new.
itary, in which the family has      been
            Principalities are      either hereditary, in which th
                          long      established; or they are new.
ither hereditary, in which the      family has been
ereditary, in which the family      has been
     Principalities are either      hereditary, in which the famil
alities are either hereditary,      in which the family has been
                                    long established; or they are 
 long established; or they are      new.
             long established;      or they are new.
                                    Principalities are either here
re either hereditary, in which      the family has been
          long established; or      they are new.
ties are either hereditary, in      which the family has been

Each word in the input generates one line. It is preceded and followed by a chosen amount of context, in this case, 30 characters. The word indexed immediately follows the tab that splits the line into two parts. Here is the text from which this was generated:

Principalities are either hereditary, in which the family has been
long established; or they are new.

Here is a shell script that generates a KWIC index, using two short awk programs along with sort.

awk '{print $0
for (i = length($0); i > 0; i--)
  if (substr($0,i,1) == " ") print substr($0,i+1) "\t" substr($0,1,i-1)
}' $1 | sort -f | awk '
BEGIN {FS = "\t"; WID = 30}
{printf("%" WID "s      %s\n",
	substr($2,length($2)-WID+1),
	substr($1,1,WID))
}'

The first AWK program generates as many copies of each line as there are words in it, splitting each line at a different word boundary, and putting the second half of the line before the first, separated by a tab. This has the effect of rotating the words so that the first copy is the same as the original, the second copy begins with the second word, and so forth. Here is the output of the first stage:

 
Principalities are either hereditary, in which the family has been
been	Principalities are either hereditary, in which the family has
has been	Principalities are either hereditary, in which the family
family has been	Principalities are either hereditary, in which the
the family has been	Principalities are either hereditary, in which
which the family has been	Principalities are either hereditary, in
in which the family has been	Principalities are either hereditary,
hereditary, in which the family has been	Principalities are either
either hereditary, in which the family has been	Principalities are
are either hereditary, in which the family has been	Principalities
long established; or they are new.
new.	long established; or they are
are new.	long established; or they
they are new.	long established; or
or they are new.	long established;
established; or they are new.	long

The call to sort sorts the lines so that lines containing the same word are grouped together:

are either hereditary, in which the family has been	Principalities
are new.	long established; or they
been	Principalities are either hereditary, in which the family has
either hereditary, in which the family has been	Principalities are
established; or they are new.	long
family has been	Principalities are either hereditary, in which the
has been	Principalities are either hereditary, in which the family
hereditary, in which the family has been	Principalities are either
in which the family has been	Principalities are either hereditary,
long established; or they are new.
new.	long established; or they are
or they are new.	long established;
Principalities are either hereditary, in which the family has been
the family has been	Principalities are either hereditary, in which
they are new.	long established; or
which the family has been	Principalities are either hereditary, in

The second AWK program splits each line at the tab and puts the two pieces back in order. It also truncates the two pieces so that they each contain at most 30 characters.


Why Bother With a Concordance?

For many research purposes there is little reason to produce a pre-prepared electronic concordance. A good text editor can search the text interactively. emacs is a particularly good choice for this purpose. emacs provides incremental search, both forwards and backward, for fixed strings or for full regular expressions. Here is a screenshot of emacs performing an interactive regular expression search of a Carrier text. The pink highlighted area shows the current match to the regular expression. The white highlighted area two lines above it shows the previous match.


emacs allows the screen to be divided, both horizontally and vertically, into multiple windows, which may show portions of the same file or different files. This allows emacs to be used to display parallel text in two or more languages. Here is a screenshot of emacs displaying the Italian original of Machiavelli's The Prince in the left window and an English translation in the right window. (Click on the image to see a larger version.)


Since emacs is actually an interpreter for LISP, a full-fledged programming language, which has been provided with some special facilties for editing text, it is not difficult to define new functions for carrying out more specialized searches.

There may be some purposes for which specialized concordance software is useful, but a very large part of the work previously done with concordances is easily done using emacs. One reason for using specialized concordance software is if your corpus contains markup for aligning parallel text. Software that understands this markup may make it easier to keep parallel portions of the text in view. Athelstan's Paraconc, for example, is designed expressly for parallel text. Another option to consider is emdros, a free database for analyzed or annotated text. emdros runs on Unix systems, including GNU/Linux, and on some versions of Microsoft Windows.

Sources of Electronic Text

It is often necessary to create one's own corpus, e.g. for a language in which there are few publications, but for many languages, especially those with a literary tradition in which there has been substantial interest, or in which there has been substantial work on computational linguistics, corpora of electronic text already exist. A major source of text in English (including translations from other languages) is Project Gutenberg, whose goal is make available famous and important texts in digital form. The Project Gutenberg collection now includes over 6,000 books. Abundant on-line text is available in quite a few languages. See, for example, Marjorie Chan's List of Searchable and Archived Classical Chinese Texts (part of her wonderful site Marjorie Chan's ChinaLinks) and the University of Virginia Library Japanese Text Initiative.

For other such resources see the extensive lists at the University of Pennsylvania Library Online Books Page, the Summer Institute of Linguistics web site Linguistic Data Resources on the Internet and Mike Barlow's Corpus Linguistics Page.



Revised 2003/12/02 19:00.