Indices and concordances provide a means of finding words, or occasionally other units, in a text. In the case of traditional printed materials, An index provides the location information indirectly, by means of page numbers, line numbers, section numbers, verse numbers and so forth. The word concordance is sometimes used with the same meaning as index, but it is also used to mean a format that provides the information more directly, for example, printing the lines of text in the which a word occurs. A printed index is thus more time-consuming to use than a concordance, but has the virtue of requiring much less space. In electronic form, an index becomes less of a burden because the pointer can be a hyperlink to an appropriate location in the full text, which one can follow virtually instantaneously.
An electronic concordance is superior to a printed concordance in two major ways. First, it is effectively unlimited by considerations of size. It may include virtually any amount of text, and it may be indexed in numerous ways. Second, an electronic concordance may readily be changed, whereas a printed concordance is fixed. There are now few if any reasons to produce a printed concordance other than circumstances in which it is to be used for an extended period of time in a location in which electrical power is not available.
In recent years, printed concordances have been replaced by electronic concordances, some intended for local use, some with web interfaces. The University of Dundee Web Concordancess are typical examples of web-searchable concordances of literary texts.
If you work on well-known literary texts, there is a reasonable chance that someone has already created a concordance for the text that you are interested in. Otherwise, you will need to create your own concordance. You can do this using commercial software or by rolling your own.
There are several commercial programs for creating and viewing concordances. These include the following:
For additional information, see Hans Klarkov Mortensen's concordance page [in Danish].
It is easy to create indices and concordances using Unix tools. Here is a shell script that creates an index to a text in which each word is followed by a list of the line numbers on which it occurs.
cat $1 | tr -dc '[A-Za-z][:blank:]\012' | gawk ' {for (i = 1; i <= NF; i++) words[$i] = words[$i] sprintf(", %d",NR);} END{for (i in words){ lines = words[i]; sub(/^,/,"",lines); printf("%s\t%s\n",i,lines); } }' | sort -f -k 1 > $1.index
This script makes use of the utilities tr and sort along with a short program in awk. Typical output looks like this:
ally 2313 almost 440, 1051, 1117, 2247 alone 710, 927, 1084, 1394, 1395, 1850, 1950, 2221, 2390, 2594 along 504, 1389
Click here to download a version of this script with detailed annotation.
A type of concordance with a long history on UNIX systems due to its use in indexing computer manuals is a keyword in context or KWIC index. A KWIC index looks like this:
Principalities are either hereditary, in whic long established; or they are new. itary, in which the family has been Principalities are either hereditary, in which th long established; or they are new. ither hereditary, in which the family has been ereditary, in which the family has been Principalities are either hereditary, in which the famil alities are either hereditary, in which the family has been long established; or they are long established; or they are new. long established; or they are new. Principalities are either here re either hereditary, in which the family has been long established; or they are new. ties are either hereditary, in which the family has been
Each word in the input generates one line. It is preceded and followed by a chosen amount of context, in this case, 30 characters. The word indexed immediately follows the tab that splits the line into two parts. Here is the text from which this was generated:
Principalities are either hereditary, in which the family has been long established; or they are new.
Here is a shell script that generates a KWIC index, using two short awk programs along with sort.
awk '{print $0 for (i = length($0); i > 0; i--) if (substr($0,i,1) == " ") print substr($0,i+1) "\t" substr($0,1,i-1) }' $1 | sort -f | awk ' BEGIN {FS = "\t"; WID = 30} {printf("%" WID "s %s\n", substr($2,length($2)-WID+1), substr($1,1,WID)) }'
The first AWK program generates as many copies of each line as there are words in it, splitting each line at a different word boundary, and putting the second half of the line before the first, separated by a tab. This has the effect of rotating the words so that the first copy is the same as the original, the second copy begins with the second word, and so forth. Here is the output of the first stage:
Principalities are either hereditary, in which the family has been been Principalities are either hereditary, in which the family has has been Principalities are either hereditary, in which the family family has been Principalities are either hereditary, in which the the family has been Principalities are either hereditary, in which which the family has been Principalities are either hereditary, in in which the family has been Principalities are either hereditary, hereditary, in which the family has been Principalities are either either hereditary, in which the family has been Principalities are are either hereditary, in which the family has been Principalities long established; or they are new. new. long established; or they are are new. long established; or they they are new. long established; or or they are new. long established; established; or they are new. long
The call to sort sorts the lines so that lines containing the same word are grouped together:
are either hereditary, in which the family has been Principalities are new. long established; or they been Principalities are either hereditary, in which the family has either hereditary, in which the family has been Principalities are established; or they are new. long family has been Principalities are either hereditary, in which the has been Principalities are either hereditary, in which the family hereditary, in which the family has been Principalities are either in which the family has been Principalities are either hereditary, long established; or they are new. new. long established; or they are or they are new. long established; Principalities are either hereditary, in which the family has been the family has been Principalities are either hereditary, in which they are new. long established; or which the family has been Principalities are either hereditary, in
The second AWK program splits each line at the tab and puts the two pieces back in order. It also truncates the two pieces so that they each contain at most 30 characters.
For many research purposes there is little reason to produce a pre-prepared
electronic concordance. A good text editor can search the text interactively.
emacs is a particularly
good choice for this purpose. emacs
provides incremental search, both forwards and backward, for fixed strings or for
full regular expressions.
Here is a screenshot of emacs performing an interactive regular expression search
of a Carrier text. The pink highlighted area shows the current match to the regular expression.
The white highlighted area two lines above it shows the previous match.
emacs allows the screen
to be divided, both horizontally and vertically, into multiple windows, which
may show portions of the same file or different files. This allows emacs
to be used to display parallel text in two or more languages.
Here is a screenshot of emacs displaying the Italian original of Machiavelli's
The Prince in the left window and an English translation in the right window.
(Click on the image to see a larger version.)
Since emacs is actually an interpreter for LISP, a full-fledged programming language, which has been provided with some special facilties for editing text, it is not difficult to define new functions for carrying out more specialized searches.
There may be some purposes for which specialized concordance software is useful, but a very large part of the work previously done with concordances is easily done using emacs. One reason for using specialized concordance software is if your corpus contains markup for aligning parallel text. Software that understands this markup may make it easier to keep parallel portions of the text in view. Athelstan's Paraconc, for example, is designed expressly for parallel text. Another option to consider is emdros, a free database for analyzed or annotated text. emdros runs on Unix systems, including GNU/Linux, and on some versions of Microsoft Windows.
It is often necessary to create one's own corpus, e.g. for a language in which there are few publications, but for many languages, especially those with a literary tradition in which there has been substantial interest, or in which there has been substantial work on computational linguistics, corpora of electronic text already exist. A major source of text in English (including translations from other languages) is Project Gutenberg, whose goal is make available famous and important texts in digital form. The Project Gutenberg collection now includes over 6,000 books. Abundant on-line text is available in quite a few languages. See, for example, Marjorie Chan's List of Searchable and Archived Classical Chinese Texts (part of her wonderful site Marjorie Chan's ChinaLinks) and the University of Virginia Library Japanese Text Initiative.
For other such resources see the extensive lists at the University of Pennsylvania Library Online Books Page, the Summer Institute of Linguistics web site Linguistic Data Resources on the Internet and Mike Barlow's Corpus Linguistics Page.