Computational Resources for Linguistic Research

Introduction

This page lists computational tools for doing linguistics. There is of course some overlap, but the emphasis is on using computation to do what ordinary linguists want to do, not on computational linguistics for its own sake.

The page emphasizes free software that runs on Unix systems. The emphasis is on Unix for several reasons. First, that's what I myself use. Second, in my opinion Unix is the environment of choice for this kind of work. The Unix philosophy of making it easy to connect one small tool to another is just right for linguistic research. Third, Unix is strongly represented in the free software world.

By free software I mean software that you can use as you wish to, modify, and redistribute. Software that is free in this sense is often also available at no cost, but that isn't the criterion. For a discussion of the distinction, click here.

The page emphasizes free software for two reasons. One is financial: linguists tend not to be well funded and so can't afford to buy expensive commercial software. Furthermore, since linguistics isn't a large or lucrative market, not much software is aimed specifically at linguistics. Linguists therefore often have to make use of tools intended for other purposes. If you buy a piece of commercial software, you may well find that it doesn't do what you want. If the software is free in the sense of "free beer", you haven't lost anything but a little time by trying it out, but if you've bought a commercial product, you're probably out of luck.

Equally if not more important is the other sense of freedom, namely freedom as in "free speech". Software that you can freely modify and redistribute is much more flexible. If it doesn't do exactly what you need, you can modify it so that it does, and you can make it available to other people.

With occasional exceptions the software listed runs on GNU/Linux systems. I'm most likely to know about such software because GNU/Linux is what I use most of the time. Software that runs on GNU/Linux systems will usually run on other Unix variants, such as FreeBSD, OpenBSD, NetBSD, SunOS/Solaris, HP-UX, IRIX, AIX, and Mac OS X. Many of the programs listed will also run natively under Microsoft Windows. Those that will not can in many cases be run on MS Windows machines by using Cygwin, which provides a Unix-like environment. A few programs are listed that run only on non-Unix systems. These are generally either programs of particular interest or programs that provide an alternative for non-Unix users to something else available for Unix.

In a few cases I list software that is not free. This means that the software is unusual and does not to my knowledge have a comparable free analogue or that it is widely used.

For what they are worth, here are some recommendations for relevant books.

Character Encoding
Fonts, Rendering, and Printing
Input Methods and Keyboard Layout
Extracting Text from Impure Formats
Regular Expressions and Other Pattern Matching
Unix Tools
Syntax
Text Corpus Databases and Searching
Obtaining Data From Web Sites
Sources of Electronic Text
Lexicography and Dictionaries
Concordances and File Comparison
Historical Linguistics
Sociolinguistics
Phonetics
Math and Statistics
Semantics
Other Software
Programming Languages
Structured Markup Languages
Miscellaneous

Character Encoding

Unicode

Information

Unicode Organization: The organization responsible for the Unicode standard. The web site contains all sorts of information about the standard, including code charts, and information about on-going activity and future plans.
UTF-8 Standard: RFC 3629 is the current definition of the UTF-8 encoding format for Unicode. RFC 3629 replaced RFC 2279, which in turn replaced RFC 2044.
Unicode Character Ranges: A list of the types of characters currently included in Unicode and the ranges of codepoints that they occupy.
Unicode Chart: A Unicode chart in the form of a set of web pages. Each page contains a 256 character block. Whether or not the characters will display depends on whether your browser has access to the necessary font. You can download the Python script used to generate the chart here.
A Quick Primer On Unicode and Software Internationalization Under Linux and UNIX: A simple explanation of how to use Unicode on Unix/Linux systems.
UTF-8 and Unicode FAQ for Unix/Linux: Detailed information on using Unicode in Unix/Linux systems.
Alan Wood's Unicode Resources: A variety of information and links.

Editors and Word Processors

TextEditors.org is a wiki devoted to text editors of all types. It contains information about over 700 editors. One section is devoted to Unicode Editors.

Babelpad: Unicode editor for Microsoft Windows.
Geresh: A Unicode editor oriented especially toward languages written right-to-left, particularly Hebrew and Arabic. Note that the web site and manual are in Hebrew but that the editor menus are in English, so people who do not read Hebrew should be able to use it. Here is an RPM and here is a compressed tar archive.
Katoob: A multilingual bidi editor, capable of reading and writing Unicode, designed especially for Arabic. It has keyboards for both Arabic and Hebrew.
Mined: A Unicode text editor. Display and edit Unicode files, and enter text in Unicode. Mined displays directly in an xterm window. Its basic command set resembles that of Wordstar, but it can also be configured to use Emacs-like commands. The quality of the rendering is not as high as in Yudit, but it is much easier to use as a general purpose editor. It also has a "char info" entry on the "Xtra" menu that provides detailed information about the current character, including, if desired, the readings for CJK characters.
OpenOffice.org Writer: OpenOffice.org Writer is a FLOSS word processor that runs on a variety of operating systems, including GNU/Linux, Mac OS X, FreeBSD, Solaris, Irix, and Microsoft Windows. It is capable of reading and writing Unicode.
Simredo: A Java Unicode editor.
Vim: A clone of the vi text editor that supports Unicode. See this page for instructions on using vim with Unicode.
Xetex: A version of TeX that works with Unicode.
XMLMind: A Docbook editor that allows Unicode.
Yudit: A Unicode text editor. Display and edit Unicode files, and enter text in Unicode. Numerous keymaps are supplied, but you can roll your own. Here is a screenshot of Yudit in action.

Other Software

Ascii2binary

A program that reads textual representations of numbers and converts them to binary format. It provides a simple way to generate text in an encoding for which you do not have a converter or input method as well as a way of being sure that you know exactly what is in the file. The associated program Binary2ascii provides conversions in the opposite direction.

BabelMap

A Unicode character map for MS Windows. Displays selected portions of the Unicode character set and provides information about the characters. It also allows selected characters to be copied to the clipboard, making it useful for Unicode input into other programs.

Gucharmap

Displays selected portions of the Unicode character set and provides information about the characters. It also allows selected characters to be copied to the clipboard, making it useful for Unicode input into other programs. In my experience, the best Unicode character map for GNU/Linux. The current version can be downloaded from: http://ftp.gnome.org/pub/GNOME/sources/gucharmap/

Heirloom Toolchest

This is a set of classic Unix tools based on the code released by Caldera. As a result, they do not have some of the more recent extensions. However, the maintainer has updated them to handle Unicode, which many more recent tools, including some of the GNU tools, do not.

International Components for Unicode

A very extensive library for dealing with Unicode, with APIs for C/C++ and Java. The main problem with it is that it is so comprehensive that finding your way can be daunting.

kcharselect

A font map that displays the characters in the selected font. You can click on a character to copy it to a buffer from which it may be entered into a document, but it is also useful just for finding out exactly which characters a font provides. Moving the mouse pointer over a character and leaving it for a short time produces a tooltip containing the character's codepoint. This is identified as a Unicode codepoint but is actually just the offset into the font, meaning that it is only the Unicode codepoint if the font is Unicode-encoded. This program is part of the kdeutils package, which in turn is part of the KDE desktop environment, so it is not readily available separately. To get it, you need either to install KDE if you do not already have it, or to obtain the Debian kdeutils package.

libucd

A C library interface to the Unicode Character Database, which contains the properties of all Unicode characters.

libuninum

A library for converting between Unicode strings representing numbers in a wide variety of number systems and internal machine representations. The library has interfaces for C and Tcl. A command-line program numconv and a graphical user interface NumberConverter are also provided.

Open Sesame

A graphical experiment-builder for psychological experiments.

Unicode Checker

A Mac OS X utility that can perform Unicode normalization and a variety of conversions, browse the character set, search by name or codepoint, display information about CJK characters, etc.

Unicode Data Browser

A browser for the UnicodeData.txt file, which contains much useful information but is not easily read by humans. The browser creates a scrollable table in which columns represent properties. The table may be sorted on any column. Abbreviations are expanded and characters cross-referenced in decomposition and casing fields are named. Regular expression search restricted to a selected column is available. The set of characters for which information is displayed may be restricted to those characters matching a regular expression on a specified property. Each such filtering operation applies to the output of the previous filtering operation unless the table is reset to the original full set of characters, so filtering on multiple properties is possible. If an up-to-date local copy of UnicodeData.txt is not available, it can be downloaded automatically from Unicode.org.

Unicode Normalization from the Command Line

How to use the facilities of common scripting languages to normalize Unicode.

Unicode Utilities

A set of programs that provide various kinds of information about the contents of Unicode files and that manipulate Unicode files:

unidesc - identifies the script ranges in a file
unifuzz - generates output for testing other software
unihist - generates a histogram of its input
uniname - identifies each character in a file
unireverse - reverses its input character-by-character
ExplicateUTF8 - determines and explains the validity of a series of bytes as UTF8.
UTF8Lookup - converts a codepoint to a Unicode character name

Uni2Ascii

A pair of programs that convert between Unicode and ASCII representations of Unicode such as HTML numeric character references (e.g. é) and URL format UTF-8 (e.g. %C3%A9).

Fonts

Alphabetum: A Truetype font focussed on ancient languages, including: classical and mediaeval Latin, ancient Greek, Old Italic - Etruscan, Oscan, Umbrian, Faliscan, Messapic, Picene - Gothic, Iberian, Celtiberian, old and middle English, Hebrew, Sanskrit, Runic, Ogham, Ugaritic, Old Persian cuneiform, Coptic, Kharosthi, Phoenician, Linear B, Cypriot, Ancient Greek musical notation, Ancient Greek acrophonic numerals, New Testament editorial symbols, Ancient Greek papyrological numbers, Aegean numbers and old and mediaeval Nordic. A demonstration version with gaps here and there is available at no cost. An unrestricted individual license costs 30 euros.
Code 2000, Code 2001, and Code2002: If you work extensively with a particular writing system, you will likely have favorite fonts. However, it is useful to have fonts that cover the entire Unicode character set. James Kass has kindly made available shareware fonts with which you can do this. His Code2000 font covers the first Unicode plane. His Code2001 font covers the second Unicode plane. Code2002 is the beginning of a font with coverage of the third plane. He asks US$5 for personal use.
Clearlyu: A free font developed by Mark Leisher at the University of New Mexico. It is a 12 point BDF (bitmap) font, so it is useful for its coverage but does not scale well. It covers the following Unicode ranges: Basic Latin; Latin-1 Supplement; Latin Extended-B; IPA Extensions; Spacing Modifier Letters; Combining Diacritical Marks; Greek; Cyrillic; Armenian; Hebrew; Thaana; Devanagari; Thai; Lao; Georgian; Ethiopic; Cherokee; Unified Canadian Aboriginal Syllabics; Ogham; Runic; Letterlike Symbols; Number Forms; Arrows; Control Pictures; Geometric Shapes; Braille Patterns.
GF Zemen Unicode: A Unicode-encoded Truetype font of broad coverage.
SIL Doulos IPA Font: This is a Unicode-encoded font that includes all of the International Phonetic Alphabet together with a variety of other Roman and Cyrillic letters.
Titus Cyberbit: A Unicode-encoded Truetype font of broad coverage.
Re-encoding a Font to Unicode: There are many fonts in existence that were created with encodings other than Unicode. This illustrated tutorial describes how to re-encode an existing TrueType font to Unicode.

Other Encodings

ASCII Code: The standard for English except on IBM mainframes, generally used for programming. This link points to charts in binary, octal, decimal and hexadecimal, all color-coded for the basic POSIX character classes, with explanations of the derived character classes and the control characters.
ascii: A handy command-line program that lists all of the synonyms of an ASCII character given a name, abbreviation, or codepointin any of several formats.
ByteName: For each byte of input prints the byte offset, the value of the byte in hex, octal, and binary, and a description of the byte in any of several dozen single-byte encodings. It can also generate a chart for a selected encoding, or give the interpretation of a given codepoint in all of the known encodings.
Cyrillic: Various encodings for Russian and other variants of the Cyrillic alphabet
Encoding Database: A database containing information on several dozen single-byte encodings, including the various ISO-8859 encodings and Microsoft Code Pages.
EUC-KR: The usual encoding for Korean.
ISCII (Indian Script Code for Information Interchange) Standard: The official Indian government encoding.
ISO-8859 (Latin-1 etc.): 8-bit encodings for European languages: extended Roman alphabet and Cyrillic
Mark Leisher's Csets: Mappings between various character sets often not covered by standard conversion tools and Unicode.
Microsoft Windows Codepage 1250: The 8-bit extension of ASCII commonly used by Microsoft Word.
Unicode Consortium Cross-Mapping Tables: The Unicode Consortium provides a large set of tables showing the relationship between other encodings and Unicode. These include all of the ISO-8859 encodings and Microsoft Codepages.

Encoding Converters

General Purpose

enca: Handles fewer encodings than most of the others, but is useful because it also functions as a detector.
iconv: The original GNU encoding conversion tool. It is a command-line tool based on libiconv.
recode: A successor to iconv but with a somewhat peculiar command-line syntax.
siconv: This is a stream-oriented counterpart to iconv, using libiconv, the same library that underlies iconv. It can handle larger amounts of data than iconv.
uniconv: This is the encoding conversion tool associated with the Yudit text editor. It can also convert from ASCII sequences to Unicode using the same keymaps used by the editor for input.
utrac: Converts among various single-byte encodings.

Specialized

Autoconvert: Converts Chinese encodings and Unicode
Cz2cz: Converts among Czech encodings
Jcode: Converts among Japanese encodings.
Polcnv: Converts among Polish encodings.
TLGU: Converts text in the encoding used by the Thesaurus Linguae Graecae to Unicode.
Xcode: Converts among Russian encodings.

Transliteration

Buckwalter2Unicode: A pair of programs that convert from the Buckwalter transliteration of Arabic to Unicode and back.
Earm2IPA: Transliterates Eastern Armenian from its native writing system to the International Phonetic Alphabet. Both input and output are in UTF-8 Unicode.
OOTranslit: An OpenOffice.org Writer macro that converts between the Roman and Cyrillic writing systems for Serbo-Croatian.
Tgn2IPA: Transliterates Tigrinya from its native writing system to the International Phonetic Alphabet. Both input and output are in UTF-8 Unicode.
Xlit: A general purpose transliteration program. Transliteration definitions may be read from files or defined interactively by entering strings to be transliterated on the left and the strings to which they are to be mapped on the right. Xlit can translate the entire text or restrict the transliteration to the text enclosed within specified delimiters or to text not enclosed in specified delimiters.

Terminal Emulators

mlterm: A terminal emulator for the X11 window system that supports UTF-8 Unicode, including bidirectional text, as well as a variety of parochial encodings, including: ISO-8859-[1-11], ISO-8859-[13-16], TIS-620 (same as ISO-8859-11), KOI8-R, KOI8-U, KOI8-T, GEORGIAN-PS, TCVN5712, VISCII, CP1251, CP1255, EUC-JP, EUC-JISX0213, Shift_JIS, Shift_JISX0213, ISO-2022-JP[1-3], EUC-KR, UHC, JOHAB, ISO-2022-KR, GB2312 (EUC-CN), GBK, GB18030, ISO-2022-CN, HZ, EUC-TW, BIG5, and BIG5HKSCS.

Fonts, Rendering, and Printing

DoubleType - a TrueType font design program
FontForge (previously called pfaedit) - a versatile font editor with many capabilities
Fontstruct - a simple web-based font editor
Gfontview - Outline Font Viewer
Installing TrueType Fonts in X11
Pango - A framework for the layout and rendering of Unicode text
SIL Graphite cross-platform rendering for non-Roman scripts.
TypeTuner a program for modifying properties of SIL fonts
Worldprint a filter for Mozilla output that enables printing in a large variety of writing systems

Input Methods and Keyboard Layout

CellWriter - A grid-entry natural handwriting input panel.

MSKLC - Keyboard layout editor for Microsoft Windows
SCIM - Multilingual input method platform
Ukelele - Keyboard layout editor for Mac OS X
X11 Tutorial - Keyboard layout editing tutorial for X11 on Ubuntu and other *nix systems
IPA Keyboard

Extracting Text from Impure Formats

Text often comes in formats that we cannot directly make use of. It is necessary first to extract plain text. Here are programs that can extract plain text from various impure formats.

Base 64 Encoding: Base64 is the method of encoding binary data as ASCII text for safe transmission in contexts that are not 8-bit safe, such as email. It is defined in RFC 3548. Base64 is a standalone base64 encoder/decoder.
HTML: HTML2Text converts HTML to plain text. In addition to use for its own sake, conversion of HTML to plain text is often the second phase of conversion from other formats since many converters try to preserve layout by generating HTML. A program designed specifically to deal with the baroque HTML generated by Microsoft Word is Microsoft Word 2002 Unmunger.
Microsoft Word: Wvware is a suite of programs that convert Microsoft Word files to a variety of other formats, including plain text. Anti-Word is a free MS Word reader for Linux and RISC OS, with ports to FreeBSD, BeOS, OS/2, Mac OS X, Amiga, VMS, NetWare, Plan9, EPOC, Zaurus PDA, MorphOS and DOS. Antiword can convert files from Word 2, 6, 7, 97, 2000, 2002 and 2003 to plain text.
Open Document Format (ODF): odt2txt is a simple command-line tool that extracts plain text from ODF. OpenOffice.org Writer can read ODF and can export in a variety of formats including plain text.
Portable Document Format (PDF): PDFtotext extracts plain text from PDF files. It is part of the xpdf package, which also provides a PDF file viewer and some other tools. Commercial software for Microsoft Windows is available at http://www.pdf2text.com/. If you are having problems dealing with a PDF file and need to explore its internal structure, the PoDoFoBrowser PDF object browser may come in handy. Formswift PDF Editor is a PDF editor. Formswift PDF Converter converts PDF to Word and other formats. Able2Extract converts PDF to such other formats as Word, Powerpoint, Publisher, and AutoCad. It can also convert a scanned PDF to Excel. Runs on Windows, Mac and Linux.
Postscript: PSToText extracts plain text (in the ISO-8859-1 extended ASCII encoding) from Postscript files.
Rich Text Format (RTF): Docfrac converts from RTF to plain text. It runs on both Microsoft Windows and Unix platforms. Rtf-converter converts RTF to HTML. It runs on Linux and Microsoft Windows NT systems and therefore probably on other platforms as well. Rtfeeder also converts RTF to HTML. Since it is a Perl script, it should run on any platform. On the Macintosh, if you have access to the Apple Developer Tools, they contain the program convertRichTextToAscii, which actually converts RTF to Unicode. If there are no non-ASCII characters in the original, the output will be ASCII. This program is located at /Developer/Applications/Utilities/FileMerge.app/Contents/Resources/convertRichTextToAscii if you have the Developer Tools installed. Xue Brothers offers an RTF to text converter that runs in a Microsoft Windows shell for a modest price (currently $5.50). If you can do a little Java programming yourself, the components necessary are available. Information can be had here.
TeX: TTH translates TeX and LaTeX into HTML, from which it can be converted to plain text by an HTML converter.
Troff/Nroff/Groff: Unroff converts from Troff format to various other formats including HTML. It is a programmable translator for troff and so can be made to generate whatever output you like.
WordPerfect: Wp2x converts WordPerfect files to plain text and various other formats.

Unoconv converts among a wide variety of document formats, including: BibTeX [.bib], Microsoft Word 97/2000/XP [.doc], Microsoft Word 6.0 [.doc], Microsoft Word 95 [.doc], DocBook [.xml], HTML Document (OpenOffice.org Writer) [.html], Open Document Text [.odt], Open Document Text [.ott], Microsoft Office Open XML [.xml], AportisDoc (Palm) [.pdb], Portable Document Format [.pdf], Pocket Word [.psw], Rich Text Format [.rtf], LaTeX 2e [.ltx], StarWriter 5.0 [.sdw], StarWriter 4.0 [.sdw], StarWriter 3.0 [.sdw], Open Office.org 1.0 Text Document Template [.stw], Open Office.org 1.0 Text Document [.sxw], Text Encoded [.txt], Plain Text [.txt], StarWriter 5.0 Template [.vor], StarWriter 4.0 Template [.vor], StarWriter 3.0 Template [.vor], and XHTML Document [.html]. It also handles a variety of graphics formats, presentation formats and spreadsheet formats.

The Multivalent Document Tools can extract text from a number of formats, including PDF and HTML.

Regular Expressions and Other Pattern Matching

Tools and Libraries

AT&T Finite State Morphology Library and Lextools: Tools for building, combining, optimizing, and searching weighted finite-state acceptors and transducers.
agrep: An approximate regular expression matcher. This is the older of two approximate regular expression matchers, sometimes referred to as Wu-Manber agrep after its original authors. The source code for the Unix version is available here. Another agrep is provided as part of the TRE regular expression package.
Bison: A parser generator. The input consists of a context-free grammar in a notation similar to BNF together with associated code. This is the GNU implementation of the classic Unix YACC. It is designed to work well with Flex but may be used separately. PyBison is a Python interface to Bison.
CL-PCRE library: A Perl-compatible regular expression library for Common Lisp.
CHSM: A code generator in the tradition of yacc and bison that generates Concurrent Hierarchical State Machines. The machines are described in a statechart specification language and annotated with code in either C++ or Java. The generated code is fully object oriented, allowing multiple machines to exist concurrently. The CHSM run-time library is small, efficient, and thread-safe.
Daciuk's Finite State Automaton Utilities: A variety of tools for working with finite state automata and transducers.
Dia2fsm
A tool that takes as input a diagram of a finite state machine in in dia format and generates C or C++ code implenting it.
Finite State Automata Utilitiies: A collection of utilities for manipulating regular expressions, finite-state automata and finite-state transducers. Manipulations include automata construction from regular expresssions, determinization, minimization, composition, complementation, intersection, and Kleene closure. Various visualization tools are available for browsing finite-state automata. Interpreters are provided to apply finite automata. Finite automata can also be compiled into stand-alone C programs.
Band2XML: Many lexical databases have been compiled in format used by Robert Hsu's Lexware programs, known as band format. Those wishing to process such dictionaries using other tools may find it useful to convert them to XML. BAnd2XML.exe performs this conversion.
Flex: A lexical-analyzer generator. This is the GNU implementation of the classic Unix Lex.
Glark: Glark adds to regular-expression matching facilities very similar to those of grep several special features. It allows Boolean combinations of search predicates and it allows specifications of how far apart (in lines) the matches to different parts of a Boolean must be. It is possible, for instance, to ask for the set of lines containing both A and B no more than K lines apart. Glark also provides optional color highlighting of matches, allows the user to specify how much context to provide for matches (e.g., "show me the six lines surrounding a match") and allows for considerable control over multi-file searches and what information they produce (e.g. name of matching file only, name and matching lines, etc.).
Grail: A symbolic computation environment for finite-state machines, regular expressions, and other formal language theory objects.
Groningen Finite State Automaton Utilities: A collection of utilities to manipulate regular expressions, finite-state automata and finite-state transducers.
grep: GNU grep. For another kind of grep try here
HyperLex: A system for performing feature-based regular expression searches on lexical databases.
Kiki: A front end to the Python re module for testing regular expressions against a sample text that provides extensive output about the results, including highlighting of groups within a match.
Kodos: A tool for creating, testing and debugging regular expressions for the Python programming language.
Kregexpeditor: A graphical tool for constructing regular expressions in a fashion somewhat like a diagram editor. Generates regular expressions in the syntax of either the Qt windowing toolkit or emacs. This is part of the KDE package and so does not have its own website for downloading.
Levenshtein: A Python library for computing various measures of string similarity (Levenshtein, Hamming, Jaro, Jaro-Winkler) and related functions, such as applying edits.
Match: A library callable from C, C++, and Ada that provides a pattern matcher inspired by that of SNOBOL4.
monq.jfa: A Java class library for finite state automata. Unlike the standard java.util.regex, which provides only recognizers and substitution, it allows actions to be bound to regular expressions so that the action is performed whenever the regular expression is matched.
Nooj: NooJ is both a corpus processing tool and a linguistic development environment: it allows linguists to formalize several levels of linguistic phenomena: orthography and spelling, lexicons for simple words, multiword units and frozen expressions, inflectional, derivational and productive morphology, local, structural syntax and transformational syntax. For each of these levels, NooJ provides linguists with one or more formal tools specifically designed to facilitate the description of each phenomenon, as well as parsing tools designed to be as computationally efficient as possible. This approach distinguishes NooJ from most computational linguistic tools, which provide a single formalism that should describe everything. As a corpus processing tool, NooJ allows users to apply sophisticated linguistic queries to large corpora in order to build indices and concordances, annotate texts automatically, perform statistical analyses, etc.
PCRE library: Perl compatible regular expression library.
Pmatch: A regular expression matching tool similar to grep but based on the PCRE library and with highlighting of matches and display of surrounding lines.
PC-KIMMO: Implementation of Kimmo Koskeniemmi's Two-Level Morphology
QFSM: A graphical tool for designing finite state machines.
Ragel State Machine Compiler: Ragel compiles finite state machines from regular languages into C, C++, or Objective-C code. It allows the programmer to embed actions at any point in a regular language.
Redet [Regular Expression Development and Execution Tool]: Redet allows the user to construct regular expressions and test them against input data by executing any of more than 40 search programs, editors, and programming languages that make use of regular expressions or similar patterns. Redet is written in Tcl, which is therefore always available. Other matchers are executed as child processes if they are available on the user's system. When a suitable regular expression has been constructed it may be saved to a file. For each program, a palette showing the available regular expression syntax is provided. Selections from the palette may be copied to the regular expression window with a mouse click. Users may add their own definitions to the palette via their initialization file. So long as the underlying program supports Unicode, redet allows UTF-8 Unicode in both test data and regular expressions. Although the primary function of Redet is to provide a convenient interface to the actual regular expression tools, it also provides some extensions of particular interest to linguists. Redet allows you to define your own named character classes and provides a notation for taking their intersection. Together, these two capabilities make it possible to perform searches on feature matrices.
re_graph: Given a regular expression draws a diagram of the corresponding finite state automaton.
The Regex Coach: A tool for experimenting with regular expressions. It can single-step through the matching process as performed by the regex engine and can show a graphical representation of the regular expression's parse tree. Uses Perl-style regular expressions.
Regex Test: Given a file of sample text, displays the text and allows the user to enter regular expressions. As the user types, it matches the regular expression against the sample text and highlights the matching portions.
Regexopt: A program that takes as input a regular expression (in a large subset of Perl syntax) and produces a more compact equivalent regular expression.
Sed: The standard Unix stream editor. It provides regular expression searches and substitutions. The GNU sed manual is available at this site. The source code may be had here. There are quite a few versions of sed available, with implementations for a wide variety of architectures and operating systems. Links to various versions are available here together with links to debuggers, tutorials, and other information. If you find sed too complicated and just want to replace fixed strings, you might try replace.
Sgrep: A tool for searching and indexing text, SGML, XML and HTML files and filtering text streams using structural criteria.
Sgrep: A stanza grep tool, which is a more general interface into searching through IOS configurations (or any file that has a 'stanza'-like format). sgrep also can match ip addresses, and even match ip addresses inside a subnet.
Ssed - Super Sed: This is an enhanced version of the standard Unix stream editor sed It provides extended regular expression syntax and large increases in speed in certain cases.
State Machine Compiler: Given a file containing a description of a finite state machine in a simple language, generates code for implementing the machine in C, C++, C#, Java, Perl, Python, Ruby, Tcl, and VB.net.
Stuttgart Finite State Transducer: A toolbox for the implementation of morphological analyzers and other tools based on finite state transducer technology. This is the closest non-proprietary equivalent to the Xerox Finite State Calculs
Theo: A simulator for finite automata and Turing machines. Written in Java so available for most systems.
TRE regexp library: A library implementing an efficient new algorithm, with C and Python bindings. In addition to classical syntax it provides some GNU and Perl extensions. It also provides approximate matching and allows costs to be set in-line, individually for each group. Wide (UTF-32) and multibyte (UTF-8) characters are supported. An approximate grep command called agrep using the library is also supplied. This version of agrep is largely compatible with the older Wu-Manber agrep at the command-line level but is more powerful in some respects.
Txt2regex: Txt2regex is a regular expression wizard that converts human sentences to regexes. In a simple interactive console interface, the user answers questions and the program constructs the regular expression. Over 20 programs are supported.
Xerox Finite State Calculus: The lexc lexicon compiler and xfst rule compiler. These compile into finite state automata.
XFA: A C library for creating non-deterministic finite state automata, either programmatically or from regular expressions and for converting them to the minimal equivalent deterministic finite state automaton.
Xmlgrep: A command-line utility that matches regular expressions against strings with XML markup.

Tutorials

Miscellaneous

Unix Tools

UNIX provides a number of tools that make it easy to extract information from text and format text in ways useful for linguistic research without having to do any real programming. These tools are now available not only on Unix systems but on many other systems, and they are available at no cost and open-source. You can obtain the source code for the GNU versions of these tools by downloading the GNU coreutils package here. Native MS Windows ports of most of them are available from the Unixutils project. (Note: do not be dismayed if you do not see a program in which you are interested in the list of programs on the Unixutils site. Their "program" list is actually a list of packages. Most of the programs of interest belong to the textutils package, which is provided by the Unixutils project.) Versions of these tools that run under Microsoft Windows in a Unix-like environment can be obtained from Cygwin.

Here is a list of the most useful UNIX tools with links to the GNU documentation:

And here are some on-line lecture notes that describe the use of the same tools.

A classic tutorial is the handout for Ken Church's talk "Unix for Poets", which you can download here[Postscript file].

The TDH Utiliies are utilities of a similar nature intended to provide some capabilities not available with the standard Unix utilities.

The Heirloom Toolchest is a full set of classic Unix tools that handle Unicode.

Syntax

Synpathy: A tool for manual syntactic annotation. Available for GNU/Linux, Mac OS X, and Microsoft Windows.
Syntactica: A tool for creating grammars and viewing syntactic structures.
TreeDraw: Software for drawing syntax trees.

Text Corpus Databases and Searching (including Treebanks)

CLaRK: An XML-based system for corpus development.
Computational Linguistics Toolset: A set of Perl programs for cleaning, splitting, refining, and taking samples from corpora (ICE, Penn, and a native one), for tagging them using the TnT-tagger, for doing permutation statistics on N-grams (useful for finding statistically significant syntactical differences between any two sets of tagged texts), and for examining corpora in various ways.
Corpus Mailing List Home Page: This page contains information about how to subscribe to the corpus mailing list, the list archives, and links to other resources.; A tool for the annotation of text corpora. MS Windows and Macintosh only.
DDC Linguistic Search Engine: The search tool developed for the DWDS Corpus of German.
emdros: A free database for analyzed or annotated text. emdros runs on Unix systems, including GNU/Linux, and on some versions of Microsoft Windows. The query language is probably the most sophisticated and powerful query language available for searching annotated text. Language Documentation and Conservation review.
MonaSearch: MonaSearch is a query tool for linguistic treebanks. The query language of MonaSearch is monadic second-order logic, an extension of first-order logic capable of expressing probably all linguistically interesting queries. In order to process queries efficiently, they are compiled into tree automata. A treebank is queried by checking whether the automaton representing the query accepts the tree, for each tree. The implementation includes a graphical user interface to facilitate the composition of queries and the interaction with treebanks. MonaSearch runs on all major platforms.
Middle English Corpus Home Page: Includes links to annotation and Corpus Search documentation
Pphonological Corpus Tools: tools for studying phonological properties of large corpora
UPLUG: Tools for linguistic corpus processing, word alignment and term extraction from parallel corpora.
York Corpus Search Lite Manual

Obtaining Data from Web Sites

Much useful data can be obtained from web sites. In a way, the web is a gigantic collection of electronic corpora. In general, any material published on a web site is fair game for research use as this constitutes "fair use" under the law of the United States and many other jurisdictions. There are, however, legal and ethical issues concerning redistribution of material obtained on the web. Furthermore, there are legal and ethical issues involved in obtaining material available on the web but not intended by the owner to be stored or converted to another format. For example, if something is made available only as streaming audio, this may be because the provider does not want you to be able to store a copy. This raises the question of whether you may legally and ethically do so.

Some useful sources of information are:

Bitlaw: A site created by an intellectual property lawyer containing over 1,800 pages dealing with all aspects of intellectual property law.
Copyright and Fair Use: Lots of information on fair use of copyrighted material provided by the Stanford University Library system.
The Electronic Frontier Foundation: An organization dedicated to the preservation of freedom on the web. Its web site contains information about such topics as copyright law, file-sharing, and digital rights management.

Here are some useful tools:

clive: A tool for downloading videos from sites like Youtube and Google Video.
curl: A tool for transferring files with URL syntax.
DataparkSearch: A web indexing and search tool.
Getleft: Similar to curl but with a graphical interface.
H2Text: Strips HTML from a file, leaving pure text. (To do this, use the command line: h2text -nc -t < <input file name> > <output file name>.)
Linguist's Search Engine: A tool for performing syntactic searches on internet data.
wget: Automatically downloads web sites, recursively if so desired. Has oodles of options allowing detailed specifications of what links to follow and what kinds of files to download. It is possible, for example, to specify that text files should be downloaded but image files should not be.

Problems sometimes arise in obtaining audio data from the web. You may find these lecture notes on audio data helpful, especially this section.

Sources of Electronic Text

Lexicography and Dictionaries

EnRus Dictionary Tools: Tools that provide a nice graphical interface for dictionaries in a simple plain text format and present. It comes with an English-Russian database but is not limited to this database or these languages.
International Bibliography of Lexicography: A bibliography on lexicography.
Kirrkirr: Kirrkirr is a research project exploring the use of computer software for automatic transformation of lexical databases ("dictionaries"), aiming at providing innovative information visualization, particularly targeted at indigenous languages.
Kura: A multi-user open-source linguistic database especially geared towards language description. Fully Unicode compatible. Runs on Unix (including GNU/Linux and Mac OS X) and MS Windows systems.
Lexica: A dictionary interface.
Lexique Pro: An interactive lexicon viewer and editor, with hyperlinks between entries, category views, dictionary reversal, search, and export tools.
Lexware: Robert Hsu's Lexware softsare.
Open Lexicon Interchange Format: An XML-based standard for interchange of electronic lexica.
Owl: A dictionary interface for dictionaries written in dicML, an XML-based markup language. The dicML standard and some large bilingual dictionaries are also available on this site.
S-Dictionary: A diotionary program for which quite a few dictionaries are available. Runs on Unix, MS Windows, and Symbian systems (for mobile phones).
Sheetswiper: Converts spreadsheets to Standard Dictionary Format files compatible with such programs as LexiquePro. The download is on the Files tab.
Shoebox: A widely used lexical database, now superseded by Toolbox. Lexique Pro is a a viewer for data in this format. There is a program called Econv for converting among Shoebox, Transcriber, and Elan files.
Sqlite: A free, public-domain relational database that can be used directly or via bindings to C and Tcl. Wrappers exist for additional programming languages, including: D, Common Lisp, Haskell, Java, Javascript, Lua, Objective C, Objective Caml, Perl, Php, Pike, Python, R, Ruby, Scheme and Squeak. Sqlite supports Unicode text. It is easy to install, has a very small footprint, and is available for all major platforms, including Mac OS X, Linux, and MS Windows.
Toolbox: A data management and analysis tool for field linguists. It is especially useful for maintaining lexical data, and for parsing and interlinearizing text, but it can be used to manage virtually any kind of data. It is the successor to Shoebox. Help is available from the user group. There is a program called Econv for converting among Shoebox, Transcriber, and Elan files.
WeSay: Helps native speakers of a language to compile a dictionary with only limited assistance from a linguist or programmer.

A useful review of graphical front ends to the MySql database is: http://www.databasejournal.com/features/mysql/slideshows/top-10-mysql-gui-tools.html.

Concordances and File Comparison

aConCorde: A multilingual concordancing program with particularly good support for Arabic and a choice between English and Arabic interfaces. Runs on all major platforms.
Conc: The Summer Institute of Linguistics' concordancing program for the MacIntosh.
Corpora og Konkordansprogrammer Konkord: Hans Klarkov Mortensen's page on concordances, in Danish
DDC-Concordance: is a search engine for linguists. It lets you search for words or sequences of words together with morphological patterns.
Diff: A program that finds differences between two text files.
Monoconc and Paraconc: Commercial concordance software for the Macintosh and Microsoft Windows. Paraconc handles bilingual parallel corpora
MultiLingProfiler: A vocabulary profiling tool for French, German, and Spanish.
TextSTAT: TextSTAT is a simple programme for the analysis of texts. It reads plain text (in different encodings) and HTML files (directly from the internet) and produces word frequency lists and concordances from these files.
Uplug Corpus Tools: A collection of tools for linguistic corpus processing, word alignment,and term extraction from parallel corpora.
Xmldiff: A program that finds differences between two similar XML files.
Y-Sets: Computes statistics about words common to two texts.

Historical Linguistics

ALingua: ALingua simulates the evolution of a two-language system in a finite population. In particular, it allows one to examine the spatial dynamics of such a system given a set of initial conditions: a distribution of agents, a network defining connections between them, and a language learning algorithm with associated parameter settings. Pragmatically, comparisons between the outcome of simulations and empirical results from historical linguistics will facilitate the search for satisfactory theories of diachronic language change.
Computational Phylogenetics in Historical Linguistics: Publications and software from the CPHL project.
Epigrass: An epidemic simulator, possibly of use for the study of the spread of linguistic change.
Etymo: A program for modelling sound change.
IPA Zounds: A program for modelling sound change.
Jsesh: An editor for Egyptian hieroglyphics
LingPY: A Python library for quantitative historical linguistics.
Phono: A program for modelling sound change.
Sounds: A program for modelling sound change.
TreeView: A viewer for phylogenetic trees. Intended for biologists, but useful for linguists too.
WordCorr: A set of tools for finding regular phonological correspondences.

Sociolinguistics

Goldvarb: A program for variable rule analysis. Available for MS-DOS, MS Windows, and Mac OS (classic).
Plotnik: A tool for making vowel plots showing the dispersion of vowel tokens in the vowel space. For Mac OS X.
R-Varb: A package for variable rule analysis along the lines of Varbrul but implemented in R and therefore available for most versions of Unix, MS Windows, and Mac OS X.
Social Networks Visualizer: A tool that allows the user to draw, visualize, and layout social networks.

Phonetics

Amadeus Pro: A general purpose audio editor with a number of analysis functions and, in particular, the ability to fragment an audio file automatically by positioning window edges at transitions between sound and silence. This is closed-source proprietary software but the license is inexpensive and a trial version may be freely downloaded. It runs only on Mac OS X. I have not personally used this software but rely on the company's description and that of a colleague who has used it. I mention it here because its automatic fragmentation capability is unusual and because it runs on Macs with Intel processors, which SndBite does not.
Audacity: An audio editor that runs on all major platforms. It is oriented more toward music than phonetics and has only limited analysis tools. However, it can do noise reduction and has some useful editing tools, including the ability to modify individual samples and to change the amplitude envelope.
AudioSpace: An audio storage calculator. Given the duration of an audio recording, it calculates the required storage in any of a variety of units, for uncompressed audio or a number of types of compression. It also works in the other direction: given the available amount of storage, it will compute the maximum duration of the audio that it will hold. Runs on all major platforms.
Autovot: Trainable software for automated measurement of Voice Onset Time.
Bartek Plichta: This site provides expert advice on microphones, recorders, and related topics, including informative reviews of particular devices.
Ecasound: A command-line tool for recording audio, playing it back and performing format conversions and mixing. It can also carry out some kinds of analysis. It is free software and runs on all major and some not-so-major platforms. Several graphical interfaces are available.
Elan: A tool for transcription and annotation of both audio and video. There is a program called Econv for converting among Shoebox, Transcriber, and Elan files. Available for GNU/Linux, Mac OS X, and Microsoft Windows.
Emu: A collection of tools for the creation, manipulation and analysis of speech databases. At the core of EMU is a database search engine which allows the researcher to find various speech segments based on the sequential and hierarchical structure of the utterances in which they occur. EMU includes an interactive labeller which can display spectrograms and other speech waveforms, and which allows the creation of hierarchical as well as sequential, labels for a speech utterance.
Exmaralda: A suite of programs for transcription and annotation of speech andthe construction of spoken language corpora.
Festival: A multi-lingual speech synthesis system that runs on all major platforms.
Libsndfile: libsndfile is a C library for reading and writing audio files. As such it is of interest only to programmers, though you may find that you need to install it for other software to work even if you do no programming yourself. However, it comes with a number of example programs, several of which are of general utility. These are sndfile-info, which extracts information from sound files, sndfile-play, which plays sound files, and sndfile-convert, which converts files from one format to another. sndfile-play and sndfile-convert can handle some formats not supported by Sox. These include 24 bit PCM data and the mat formats used by Matlab and Octave.
Paradigm: Paradigm helps you to design and run psycholinguistic experiments with millisecond-level accuracy. It is scriptable in Python. It runs only on Microsoft Windows. This is non-free commercial software but the current beta version is downloadable at no cost.
Linux Sound: An extensive listing of software for Linux. The orientation is toward music, but there are quite a few items of interest to linguists.
Nyquist: A sound synthesis system in the form of an object-oriented dialect of Lisp with primitives for both events (as in typical musical score synthesis systems) and signal synthesis.
Phontools: An R package which contains functions intended to facilitate the organization, display, and analysis of the sorts of data frequently encountered in phonetics research and experimentation.
Praat: Praat is a "research, publication, and productivity tool for phoneticians." It includes a comprehensive set of capabilities, usable both interactively and via a scripting language.
Shntool: A command-line utility for handling a wide range of lossless compression formats. It can convert from one format to another, provide information about the contents of a file, split and join files, etc.
Shva: Shva ("speech hear view and annotate") is a Web GUI for aligning linguistic annotation with the acoustic signal. It runs only on GNU/Linux.
SndBite: SndBite is a specialized audio editor, designed for breaking large recordings into smaller components with great efficiency. Special features include: multiple simultaneous views of the waveform at different resolutions; the ability to position window edges at transitions between sound and silence; automated setting of cut points at zero-crossings; automatic filename generation easily controlled by the user; optional automatic playback on window motion; and logging of each write, so that a record exists both for the long term and in case (as often happens) the user loses his or her place while segmenting a large file. SndBite runs on all major platforms but does not currently run on Macintosh computers with Intel processors because of the unavailability of the Snack audio library on that platform. Those desiring the ability to fragment files automatically at speech/silence transitions on Intel Macs should consider Amadeus Pro.
SoundIndex: A transcription tool combining an XML editor with an audio display.
SoX: ("Sound eXchange") is the Swiss Army Knife of the phonetics lab. It is a command line utility that can convert various formats of computer audio files to other formats, also changing sampling rate and performing some other modifications as instructed. The downside to this program is its complex and peculiar command-line syntax. Here is a brief tutorial that will show you how to do most of the more common tasks.
TranscriberAG: This is a program for creating transcriptions (typically orthographic) of sound recordings, time linked (typically at the phrase level) to a digital audio file. It will conveniently deal with long recordings -- an hour or more. The handling of audio I/O and waveform displays is based on the Snack sound toolkit. TranscriberAG is the successor to Transcriber. There is a program called Econv for converting among Shoebox, Transcriber, and Elan files.
Wavesurfer: This is a simple but fairly powerful program for interactive display of waveforms, spectrograms, pitch tracks and transcriptions (phonetic, orthographic etc.). It does not do everything that Praat does, but is easier for novices to learn to use.

Math, Statistics, and Graphics

FreeMat

FreeMat is a free environment for rapid engineering and scientific prototyping and data processing. It is similar to commercial systems such as MATLAB from Mathworks, and IDL from Research Systems.

GnuPlot

Gnuplot is a free, command-driven, interactive, function and data plotting program. Roughly speaking, if you can write a mathematical expression in C or Fortran, Gnuplot will plot it for you. It will also draw plots from data, or combined plots showing both data and theoretical calculations. GnuPlot runs on just about every platform you've ever heard of and some you probably haven't, including not only Unix, DOS, Microsoft Windows, and Mac OS, but OS/2, VMS, and Atari. A manual or tutorial is available in: Czech, French, German, Indonesian, Italian, Japanese, and Portuguese. A nice, short tutorial is available here.

LabPlot

LabPlot is a scientific graphics and statistics package similar to commercial products such as Microcal Origin and SPSS Sigmaplot. It is fully scriptable and runs on most variants of Unix including Mac OS X.

Lush

A language designed particularly for large-scale numerical and graphical programming. Lush is an object-oriented dialect of Lisp. It has a huge library of numerical routines and many other libraries, including one for probabilistic finite state automata. It runs on all major platforms.

Octave

Octave is a free-software emulation of Matlab. It is largely compatible with Matlab 4.X but not with Matlab 5.X. Some interesting free octave/matlab toolboxes are available here.

PSPP

A free clone of SPSS, widely used in the social sciences. PSPP is not as mature or sophisticated as R, but is considered easier to use for non-programmers.

R

R is a free-software version of the improved version of the S statistics language, whose proprietary version goes by the name of "Splus". A nice, short, and simple introduction to R in the form of a web page can be found at: http://lib.stat.cmu.edu/R/CRAN/doc/contrib/kickstart/index.html. A more detailed introduction to R, that devotes more attention to teaching statistics at the same time, is: http://www.math.csi.cuny.edu/Statistics/R/simpleR/ An Introduction to R (about 100 pp.) can be downloaded in PDF format. To download a copy, click here. The R FAQ contains answers to many frequently asked questions. The full R Reference Manual, currently 1144 pages, can be downloaded in PDF format. To download a copy, click here. An introduction to R aimed specifically at linguistics is Analyzing Linguistic Data: A Practical Introduction to Statistics using R. A page containing lots of useful information about R is: http://finzi.psych.upenn.edu. A repository of code and datasets for S and Splus, most of which will also run under R, can be found at http://lib.stat.cmu.edu/S/. You can get the current sources from: http://lib.stat.cmu.edu/R/CRAN/banner.shtml. You can also download precompiled binary versions from the following sites:

GNU/Linux: http://lib.stat.cmu.edu/R/CRAN/bin/linux/
Mac OS X: http://lib.stat.cmu.edu/R/CRAN/bin/macosx/
Mac OS 8.6-9.1: http://lib.stat.cmu.edu/R/CRAN/bin/macos/
Microsoft Windows: http://cran.r-project.org/bin/windows/base/

Semantics

Discrete Event Calculus Reasoner: Performs automated commonsense reasoning using the event calculus, a comprehensive and highly usable logic-based formalism. It solves problems efficiently by converting them into satisfiability (SAT) problems.
Lambda Calculator: An interactive, graphical Java program to help students of natural language semantics practice derivations in the typed lambda calculus. A description is available here. Versions are available for MacOSX, MS Windows, and Linux.
Molle: A cross-platform modal logic prover.
Projective Discourse Representation Theory Sandbox: PDRT-SANDBOX is an Haskell library that implements Discourse Representation Theory (DRT), and its extension Projective Discourse Representation Theory (PDRT). The implementation includes a translation from PDRT to DRT and First-order Logic, composition via different types of merge, and unresolved structures based on Montague Semantics, defined as Haskell functions.
WordNet-Similarity: A collection of Perl modules for the WordNet system. They are designed as object classes with methods that take two word senses as input and return the semantic relatedness of these word senses.

Other Software

An Gramadóir: An open source grammar checking engine, intended as a platform for the development of sophisticated natural language processing tools for languages with limited computational resources.
BBE: Bbe is a sed-like stream editor for binary files. Instead of reading input in lines like sed, bbe reads arbitrary blocks from an input stream and performs byte-related transformations on selected blocks. Blocks can be defined using start/stop strings, offset in the stream and block length, or a combination thereof. Basic editing commands include delete, replace, search/replace, binary operations (and, or, etc.), append, and Binary CodedDecimal/ASCII conversion. For examining the input stream, it contains some grep-like features such as printing the input file name, stream offset, and block number of the selected blocks. Block contents can also be printed in different formats such as hex, octal, ASCII, and binary.
Emacs: Emacs is a very powerful text editor with a number of features that make it especially useful for linguistic research. Emacs allows the screen to be divided into multiple windows, both vertically and horizontally. One can, for example, split the screen vertically so as to view different versions of a text in parallel. It provides full regular expression search and substitution facilities. Since emacs is implemented on top of a LISP interpreter, to which full access is available, emacs is fully programmable, not in a feeble extension language but in a full-fledged programming language.
Flat File Extractor: Flat File Extractor is a parser for flat file databases. Using specifications of the structure of the input file and of the desired output format, it parses its input and writes it back out in another format. Among other things, it can convert other formats to XML.
FreeLing: An open source language analysis tool suite covering: text tokenization, sentence splitting, morphological analysis, named entity detection, date/number/currency/ratio recognition, PoS tagging, chart-based shallow parsing, contraction splitting, physical magnitude detection (speed, weight, temperature, density, etc.), named entity classification, WordNet based sense annotation, and dependency parsing. It comes with data for Catalan, Spanish, Italian, Galician, and English. In C++.
GTick: A software metronome, useful, for example, for syllable-breaking tasks. For Unix-ish systems.
Linguistic Tree Constructor: Linguistic Tree Constructor is a program for drawing syntactic trees. It allows the user to create trees for large amounts of text quickly. "Generic" trees as well as Role and Relation Grammar and X-Bar trees are supported, as is exporting to Annotation Graph XML format. Printing and copying parts of the tree to clipboard are supported. LTC runs on all major platforms.
Minpair: A program for finding minimal pairs in a wordlist. It accepts definitions of multigraphs, allows words to be paired with identifiers, and handles Unicode input.
Msort: A sophisticated sorting program, capable of handling multiline records, locating fields by tags, using arbitrary sort orders with long multigraphs, and various other things. Fully Unicode-capable
Numutils: A set of command line utilities that may be useful in dealing with numbers in combination with other Unix utilities. In particular, numgrep permits filtering by the numerical value of expressions, which can be difficult with ordinary versions of grep.
phpSyntaxTree: A web application that creates syntax tree graphs from phrases entered in labelled bracket notation.
PhonoApps: Programs for writing and testing phonological rules and finding natural classes of segments with respect to a given feature set.
GNU Poke: A program for reading, manipulating, and writing binary data. It provides a language and datastructures for working with binary data at a more reasonable level of abstraction. Useful for tasks such as extracting data from media or wordprocessor file formats for which you do not have high-level tools.
Penn Controller: A system for controlling on-line psycholinguistic experiments using the IBEX system.
Replace: According to its author, "the sane person's alternative to sed". replace provides an easier alternative to sed for replacing one or more strings with another in one or more text or binary files or from standard input. It works with fixed strings rather than regular expressions but can adapt the substitution to the case of the original if so desired. An interactive mode is available in which the user is asked to confirm proposed substitutions.
SyNTeX: SyNTeX is a LaTeX preprocessor that draws syntactic trees using the LaTeX picture environment. The preprocessor reads the comments in a LaTeX file and draws the tree based on commands that it finds in the comments.
TDH Utilities: A suite of programs for handling tabular ASCII data intended to supplement the standard Unix utilities. They accomplish nothing that can't be done fairly easily in a scripting language like AWK, but some of them provide an easy way to do things that would take some work in a scripting language. Among these are utilities for extracting data from specified cells of a spreadsheet and for cleaning up spreadsheets.
Tree Draw: A program for drawing syntactic trees.
Turk Tools: A tool for constructing linguistic surveys. "More and more researchers in linguistics use large-scale experiments to test hypotheses about the data they research, in addition to more traditional informant work. In this paper we describe a new set of free, open-source tools that allow linguists to post studies online, turktools. These tools allow for the creation of a wide range of linguistic tasks, including grammaticality surveys, sentence completion tasks, and picture-matching tasks, allowing for easily implemented large-scale linguistic studies. Our tools further help streamline the design of such experiments and assist in the extraction and analysis of the resulting data. Surveys created using the tools described in this paper can be posted on Amazon’s Mechanical Turk service, a popular crowdsourcing platform that mediates between ‘Requesters’ who can post surveys online and ‘Workers’ who complete them. This allows many linguistic surveys to be completed within hours or days and at relatively low costs. Alternatively, researchers can host these randomized experiments on their own servers using a supplied server-side component."
Wordfreak: WordFreak is a java-based linguistic annotation tool designed to support human, and automatic annotation of linguistic data as well as employ active-learning for human correction of automatically annotated data.
WordGenerator: WordGenerator generates hypothetical words from specifications of their syllable structure. The user specifies the maximum length of the words in syllables, the abstract structure of syllables in the language (in terms of such units as consonants and vowels or onsets and rhymes), and the actual sounds that comprise each abstract class (e.g. the list of vowels in the language); WordGenerator then generates the words that conform to this specification.
Word generator: This takes a list of existing words in a language and generates other possible words, with a number of settings you can select to alter the characteristics of the created set.

Programming Languages

Languages

AWK: The classic workhorse of UNIX text processing. If the program is complex, and in particular, if sophisticated data structures are needed, other languages may be preferrable, but many people find Awk easier and quicker to use for relatively simple programs. One reason for this is that AWK automatically parses its input into records and fields and uses an unusual pattern-action format. Some versions of Awk, including GNU Awk, support Unicode. X(ml)Gawk is a new derivative of awk that contains an XML parser. Instead of reading input record by record as in traditional Awk, it reads input node by node.
Perl: A scripting language with elaborate regular expression support, Perl is too widely used not to mention, and its author, Larry Wall, is a linguist. However, Perl syntax is arcane and inconsistent, with numerous special cases. Perl programs are said to be "write only" because they are so often impossible to understand, even for the author after some time has passed. It is possible to write clean, understandable and modifiable programs in Perl, but you have to work at it. A good discussion of the problems of Perl and the advantages of Python over Perl can be found here. Another good critique of Perl is this one. Your mileage may vary.
Python: A general purpose, high-level object-oriented language with good built-in regular expression operations and strong Unicode support. Python now supports Unicode outside the Basic Multilingual Plane but must be compiled to do so. Python is (in)famous for using indentation rather than parenthesization to indicate block structure. For expositions of the virtues of Python see: Why I Love Python and Why Python?.
Ruby: An interpreted, high-level, object-oriented language. It is similar in many ways to Python, but is more strictly object-oriented and has a more traditional syntax. Ruby now supports Unicode, though not as natively and completely as Python and Tcl.
Snobol: Snobol "String Oriented Symbolic Language" was the first language focussed on text-processing, for which it was widely used in the 1970s and 1980s. It is no longer commonly used, but still has its devotees and perhaps deserves a resurgence due to its powerful pattern-matching facilities, which differ in interesting ways from regular expressions. It is the antecedant of Icon. See the Wikipedia article for more information.
Tcl: Much of Tcl ("Tool Construction Language")'s reputation is due to its associated windowing and graphics library, Tk, for which bindings now exist for several other languages, including Python and Perl. However, Tcl is actually quite a nice language for text processing. Its unusually simple syntax makes it easy for beginners to learn. (In fact, many of the problems encountered by those new to Tcl are due to false assumptions that it is like more familiar languages. Novice programmers have no such assumptions.) It uses Unicode as its native character set and has a good set of built-in string manipulation functions and superior regular expression facilities. Tcl does not presently support Unicode above the Basic Multilingual Plane. Due to its original design as an extension language, Tcl can easily be extended by writing functions in C and C programs can easily embed Tcl interpreters. Tcl is not object-oriented, but several object-oriented extensions are available.

Information for Particular Languages

Python

Main Python Site
Download Documentation
Non-Programmers Tutorial for Python
Instant Hacking
Python tutorial
Python cheat sheet
Python regular expression tutorial (gentler introduction than library reference)
Natural Language Toolkit
Python string module reference
Python regular expression module reference
Python Unicode database module reference
Python structured markup (HTML, XML, etc.) reference
AGREPY - Python approximate matching library
Recent versions of Python (such as the current 2.5) support all of Unicode. However, by default Python is compiled with support only for the Basic Multilingual Plane. To check whether your Python supports all of Unicode, execute Python and type the commmand: "import sys". Then type "print sys.maxunicode". If the result is 65535, you have support only for the BMP. To compile Python with support for all of Unicode, give the argument --enable-unicode=ucs4 to configure.

Tcl

Structured Markup Languages

HTML

Cascading Style Sheet specification: The official specification for Cascading Style Sheets, which allow you to separate the details of presentation from your HTML.
HTML Specification: The full, formal specification of HTML.
HTML Validator: The official HTML checker. Submit a URL to it and it will tell you whether the page conforms to the standard.
Dillo: Dillo is a lightweight web browser. It uses very little memory and starts up very quickly. It doesn't have all the features of a full-fledged browser - it doesn't display the full range of Unicode characters, for example - but has a very nice HTML checker. It shows a count of the errors it detects in your page. If you click on this, it brings up a window listing the errors and their locations in your HTML file. Once you have corrected them, if you right-click on the bug meter a menu will pop up offering you the opportunity to fully validate the page at either W3C or WDG.
HTML Tidy: This program detects errors in HTML pages and, if it is clear how to fix them, does so. If it is not, it generates a warning.
HTML Introduction: A set of tutorials.
HTML Attributes: A list of all of the HTML attributes.
HTML Character Entity References: The character entities are the HTML "names" that can be used to represent characters that are otherwise difficult or impossible to represent, such é for é. This page provides a complete list.
HTML Cheat Sheet: A list of HTML elements.
HTML Latin-1 Characters: This page shows how to represent the characters in the ISO-8859-1 (aka "Latin-1") character set in HTML. This character set includes most of the non-ASCII characters used in Western European languages.
HTML Color Specifications

XML

LibXML: An XML parser library with a C API.
LibXSLT: An XML transformation library with a C API.
LXML: A Python binding for the libxml and libxslt XML processing libraries.
Xmldiff: A program that finds differences between two similar XML files.
XML Tree: An XML parser with C++ and Perl APIs.
Xsltproc/libxslt: libxslt is a library for processing XML, with C and Python bindings. There is also a convenient command-line tool, xsltproc, that takes as arguments a stylesheet and an XML file and processes the latter according to the former.

Miscellaneous

Revised 2024-07-12.
[This date is in the International Date Format, ISO 8601]