Computational Resources for Linguistic Research
This page lists computational tools for doing linguistics. There is of course some
overlap, but the emphasis is on using computation to do what ordinary linguists
want to do, not on computational linguistics for its own sake.
The page emphasizes free software that runs on Unix systems.
The emphasis is on Unix for several reasons. First, that's what I myself use.
Second, in my opinion Unix is the environment of choice for this kind of work.
The Unix philosophy of making it easy to connect one small tool to another
is just right for linguistic research. Third, Unix is strongly represented in the
free software world.
By free software I mean software that you can use as you wish to, modify,
and redistribute. Software that is free in this sense is often also available
at no cost, but that isn't the criterion. For a discussion of the distinction,
click here.
The page emphasizes free software for two reasons. One is financial: linguists
tend not to be well funded and so can't afford to buy expensive commercial
software. Furthermore, since linguistics isn't a large or lucrative market,
not much software is aimed specifically at linguistics. Linguists therefore
often have to make use of tools intended for other purposes. If you buy
a piece of commercial software, you may well find that it doesn't do what you want.
If the software is free in the sense of "free beer", you haven't lost anything
but a little time by trying it out, but if you've bought a commercial product,
you're probably out of luck.
Equally if not more important is the other sense of freedom, namely freedom
as in "free speech". Software that you can freely modify and redistribute is much
more flexible. If it doesn't do exactly what you need, you can modify it so that
it does, and you can make it available to other people.
With occasional exceptions the software listed runs on GNU/Linux systems.
I'm most likely to know about such software because GNU/Linux is what I use most
of the time. Software that runs on GNU/Linux systems will usually run on other
Unix variants, such as FreeBSD, OpenBSD, NetBSD, SunOS/Solaris, HP-UX, IRIX, AIX,
and Mac OS X. Many of the programs listed will also run natively under Microsoft Windows.
Those that will not can in many cases be run on MS Windows machines by using
Cygwin, which provides a Unix-like environment.
A few programs are listed that run only on non-Unix systems.
These are generally either programs of particular interest or programs that provide
an alternative for non-Unix users to something else available for Unix.
In a few cases I list software that is not free. This means that
the software is unusual and does not to my knowledge have a comparable free analogue
or that it is widely used.
For what they are worth, here are some
recommendations for relevant books.
- Character Encoding
- Fonts, Rendering, and Printing
- Input Methods and Keyboard Layout
- Extracting Text from Impure Formats
- Regular Expressions and Other Pattern Matching
- Unix Tools
- Syntax
- Text Corpus Databases and Searching
- Obtaining Data From Web Sites
- Sources of Electronic Text
- Lexicography and Dictionaries
- Concordances and File Comparison
- Historical Linguistics
- Sociolinguistics
- Phonetics
- Math and Statistics
- Semantics
- Signed Languages
- Other Software
- Programming Languages
- Structured Markup Languages
- Miscellaneous
Unicode
Information
- Unicode Organization
- The organization responsible for the Unicode standard. The web site contains
all sorts of information about the standard, including code charts, and information
about on-going activity and future plans.
- UTF-8 Standard
- RFC 3629 is the current definition of the
UTF-8 encoding format for Unicode. RFC 3629 replaced
RFC 2279,
which in turn replaced RFC 2044.
- Unicode Character Ranges
- A list of the types of characters currently included in Unicode
and the ranges of codepoints that they occupy.
- Unicode Chart
- A Unicode chart in the form of a set of web pages. Each page contains
a 256 character block. Whether or not the characters will display depends on
whether your browser has access to the necessary font.
You can download the Python script used to generate the chart
here.
- A Quick Primer On Unicode and Software Internationalization Under Linux and UNIX
- A simple explanation of how to use Unicode on Unix/Linux systems.
- UTF-8 and Unicode FAQ for Unix/Linux
- Detailed information on using Unicode in Unix/Linux systems.
- Alan Wood's Unicode Resources
- A variety of information and links.
Editors and Word Processors
TextEditors.org is a wiki devoted to
text editors of all types. It contains information about over 700 editors.
One section is devoted to Unicode Editors.
- Babelpad
- Unicode editor for Microsoft Windows.
- Geresh
- A Unicode editor oriented especially toward languages written right-to-left,
particularly Hebrew and Arabic. Note that the web site and manual are in Hebrew but
that the editor menus are in English, so people who do not read Hebrew should be
able to use it. Here is an RPM and here is a compressed tar archive.
- Katoob
- A multilingual bidi editor, capable of reading and writing Unicode,
designed especially for Arabic. It has keyboards for both Arabic and Hebrew.
- Mined
- A Unicode text editor. Display and edit Unicode files, and enter text in Unicode.
Mined displays directly in an xterm window.
Its basic command set resembles that of Wordstar, but it can also be configured to
use Emacs-like commands. The quality of the rendering is not as high as in Yudit,
but it is much easier to use as a general purpose editor. It also has a "char info"
entry on the "Xtra" menu that provides detailed information about the current
character, including, if desired, the readings for CJK characters.
- OpenOffice.org Writer
- OpenOffice.org Writer is a FLOSS word processor that runs on a variety of operating
systems, including GNU/Linux, Mac OS X, FreeBSD, Solaris, Irix, and Microsoft Windows.
It is capable of reading and writing Unicode.
- Simredo
- A Java Unicode editor.
- Vim
- A clone of the vi text editor that supports Unicode.
See this page
for instructions on using vim with Unicode.
- Xetex
- A version of TeX that works with Unicode.
- XMLMind
- A Docbook editor that allows Unicode.
- Yudit
- A Unicode text editor. Display and edit Unicode files, and enter text in Unicode.
Numerous keymaps are supplied, but you can roll your own.
Here is a screenshot of Yudit in action.
Other Software
- Ascii2binary
-
A program that reads textual representations of numbers and converts them
to binary format. It provides a simple way to generate text in an encoding
for which you do not have a converter or input method as well as a way
of being sure that you know exactly what is in the file.
The associated program Binary2ascii provides conversions in the
opposite direction.
- BabelMap
-
A Unicode character map for MS Windows.
Displays selected portions of the Unicode character set and provides information
about the characters. It also allows selected characters to be copied to the
clipboard, making it useful for Unicode input into other programs.
- Gucharmap
-
Displays selected portions of the Unicode character set and provides information
about the characters. It also allows selected characters to be copied to the
clipboard, making it useful for Unicode input into other programs.
In my experience, the best Unicode character map for GNU/Linux.
The current version can be downloaded from:
http://ftp.gnome.org/pub/GNOME/sources/gucharmap/
- Heirloom Toolchest
-
This is a set of classic Unix tools based on the code released
by Caldera. As a result, they do not have some of the more recent extensions.
However, the maintainer has updated them to handle Unicode, which many more recent
tools, including some of the GNU tools, do not.
- International Components for Unicode
- A very extensive library for dealing with Unicode, with APIs
for C/C++ and Java. The main problem with it is that it is so comprehensive
that finding your way can be daunting.
- kcharselect
- A font map that displays the characters in the selected font.
You can click on a character to copy it to a buffer from which
it may be entered into a document, but it is also useful just for
finding out exactly which characters a font provides. Moving
the mouse pointer over a character and leaving it for a short
time produces a tooltip containing the character's codepoint.
This is identified as a Unicode codepoint but is actually just
the offset into the font, meaning that it is only the Unicode
codepoint if the font is Unicode-encoded.
This program is part of the kdeutils package, which in turn
is part of the KDE desktop environment, so it is not readily
available separately. To get it, you need either to install KDE
if you do not already have it, or to obtain the Debian kdeutils
package.
- libucd
- A C library interface to the Unicode Character Database, which contains the
properties of all Unicode characters.
- libuninum
-
A library for converting between Unicode strings representing numbers in a wide
variety of number systems and internal machine representations. The library
has interfaces for C and Tcl. A command-line program numconv and a graphical
user interface NumberConverter are also provided.
- Open Sesame
- A graphical experiment-builder for psychological experiments.
- Unicode Checker
- A Mac OS X utility that can perform Unicode normalization and a variety of conversions,
browse the character set, search by name or codepoint, display information about CJK
characters, etc.
- Unicode Data Browser
- A browser for the UnicodeData.txt file, which contains much useful
information but is not easily read by humans. The browser creates a scrollable table
in which columns represent properties.
The table may be sorted on any column. Abbreviations are expanded
and characters cross-referenced in decomposition and casing fields are named.
Regular expression search restricted to a selected column is available.
The set of characters for which information is displayed may be restricted to those
characters matching a regular expression on a specified property. Each such filtering
operation applies to the output of the previous filtering operation unless the table
is reset to the original full set of characters, so filtering on multiple properties
is possible.
If an up-to-date local copy of UnicodeData.txt is not available, it can be downloaded
automatically from Unicode.org.
- Unicode Normalization from the Command Line
- How to use the facilities of common scripting languages to normalize Unicode.
- Unicode Utilities
- A set of programs that provide various kinds of information about the contents of
Unicode files and that manipulate Unicode files:
- unidesc - identifies the script ranges in a file
- unifuzz - generates output for testing other software
- unihist - generates a histogram of its input
- uniname - identifies each character in a file
- unireverse - reverses its input character-by-character
- ExplicateUTF8 - determines and explains the validity of a series of bytes as UTF8.
- UTF8Lookup - converts a codepoint to a Unicode character name
- Uni2Ascii
- A pair of programs that convert between Unicode and ASCII representations
of Unicode such as HTML numeric character references (e.g. é) and
URL format UTF-8 (e.g. %C3%A9).
Fonts
- Alphabetum
-
A Truetype font focussed on ancient languages, including:
classical and mediaeval Latin,
ancient Greek,
Old Italic - Etruscan,
Oscan,
Umbrian,
Faliscan,
Messapic,
Picene - Gothic,
Iberian,
Celtiberian,
old and middle English,
Hebrew,
Sanskrit,
Runic,
Ogham,
Ugaritic,
Old Persian cuneiform,
Coptic,
Kharosthi,
Phoenician,
Linear B,
Cypriot,
Ancient Greek musical notation,
Ancient Greek acrophonic numerals,
New Testament editorial symbols,
Ancient Greek papyrological numbers,
Aegean numbers and old and mediaeval Nordic.
A demonstration version with gaps here and there is available at no cost.
An unrestricted individual license costs 30 euros.
- Code 2000, Code 2001, and Code2002
- If you work extensively with a particular writing system, you will likely
have favorite fonts. However, it is useful to have fonts that cover the entire
Unicode character set. James Kass has kindly made available shareware fonts with
which you can do this. His Code2000 font covers the first Unicode plane.
His Code2001 font covers the second Unicode plane. Code2002 is the
beginning of a font with coverage of the third plane. He asks US$5 for
personal use.
- Clearlyu
- A free font developed by Mark Leisher at the University of New Mexico.
It is a 12 point BDF (bitmap) font, so it is useful for its coverage but does not
scale well. It covers the following Unicode ranges:
Basic Latin;
Latin-1 Supplement;
Latin Extended-B;
IPA Extensions;
Spacing Modifier Letters;
Combining Diacritical Marks;
Greek;
Cyrillic;
Armenian;
Hebrew;
Thaana;
Devanagari;
Thai;
Lao;
Georgian;
Ethiopic;
Cherokee;
Unified Canadian Aboriginal Syllabics;
Ogham;
Runic;
Letterlike Symbols;
Number Forms;
Arrows;
Control Pictures;
Geometric Shapes;
Braille Patterns.
- GF Zemen Unicode
- A Unicode-encoded Truetype font of broad coverage.
- SIL Doulos IPA Font
- This is a Unicode-encoded font that includes all of the
International Phonetic Alphabet together with a variety of other Roman
and Cyrillic letters.
- Titus Cyberbit
- A Unicode-encoded Truetype font of broad coverage.
- Re-encoding a Font to Unicode
- There are many fonts in existence that were created with encodings other than Unicode.
This illustrated tutorial describes how to re-encode an existing TrueType font to Unicode.
Other Encodings
- ASCII Code
- The standard for English except on IBM mainframes, generally used for programming.
This link points to charts in binary, octal, decimal and hexadecimal, all color-coded for the
basic POSIX character classes, with explanations of the derived character classes and
the control characters.
- ascii
- A handy command-line program that lists all of the synonyms of an
ASCII character given a name, abbreviation, or codepointin any of
several formats.
- ByteName
- For each byte of input prints the byte offset, the value of the byte in hex, octal, and
binary, and a description of the byte in any of several dozen single-byte encodings.
It can also generate a chart for a selected encoding, or give the interpretation
of a given codepoint in all of the known encodings.
- Cyrillic
- Various encodings for Russian and other variants of the Cyrillic alphabet
- Encoding Database
- A database containing information on several dozen single-byte encodings, including
the various ISO-8859 encodings and Microsoft Code Pages.
- EUC-KR
- The usual encoding for Korean.
- ISCII (Indian Script Code for Information Interchange) Standard
- The official Indian government encoding.
- ISO-8859 (Latin-1 etc.)
- 8-bit encodings for European languages: extended Roman alphabet and Cyrillic
- Mark Leisher's Csets
- Mappings between various character sets often not covered by standard conversion tools and Unicode.
- Microsoft Windows Codepage 1250
- The 8-bit extension of ASCII commonly used by Microsoft Word.
- Unicode Consortium Cross-Mapping Tables
- The Unicode Consortium provides a large set of tables showing the relationship between
other encodings and Unicode.
These include all of the ISO-8859 encodings and Microsoft Codepages.
Encoding Converters
General Purpose
- enca
- Handles fewer encodings than most of the others, but is useful because it
also functions as a detector.
- iconv
- The original GNU encoding conversion tool. It is a command-line tool
based on libiconv.
- recode
- A successor to iconv but with a somewhat peculiar command-line
syntax.
- siconv
- This is a stream-oriented counterpart to iconv,
using libiconv, the same library that underlies iconv.
It can handle larger amounts of data than iconv.
- uniconv
- This is the encoding conversion tool associated with the Yudit text editor.
It can also convert from ASCII sequences to Unicode using the same keymaps
used by the editor for input.
- utrac
- Converts among various single-byte encodings.
Specialized
- Autoconvert
- Converts Chinese encodings and Unicode
- Cz2cz
- Converts among Czech encodings
- Jcode
- Converts among Japanese encodings.
- Polcnv
- Converts among Polish encodings.
- TLGU
- Converts text in the encoding used by the Thesaurus Linguae Graecae to Unicode.
- Xcode
- Converts among Russian encodings.
Transliteration
- Buckwalter2Unicode
- A pair of programs that convert from the Buckwalter transliteration of Arabic to Unicode and back.
- Earm2IPA
- Transliterates Eastern Armenian from its native writing system to
the International Phonetic Alphabet. Both input and output are in UTF-8
Unicode.
- OOTranslit
- An OpenOffice.org Writer macro that converts between the Roman and Cyrillic writing
systems for Serbo-Croatian.
- Tgn2IPA
- Transliterates Tigrinya from its native writing system to the International Phonetic Alphabet.
Both input and output are in UTF-8 Unicode.
- Xlit
- A general purpose transliteration program.
Transliteration definitions may be read from files or
defined interactively by entering strings to be transliterated on the left
and the strings to which they are to be mapped on the right. Xlit can translate
the entire text or restrict the transliteration to the text enclosed within
specified delimiters or to text not enclosed in specified delimiters.
Terminal Emulators
- mlterm
- A terminal emulator for the X11 window system that supports UTF-8 Unicode,
including bidirectional text, as well as a variety of parochial encodings,
including: ISO-8859-[1-11],
ISO-8859-[13-16], TIS-620 (same as ISO-8859-11), KOI8-R,
KOI8-U, KOI8-T, GEORGIAN-PS, TCVN5712, VISCII, CP1251,
CP1255, EUC-JP, EUC-JISX0213, Shift_JIS, Shift_JISX0213,
ISO-2022-JP[1-3], EUC-KR, UHC, JOHAB, ISO-2022-KR, GB2312
(EUC-CN), GBK, GB18030, ISO-2022-CN, HZ, EUC-TW, BIG5,
and BIG5HKSCS.
Text often comes in formats that we cannot directly make use of. It is necessary
first to extract plain text. Here are programs that can extract plain text
from various impure formats.
- Base 64 Encoding
- Base64 is the method of encoding
binary data as ASCII text for safe transmission in contexts that
are not 8-bit safe, such as email. It is defined in
RFC 3548.
Base64 is a standalone base64
encoder/decoder.
- HTML
- HTML2Text converts HTML to plain text.
In addition to use for its own sake, conversion of HTML to plain text is often
the second phase of conversion from other formats since many converters
try to preserve layout by generating HTML. A program designed specifically to deal
with the baroque HTML generated by Microsoft Word is
Microsoft Word 2002 Unmunger.
- Microsoft Word
- Wvware is a suite of programs that
convert Microsoft Word files to a variety of other formats,
including plain text. Anti-Word
is a free MS Word reader for Linux and RISC OS, with ports
to FreeBSD, BeOS, OS/2, Mac OS X, Amiga, VMS, NetWare, Plan9,
EPOC, Zaurus PDA, MorphOS and DOS. Antiword can convert files
from Word 2, 6, 7, 97, 2000, 2002 and 2003 to plain text.
- Open Document Format (ODF)
- odt2txt is a simple command-line
tool that extracts plain text from ODF.
OpenOffice.org Writer can read ODF and can
export in a variety of formats including plain text.
- Portable Document Format (PDF)
- PDFtotext extracts plain text from PDF files.
It is part of
the xpdf package, which also provides a PDF file viewer and some other
tools. Commercial software for Microsoft Windows is available at
http://www.pdf2text.com/.
If you are having problems dealing with a PDF file and need to explore
its internal structure,
the PoDoFoBrowser PDF object
browser may come in handy.
Formswift PDF Editor is a PDF
editor. Formswift PDF Converter converts PDF to Word and other formats.
Able2Extract
converts PDF to such other formats as Word, Powerpoint, Publisher, and AutoCad.
It can also convert a scanned PDF to Excel. Runs on Windows, Mac and Linux.
- Postscript
- PSToText extracts
plain text (in the ISO-8859-1 extended ASCII encoding) from Postscript files.
- Rich Text Format (RTF)
-
Docfrac converts
from RTF to plain text. It runs on both Microsoft Windows and Unix platforms.
Rtf-converter converts RTF to HTML.
It runs on Linux and Microsoft Windows NT systems and therefore probably on
other platforms as well.
Rtfeeder
also converts RTF to HTML. Since it is a Perl script, it should run on
any platform. On the Macintosh, if you have access to the Apple Developer Tools, they contain
the program convertRichTextToAscii, which actually converts RTF to Unicode.
If there are no non-ASCII characters in the original, the output
will be ASCII. This program is located at
/Developer/Applications/Utilities/FileMerge.app/Contents/Resources/convertRichTextToAscii
if you have the Developer Tools installed.
Xue Brothers offers
an RTF to text converter that runs in a Microsoft Windows shell for a modest price
(currently $5.50). If you can do a little Java programming yourself, the components
necessary are available. Information can be had here.
- TeX
- TTH translates TeX and LaTeX into HTML,
from which it can be converted to plain text by an HTML converter.
- Troff/Nroff/Groff
- Unroff converts
from Troff format to various other formats including HTML.
It is a programmable translator for troff and so can be made to generate whatever
output you like.
- WordPerfect
- Wp2x converts
WordPerfect files to plain text and various other formats.
Unoconv converts among a wide variety
of document formats, including:
BibTeX [.bib],
Microsoft Word 97/2000/XP [.doc],
Microsoft Word 6.0 [.doc],
Microsoft Word 95 [.doc],
DocBook [.xml],
HTML Document (OpenOffice.org Writer) [.html],
Open Document Text [.odt],
Open Document Text [.ott],
Microsoft Office Open XML [.xml],
AportisDoc (Palm) [.pdb],
Portable Document Format [.pdf],
Pocket Word [.psw],
Rich Text Format [.rtf],
LaTeX 2e [.ltx],
StarWriter 5.0 [.sdw],
StarWriter 4.0 [.sdw],
StarWriter 3.0 [.sdw],
Open Office.org 1.0 Text Document Template [.stw],
Open Office.org 1.0 Text Document [.sxw],
Text Encoded [.txt],
Plain Text [.txt],
StarWriter 5.0 Template [.vor],
StarWriter 4.0 Template [.vor],
StarWriter 3.0 Template [.vor],
and XHTML Document [.html].
It also handles a variety of graphics formats, presentation formats and spreadsheet formats.
The Multivalent Document Tools
can extract text from a number of formats, including PDF and HTML.
Tools and Libraries
- AT&T Finite State Morphology Library and Lextools
- Tools for building, combining, optimizing, and searching weighted finite-state acceptors and transducers.
- agrep
- An approximate regular expression matcher. This is the older of two approximate
regular expression matchers, sometimes referred to as Wu-Manber agrep
after its original authors. The source code for the Unix version is available
here.
Another agrep is provided as part of the
TRE regular expression package.
- Bison
- A parser generator. The input consists of a context-free grammar in a notation similar to
BNF together with associated code. This is the GNU implementation of the
classic Unix YACC. It is designed to work well with Flex but may
be used separately. PyBison is
a Python interface to Bison.
- CL-PCRE library
- A Perl-compatible regular expression library for Common Lisp.
- CHSM
- A code generator in the tradition of yacc and bison
that generates Concurrent Hierarchical State Machines.
The machines are described in a statechart specification language and annotated
with code in either C++ or Java. The generated code is fully object oriented,
allowing multiple machines to exist concurrently. The CHSM run-time library
is small, efficient, and thread-safe.
- Daciuk's Finite State Automaton Utilities
- A variety of tools for working with finite state automata and transducers.
- Dia2fsm
- A tool that takes as input a diagram of a finite state machine in
in dia format and generates C or C++ code implenting it.
- Finite State Automata Utilitiies
- A collection of utilities for manipulating regular expressions,
finite-state automata and finite-state transducers.
Manipulations include automata construction from regular expresssions,
determinization, minimization, composition, complementation, intersection,
and Kleene closure. Various visualization tools are available
for browsing finite-state automata. Interpreters are provided to apply
finite automata. Finite automata can also be compiled into stand-alone C programs.
- Band2XML
- Many lexical databases have been compiled in format used by Robert Hsu's Lexware programs, known as band format. Those wishing to process such
dictionaries using other tools may find it useful to convert them to XML. BAnd2XML.exe performs this conversion.
- Flex
- A lexical-analyzer generator. This is the GNU implementation of the
classic Unix Lex.
- Glark
- Glark adds to regular-expression matching facilities very similar to those of grep
several special features. It allows Boolean combinations of search predicates
and it allows specifications of how far apart (in lines) the matches
to different parts of a Boolean must be. It is possible, for instance, to ask for
the set of lines containing both A and B no more than K lines apart.
Glark also provides optional color highlighting of matches, allows the user to specify
how much context to provide for matches (e.g., "show me the six lines surrounding a match")
and allows for considerable control over multi-file searches and what information
they produce (e.g. name of matching file only, name and matching lines, etc.).
- Grail
- A symbolic computation environment for finite-state machines, regular expressions, and other formal language theory objects.
- Groningen Finite State Automaton Utilities
- A collection of utilities to manipulate regular expressions, finite-state automata and finite-state transducers.
- grep
- GNU grep. For another kind of grep try here
- HyperLex
- A system for performing feature-based regular expression searches on lexical databases.
- Kiki
- A front end to the Python re module for testing regular expressions
against a sample text that provides extensive output about the results,
including highlighting of groups within a match.
- Kodos
- A tool for creating, testing and debugging regular expressions for the Python
programming language.
- Kregexpeditor
- A graphical tool for constructing regular expressions in a fashion somewhat like
a diagram editor. Generates regular expressions in the syntax of either the Qt windowing
toolkit or emacs. This is part of the KDE package and so does not have its own website
for downloading.
- Levenshtein
- A Python library for computing various measures of string similarity (Levenshtein,
Hamming, Jaro, Jaro-Winkler) and related functions, such as applying edits.
- Match
- A library callable from C, C++, and Ada that provides a pattern matcher
inspired by that of SNOBOL4.
- monq.jfa
- A Java class library for finite state automata. Unlike the standard java.util.regex,
which provides only recognizers and substitution, it allows actions to be bound to regular
expressions so that the action is performed whenever the regular expression is matched.
- Nooj
- NooJ is both a corpus processing tool and a linguistic development
environment: it allows linguists to formalize several levels of linguistic
phenomena: orthography and spelling, lexicons for simple words, multiword
units and frozen expressions, inflectional, derivational and productive
morphology, local, structural syntax and transformational syntax. For each
of these levels, NooJ provides linguists with one or more formal tools
specifically designed to facilitate the description of each phenomenon, as
well as parsing tools designed to be as computationally efficient as
possible. This approach distinguishes NooJ from most computational
linguistic tools, which provide a single formalism that should describe
everything. As a corpus processing tool, NooJ allows users to apply
sophisticated linguistic queries to large corpora in order to build indices
and concordances, annotate texts automatically, perform statistical
analyses, etc.
- PCRE library
- Perl compatible regular expression library.
- Pmatch
- A regular expression matching tool similar to grep but based on the PCRE
library and with highlighting of matches and display of surrounding lines.
- PC-KIMMO
- Implementation of Kimmo Koskeniemmi's Two-Level Morphology
- QFSM
- A graphical tool for designing finite state machines.
- Ragel State Machine Compiler
- Ragel compiles finite state machines from regular languages into C, C++, or Objective-C code.
It allows the programmer to embed actions at any point in a regular language.
- Redet [Regular Expression Development and Execution Tool]
- Redet allows the user to construct regular expressions and test them against input
data by executing any of more than 40 search programs, editors, and programming languages
that make use of regular expressions or similar patterns. Redet is written in Tcl, which is
therefore always available. Other matchers are executed as child processes
if they are available on the user's system. When a suitable regular expression has been constructed
it may be saved to a file. For each program, a palette showing the
available regular expression syntax is provided. Selections from the palette may be copied
to the regular expression window with a mouse click. Users may add their own definitions
to the palette via their initialization file. So long as the underlying program
supports Unicode, redet allows UTF-8 Unicode in both test data and
regular expressions. Although the primary function of Redet is to provide
a convenient interface to the actual regular expression tools, it also provides some
extensions of particular interest to linguists. Redet allows you to define
your own named character classes and provides a notation for taking their intersection.
Together, these two capabilities make it possible to perform searches on feature matrices.
- re_graph
- Given a regular expression draws a diagram of the corresponding finite state automaton.
- The Regex Coach
- A tool for experimenting with regular expressions. It can single-step through the
matching process as performed by the regex engine and can show a graphical representation
of the regular expression's parse tree. Uses Perl-style regular expressions.
- Regex Test
- Given a file of sample text, displays the text and allows the user to enter
regular expressions. As the user types, it matches the regular expression against
the sample text and highlights the matching portions.
- Regexopt
- A program that takes as input a regular expression (in a large subset of
Perl syntax) and produces a more compact equivalent regular expression.
- Sed
- The standard Unix stream editor. It provides regular expression searches and
substitutions. The GNU sed manual is available at
this site.
The source code may be had here.
There are quite a few versions of sed available, with implementations for a wide
variety of architectures and operating systems. Links to various versions are available
here together
with links to debuggers, tutorials, and other information.
If you find sed too complicated and just want to replace fixed strings,
you might try replace.
- Sgrep
- A tool for searching and indexing text, SGML, XML and HTML files and filtering text
streams using structural criteria.
- Sgrep
- A stanza grep tool, which is a more general interface into searching through
IOS configurations (or any file that has a 'stanza'-like format).
sgrep also can match ip addresses, and even match ip addresses inside a subnet.
- Ssed - Super Sed
- This is an enhanced version of the standard Unix stream editor sed
It provides extended regular expression syntax and large increases in speed in certain cases.
- State Machine Compiler
- Given a file containing a description of a finite state machine in a simple
language, generates code for implementing the machine in
C, C++, C#, Java, Perl, Python, Ruby, Tcl, and VB.net.
- Stuttgart Finite State Transducer
- A toolbox for the implementation of morphological analyzers and other tools based on finite state transducer technology. This is the closest non-proprietary equivalent to the
Xerox Finite State Calculs
- Theo
- A simulator for finite automata and Turing machines. Written in Java so available for most systems.
- TRE regexp library
- A library implementing an efficient new algorithm, with C and Python bindings. In addition
to classical syntax it provides some GNU and Perl extensions. It also provides
approximate matching and allows costs to be set in-line, individually for each
group. Wide (UTF-32) and multibyte (UTF-8) characters are supported.
An approximate grep command called agrep using the library is also supplied.
This version of agrep is largely compatible with the
older Wu-Manber agrep at the command-line level but
is more powerful in some respects.
- Txt2regex
- Txt2regex is a regular expression wizard that converts human sentences to regexes.
In a simple interactive console interface, the user answers questions and the
program constructs the regular expression. Over 20 programs are supported.
- Xerox Finite State Calculus
- The lexc lexicon compiler and xfst rule compiler. These compile into finite state automata.
- XFA
- A C library for creating non-deterministic finite state automata, either programmatically
or from regular expressions and for converting them to the minimal equivalent deterministic
finite state automaton.
- Xmlgrep
- A command-line utility that matches regular expressions against strings with XML markup.
Tutorials
Miscellaneous
UNIX provides a number of tools that make it easy to extract information from text
and format text in ways useful for linguistic research without having to do any
real programming. These tools are now available not only on Unix systems but
on many other systems, and they are available at no cost and open-source.
You can obtain the source code for the GNU versions of these tools by
downloading the GNU coreutils package here.
Native MS Windows ports of most of them are available from the
Unixutils project.
(Note: do not be dismayed if you do not see a program in which you are interested
in the list of programs on the Unixutils site. Their "program" list is actually a
list of packages. Most of the programs of interest belong to the
textutils package, which is provided by the Unixutils project.)
Versions of these tools that run under Microsoft Windows in a Unix-like
environment can be obtained from Cygwin.
Here is a list of the most useful UNIX tools with links to the GNU documentation:
And here are some on-line lecture notes that describe the use of the same tools.
A classic tutorial is the handout for
Ken Church's talk "Unix for Poets",
which you can download
here[Postscript file].
The TDH Utiliies are utilities of a similar nature intended
to provide some capabilities not available with the standard Unix utilities.
The Heirloom Toolchest
is a full set of classic Unix tools that handle
Unicode.
- Synpathy
- A tool for manual syntactic annotation.
Available for GNU/Linux, Mac OS X, and Microsoft Windows.
- Syntactica
- A tool for creating grammars and viewing syntactic structures.
- TreeDraw
- Software for drawing syntax trees.
- CLaRK
- An XML-based system for corpus development.
- Computational Linguistics Toolset
- A set of Perl programs for cleaning, splitting, refining, and taking samples from corpora
(ICE, Penn, and a native one), for tagging them using the TnT-tagger, for doing
permutation statistics on N-grams (useful for finding statistically significant
syntactical differences between any two sets of tagged texts), and for examining corpora
in various ways.
- Corpus Mailing List Home Page
- This page contains information about how to subscribe to the corpus mailing list,
the list archives, and links to other resources.
- Corpus Tool
- A tool for the annotation of text corpora. MS Windows and Macintosh only.
- DDC Linguistic Search Engine
- The search tool developed for the DWDS Corpus of German.
- emdros
- A free database for analyzed or annotated text.
emdros runs on Unix systems, including GNU/Linux, and
on some versions of Microsoft Windows. The query language
is probably the most sophisticated and powerful query language
available for searching annotated text.
Language Documentation and Conservation review.
- MonaSearch
- MonaSearch is a query tool for linguistic treebanks. The
query language of MonaSearch is monadic second-order logic, an extension
of first-order logic capable of expressing probably all linguistically
interesting queries. In order to process queries efficiently, they are
compiled into tree automata. A treebank is queried by checking whether the
automaton representing the query accepts the tree, for each tree.
The implementation includes a graphical user interface to facilitate
the composition of queries and the interaction with treebanks.
MonaSearch runs on all major platforms.
- Middle English Corpus Home Page
- Includes links to annotation and Corpus Search documentation
- Pphonological Corpus Tools
- tools for studying phonological properties of large corpora
- UPLUG
- Tools for linguistic corpus processing, word alignment and term extraction
from parallel corpora.
- York Corpus Search Lite Manual
-
Much useful data can be obtained from web sites. In a way, the web is a gigantic collection of
electronic corpora. In general, any material published on a web site is
fair game for research use as this constitutes "fair use" under the law of the
United States and many other jurisdictions. There are, however, legal and ethical
issues concerning redistribution of material obtained on the web. Furthermore,
there are legal and ethical issues involved in obtaining material available
on the web but not intended by the owner to be stored or converted to another format.
For example, if something is made available only as streaming audio, this may be
because the provider does not want you to be able to store a copy. This raises
the question of whether you may legally and ethically do so.
Some useful sources of information are:
- Bitlaw
- A site created by an intellectual property lawyer containing over 1,800 pages
dealing with all aspects of intellectual property law.
- Copyright and Fair Use
- Lots of information on fair use of copyrighted material provided by the
Stanford University Library system.
- The Electronic Frontier Foundation
- An organization dedicated to the preservation of freedom on the web.
Its web site contains information about such topics as copyright law, file-sharing,
and digital rights management.
Here are some useful tools:
- clive
- A tool for downloading videos from sites like Youtube and Google Video.
- curl
- A tool for transferring files with URL syntax.
- DataparkSearch
- A web indexing and search tool.
- Getleft
- Similar to curl but with a graphical interface.
- H2Text
- Strips HTML from a file, leaving pure text. (To do this, use the command line:
h2text -nc -t < <input file name> > <output file name>.)
- Linguist's Search Engine
- A tool for performing syntactic searches on internet data.
- wget
- Automatically downloads web sites, recursively if so desired.
Has oodles of options allowing detailed specifications of what links to
follow and what kinds of files to download. It is possible, for example,
to specify that text files should be downloaded but image files should
not be.
Problems sometimes arise in obtaining audio data from the web. You may find
these lecture notes on audio data helpful, especially this section.
- EnRus Dictionary Tools
- Tools that provide a nice graphical interface for dictionaries in a simple plain text
format and present. It comes with an English-Russian database but is not limited to this
database or these languages.
- International Bibliography of Lexicography
- A bibliography on lexicography.
- Kirrkirr
- Kirrkirr is a research project exploring the use of computer
software for automatic transformation of lexical databases
("dictionaries"), aiming at providing innovative information
visualization, particularly targeted at indigenous languages.
- Kura
- A multi-user open-source linguistic database especially geared towards language
description. Fully Unicode compatible. Runs on Unix (including GNU/Linux and
Mac OS X) and MS Windows systems.
- Lexica
- A dictionary interface.
- Lexique Pro
- An interactive lexicon viewer and editor, with hyperlinks between entries, category views, dictionary reversal,
search, and export tools.
- Lexware
- Robert Hsu's Lexware softsare.
- Open Lexicon Interchange Format
- An XML-based standard for interchange of electronic lexica.
- Owl
- A dictionary interface for dictionaries written in dicML, an XML-based
markup language. The dicML standard and some large bilingual dictionaries are also
available on this site.
- S-Dictionary
- A diotionary program for which quite a few dictionaries
are available. Runs on Unix, MS Windows, and Symbian systems (for mobile phones).
- Sheetswiper
- Converts spreadsheets to Standard Dictionary Format files compatible
with such programs as LexiquePro.
The download is on the Files tab.
- Shoebox
- A widely used lexical database, now superseded by Toolbox. Lexique Pro is a a viewer for data in this format. There is a program called Econv
for converting among Shoebox,
Transcriber, and Elan files.
- Sqlite
- A free, public-domain relational database that can be used directly or via bindings to
C and Tcl. Wrappers exist for additional programming languages, including:
D, Common Lisp, Haskell, Java, Javascript, Lua, Objective C, Objective Caml, Perl, Php, Pike,
Python, R, Ruby, Scheme and Squeak.
Sqlite supports Unicode text. It is easy to install, has a very small footprint,
and is available for all major platforms, including Mac OS X, Linux, and MS Windows.
- Toolbox
- A data management and analysis tool for field linguists. It is especially useful
for maintaining lexical data, and for parsing and interlinearizing text,
but it can be used to manage virtually any kind of data. It is the successor
to Shoebox. Help is available from the
user group.
There is a program called Econv
for converting among Shoebox,
Transcriber, and Elan files.
- WeSay
- Helps native speakers of a language to compile a dictionary with
only limited assistance from a linguist or programmer.
A useful review of graphical front ends to the MySql database is:
http://www.databasejournal.com/features/mysql/slideshows/top-10-mysql-gui-tools.html.
- aConCorde
- A multilingual concordancing program with particularly good support for
Arabic and a choice between English and Arabic interfaces. Runs on all major platforms.
- Conc
- The Summer Institute of Linguistics' concordancing program for the MacIntosh.
- Corpora og Konkordansprogrammer Konkord
- Hans Klarkov Mortensen's page on concordances, in Danish
- DDC-Concordance
- is a search engine for linguists. It lets you search
for words or sequences of words together with morphological
patterns.
- Diff
- A program that finds differences between two text files.
- Monoconc and Paraconc
- Commercial concordance software for the Macintosh and Microsoft Windows.
Paraconc handles bilingual parallel corpora
- MultiLingProfiler
- A vocabulary profiling tool for French, German, and Spanish.
- TextSTAT
- TextSTAT is a simple programme for the analysis of texts. It reads plain text
(in different encodings) and HTML files (directly from the internet) and produces
word frequency lists and concordances from these files.
- Uplug Corpus Tools
- A collection of tools for linguistic corpus processing, word alignment,and term extraction from parallel corpora.
- Xmldiff
- A program that finds differences between two similar XML files.
- Y-Sets
- Computes statistics about words common to two texts.
- ALingua
- ALingua simulates the evolution of a two-language system
in a finite population. In particular, it allows one to examine the spatial dynamics
of such a system given a set of initial
conditions: a distribution of agents, a network defining connections between them,
and a language learning algorithm with associated parameter settings.
Pragmatically, comparisons between the
outcome of simulations and empirical results from historical linguistics will
facilitate the search for satisfactory theories of diachronic language change.
- Computational Phylogenetics in Historical Linguistics
- Publications and software from the CPHL project.
- Epigrass
- An epidemic simulator, possibly of use for the study of the spread of linguistic change.
- Etymo
- A program for modelling sound change.
- IPA Zounds
- A program for modelling sound change.
- Jsesh
- An editor for Egyptian hieroglyphics
- LingPY
- A Python library for quantitative historical linguistics.
- Phono
- A program for modelling sound change.
- Sounds
- A program for modelling sound change.
- TreeView
- A viewer for phylogenetic trees. Intended for biologists, but useful for linguists too.
- WordCorr
- A set of tools for finding regular phonological correspondences.
- Goldvarb
- A program for variable rule analysis. Available for MS-DOS, MS Windows, and Mac OS (classic).
- Plotnik
- A tool for making vowel plots showing the dispersion of vowel tokens in the vowel space.
For Mac OS X.
- R-Varb
- A package for variable rule analysis along the lines of Varbrul but implemented in R and therefore available for most versions of Unix, MS Windows, and Mac OS X.
- Social Networks Visualizer
- A tool that allows the user to draw, visualize, and layout social networks.
- Amadeus Pro
- A general purpose audio editor with a number of analysis functions and, in particular,
the ability to fragment an audio file automatically by positioning window edges
at transitions between sound and silence. This is closed-source proprietary software
but the license is inexpensive and a trial version may be freely downloaded.
It runs only on Mac OS X. I have not personally used this software but rely
on the company's description and that of a colleague who has used it. I mention
it here because its automatic fragmentation capability is unusual and because
it runs on Macs with Intel processors, which SndBite does not.
- Audacity
- An audio editor that runs on all major platforms. It is oriented more
toward music than phonetics and has only limited analysis tools. However, it
can do noise reduction and has some useful editing tools, including the ability to
modify individual samples and to change the amplitude envelope.
- AudioSpace
- An audio storage calculator. Given the duration of an audio recording, it calculates
the required storage in any of a variety of units, for uncompressed audio or a number of
types of compression. It also works in the other direction: given the available amount
of storage, it will compute the maximum duration of the audio that it will hold.
Runs on all major platforms.
- Autovot
- Trainable software for automated measurement of Voice Onset Time.
- Bartek Plichta
- This site provides expert advice on microphones, recorders, and related
topics, including informative reviews of particular devices.
- Ecasound
- A command-line tool for recording audio, playing it back and performing format conversions
and mixing. It can also carry out some kinds of analysis. It is free software and runs
on all major and some not-so-major platforms. Several graphical interfaces are available.
- Elan
- A tool for transcription and annotation of both audio and video.
There is a program called Econv
for converting among Shoebox,
Transcriber, and Elan files.
Available for GNU/Linux, Mac OS X, and Microsoft Windows.
- Emu
- A collection of tools for the creation, manipulation and analysis of speech databases.
At the core of EMU is a database search engine which allows the researcher
to find various speech segments based on the sequential
and hierarchical structure of the utterances in which they occur.
EMU includes an interactive labeller which can display spectrograms
and other speech waveforms, and which allows the creation of hierarchical
as well as sequential, labels for a speech utterance.
- Exmaralda
- A suite of programs for transcription and annotation of speech andthe construction
of spoken language corpora.
- Festival
- A multi-lingual speech synthesis system that runs on all major platforms.
- Libsndfile
- libsndfile is a C library for reading and writing audio files.
As such it is of interest only to programmers,
though you may find that you need to install it for other software to work
even if you do no programming yourself.
However, it comes with a number of example programs, several of which are of general
utility. These are sndfile-info, which extracts information from sound files,
sndfile-play, which plays sound files,
and sndfile-convert, which converts files from one format to another.
sndfile-play and sndfile-convert can handle some formats not supported
by Sox. These include 24 bit PCM data and the mat formats used by
Matlab and Octave.
- Paradigm
- Paradigm helps you to design and run psycholinguistic experiments with
millisecond-level accuracy. It is scriptable in Python. It runs only on
Microsoft Windows. This is non-free commercial software but the current beta version
is downloadable at no cost.
- Linux Sound
- An extensive listing of software for Linux. The orientation is
toward music, but there are quite a few items of interest to linguists.
- Nyquist
- A sound synthesis system in the form of an object-oriented dialect of Lisp
with primitives for both events (as in typical musical score synthesis systems)
and signal synthesis.
- Phontools
- An R package which contains functions intended to facilitate the organization, display, and analysis of the sorts of data frequently encountered in phonetics research and experimentation.
- Praat
- Praat is a "research, publication,
and productivity tool for phoneticians." It includes a comprehensive set
of capabilities, usable both interactively and via a scripting language.
- Shntool
- A command-line utility for handling a wide range of lossless compression formats.
It can convert from one format to another, provide information about the contents
of a file, split and join files, etc.
- Shva
- Shva ("speech hear view and annotate") is a Web GUI for aligning linguistic
annotation with the acoustic signal. It runs only on GNU/Linux.
- SndBite
- SndBite is a specialized audio editor, designed for breaking large recordings
into smaller components with great efficiency. Special features include:
multiple simultaneous views of the waveform at different resolutions;
the ability to position window edges at transitions between sound and silence;
automated setting of cut points at zero-crossings;
automatic filename generation easily controlled by the user;
optional automatic playback on window motion;
and logging of each write, so that a record exists both for the long term
and in case (as often happens) the user loses his or her place while segmenting a large file.
SndBite runs on all major platforms but does not currently run on Macintosh computers
with Intel processors because of the unavailability of the Snack audio library on that
platform. Those desiring the ability to fragment files automatically at speech/silence
transitions on Intel Macs should consider Amadeus Pro.
- SoundIndex
- A transcription tool combining an XML editor with an audio display.
- SoX
- ("Sound
eXchange") is the Swiss Army Knife
of the phonetics lab. It is a command line utility that can convert various formats
of computer audio files to other formats, also changing sampling rate and performing
some other modifications as instructed. The downside to this program is its complex
and peculiar command-line syntax. Here is a brief tutorial that will show you how to do most
of the more common tasks.
- TranscriberAG
- This is a program for creating transcriptions (typically orthographic) of sound
recordings, time linked (typically at the phrase level) to a digital audio file.
It will conveniently deal with long recordings -- an hour or more.
The handling of audio I/O and waveform displays is based on the
Snack
sound toolkit. TranscriberAG is the successor to Transcriber.
There is a program called Econv
for converting among Shoebox,
Transcriber, and Elan files.
- Wavesurfer
- This is a simple but fairly powerful program for interactive display of waveforms,
spectrograms, pitch tracks and transcriptions (phonetic, orthographic etc.).
It does not do everything that Praat does, but is easier for novices
to learn to use.
- FreeMat
- FreeMat is a free environment for rapid engineering and scientific
prototyping and data processing. It is similar to commercial systems such as MATLAB
from Mathworks, and IDL from Research Systems.
- GnuPlot
-
Gnuplot is a free, command-driven, interactive, function and data plotting program.
Roughly speaking, if you can write a mathematical expression in C or Fortran,
Gnuplot will plot it for you. It will also draw plots from data, or combined plots
showing both data and theoretical calculations.
GnuPlot runs on just about every platform you've ever heard of and some you probably
haven't, including not only Unix, DOS, Microsoft Windows, and Mac OS, but OS/2, VMS, and Atari.
A manual or tutorial is available in:
Czech, French, German, Indonesian, Italian, Japanese, and Portuguese.
A nice, short tutorial is available here.
- LabPlot
- LabPlot is a scientific graphics and statistics package similar to
commercial products such as Microcal Origin and SPSS Sigmaplot.
It is fully scriptable and runs on most variants of Unix including Mac OS X.
- Lush
- A language designed particularly for large-scale numerical and graphical programming.
Lush is an object-oriented dialect of Lisp. It has a huge library of numerical routines
and many other libraries, including one for probabilistic finite state automata.
It runs on all major platforms.
- Octave
- Octave is a free-software emulation of Matlab.
It is largely compatible with Matlab 4.X but not with Matlab
5.X. Some interesting free octave/matlab toolboxes are available
here.
- PSPP
- A free clone of SPSS, widely used in the social sciences. PSPP is
not as mature or sophisticated as R, but is considered
easier to use for non-programmers.
- R
- R is a free-software version of the improved
version of the S statistics language, whose proprietary version goes by the
name of "Splus".
A nice, short, and simple introduction to R in the form of a web page can be found at:
http://lib.stat.cmu.edu/R/CRAN/doc/contrib/kickstart/index.html.
A more detailed introduction to R, that devotes more attention to teaching
statistics at the same time, is:
http://www.math.csi.cuny.edu/Statistics/R/simpleR/
An Introduction to R (about 100 pp.) can be downloaded in PDF format.
To download a copy, click here.
The R FAQ
contains answers to many frequently asked questions.
The full R Reference Manual, currently 1144 pages, can be downloaded in PDF
format. To download a copy,
click here.
An introduction to R aimed specifically at linguistics is
Analyzing Linguistic Data: A Practical Introduction to Statistics using R.
A page containing lots of useful information about R is:
http://finzi.psych.upenn.edu.
A repository of code and datasets for S and Splus, most of which will also run under R, can be found at http://lib.stat.cmu.edu/S/.
You can get the current sources
from: http://lib.stat.cmu.edu/R/CRAN/banner.shtml.
You can also download precompiled binary versions from the following sites:
- GNU/Linux
-
http://lib.stat.cmu.edu/R/CRAN/bin/linux/
- Mac OS X
-
http://lib.stat.cmu.edu/R/CRAN/bin/macosx/
- Mac OS 8.6-9.1
-
http://lib.stat.cmu.edu/R/CRAN/bin/macos/
- Microsoft Windows
-
http://cran.r-project.org/bin/windows/base/
- Discrete Event Calculus Reasoner
- Performs automated commonsense reasoning using the event calculus,
a comprehensive and highly usable logic-based formalism.
It solves problems efficiently by converting them into satisfiability (SAT) problems.
- Molle
- A cross-platform modal logic prover.
- Projective Discourse Representation Theory Sandbox
-
PDRT-SANDBOX is an Haskell library that implements Discourse Representation Theory (DRT),
and its extension Projective Discourse Representation Theory
(PDRT). The implementation includes a translation from PDRT to DRT
and First-order Logic, composition via different types of
merge, and unresolved structures based on Montague Semantics,
defined as Haskell functions.
- WordNet-Similarity
- A collection of Perl modules for the WordNet system.
They are designed as object classes with methods that take two word
senses as input and return the semantic relatedness of these word senses.
- SignStream
- A Macintosh program for annotating recordings of American Signed Language.
- An Gramadóir
- An open source grammar checking engine, intended as a platform
for the development of sophisticated natural language processing tools
for languages with limited computational resources.
- BBE
- Bbe is a sed-like stream editor for binary files.
Instead of reading input in lines like sed, bbe reads arbitrary blocks from an
input stream and performs byte-related transformations on selected blocks.
Blocks can be defined using start/stop strings,
offset in the stream and block length, or a combination thereof.
Basic editing commands include delete, replace, search/replace,
binary operations (and, or, etc.), append, and Binary CodedDecimal/ASCII conversion.
For examining the input stream,
it contains some grep-like features such as printing the input file name,
stream offset, and block number of the selected blocks.
Block contents can also be printed in different formats such as hex,
octal, ASCII, and binary.
- Emacs
- Emacs is a very powerful text editor with a number of features that
make it especially useful for linguistic research. Emacs allows the screen to be
divided into multiple windows, both vertically and horizontally.
One can, for example, split the screen vertically so as to view different versions of a
text in parallel. It provides full regular expression search and substitution facilities.
Since emacs is implemented on top of a LISP interpreter, to which full
access is available, emacs is fully programmable, not in a feeble
extension language but in a full-fledged programming language.
- Flat File Extractor
- Flat File Extractor is a parser for flat file databases. Using specifications of the
structure of the input file and of the desired output format, it parses its input
and writes it back out in another format. Among other things, it can convert other
formats to XML.
- FreeLing
-
An open source language analysis tool suite covering:
text tokenization,
sentence splitting,
morphological analysis,
named entity detection,
date/number/currency/ratio recognition,
PoS tagging,
chart-based shallow parsing,
contraction splitting,
physical magnitude detection (speed, weight, temperature, density, etc.),
named entity classification,
WordNet based sense annotation,
and dependency parsing. It comes with data for Catalan, Spanish, Italian, Galician, and
English. In C++.
- GTick
- A software metronome, useful, for example, for syllable-breaking tasks.
For Unix-ish systems.
- Linguistic Tree Constructor
- Linguistic Tree Constructor is a program for drawing syntactic
trees. It allows the user to create trees for large amounts of text quickly.
"Generic" trees as well as Role and Relation Grammar and X-Bar trees
are supported, as is exporting to Annotation Graph XML format.
Printing and copying parts of the tree to clipboard are supported.
LTC runs on all major platforms.
- Minpair
- A program for finding minimal pairs in a wordlist. It accepts definitions
of multigraphs, allows words to be paired with identifiers, and handles Unicode input.
- Msort
- A sophisticated sorting program, capable of handling multiline records,
locating fields by tags, using arbitrary sort orders with long multigraphs, and various
other things. Fully Unicode-capable
- Numutils
- A set of command line utilities that may be useful in dealing with numbers in
combination with other Unix utilities. In particular, numgrep permits
filtering by the numerical value of expressions, which can be difficult with
ordinary versions of grep.
- phpSyntaxTree
- A web application that creates syntax tree graphs from phrases
entered in labelled bracket notation.
- PhonoApps
- Programs for writing and testing phonological rules and finding natural
classes of segments with respect to a given feature set.
- GNU Poke
- A program for reading, manipulating, and writing binary data. It provides a language
and datastructures for working with binary data at a more reasonable level of
abstraction. Useful for tasks such as extracting data from media or wordprocessor file
formats for which you do not have high-level tools.
- Penn Controller
- A system for controlling on-line psycholinguistic experiments using the IBEX
system.
- Replace
- According to its author, "the sane person's alternative to sed".
replace provides an easier alternative to sed for replacing one or
more strings with another in one or more text or binary files or from standard input.
It works with fixed strings rather than regular expressions but can adapt the
substitution to the case of the original if so desired. An interactive mode
is available in which the user is asked to confirm proposed substitutions.
- SyNTeX
- SyNTeX is a LaTeX preprocessor that draws syntactic trees using the
LaTeX picture environment. The preprocessor reads the comments in a LaTeX
file and draws the tree based on commands that it finds in the comments.
- TDH Utilities
- A suite of programs for handling tabular ASCII data intended to
supplement the standard
Unix utilities. They accomplish nothing that can't be done fairly
easily in a scripting language
like AWK, but some of them provide an easy way to do things that
would take some work in a scripting language. Among these are utilities
for extracting data from specified cells of a
spreadsheet and for cleaning up spreadsheets.
- Tree Draw
- A program for drawing syntactic trees.
- Turk Tools
- A tool for constructing linguistic surveys. "More and more researchers in linguistics use large-scale experiments to test hypotheses about the data they research, in addition to more traditional informant work. In this paper we describe a new set of free, open-source tools that allow linguists to post studies online, turktools. These tools allow for the creation of a wide range of linguistic tasks, including grammaticality surveys, sentence completion tasks, and picture-matching tasks, allowing for easily implemented large-scale linguistic studies. Our tools further help streamline the design of such experiments and assist in the extraction and analysis of the resulting data. Surveys created using the tools described in this paper can be posted on Amazon’s Mechanical Turk service, a popular crowdsourcing platform that mediates between ‘Requesters’ who can post surveys online and ‘Workers’ who complete them. This allows many linguistic surveys to be completed within hours or days and at relatively low costs. Alternatively, researchers can host these randomized experiments on their own servers using a supplied server-side component."
- Wordfreak
- WordFreak is a java-based linguistic annotation tool designed to support human,
and automatic annotation of linguistic data as well as employ active-learning for human
correction of automatically annotated data.
- WordGenerator
- WordGenerator generates hypothetical words from specifications of
their syllable structure. The user specifies the maximum length of the
words in syllables, the abstract structure of syllables in the
language (in terms of such units as consonants and vowels or onsets
and rhymes), and the actual sounds that comprise each abstract class
(e.g. the list of vowels in the language); WordGenerator then
generates the words that conform to this specification.
- Word generator
- This takes a list of existing words in a language and generates other possible words, with a number of settings you can select to alter the characteristics of the created set.
Languages
- AWK
- The classic workhorse of UNIX text processing. If the program is
complex, and in particular, if sophisticated data structures are needed,
other languages may be preferrable,
but many people find Awk easier and quicker to use for relatively
simple programs. One reason for this is that AWK automatically parses its
input into records and fields and uses an unusual pattern-action format.
Some versions of Awk, including GNU Awk, support Unicode.
X(ml)Gawk
is a new derivative of awk that contains an XML parser. Instead of reading
input record by record as in traditional Awk, it reads input node by node.
- Perl
- A scripting language with elaborate regular expression support,
Perl is too widely used not to mention, and its author, Larry Wall,
is a linguist. However, Perl syntax is
arcane and inconsistent, with numerous special cases.
Perl programs are said to be "write only"
because they are so often impossible to understand, even for
the author after some time has passed. It is possible
to write clean, understandable and modifiable
programs in Perl, but you have to work at it.
A good discussion of the problems of Perl and the advantages of
Python over Perl can be found
here.
Another good critique of Perl is
this one.
Your mileage may vary.
- Python
- A general purpose, high-level object-oriented language
with good built-in regular expression operations and strong Unicode
support. Python now supports Unicode outside the Basic Multilingual
Plane but must be compiled to do so.
Python is (in)famous for using indentation rather than
parenthesization to indicate block structure.
For expositions of the virtues of Python see:
Why I Love Python
and
Why Python?.
- Ruby
- An interpreted, high-level, object-oriented language.
It is similar in many ways to Python, but is more strictly
object-oriented and has a more traditional syntax. Ruby now
supports Unicode, though not as natively and completely
as Python and Tcl.
- Snobol
- Snobol "String Oriented Symbolic Language" was
the first language focussed on text-processing, for which it was widely used
in the 1970s and 1980s. It is no longer commonly used, but still has its devotees and
perhaps deserves a resurgence due to its powerful pattern-matching facilities,
which differ in interesting ways from regular expressions. It is the antecedant of
Icon. See the Wikipedia article
for more information.
- Tcl
-
Much of Tcl ("Tool Construction Language")'s reputation is due to its
associated windowing and graphics library, Tk,
for which bindings now exist for several other languages,
including Python and Perl. However, Tcl is actually quite a nice language
for text processing. Its unusually simple syntax makes it easy for
beginners to learn. (In fact, many of the problems encountered by those
new to Tcl are due to false assumptions that it is like more familiar
languages. Novice programmers have no such assumptions.) It uses Unicode
as its native character set and has a good set of built-in string manipulation
functions and superior regular expression facilities.
Tcl does not presently support Unicode above the Basic Multilingual Plane.
Due to its original design as an
extension language, Tcl can easily be extended by writing functions in C and
C programs can easily embed Tcl interpreters. Tcl is not object-oriented, but several
object-oriented extensions are available.
Information for Particular Languages
Python
Tcl
HTML
- Cascading Style Sheet specification
- The official specification for Cascading Style Sheets, which allow you to separate
the details of presentation from your HTML.
- HTML Specification
- The full, formal specification of HTML.
- HTML Validator
- The official HTML checker. Submit a URL to it and it will tell you whether
the page conforms to the standard.
- Dillo
- Dillo is a lightweight web browser. It uses very little memory
and starts up very quickly. It doesn't have all the features of a full-fledged
browser - it doesn't display the full range of Unicode characters, for example -
but has a very nice HTML checker. It shows a count of the errors it detects in
your page. If you click on this, it brings up a window listing the errors and their
locations in your HTML file. Once you have corrected them, if you right-click on
the bug meter a menu will pop up offering you the opportunity to fully validate the
page at either W3C or WDG.
- HTML Tidy
- This program detects errors in HTML pages and, if it is clear how to fix them, does so.
If it is not, it generates a warning.
- HTML Introduction
- A set of tutorials.
- HTML Attributes
- A list of all of the HTML attributes.
- HTML Character Entity References
- The character entities are the HTML "names" that can be used to represent characters
that are otherwise difficult or impossible to represent, such é for é.
This page provides a complete list.
- HTML Cheat Sheet
- A list of HTML elements.
- HTML Latin-1 Characters
- This page shows how to represent the characters in the ISO-8859-1 (aka "Latin-1") character
set in HTML. This character set includes most of the non-ASCII characters used in Western European languages.
- HTML Color Specifications
-
XML
- LibXML
- An XML parser library with a C API.
- LibXSLT
- An XML transformation library with a C API.
- LXML
- A Python binding for the libxml and
libxslt XML processing libraries.
- Xmldiff
- A program that finds differences between two similar XML files.
- XML Tree
- An XML parser with C++ and Perl APIs.
- Xsltproc/libxslt
- libxslt is a library for processing XML, with C and Python bindings.
There is also a convenient command-line tool, xsltproc, that takes as
arguments a stylesheet and an XML file and processes the latter according to the former.
Revised 2022-06-05.
[This date is in the
International Date Format, ISO 8601]