Text


  1. The Representation of Text on a Computer
  2. End of Line Conventions
  3. Other Character Encodings Than ASCII
  4. File Types
  5. Document Formats
  6. Modified Files

The Representation of Text on a Computer

Plain text is represented by character codes. These are bit patterns that might equally well be interpreted as numbers. Indeed, they are interpreted as numbers, in we may say "the code for a is 97". What this means is the bit pattern 01100001, which may be interpreted as the number 97, will, when sent to an appropriate display device, cause the letter a to be displayed.

The most widely used character code is the ASCII (American Standard for Computer Information Exchange) code For many years, most computer equipment assumed this code. For example, computer terminals used to be designed to interpret such codes directly. When the computer sent a byte to the terminal, circuits in the terminal would generate and display the appropriate pattern of pixels. Similarly, until recently, much software contained assumptions based on the ASCII code. For example, lower-case letters would be converted to upper-case letters by substracting 32 from their character codes. If the ASCII code is in use, this works because the upper-case letters precede the lower-case letters and there are six punctuation signs between them.

The original ASCII system contains 128 codes, ranging from 0 through 127. This means that in the usual binary representation for unsigned numbers, only the low seven bits are used. The high bit (written at the left) is always 0 for ASCII characters. Here is a chart of the ASCII characters. Fuller information can be found here.

00nul01soh02stx03etx04eot05enq06ack07bel
08bs09tab10nl11vt12np13cr14so15si
16dle17dc118dc219dc320dc421nak22syn23etb
24can25em26sub27esc28fs29gs30rs31us
32sp33!34"35#36$37%38&39'
40(41)42*43+44,45-46.47/
480491502513524535546557
56857958:59;60<61=62>63?
64@65A66B67C68D69E70F71G
72H73I74J75K76L77M78N79O
80P81Q82R83S84T85U86V87W
88X89Y90Z91[92\93]94^95_
96`97a98b99c100d101e102f103g
104h105i106j107k108l109m110n111o
112p113q114r115s116t117u118v119w
120x121y122z123{124 |125}126~127del

The ASCII characters are divided into printing characters and non-printing characters. The printing characters are the characters that appear directly on a computer screen or a printout:' the letters of the alphabet, the digits, and the punctuation symbols. The non-printing characters include the whitespace characters: space, tab, linefeed and carriage return. Most of the non-printing characters are control characters. These codes were originally used to control teletype machines. Some of them, such as "start of text" and "acknowledge" have no use as such anymore. Others, such as "backspace" and "bell" are still meaningful. ("bell" is the code that causes computer terminals to beep. It originally caused the bell on the teletype machine to ring.) Nowadays the control characters are often used for other purposes, e.g. to give commands to a text editor.

The chart above gives the numerical values of the character codes as base-10 numbers, the numbers in common use. However, character codes are not usually given in this form. The older convention is to give them in octal, that is, in base-8. The ASCII code for a, decimal 97, would be described as octal 141. Here the first digit represents 8 to the second power, the second digit represents 8 to the first power, and the last digit represents 8 to the zeroth power, that is, one. 141 in base 8 is therefore 1*8*8 + 4 *8 + 1 = 64 + 32 + 1 = 97. Nowadays, character codes are usually given in hexadecimal, that is, in base 16. In hexadecimal, decimal 10 is represented by A, 11 by B, 12 by C, 13 by D, 14 by E, and 15 by F. The ASCII code for the letter a is 61 in hexadecimal, that is, 6*16 + 1. Further information on base conversions may be found here.

back to top

End-of-line Conventions

One of the ways in which text files that use the same character encoding may vary is in how they represent the end of the line. The old convention goes back to the days of teletype machines. To move from the end of one line to the beginning of the next, two operations were necessary. A "carriage return" (abbreviated CR) moved the print head back to the left margin; a "line feed" (abbreviated LF, or NL, for "newline") moved the print head down one line. In the days of teletypes, two character codes therefore had to be sent out at the end of a line: CR LF. Dos and Windows still make use of this convention. Other operating systems make use of just one of the two. On the MacIntosh, end-of-line is represented by CR alone. On Unix systems, NL alone is used.

Conventions also differ as to when end-of-line is marked at all. Word processors such as Word Perfect and Microsoft Word use end-of-line to indicate what they call a "hard return", that is, a place at which the line MUST end. This is more-or-less equivalent in these programs to the end of a paragraph. They wrap text to fit the chosen margins when they display it and print it, so it is not necessary for the user to enter an end-of-line marker except when a "hard return" is desired. Text files created in such word processors will therefore usually have relatively few end-of-line markers in them.

back to top

Other Character Encodings Than ASCII

The ASCII character code is by no means the only one that has been used. Over the years, hundreds of character enccodings have been developed to represent languages and writing sytems other than English. Even for English, there have been other codes, especially in the early days of computing. For English, the main competitor for ASCII has been the EBCDIC (External Binary Coded Decimal for Information Interchange) code developed by IBM. EBCDIC is still used on IBM mainframe computers. Here is an EBCDIC chart. Note that the EBCDIC code, unlike ASCII, is an 8-bit code. The high bit is not necessarily 0. Indeed, the letters of the alphabet all have codes above 128 in EBCDIC.

How characters are represented for a wide range of languages and writing systems is a complicated matter that we will address later in the course. For the time being, we will generally work with text in ASCII code, or in the Latin-1 encoding, which is used for many European languages. Latin-1 is an extension of ASCII in the sense that it assigns the same codes to the ASCII characters and adds codes ranging from 128 to 255 for additional characters, such as é and Ü.

back to top

File Types

We often speak of files "text files", "graphics files", "sound files" and so forth, as if files were of one type or another. In Unix, files do not really have any intrinsic type. A file is simply a sequence of bytes. What those bytes represent is a matter, on the one hand, of the intention of the person who created the file, and on the other hand, of how they are interpreted. The bit pattern 01100001 may represent the letter a, if interpreted as ASCII text, the punctuation symbol / , if interpreted as EBCDIC text, or the number 97 if interpreted as an unsigned 8-bit number. It might also represent the grey-level of a pixel in a gray-scale image, the color of a pixel in a color image, or the amplitude of a sound wave at one point in time.

To illustrate the fact that the same data can represent either an image or sound depending on how it is interpreted, we will generate a sound file and an image file from the same data.

First we use R to generate the data:

#Generate a sequence of 1,000,000 values. 
x=seq(1,10^6)
#Generate their sines, with values ranging from 0. to 1.0
y=(sin(2*pi*x/441)+1.0)/2.0
#Scale so values range from 0 to 255
y=y*(2^8-1)
#Convert to integer form.
yi=as.integer(y)
#Write to file as ASCII text
write(yi,file="AudioImage.asc")

We now have a file containing a sine wave in ASCII form, each sample represented by an unsigned 8-bit integer. Next, we convert the ASCII values to binary. I use my own little utility, atoub, which takes as input ASCII integers and converts them to unsigned 8-bit binary integers:

%atoub AudioImage.asc AudioImage.raw

AudioImage.raw now contains one million 8-bit integers representing a sine wave. To create a sound file, we need to add a wav header. We can do this with sox:

%sox -r 44100 -u -b AudioImage.raw -r 44100 -u -b AudioImage.wav

Running InfoWave produces the following summary of the structure of this file. It consists of 44 bytes of header information followed by the actual audio sample data.

     0: RIFF identifier.
     4: chunk size = 1,000,036 bytes.
     8: WAV identifier.
    12: format chunk identifier
    16: format chunk size = 16 bytes.
    20: data format: PCM.
    22: one channel (mono).
    24: Sampling Rate = 44,100 samples per second.
    28: Average Data Rate = 44,100 bytes per second.
    32: Bytes_Per_Sample value of 1 indicates 8-bit mono
    34: Bits_Per_Sample = 8.
    36: chunk id
    40: chunk length
    44: chunk of type data     (standard)  length   1,000,000 bytes
	amounting to 0 minutes and 22.7 seconds

If you play this file, you will hear a 100 Hz pure tone 22.68 seconds long. Here is a screenshot of a segment of this file as displayed by Wavesurfer:


The upper panel shows the F0 contour, which is a horizontal line at 100Hz. The lower panel shows the sound pressure waveform, which is a sine wave.

To create the graphics file, we prepend to the data a header appropriate for a binary format PGM file. The header itself is the following ASCII text:

P5 #xxxxxxxxxxxxxxxxxxxxxxxxx
1000
1000
255

This header provides the information that this is a binary format PGM file, that it contains an image 1000 pixels wide by 1000 pixels high, and that the maximum grey-level value is 255. I've added an unnecessary comment to pad the header to 44 bytes so that it will be the same length as the wav file header.

%cat pgmbhdr AudioImage.raw > AudioImage.pgm

The resulting file can be displayed by a suitable graphics program. Here is a screenshot of the result of using the display program from the ImageMagick package to display it:


Downloads (both 1,000,044 bytes):
AudioImage.pgm
AudioImage.wav

Indeed, the same data can also be interpreted as text. If we surround our sine wave with an HTML header and footer, with the header identifying the content as Latin-1 text, we see this:



And if we identify the character set as Koi8-r, the Cyrillic encoding widely used in Russia, we see this:



Admittedly, this isn't the most interesting text. Data that are coherent in one interpretation won't necessarily be in another.

Here are the HTML files in case you want to see for yourself. Be warned, though, that loading these files will load down your browser for a while.

AudioImageLatin1.html
AudioImageKoi8r.html

In Unix, neither the file nor its name necessarily indicates what kind of file it is. In some operating systems, such as Microsoft Windows, the file name extension indicates what kind of file it is. Thus, files with the extension exe are programs, while files with the extension doc are Microsoft Word documents. In Unix, this is not necessarily the case. Filenames may be constructed so as to encode information about the content of the file, but the operating system does not require this, nor do most programs. For example, it is common to use the suffix .jpg for image files in the JPEG format, but programs for viewing and processing images will in general not care whether the file has such a suffix.

Another way of indicating what kind of information a file contains is by means of a header. This is some information put at the beginning of the file that tells programs that read the file about its contents. A header often begins with a byte that indicates the file type. Such a byte is known as a magic number. The header may contain other information as well, such as the sampling rate of a sound file or the size of an image.

On Unix systems each file also has associated with it information called permissions, which determine who is allowed to read it, write it, and execute it. Permissions are not actually part of the file. A Unix system will not execute a program if the permissions of the file containing do not indicate that it is executable. However, execution permission does not actually indicate the contents of the file. An excutable file may be a machine language program, which can be directly executed, or it may be a script that must be interpreted by another program, such as a shell. Marking it as executable only tells the system to consider it for execution.

A file's contents can often be identified by means of the file program. This is a utility that attempts to identify file types on the basis of their magic numbers and contents. It is not foolproof, but it does quite a good job. (The version of file provided by some computer manufacturers is rather impoverished. What is probably the most sophisticated version can be obtained here.)

If you cannot identify a file, or if for some reason you need to see exactly what is in a file, you may find the od utility useful. od stands for "octal dump" and refers to one of the things it can do, namely display the contents of each byte as an octal number. The command od -bc is particularly useful. It displays each byte as an octal number and also displays it as an ASCII character, where possible. Otherwise, it displays the escape sequence, if there is one (e.g. \t for tab), or as an octal esacpe (e.g. \377 for é) . For example, if we give the command:

echo "abc xyz" | od -bc

the output will be:
0000000 141 142 143 040 170 171 172 012
          a   b   c       x   y   z  \n
0000010

0000000 and 0000010 are the numbers (in octal) of the byte at the beginning of the two rows of output (the second of which is blank). The numbers 141, 142, etc. are the octal values of the bytes. On the line below are the characters that they represent. 040 is the code for space, so this space is not actually blank. \n is the escape sequence for the newline character.

One use of od is to read header information. For example, if we run od on a GIF format image file, the output begins like this:

0000000 107 111 106 070 067 141 322 003 000 004 367 000 000 007 003 005
          G   I   F   8   7   a   Ò 003  \0 004   ÷  \0  \0  \a 003 005


back to top

Document Formats

Documents often consist of more than text. They may contain other kinds of data, such as images, and they may contain markup, that is, of information about the structure of the text ("structural" or "logical" markup) and/or how it should be displayed ("physical" or "visual" markup).

Markup

Structural markup of a typical piece of text might consist of information such as "this is a chapter title" or "this is an element of an ordered list". Physical markup might consist of information such as "this should be printed in 18 point bold Helvetica type" or "this should be centered".

Some text formatting systems make use of overt markup. This is true of the roff family of text formatters troff, groff, nroff and their preprocessors, of TeX, and of lout. To format text in one of these systems, in addition to the actual text, one inserts markup which is interpreted by the text formatting program. Documents written in such formats can therefore be used as linguistic data by stripping out the markup.

The formats used by word processors, such as Microsoft Word, Word Perfect, and Nisus Writer Express, contain markup that is not intended to be human-readable. In addition to text and markup, word processor files may contain the fonts needed to print the document and images.

In addition to markup systems intended specifically for document formatting there are now general markup lagnguages, which can be used for many purposes. The grandfather of modern general markup languages is SGML ("Standard Generalized Markup Language"), which is specified by ISO standard 8879:1986. (The standard is not freely available online but may be purchased from the ISO.) HTML ("Hypertext Markup language"), the language in which web pages are written, is a specialized derivative of SGML. XML is a derivative of SGML that is increasingly used both for structuring documents and for databases, if not as the internal format, at least as a transfer format.

Printer and Page Description Languages

When printed, text is usually eventually translated into a low-level printer language. Such languages contain detailed information about the position at which to print each character as well as control codes for the printer. They may also contain instructions for graphics. In some cases, the printer is used as a pure graphics device, and text is printed by translating character codes and font information into instructions that tell the printer how and where to draw each character.

Documents are often now printed by translating them into a page description language which is then either translated into a low-level printer language or sent to a printer that is able to interpret it directly. A page description language is intermediate in abstraction between a document with markup and a low-level printer language. Perhaps the best known page description language is Postscript. Postscript is actually a complete programming language (similar to Forth), one capable of elaborate mathematical calculations, which contains primitives for printing characters and drawing graphics. One of the motivations for using Postscript when it was first developed in 1984 was to download the computational load of rendering complex documents from the computer to the printer. A document in Postscript could consist of a fairly abstract program which when executed on the processor of the printer would cause the printer to render the document. In the first few years of its life, the Apple laser printers that were the first printers to use Postscript usually had greater computing power than the computers to which they were connected.

A document in Postscript may consist of low-level graphics data, in which case the text cannot be extracted from it except, possibly, by optical character recognition. A document in Postscript may also consist of calls to functions themselves defined in the document applied to individual characters or other small bits of text. These functions position the text and determine its size and typeface. The characters or strings that are their arguments constitute the text itself. In this case, it is possible to extract the text by removing the function calls.


PDF Files

A file format that is frequently encountered is PDF, the Portable Document Format developed by Adobe in 1993. PDF is a derivative of Postscript, which adds to Postscript's imaging model a document structure and interactive navigation features. PDF is a portable page description format, which allows complex documents, including non-Roman fonts and graphics, to be read and printed on a wide variety of operating systems. PDF files may contain hyperlinks, and they may be password protected. The full PDF standard can be obtained from Adobe's web site here.

Adobe's own program for reading PDF files is Acrobat Reader. Acrobat Reader can be downloaded free for most operating systems here, but there are a number of others. A list of PDF readers for various operating systems may be found here.

However, for most linguistic purposes it is desirable to extract plain text from the PDF file. This can be done, if it is not locked, and if the material is in the form of text. Since PDF files can also contain images, one way to avoid problems with the distribution of exotic fonts is to convert a document to a set of images and embed the images in a PDF file. Such PDF files do not contain extractable, manipulable text. If the text is extractable, it can be extracted by using pdftotext.


RTF Files

Rich Text Format is a format for text and graphics interchange developed by Microsoft. RTF files are usually generated by programs rather than directly by people. RTF is actually a physical markup system, simpler than PDF. Both the markup tags and the text characters are written in a human-readable format, using ASCII characters, or occasionally another common single-byte character encoding. Other characters are represented by codes. Similarly, images are encoded as text.

The RTF specification is available as a set of web pages here. The RTF specification is available as a PDF file here


back to top

Modified Files

Thus far we have described files in their native format, such as plain text files. Files may also appear in a variety of modified formats.

Email Encoding

Files are sometimes encoded to be sent as email. Once upon a time, the system for transmitting electronic mail was designed only to transmit plain text, that is, 7-bit ASCII, not "binary" files. In many ways, it mimicked teletype transmission. The designers expected only certain byte values to be included in message text. Other values were interpreted as control codes for the mail system. As a result, messages that contained certain values would not be transmitted properly. In order to allow files containing such bytes, such as images and computer programs, to be transmitted by email, they are encoded so that only the non-disruptive values are used. They must then be decoded at the other end. On Unix systems, the usual program for encoding binary files for email is uuencode. uudecode is the corresponding program for decoding them. Another widely used encoding, originating on the MacIntosh, is binhex.

Nowadays email encoding is generally handled automatically, without user intervention. Programs like pine on Unix systems, Outlook Express on MS Windows systems, and Eudora on Macintosh systems, automatically perform the necessary encoding and decoding using the MIME (Multipurpose Internet Mail Extensions) format, defined in RFC 2045. Furthermore, the software underlying the mail system is increasingly 8-bit clean. It is often possible to send 8-bit data without any problem. The result is that you are not very likely to have to deal with a file in encoded form.

Should you encounter the need to decode a file with email encoding, there is a single free program, uudeview, available for Unix and MS Windows systems, that handles all three major encoding systems: binhex, uudecode, and Base64.

Encryption

Files are sometimes encrypted for security. An encrypted text file will generally appear to contain random binary data. It will not be identifiable as a text file. The traditional Unix encryption program is crypt. This is increasingly being replaced by gnupg.

Compression

Files are sometimes compressed in order to reduce disk usage or speed up transmission over the network. There are various ways of compressing files, which depend on the type of data. On Unix systems, the most widely used compression program is probably gzip, which is called gunzip when used to decompress. On MS Windows systems the most widely used compression program is Winzip. Another compression program that runs on just about every kind of system is bzip2.

Packaging

For some purposes it is desirable to combine several files into a single file. It is generally more convenient to deal with a single large file than many small files, e.g. for transmission over a network or inclusion in email. Furthermore, one sometimes wants to preserve the directory structure, not just the individual files.

There are a number of programs that package groups of files. The most important of these on Unix systems is tar ("tape archiver"), which was originally intended to package files for archiving on tape. tar creates a single file that contains not only the original files but the information necessary to reconstruct the original tree structure, permissions, modification times, and so forth. Files created by tar often have the suffix .tar.

Packaging is often combined with compression. On Unix systems, tar files that have also been compressed by gzip are usually given either the suffix .tar.gz or most commonly .tgz. The more recent versions of tar also perform gzip compression and decompression if so requested (by the z flag). On MS Windows systems Winzip performs both packaging and compression. tar and Winzip can each unpack and decompress files created by the other.

back to top

Revised 2009-04-22.