Character Set

redet itself works with Unicode and by default reads and writes UTF-8 Unicode. Whether regular expression matching works with characters outside the 7-bit ASCII range in the test data or the regular expression depends on whether the program that redet calls works with Unicode. Whether characters are properly displayed in redet windows depends on the fonts that you have installed.

Test data and comparison data may be read in encodings other than UTF-8, and test data, comparison data, and results of matches and substitutions may be written in encodings other than UTF-8. For each type of data, an encoding is specified, which is used both to read and to write that type of data. By default, these are all set to UTF-8. The encodings may be changed interactively, via the File menu, or via the initialization file commands TestDataEncoding, ResultEncoding, and ComparisonDataEncoding. The set of encodings available is that supported by your Tcl/Tk installation. A full installation of Tcl/Tk currently provides 81 encodings, with good coverage of Europe and East Asia.

This image of the encoding menu on my machine illustrates the encodings available. Notice that both the current encoding (EUC-JP) and the encoding under consideration, (Big5), are highlighted, in different ways.


The Encoding Menu

Since virtually all other encodings are subsets of Unicode, it is possible to attempt to write out data in an encoding that does not support one or more of the Unicode characters in the internal buffer. Redet detects this situation, aborts the write, and prints a message indicating the problem and identifying the characters that cannot be transcoded.

Note that some programs that do handle Unicode only work with Unicode in certain locale settings, while others work with Unicode regardless of the locale. Members of the latter category include Python and Pike. Programs that support Unicode only in certain locales include GNU ed, GNU grep, GNU sed and mawk. If you want to test this, try zh_TW.UTF-8 (Taiwan Chinese in UTF-8 encoding) or es_ES.UTF-8 (Castillian in Spain with UTF-8 encoding) for a locale in which Unicode should be supported and es_ES (Castillian in Spain, with default ISO-8859-1 encoding) for a locale in which Unicode is not supported.

Perl can be made to handle Unicode in a variety of ways determined by the setting of an environment variable or command-line flag. Redet runs Perl in such a way as to use UTF-8 Unicode for all input and output, regardless of locale.

Non-ASCII characters can be entered using whatever entry methods the user's system provides or using a Unicode character map such as gucharmap. Widgets are provided for entering characters from the International Phonetic Alphabet since these are scattered through several Unicode ranges and are therefore inconvenient to enter using a general purpose Unicode character map. A widget is also provided for entering characters by their Unicode codepoint. Finally, it is possible to create custom character entry widgets by loading definitions from a file.

As an aid to those working with Unicode, lists of Unicode ranges and general character properties are available from the Help menu.


Next

Back to Table of Contents