Unicode Utilities

Description

This package consists of a set of programs for manipulating and analyzing Unicode text. The analysis utilities are useful when working with Unicode files when one doesn't know the writing system, doesn't have the necessary font, needs to inspect invisible characters, needs to find out whether characters have been combined or in what order they occur, or needs statistics on which characters occur.

Uniname
Unidesc
Unihist
ExplicateUTF8
Utf8lookup
Unireverse
Unifuzz
Unisurrogate

Uniname

uniname defaults to printing the character offset of each character, its byte offset, its hex code value, its encoding, the glyph itself, and its name. Command line options allow undesired information to be suppressed and the Unicode range to be added. Other options permit a specified number of bytes or characters to be skipped. For example, the default output for this text:

is this:

character  byte       UTF-32   encoded as     glyph   name
        0          0  003053   E3 81 93       こ     HIRAGANA LETTER KO
        1          3  00308C   E3 82 8C       れ      HIRAGANA LETTER RE
        2          6  00306F   E3 81 AF       は      HIRAGANA LETTER HA
        3          9  0065E5   E6 97 A5       日     CJK character Nelson 2097
        4         12  00672C   E6 9C AC       本      CJK character Nelson   96
        5         15  008A9E   E8 AA 9E       語      CJK character Nelson 4374
        6         18  003067   E3 81 A7       で      HIRAGANA LETTER DE
        7         21  003059   E3 81 99       す      HIRAGANA LETTER SU
        8         24  003002   E3 80 82       。      IDEOGRAPHIC FULL STOP

uniname may also be used to validate UTF-8 input. In this case it reports the first invalid UTF-8 that it encounters, explains why it is invalid, and exits.

Unidesc

unidesc reports the character ranges to which different portions of the text belong. It can also be used to identify Unicode encodings (e.g. UTF-16be) flagged by magic numbers. Here is the output when given the above Japanese text as input:

       0	       2	Hiragana
       3	       5	CJK Unified Ideographs
       6	       7	Hiragana
       8	       9	CJK Symbols and Punctuation

Unihist

unihist generates a histogram of the characters in its input, which must be encoded in UTF-8 Unicode. By default, for each character it prints the frequency of the character as a percentage of the total, the absolute number of tokens in the input, the UTF-32 code in hexadecimal, and, if the character is displayable, the glyph itself as UTF-8 Unicode. Command line flags allow unwanted information to be suppressed. In particular, note that by suppressing the percentages and counts it is possible to generate a list of the unique characters in the input.

Here is a portion of the histogram for an Armenian text:

	  0.930	   2,599	0x00057E	վ
	  1.637	   4,574	0x00057F	տ
	  2.983	   8,332	0x000580	ր
	  0.720	   2,010	0x000581	ց
	  1.655	   4,622	0x000582	ւ
	  0.312	     872	0x000583	փ
	  0.441	   1,232	0x000584	ք
	  0.130	     362	0x000585	օ
	  0.147	     412	0x000586	ֆ
	  0.000	       1	0x002026	…

ExplicateUTF8

ExplicateUTF8 is intended for debugging or for learning about Unicode. It determines and explains the validity of a sequence of bytes as a UTF8 encoding. Here is the output when given the above Japanese text as input:

The sequence 0xE3     0x81     0x93    
             11100011 10000001 10010011 
is a valid UTF-8 character encoding equivalent to UTF32 0x00003053.
The first byte tells us that there should be 2
continuation bytes since it begins with 3 contiguous 1s.
There are 2 following bytes and all are valid
continuation bytes since they all have high bits 10.
The first byte contributes its low 4 bits.
The remaining bytes each contribute their low 6 bits,
for a total of 16 bits: 0011 000001 010011 
This is padded to 32 places with 16 zeros: 00000000000000000011000001010011
                                           0   0   0   0   3   0   5   3

And here is the output when given as input the same file with the first byte removed:

The first byte, value 0x81, with bit pattern 10000001,
is not a valid first byte of a UTF-8
sequence because its high bits are 10.
A valid first byte must be of the form 0nnnnnnn or 11nnnnnn.

Utf8lookup

Utf8lookup is a shell script which invokes uniname to provide an easy way to look up the character name corresponding to a codepoint from the command line. In addition to uniname it requires the utility Ascii2binary.

For example, the command:

utf8lookup 1254

will produce the output:

001254  ETHIOPIC SYLLABLE QHEE

Unireverse

Unireverse is a filter that reverses UTF-8 strings character-by-character (as opposed to byte-by-byte). This is useful when dealing with text that is not encoded in the order in which you want to display it or analyze it. For example, if you want to display Arabic on a terminal window that does not support bidi text, Unirev will put it into the normal display order.

For example, Unireverse will convert this:

   abcde
   한국말
   五十七

into this:

   edcba
   말국한
   七十五

Unifuzz

Unifuzz generates test input for programs that expect Unicode. It can generate a random string of characters, tokens of various potentially problematic characters and sequences, very long lines, strings with embedded nulls, and ill-formed UTF-8. Use it to find out whether your program reacts gracefully when given unexpected or ill-formed input.

Unisurrogate

Unisurrogate takes a codepoint on the command line and, if it falls outside the BMP, reports its surrogate decomposition. This is useful if you need to enter Unicode codepoints manually in UTF-16. The codepoint on the command line may be either raw hexadecimal or prefixed by either "0x" or "U+".

For example, the command:

unisurrogate 12345

will produce the output:

The surrogate representation of U+12345 is U+D808 U+DF45

Details

Language	C
Environment	POSIX
Current version	2.28
Last modified	2009-04-25
License	GNU General Public License, Version 3

Download Source

Packages:

If you would like to be notified of new releases, subscribe to uniutils at Freshmeat.

Changes

Version 2.28: This release adds Unisurrogate, a utility that computes the UTF-16 surrogate decomposition of characters outside the BMP.
Version 2.27: This release updates the character data to Unicode 5.1 and fixes a bug in the -V option of uniname as well as couple of other minor bugs.
Version 2.26: This release adds unifuzz, a utility that generates test input for programs expecting Unicode. Unifuzz can generate a random string of characters, tokens of various potentially problematic characters and sequences, very long lines, strings with embedded nulls, and ill-formed UTF-8. unirev is renamed unireverse. The license is now version 3 of the GPL.
Version 2.25: Adds to unidesc the option -r which causes it to list the ranges detected after reading all input rather than listing them as they are encountered, and adds to uniname the option -B which causes it to ignore characters within the Basic Multilingual Plane.
Version 2.24: Adds the utility Unirev, a filter which reverses UTF-8 strings.
Version 2.23: uniname and unidesc now provide information about the unofficial ranges within the Private Use Area registered with the ConScript Unicode Registry.

Full Change Log

Back to Bill Poser's software page.