Unicode Normalization


Unicode normalization is easily accomplished from the command-line by means of simple programs that make use of the library facilities of widely used programming languages.

Here is a Perl script that reads UTF-8 Unicode from the standard input and writes the result, normalized to Normalization Format C, on the standard output.

perl -CSD -e 'use Unicode::Normalize;
while ($line = <STDIN>){
    print NFC($line)
}'

Here is the equivalent in Python:

#!/usr/bin/env python
import sys
import codecs
import unicodedata
(utf8_encode, utf8_decode, utf8_reader, utf8_writer) = codecs.lookup('utf-8')
outfile = utf8_writer(sys.stdout)
infile=utf8_reader(sys.stdin)
outfile.write(unicodedata.normalize('NFC',infile.read()))
sys.exit(0)

In both of the above, to obtain another format, just substitute NFD, NFKC, or NFKD for NFC.

Here is the same filter in Tcl. By default, the output is in NFC. To specify another normalization form, give it on the command line, with or without the leading "NF".

#!/usr/bin/env tclsh
package require unicode
set form C
if {$argc > 0} {
    set form [lindex $argv 0]
}
if {[string match "NF*" $form]} {
   set form [string range $form 2 end]
}
fconfigure stdin -encoding utf-8
fconfigure stdin -encoding utf-8
while {![eof stdin]} {
    gets stdin line
    puts [::unicode::normalizeS $form $line]
}