MSORT

ScreenshotScreenshot
Msort's graphical user interface

Contents

  1. Description
  2. Comparison with GNU Sort
  3. News
  4. Details
  5. Environment
  6. Documentation
  7. Downloads
  8. Change Log
  9. Bugs
  10. Roadmap

Description

msort is a program for sorting files in sophisticated ways. It was originally developed for alphabetizing dictionaries of "exotic" languages, for which it has been extensively used, but is useful for many other purposes. msort differs from typical sort utilities in providing greater flexibility in parsing the input into records and identifying key fields and greater control over the sort order. Its main distinctive features are:

msort understands UTF-8 Unicode. Unicode may be used anywhere that text is entered: in the text to be sorted, in sort order and exclusion definitions, as a field or record separator, or as a field tag. Full Unicode case-folding is available.

Review by Ben Martin at linux.com
     (上の日本語訳)

If you are looking for the specialized Hungarian sort program also called msort, try here.


Back to Top

Comparison with GNU Sort

Msort's capabilities are very close to a superset of those of GNU sort. Msort provides greater flexibility in selecting key fields, more comparison types, the ability to use collation rules from different locales on different keys, the ability to handle numbers in non-Western number systems, and a variety of other options lacking in GNU sort. Whereas msort understands Unicode, GNU sort does not. It is a property of the UTF-8 transfer format that a binary sort will sort in Unicode codepoint order, so for some purposes GNU sort will behave in an acceptable manner on Unicode input. However, operations requiring an understanding of the encoding of the input do not work properly in GNU sort with Unicode input. Capabilities of GNU sort lacking in msort are the ability to merge files without sorting them (the --merge option) and the ability to emit only the first of an equal run (the --unique option).

Generally speaking, msort is the more powerful program, either the only choice or the more convenient choice in cases in which something other than standard sorts of positionally selected fields are required. On the other hand, if GNU sort is capable of doing what you want, it will generally be faster. The exact ratio varies with the details of the sort and the nature of the input, but in my tests, where msort and GNU sort are capable of performing the same sort, GNU sort is typically two to three times faster.

Back to Top

News

Binary packages are now available for Solaris.

Input and output files may now be specified by means of command line flags. Another new flag allows suppression of generation of the log file.

Msort can now handle ISO 8601 timezone specifiers in time and date-time keys.

As of version 8.39, msort by default normalizes Unicode input.

It is now possible to select a configuration that does not require the GMP and Uninum libraries by executing configure with the option --disable-uninum. This simplifies installation for users who do not need the ability to handle non-Western numbers.

The available date formats have been significantly expanded. Month fields may now consist of month names or abbreviations. Dates consisting of only year and day-of-year are now accepted. The numerical components of dates may be in any supported number system.

Msort can now handle both numeric and numeric string keys using a wide range of number systems. Unfortunately, as a result installing msort has become a little bit more work as it now depends on two additional libraries, Uninum, and GNU MP. The upside is, you can now sort even data like this, in multiple non-Western number systems:

३८२४९三百五十三
७४५३९五十七
१४९三百二十七
२८४९३六万七十三
३२९४五十七
४८३९३五万七千

For example, to sort this data on the Chinese as primary key and the Devanagari as secondary key, do this:

msort -l -n 2 -c n -y any -n 1 -c n -y any

with the result:

३२९४五十七
७४५३९五十七
१४९三百二十七
३८२४९三百五十三
४८३९३五万七千
२८४९३六万七十三

Since you can use non-Western number systems for numeric string comparison, you can now sort non-Western numbers far outside the range representable using the usual fixed precision types. For example, here are some very large numbers written the traditional Chinese way:

三極五十七載二垓七万十三
三百八十九極
三極五十七載二十七垓七万三百十九
九載三正七澗五澗二溝
三極五十七載二十七垓七万三百九十九

These are equivalent to:

3,005,700,000,000,000,000,000,000,000,200,000,000,000,000,070,013
389,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
3,005,700,000,000,000,000,000,000,002,700,000,000,000,000,070,319
900,030,012,000,200,000,000,000,000,000,000,000,000,000,000
3,005,700,000,000,000,000,000,000,002,700,000,000,000,000,070,399

If you try to sort these using a plain numeric sort msort may report that all five records are ill-formed since the values are not representable as unsigned long integers. But if you use a numeric string sort like this:

msort -l -w -c N -y any

msort will produce this output:

九載三正七澗五澗二溝
三極五十七載二垓七万十三
三極五十七載二十七垓七万三百十九
三極五十七載二十七垓七万三百九十九
三百八十九極

A specialized domain name comparison type has been added.

Thanks to the Huntsville Macintosh User's Group, Mac OS X binaries are now available.


Details

LanguageCmain program
 Tcl/Tkfor GUI only
DependenciesTRE regular expression libraryrequired
 ICU - International Components for Unicodeone or the other
required
 Utf8proc
 Uninum number conversion libraryoptional
 GNU MP multiple precision arithmetic libraryoptional
used by uninum
 Tcl/Tk version 8.3 or higherfor GUI only
 Iwidgets (Tcl/Tk library)for GUI only
LicenseGNU General Public License,Version 3
Current version8.47
Last modified2008-07-01


Back to Top

Environment

The underlying command-line program msort should compile and run without difficulty on any POSIX-conformant system on which the requisite libraries are available. In practice, this should mean just about anywhere. It is known to compile and run without modification under GNU/Linux, FreeBSD, Mac OS X, and SunOs. I am note sure whether the current version will compile and run properly under MS Windows, even under Cygwin, due to the fact that MS Windows uses UTF-16 Unicode internally while msort expects UTF-32.

Note also that msort may be configured to compile without the GMP and Uninum libraries, at the cost of forgoing the ability to handle numbers in non-Western number systems. If you cannot or do not want to install these libraries, run configure with the option --disable-uninum. This will also disable linkage with libgmp.

The graphical user interface should run anywhere that Tcl/Tk is available, but a few features may not work on non-Unix systems. In particular, the Abort Sort command depends on the existence of a Unix-style kill program that can be used to send a signal to another process. It is known to run under GNU/Linux, FreeBSD, and SunOS. msg will run properly under Mac OS X if you have installed X11 and use Tk-X11. msg now adapts itself to Tk-Aqua sufficiently well as to be usable, but some details remain to be dealt with.


Note: obtaining the necessary Tcl/Tk environment.

The GUI requires both the basic Tcl/Tk distribution and the iwidgets library. If you already have Tcl/Tk and just need to add iwidgets, you can obtain the package from the Sourceforge project site. On the download page you will find source and binary packages for both [incr Tcl/Tk], which is the basic part of this package, and [incr widgets], which is the part that contains the widgets. You will need to install both. (iwidgets is an alternative name for [incr widgets].)

The easiest way to obtain the Tcl/Tk environment you need is to install the ActiveTcl distribution from ActiveState. This distribution provides the Tcl language, the Tk graphics library, and a bunch of extensions, including [incr tcl] and [incr widgets]. Don't be concerned by the fact that ActiveState is a commercial outfit. The Tcl/Tk distribution that they provide is free as in both beer and speech. They make their money selling services and programming tools. The ActiveTcl distribution is currently available for: GNU/Linux, HP-UX, AIX, Solaris, Mac OS X, and MS Windows.

For FreeBSD, Tcl and Tk are available at:


Back to Top

Documentation

A standard Unix manual page is included in the package, or you can read it here. The full documentation is the reference manual (PDF), a copy of which is included in the package.

Back to Top

Downloads

FileSize (Bytes)MD5 Sum
msort-8.47.tar.bz2 421,260 ca550abf701cddc030ec8678efa365de
msort-8.47.tar.gz 448,744 6bf495c5330cb1a4e526e10f45d773de
msort-8.47.zip 468,394 3d8e11943c239fe819a486745f17bbf

If you would like to be notified of new releases, subscribe to msort at Freshmeat.

Packages

Debian
Debian package (testing)
Debian package (unstable)
FreeBSD
FreeBSD Freshport
Mac OS X
Macport
Mac OS X binaries
Softpedia (PPC and Intel)
Nexenta/GNU Solaris
Nexenta packages
Redhat Linux
Redhat RPMs
SUSE Linux
Source and i686 executable RPMs courtesy of Pascal Bleser: SUSE RPMs.
Solaris (SPARC and Intel)
Solaris Package Index
T2
T2
Ubuntu
Ubuntu packages


Back to Top

Changes

8.47 - 2008-07-01

8.46 - 2008-05-27

8.45 - 2008-05-19

8.44

8.43


Full Change Log


Back to Top

Known Bugs

Under obscure conditions date sorts may produce a segmentation fault or valid date fields may be rejected as invalid. I have been unable to reproduce this bug on my own system. It may or may not be significant that the machine on which this bug has been reported is a 64-bit machine.

Known bugs in the GUI are:


Roadmap

If you care about any of these, please feel free to drop me a line.


Back to Top


Back to Bill Poser's software page.
Valid HTML 4.01 Transitional Valid CSS!