Parse v2.5 * Gregory Kwok <gkwok@jps.net>

This is a command-line utility that counts how many letters, words, and
sentences there are in a text file. It also calculates the average number of
letters/word and words/sentence. Finally, it will give you a list of the
most common words in the file.

For brief instructions on command line switches and syntax, simply type
  PARSE
which prints the following:

Syntax:  PARSE file [/Cn] [/Mn] [/O] [/S]
   file   The file to read and parse, with an extension if necessary
   /C     Prints the <n> most common words, default n = 10
          "/C" is equivalent to "/C0"
   /M     Specifies that only words that appear <n> times or more are
          to be added to the common word list, default n = 2
   /O     Do not add ordinary words to the common word list, default
          off (ordinary words are added)
   /S     Semicolons and colons break sentences, default off

Normally the following characters break a sentence:
   (tab) <CR> <LF> <EOF> (space) ! " ( ) , - . / ? [ ] { }
These extended ASCII characters also break sentences:
   0x9b: ˘  0x9c: Ł  0xa8: ż  0xad: Ą  0xc4: -  0xff: _


To parse a file, type
  PARSE file.txt

Here is a sample output, using the /O switch, on the MANUAL.DOC file
included with PKZip 2.04g:

Printable letters:   190438
Words:               27888
Sentences:           1931
Avg. letters/word:   4.9
Avg. words/sentence: 14.4

10 most common words:
1) File (907)
2) Zip (478)
3) Pkzip (409)
4) Files (395)
5) | (194)
6) Option (191)
7) Pkunzip (182)
8) May (178)
9) Ś (168)
10) List (146)


This archive contains both 32-bit and 16-bit executables, as well as source
code. The 32-bit version was compiled with DJGPP 2.01; the 16-bit version was
compiled with Borland Turbo C++ 3.0 using /mc.

It is highly recommended that you use the 32-bit executable. The 16-bit
executable is limited in terms of memory capacity; while it is adequate for
smaller files (up to 1.5MB or so), it cannot handle multi-megabyte files like
Parse32 can, and it is slower. For this reason, both executables use a
three-pass algorithm (I know...slow!). The first pass is what generates the
statistics. The second pass is what gets the common word list, by writing
each word into memory (a binary tree). The third pass searches the tree and
finds the words with the greatest frequency. If Parse16's binary tree gets
too big during the second pass, you will get just the statistics, and no
common word list. The order of the common word list generating algorithm is
approximately [C+1][Nlog2(N)], where C is the number of common words to
find and N is the number of unique words in the file.

This program is shareware. Please distribute it freely. If you make any
modifications to the source code, please contact me.

Contact information:
  Gregory Kwok
  gkwok@jps.net
  http://www.jps.net/gkwok/