Parse v2.5 * Gregory Kwok This is a command-line utility that counts how many letters, words, and sentences there are in a text file. It also calculates the average number of letters/word and words/sentence. Finally, it will give you a list of the most common words in the file. For brief instructions on command line switches and syntax, simply type PARSE which prints the following: Syntax: PARSE file [/Cn] [/Mn] [/O] [/S] file The file to read and parse, with an extension if necessary /C Prints the most common words, default n = 10 "/C" is equivalent to "/C0" /M Specifies that only words that appear times or more are to be added to the common word list, default n = 2 /O Do not add ordinary words to the common word list, default off (ordinary words are added) /S Semicolons and colons break sentences, default off Normally the following characters break a sentence: (tab) (space) ! " ( ) , - . / ? [ ] { } These extended ASCII characters also break sentences: 0x9b: ¢ 0x9c: £ 0xa8: ¿ 0xad: ¡ 0xc4: - 0xff: _ To parse a file, type PARSE file.txt Here is a sample output, using the /O switch, on the MANUAL.DOC file included with PKZip 2.04g: Printable letters: 190438 Words: 27888 Sentences: 1931 Avg. letters/word: 4.9 Avg. words/sentence: 14.4 10 most common words: 1) File (907) 2) Zip (478) 3) Pkzip (409) 4) Files (395) 5) | (194) 6) Option (191) 7) Pkunzip (182) 8) May (178) 9) ¦ (168) 10) List (146) This archive contains both 32-bit and 16-bit executables, as well as source code. The 32-bit version was compiled with DJGPP 2.01; the 16-bit version was compiled with Borland Turbo C++ 3.0 using /mc. It is highly recommended that you use the 32-bit executable. The 16-bit executable is limited in terms of memory capacity; while it is adequate for smaller files (up to 1.5MB or so), it cannot handle multi-megabyte files like Parse32 can, and it is slower. For this reason, both executables use a three-pass algorithm (I know...slow!). The first pass is what generates the statistics. The second pass is what gets the common word list, by writing each word into memory (a binary tree). The third pass searches the tree and finds the words with the greatest frequency. If Parse16's binary tree gets too big during the second pass, you will get just the statistics, and no common word list. The order of the common word list generating algorithm is approximately [C+1][Nlog2(N)], where C is the number of common words to find and N is the number of unique words in the file. This program is shareware. Please distribute it freely. If you make any modifications to the source code, please contact me. Contact information: Gregory Kwok gkwok@jps.net http://www.jps.net/gkwok/