Miscellaneous Utilities for Text Processing

Testing and staging ground for useful pieces of future Lextutor routines.

And pieces of existing routines with independent uses

Forwarding addresses:

    FreqList Builders now moved out to ../freq;     Randomizers to ../rand

1. Tag Stripper

Removes HTML tags.
ANd *NEW! Jan '16 square brackets [bla bla] and curly braces {bla bla}
2. Corpus Builder
Join up to 25 files - to about half a million words.
3. Random Wiki Entries by Subject
Build your own balanced corpus with modest labour
4. Sentence Extractor / T-Unit Calculator (+ Std. Dev.) *NEW!
File to sentences.
5. Proper Stripper Under repair

5. The Compleat Stripper (some elements under review Sept 2016)

Brought back on user demand Sept 2016 with problematic experiments removed
5. Two useful off-site DBs (Collocations and Associations)




  • Some of these routines require TEXT files as their input. A text file is a simple file that contains no codes for emphasis, font sizes, etc. To transform a Word file into a text file, simply SAVE it AS text. You will not thereby lose the original file, but create an additional text file (identifiable by the .txt extension).

  • Most of these routines take their file inputs from a menu that accesses the hard drive or YOUR computer; they have not been adapted for copy-paste text entry. They have not been tested for French.

  • For complex jobs, combine routines (e.g., first strip tags of html file, save as text file, then build list or extract sentences).

