Phrase Extractors    
   Two ways to extract phrases from corpus/text


N-Gram Phrase Extractor

N-Gram formats a text as strings and counts up those that are repeated, whether meaningful in themselves or not (unmeaningful would be 'of the,' or 'in a,' etc). Repeated sequences are often called n-grams or lexical bundles.

M-I Phrase Extractor
M-I Extractor is N-Gram but with an extra step. It reduces the n-gram list down to meaningful phrases (collocations) by calculating the mutual information provided by the individual words in each phrase. MI is the ratio of phrase frequency to the total frequency in a corpus for the individual words in the phrase. So if 'torrential rain/rainfall' appears 10 times in a corpus, since 'torrential' rarely appears except in this phrase, its ratio of phrase to total frequency is 10:10 or 1:1. Words from the 'rain' family also appear 10 times in the phrase, but in addition 10 times in other phrases, for a total of 20 and a ratio of 10:20 or .5:1. So the average for 'torrental' and 'rain' in this corpus is (1+.5)/2 = .75. A ratio >.5 indicates strong MI, so 'torrential rain' is a collocation or other unit. (MI needs sizeable/specialized corpus for interesting result.)

Phrase research

  1. Erman & Warren (2000)
  2. Review A. Wray (TC, 2002)
  3. Learner Corpus (TC, 2004)
  4. Review N. Nesselhauf (TC, 2005)
  5. Formulaic CALL (TC, 2018)


Related routines on Lextutor

  1. Concordancers (main/multi) do phrases
  2. Compleat Lister does phrases
  3. Phrase Profiler finds pre-set phrases in texts
  4. N-Word Cloze makes phrase cloze
  5. Keyword Extractor like M-I but single words