Phrase Extractors    
 Two ways to extract phrases from corpus/text


N-Gram Phrase Extractor

N-Gram formats a text as strings and counts any that are repeated whether 'meaningful' or not. (Unmeaningful would be 'of the,' or 'in a,' etc). In the research, such repeated sequences are called n-grams or lexical bundles.

M-I Phrase Extractor
M-I Extractor does the same but with an extra step. It reduces total phrases down to (probably) meaningful phrases or 'collocations' by calculating the mutual information shared by the words in each phrase, defined here as the ratio of the phrase frequency to the averaged frequency of the individual words in the phrase. So for example in 'torrential rainfall' the first term rarely appears except in this one phrase, so the mutual information is strong. (Needs a sizeable specialized text/corpus to give interesting result.)


Phrase research
  1. Erman & Warren (2000)
  2. Review A. Wray (TC, 2002)
  3. Learner Corpus (TC, 2004)
  4. Review N. Nesselhauf (TC, 2005)
  5. Formulaic CALL (TC, 2018)



Related routines on Lextutor
  1. Concordancers (main/multi) do phrases
  2. Compleat Lister does phrases
  3. Phrase Profiler finds pre-set phrases in texts
  4. N-Word Cloze makes phrase cloze
  5. Keyword Extractor like M-I but single words