Phrase Extractors    
 Two ways to extract phrases from corpus/text


N-Gram Phrase Extractor

N-Gram formats a text as strings and counts any that are repeated whether 'meaningful' or not. (Unmeaningful would be 'of the,' or 'in a,' etc). In the research, such repeated sequences are called n-grams or lexical bundles.

M-I Phrase Extractor
M-I Extractor does the same but with an extra step. It reduces total phrases down to (probably) meaningful phrases or 'collocations' by calculating the mutual information shared by the words in each phrase, defined here as the ratio of the phrase frequency to the averaged frequency of the individual words in the phrase. So for example in 'torrential rainfall' the first term rarely appears except in this one phrase, so the mutual information is strong. (Needs a sizeable specialized text/corpus to give interesting result.)


Phrase research
Related routines on Lextutor
