Phrase extractor v.1.2|
Find collocations in a text/corpus with MI
Collocations are high mutual-information phrases that are more frequent qua phrases than the average frequency of their component words separately. E.g., in some corpus, Puerto Rico freq=10, Puerto freq =11, Rico freq=11. IE, thewords in this phrase appears mainly in this one phrase, not in other phrases or independently. The frequency of Puerto Rico is 10, averaged frequency of the component terms is (11+11)/2 = 10.5, so the ratio of phrase to word frequency is 10:10.5, or .95. MI becomes interesting at ratios of about .5 or .7 (see examples in the sample corpora provided). Max 800k words.