LEXTUTOR CORPUS SOURCES

Where do Lextutor's Corpora come from?

The selection on Lextutor includes classic mid-size corpora, some web culls, plus corpora developed by Lextutor users and students. They are presented here roughly in order of frequency of access but with any new acquisition toward the top.

Get back to this page from linked pages using your Back or Alt-Left Arrow key

NEWER ACQUISITIONs USUALLY NEAR THE TOP,

Half-K

JAN 2025
Five thousand words within the first 500 families by BNC/COCA frequency, meaning that they have a consistent 95% with just over 500 most frequent words. From two sources, (1) the most basic level of Sonia Millet's 'speed reading' collection, an associate of Paul Nation (https://www.wgtn.ac.nz/lals/resources/paul-nations-resources/speed-reading-and-listening-fluency), and (2) Camille Hanson's (2022) 77 Real-Life Dialogues with 500 most common English Words (https://www.learnenglishwithcamille.com.)

Harry Potter

JAN 2025
The entire seven original stories by J. K. Rowling, which rather remarkably all meet their 95% coverage mark at 4k families. A real find, available https://kvongcmehsanalibrary.wordpress.com/wp-content/uploads/2021/07/harrypotter.pdf

COCA Movies

APRIL 2024
Despite supposedly being samples of 'real language,' most corpora are in fact sanitized, with all 'bad words' removed. I discovered this when trying to show a learner that the expression 'I don't give it a sh*t' is non-existent. The search was 'give' with right associate 'sh*t' which should pick up both 'give a sh*t' and 'give it a sh*t' if present.
Surprise: This gave hits = 0 on all the corpora on Lextutor! Hm, bad news. These are mainly academic I guess, where this sort of language would be rare, though BNC speech and COCA speech sampler should have it. Just not big enough?
So I went to Mark Davies and got this corpus, a 1.6m word mini-sample from the 575m word 'TV and Movies' section of COCA (many thanks, Mark) - which is bristling with R-rated authenticity.
Get more details at https://www.english-corpora.org/files/tv_movie_corpora.pdf

QUEBEC LEARNER CORPUS

Hardly ' new' but rather a new assembly format of the three-part LC (learner corpus) based on Cobb (2003) study Analyzing late interlanguage with learner corpora: Quebec replications... (here) now reformatted in separate-files format (begin-inter-adv) to facilitate comparison. In this format, the whole study can be replicated by simple concordancing. The raw corpus files though small by current standards, totally unstructured, and sadly not equal in size (adv=62k, inter=58k, begin=32k) but can be calculated as porportions of the total, 152k. These are are requested regularly by graduate students doing LC studies. Sub-corpora can be accessed as three separate text files here. An interesting search to start with is subordinating conjunctions ('and' or 'but' vs 'while' or 'since' or first-person vs other pronouns - sorted by sub-corpus).

CHILDREN'S STORIES

This is a collection of the 13 most popular (L1) children's stories on Gutenberg, about 947,000 words (Alice, Anne, Kim, Black Beauty, Wind in the Willows, Wizard of Oz, Just So, Secret Garden, Treasure Island, Little Women, and Heidi). The storytelling is uniformly clear and yelds excellent concordance lines for learners. The lexical level, however, is another story. This collection confirms that the vocabulary of L1 stories is rather difficult (as discussed in Webb & Macalister's (2019) Why might children’s literature be difficult for non-native speakers of English?). These authors use VP tools to discover that 98% coverage is achieved in texts of this type only with knowledge of about 8-10 thousand word families, far more than is needed for graded readers. (Users can replicate this experiment with VP-Big with VP Compleat.) W+M advance some reasons for this (childen know more words than L2 learners, are more willing to re-read, are more tolerant of partial comprehension, and their texts often have pictures); two more might be the archtypicality of the story lines and the large yet constrained number of proper nouns.

LANGUAGE OF VOCAB RESEARCH

This collection of about 140,000 words is the entire RIFL 2010 Festschrift for Paul Nation (available as PDF files here), in which about a dozen of Nation's colleagues and students contributed pieces related to his work. A search on right collocates of word pulls out a useful summary of current themes in vocabulary research.

COMBO LL

This corpus is the combined 'cross-corpus' used by Cobb and Laufer's (2021) paper in Language Learning about the rationale, development, and validation of the nuclear frequency list (NFL7). It is 3.7 million words and is combined from two other corpora, the BNC spoken and written samplers of 1 million words each and the COCA US speech and writing sampler avaiable from Mark Davies' website.

Bible Corpora (two, c.800k words apiece)

First is the 1769 edition of the King James Version (1611), rendered chronological by following the scheme of the World English version which appears just below. The analysis goes down as far as Books (Genesis etc) but not chapters. Apocrypha not included. Also not included are the smaller Books 26_Ezekiel 27_Daniel 28_Hosea 31_Obadiah 35_Habbakuk though the numbering does not reflect this omission but rather proceeds from 1_Genesis to 66_Revelations in the traditional manner. The New Testament begins in Book 40 with Matthew. Ranges are calculated out of 61 not 66.
Texts come from Word one-doc-per-book collection here (http://www.gpbc.ca/bibledownload.html) rendered as TEXT files using Convert Doc by SoftInterface here.
Second is the World English Bible (2000), based on both American Standard Version (1901) and American Standard Version (1997), and designed to be internationally comprehensible. It normally lives here (https://ebible.org/ and is described here (en.wikipedia.org/wiki/World_English_Bible. Lextutor sub-analyzes down as far as Books (Genesis etc) + chapters but not verses; Apocrypha included though the attempt to expand abbreviated names is uncertain n some cases. On Lextutor this version is divided by Book (Genesis etc) and chapter (well over 1,000 in all) but not by verse.

Mid-Frequency Graded Stories Corpus (1 m)

THis is one of the collections of reduced-vocabulary Mid-Frequency Readers created by Paul Nation and Laurence Anthony. This collection responds to the need for graded/simplified texts beyond just the usual elementary (1k-2k) level. These are the 4k Level series of about 15 classic fiction and non-fiction texts (there are also 6k and 8k Levels). These and other simplified texts can be found here and are described in a 2013 research paper here. Names of the texts can be seen by concordancing items through the separate-files version of this collection.
The 4k reference means that in these texts a learner who knows 4,000 word families will know close to 98% of the words on average in these texts. This can be seen for the collection as a whole by entering the whole corpus at VP in the 'VP-Big' menu bottom left (copied here->).
The goal in creating this corpus is to give intermediate learners corpus data they can make sense of as they undertake 'Data Driven Learning.'

Brown (1 m)

The Brown is the classic early corpus that many others are based on. American, late 1970s, developed by Kucera and Francis at Brown University (NJ), this corpus comprised 500 written texts of 2,000 words each in three main divisions (press, journalism, and academic)and several subdivisions. FOr a functioning list of subdivisions on Lextutor, see Range - Corpus version (../range/corpus). For the original corpus manual see HREF.
The 500 x 2000 formula was also the basis of the Braun German corpus and Bruno Spanish corpus developed especially for Lextutor (at https://www.lextutor.ca/conc/germ/ for German and https://www.lextutor.ca/conc/span for Spanish).

LOB (1 m)

The Lancaster-Oslo-Bergen Corpus (LOB, 1978) was a UK-English adaptation of the Brown Corpus and follows an identical sampling method (ICAME, Stig Johansson)

BASE - British Academic spoken English (1.6m)

The British Academic Spoken English (BASE) corpus contains 196 hours (1,644,942 words) of transcribed lectures and seminars in four academic areas (arts and humanities, social science, physical science, and life/medicine) in roughly 2010. It was developed at the Universities of Warwick and Reading under the directorship of Hilary Nesi (formerly of the Centre for Applied Linguistics [previously called CELTE], Warwick) and Paul Thompson (formerly of the Department of Applied Linguistics, Reading), with funding from BALEAP, EURALEX, The British Academy (SG 30284) and the Arts and Humanities Research Board as part of their Resource Enhancement Scheme (RE/AN6806/APN13545).
BASE joined Lextutor in March, 2018. The BASE website is here and a summarized PDF description of corpus contents is here or in more complete spreadsheets via the link just above.

BAWE - British Academic Written English (6.5m)

The British Academic Written English (BAWE) corpus of university students' writing was developed at the Universities of Warwick, Reading and Oxford Brookes under the directorship of Hilary Nesi and Sheena Gardner (formerly of the Centre for Applied Linguistics [previously called CELTE], Warwick), Paul Thompson (formerly of the Department of Applied Linguistics, Reading) and Paul Wickens (Westminster Institute of Education, Oxford Brookes), with funding from the ESRC (RES-000-23-0800). The BAWE website is here.
From the website we read the following description:
"The corpus is a record of proficient university-level student writing at the turn of the 21st century. This Excel Spreadsheet contains information about the corpus holdings. A more detailed spreadsheet is available from the Oxford Text Archive. It contains just under 3000 good-standard student assignments (6,506,995 words). Holdings are fairly evenly distributed across four broad disciplinary areas (Arts and Humanities, Social Sciences, Life Sciences and Physical Sciences) and across four levels of study (undergraduate and taught masters level). Thirty main disciplines are represented."
The 1 million word 'sampler' selection from the larger 6.5 million word BAWE was installed in Lextutor in June 2015 at the request of academic EFL instructors for use with learners in e.g. Multiconc. The full 6.5 million word corpus was installed in February 2018, in collaboration with Prof. Nesi of Coventry University, and is sortable in by the 30 subject divisions that comprise the corpus. Corpus composition is described in summarized PDF format here or in more complete spreadsheets via the link just above.

COCA 'Now' sampler (1.72 million)

Added to Lextutor in December 2019. A roughly 1:100 sampler of Mark Davies' huge and growing COCA (Corpus of Contemporary American English). Described and obtainable here (look under 'linear text'; in multiple files that need assembly and clean-up).
Plus the Now speech sampler as a separate corpus of 387,000 words

BNC Written (1 million), BNC Spoken (1 million)

After the compilation of the 100 million word British National Corpus in 1986 (www.natcorp.ox.ac.uk/ , Oxford University Press publicized the achievement in two BNC Sampler corpora of roughly 1 million words each on CD-Rom, one of spoken English and one of written English, and fully tagged for SGML parsing (Description https://ucrel.lancs.ac.uk/bnc2sampler/sampler.htm, obtain https://www.natcorp.ox.ac.uk/corpus/index.xml?ID=products). These were modified for work on Lextutor by having their tags removed, and serve mainly to explore differences between written and spoken English (e.g. at https://www.lextutor.ca/range/corpus/).

Brown + BNC Written (2+ m)

These corpora are described above. The purpose of joining the Brown and the Written Sampler into a single corpus was threefold:

to form a corpus large enough to give at least 10 examples of most medium frequency items (for example at www.lextutor.ca/list_learn
which was nonetheless small enough to run over the Web on a phone line, and
combine features of British and American varieties of English.

Then in 2023 a further purpose for this corpus was to serve as a reference corpus for keyness calculations (the frequency of a word or expression in any corpus compared to this fairly standard corpus of written language, expressed as a proportion of 1).

General Academic (6+m)

This 6 million word corpus is an amalgam of the following, all available separately on Lextutor, here assembled in the aim of providing a sizeable "general academic" corpus:

bnc_humanities 3,361,000 million
bnc_soc_sci 2,322,000 million
AA_Academic_Abstracts 174,000
Brown_Academic 162,500
It is largely supplanted by the BAWE Complete with subject divisions but as a single-file corpus has its own uses.

Dr House Corpus

These comprise scripts for the entire eight seasons (811,150 words in 176 episodes, 2004-2012) divided and sortable here by season. Engaging and unexpectedly popular series starring UK Hugh Laurie in convincing US voice as a sarcastic, anti-social, pain-wracked, vicodin-addicted top diagnostician in a private for-pay East Coast hospital. For House, science and logic are a kind of religion, in echoes of Sherlock Holmes.
Scripts were obtained at http://www.springfieldspringfield.co.uk/ and extensively cleaned up by Clinton Hendry from Wiki class (this page).
An interesting corpus project would be to compare medical with fiction-medical usages and frequencies.

Wiki Corpus

Developed by MA students in Concordia's APLI program in a course given by Marlise Horst on Applied Corpus Linguistics in Winter session 2016. Comprises 1,051,921 words from random entries in each of Wiki's 12 divisions. Using native random Wiki Encyclopedia with repeat function and further randomization interface at https://www.lextutor.ca/tools/wiki_corpus/, 12 students contributed one sub-corpus each of random entries up to 900,000 words based on one of Wiki's 12 epistemological groupings. Individual corpus file titles and URLs can be seen here. Lextutor Concordancer was adapted on this occasion to search and sort by sub-corpus.

Simplified Wiki Corpus

A continuation of Wiki Corpus project above, carried forward by Clinton Hendry and Emily Sheepy at Education Dept., Concordia, Montreal. The whole simplified corpus, 15.46 million words, 101 MB.
"The Simple English Wikipedia Corpus was created for research into the online simplified English encyclopedia by Clinton Hendry and Emily Sheepy. It is derived from the June 20th, 2017 Simple English Wikipedia file dump located at https://dumps.wikimedia.org/. After downloading the .xml file, we unpacked it using open source software, and then reorganized the dump into a single .txt file while removing unnecessary tags (e.g. tags) using Bash. Programming assistance credited to Christopher J.F. Cameron."

Shakspeare (5 m)

Added January, 2020. Comprises all 37 plays plus the sonnets; plays are differentiated in concordance output, but sonnets are not. PLays approximately 836,000 words, Sonnets 19,000 words. Or, 18,000 different words (types), about 105 of which were used by Shakespeare for the first time. Average per play is 22,600 according to https://www.opensourceshakespeare.org/views/plays/plays_numwords.php - which also has a great online presentation for anyone who wants to read not analyse and includes a full-sentence concordancer.
Of the many Shakespeare collections available that were considered for treatment by Lextutor, only this one from Folger generated clear and attractive true concordances and broader contexts (others were XML and various other special formats).
One challenge was to reduce the size of some of the play names so not to obscure the contexts (As You Like It -> AYLikeIt).
Files downloaded from Folger Digital Texts under the non-commercial Digital Commons agreement https://www.folgerdigitaltexts.org/download/.

2k Graded Corpus (1.2 m)

This corpus is formed of hundreds of graded readers, scanned and digitized over 10 years. They have an overall VocapProfile of 2000 word families = 95% of the running words overall (not counting proper nouns). This corpus answers a major need in pedagogical concordancing, that in order for learners top perceive lexical or other patterns in a corpus, the corpus must be largely composed of items they are familiar with. This is not the case with e.g. the Brown corpus with lower intermediate ESL learners (2000 word families=80% of running items).

1k Graded Corpus (530,000)

This is a subset of the 2k graded corpus, with a profile of 1000 word families = 90% of running items. This is probably the closest thing we have at present to a corpus for near beginners.

BNC Law (2.2 million), BNC Med (1.4 million)

These are parts of the larger BNC corpus with temporary residence on Lextutor as part of an undergraduate applied linguistics exercise in building technical lexicons.

QUEST approved Quebec secondary ESL course

Developed by Juliane Martini for her MA study of lexical recycling in course materials in c. 2010 (mainly using Lextutor Range, Paul Nation's Range, or AntProfiler). Comprises three text books and three corresponding workbooks approved by Ministère de l'Education, Loisirs et Sport (MELS) du Québec. Writers Cynthia Beyea, Paul Bougie, Claire Maria Ford; publisher Chenelière, Montreal. Course books total 284,577 words; workbooks 85,051.

UK Law Reports (BLaRC, 835k wds)

MODIFICATION DESCRIPTION
In June, 2024, the BLaRC corpus was modified to comprise two sub corpora of UK law reports (CIVIL 417,965 and CRIMINAL 418,000) and renamed "UK Law Reports (BLaRC)." The original full 8+-million word corpus is still available from Lextutor but searches are slow on this size file as a single file. Check research related to BLaRC since it was published in c. 2010 at https://webs.um.es/mariajose.marin1/
ORIGINAL DESCRIPTION
BLaRC, the British Law Report Corpus, is an 8.85 million-word legal English corpus of law reports, that is, collections of judicial decisions as officially transcribed at British courts and tribunals. It is owned by María José Marín, a lecturer in legal English at the University of Murcia. Law reports were selected as the genre to build the corpus on owing to the pivotal role they play in common law countries such as the UK, being one of the major sources of law for these legal systems.Their lexical richness is also remarkable as they include terminology pertaining to all areas of law.

TV - scripted narrative (554,000)

This is collection was produced as part of an assignment in a colleague's graduate course in applied linguistics in 2008.
Details. The television corpus used in this study was established through the combined efforts of Applied Linguistics graduate students at ConcordiaUniversity. The corpus contains 10 popular 1990s TV shows, five comedies and five dramas that the graduate students, in an applied corpus linguistics course, deemed to be typical of what learners might be asked to watch as part of their language enrichment homework. The five comedies were: How I Met Your Mother, The Office, Seinfeld, Two and a Half Men and Frasier. The five dramas were: Alias, Desperate Housewives, Grey's Anatomy, Lost and Prison Break. The corpus material is narrative; news, commentaries or talk shows were not included. The sub-corpora from the 10 shows were compiled by downloading transcripts freely available on the internet; stage prompts and other non-spoken material in the transcripts were deleted manually. Each of the 10 show corpora amounted to around 50,000 words; the number of episodes represented in each ranged from 11 to 18 (due to differences in show length and amounts of talk that occurred in them). In total the corpus contained approximately 554,000 words in roughly equal halves, i.e., the comedy and drama sub-corpora amounted to about 250,000 words each.

2000 list corpus (240,000)

This is the purpose built corpus of coursebook materials developed for the 1997 Pet-2000 study (described in Breadth & Depth ) on which Lextutor is based.

Univ. Word List (550,000)

Similar to the previous corpus, this is a corpus-built collection of roughly the same vintage, designed to fuel a corpus based CALL program that gives students of the UWL (University Word List) exposure to a minimum number of examples of each of the UWL's 580 word families.

Focus on Vocab (82,300)

THis is the text part of Norbert and Diane Schmitt's 2007 book Focus on Vocabulary for learning the AWL (Academic Word List.These are great texts bearing high proportions of AWL vocabulary. This corpus could be joined with the UWL corpus at some point.

Call of the Wild (24,000)

A book as corpus - in this case the heavily used Call of the Wild developed as a full Assisted Reading hypertext at https://www.lextutor.ca/ra/CallWild here available for other kinds of searches.

TC Learner (Student) (150,000), TC Learner (Teacher) (61,000)

These are the learner corpora described in Cobb, T. Analyzing late interlanguage with learner corpora: Quebec replications of three European studies. Canadian Modern Language Review, 59(3), 393-423. (At at cv page)) Learner (Student) is composed of three levels of ESL students, originally three mini-corpora of 50,0000 words each but here joined to get some size.

JPU Learner (300,000)

This is a Learner Corpus developed by Joseph Horvaths for his PhD, a collection of 221 Hungarian student essays and research papers. The corpus is described at his blog http://joeandco.blogspot.com/

Presidential Speeches (2.15 million)

Montreal scholar Pierre Henrichon's hobby project - a large collection of presidential speeches collected mainly from www.presidency.ucsb.edu, in which he is mainly interested in military/militaristic references. In English; goes up to Obama.
Expanded to include early Trump presidency speeches in Aug 2020.

RAC Research Articles Corpus (HK, 132,000 wds)

The Research Article Corpus (RAC_academic) consists of 19 empirical journal articles (132,102 words) of required readings for students in English Language Education, most with an IMRD structure, recommended by MA and PhD students in the Education Faculty of the University of Hong Kong as the key reference articles related to their study/research. For inclusion as the default corpus in a Chinese language version of ConcordWriter (www.lextutor.ca/cgi-bin/conc/write/index.pl - August, 2011)

AA Academic Abstracts

AA corpus description
The corpus was compiled in an electronic format from the World Wide Web and is approximately 174,000 words. In order to qualify for the corpus, the abstracts primarily had to be from universities in countries where English is the native language. Another important criterion was that the abstracts had to be thesis and dissertation abstracts, since abstracts written for journal articles and conference presentations tend to differ. Therefore, journal article and conference presentation abstracts were disqualified.
The abstracts had to be written at Master’s and PhD levels at educational institutions. This criterion naturally excluded abstracts written at Bachelor’s level. One advantage of abstracts is that they do not exhibit significant differences in terms of length. This property contributes to balance within the corpus. To ensure representativeness of the corpus in accordance with the purpose of the study and corpus design in general, the corpus covers four main disciplines: Arts and Humanities, Social Sciences, Sciences and Architecture/ Urban and Regional Planning. Each discipline, making up a sub-corpus, includes 150 representative texts, in this case abstracts.
In the Arts and Humanities sub-corpus, 5 abstracts come from Anthropology, 30 from Archaeology, 19 from Art History, 7 from History, 40 from Language / Literature / Linguistics, 10 from Philosophy / Religion / Theology, 21 from Psychology, 5 from Music and 13 from Sociology.. Of the 150 abstracts in the Social Sciences sub-corpus, 20 are from Business Administration, 17 from Communications, 3 from Demography, 20 from Economics, 20 from Education, 5 from Geography, 38 from Information Technology, 12 from Accounting and 15 from Political Science. The Sciences sub-corpus is composed of 3 abstracts from Algebra, 13 from Biological Sciences, 10 from Chemistry, 11 from Computer Science, 92 from Engineering, 12 from Mathematics and 10 from Physics. The Architecture sub-corpus includes abstracts from the fields of town and regional planning, landscape architecture, interior architecture as well as architecture. Assembled for a thesis study by Nilgun Hancioglu, Eastern Mediterraneam Univ, Famagusta, Cyprus.
For inclusion as the default corpus in a Turkish language version of ConcordWriter (www.lextutor.ca/cgi-bin/conc/write/index.pl - June, 2011)

BNC Speech 10 million

The BNC is 10 million speech and 90 million text, which sums to 100 million. In making the BNC lists, Nation (e.g. 2006) used the speech section as the basis for the first two 1,000 lists, in order to assure that items like "hello" would appear in the most frequent zones (for pedagogical reasons).

BN-COCA 1k-2k / 14 million

(Following from the previous entry ~) Then when Nation integrated the COCA and BNC lists he needed something resembling the BNC 10-million spoken in both UK and US registers as well as spoken plus written as a basis for the first two 1,000 lists (for reason described just above). He therefore developed his own corpus of spoken and written US-UK speech and simplified readers (described on his website - see here ) as the basis for these lists. This corpus is kindly offered here to Lextutor users.

Elecrical Engineering - textbooks and ESP coursebooks

These corpora were assembled by Lin Chen for her Master's study at Carlton University, which compared "real" textbooks in electrical engineering to ESP (English for Specific Purposes) coursebooks for Electrical Engineering English. The volumes in question are (1) Irwin, J. D. (2002). Basic engineering circuit analysis (7th ed.). New York: John Wiley & Sons; (2) Sedra, A.S. & Smith, K.C. (2004). Microelectronic circuits (5th ed.). Oxford: OUP. (3) Glendinning, E. H., & McEwan, J. (1993). Oxford English for electronics. Oxford: OUP; and (4) Glendinning, E. H., & Glendinning, N. (1993). Oxford English for electrical and mechanical engineering. Oxford: OUP. Here the four volumes appear together in a single corpus.

Korean EFL teacher corpus

Yenny Kwon, Corpus of NS/NNS EFL teacher talk
Two corpora of EFL teacher talk from low-intermediate EFL classrooms at six Korean universities: one by NNS (non-native speaking) EFL teachers (123,122 words) and the other by native speakers (124,276 words). Recorded and transcribed by Yenny Kwon from 30+ hours of instruction in each condition.
For her PhD dissertations at Ewha Women's University graduate school, on formulaic sequences (chunks) in teacher talk of native and non-native teachers. Goal was to provide non-native teachers with meaningful lists of chunks useful in teaching.
Can also be accessed from UCL Louvain learner corpus collection
https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html