Where do Lextutor's Corpora come from?

The selection on Lextutor includes classic mid-size corpora, some web culls, plus corpora developed by Lextutor users and students. They are presented here roughly in order of frequency of access ut with any new acquisition at the top.

Get back to this page from linked pages using your Back or Alt-Left Arrow key


Mid-Frequency Graded Stories Corpus (1 m)

THis is one of the collections of reduced-vocabulary Mid-Frequency Readers created by Paul Nation and Laurence Anthony. This collection responds to the need for graded/simplified texts beyond just the usual elementary (1k-2k) level. These are the 4k Level series of about 15 classic fiction and non-fiction texts (there are also 6k and 8k Levels). These and other simplified texts can be found here and are described in a 2013 research paper here. Names of the texts can be seen by concordancing items through the separate-files version of this collection.

The 4k reference means that in these texts a learner who knows 4,000 word families will know close to 98% of the words on average in these texts. This can be seen for the collection as a whole by entering the whole corpus at VP in the 'VP-Big' menu bottom left (copied here->).

The goal in creating this corpus is to give intermediate learners corpus data they can make sense of as they undertake 'Data Driven Learning.'


Brown (1 m)

The Brown is the classic early corpus that many others are based on. American, late 1970s, developed by Kucera and Francis at Brown University (NJ), this corpus comprised 500 written texts of 2,000 words each in three main divisions (press, journalism, and academic)and several subdivisions. FOr a functioning list of subdivisions on Lextutor, see Range - Corpus version (../range/corpus). For the original corpus manual see HREF.

The 500 x 2000 formula was also the basis of the Braun German corpus and Bruno Spanish corpus developed especially for Lextutor (at https://www.lextutor.ca/conc/germ/ for German and https://www.lextutor.ca/conc/span for Spanish).

LOB (1 m)

The Lancaster-Oslo-Bergen Corpus (LOB, 1978) was a UK-English adaptation of the Brown Corpus and follows an identical sampling method (ICAME, Stig Johansson)

BASE - British Academic spoken English (1.6m)

The British Academic Spoken English (BASE) corpus contains 196 hours (1,644,942 words) of transcribed lectures and seminars in four academic areas (arts and humanities, social science, physical science, and life/medicine) in roughly 2010. It was developed at the Universities of Warwick and Reading under the directorship of Hilary Nesi (formerly of the Centre for Applied Linguistics [previously called CELTE], Warwick) and Paul Thompson (formerly of the Department of Applied Linguistics, Reading), with funding from BALEAP, EURALEX, The British Academy (SG 30284) and the Arts and Humanities Research Board as part of their Resource Enhancement Scheme (RE/AN6806/APN13545).

BASE joined Lextutor in March, 2018. The BASE website is here and a summarized PDF description of corpus contents is here or in more complete spreadsheets via the link just above.

BAWE - British Academic Written English (8.0m)

The British Academic Written English (BAWE) corpus of university students' writing was developed at the Universities of Warwick, Reading and Oxford Brookes under the directorship of Hilary Nesi and Sheena Gardner (formerly of the Centre for Applied Linguistics [previously called CELTE], Warwick), Paul Thompson (formerly of the Department of Applied Linguistics, Reading) and Paul Wickens (Westminster Institute of Education, Oxford Brookes), with funding from the ESRC (RES-000-23-0800). The BAWE website is here.

From the website we read the following description:

"The corpus is a record of proficient university-level student writing at the turn of the 21st century. This Excel Spreadsheet contains information about the corpus holdings. A more detailed spreadsheet is available from the Oxford Text Archive. It contains just under 3000 good-standard student assignments (6,506,995 words). Holdings are fairly evenly distributed across four broad disciplinary areas (Arts and Humanities, Social Sciences, Life Sciences and Physical Sciences) and across four levels of study (undergraduate and taught masters level). Thirty main disciplines are represented."
The 1 million word 'sampler' selection from the 8 million word BAWE was installed in Lextutor in June 2015 at the request of academic EFL instructors for use with learners in e.g. Multiconc. The full 8 million word corpus was installed in February 2018, in collaboration with Prof. Nesi of Coventry University, and is sortable in by the 30 subject divisions that comprise the corpus. Corpus composition is described in summarized PDF format here or in more complete spreadsheets via the link just above.

COCA 'Now' sampler (1.72 million)

Added to Lextutor in December 2019. A roughly 1:100 sampler of Mark Davies' huge and growing COCA (Corpus of Contemporary American English). Described and obtainable here (look under 'linear text'; in multiple files that need assembly and clean-up).

Plus the Now speech sampler as a separate corpus of 387,000 words

BNC Written (1 million), BNC Spoken (1 million)

After the compilation of the 100 million word British National Corpus in 1986 (www.natcorp.ox.ac.uk/ , Oxford University Press publicized the achievement in two BNC Sampler corpora of roughly 1 million words each on CD-Rom, one of spoken English and one of written English, and fully tagged for SGML parsing (Description https://ucrel.lancs.ac.uk/bnc2sampler/sampler.htm, obtain https://www.natcorp.ox.ac.uk/corpus/index.xml?ID=products). These were modified for work on Lextutor by having their tags removed, and serve mainly to explore differences between written and spoken English (e.g. at https://www.lextutor.ca/range/corpus/).

Brown + BNC Written (2+ m)

These corpora are described above. The purpose of joining the Brown and the Written Sampler into a single corpus was threefold:
  1. to form a corpus large enough to give at least 10 examples of most medium frequency items (for example at www.lextutor.ca/list_learn
  2. which was nonetheless small enough to run over the Web on a phone line, and
  3. combined British and American linguistic features.

General Academic (6+m)

This 6 million word corpus is an amalgam of the following, all available separately on Lextutor, here assembled in the aim of providing a sizeable "general academic" corpus:
  1. bnc_humanities 3,361,000 million
  2. bnc_soc_sci 2,322,000 million
  3. AA_Academic_Abstracts 174,000
  4. Brown_Academic 162,500

Dr House Corpus

These comprise scripts for the entire eight seasons (811,150 words in 176 episodes, 2004-2012) divided and sortable here by season. Engaging and unexpectedly popular series starring UK Hugh Laurie in convincing US voice as a sarcastic, anti-social, pain-wracked, vicodin-addicted top diagnostician in a private for-pay East Coast hospital. For House, science and logic are a kind of religion, in echoes of Sherlock Holmes.

Scripts were obtained at http://www.springfieldspringfield.co.uk/ and extensively cleaned up by Clinton Hendry from Wiki class (this page).
An interesting corpus project would be to compare medical with fiction-medical usages and frequencies.

Wiki Corpus

Developed by MA students in Concordia's APLI program in a course given by Marlise Horst on Applied Corpus Linguistics in Winter session 2016. Comprises 1,051,921 words from random entries in each of Wiki's 12 divisions. Using native random Wiki Encyclopedia with repeat function and further randomization interface at https://www.lextutor.ca/tools/wiki_corpus/, 12 students contributed one sub-corpus each of random entries up to 900,000 words based on one of Wiki's 12 epistemological groupings. Individual corpus file titles and URLs can be seen here. Lextutor Concordancer was adapted on this occasion to search and sort by sub-corpus.

Simplified Wiki Corpus

A continuation of Wiki Corpus project above, carried forward by Clinton Hendry and Emily Sheepy at Education Dept., Concordia, Montreal. The whole simplified corpus, 15.46 million words, 101 MB.

"The Simple English Wikipedia Corpus was created for research into the online simplified English encyclopedia by Clinton Hendry and Emily Sheepy. It is derived from the June 20th, 2017 Simple English Wikipedia file dump located at https://dumps.wikimedia.org/. After downloading the .xml file, we unpacked it using open source software, and then reorganized the dump into a single .txt file while removing unnecessary tags (e.g. tags) using Bash. Programming assistance credited to Christopher J.F. Cameron."

Shakspeare (5 m)

Added January, 2020. Comprises all 37 plays plus the sonnets; plays are differentiated in concordance output, but sonnets are not. PLays approximately 836,000 words, Sonnets 19,000 words. Or, 18,000 different words (types), about 105 of which were used by Shakespeare for the first time. Average per play is 22,600 according to https://www.opensourceshakespeare.org/views/plays/plays_numwords.php - which also has a great online presentation for anyone who wants to read not analyse and includes a full-sentence concordancer.

Of the many Shakespeare collections available that were considered for treatment by Lextutor, only this one from Folger generated clear and attractive true concordances and broader contexts (others were XML and various other special formats).

One challenge was to reduce the size of some of the play names so not to obscure the contexts (As You Like It -> AYLikeIt).

Files downloaded from Folger Digital Texts under the non-commercial Digital Commons agreement https://www.folgerdigitaltexts.org/download/.

2k Graded Corpus (1.2 m)

This corpus is formed of hundreds of graded readers, scanned and digitized over 10 years. They have an overall VocapProfile of 2000 word families = 95% of the running words overall (not counting proper nouns). This corpus answers a major need in pedagogical concordancing, that in order for learners top perceive lexical or other patterns in a corpus, the corpus must be largely composed of items they are familiar with. This is not the case with e.g. the Brown corpus with lower intermediate ESL learners (2000 word families=80% of running items).

1k Graded Corpus (530,000)

This is a subset of the 2k graded corpus, with a profile of 1000 word families = 90% of running items. This is probably the closest thing we have at present to a corpus for near beginners.

BNC Law (2.2 million), BNC Med (1.4 million)

These are parts of the larger BNC corpus with temporary residence on Lextutor as part of an undergraduate applied linguistics exercise in building technical lexicons.

QUEST approved Quebec secondary ESL course

Developed by Juliane Martini for her MA study of lexical recycling in course materials in c. 2010 (mainly using Lextutor Range, Paul Nation's Range, or AntProfiler). Comprises three text books and three corresponding workbooks approved by Ministère de l'Education, Loisirs et Sport (MELS) du Québec. Writers Cynthia Beyea, Paul Bougie, Claire Maria Ford; publisher Chenelière, Montreal. Course books total 284,577 words; workbooks 85,051.

BLaRC (8.8 million), British Law Reports Corpus

BLaRC, the British Law Report Corpus, is an 8.85 million-word legal English corpus of law reports, that is, collections of judicial decisions as officially transcribed at British courts and tribunals. It is owned by María José Marín, a lecturer in legal English at the University of Murcia. Law reports were selected as the genre to build the corpus on owing to the pivotal role they play in common law countries such as the UK, being one of the major sources of law for these legal systems.Their lexical richness is also remarkable as they include terminology pertaining to all areas of law.

TV - Marlise Horst & colleagues (530,000)

This is a more princpled version of the collection just above, produced as part of an assignment in a colleague's graduate course in applied linguistics in 2008.
Details. The television corpus used in this study was established through the combined efforts of Applied Linguistics graduate students at ConcordiaUniversity. The corpus contains 10 popular TV shows five comedies and five dramas that the graduate students, in a corpus linguisticscourse, deemed to be typical of what learners might be asked to watch as part of their language enrichment homework. The five comedies were: How I Met Your Mother, The Office, Seinfeld, Two and a Half Men and Frasier. The five dramas were: Alias, Desperate Housewives, Grey's Anatomy, Lost and Prison Break. The corpus material is narrative; news, commentaries or talk shows were not included. The sub-corpora from the 10 shows were compiled by downloading transcripts freely available on the internet; stage prompts and other non-spoken material in the transcripts were deleted manually. Each of the 10 show corpora amounted to around 50,000 words; the number of episodes represented in each ranged from 11 to 18 (due to differences in show length and amounts of talk that occurred in them). In total the corpus contained approximately 500,000 words in roughly equal halves, i.e., the comedy and drama sub-corpora amounted to about 250,000 wordseach.

US TV Talk (868,000)

This is the Marlise Horst TV corpus described just above, augmented in size with the addition of 338,000 words of US television programming pulled off the Web.

2000 list corpus (240,000)

This is the purpose built corpus of coursebook materials developed for the 1997 Pet-2000 study (described in Breadth & Depth ) on which Lextutor is based.

Univ. Word List (550,000)

Similar to the previous corpus, this is a corpus-built collection of roughy the same vintage, designed to fuel a corpus based CALL program that gives students of the UWL (University Word List) exposure to a minimum number of examples of each of the UWL's 580 word families.

Focus on Vocab (82,300)

THis is the text part of Norbert and Diane Schmitt's 2007 book Focus on Vocabulary for learning the AWL (Academic Word List.These are great texts bearing high proportions of AWL vocabulary. This corpus could be joined with the UWL corpus at some point.

Call of the Wild (24,000)

A book as corpus - in this case the heavily used Call of the Wild developed as a full Assisted Reading hypertext st https://www.lextutor.ca/CallWild here available for other kinds of searches.

TC Learner (Student) (150,000), TC Learner (Teacher) (61,000)

These are the learner corpora described in Cobb, T. Analyzing late interlanguage with learner corpora: Quebec replications of three European studies. Canadian Modern Language Review, 59(3), 393-423. (At at cv page)) Learner (Student) is composed of three levels of ESL students, originally three mini-corpora of 50,0000 words each but here joined to get some size.

JPU Learner (300,000)

This is a Learner Corpus developed by Joseph Horvaths for his PhD, a collection of 221 Hungarian student essays and research papers. The corpus is described at his blog http://joeandco.blogspot.com/

Presidential Speeches (2.15 million)

Montreal scholar Pierre Henrichon's hobby project - a large collection of presidential speeches collected from a variety of sources, in which he is mainly interested in military/militaristic references. In English, goes up to Obama.

Expanded to include early Trump presidency speeches in Aug 2020.

RAC Research Articles Corpus (HK, 132,000 wds)

The Research Article Corpus (RAC_academic) consists of 19 empirical journal articles (132,102 words) of required readings for students in English Language Education, most with an IMRD structure, recommended by MA and PhD students in the Education Faculty of the University of Hong Kong as the key reference articles related to their study/research. For inclusion as the default corpus in a Chinese language version of ConcordWriter (www.lextutor.ca/cgi-bin/conc/write/index.pl - August, 2011)

AA Academic Abstracts

AA corpus description

The corpus was compiled in an electronic format from the World Wide Web and is approximately 174,000 words. In order to qualify for the corpus, the abstracts primarily had to be from universities in countries where English is the native language. Another important criterion was that the abstracts had to be thesis and dissertation abstracts, since abstracts written for journal articles and conference presentations tend to differ. Therefore, journal article and conference presentation abstracts were disqualified.

The abstracts had to be written at Master’s and PhD levels at educational institutions. This criterion naturally excluded abstracts written at Bachelor’s level. One advantage of abstracts is that they do not exhibit significant differences in terms of length. This property contributes to balance within the corpus. To ensure representativeness of the corpus in accordance with the purpose of the study and corpus design in general, the corpus covers four main disciplines: Arts and Humanities, Social Sciences, Sciences and Architecture/ Urban and Regional Planning. Each discipline, making up a sub-corpus, includes 150 representative texts, in this case abstracts.

In the Arts and Humanities sub-corpus, 5 abstracts come from Anthropology, 30 from Archaeology, 19 from Art History, 7 from History, 40 from Language / Literature / Linguistics, 10 from Philosophy / Religion / Theology, 21 from Psychology, 5 from Music and 13 from Sociology.. Of the 150 abstracts in the Social Sciences sub-corpus, 20 are from Business Administration, 17 from Communications, 3 from Demography, 20 from Economics, 20 from Education, 5 from Geography, 38 from Information Technology, 12 from Accounting and 15 from Political Science. The Sciences sub-corpus is composed of 3 abstracts from Algebra, 13 from Biological Sciences, 10 from Chemistry, 11 from Computer Science, 92 from Engineering, 12 from Mathematics and 10 from Physics. The Architecture sub-corpus includes abstracts from the fields of town and regional planning, landscape architecture, interior architecture as well as architecture. Assembled for a thesis study by Nilgun Hancioglu, Eastern Mediterraneam Univ, Famagusta, Cyprus.

For inclusion as the default corpus in a Turkish language version of ConcordWriter (www.lextutor.ca/cgi-bin/conc/write/index.pl - June, 2011)

Yenny Korean EFL teachers corpus

Yenny Corpus descriptions

Two corpora of EFL teacher talk from general EFL courses offered by six different universitis in Korea: the NN is non-native EFL teacher corpus (123,122 word counts) and the N is native EFL teacher corpus(124,276). The target students were from low intermediate to intermediate level.

Recorded and transcribed by Yenny Kwon from 20+ hours of instruction in each condition. For her PhD dissertations at Ewha Women's University graduate school, on formulaic sequences (chunks) in teacher talk of native and non-native teachers. Goal was to provide non-native teachers with meaningful lists of chunks useful in teaching.

BNC Speech 10 million

The BNC is 10 million speech and 90 million text, which sums to 100 million. In making the BNC lists, Nation (e.g. 2006) used the speech section as the basis for the first two 1,000 lists, in order to assure that items like "hello" would appear in the most frequent zones (for pedagogical reasons).

BN-COCA 1k-2k / 14 million

(Following from the previous entry ~) Then when Nation integrated the COCA and BNC lists he needed something resembling the BNC 10-million spoken in both UK and US registers as well as spoken plus written as a basis for the first two 1,000 lists (for reason described just above). He therefore developed his own corpus of spoken and written US-UK speech and simplified readers (described on his website - see here ) as the basis for these lists. This corpus is kindly offered here to Lextutor users.

Elecrical Engineering - textbooks and ESP coursebooks

These corpora were assembled by Lin Chen for her Master's study at Carlton University, which compared "real" textbooks in electrical engineering to ESP (English for Specific Purposes) coursebooks for Electrical Engineering English. The volumes in question are (1) Irwin, J. D. (2002). Basic engineering circuit analysis (7th ed.). New York: John Wiley & Sons; (2) Sedra, A.S. & Smith, K.C. (2004). Microelectronic circuits (5th ed.). Oxford: OUP. (3) Glendinning, E. H., & McEwan, J. (1993). Oxford English for electronics. Oxford: OUP; and (4) Glendinning, E. H., & Glendinning, N. (1993). Oxford English for electrical and mechanical engineering. Oxford: OUP. Here the four volumes appear together in a single corpus.