The role of the computer in modern science is well
known. In disciplines like physics and biology, the computer's ability to store
and process inhumanly large amounts of information has disclosed patterns and
regularities in nature beyond the limits of normal human experience. Similarly in language study, computer
analysis of large texts reveals facts about language that are not limited to
what people can experience, remember, or intuit. In the natural sciences,
however, the computer merely continues the extension of the human sensorium
that began 200 years ago with the telescope and microscope. But language study did not have its
telescope or microscope. The computer is its first analytical tool, making
feasible for the first time a truly empirical science of language.
The
details of this new empiricism are being worked out, mainly at conferences
rather than in books or journals. Corpus-Based
Studies in English contains selected papers from the seventeenth
International Conference on English Language Research on Computerized Corpora
(ICAME 17) held in Stockholm in 1996. This review will sample from the fruits
of the new empiricism, as well as its issues, procedures, and problems, mainly
for the benefit of linguists and applied linguists who are curious about the
computational end of the field but who do not often come across specific
instances of what goes on there. The review will not give a one-liner on each
contribution, which is provided in the book's preface, but rather will explore
a handful of themes in slightly more depth. These themes deal mainly with
applied rather than theoretical questions (the series title is "Studies in
Practical Linguistics.") The examples are entirely from English, although
provided mainly by Germans, Dutch, and Scandinavians, and the research ideas
and methodologies are ripe for adapting to French.
The
practical focus of many of the contributions is language pedagogy. A corpus
finding that has impacted strongly on English teaching is the importance of
lexicalized phrases in language use and acquisition (also known as
"formulaic expressions" or "chunks" (discussed in Nattinger
& DeCarrico, 1992). In contrast to the
slot-and-filler grammars attributed to Chomsky, where for example any noun can
fill the slot wherever NP is indicated in the tree chart, it now
seems clear that only one form of one noun will fill certain slots. For example,
you can say "He got cold feet and refused to sign," but not
"He got a cold foot and refused to sign." The analysis of
large text corpora has shown that such
restrictions are rather more common than unaided intuition would suggest. If
language learners apply rules freely and
productively, they will often end up with sentences that are grammatically
acceptable but idiomatically unusual. It is less the
"indefinitely many" sentences the grammar makes
possible (Chomsky, 1965) that should occupy language learners as much as the relatively few
that native speakers actually use.
How do we produce an inventory of these
lexical phrases in a language and determine the degree of freedom that particular phrases
have? Barkema's piece discusses procedures for doing
this. Briefly, using a syntactically parsed corpus of adequate size, the linguist can examine
the degree of syntactic flexibility of phrases like red tape or wet blanket by extracting
from the corpus all instances of
the following pattern: "Premodifying adjective (absolute form) + singular noun as head of noun phrase." This output, alphabetized and viewed
in concordance format (i.e., phrase plus immediate context), will show whether or not
English speakers ever say The project was tied up in miles of bright red
tape or This piece of news dropped two wet blankets over the dinner party. The verdict for language learners is that while these phrases
are possible, they are extremely infrequent, and best left to native speakers. Learners should
treat "wet blanket" and "red tape" as fixed and immutable.
Another
practical focus is the study of translation through parallel corpora (of
translated texts, normally appearing in concordance format on a horizontally
split computer screen). Schmied and Schäffler use the Chemnitz German-English
translation corpus to look into the phenomenon of "translationese,"
whereby texts that have been translated show systematic differences from texts written
in the target language. They argue that while some of such differences may stem
from particular differences between source and target languages, others are
universal features of the translation process. Two of these are explicitness
and condensation, or "showing more underlying elements on the
surface" (or fewer). For instance, the large number of non-finite verb
constructions possible in English are not available in German, and instead must
be broken down into relative clauses specifying agency, tense and other
information that is implicit in an English infinitive or participle. The writers
find more instances of explicitness than condensation in their corpus, and offer an information-processing account for why this should be so.
Since all their translation data deals with only one pair of languages, they
will presumably want to look at other pairs of languages before making a final
commitment to their universal hypothesis. In the meantime,
Chomskyans may take comfort in knowing that interest in various aspects of
universalism are alive and well among corpus linguists.
Chomskyans
will also find familiar the corpus linguists' occasional interest in linguistic
invisibles. The invisibles, as ever, are implicit pieces of
D-structure, for example the elided
relative in the sentence, The dog I
bought died (i.e. the dog that I bought died).
Lehmann's piece describes a way of inserting a placeholder Ø between each two NPs
where there could have been a relative pronoun. This insertion is
simple in a fully tagged corpus, of course, where every grammatical element has been fully
marked with a tag-set, as for example in The_artDef + dog_nounCommonCountable + that_relElided etc.)
In this case, a simple search for relElided will bring forth all instances
of the phenomenon for inspection. But Lehmann is interested in working with more
natural texts that have been parsed only with an
automatic tagging system (which do not attempt to assign phrase-level tags).
What
practical purpose is served by assigning all those Ø's?
Lehman is interested in machine translation. A problem that has plagued machine translation
between English and several other languages is that elided relatives
are permitted in English under certain conditions but are not permitted in some
other languages under any conditions.
For example, in French one cannot say *Le
chien j'ai acheté est mort, nor in German * Der Hund ich kaufte ist gestorben. Lehmann has worked out a way of
searching through an English text for the eight condition-sets where relatives
may be omitted, and inserting Ø in each. With Ø's inserted in the English
text, it can be passed to a machine translation
system which will replace Ø with que
or qui, as appropriate.
Another echo from the past unexpectedly encountered in this volume is the grammaticality judgment task, so derided in the early days of corpus analysis when it seemed hard evidence would supplant soft intuitions entirely. No one is any longer building any empires on whether subjects will grant grammaticality to Colorless green ideas sleep furiously, but there is still a role for grammatical judgment in certain cases. Mönnink describes a problem she has had in attempting to write a corpus-based descriptive grammar, which is that even in corpora of substantial size there is a very low representation for some structures that native speakers would instantly judge to be grammatical. Taking as an example the common NP, whose grammar-book formula is optional determiner + Ø or more premodifying elements + obligatory head + Ø or more postmodifying elements, Mönnink shows that several legitimate changes may be rung on this theme which are entirely legitimate and yet which will appear very infrequently in a corpus of reasonable size. Such changes include shifted premodifications ("I wouldn't give it so romantic a name"), discontinuous modifications ("We can do as much guessing about her as we please"), floating postmodification ("Much evidence has accumulated concerning cytoplasmic DNA.") These forms are clearly decent English, and yet a purely frequency-based approach to constructing a descriptive grammar might underplay them or omit them.
The writer proposes
supplementing corpus information about such NPs with information from a
principled set of elicitation tasks. One such task might be evaluation (rate
from 1=perfectly acceptable to 7=not at all acceptable the following sentence, I never saw so beautiful a person). Another
might be composition (give all possible sentences that can be constructed from
these phrases: was made, to win a medal,
today, no effort). A methodology for blending frequency and elicitation information is presented.
A
small quibble with Mönnink's piece is that the corpus in which she finds certain
structures inadequately represented is only about 120,000 words in four text
genres. She might find less need for experimental supplementation, and all the
vagueness that introduces, were she to consult a larger source such as the
British National Corpus, currently weighing in at 100 million words in several genres and
growing every day (see http://info.ox.ac.uk/bnc). On the other hand, smaller corpora and
home-grown corpora have their uses (for an excellent example see Granger, 1998,
who works with a corpus of learner English produced by Belgian francophones), and in these cases a purely frequency based approach will
often be usefully complemented by more judgment-based elicitation data.
The
three
classics of corpus study are all represented in this volume: corpus comparisons
of older English and newer English, written English and spoken English, and
British English (BE) with American English (AE). Only the latter will be discussed. The point of the latter is to find evidence whether BE and AE are
different, as they intuitively sound as if they are. A claim in need of support
is, for instance, that Steve Forbes
political neophyte "is an
apposition type more characteristic of AE than BE." However, a perceived
problem with such studies, and more or less the opposite of the problem
discussed just above concerning over-relying on objective data, is that the data
behind the AE-BE comparisons may not be objective enough. Kretzschmar, Meyer,
and Ingegneri argue that the sampling procedures by which corpora of AE have
been put together do not meet the standards necessary to allow a statistical
inference of representativeness. Indeed, to support such an inferences would
require that linguists have the resources of large political polling
organizations, or indeed of the US Federal government. Until then, all we know
about Steve Forbes political neophyte (linguistically speaking) is that it is a
phrase produced at least once in an AE publication and widely understood by AE
speakers (but then, by BE speakers, too).
The
reader is invited to read the book itself for more details on the studies I have
reported, and for all the details on the no less interesting studies not reported
for lack of space. The sense I take away from this volume is that corpus
investigation deals with extremely interesting questions and relies on hard
evidence as much as possible to do so. However, I also see that as the
discipline matures some of the problems with getting and using hard evidence
are presenting themselves, and that some of the lines that once seemed so clear
between old and new linguistics are blurring.
References
Chomsky,
N. (1965). Aspects of the theory of
syntax. Cambridge, MA: MIT Press.
Granger,
S. (1998). Learner English on
computer. London: Addison Wesley
Longman.
Nattinger,
J. & DeCarrico, J. (1992). Lexical
phrases and language teaching. London:
Oxford University Press.
Pawley,
A., & Syder, F. (1983). Two puzzles
for linguistic theory: Nativelike selection and nativelike fluency. In J.C. Richards & R. Schmidt (Eds.),
Language & communication. London:
Longman.