Review of Michael Hammond, Programming for linguists: Perl for language researchers. Oxford: Blackwell, 2003. (x + 219 pp), $74.95 US (hardcover), $39.95 (paper).

 

Reviewed by Tom Cobb, Dépt. de linguistique et de didactique des langues, Université du Québec à Montréal, 4 August 2004, for Canadian Journal of Linguistics.

 



Most people involved in the systematic study of language, whether theoretical linguists, applied linguists or language educators, are probably already using computation in their work or else wish they knew how they could. The sheer volume and variety of data relevant to any linguistic question makes our field especially suited to computational analysis, and indeed it can be argued that the computer is to language study what the microscope was to biology or the telescope to astronomy—a vital tool heralding a late entry into the empirical age.

 

Few at present would attempt to produce a professional document without making some use of their word processor’s spell checking, word counting, or find and replace features, but most linguists are aware that the computer’s potential extends well beyond this. The problem is how to realize this potential. A common first step for many is to switch, for some purposes, from an all-purpose word processing program (like Microsoft’s Word) to a specialized Text Editor (like BBEdit for Macintosh or Textpad for PC) which trades text formatting for text processing. These text editors can handle files of more than a million words, or handle several files at once, but their main advantage is that they can find and replace using “regular expressions” or regexes. A simple example of a regex in a search pattern is that a full stop becomes a wildcard, so that searching a text for “h.t  will locate every hat, hit, hot or hut in a text; or that square brackets indicate a constrained wildcard, so that searching for “h[aeiou][a-z]” matches hat, hip, hit, hop, or hut; or that a carat (^) indicates a negative wildcard, so that searching for “h[^u]t” matches hat, hit, and hot, but not hut.

 

For a real life example of regular expressions in the workplace, as an instructor I once wanted to change the format of the 1 million word Brown corpus so that I could present parts of it to first-year students minus some of the distracting coding of the original. The Brown corpus is broken into separate lines, each beginning with a set of codes indicating its location in the corpus and its source.

 

A01 0010  1    The Fulton County Grand Jury said Friday an investigation

A01 0020  1    of Atlanta's recent primary election produced "no evidence"

A01 0020  9    that any irregularities took place.

A01 0030  5    The jury further said in term-end presentments that

A01 0040  3    the City Executive Committee, which had over-all charge

A01 0050  2    of the election, "deserves the praise and thanks of

A01 0050 11    the City of Atlanta" for the manner in which the election

A01 0060 11    was conducted.

 

First, MS Word could not even open the entire corpus, which my text editor (Textpad) could do easily; and through a careful perusal of Textpad's Help files I came up with a regular expression that would remove all the coding from the entire corpus in a few minutes. This was to replace “\n.\{15\}” with “”, that is to say, the first 15 characters (full stops) at the beginning of every line (\n) with nothing.

 

The source of these and many more powerful Regexes is the programming language Perl (invented by Larry Wall in the 1980s, an acronym meaning Practical Extraction and Report Language). After moving to a text editor, Perl is the next logical step for the linguist wishing to exploit the powers of the computer, and there is no better place to begin than Michael Hammond’s (2002) Perl for Linguists.

 

It is curious that a book on Perl “for linguists” is needed at all, since the Perl language was basically made for text processing and manipulation (most languages prioritize number crunching rather than text crunching). The reason a dedicated volume makes sense is that Perl has become such a widely used language (because it is cross-platform or runs on any computer, it is free, and it is the main server-side scripting language for dynamic Web applications), that its original purpose and powers have to some extent been obscured.

 

Hammond’s book guides the linguist into Perl in a clear, step-by-step fashion, through hands-on examples that right from the beginning are of potential interest to linguists. The programming complexity and the linguistic complexity advance side by side. Concepts like variables, repeat loops, if clauses, regular expressions, subroutines, file input/output, and the different ways that Perl can represent text (such as whole strings and itemized arrays) are introduced in the context of authentic linguistic analyses. By the end of Chapter 3, we are already in a position to perform a real task, in this case a phonetics task—inputting a natural text to a Perl routine that outputs all consonant-vowel (CVCV) pairs. Several small pieces of programs are needed to do this, such as first defining for Perl what we mean by a consonant and a vowel, and then searching for each consonant followed by each vowel in a quadruple nested loop. It sounds complicated, but each piece is clearly and separately set out in program code that is both included in the text and available for download on the accompanying website (www.u.arizona.edu/~hammond, although the author wisely recommends typing the code ourselves to be sure we understand it). What the author does not give us, however, is the programs’ outputs, which we must get for ourselves by running them on our computers.

 

The linguistic analysis advances over the course of the book through a nice sequence including collecting experimental responses from subjects to randomized stimuli; translating natural language into Pig Latin or other nonsense language while accommodating irregularities (such as words with vowel endings, which of course recycles the vowel identification technique from a previous example); breaking a text into sentences; recording users’ reaction times in milliseconds; and breaking a text into a concordance (an alphabetized list of all words in a text accompanied by their immediate contexts). Each example introduces new concepts and re-uses or extends concepts from previous examples. Each chapter is followed by transfer and extension exercises.

 

The examples cited so far are shown as running on one’s own PC, but in fact all of them could be run as networked applications over the World Wide Web. Hammond devotes two later chapters to Perl as a scripting language running on a network server—sometimes known in this manifestation as CGI or “Common Gateway Interface”—and outputting Web pages in HTML (Hypertext Mark-up Language, the formatting code for most pages appearing on the Internet). Accordingly, Chapter 8 is a brief primer on HTML text tagging (e.g., <b>between these tags, this text comes out as darkened or bold</b>), and Chapter 9 is devoted to getting text input from a “form” on a Web page, into Perl on a server, where it is processed in some way, and from there either to a text file stored on the server itself and/or back onto an output Web page for the user.

 

My own use of Perl is mainly for Web applications, which typically take user input texts, chop them up in one way or another, and then return them to the user for either contemplation or input to still another Perl routine. Many of these applications can be seen on my Compleat Lexical Tutor website, at www.lextutor.ca, and in fact a tour of some of its routines may help the reader understand several of the concepts discussed so far. Perhaps the clearest example is my Web Vocabprofiler, which returns the user’s input text, whether pasted into a form or uploaded to my server, colour-profiled by word frequency level (a research tool developed by Laufer & Nation, 1995, and used since then for a number of research purposes involving language acquisition and the development of language learning materials). Perl is an extremely useful language for performing this analysis in a number of ways. It handles large input files; it strips punctuation and irrelevant coding through successive waves of regexes; it handles several different natural languages (e.g., French) through Perl’s “locale” header; it opens several enormous, lemmatized frequency lists, against which each word in the input text is tested; it calculates percentages at each level; and it outputs the analysis in clear, colour-coded HTML tables—all in relatively short order. I do not know of any other scripting language that could perform all these functions.

 

Will Hammond’s reader be in a position to undertake large-scale web operations? Not exactly—what Hammond provides is more an introduction to the art of getting information from one Web page into a Perl routine and then back onto another page (although the chapter does end in a rather interesting example of running a linguistics experiment over the web that collects, stores, and processes the results). But the reader definitely will be ready to profit from a work dedicated more particularly to Web applications of Perl-CGI, such as Elizabeth Castro’s (2001) Perl and CGI for the World Wide Web, which revisits concepts like regex and array in a specifically online context.

 

I found this book to be engaging, humorous, relevant, and pedagogically sound with its clarity of presentation and recycling and building of concepts. Hammond will definitely get a language researcher up and running with Perl, and prepare him or her for a more detailed treatment—such as Wall et al’s (1996) Programming Perl for non-Web applications, or Stein’s (1998) Official Guide to Programming with CGI.pm for Web applications, or of course the many information sources on the Web itself such as www.perldoc.com. Most people who  learn Perl do so by studying and adapting other people’s Perl programs or scripts, which are voluminously available on the Web as open-source downloads (for example, at www.cgi-resources.com), and Hammond’s graduate will be in a position to start doing this.

 

An area where the book could be improved is in the start-up phase. Having learned Perl as a scripting language for the Web, I was unfamiliar with the off-line use of the language and had a little trouble getting started with it under Windows on my own desktop. The author directs aspiring linguist-programmers to the correct URL for a free download of the Perl program (www.cpan.org), and then gets us started writing and accessing our first Hello World program and indeed accessing all of the provided programs, via “the MS-DOS prompt on the program menu”—which leads to the “DOS window” where we will do such things as “switch to an appropriate directory using the cd command” (p. 6). As one of the dwindling number who actually remember DOS, I still had a little trouble getting some of this right, and I suspect others may, too, and even put the book down before really getting started. That would be a pity.

 

This may help: the MS-DOS prompt, on Windows XP Professional at any rate, is not or not any longer on the Program Menu, as advertised, but in the Accessories menu inside the Program Menu. From the Start button, you should go to All Programs, then Accessories, where the MS-DOS prompt seems to have been renamed the Command Prompt (and make a desktop shortcut for future use). When you click the Command Prompt, a small black text box appears where you see a C:\ plus a space where you are “prompted” to type something. For example, if you have downloaded Hammond’s files to a directory or folder called MyPerl in your C: drive, then you “go there” by typing the words cd Myperl at the prompt (cd means “change directory”). Thereafter, the author’s instructions make sense and the sailing is smooth.

 

Coming soon: a review of Hammond's earlier (2002) book on Java programming for linguists. What do we get with Java that we do not get with Perl?

 

References

 

Castro, Elizabeth. 2001. Perl and CGI for the World Wide Web (2nd Edition). Berkeley CA: Peachpit Press.

 

Laufer, Batia, and Nation, Paul. 1995. Vocabulary size & use: Lexical richness in L2 written productions. Applied Linguistics 16 (3), 307-322.

 

Stein, Lincoln. 1998. Official Guide to Programming with CGI.pm. New York: Wiley.

 

Wall, Larry, Christiansen, Tom, and Schwartz, Randal. 1996.  Programming Perl (2nd Edition). New York: O’Reilly.