Tuesday, May 05, 2009

Lexomics at Kalamazoo

Do you like to count things? If so, the website our research group (me; Mark LeBlanc, prof. of computer science; Mike Kahn, prof. of statistics; and our students) is unveiling at Kalamazoo might be useful to you.

The on-line tools we have put up at http://lexomics.wheatoncollege.edu help researchers examine Anglo-Saxon texts in terms of patterns of words. For example, if you would like to list all the words in The Gifts of Men in order of frequency, you can. If you'd like to see what were Wulfstan's favorite words, you can. If you want to know how many words there are in the entire Anglo-Saxon corpus, and which are most common, you can find out.

Not only can you count words in any text, but you can create "virtual manuscripts" that combine separate texts into one file for counting purposes. So if you'd like to determine the most common--or the rarest-- words in the Beowulf manuscript (not just the poem Beowulf), you now can. Want to see which words appear once and only once (or twice and only twice) in Cotton Tiberius A.iii.? Easy!

And each of the words you might countis linked to a concordance search in the Dictionary of Old English. So if you discover that meahtum appears only twice in the Beowulf manuscript, you can then click on it and, provided you have electronic access to the DOE, you can see its other appearances in Old English (yes, the texts aren't lemmatized; if you want to know why, come to our paper at Kalamazoo).

Now why would I want to count words, you ask?

First, because looking at word frequencies helps you to identify interesting words in a text. If a word appears only in the OE translation of the Rule of Chrodegang and a text by Ælfric, that's potentially useful. Our tools can thus help literary scholars zero in on words that are worth a second look and can open the way for additional literary analysis.

Also, because the counts can be downloaded as Excel spreadsheets, there are many advanced statistical techniques that can be applied to them. But saying more would give away our paper at Kalamazoo, which I don't want to do.

Instead, I want to invite you to come to a roundtable, "Computing with Style," Thursday at 10:00 in Bernhard 213, our formal paper, "Lexomics for Literature" Thurs at 1:30 in the Bernhard Brown and Gold Room, and the poster session for Digital Humanities Thurs at 6:30 in Fetzer 1035. We'll have the software there for you to try and all three of us will be there to answer questions. We're looking for people who want to use the software and for additional collaborators.

All the software is open-source and available for free at the lexomics website, and it can easily be adapted to work on any electronic corpus of texts in any language.


This work is funded in part by the National Endowment for the Humanities Digital Humanities Initiative, Grant #HD-50300-08.

5 comments:

jeniffercox said...

This sounds like a pretty amazing collaborative effort. Congratulations on your project. Best of luck at Kalamazoo!

Jon Myerov said...

Sounds and looks very interesting. Have a great K'zoo. Wish I were there.

Jason Fisher said...

Excellent! This should be quite a useful new tool for zeroing in on the hapax legomena scattered throughout the texts.

Henrik said...

Boom! Word bomb going off in my head!

N.E. Brigand said...

As one who regularly counts words, I was really disappointed not to have seen this post until the day after you presented! I would at least have attended the poster session.