Hope is the Thing With Numbers: Poems and Stats

Professor Horton's student in the Emily Dickinson Museum

In seeking to learn how our words can be mined to uncover who we are, Professor of Statistics Nicholas Horton introduced his students to Emily Dickinson as not just a poet, but the generator of a complex data set.

For a recent segment of his course “Data Science,” Horton and his students explored Dickinson’s poetry as they would any stream of information, be it linguistic or numerical.

To get a sense of their subject beyond mere words on the page, though, the class went to where the poet generated the data: her Amherst home. On a quiet, snowy morning, students toured the bedroom, parlor, library and other rooms where Dickinson wrote close to 1,800 poems, and where she died, largely unpublished, in 1886.

The purpose of the class is to connect the past with the future, Horton told the students. “To do that in any kind of reasonable way requires a little bit more knowledge than just ‘Here are the poems—start extracting data from them.’ So that's why we’re working with the Emily Dickinson Museum, to let you explore this building where a lot of the writing was done.”

Professor Horton's student in the Emily Dickinson Museum

Use of the museum as a classroom space is nothing unusual, said Elizabeth Bradley, program coordinator. Just a stone’s throw from campus, Dickinson’s home has seen visits from over 200 Amherst College students in the past year for private tour groups and on-site classes. The museum employs about a dozen students as assistants and guides, and partners with the College for undertakings like a recent exhibit and reading with the campus literary magazine Circus, she said.

“Data Science” takes students through “practical analyses of large, complex, and messy data sets leveraging modern computing tools,” according to the course description. “Computational data analysis is an essential part of modern statistics and data science.”

“Machine and machine-learning algorithms are now making decisions that affect us profoundly,” said Horton. “How do we train students, educated citizens, to be able to make sense of that? The course is designed to really get their hands dirty and then periodically have the opportunity to come back up and ask the big questions: What does it mean to have no privacy in the 21st century? What does it mean to bring these things together that on their own are pretty innocuous,” but, when you look at the big picture, can seem more sinister?

Using Modern Data Science with R, a textbook that Horton co-authored dealing with R, a software environment for statistical computing and graphics, the course started off with students “scraping” Macbeth and Beatles songs, extracting data from the texts.

Emily Dickinson word cloud
Code and inspiration for this WordCloud, showing the frequency of words occuring in Emily Dickinson's poetry, created by Christina (Ningyue) Wang '16.

For the Dickinson segment, Horton never knows quite where the students are going to go with the data, he said. Using a web application called Shiny, students are basically let loose with Dickinson’s poetry. Horton started them off by creating a page where, at the click of a button, the application returns a poem selected at random.

“I say, ‘This is my beginning, you can do better than that,’” he said.

Don’t look for dashes in this data set, though. Because of the complicated history of the ownership of Dickinson’s poetry, the students are using the public-domain, early printed versions of the poems, which were shorn of the poet’s idiosyncratic dashes and capitalizations.

In past semesters, students have created output such as charts and maps of the most-used words (Dickinson liked “like,” for instance), or lists of synonyms and antonyms in the poet’s lexicon.

“What I’m really excited about, in teaching statistics at a liberal arts college, is the idea of helping students to make sense of the data in the world around them,” Horton said. “What could be more of a liberal art than that?”