When the UNC Library launched Documenting the American South (DocSouth) in 1996, the project helped set the standard for publishing historic texts online.
Nearly twenty years later, DocSouth is poised to reach a new set of readers—the computers that digest and find patterns in immense bodies of text through techniques known as digital text analysis.
The newly-released DocSouth Data makes the full text of hundreds of nineteenth-century books and pamphlets available for easy download as text-only files. The materials come from four text-heavy Documenting the American South collections: The Church in the Southern Black Community; First-Person Narratives of the American South; Library of Southern Literature; and North American Slave Narratives.
“Researchers who want to experiment with text analysis have plenty of tools to choose from, but they often hit a roadblock when it comes to finding collections that are ready to be analyzed,” says Stewart Varner, UNC’s digital scholarship librarian.
According to Varner, most common text analysis techniques work best when they draw on hundreds, thousands, or even millions of pages at a time.
Machines comb these massive files, looking for unusual turns of phrase, words commonly appearing together, frequently used terms, and other patterns that can help the researcher make sense of the text.
Varner and his colleagues at the UNC Library realized that DocSouth provides a rich source of ready-to-use full-text that is of high interest and high quality.
The North American Slave Narratives collection, for example, includes every known autobiographical narrative of fugitive or former slaves published in English up to 1920. Its titles are among the most frequently used in DocSouth. The collection offers a wide-ranging picture of the experiences of former slaves and African American life in antebellum America.
Moreover, these files were transcribed by hand for the original digital files. That means they are unusually accurate, free of the errors that optical character recognition (OCR) often produces.
Varner is not sure just what discoveries will be made from DocSouth Data, but he is eager to see how people will use the collection.
“Visualizing text is a great way to start asking questions,” said Varner. “What is expected? What is surprising? Do the results have any significance?”
Instructions for downloading texts are on the DocSouth Data web page. Varner is glad to consult with students or faculty members who would like to learn more about performing text analysis with DocSouth Data or with other bodies of text. Reach him at firstname.lastname@example.org or (919) 962-2094.