Journalism

Internet Archive’s virtual reading room empowers data mining on a societal scale

Published January 7, 2014 by Kalev Leetaru

Knight Foundation supports the Internet Archive to improve the world’s access to a vast trove of information. Below, Roger Macdonald, director of the Television Archive initiative, and Kalev Leetaru, a data scientist currently the Yahoo Fellow at Georgetown University, write about their recent collaboration. Above: Mapping 400,000 hours of U.S. TV news.

The servers of the Internet Archive house one of the largest publicly accessible digital archives of human society; it spans nearly every medium from Web pages to software. Thanks to the support of Knight Foundation, the archive’s latest addition is a research library of nearly half a million hours of American television news broadcasts aired over the last four years.

This groundbreaking collection indexes the closed-captioning streams of each broadcast, making it possible to compare, contrast, quote, borrow and study this ephemeral medium, which is still the dominant source of news information for the American public. Journalists, scholars, librarians, documentarians and others are diving into this collection to unveil new perspectives on American society.

Mapping the geography of American television news

What would it look like if you “mapped” every location mentioned across all half million hours of American television news programming to visualize the picture of global events seen by the American public each day? In a recent collaboration we did just that, applying “fulltext geocoding” software to all 2.7 billion words of closed captioning, scanning each broadcast for any mention of a worldwide location, using the surrounding narrative to disambiguate them (Springfield, Ill., vs. Springfield, Mass.), and ultimately putting them all on a map. The resulting CartoDB visualizations provide one of the first large-scale glimpses of the geography of American television news; it begins to reveal which areas receive outsized attention and which are neglected. There is still much work to be done, but we believe this represents an exciting prototype for new ways of interacting with the world’s information by organizing it in new ways, especially through the lens of geography.

The virtual reading room

As responsible stewards of content created by others, the Internet Archive endeavors to secure media holdings not in the public domain within a library context. Without adequate safeguards, the ease of modern digital distribution could facilitate widespread unauthorized release that might challenge the media’s potential market value. In addition, the archive’s public domain collections are simply too large to make widely available at scale. Yet, unlike humans, whose capacity to intensely scrutinize is limited to small amounts of content, computerized data mining algorithms search for surface-level patterns across vast amounts of material, meaning they can process millions or even billions of words to identify subtle trends. How can a library enable such large-scale data mining of its archives to serve journalists, scholars and others?

The Internet Archive’s answer to this is a “virtual reading room” in which scholars can run their data mining algorithms directly on the archive’s servers using virtual machine software. Much like a reading room in a physical archive, scholars don’t borrow copies of large portions of the archive for study. They submit data mining algorithms to the virtual reading room where they run on the archive’s computers and can access all of the archive’s holdings—but the library’s holdings don’t leave the reading room. Only the patterns identified by the algorithms, akin to a scholar’s handwritten notes, are permitted to leave the confines of the reading room. In the case of the television news mapping collaboration, only the final list of identified locations compiled by the software was allowed to leave reading room; the original captioning text remained secure on the archive’s servers.

We believe this presents a powerful new model for digital libraries to support large-scale public interest data mining of the collections they hold in trust while being respectful stewards of that intellectual property. It will enable a new generation of research exploring the macro-level patterns of society itself.

Billions of Web pages and PDF files; millions of books

Just two decades after the dawn of the popular Internet, the archive’s 15 petabytes house a longitudinal repository of one of the most powerful forces shaping modern society: more than 373 billion Web objects and their 1.6 billion PDF files collected over 17 years. While the print revolution is cataloged across tens of thousands of libraries throughout the world, much of the Web revolution is publicly preserved only on the archive’s servers. Couple with that the largest accessible archive of television news, one of the largest digitized book archives and a growing collection of software and other material, and you have the equivalent of a single library of the digital world, all available through our virtual reading room.

From left to right: Literary works of Leonardo da Vinci (1883), Uncle Tom’s Cabin (1900) and History of India (1917).

With the success of mapping the TV News Archive, we are already embarking upon even grander challenges to better understand our collective digital heritage. One initiative focuses on unlocking the visual imagery of the archive’s massive collection of more than 2 million digitized books. Large-scale research on digital book collections has focused nearly exclusively on textual analysis, with images discarded as a byproduct of the digitization process. Yet the archive’s 600 million digitized book pages house a vast treasure trove of tens of millions of illustrations, drawings, charts, maps, photographs and other images covering nearly every imaginable topic and location; it’s a visual tapestry of human society spanning more than two centuries. Picture the changing visual portrayal of the American West over the 19th century or taking a virtual tour of the world’s art galleries, natural wonders and major cities over 200 years.

A global public library of the digital world

Imagine a world in which the entire digital history of society is not only globally available 24/7 to every person on Earth and preserved for our future generations, but also housed in a new Library of Alexandria where all of the world’s researchers can come together in a single virtual reading room to pioneer new ways of visualizing and understanding our global heartbeat. This dream of the future of scholarly research is already being realized in early experiments such as the maps above of television news. The Internet Archive looks forward to elaborating on this vision even further in 2014 by working with others to expand and explore our archive of human society to reveal what 15 petabytes and more can tell us about the world in which we live.

Recent Content